Tuesday, April 17, 2012

Applying Bayesian technique to filtering spam

The white paper, "Why Bayesian filtering is the most effective anti-spam technology" describes how a company can apply Bayesian mathematics to spam e-mails’ problem by creating an adaptive, ‘statistical intelligence’ technique that increases spam detection rates. The unique characteristic that distinguishes from other spam filters is that the company or organization can customize the filter based on company or organization’s email characteristics, and update the database with newly detected spam characteristics.

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can infer from previous occurrences of that event. According to the paper, before we can filter span using this method, we need to generate a database with words and token collected from a sample of spam mail and valid mail, also referred to as ‘ham’. A probability value is assigned to each word or token, which is based on calculations that take into account how often that word occurs in spam as opposed to legitimate mail (ham). The word probability is calculated by analyzing the users’ outbound mail and by analyzing known spam.

For example: If the word “mortgage” occurs in 400 of 3,000 spam mails and in 5 out of 300 legitimate emails, then its spam probability would be { [400/3000] divided by [(5/300) + (400/3000)] } i.e. 0.8889.

Creating the ham database: The analysis of ham mail is performed on the organization’s mail, and is therefore tailored to that particular organization. A financial institution, for example, might use the word “mortgage” many times over and would get a lot of false positives if using a general anti-spam rule set. However, the Bayesian filter, if tailored to your company through an initial training period, takes note of the company’s valid outbound mail and recognizes “mortgage” as bring frequently used in legitimate messages, and therefore has a much better spam detection rate and a far lower false positive rate.
Creating a spam database: Along with ham database, the Bayesian filter also relies on spam data file. The spam data file must include a large sample of known spam and must be constantly updated with the latest spam by the anti-spam software. This will ensure that the Bayesian filter is aware of the latest spam tricks, resulting in a high spam detection rate.
How the actual filtering is done: Once the ham and spam databases have been created, the word probabilities can be calculated and the filter is ready for use. When a new mail arrives, it is broken down into words and the most relevant words, such as those that are most significant in identifying whether the mail is spam or not, are singled out. From these words, Bayesian filter calculates the probability of the new message being spam or not. If the probability is greater than a threshold, 0.9 for instance, then the message is classified as spam. This Bayesian approach to spam is highly effective, a May 2003 BBC article reported that spam detection rates of over 99.7 percent can be achieved with a very low number of false positives.

Bayesian filtering, if implemented the right way and tailored to an organization, is the most effective technology to combat spam. The downside to this technique is that, we have to wait at least two weeks for it to learn and create the ham or spam database. Nevertheless, over time, the Bayesian filter becomes more and more effective as it learns about the organization’s email habits, along with updating through other anti-spam databases. 

White Paper, GFI (2008). Why Bayesian filtering is the most effective anti-spam technology. Retrieved from http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf


  1. Google, for example, uses the emails that users flag as spam (by clicking the "Report Spam" button) to strengthen their spam filter. Other free services began doing this as well after seeing Google's initial success with a Bayesian spam filter.

  2. I wonder if it ever gets applied in reverse - looking for words common to 'ham' emails so that, even if a message contains "mortgage", if it also contains "Christmas" or "family" or something, it would counter the effects of the spam-words.

    1. What google does in that type of case is train the filter. The more examples they have of spam and it's interaction with ham, they can do a better job of detecting spam.