Description of b8

Last updated on

19 June 2018

From the b8 README.

What is b8?

b8 is a spam filter implemented in PHP. It is intended to keep your weblog or guestbook spam-free.

The filter can be used anywhere in your PHP code and tells you whether a text is spam or not, using statistical text analysis.

What it does is: you give b8 a text and it returns a value between 0 and 1, saying it's ham when it's near 0 and saying it's spam when it's near 1. To be able to do this, b8 first has to learn some spam and some ham (non-spam) texts. If it makes mistakes when classifying unknown texts or the result is not distinct enough, b8 can be told what the text actually is, getting better with each learned text.

b8 is a statistical spam filter. I'm not a mathematician, but as far as I can grasp it, the math used in b8 has not much to do with Bayes' theorem itself. So I call it a "statistical" spam filter, not a "Bayesian" one. Principally, It's a program like Bogofilter or SpamBayes, but it is not intended to classify emails. Therefore, the way b8 works is slightly different from email spam filters. See What's different? if you're interested in the details.

An example of what we're talking about here:

At the moment of this writing (november 2012), b8 has, since december 2006, classified 26869 guestbook entries and weblog comments on my homepage. 145 were ham. 76 spam texts (0.28 %) have been falsely rated as ham (false negatives) and I had to remove them manually. Only one single ham message has been falsely classified as spam (false positive) back in june 2010, but -- in defense of b8 -- this was the very first English ham text I got. Previously, each and every of the 15024 English texts posted has been spam.

Texts with Chinese, Japanese or Cyrillic content (all spam either) did not appear until 2011. This results in a sensitivity of 99.72 % (the probability that a spam text will actually be rated as spam) and a specifity of 99.31 % (the probability that a ham text will actually be rated as ham) for my homepage. Before the one false positive, of course, the specifity has been 100 % ;-)

How does it work?

In principle, b8 uses the math and technique described in Gary Robinson's articles "A Statistical Approach to the Spam Problem" [^1]
and "Spam Detection"[^2]. The "degeneration" method Paul Graham proposed in "Better Bayesian Filtering"[^3] has also been implemented.

b8 cuts the text to classify to pieces, extracting stuff like email addresses, links and HTML tags and of course normal words. For each such token, it calculates a single probability for a text containing it being spam, based on what the filter has learned so far. When the token has not been seen before, b8 tries to find similar ones using "degeneration" and uses the most relevant value found. If really
nothing is found, b8 assumes a default rating for this token for the further calculations. Then, b8 takes the most relevant values (which have a rating far from 0.5, which would mean we don't know what it is) and calculates the combined probability that the whole text is spam.

What's different?

b8 has been designed to classify forum posts, weblog comments or guestbook entries, not emails. For this reason, it uses a slightly
different technique than most of the other statistical spam filters out there use.

My experience was that spam entries on my weblog or guestbook were often quite short, sometimes just something like "123abc" as text and a link to a suspect homepage. Some spam bots don't even made a difference between e. g. the "name" and "text" fields and posted their text as email address, for example. Considering this, b8 just takes one string to classify, making no difference between "headers" and "text". The other thing is that most statistical spam filters count one token one time, no matter how often it appears in the text (as Paul Graham describes it in[^4]). b8 does count how often a token has been seen and learns resp. considers this. Why this? Because a text containing one link (no matter where it points to, just indicated by a "http://" or a "www.") might not be spam, but a text containing 20 links might be.

This means that b8 might be good for classifying weblog comments, guestbook entries or forum posts (I really think it is ;-) -- but very likely, it will work quite poor when being used for something else like classifying emails. At least with the default lexer. But as said above, for this task, there are lots of very good filters out there to choose from.

Tips on operation

Before b8 can decide whether a text is spam or ham, you have to tell it what you consider as spam or ham. At least one learned spam or one learned ham text is needed to calculate anything. With nothing learned, b8 will rate everything with 0.5 (or whatever `rob_x` has been set to).

To get good ratings, you need both learned ham and learned spam texts, the more the better. What's considered as ham or spam can be very different, depending on the operation site. On my homepage, practically each and every text posted in English or using non-latin-1 letters is spam. On an English or Russian homepage, this will be not the case. So I think it's not really meaningful to provide some "spam data" to start. Just train b8 with "your" spam and ham.

For the practical use, I advise to give the filter all data availible. E. g. name, email address, homepage and of course the text itself should be assembled in a variable (e. g. separated with an `\n` or just a space or tab after each block) and then be classified. The learning should also be done with all data availible. Saving the IP address is probably only meaningful for spam entries, because spammers often use the same IP address multiple times. In principle, you can leave out the IP of ham entries.

You can use b8 e. g. in a guestbook script and let it classify the text before saving it. Everyone has to decide which rating is necessary to classify a text as "spam", but a rating of >= 0.8 seems to be reasonable for me. If one expects the spam to be in another language that the ham entries or the spams are very short normally, one could also think about a limit of 0.7. The email filters out there mostly use > 0.9 or even > 0.99; but keep in mind that they have way more data to analyze in most of the cases. A guestbook entry may be quite short, especially when it's spam.

In my opinion, an autolearn function is very handy. I save spam messages with a rating higher than 0.7 but less than 0.9 automatically as spam. I don't do this with ham messages in an automated way to prevent the filter from saving a false negative as ham and then classifying and learning all the spam as ham when I'm on holidays ;-)

Learning spam or ham that has already been rated very high or low will not make spam detection better (as b8 already could classify the text correctly!) but probably only blow the database. So don't do that.

Tobias Leupold

References

[^1]: Gary Robinson, A Statistical Approach to the Spam Problem http://linuxjournal.com/article/6467
[^2]: Gary Robinson, Spam Detection (from the web archive)
[^3]: Paul Graham, Better Bayesian Filtering http://paulgraham.com/better.html
[^4]: Paul Graham, A Plan For Spam http://paulgraham.com/spam.html