Content-based spam filtering is a dead-end path

by Andy Lester

In the arms race of spam prevention, content-based filters, including any Bayesian ones you care to throw at it, have been beaten. Until we get truly intelligent recognition, where a computer is smart enough to know that a subject of "She will love you for it" is Viagra spam, and that "I was at the end of my rope until I found this" is some money scam, the spammers will be able to get any content past the filters.


In addition to the tricks discussed in the
ActiveState Field Guide To Spam,
spammers are already started foiling the filters by throwing in random real words. I regularly get spam through two levels of filtering (SpamAssassin and Eudora) that looks like this:



Our rates are the lowest! You can get 3.45% fixed for
rough pencil final happy
30-years! Follow this link to get the best rates
napkins canine amazed
in the country, but only for a limited time!



The extra random non-spam text foils it. And, since the words are random, tactics to get a checksum or signature on it are, or will be, useless. I suspect it won't be long before spam comes through with three lines of spam content, and a couple K of random words. If we get to where words that are clearly random are somehow caught, then the spammers will turn to pulling random pages off the net for their obscuring text. Maybe they'll throw in, say, a few pages of Macbeth to foil things.



The answer is to stop the spammers before they get their message in. All content-based filtering depends on the spammer getting their payload to us first, instead of checking them at the gate. This will mean a replacement of SMTP. Until then, SPF seems to have potential, but it has its drawbacks.



Mind you, I'm not throwing away my SpamAssassin install. It helps stop a significant amount of the spam. Unfortunately, content-based filtering is a Band-Aid on the real problem.

Do you see any solution outside of replacing SMTP?


17 Comments

anonymous2
2003-10-03 16:16:56
content filtering not by words but by math?
Some frequency analysis may help. The emails are converted to frequencies and run through some fourier anaylsis. Random words will show up as noise while the spam content will show up as a regular pattern. To counter act the spammers changing single words the filter can use a probaility or statistical closesness to a frequency standard. For further comparison fourier wave forms of real email of varying content, length etc can also be used.


Another possibility is to use the fourier anaylsis to strip out random words and then run the resulting message through a content filter.


Interesting problem. The key I think is identifying the characteristics of spam the spammers can't change.


anonymous2
2003-10-03 17:48:13
Well
First off, these filters are generally employing the Naive Bayes algorithm, not a 'Bayesian Filter.' Being a Bayesian filter only means that the prediction algorithm is trained via Bayes' Rule. Second, what do you think you're doing when YOU read the email and realize its Spam? Content-based filtering. What you're really saying is that Naive Bayes is dead, which may or may not be true. Naive Bayes is something like 30-40 years old now, AFAIK its been invented a couple of times so it depends on how you count, and text categorization is an active field of research---Naive Bayes just happens to be a) very easy to implement and b) does a Pretty Good Job.


Also, using two NB filters probably won't give you much additional power since you'll be depriving the second tier of training data for the filter. If you could pool training data across filters that might be different (the Boosting people do something like that I think---they somehow manage to chain together a bunch of poorly performing machine learning algorithms into one that does a Really Good Job. It doesn't seem like it should work, but somehow it does.)

anonymous2
2003-10-03 17:53:21
content filtering not by words but by math?
How exactly would you convert the emails into frequencies?


In any case, "Frequency Analyis" is what is going on---Spam is classified via the Probability of a document being spam given the set of words. This probability being calculated from the observed frequency of words in a set of documents with known Spam and Not Spam labels. (Modulo whats called a psuedocount, because to say that some word has a 0 probability in, say, the set of Spam documents, is a much different statement that saying you've never observed that particular word in a Spam document.

anonymous2
2003-10-03 18:59:13
Well
Oh yeah,


- Byron "Too Lazy to Register" Ellis :-)

anonymous2
2003-10-03 19:31:34
EMI or proximity filtering?
Well, maybe it's time to look at the stuff that is more computationally intensive? Maybe some sort of knn or clustering approach?


Maybe neural networks? Maybe some sort of constraint satisfaction system?


Interesting field to take a look at ...

anonymous2
2003-10-03 20:36:51
SpamBayes works for me!
Sorry, I don't accept the premise that Bayesian filtering has lost the arms race. My SpamBayes setup almost never gets fooled by the irrelevant word dodge. Three days after purging junk, right now I have 581 definite spams (never had a false positive!) in the Junk folder and 15 possible spams in the MaybeSpam folder, about half of which are real spam. Out of hundreds of spams a day, maybe 90% of my total email volume, it is very rare to see spam in my inbox after SpamBayes filters it.


I'll admit that the percentage of real messages that end up in the MaybeSpam folder has gone up over the 8 months or so I've been using SpamBayes. Probably some more sophisticated logic or easier control over the weights could improve that. My only complaint is that I can't get the Powers that Be to use content filtering as powerful as SpamBayes provides on the server side to keep me from having to download the stuff that is virtually certain to be spam.


I think you might want to get a better Bayesian filter rather than give up on content filtering.

anonymous2
2003-10-04 01:01:51
Are you using the SpamAssassin network tests?
The SpamAssassin network tests, which require the Net::DNS perl module also help with getting a more correct "spamminess" score. Which, if any, of the network tests do you use Andy: Razor2, pyzor, dcc, which of the RBLs, etc.? For me, a combination of many spam-scoring techniques has made my email life much more bearable. I agree that content-based filtering *alone* is not enough!
anonymous2
2003-10-04 04:57:16
Are you using the SpamAssassin network tests?
How does one set these up? I'm running a redhat box that allowed me to install spamassassin as an option at when installing.
anonymous2
2003-10-04 09:24:58
SpamBayes works for me!
I agree, this article is ridiculous. I'm using spambayes as well and having no problems with spam. No false positives, very few spams showing up in my unsure folder, and even fewer in my inbox.


However, I would caution against using spambayes on the server side. Part of the point is that it learns what is spam for YOU, not for a large group of people. I have to believe that's part of why it works so well.

pudge
2003-10-05 09:04:59
SpamBayes works for me!
Yeah, I dunno ... bayesian filters for me in Eudora 6 have well over 99% accuracy, in both positives and negatives. Is it ideal? No. Would I rather stop spam through reasonable authentication? Hell yes. But bayesian filters are preferable to traditional blacklisting and whitelisting and just about any other method I've tried.
Jonathan Gennick
2003-10-05 11:54:17
Already using random passages of real text
Andy, you mention the possibility of using random passages of text from MacBeth. Spammers are already doing this. I regularly get html emails in which the text content is a passage taken from a real book. Just today, I received the following in the text side of an html message;



Carl Jung, an early Freudian disciple and later heretic, extended this
model of memory by adding another area of repressed memories to the
...


The html side of this message was a "refinance the house" spam. My guess is that the short passage from a real book is there to make the spam look less like spam to the filters. I get quite a lot of this type of message, actually.

stan_krute
2003-10-05 12:25:55
Sorry Andy, but you're wrong
I use POPFile, an open-source
Bayesian filter that's quite
good at dealing with spammer
tricks.


It's running at over 99% accuracy
for the past 9 months.


(disclosure: I've contributed some
code to POPFile).


anonymous2
2003-10-06 11:36:28
To the point
Why do spammers send you messages? It's not for the sheer fun, but they want to get profit. How do they get profit? By directly or indirectly selling something (perhaps only a validated adress). Can they sell you that something right in your mail client? No. So this is their weakest point: either there is


(1) a clickable (or plain) hyperlink to some other page, or
(2) an embedded image pointing to another server (and thus validating your adress upon loading)


99.99 percent of the latest spam in my inbox has either one.


Spammers might insert fake, bogus or copied text which might fool filters, but they will never be able to get their selling points hosted and thus validated at trusted sites.


Yet I wonder why so many spam detection engines rely on scanning _all_ content, instead of concentrating on URL/URI validation. Do I receive messages from friends with embedded URLs? Yes. Do they ever upload holiday images to free.hosting.site.ru/maria2003 ? No.

anonymous2
2003-10-17 12:11:31
White list with autoreply and code word
I block spam by allowing in only messages from people on my address book. If anyone else writes to me, Eudora autoreplies with an apology and tells the writer to try again with the code word in the subject line. If Eudora sees the code word in the subject line, it accepts the message without fuss. I can then add that person to my address list.


Slightly inconvenient for people the first time they write to me, but very effective. Would this work for you?

xamde
2003-12-26 12:41:26
To the point
Exactly! Good filters should concentrate MOSTLY on the embedded URLs. I would be nice to launche DDoS-attacks from a network of filterclients against the most evil spammer. Ok, that's not legal, but spam makes me angry.
elleirdad
2004-01-31 06:29:04
SpamBayes works for me!
I have been using a version of SpamBayes named InBoxer for six months now. It remains accurate and really protects me from the recent outbreaks of mail.


Here in the U.K., Personal Computer World just did a major review of anti-spam software. SpamBayes, Spam Assassin, InBoxer and 7 others were compared. They tested on over 1000 messages. The results:


PERCENTAGE OF SPAM KEPT FROM INBOX:


SpamBayes 97.94%
InBoxer 97.07%
Spam Assassin 93.72%


TOTAL INTERVENTIONS REQUIRED (a misfiled message either false-positive or false-negative):


InBoxer 37
SpamBayes 80
Spam Assassin 80


They gave both Editor's Choice and Best Buy awards to InBoxer (http://www.inboxer.com) in large part because it worked better than SpamBayes and Spam Assassin out of the box. SpamBayes was given the great value award

Morat
2006-05-20 10:04:55
I can't agree with this post at all. I've been using a naive bayesian filter for some time and it has no trouble with these random words tactics at all. There's enough spam content for them to keep on tripping the filter, whether it be in headers or the message itself.