Browse thread
[Caml-list] posting policy and spam
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2004-01-04 (16:22) |
From: | Scott Alexander <salex@d...> |
Subject: | Re: [Caml-list] posting policy and spam |
On Sun, 2004-01-04 at 05:23, Richard Zidlicky wrote: > On Sun, Jan 04, 2004 at 12:28:37AM +0100, Sven Luther wrote: > > On Sat, Jan 03, 2004 at 10:24:49AM +0100, Xavier Leroy wrote: > > > There have been several complains recently about spam getting through > > > the caml-list. > > > > > > For your information, the list is filtered through SpamOracle, and the > > > posting address receives several hundred spams a day. Due to spammers > > > getting more clever, the efficiency of the filtering went from perfect > > > to about 99%. That's enough to let significant amounts of spam slip > > > through. > > > > Well, on a similar subject, is there any chance of implementing a > > workaround in spamoracle to counter those spams specifically designed to > > fool the bayesian filters ? You know, those who have 4 lines of random > > words in a text attachement, and then some html spam. > > > > I don't know if the bayesian filters or a modification thereof is able > > to counter this kind of email, but i don't think so. > > n-grams should be able to cope with the random words. There is already > at least one library at sf implementing them so I am not sure it is > worth to reimplement it in spamoracle. FWIW, I've found the Bayesian stuff to do pretty well even with random words given enough training. (I'm using spambayes if it matters.) Most of the random words they pick aren't in my common words list as it turns out. And so many of the words in their actual message are in my spam list. (Obviously, this isn't a correct statement of how the algorithm actually works, but I think it gives the right idea.) After reading Paul Graham's look back on Bayesian filtering after a year, (http://www.paulgraham.com/sofar.html), I looked more closely at how some of my spam and ham were scoring. Looking at the misspelling approach, I current score "viagra" as 0.974978, "vi@gra" as 0.844828, and "v1@gra" as 0.908163. As for random words, looking through my list of messages to be trained, I have a typical spam titled "Re: YGOCP, to the procurator". With a long list of random words and breaking up their message ("<p>O</rigid>ur U</immature>S Li</prominent>censed Doc</shepherd>tors wi</calve>ll<BR> Prescr</violate>ibes Y</esophagi>our Me</antonym>dication F</eigenvector>or F</irreversible>ree"), it scores as 99.79%. Not only do they have some URL elements (like biz) which are high on my spam list, but some of the random words have become spam identifiers (euclid, metalwork, adequacy, bourgeoisie, cornish, rectilinear). It did hit a few on the ham list (oregon, weird, and laminar appear in spam for the first time with this message), but not enough to be significant. I do train on (almost) every message that I receive and have done so for several months. According to the statistics section I have "Total emails trained: Spam: 3893 Ham: 12685". And I am having a false positive problem with Caml-list after the rash of spams. It seems to be getting close to being trained back, but Caml-list is a relatively low volume list for me. Anyway, enough nattering on. I'm amazed by the Bayesian stuff and find it interesting. Best, Scott -- Scott Alexander <salex@dsl.cis.upenn.edu> ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners