Setting this up has been a bitch. All early attempts floundered beca= use it turned out there was *some* systematic difference between the ham and= spam archives that made the job trivial. The ham archive: I selected 20,000 messages, and broke them into 5 s= ets of 4,000 each, at random, from a python-list archive Barry put together, containing msgs only after SpamAssassin was put into play on python.o= rg. It's hoped that's pretty clean, but nobody checked all ~=3D 160,000+ = msgs. As will be seen below, it's not clean enough. The spam archive: This is essentially all of Bruce Guenter's 2002 sp= am collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at ra= ndom into 5 sets of 2,750 spams each. Problems included: + Mailman added distinctive headers to every message in the ham archive, which appear nowhere in the spam archive. A Bayesian classifier picks up on that immediately. + Mailman also adds "[name-of-list]" to every Subject line. + The spam headers had tons of clues about Bruce Guenter's mailing addresses that appear nowhere in the ham headers. + The spam archive has Windows line ends (\r\n), but the ham archive plain Unix \n. This turned out to be a killer clue(!) in the simpl= est character n-gram attempts. (Note: I can't use text mode to read msgs, because there are binary characters in the archives that Windows treats as EOF in text mode -- indeed, 400MB of the ham archive vanishes when read in text mode!) What I'm reporting on here is after normalizing all line-ends to \n, = and ignoring the headers *completely*. There are obviously good clues in= the headers, the problem is that they're killer-good clues for accidental reasons in this test data. I don't want to write code to suppress th= ese clues either, as then I'd be testing some mix of my insights (or lack thereof) with what a blind classifier would do. But I don't care how= good I am, I only care about how well the algorithm does. Since it's ignoring the headers, I think it's safe to view this as a = lower bound on what can be achieved. There's another way this should be a = lower bound: def tokenize_split(string): for w in string.split(): yield w tokenize =3D tokenize_split class Msg(object): def __init__(self, dir, name): path =3D dir + "/" + name self.path =3D path f =3D file(path, 'rb') guts =3D f.read() f.close() # Skip the headers. i =3D guts.find('\n\n') if i >=3D 0: guts =3D guts[i+2:] self.guts =3D guts def __iter__(self): return tokenize(self.guts) This is about the stupidest tokenizer imaginable, merely splitting th= e body on whitespace. Here's the output from the first run, training agains= t one pair of spam+ham groups, then seeing how its predictions stack up aga= inst each of the four other pairs of spam+ham groups: Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 s= pams testing against Data/Spam/Set2 and Data/Ham/Set2 tested 4000 hams and 2750 spams false positive: 0.00725 (i.e., under 1%) false negative: 0.0530909090909 (i.e., over 5%) testing against Data/Spam/Set3 and Data/Ham/Set3 tested 4000 hams and 2750 spams false positive: 0.007 false negative: 0.056 testing against Data/Spam/Set4 and Data/Ham/Set4 tested 4000 hams and 2750 spams false positive: 0.0065 false negative: 0.0545454545455 testing against Data/Spam/Set5 and Data/Ham/Set5 tested 4000 hams and 2750 spams false positive: 0.00675 false negative: 0.0516363636364 It's a Good Sign that the false positive/negative rates are very clos= e across the four test runs. It's possible to quantify just how good a= sign that is, but they're so close by eyeball that there's no point in bot= hering. This is using the new Tester.py in the sandbox, and that class automa= tically remembers the false positives and negatives. Here's the start of the= first false positive from the first run: """ It's not really hard!! Turn $6.00 into $1,000 or more...read this to find out how!! READING THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board anddecided to try it. A little while back, while chatting on the internet, I cam= e across an article similar to this that said you could make thousands of dollars in cash within weeks with only an initial investment of $6.00! So I thought, "Yeah right, this must be a scam", but like most of us, I was curious, so I kept reading. Anyway, it said that you send $1.00 to each of the six names and address statedin the article. You then place your own name and address in the bottom of th= e list at #6, and post the article in at least 200 newsgroups (There are thousands) or e-mail them. No """ Call me forgiving, but I think it's vaguely possible that this should= have been in the spam corpus instead <wink>. Here's the start of the second false positive: """ Please forward this message to anyone you know who is active in the s= tock market. See Below for Press Release xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX Dear Friends, I am a normal investor same as you. I am not a finance professional= nor am I connected to FDNI in any way. I recently stumbled onto this OTC stock (FDNI) while searching throug= h yahoo for small float, big potential stocks. At the time, the company had r= eleased a press release which stated they were doing a stock buyback. Intrig= ued, I bought 5,000 shares at $.75 each. The stock went to $1.50 and I sold= my shares. I then bought them back at $1.15. The company then circulat= ed another press release about a foreign acquisition (see below). The s= tock jumped to $2.75 (I sold @ $2.50 for a massive profit). I then bought= back in at $1.25 where I am holding until the next major piece of news. """ Here's the start of the third: """ Grand Treasure Industrial Limited Contact Information We are a manufacturer and exporter in Hong Kong for all kinds of plas= tic products, We export to worldwide markets. Recently , we join-ventured with a ba= g factory in China produce all kinds of shopping , lady's , traveller's bags.... visit our page and send us your enquiry by email now. Contact Address : Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong. Telephone : ( 852 ) 2408 9382 """ That is, all the "false positives" there are blatant spam. It will t= ake a long time to sort this all out, but I want to make a point here now: = the classifier works so well that it can *help* clean the ham corpus! I = haven't found a non-spam among the "false positives" yet. Another lesson rei= nforces one from my previous life in speech recognition: rigorous data colle= ction, cleaning, tagging and maintenance is crucial when working with statis= ical approaches, and is damned expensive to do. Here's the start of the first "false negative" (including the headers= ): """ Return-Path: <911@911.COM> Delivered-To: em-ca-bruceg@em.ca Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0= 000 Received: from unknown (HELO PC-5.) (61.48.16.65) by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000 x-esmtp: 0 0 1 Message-ID: <1604543-22002702894513952@smtp.vip.sina.com> To: "NEW020515" <911@911.COM> =46rom: "=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata= base.net =A3=A9" <911@911.COM> Subject: =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata= base.net =A3=A9 Date: Sun, 28 Jul 2002 17:45:13 +0800 MIME-Version: 1.0 Content-type: text/plain; charset=3Dgb2312 Content-Transfer-Encoding: quoted-printable Content-Length: 977 =3DD6=3DD0=3DB9=3DFAIT=3DCA=3DFD=3DBE=3DDD=3DBF=3DE2=3DCD=3DF8=3DD5= =3DBE=3DA3=3DA8www=3D2Eitdatabase=3D2Enet =3DA3=3D =3DA9=3DCC=3DE1=3DB9=3DA9=3DB4=3DF3=3DC1=3DBF=3DD3=3DD0=3DB9=3DD8= =3DD6=3DD0=3DB9=3DFAIT/=3DCD=3DA8=3DD0=3DC5=3DCA=3DD0=3DB3=3D =3DA1=3DD2=3DD4=3DBC=3DB0=3DC8=3DAB=3DC7=3DF2IT/=3DCD=3DA8=3DD0=3DC5= =3DCA=3DD0=3DB3=3DA1=3DB5=3DC4=3DCF=3DE0=3DB9=3DD8=3DCA=3D =3DFD=3DBE=3DDD=3DBA=3DCD=3DB7=3DD6=3DCE=3DF6=3DA1=3DA3 =3DB1=3DBE=3DCD=3DF8=3DD5=3DBE=3DC9=3DE6=3DBC=3DB0=3DD3=3DD0=3DB9= =3DD8=3D =3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DCA=3DD0=3DB3=3DA1=3DA1= =3DA2=3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DC9=3DCC=3DA1=3D """ Since I'm ignoring the headers, and the tokenizer is just a whitespac= e split, each line of quoted-printable looks like a single word to the classifier. Since it's never seen these "words" before, it has no re= ason to believe they're either spam or ham indicators, and favors calling it = ham. One more mondo cool thing and that's it for now. The GrahamBayes cla= ss keeps track of how many times each word makes it into the list of the= 15 strongest indicators. These are the "killer clues" the classifier ge= ts the most value from. The most valuable spam indicator turned out to be "<br>" -- there's simply almost no HTML mail in the ham archive (but = note that this clue would be missed if you stripped HTML!). You're never = going to guess what the most valuable non-spam indicator was, but it's quit= e plausible after you see it. Go ahead, guess. Chicken <wink>. Here are the 15 most-used killer clues across the runs shown above: = the repr of the word, followed by the # of times it made into the 15-best= list, and the estimated probability that a msg is spam if it contains this = word: testing against Data/Spam/Set2 and Data/Ham/Set2 best discrimators: 'Helvetica,' 243 0.99 'object' 245 0.01 'language' 258 0.01 '<BR>' 292 0.99 '>' 339 0.179104 'def' 397 0.01 'article' 423 0.01 'module' 436 0.01 'import' 499 0.01 '<br>' 652 0.99 '>>>' 667 0.01 'wrote' 677 0.01 'python' 755 0.01 'Python' 1947 0.01 'wrote:' 1988 0.01 testing against Data/Spam/Set3 and Data/Ham/Set3 best discrimators: 'string' 494 0.01 'Helvetica,' 496 0.99 'language' 524 0.01 '<BR>' 553 0.99 '>' 687 0.179104 'article' 851 0.01 'module' 857 0.01 'def' 875 0.01 'import' 1019 0.01 '<br>' 1288 0.99 '>>>' 1344 0.01 'wrote' 1355 0.01 'python' 1461 0.01 'Python' 3858 0.01 'wrote:' 3984 0.01 testing against Data/Spam/Set4 and Data/Ham/Set4 best discrimators: 'object' 749 0.01 'Helvetica,' 757 0.99 'language' 763 0.01 '<BR>' 877 0.99 '>' 954 0.179104 'article' 1240 0.01 'module' 1260 0.01 'def' 1364 0.01 'import' 1517 0.01 '<br>' 1765 0.99 '>>>' 1999 0.01 'wrote' 2071 0.01 'python' 2160 0.01 'Python' 5848 0.01 'wrote:' 6021 0.01 testing against Data/Spam/Set5 and Data/Ham/Set5 best discrimators: 'object' 980 0.01 'language' 992 0.01 'Helvetica,' 1005 0.99 '<BR>' 1139 0.99 '>' 1257 0.179104 'article' 1678 0.01 'module' 1702 0.01 'def' 1846 0.01 'import' 2003 0.01 '<br>' 2387 0.99 '>>>' 2624 0.01 'wrote' 2743 0.01 'python' 2864 0.01 'Python' 7830 0.01 'wrote:' 8060 0.01 Note that an "intelligent" tokenizer would likely miss that the Pytho= n prompt ('>>>') is a great non-spam indicator on python-list. I've ha= d this argument with some of you before <wink>, but the best way to let this= kind of thing be as intelligent as it can be is not to try to help it too = much: it will learn things you'll never dream of, provided only you don't f= ilter clues out in an attempt to be clever. everything's-a-clue-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4