Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability. (Incidentally, why so high? In my db it's only 0.3930784.) --pg --Tim Peters wrote: > FYI. After cleaning the blatant spam identified by the classifier out of my > ham corpus, and replacing it with new random msgs from Barry's corpus, the > reported false positive rate fell to about 0.2% (averaging 8 per each batch > of 4000 ham test messages). This seems remarkable given that it's ignoring > headers, and just splitting the raw text on whitespace in total ignorance of > HTML & MIME etc. > > 'FREE' (all caps) moved into the ranks of best spam indicators. The false > negative rate got reduced by a small amount, but I doubt it's a > statistically significant reduction (I'll compute that stuff later; I'm > looking for Big Things now). > > Some of these false positives are almost certainly spam, and at least one is > almost certainly a virus: these are msgs that are 100% base64-encoded, or > maximally obfuscated quoted-printable. That could almost certainly be fixed > by, e.g., decoding encoded text. > > The other false positives seem harder to deal with: > > + Brief HMTL msgs from newbies. I doubt the headers will help these > get through, as they're generally first-time posters, and aren't > replies to earlier msgs. There's little positive content, while > all elements of raw HTML have high "it's spam" probability. > > Example: > > """ > --------------=_4D4800B7C99C4331D7B8 > Content-Description: filename="text1.txt" > Content-Type: text/plain; charset=ISO-8859-1 > Content-Transfer-Encoding: quoted-printable > > Is there a version of Python with Prolog Extension?? > Where can I find it if there is? > > Thanks, > Luis. > > P.S. Could you please reply to the sender too. > > > --------------=_4D4800B7C99C4331D7B8 > Content-Description: filename="text1.html" > Content-Type: text/html > Content-Transfer-Encoding: quoted-printable > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> > <HTML> > <HEAD> > <TITLE>Prolog Extension</TITLE> > <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)"> > <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200"> > <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes"> > <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700"> > </HEAD> > <BODY> > <PRE>Is there a version of Python with Prolog Extension?? > Where can I find it if there is? > > Thanks, > Luis. > > P.S. Could you please reply to the sender too.</PRE> > </BODY> > </HTML> > > --------------=_4D4800B7C99C4331D7B8--""" > """ > > Here's how it got scored: > > prob = 0.999958816093 > prob('<META') = 0.957529 > prob('<META') = 0.957529 > prob('<META') = 0.957529 > prob('<BODY>') = 0.979284 > prob('Prolog') = 0.01 > prob('<HEAD>') = 0.97989 > prob('Thanks,') = 0.0337316 > prob('Prolog') = 0.01 > prob('Python') = 0.01 > prob('NAME=3D"GENERATOR"') = 0.99 > prob('<HTML>') = 0.99 > prob('</HTML>') = 0.989494 > prob('</BODY>') = 0.987429 > prob('Thanks,') = 0.0337316 > prob('Python') = 0.01 > > Note that '<META' gets penalized 3 times. More on that later. > > + Msgs talking *about* HTML, and including HTML in examples. This one > may be troublesome, but there are mercifully few of them. > > + Brief msgs with obnoxious employer-generated signatures. Example: > > """ > Hi there, > > I am looking for you recommendations on training courses available in the UK > on Python. Can you help? > > Thanks, > > Vickie Mills > IS Training Analyst > > Tel: 0131 245 1127 > Fax: 0131 245 1550 > E-mail: vickie_mills@standardlife.com > > For more information on Standard Life, visit our website > http://www.standardlife.com/ The Standard Life Assurance Company, Standard > Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland > (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225 > 2552 - calls may be recorded or monitored. This confidential e-mail is for > the addressee only. If received in error, do not retain/copy/disclose it > without our consent and please return it to us. We virus scan all e-mails > but are not responsible for any damage caused by a virus or alteration by a > third party after it is sent. > """ > > The scoring: > > prob = 0.98654879055 > prob('our') = 0.928936 > prob('sent.') = 0.939891 > prob('Tel:') = 0.0620155 > prob('Thanks,') = 0.0337316 > prob('received') = 0.940256 > prob('Tel:') = 0.0620155 > prob('Hi') = 0.0533333 > prob('help?') = 0.01 > prob('Personal') = 0.970976 > prob('regulated') = 0.99 > prob('Road,') = 0.01 > prob('Training') = 0.99 > prob('e-mails') = 0.987542 > prob('Python.') = 0.01 > prob('Investment') = 0.99 > > The brief human-written part is fine, but the longer boilerplate sig is > indistinguishable from spam. > > + The occassional non-Python conference announcement(!). These are > long, so I'll skip an example. In effect, it's automated bulk email > trying to sell you a conference, so is prone to use the language and > artifacts of advertising. Here's typical scoring, for the TOOLS > Europe '99 conference announcement: > > prob = 0.983583974285 > prob('THE') = 0.983584 > prob('Object') = 0.01 > prob('Bell') = 0.01 > prob('Object-Oriented') = 0.01 > prob('**************************************************************') = > 0.99 > prob('Bertrand') = 0.01 > prob('Rational') = 0.01 > prob('object-oriented') = 0.01 > prob('CONTACT') = 0.99 > prob('**************************************************************') = > 0.99 > prob('innovative') = 0.99 > prob('**************************************************************') = > 0.99 > prob('Olivier') = 0.01 > prob('VISIT') = 0.99 > prob('OUR') = 0.99 > > Note the repeated penalty for the lines of asterisks. That segues into the > next one: > > + Artifacts of that the algorithm counts multiples instances of "a word" > multiple times. These are baffling at first sight! The two clearest > examples: > > """ > > > Can you create and use new files with dbhash.open()? > > > > Yes. But if I run db_dump on these files, it says "unexpected file type > > or format", regardless which db_dump version I use (2.0.77, 3.0.55, > > 3.1.17) > > > > It may be that db_dump isn't compatible with version 1.85 databse files. I > can't remember. I seem to recall that there was an option to build 1.85 > versions of db_dump and db_load. Check the configure options for > BerkeleyDB to find out. (Also, while you are there, make sure that > BerkeleyDB was built the same on both of your platforms...) > > > > > > > Try running db_verify (one of the utilities built > > > when you compiled DB) on the file and see what it tells you. > > > > There is no db_verify among my Berkeley DB utilities. > > There should have been a bunch of them built when you compiled DB. I've got > these: >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4