[Tim Peters] [... extremely good work and stuff and comments, for a good while now ...] Hi, Tim. I read your messages, witnessing your work and progress in that area, with great interest, and also saved them for later contemplation! :-) Spam always annoyed me, as most of us, and despite many efforts I did, it is increasingly successful at traversing my filters -- so this idea of Graham or Bayesian filters is timely and welcome. Most previous filters I observed are based on various (random) tests or events (you surely know all this), and `procmail'-based filters, or even the popular SpamAssassin, are either very slow or at least slow. The tool I use since 1998 is much faster, especially after I rewrote it in Python!, it is also based on various tests or events. Your works concentrated on tuning the statistical formulas and lexical analysis, and building operational data from preset corpora. I'm sure all the knowledge gleaned there will make its way everywhere, and reach me. For a tiny share, I decided to experiment with day-to-day user aspects of using such a filter, and built a Gnus interface over Eric Raymond's Bogofilter. There are two functions to this program, one is about learning from messages known to be ham or spam, the other is about classification of incoming messages. By the way, if there are Gnus users among you, just ask me for the recipe... It goes pretty well for me, so far. The principle, put forward by Paul Graham, is to let the user have two delete commands: delete-as-ham or delete-as-spam. Eric pushed this idea a bit further by postponing learning until the user quits the mail reader, `mutt' in his case. As Gnus allows me to have many mailgroups and folders and shuffle between them, I postpone learning until the user switches mailgroups or quit, and only for the _final_ disposition of a message: that is, when a message is merely saved into another folder, the decision will be taken when leaving that other folder, and not the current one. Messages marked as "saved" are _not_ sent, so to avoid double learning. The fact is that ham messages are more likely to be postponed than spam, because ham is more often filed here and there. Even if many or most ham messages are deleted, this introduce a short term bias in the learning statistics by which the percentage of spam seems to be higher (in my case, 1157 messages have been learned in about three days, 20% of which were spam), but this percentage will later be lowered as filed messages get reprocessed. Another effect is that the delay itself in ham learning may have a slight effect on classification, but since both ham and spam are well represented, the effect is likely negligible. Tim corpora are surely very clean, at least by now, while day-to-day learning may yield slightly tainted learning. In my case, when a thread does not interest me, I often kill all articles it contains in one command, without opening each of them to see if it would not be spam: the threading itself makes it unlikely. But nevertheless possible, you surely noticed that bad guys now fetch and re-use already published subjects as a way to get through. That means that if big corpora are thinkable in case of mailing lists having existed for a while, those are probably not very usable for individual users. GBayes, Bogofilter and others should ideally resist some amount of ham-tainted-as-spam or spam-tainted-as-ham at learning time. After adding Graham filtering as a supplementary method to my spam detection tool, I gladly observe that it successfully detects many spam messages which would otherwise fall in the cracks, so it really brings something to me. But I also see many spam cases (are they?) it does not detect and that it would hardly: one simple example is that _for me_, invalidly structured MIME is indicative of an un-interesting message, as interesting people know better! One particular problem I observed are Tim messages themselves, which are undoubtedly very miummy ham messages, but discussing and quoting many spam inside them. Should these be registered as ham or spam? :-) Would not these defeat the learning to some extent? Where should Tim add his own messages in the corpora he uses, and what changes would result in `GBayes' effectiveness? -- François Pinard http://www.iro.umontreal.ca/~pinard
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4