[Skip Montanaro] > ... > One thing I think would be worthwhile would be to run GBayes first, then > only run stuff it thought was spam through SpamAssassin. Only > messages that both systems categorized as spam would drop into the spam > folder. This has a couple benefits over running one or the other in > isolation: > > * The training set for GBayes probably doesn't need to be as big Training GBayes is cheap, and the more you feed it the less need to do information-destroying transformations (like folding case or ignoring punctuation). > * The two systems use substantially different approaches to > identifying spam, Which could indeed be a killer-strong benefit. > so I suspect your false positive rate would go way down. I'm already having a real problem with this just looking at content: the false positive rate is already so low that I can't make statistically significant conclusions about things that may improve it (e.g., if I do something that removes just *one* false positive in a test run on 4000 hams, the false-positive rate falls by 12.5% -- I don't have enough false positives to make fine-grained judgments. And, indeed, every time I test a change to the algorithm, the most *significant* thing I find is that it turns up another class of blatant spam hiding in the ham corpus: my training data is still too dirty, and cleaning it up is labor-intensive). > False negatives would go up, but only testing can suggest by how > much. > > * Since SA is dog slow most of the time, SA users get a big speedup, > since a substantially smaller fraction of your messages get run > through it. > > This sort of chaining is pretty trivial to setup with procmail. > Dunno what the Windows set will do though. There are different audiences here. Greg is keen to have a better approach for python.org as a whole, while Barry is keen about that and about doing something more generic for Mailman. Windows isn't an issue for either of those. Everyone else can eat cake <wink>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4