[Tim, last week] > What's an acceptable false positive rate? [my response] > Speaking as one of the people who reviews suspected spam for python.org > and rescues false positives, I would say that the more relevant figure > is: how much suspected spam do I have to review every morning? < 10 > messages would be peachy; right now it's around 5-20 messages per day. [Tim again] > I must be missing something. I would *hope* that you review *all* messages > claimed to be spam, in which case the number of msgs to be reviewed would, > in a perfectly accurate system, be equal to the number of spams received. Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day. (I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.) Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection. > OTOH, the false positive rate doesn't have anything to do with the number of > spams received, it has to do with the number of non-spams received. Err, yeah, good point. I make a point of talking about "suspected spam", which is any message that scores between 5.0 and 10.0. IMHO, the true nature of those messages can only be determined by manual inspection. > Maybe you don't want this kind of approach at all. The classifier doesn't > have "gray areas" in practice: it tends to give probabilites near 1, or > near 0, and there's very little in between -- a msg either has a > preponderance of spam indicators, or a preponderance of non-spam indicators. That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.) However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside? Just how many messages fall in that grey area anyways? Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ MTV -- get off the air! -- Dead Kennedys
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4