[Eric S. Raymond] > Your users' mailers would have two delete buttons -- spam and nonspam. > On each delete the message would be shipped to bogofilter, which would > would merge the content into its token lists. I want to raise a caution here. Graham pulled his formulas out of thin air, and one part of the scoring step is quite dubious. This requires detail to understand. Where X means "word x is present" and similarly for Y, and S means "it's spam" and "not-S" means "it's not spam", and sticking to just the two-word case for simplicity: P(S | X and Y) = [by Bayes] P(X and Y | S) * P(S) / P(X and Y) = [by the usual expanded form of Bayes] P(X and Y | S) * P(S) / (P(S)*P(X and Y | S) + P(not-S)*P(X and Y | not-S)) All that is rigorously correct so far. Now we make the simplifying assumption that puts the "naive" in "naive Bayes", that the probability of X is independent of the probability of Y, so that the conjoined probabilities can be replaced by multiplication of non-conjoined probabilities. This yields P(X|S)*P(Y|S)*P(S) --------------------------------------------------- P(S)*P(X|S)*P(Y|S) + P(not-S)*P(X|not-S)*P(Y|not-S) Then, unlike a "normal" formulation of Bayesian classification, Graham's scheme simply doesn't know anything about P(X|S) and P(Y|S) etc. It only knows about probabilities in the other direction (P(S|X) etc). It takes 3 more applications of Bayes to get what we want from what we know. That is, P(X|S) = [again by Bayes] P(S|X) * P(X) / P(S) Plug that in, mutatis mutandis, in six places, to get P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S)*P(S) --------------------------------------------------- P(S)*P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S) + ... P(not-S)*P(not-S|X)*P(X)/P(not-S)*P(not-S|Y)*P(Y)/P(not-S) The factor P(X)*P(Y) cancels out of numerator and denominator, leaving P(S|X)*/P(S)*P(S|Y)*/P(S)*P(S) --------------------------------------------------- P(S)*P(S|X)/P(S)*P(S|Y)/P(S) + ... P(not-S)*P(not-S|X)/P(not-S)*P(not-S|Y)/P(not-S) and simplifying some P(whatever)/P(whatever) instances away gives P(S|X)*P(S|Y)/P(S) --------------------------------------------------- P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) This isn't what Graham computes, though: the P(S) and P(not-S) terms are missing in his formulation. Given that P(not-S) = 1-P(S), and P(not-S|whatever) = 1-P(S|whatever), what he actually computes is P(S|X)*P(S|Y) ------------------------------------- P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y) This is the same as the Bayesian result only if P(S) = 0.5 (in which case all the instances of P(S) and P(not-S) cancel out). Else it's a distortion of the naive Bayesian result. For this reason, it's best that the number of spam msgs fed into your database be approximately equal to the number of non-spam msgs fed into it: that's the only way to make P(S) ~= P(not-S), so that the distortion doesn't matter. Indeed, it may be that Graham found he had to multiply his "good counts" by 2 in order to make up for that in real life he has twice as many non-spam messages as spam messages in his inbox, but that the final scoring implicitly assumes they're of equal number (and so overly favors the "it's spam" outcome unless the math is fudged elsewhere to make up for that). It would likely be better still to train the database with a proportion of spam to not-spam messages reflecting what you actually get in your inbox, and change the scoring to use the real-life P(S) and P(not-S) estimates. In that case the "mystery bias" of 2 may actively hurt, overly favoring the "not spam" outcome. Note that Graham said: Here's a sketch of how I do statistical filtering. I start with one corpus of spam and one of nonspam mail. At the moment each one has about 4000 messages in it. That's consistent with all the above, although it's unclear whether Graham intended "about the same" to be a precondition for using this formulation, or whether fudging elsewhere was introduced empirically to make up for the scoring formula neglecting P(S) and P(not-S) by oversight.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4