> P(S|X)*P(S|Y)/P(S) > --------------------------------------------------- > P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) [Eric S. Raymond] > ... > OK. So, maybe I'm just being stupid, but this seems easy to solve. > We already *have* estimates of P(S) and P(not-S) -- we have a message > count associated with both wordlists. So why not use the running > ratios between 'em? a. There are other fudges in the code that may rely on this fudge to cancel out, intentionally or unintentionally. I'm loathe to type more about this instead of working on the code, because I've already typed about it. See a later msg for a concrete example of how the factor-of-2 "good count" bias acts in part to counter the distortion here. Take one away, and the other(s) may well become "a problem". b. Unless the proportion of spam to not-spam in the training sets is a good approximation to the real-life ratio of spam to not- spam, it's also dubious to train the system with bogus P(S) and P(not-S) values. c. I'll get back to this when our testing infrastructure is trustworthy. At the moment I'm hosed because the spam corpus I pulled off the web turns out to be trivial to recognize in contrast to Barry's corpus of good msgs from python.org mailing lists: every msg in the spam corpus has stuff about the fellow who collected the spam in the headers, while nothing in the python.org corpus does; contrarily, every msg in the python.org corpus has python.org header info not in the spam corpus headers. This is an easy way to get 100% precision and 100% recall, but not particularly realistic -- the rules it's learning are of the form "it's spam if and only if it's addressed to bruceg"; "t's not spam if and only if the headers contain 'List-Unsubscribe'"; etc. The learning can't be faulted, but the teacher can <wink>. d. I only exposed the math for the two-word case above, and the generalization to n words may not be clear from the final result (although it's clear enough if you back off a few steps). If there are n words, w[0] thru w[n-1]: prod1 <- product for i in range(n) of P(S|w[i])/P(S) prod2 <- product for i in range (n) of (1-P(S|w[i])/(1-P(S)) result <- prod1*P(S) / (prod1*P(S) + prod2*(1-P(S))) That's if you're better set up to experiment now. If you do this, the most interesting thing to see is whether results get better or worse if you *also* get rid of the artificial "good count" boost by the factor of 2. > As long as we initialize with "good" and "bad" corpora that are > approximately the same size, the should work no worse than the > equiprobability assumption. "not spam" is already being given an artificial boost in a couple of ways. Given that in real life most people still get more not-spam than spam, removing the counter-bias in the scoring math may boost the false negative rate. > The ratios will correct in time based on incoming traffic. Depends on how training is done. > Oh, and do you mind if I use your algebra as part of bogofilter's > documentation? Not at all, although if you wait until we get our version of this ironed out, you'll almost certainly be able to pull an anally-proofread version out of a plain-text doc file I'll feel compelled to write <wink>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4