Tim Peters <tim.one@comcast.net>: > P(S|X)*P(S|Y)/P(S) > --------------------------------------------------- > P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) > > This isn't what Graham computes, though: the P(S) and P(not-S) terms are > missing in his formulation. Given that P(not-S) = 1-P(S), and > P(not-S|whatever) = 1-P(S|whatever), what he actually computes is > > P(S|X)*P(S|Y) > ------------------------------------- > P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y) > > This is the same as the Bayesian result only if P(S) = 0.5 (in which case > all the instances of P(S) and P(not-S) cancel out). Else it's a distortion > of the naive Bayesian result. OK. So, maybe I'm just being stupid, but this seems easy to solve. We already *have* estimates of P(S) and P(not-S) -- we have a message count associated with both wordlists. So why not use the running ratios between 'em? As long as we initialize with "good" and "bad" corpora that are approximately the same size, the should work no worse than the equiprobability assumption. The ratios will correct in time based on incoming traffic. Oh, and do you mind if I use your algebra as part of bogofilter's documentation? -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4