A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-August/028259.html below:

[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8Eric S. Raymond esr@thyrsus.com
Sat, 24 Aug 2002 00:44:16 -0400
Tim Peters <tim.one@comcast.net>:
>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
> 
> This isn't what Graham computes, though:  the P(S) and P(not-S) terms are
> missing in his formulation.  Given that P(not-S) = 1-P(S), and
> P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
> 
>            P(S|X)*P(S|Y)
> -------------------------------------
> P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)

> 
> This is the same as the Bayesian result only if P(S) = 0.5 (in which case
> all the instances of P(S) and P(not-S) cancel out).  Else it's a distortion
> of the naive Bayesian result.

OK.  So, maybe I'm just being stupid, but this seems easy to solve.
We already *have* estimates of P(S) and P(not-S) -- we have a message
count associated with both wordlists.  So why not use the running
ratios between 'em?

As long as we initialize with "good" and "bad" corpora that are approximately
the same size, the should work no worse than the equiprobability assumption.  
The ratios will correct in time based on incoming traffic.

Oh, and do you mind if I use your algebra as part of bogofilter's
documentation?
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4