RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://mail.python.org/pipermail/python-dev/2002-August/028098.html below:

[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8Tim Peters tim.one@comcast.net
Tue, 20 Aug 2002 19:44:52 -0400

Previous message: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Next message: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Paul Prescod]
> Some perhaps relevant links (with no off-topic discusssion):
>
>  * http://www.tuxedo.org/~esr/bogofilter/

Damn -- wish I'd read that before.  Among other things, Eric found a good
use for Judy arrays <wink>.

>  * http://www.ai.mit.edu/~jrennie/ifile/

Knew about that.  Good stuff.

> http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-1259
> 32.news.dfncis.de

Seems confused, assuming Graham's approach is a minor variant of ifile's.
But Graham's computation is to classic Bayesian classifiers (like ifile) as
Python's lambda is to Lisp's <0.7 wink>.  Heart of the confusion:

    Integrating the whole set of statistics together requires adding up
    statistics for _all_ the words found in a message, not just the
    words "sex" and "sexy."

The rub is that Graham doesn't try to add up the statistics for all the
words found in a msg.  To the contrary, it ends up ignoring almost all of
the words.  In particular, if the database indicates that "sex" and "sexy"
aren't good spam-vs-non-spam discriminators, Graham's approach ignores them
completely (their presence or absence doesn't affect the final outcome at
all -- it's like the words don't exist; this isn't what ifile does, and
ifile probably couldn't get away with this because it's trying to do N-way
classification instead of strictly 2-way -- someone who understands the math
and reads Graham's article carefully will likely have a hard time figuring
out what Bayes has to do with it at all!  I sure did.).

> """My finding is that it is _nowhere_ near sufficient to have two
> populations, "spam" versus "not spam."

In ifile I believe that.  But the data will speak for itself soon enough, so
I'm not going to argue about this.

Previous message: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Next message: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4