[Paul Prescod] > Some perhaps relevant links (with no off-topic discusssion): > > * http://www.tuxedo.org/~esr/bogofilter/ Damn -- wish I'd read that before. Among other things, Eric found a good use for Judy arrays <wink>. > * http://www.ai.mit.edu/~jrennie/ifile/ Knew about that. Good stuff. > http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-1259 > 32.news.dfncis.de Seems confused, assuming Graham's approach is a minor variant of ifile's. But Graham's computation is to classic Bayesian classifiers (like ifile) as Python's lambda is to Lisp's <0.7 wink>. Heart of the confusion: Integrating the whole set of statistics together requires adding up statistics for _all_ the words found in a message, not just the words "sex" and "sexy." The rub is that Graham doesn't try to add up the statistics for all the words found in a msg. To the contrary, it ends up ignoring almost all of the words. In particular, if the database indicates that "sex" and "sexy" aren't good spam-vs-non-spam discriminators, Graham's approach ignores them completely (their presence or absence doesn't affect the final outcome at all -- it's like the words don't exist; this isn't what ifile does, and ifile probably couldn't get away with this because it's trying to do N-way classification instead of strictly 2-way -- someone who understands the math and reads Graham's article carefully will likely have a hard time figuring out what Bayes has to do with it at all! I sure did.). > """My finding is that it is _nowhere_ near sufficient to have two > populations, "spam" versus "not spam." In ifile I believe that. But the data will speak for itself soon enough, so I'm not going to argue about this.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4