This brings up a couple of questions, one related to the theory behind this Bayesian spam filtering, and one about Python optimization ... apologies in advance for the long post. Tim Peters <tim.one@comcast.net> wrote: > > I want to raise a caution here. Graham pulled his formulas out of thin air, > and one part of the scoring step is quite dubious. This requires detail to > understand. [detail deleted] > P(S|X)*P(S|Y)/P(S) > --------------------------------------------------- > P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) > > This isn't what Graham computes, though: the P(S) and P(not-S) terms are > missing in his formulation. Given that P(not-S) = 1-P(S), and > P(not-S|whatever) = 1-P(S|whatever), what he actually computes is > > P(S|X)*P(S|Y) > ------------------------------------- > P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y) > > This is the same as the Bayesian result only if P(S) = 0.5 (in which case > all the instances of P(S) and P(not-S) cancel out). Else it's a distortion > of the naive Bayesian result. Is there an easy fix to this problem? I implemented this in Python after reading about it on the weekend, and it might explain why my results are not quite as fabulous as the author noted (I'm getting more false positives than he claimed he was). Note that I'm not so good with the above notation; I'm more at home with plain algebraic stuff :). But the more interesting Python question: I'm running into some performance problems with my implementation. Details: The analysis stage of my implementation (I'll refer to it as "spamalyzer" for now) stores the "mail corpus" and term list on disk. The mail corpus is two dictionaries (one for spam, one for good mail), each of which contains two further dictionaries -- one is the filenames of analyzed messages (one key per filename, values ignored and stored as 0), and the other is a dictionary mapping terms to the number of occurrences. The terms list is a single dictionary mapping terms to a pair of floats (probability of being spam and distance from 0.5). My first try at this used cPickle to store these items, but loading them back in was excruciatingly slow. From a lightly loaded P3-500/128MB running Linux 2.2.x, each of these is a separate run of a benchmarking Python script: Loading corpus -------------- pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 65.190000000000 seconds. pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 64.790000000000 seconds. pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 65.010000000000 seconds. Loading terms ------------- pickle method: got 12986 terms in 3.460000000000 seconds. pickle method: got 12986 terms in 3.470000000000 seconds. pickle method: got 12986 terms in 3.450000000000 seconds. For a lark, I decided to try an alternative way of storing the data (and no, I haven't tried the marshal module directly). I wrote a function to write the contents of the dictionary to a text file in the form of Python source, so that you can re-load the data with a simple "import" command. To my surprise, this was significantly faster! The first import, of course, takes a while, as the interpreter compiles the .py file to .pyc format, but subsequent runs are an order of magnitude faster than cPickle.load(): Loading corpus -------------- [charlesc@charon spamalyzer]$ rm mail_corpus.pyc custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 194.210000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.500000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.260000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.260000000000 seconds. Loading terms ------------- [charlesc@charon spamalyzer]$ rm terms.pyc custom method: got 12986 terms in 3.110000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. So the big question is, why is my naive "o = __import__ (f, {}, {}, [])" so much faster than the more obvious "o = cPickle.load (f)"? And what can I do to make it faster :). Charles -- ----------------------------------------------------------------------- Charles Cazabon <python@discworld.dyndns.org> GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ -----------------------------------------------------------------------
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4