[Charles Cazabon] > Is there an easy fix to this problem? I don't know that there is "a problem". The step is dubious, but other steps are also dubious, and naive Bayes itself is dubious (the assumption that word pobabilities are independent is simply false in this application). But outside of, perhaps, quantum chromodynamics, all models of reality are more or less dubious, and it's darned hard to say whether the fudging needed to make them appear to work is lucky or principled, robust or brittle. The more gross deviations there are from a model, though, the less one can appeal to the model for guidance. In the limit, you can end up with a pile of arbitrary tricks, with no idea which gimmicks matter anymore (given enough free parameters to fiddle, you can train even a horribly inappropriate model to fit a specific blob of data exactly). > I implemented this in Python after reading about it on the weekend, and it > might explain why my results are not quite as fabulous as the author noted > (I'm getting more false positives than he claimed he was). How many lines of code do you have? That's a gross lower bound on the number of places it might have screwed up <wink>. > Note that I'm not so good with the above notation; I'm more at home with > plain algebraic stuff :). It's all plain-- and simple ---algebra, it's just long-winded. You may be confusing yourself, e.g., by reading P(S|X) as if it's a complex expression in its own right. But it's not -- given an incarnation of the universe, it denotes a fixed number. Think of it as "w" instead <wink>. Let's get concrete. You have a spam corpus with 1000 messages. 100 of them contain the word x, and 500 contain the word y. Then P(X|S) = 100/1000 = 1/10 P(Y|S) = 500/1000 = 1/2 You also have a non-spam corpus with 2000 messages. 100 of them contain x too, and 500 contain y. Then P(X|not-S) = 100/2000 = 1/20 P(Y|not-Y) = 500/2000 = 1/4 This is the entire universe, and it's all you know. If you pick a message at random, what's P(S) = the probability that it's from the spam corpus? It's trivial: P(S) = 1000/(1000+2000) = 1/3 and P(not-S) = 2/3 Now *given that* you've picked a message at random, and *know* it contains x, but don't know anything else, what's the probability it's spam (== what's P(S|X)?). Well, it has to be one of the 100 spam messages that contains x, or one of the 100 non-spam messages that contains x. They're all equally likely, so P(S|X) = (100+100)/200 = 1/2 and P(S|Y) = (500+500)/500 = 1/2 too by the same reasoning. P(not-S|X) and P(not-S|Y) are also 1/2 each. So far, there's nothing a reasonable person can argue with. Given that this is our universe, these numbers fall directly out of what reasonable people agree "probability" means. When it comes to P(S|X and Y), life is more difficult. If we *agree* to assume that word probabilities are independent (which is itself dubious, but has the virtue of appearing to work pretty well anyway), then the number of messages in the spam corpus we can expect to contain both X and Y is P(X|S)*P(Y|S)*number_spams = (1/10)*(1/2)*1000 = 50 Similarly the # of non-spam messages we can expect to contain both X and Y is (1/20)*(1/4)*2000 = 25 Since that's all the messages that contain both X and Y, the probability that a message containing both X and Y is spam is P(S | X and Y) = 50/(50 + 25) = 2/3 Note that this agrees with the formula whose derivation I spelled out from first principles: P(S|X)*P(S|Y)/P(S) --------------------------------------------------- = P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) (1/2)*(1/2)/(1/3) 2 -------------------------------------- = - (1/2)*(1/2)/(1/3) + (1/2)*(1/2)/(2/3) 3 It's educational to work through Graham's formulation on the same example. To start with, P(S|X) is approximated by a different means, and fudging the "good count" by a factor of 2, giving P'(S|X) = (100/1000) / (100/1000 + 2*100/2000) = 1/2 and similarly for P'(S|Y). These are the same probabilities I gave above, but the only reason they're the same is because I deliberately picked a spam corpus size exactly half the size of the non-spam corpus, knowing in advance that this factor-of-2 fudge would make the results the same in the end. The only difference in what's computed then is in the scoring step, where Graham's formulation computes P'(S | X and Y) = (1/2)*(1/2)/((1/2)*(1/2)+(1/2)*(1/2)) = 1/2 instead of the 2/3 that's actually true in this universe. If the corpus sizes diverge more, the discrepancies at the end grow too, and the way of computing P(S|X) at the start also diverges. Is that good or bad? I say no more now than that it's dubious <wink>. > But the more interesting Python question: I'm running into some > performance problems with my implementation. Details: English never helps with these. Whittle it down and post actual code to comp.lang.python for help. Or study the sandbox code in the CVS repository (see the Subject line of this msg) -- it's not having any speed problems.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4