Tim Peters <tim.one@comcast.net>: > a. There are other fudges in the code that may rely on this fudge > to cancel out, intentionally or unintentionally. I'm loathe to > type more about this instead of working on the code, because I've > already typed about it. See a later msg for a concrete example of > how the factor-of-2 "good count" bias acts in part to counter the > distortion here. Take one away, and the other(s) may well become > "a problem". I was thinking of shooting that "goodness bias" through the head and seeing what happens, actually. I've been unhappy with that fudge in Paul's original formula from the beginning. > b. Unless the proportion of spam to not-spam in the training sets > is a good approximation to the real-life ratio of spam to not- > spam, it's also dubious to train the system with bogus P(S) and > P(not-S) values. Right -- which is why I want to experiment with actually *using* the real life running ratio. > c. I'll get back to this when our testing infrastructure is trustworthy. > At the moment I'm hosed because the spam corpus I pulled off the > web turns out to be trivial to recognize in contrast to Barry's > corpus of good msgs from python.org mailing lists: Ouch. That's a trap I'll have to watch out for in handling other peoples' corpora. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4