Tim> FYI, here's the closest thing to a real false positive I've seen so Tim> far: I have much smaller spam and ham corpora (currently about 400 msgs each), but both consist only of messages sent to me in the past couple weeks (though not all messages sent during that interval), so some of the header clues which skewed Tim's tests shouldn't be present. Using my currently undeleted Python mail as "unknown" (but which doesn't actually contain any spam), I saw two false positives. One had an attached gif image. The other was a one-line text+html message whose "words" were thus dominated by the HTML tags in the second part. Once my spam and ham grow to something more like 2000 each I will try Tim's technique of splitting them into smaller chunks, training on one chunk, then testing against the remaining chunks. Skip
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4