I see, if you count the punctuation as part of the token, you end up with undersized-corpus effects. Esp if you are case-sensitive too. If I were you I'd map your input down into a narrower set of tokens, or you'll get too many errors. --pg --Tim Peters wrote: > [Paul Graham] > > Don't count words multiple times, and you'll probably > > get fewer false positives. That's the main reason I > > don't do it-- because it magnifies the effect of some > > random word like water happening to have a big spam > > probability. > > Yes, that makes sense, but I'm trained not to think <wink>. Experiment will > decide it (although I *expect* it's a good change, and counting multiple > occurrences was obviously a factor in several of the rare false positives). > If spam really is different, it should be different in several distinct > ways. > > > (Incidentally, why so high? In my db it's only 0.3930784.) --pg > > I expect it's because this tokenizer *only* split on whitespace. > Punctuation was left intact. So, e.g., on the Python discussion list stuff > like > > The new approach blows it out of the water: > and > This is very deep water; > and > Then you'll take to Python like a duck takes to water! > > are counted as "water:" and "water;" and "water!", not as "water". > > The spam corpus is chock full o' "water", though: > > + Porn sites advertising water sports. > + Assorted bottled water pitches. > + Assorted "oxygenated water" pitches. > + Claims of environmental friendliness explicated via stuff like > "no harmful chlorine to pollute the water or air!". > + Pitches for weight-loss gimmicks emphasizing that you'll really > loss fat, not just reduce water retention. > + Pitches for weight-loss gimmicks empphasizing that you'll reduce > water retention as well as lose fat. > + One repeated bizarre analogy for how a breast enlargement cream > works in the way "a sponge absorbs water". > + This revolutionary new flat garden hose will really cut your water > bills. > + Ditto this miracle new laundry tablet lets you use a fraction of > the water needed by old-fashioned detergents. > + Survivalist pitches often mention water in the same sentence as > air and medical care. > > I got tired then <wink>. >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4