[Paul Graham] > Don't count words multiple times, and you'll probably > get fewer false positives. That's the main reason I > don't do it-- because it magnifies the effect of some > random word like water happening to have a big spam > probability. Yes, that makes sense, but I'm trained not to think <wink>. Experiment will decide it (although I *expect* it's a good change, and counting multiple occurrences was obviously a factor in several of the rare false positives). If spam really is different, it should be different in several distinct ways. > (Incidentally, why so high? In my db it's only 0.3930784.) --pg I expect it's because this tokenizer *only* split on whitespace. Punctuation was left intact. So, e.g., on the Python discussion list stuff like The new approach blows it out of the water: and This is very deep water; and Then you'll take to Python like a duck takes to water! are counted as "water:" and "water;" and "water!", not as "water". The spam corpus is chock full o' "water", though: + Porn sites advertising water sports. + Assorted bottled water pitches. + Assorted "oxygenated water" pitches. + Claims of environmental friendliness explicated via stuff like "no harmful chlorine to pollute the water or air!". + Pitches for weight-loss gimmicks emphasizing that you'll really loss fat, not just reduce water retention. + Pitches for weight-loss gimmicks empphasizing that you'll reduce water retention as well as lose fat. + One repeated bizarre analogy for how a breast enlargement cream works in the way "a sponge absorbs water". + This revolutionary new flat garden hose will really cut your water bills. + Ditto this miracle new laundry tablet lets you use a fraction of the water needed by old-fashioned detergents. + Survivalist pitches often mention water in the same sentence as air and medical care. I got tired then <wink>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4