[Tim, to Paul Graham] > ... > I also noted earlier that FREE (all caps) is now one of the 15 words that > most often makes it into the scorer's best-15 list, and cutting > the legs off a clue like that is unattractive on the face of it. So I'm > loathe to fold case unless experiment proves that's an improvement, and it > just doesn't look likely to do so. Those experiments have been run now. Folding case gave a slight but significant improvement in the false negative rate. It had no effect on the false positive rate, but did change the *set* of messages flagged as false positives: conference announcments are no longer flagged (for their VISIT OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some highly off-topic messages do (e.g., talking about money is now indistinguishable from screaming about MONEY). So, overall, I'm leaving case-folding in. It does (of course) reduce the database size, and reduce the amount of training data needed. I have no idea what this does for corpora in languages other than English (for that matter, I don't even know what "fold case" *means* in other languages <wink>). Experiment also showed that boosting the "unknown word" probability from 0.2 to 0.5 was a pure win: it had no significant effect on the false positive rate, but cut the false negative rate by a third. The only change I've seen that had a bigger effect on reducing false negatives was adding special parsing and tagging for embedded URLs.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4