A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://mail.python.org/pipermail/python-dev/2002-August/028495.html below:

[Python-Dev] The first trustworthy <wink> GBayes results

[Python-Dev] The first trustworthy <wink> GBayes resultsTim Peters tim.one@comcast.net
Sat, 31 Aug 2002 17:43:39 -0400
[Tim, to Paul Graham]
> ...
> I also noted earlier that FREE (all caps) is now one of the 15 words that
> most often makes it into the scorer's best-15 list, and cutting
> the legs off a clue like that is unattractive on the face of it.  So I'm
> loathe to fold case unless experiment proves that's an improvement, and it
> just doesn't look likely to do so.

Those experiments have been run now.  Folding case gave a slight but
significant improvement in the false negative rate.  It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives:  conference announcments are no longer flagged (for their VISIT
OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY).  So, overall, I'm leaving
case-folding in.  It does (of course) reduce the database size, and reduce
the amount of training data needed.  I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).

Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win:  it had no significant effect on the false positive
rate, but cut the false negative rate by a third.  The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.




RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4