A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-September/028505.html below:

[Python-Dev] The first trustworthy <wink> GBayes results

[Python-Dev] The first trustworthy <wink> GBayes resultsTim Peters tim.one@comcast.net
Sun, 01 Sep 2002 19:40:38 -0400
[Delaney, Timothy]
> Speaking of which, I had a thought this morning (in the shower of
> course ;) about a slightly more intelligent tokeniser.

"Intelligence" isn't necessarily helpful with a statistical scheme, and
always makes it harder to adapt to other languages.

> Split on whitespace, then runs of punctuation at the end of "words" are
> split off as a separate word.

For example <wink>, "free!!" never appears in a ham msg in my corpora, but
appears often in the spam samples.  OTOH, plain "free" is a weak spam
indicator on c.l.py, given the frequent supposedly on-topic arguments about
free beer versus free speech, etc.

>     a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
>
>     A phrase. -> 'A', 'phrase', '.'
>
>     WTF??? -> 'WTF', '???'
>
>     >>> import module -> '>>>', 'import', 'module'

The first and last are the same as just splitting on whitespace.  The
2nd-last may lose the distinction between WTF??? and a solicitation to join
the World Trade Federation <wink>; WTF isn't likely to make it into a list
of smoking guns regardless.  Hard to guess about the 2nd.  The database
isn't large enough to worry about reducing its size, btw -- the only
gimmicks I care about are those that increase accuracy.

> Might this be useful? No code of course ;)

It takes about an hour to run and evaluate tests for one change.  If you
want to motivate me to try, supply a patch against timtest.py (in the
sandbox), else I've already got far more ideas than time to test them
properly.  Anyone else want to test this one?




RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4