RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2002-September/028505.html below:

[Python-Dev] The first trustworthy <wink> GBayes results

[Python-Dev] The first trustworthy <wink> GBayes resultsTim Peters tim.one@comcast.net
Sun, 01 Sep 2002 19:40:38 -0400

Previous message: [Python-Dev] The first trustworthy <wink> GBayes results
Next message: [Python-Dev] The first trustworthy <wink> GBayes results
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Delaney, Timothy]
> Speaking of which, I had a thought this morning (in the shower of
> course ;) about a slightly more intelligent tokeniser.

"Intelligence" isn't necessarily helpful with a statistical scheme, and
always makes it harder to adapt to other languages.

> Split on whitespace, then runs of punctuation at the end of "words" are
> split off as a separate word.

For example <wink>, "free!!" never appears in a ham msg in my corpora, but
appears often in the spam samples.  OTOH, plain "free" is a weak spam
indicator on c.l.py, given the frequent supposedly on-topic arguments about
free beer versus free speech, etc.

>     a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames)
>
>     A phrase. -> 'A', 'phrase', '.'
>
>     WTF??? -> 'WTF', '???'
>
>     >>> import module -> '>>>', 'import', 'module'

The first and last are the same as just splitting on whitespace.  The
2nd-last may lose the distinction between WTF??? and a solicitation to join
the World Trade Federation <wink>; WTF isn't likely to make it into a list
of smoking guns regardless.  Hard to guess about the 2nd.  The database
isn't large enough to worry about reducing its size, btw -- the only
gimmicks I care about are those that increase accuracy.

> Might this be useful? No code of course ;)

It takes about an hour to run and evaluate tests for one change.  If you
want to motivate me to try, supply a patch against timtest.py (in the
sandbox), else I've already got far more ideas than time to test them
properly.  Anyone else want to test this one?

Previous message: [Python-Dev] The first trustworthy <wink> GBayes results
Next message: [Python-Dev] The first trustworthy <wink> GBayes results
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4