[Delaney, Timothy] > Speaking of which, I had a thought this morning (in the shower of > course ;) about a slightly more intelligent tokeniser. "Intelligence" isn't necessarily helpful with a statistical scheme, and always makes it harder to adapt to other languages. > Split on whitespace, then runs of punctuation at the end of "words" are > split off as a separate word. For example <wink>, "free!!" never appears in a ham msg in my corpora, but appears often in the spam samples. OTOH, plain "free" is a weak spam indicator on c.l.py, given the frequent supposedly on-topic arguments about free beer versus free speech, etc. > a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames) > > A phrase. -> 'A', 'phrase', '.' > > WTF??? -> 'WTF', '???' > > >>> import module -> '>>>', 'import', 'module' The first and last are the same as just splitting on whitespace. The 2nd-last may lose the distinction between WTF??? and a solicitation to join the World Trade Federation <wink>; WTF isn't likely to make it into a list of smoking guns regardless. Hard to guess about the 2nd. The database isn't large enough to worry about reducing its size, btw -- the only gimmicks I care about are those that increase accuracy. > Might this be useful? No code of course ;) It takes about an hour to run and evaluate tests for one change. If you want to motivate me to try, supply a patch against timtest.py (in the sandbox), else I've already got far more ideas than time to test them properly. Anyone else want to test this one?
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4