> From: Tim Peters [mailto:tim.one@comcast.net] > > Training GBayes is cheap, and the more you feed it the less need to do > information-destroying transformations (like folding case or ignoring > punctuation). Speaking of which, I had a thought this morning (in the shower of course ;) about a slightly more intelligent tokeniser. Split on whitespace, then runs of punctuation at the end of "words" are split off as a separate word. So: a.b.c -> 'a.b.c' (main use: keeps file extensions with filenames) A phrase. -> 'A', 'phrase', '.' WTF??? -> 'WTF', '???' >>> import module -> '>>>', 'import', 'module' Might this be useful? No code of course ;) Tim Delaney
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4