[Eric S. Raymond] > VM, dude, VM is your friend. I thought this through carefully. The > size of bogofilter's working set isn't limited by core. Do you expect that to be an issue? When I built a database from 20,000 messages, the whole thing fit in a Python dict consuming about 10MB. That's all. As is always the case in this kind of thing, about half the tokens are utterly useless since they appear only once in only one message (think of things like misspellings and message ids here -- about half the tokens generated will be "obviously useless", although the useless won't become obvious until, e.g., a month passes and you never seen the token again). In addition, this is a statistical-sampling method, and 20,000 messages is a very large sample. I concluded that, in practice, and since we do have ways to identify and purge useless tokens, 5MB is a reasonable upper bound on the size of this thing. It doesn't fit in my L2 cache, but I'd need a magnifying glass to find it in my RAM. > And because it's a B-tree variant, the access frequency will be proportional > to log2 of the wordlist size I don't believe Judy is faster than Python string-keyed dicts at the sizes I'm expecting (I've corresponded with Douglas about that too, and his timing data has a hard time disagreeing <wink>). > and the patterns will be spatially bursty. Why? Do you sort tokens before looking them up? Else I don't see a reason to expect that, from one lookup to the next, the paths from root to leaf will enjoy spatial overlap beyond the root node. > This is a memory access pattern that plays nice with an LRU pager. Well, as I said before, all the evidence I've seen says the scoring time for a message is near-trivial (including the lookup times), even in pure Python. It's only the parsing that I'm still worried about, and I may yet confess a bad case of Flex envy. > I'm working on a simpler solution, one which might have a Pythonic spinoff. > Stay tuned. I figure this means something simpler than a BTree under ZODB. If so, you should set yourself a tougher challenge <wink>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4