Tim Peters <tim_one@email.msn.com>: > [Eric S. Raymond] > > VM, dude, VM is your friend. I thought this through carefully. The > > size of bogofilter's working set isn't limited by core. > > Do you expect that to be an issue? When I built a database from 20,000 > messages, the whole thing fit in a Python dict consuming about 10MB. Hm, that's a bit smaller than I would have thought, but the order of magnitude I was expecting. > That's > all. As is always the case in this kind of thing, about half the tokens are > utterly useless since they appear only once in only one message (think of > things like misspellings and message ids here -- about half the tokens > generated will be "obviously useless", although the useless won't become > obvious until, e.g., a month passes and you never seen the token again). Recognition features should age! Wow! That's a good point! With the age counter being reset when they're recognized. > > and the patterns will be spatially bursty. > > Why? Do you sort tokens before looking them up? I thought part of the point of the method was that you get sorting for free because of the way elements are inserted. > Else I don't see a reason > to expect that, from one lookup to the next, the paths from root to leaf > will enjoy spatial overlap beyond the root node. No, but think about how the pointer in a binary search moves. It's spatially bursty, Memory accesses frequencies for repeated binary searches will be a sum of bursty signals, analogous to the way network traffic volumes look in the time domain. In fact the graph of memory adress vs. number of accesses is gonna win up looking an awful lot like 1/f noise, I think. *Not* evenly distributed; something there for LRU to weork with. > > I'm working on a simpler solution, one which might have a Pythonic spinoff. > > Stay tuned. > > I figure this means something simpler than a BTree under ZODB. If so, you > should set yourself a tougher challenge <wink>. What I'm starting to test now is a refactoring of the program where it spawn a daemon version of itself first time it's called. The daemon eats the wordlists and stays in core fielding requests from subsequent program runs. Basically an answer to "how you call bogofilter 1K times a day from procmail without bringing your disks to their knees" problem" -- persistence on the cheap. Thing is that the solution to this problem is very generic. Might turn into a Python framework. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4