[Tim] >> Do you expect that to be an issue? When I built a database from 2= 0,000 >> messages, the whole thing fit in a Python dict consuming about 10M= B. [Eric S. Raymond] > Hm, that's a bit smaller than I would have thought, but the order o= f > magnitude I was expecting. It's even smaller than that <wink>. The dict here maps strings to in= stances of a Python class (WordInfo). The latter uses new-in-2.2 __slots__, = and those give major memory efficiences over old-style classes, but there= 's still subtantial memory overhead compared to what's possible in C. I= n addition, there are memory overheads for the Python objects stored in= the WordInfo instances, including a Python float object in each record re= cording the time.time() of last access by the scoring method. IOW, there are tons of memory overheads here, yet the size was still = minor. So I have no hesitation leaving this part in Python, and coding this = part up was a trivial finger exercise. You know all that, though! It makes = your decision to use C from the start hard to fathom. > ... > Recognition features should age! Wow! That's a good point! With = the > age counter being reset when they're recognized. For concreteness, here's the comment from the Python code, which I be= lieve is accurate: # (*)atime is the last access time, a UTC time.time() value. It'= s the # most recent time this word was used by scoring (i.e., by spampr= ob(), # not by training via learn()); or, if the word has never been us= ed by # scoring, the time the word record was created (i.e., by learn()= ). # One good criterion for identifying junk (word records that have= no # value) is to delete words that haven't been used for a long tim= e. # Perhaps they were typos, or unique identifiers, or relevant to = a # once-hot topic or scam that's fallen out of favor. Whatever, i= f # a word is no longer being used, it's just wasting space. Besides the space-saving gimmick, there may be practical value in exp= iring older words that are getting used, but less frequently over time. Th= at would be evidence that the nature of the world is changing, and more aggressively expiring the model for how the world *used* to be may sp= eed adaptation to the new realities. I'm not saving enough info to do th= at, though, and it's unclear whether it would really help. Against it, w= hile I see new spam gimmicks pop up regularly, the old ones never seem to go= away (e.g., I don't do anything to try to block spam on my email accounts,= and the bulk of the spam I get is still easily recognized from the subjec= t line alone). However, because it's all written in Python <wink>, it will = be very easy to set up experiments to answer such questions. BTW, the ifile FAQ gives a little info about the expiration scheme if= ile uses. Rennie's paper gives more: Age is defined as the number of e-mail messages which have been added to the model since frequency statistics have been kept for the word. Old, infrequent words are to be dropped while young wo= rds and old, frequent words should be kept. One way to quantify this is to say that words which occur fewer than log2(age)-1 times should be discarded from the model. For example, if =93baseball= =94 occurred in the 1st document and occurred 5 or fewer times in the next 63 documents, the word and its corresponding statistics woul= d be eliminated from the model=92s database. This feature selectio= n cutoff is used in ifile and is found to significantly improve efficiency without noticeably affecting classification performanc= e. I'm not sure how that policy would work with Graham's scheme (which h= as many differences from the more conventional scheme ifile uses). Our Pytho= n code also saves a count of the number of times each word makes it into Gra= ham's "best 15" list, and I expect that to be a direct measure of the value= we're getting out of keeping a word ("word" means whatever the tokenizer pa= sses to the classifier -- it's really any hashable and (for now) pickleable P= ython object). [on Judy] > I thought part of the point of the method was that you get > sorting for free because of the way elements are inserted. Sure, if range-search or final sorted order is important, it's a grea= t benefit. I was only wondering about why you'd expect spatial localit= y in the input as naturally ordered. > ... > No, but think about how the pointer in a binary search moves. It's > spatially bursty, Memory accesses frequencies for repeated binary > searches will be a sum of bursty signals, analogous to the way > network traffic volumes look in the time domain. In fact the > graph of memory adress vs. number of accesses is gonna win up > looking an awful lot like 1/f noise, I think. *Not* evenly > distributed; something there for LRU to weork with. Judy may or may not be able to exploit something here; I don't know, = and I'd need to know a lot more about Judy's implementation to even start to = guess. Plain binary search has horrid cache behavior, though. Indeed, most = older research papers on B-Trees suggest that binary search doesn't buy any= thing in even rather large B-Tree nodes, because the cache misses swamp the reduced instruction count over what a simple linear search does. Wor= se, that linear search can be significantly faster if the HW is smart eno= ugh to detect the regular address access pattern of linear search and do som= e helpful prefetching for you. More recent research on B-Trees is on w= ays to get away from bad binary search behavior; two current lines are using explicit prefetch instructions to minimize the stalls, and using a mo= re cache-friendly data structure inside a B-Tree node. Out-guessing mod= ern VM implementations is damned hard. With a Python dict you're likely to get a cache miss per lookup. If = that's a disaster, time.clock() isn't revealing it <wink>. > ... > What I'm starting to test now is a refactoring of the program where= it > spawn a daemon version of itself first time it's called. The daemo= n > eats the wordlists and stays in core fielding requests from subsequ= ent > program runs. Basically an answer to "how you call bogofilter 1K > times a day from procmail without bringing your disks to their knee= s" > problem" -- persistence on the cheap. > > Thing is that the solution to this problem is very generic. Might > turn into a Python framework. Sounds good! Just don't call it "compilerlike" <wink>.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4