This is really interesting. When I was at Dragon (well, actually, after Tim left and it became L&H), I ported my natural language parsing/understanding system from Python to C++ so it could run quickly enough for embedded devices. The core of this system was an associative container, so I knew that its performance would be crucial. I used C++ generics which made it really easy to swap in different associative container implementations, and I tried lots, including the red-black tree containers built into most C++ implementations, and hash tables. My experience was that trying to come up with a hash function that would give a significant speed increases over the tree containers was extremely difficult, because it was really hard to come up with a good hash function. Furthermore, even if I succeeded, it was like black magic: it was inconsistent accross my test cases and there was no way to understand why it worked well, and to get a feeling for how it would scale to problems outside those cases. I ended up hand-coding a two-level scheme based on binary searches in contiguous arrays which blew away anything I'd been able to do with a hash table. My conclusion was that for general-purpose use, the red-black tree was pretty good, despite its relatively high memory overhead of 3 pointers per node: it places easy requirements on the user (supply a strick weak ordering) and provides predictable and smooth performance even asymptotically. On the other hand, hashing requires that the user supply both a hash function and an equality detector which must agree with one-another, requires hand-tuning of the hash function for performance, and is rather more unpredictable. We've been talking about adding hash-based containers to the C++ standard library but I'm reluctant on these grounds. It seems to me that when you really care about speed, some kind of hand-coded solution might be a better investment than trying to come up with a good hash function. I'm ready to believe that hashing is the most appropriate choice for Python, but I wonder what makes the difference? -Dave From: "Tim Peters" <tim.one@comcast.net> > There's just no contest here. BTrees have many other virtues, like > supporting range searches, and automatically playing nice with ZODB > persistence, but they're plain sluggish compared to dicts. To be completely > fair and unfair at the same time <wink>, there are also 4 other flavors of > Zope BTree, purely for optimization reasons. In particular, the IIBTree > maps Python ints to Python ints, and does so by avoiding Python int objects > altogether, storing C longs directly and comparing them at native "compare a > long to a long" C speed. That's *almost* as fast as Python 2.1 int->int > dicts (which endure all-purpose Python object comparison), except for > deletion (the BTree spends a lot of time tearing apart all the tree pointers > again). > > Now that's a perfectly height-balanced search tree that "chunks up" blocks > of keys for storage and speed efficiency, and rarely needs more than a > simple local adjustment to maintain balance. I expect that puts it at the > fast end of what can be achieved with a balanced tree scheme. > > The ABC language (which Guido worked on before Python) used AVL trees for > just about everything under the covers. It's not a coincidence that Python > doesn't use balanced trees for anything <wink>. > > > > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/mailman/listinfo/python-dev >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4