RetroSearch Browse

Wed, 26 Jun 2002 11:23:32 -0400 · https://mail.python.org/pipermail/python-dev/2002-June/025954.html

This is really interesting. When I was at Dragon (well, actually, after Tim
left and it became L&H), I ported my natural language parsing/understanding
system from Python to C++ so it could run quickly enough for embedded
devices. The core of this system was an associative container, so I knew
that its performance would be crucial. I used C++ generics which made it
really easy to swap in different associative container implementations, and
I tried lots, including the red-black tree containers built into most C++
implementations, and hash tables. My experience was that trying to come up
with a hash function that would give a significant speed increases over the
tree containers was extremely difficult, because it was really hard to come
up with a good hash function. Furthermore, even if I succeeded, it was like
black magic: it was inconsistent accross my test cases and there was no way
to understand why it worked well, and to get a feeling for how it would
scale to problems outside those cases. I ended up hand-coding a two-level
scheme based on binary searches in contiguous arrays which blew away
anything I'd been able to do with a hash table. My conclusion was that for
general-purpose use, the red-black tree was pretty good, despite its
relatively high memory overhead of 3 pointers per node: it places easy
requirements on the user (supply a strick weak ordering) and provides
predictable and smooth performance even asymptotically. On the other hand,
hashing requires that the user supply both a hash function and an equality
detector which must agree with one-another, requires hand-tuning of the
hash function for performance, and is rather more unpredictable. We've been
talking about adding hash-based containers to the C++ standard library but
I'm reluctant on these grounds. It seems to me that when you really care
about speed, some kind of hand-coded solution might be a better investment
than trying to come up with a good hash function.

I'm ready to believe that hashing is the most appropriate choice for
Python, but I wonder what makes the difference?

-Dave

From: "Tim Peters" <tim.one@comcast.net>

> There's just no contest here.  BTrees have many other virtues, like
> supporting range searches, and automatically playing nice with ZODB
> persistence, but they're plain sluggish compared to dicts.  To be
completely
> fair and unfair at the same time <wink>, there are also 4 other flavors
of
> Zope BTree, purely for optimization reasons.  In particular, the IIBTree
> maps Python ints to Python ints, and does so by avoiding Python int
objects
> altogether, storing C longs directly and comparing them at native
"compare a
> long to a long" C speed.  That's *almost* as fast as Python 2.1 int->int
> dicts (which endure all-purpose Python object comparison), except for
> deletion (the BTree spends a lot of time tearing apart all the tree
pointers
> again).
>
> Now that's a perfectly height-balanced search tree that "chunks up"
blocks
> of keys for storage and speed efficiency, and rarely needs more than a
> simple local adjustment to maintain balance.  I expect that puts it at
the
> fast end of what can be achieved with a balanced tree scheme.
>
> The ABC language (which Guido worked on before Python) used AVL trees for
> just about everything under the covers.  It's not a coincidence that
Python
> doesn't use balanced trees for anything <wink>.
>
>
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
>

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2002-June/025954.html below:

[Python-Dev] Priority queue (binary heap) python code