I'm still trying to work off the backlog of ignored dict ideas. Way back here: http://mail.python.org/pipermail/python-dev/2000-December/011085.html Christian Tismer suggested using polynomial division instead of multiplication for generating the probe sequence, as a way to get all the bits of the hash code into play. The desirability of doing that is illustrated by, e.g., this program: def f(keys): from time import clock d = {} s = clock() for k in keys: d[k] = k f = clock() print "build time %.3f" % (f-s) s = clock() for k in keys: assert d.has_key(k) f = clock() print "search time %.3f" % (f-s) # Excellent performance. keys = range(20000) for i in range(5): f(keys) # Terrible performance; > 500x slower. keys = [i << 16 for i in range(20000)] for i in range(5): f(keys) Christian had a very clever (cheap and effective) solution: Old algortithm (multiplication): shift the index left by 1 if index > mask: xor the index with the generator polynomial New algorithm (division): if low bit of index set: xor the index with the generator polynomial shift the index right by 1 where "index" should really read "increment", and unlike today we do not mask off any of the bits of the initial increment (and that's what lets *all* the bits of the hash code come into play; there's no point to doing this otherwise). I've since discovered that it's got a fatal rare flaw: the new algorithm can generate a 0 increment, while the old algorithm cannot. Example: poly is 131 and hash is 145. Because we don't mask off any bits in computing the initial increment, the initial increment is computed as incr = hash ^ (hash >> 3) == 145 ^ (145 >> 3) == 145 ^ 18 == 131 == poly So if we don't hit on the first probe, the new if low bit of index set: xor the index with the generator polynomial shift the index right by 1 business sets incr to 0, and the result is an infinite loop (0 is a fixed point). I hate to add another branch to this. As is, the existing branch in both the old and new ways is of the worst possible kind: it's taken half the time, with a pseudo-random distribution. So there's not a branch-prediction gimmick on earth it won't fool. Note that there's no reasonable way to identify "bad values" for incr before the loop starts, either -- there's really no way to tell whether incr mod poly is 0 without a loop to do division steps until incr < poly (if incr < poly and incr != 0, incr can never become 0, so there's no more need to test after reaching that point). Such a "pre loop" would cost more than the existing loop in most cases, as we usually get out of the existing loop today on its first iteration. But in that case, what am I worried about <wink>? time-for-a-checkin-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4