Ive been doing some investigation of python hashtables and the hash function used in Python. It turns out, that when running pystone.py, as much time is spent in string_hash() as is spent in lookdict_string(). If you look at python's hash function, shown below, you can get an idea why - a multiply is used for every character in the string. The following function, called 2^22 times in a loop, times in at 530ms for 7-character strings. static long string_hash1(register char *p, int size) { register int len; register long x; len = size; x = *p << 7; while (--len >= 0) x = (100003*x) ^ *p++; x ^= size; return x; } Looking around the web I found the following hash function, which is much faster, and seems to distribute better than the python hash function. The following function, called 2^22 times in a loop, times in at 160ms for 7-character strings. static long string_hash3(register char *p, register int size) { register long x; register char c; x = 0xd2d84a61; while (c = *p++) x ^= ( (x<<7) + c + (x>>5) ); return x; } I also came up with a hash function of my own, which seems to distribute near-perfectly (in the sense of being comparable to assigning, as hashes, random numbers to unique strings) and be faster yet (at least, on my P4 box). The following function, called 2^22 times in a loop, times in at 120ms for 7-character strings. static long string_hash6(register unsigned short *p, int size) { register unsigned long x = 0; if (size==0) return 0; len = (size+1) >> 1; while (len--) { x += (x>>14) + (*p++ * 0xd2d84a61); } x += (x>>14) + (size*0xd2d84a61); return x; } Ive tested these functions by hashing a large set of strings generated by instrumenting the python interpeter to emit a string every time a string is added to a dictionary. These strings are hashed and thrown into the buckets of various sized tables. I then calculate sigma statistics (sum((observed-expected)^2)/(N-1)) for the number of items in the buckets of those tables. Here are the sum(sigma) results I am getting in my tests: string_hash1 sum(sigma) ~= 30K string_hash3 sum(sigma) ~= 23-24K string_hash6 sum(sigma) ~= 23-34K string_hashRandom sum(sigma) ~= 23-24K Maddeningly, I havent been able to translate these near-perfect distributions into improvements in pystones. With the current hash function, pystone does an average of 1.2 probes in the lookdict_string() function. With either of these "near-perfect" hash functions, it does at best 1.3 probes (given exhaustive searches for 'good' shift paramaters, though not multipliers). Im not sure what other tests to try out, or how I might go about tweaking the hashes to perform better in pystone. Im hoping that someone on python-dev has some experience in testing hash functions, and can suggest some additional tests and/or tweaks.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4