[Christian Tismer] > ... > For short strings, this prime has bad influence on the low bits, > making it perform supoptimally for small dicts. > See the new2 algo which funnily corrects for that. > The reason is obvious: Just look at the bit pattern > of 1000003: '0xf4243' > > Without giving proof, this smells like bad bit distribution on small > strings to me. You smell it too, right? > ... [Tim] > As is, string hashes are a lot like integer hashes, in that > "consecutive" strings > > J001 > J002 > J003 > J004 > ... > > yield hashes very close together in value. [back to Christian] > A bad generator in that case. I'll look for a better one. Not necessarily! It's for that same reason "consecutive strings" can have "better than random" behavior today. And consecutive strings-- like consecutive ints --are a common case. Here are the numbers for the synthesized string cases: N=1000 trips for strings old=293 new=302 new2=221 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=1093 new=1109 new2=786 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=810 new=839 new2=609 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1850 new=1843 new2=1375 trips for bin strings old=0 new=0 new2=0 Here they are again, after doing nothing except changing the "^" to "+" in the string hash, i.e. replacing x = (1000003*x) ^ *p++; by x = (1000003*x) + *p++; N=1000 trips for strings old=140 new=127 new2=108 trips for bin strings old=0 new=0 new2=0 N=2000 trips for strings old=480 new=434 new2=411 trips for bin strings old=0 new=0 new2=0 N=3000 trips for strings old=821 new=857 new2=631 trips for bin strings old=0 new=0 new2=0 N=4000 trips for strings old=1892 new=1852 new2=1476 trips for bin strings old=0 new=0 new2=0 The first two sizes are dramatically better, the last two a wash. If you want to see a real disaster, replace the "+" with "*" <wink>: N=1000 trips for strings old=71429 new=6449 new2=2048 trips for bin strings old=81187 new=41117 new2=41584 N=2000 trips for strings old=26882 new=9300 new2=6103 trips for bin strings old=96018 new=46932 new2=42408 I got tired of waiting at that point ... suspecting-a-better-string-hash-is-hard-to-find-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4