> Jeremy had a possibly happy idea wrt this: make the collision probe > sequence start in a slot adjacent to the colliding slot. That's likely to > get sucked into cache "for free", tagging along with the slot that collided. > If that's effective, it could buy much of the "large dict" speed gains > Raymond saw without increasing the dict size. I worked on similar approaches last month and found them wanting. The concept was that a 64byte cache line held 5.3 dict entries and that probing those was much less expensive than making a random probe into memory outside of the cache. The first thing I learned was that the random probes were necessary to reduce collisions. Checking the adjacent space is like a single step of linear chaining, it increases the number of collisions. That would be fine if the cost were offset by decreased memory access time; however, for small dicts, the whole dict is already in cache and having more collisions degrades performance with no compensating gain. The next bright idea was to have a separate lookup function for small dicts and for larger dictionaries. I set the large dict lookup to search adjacent entries. The good news is that an artificial test of big dicts showed a substantial improvement (around 25%). The bad news is that real programs were worse-off than before. A day of investigation showed the cause. The artificial test accessed keys randomly and showed the anticipated benefit. However, real programs access some keys more frequently than others (I believe Zipf's law applies.) Those keys *and* their collision chains are likely already in the cache. So, big dicts had the same limitation as small dicts: You always lose when you accept more collisions in return for exploiting cache locality. The conclusion was clear, the best way to gain performance was to have fewer collisions in the first place. Hence, I resumed experiments on sparsification. > > If someone wants to experiment with that in lookdict_string(), stick a new > > ++i; > > before the for loop, and move the existing > > i = (i << 2) + i + perturb + 1; > > to the bottom of that loop. Likewise for lookdict(). PyStone gains 1%. PyBench loses a 1%. timecell gains 2% (spreadsheet benchmark) timemat loses 2% (pure python matrix package benchmark) timepuzzle loses 1% (class based graph traverser) Raymond Hettinger P.S. There is one other way to improve cache behavior but it involves touching code throughout dictobject.c. Move the entry values into a separate array from the key/hash pairs. That way, you get 8 entries per cache line. P.P.S. One other idea is to use a different search pattern for small dictionaries. Store entries in a self-organizing list with no holes. Dummy fields aren't needed which saves a test in the linear search loop. When an entry is found, move it one closer to the head of the list so that the most common entries get found instantly. Since there are no holes, all eight cells can be used instead of the current maximum of five. Like the current arrangement, the whole small dict fits into just two cache lines. ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4