[Scott Gilbert] > I'm curious if there is any literature that you've come across, or if > you've done any experiments with merging more than two parts at a > time. There's a literal mountain of research on the topic. I recommend "A Meticulous Analysis of Mergesort Programs" Jyrki Katajainen, Jesper Larsson Traff for a careful accounting of all operations that go into one of these beasts. They got the best results (and much better than quicksort) out of a 4-way bottom-up mergesort via very tedious code (e.g., it effectively represents which input run currently has the smallest next key via the program counter, by way of massive code duplication and oodles of gotos); they were afraid to write actual code for an 8-way version <wink>. OTOH, they were sorting random integers, and, e.g., were delighted to increase the # of comparisons when that could save a few other "dirt cheap" operations. > ... > My thinking is that many machines (probably yours for instance) have a > cache that is 4-way associative, so merging only 2 blocks at a time might > not be using the cache as well as it could. Also, changing from merging 2 > blocks to 3 or 4 blocks at a time would change the number of passes you > have to make (the log part of N*log(N)). > > It's quite possible that this isn't worth the trade off in complexity and > space (or your time :-). The real reason it's uninteresting to me is that it has no clear applicability to the cases this sort aims at: exploiting significant pre-existing order of various kinds. That leads to unbalanced run lengths when we're lucky, and if I'm merging a 2-element run with a 100,000-element run, high cache associativity isn't of much use. From the timings I showed before, it's clear that "good cases" of pre-existing order take time that depends almost entirely on just the number of comparisons needed; e.g., 3sort and +sort were as fast as /sort, where the latter does nothing but N-1 comparisons in a single left-to-right scan of the array. Comparisons are expensive enough in Python that doing O(log N) additional comparisons in 3sort and +sort, then moving massive amounts of the array around to fit the oddballs in place, costs almosts nothing more in percentage terms. Since these cases are already effectively as fast as a single left-to-right scan, there's simply no potential remaining for significant gain (unless you can speed a single left-to-right scan! that would be way cool). If you think you can write a sort for random Python arrays faster than the samplesort hybrid, be my guest: I'd love to see it! You should be aware that I've been making this challenge for years <wink>. Something to note: I think you have an overly simple view of Python's lists in mind. When we're merging two runs in the timing test, it's not *just* the list memory that's getting scanned. The lists contain pointers *to* float objects. The float objects have to get read up from memory too, and there goes the rest of your 4-way associativity. Indeed, if you read the comments in Lib/test/sortperf.py, you'll find that it performs horrid trickery to ensure that =sort and =sort work on physically distinct float objects; way back when, these particular tests ran much faster, and that turned out to be partly because, e.g., [0.5] * N constructs a list with N pointers to a single float object, and that was much easier on the memory system. We got a really nice slowdown <wink> by forcing N distinct copies of 0.5. In earlier Pythons the comparison also got short-circuited by an early pointer-equality test ("if they're the same object, they must be equal"), but that's not done anymore. A quick run just now showed that =sort still runs significantly quicker if given a list of identical objects; the only explanation left for that appears to be cache effects. > Keeping track of comparisons that you've already made could get ugly, Most researches have found that a fancy data structure for this is counter-productive: so long as the m in m-way merging isn't ridiculously large, keeping the head entries in a straight vector with m elements runs fastest. But they're not worried about Python's expensive-comparison case. External sorts using m-way merging with large m typically use a selection tree much like a heap to reduce the expense of keeping track (see, e.g., Knuth for details). > and your temp space requirement would go from N/2 to possibly 3N/4... > But since you're diving so deeply into this problem, I figured I'd > throw it out there. > > OTOH, this could be close to the speedup that heavily optimized FFT algs > get when they go from radix-2 to radix-4. Just thinking out loud... I don't think that's comparable. Moving to radix 4 cuts the total number of non-trivial complex multiplies an FFT has to do, and non-trivial complex multiplies are the expensive part of what an FFT does. In contrast, boosting the m in m-way merging doesn't cut the number of comparisons needed at all (to the contrary, if you're not very careful it increases them), and comparisons are what kill sorting routines in Python. The elaborate gimmicks in timsort for doing merges of unbalanced runs do cut the total number of comparisons needed, and that's where the huge wins come from.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4