[Tim] > ... > I also noted that msort() gets a 32% speedup on my box when sorting a > 1.33-million line snapshot of the Python-Dev archive. This is a puzzler > to account for, since you wouldn't think there's significant pre-existing > lexicographic order in a file like that. McIlroy noted similar results > from experiments on text, PostScript and C source files in his adaptive > mergesort (which is why I tried sorting Python-Dev to begin with), but > didn't offer a hypothesis. Just a note to clarify what "the puzzle" here is. msort() may or may not be faster than sort() on random input on a given box due to platform quirks, but that isn't relevant in this case. What McIlroy noted is that the total # of compares done in these cases was significantly less than log2(N!). That just can't happen (except with vanishingly small probability) if the input data is randomly ordered, and under any comparison-based sorting method. The only hypothesis I have is that, for a stable sort, all the instances of a given element are, by definition of stability, already "in sorted order". So, e.g., "\n" is a popular line in text files, and all the occurrences of "\n" are already sorted. msort can exploit that -- and seemingly does. This doesn't necessarily contradict that ~sort happens to run slower on my box under msort, because ~sort is such an extreme case. OK, if I remove all but the first occurrence of each unique line, the # of lines drops to about 600,000. The speedup msort enjoys also drops, to 6.8%. So exploiting duplicates does appear to account for the bulk of it, but not all of it. If, after removing duplicates, I run that through random.shuffle() before sorting, msort suffers an 8% slowdown(!) relative to samplesort. If I shuffle first but don't remove duplicates, msort enjoys a 10% speedup. So it's clear that msort is getting a significant advantage out of the duplicates, but it's not at all clear what else it's exploiting -- only that there is something else, and that it's significant. Now many times has someone posted an alphabetical list of Python's keywords <wink>?
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4