[Armin Rigo] >> from heapq import * >> def isorted(iterable): >> heap = list(iterable) >> heapify(heap) >> while heap: >> yield heappop(heap) >> >> This generator is similar to the new list.sorted() but starts >> yielding elements after only O(n) operations (in heapify). >> ... [Raymond Hettinger] > How much of the iterator can be consumed before it becomes preferable > (in terms of speed and memory) to have used iter(list.sort())? > > My guess is that the break-even point for speed is around 10% > depending on how much order already exists in the underlying list. This depends so much on the speed of the heap implementation. When it gets into the log-time part, a high multiplicative constant due to fixed overheads makes a slow heap run like a fast heap would if the latter were working on an *exponentially* larger list. I just tried on my laptop, under 2.3.2, with lists of a million random floats. That's a bad case for list.sort() (there's no order to exploit, and it wastes some compares trying to find order to exploit), and is an average case for a heapsort. Even if I only asked for *just* the first element of the sorted result, using sort() and peeling off the first element was about 25% faster than using heapify followed by one heappop. That says something about how dramatic the overheads are in calling Python-coded heap functions (well, it also says something about the amount of effort I put into optimizing list.sort() <wink>). There are deeper problems the heap approach has to fight: 1. A heapsort does substantially more element compares than a mergesort, and element compares are expensive in Python, so that's hard to overcome. 2. Heapsort has terrible spatial locality, and blowing the cache becomes even more important than comparison speed as the number of elements grows large. One of the experiments I did when writing the 2.3 sort was to compare a straight mergesort to an enhanced version of "weak- heap sort". Both of those do close to the theoretical minimum number of compares on random data. Despite that the mergesort moved more memory around, the always-sequential data access in the mergesort left it much faster than the cache-hostile weak- heap sort. A regular heapsort isn't as cache-hostile as a weak-heap sort, but it's solidly on the cache-hostile side of sorting algorithms, and does more compares too. There's another way to get an iterative sort: do an ordinary recursive top-down mergesort, but instead of shuffling sublists in place, *generate* the merge of the subsequences (which are themselves generators, etc). That's a very elegant sort, with the remarkable property that the first element of the final result is generated after doing exactly N-1 compares, which achieves the theoretical minimum for finding the smallest element. Getting result elements after that takes O(log(N)) additional compares each. No array storage is needed beyond the original input list (which isn't changed), but there are O(N) generators hiding in the runtime stack. Alas, for that reason it's impractical for large lists, and the overheads are deadly for short lists. It does enjoy the advantage of beauty <wink>. > In terms of memory, I think list.sort() always beats the above > implementation. That can't be -- the heap method only requires a fixed (independent of N) and small amount of working storage. list.sort() may need to allocate O(N) additional temp bytes under the covers (to create a working area for doing merges; it can be expected to allocate 2*N temp bytes for a random array of len N, which is its worst case; if there's a lot of pre-existing order in the input array, it can sometimes get away without allocating any temp space).
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4