On 3/13/2011 2:05 PM, Daniel Stutzbach wrote: > On Sat, Mar 12, 2011 at 3:44 PM, Guido van Rossum <guido at python.org > <mailto:guido at python.org>> wrote: > > I recently advised a Googler who was sorting a large dataset and > running out of memory. My analysis of the situation was that he was > sorting a huge list of short lines of the form "shortstring,integer" > with a key function that returned a tuple of the form ("shortstring", > integer). > > > As Raymond pointed out, a change I made for 3.2 significantly shrinks > the memory footprint of sorting with a key (although it's still more > memory-intensive than sorting with cmp). > > He could reduce the memory footprint further by sorting in two passes > instead of using a tuple, leveraging the fact that Python guarantees a > stable sort. In 3.2 or later, this technique will require roughly twice > as much memory as just storing the list: > > biglist.sort(key=lambda s: int(s.split(',')[1])) # Sort by the integer > biglist.sort(key=lambda s: s.split(',')[0]) # Sort by the shortstring > > I think the use cases are pretty narrow where there's plenty of memory > for storing the list but not enough to store two copies. Here are two variations on the idea of two pass sorting, both of which only sort ints for duplicate shortstrings and neither of which use key=. 1. If duplication of shortstrings is (known to be) sparse, sort the lines as they are, then scan to detect runs (slices) of duplicate shortstrings and sort and replace those. (If sort had start and stop index keywords, this could be done in place.) 2. If duplication of shortstrings is (known to be) heavy, read the dataset into a shortstring-list_of_ints dict rathar than a list. Sort the keys in a separate (much shorted) list and sort each value (list of ints) separately. For some post-sort usages, this would be more useful than a flat sorted list. A third idea is to presort the dataset into chunks (a-m, n-z) or however appropriate, perhaps as it is gathered; sort each separately; and then concatenate. This will handle cases where even the flat list is too big for memory. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4