[Guido, on grouping elements of a sequence by key] > ... > Or is there a more elegant approach than my original code that I've > missed all these years? I've always done it like: d = {} for x in sequence: d.setdefault(key(x), []).append(x) # Now d has partitioned sequence by key. The keys are # available as d.keys(), the associated groups as d.values(). # So, e.g., for key, group in d.iteritems(): d[key] = sum(group) There's no code duplication, or warts for an empty sequence, which are the ugly parts of the non-dict approach. It doesn't matter here whether the elements orginally appear with equal keys all adjacent, and input often isn't sorted that way. When it isn't, not needing to sort first can be a major time savings if the sequence is big. Against it, a dict is a large data structure. I don't think it's ever been a real problem that it requires keys to be hashable. groupby() looks very nice when it applies. > ... > totals = {} > for key, group in groupby(keyfunc, sequence): > totals[key] = sum(group) Or totals = dict((key, sum(group)) for key, group in groupby(keyfunc, sequence)) exploiting generator expressions too. [after Raymond wonders about cases where the consumer doesn't iterate over the group generators ] > I don't think those semantics should be implemented. You should be > required to iterate through each group. Brrrr. Sounds error-prone (hard to explain, and impossible to enforce unless the implementation does almost all the work it would need to allow groups to get skipped -- if the implementation can detect that a group hasn't been fully iterated, then it could almost as easily go on to skip over remaining equal keys itself instead of whining about it; but if the implementation can't detect it, accidental violations of the requirement will be hard to track down). You're a security guy now. You've got a log with line records of the form month day hhmmss severity_level threat_id It's sorted ascending by month then desceding by severity_level. You want a report of the top 10 threats seen each month. for month, lines in groupby(lamdba s: s.split()[0], input_file): print month print itertools.islice(lines, 10) Like array[:10], islice() does the right thing if there are fewer than 10 lines in a month. It's just not natural to require that an iterator be run to exhaustion (if it *were* natural, this wouldn't be the first context ever to require it <wink>).
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4