> I've always done it like: > > d = {} > for x in sequence: > d.setdefault(key(x), []).append(x) > # Now d has partitioned sequence by key. The keys are > # available as d.keys(), the associated groups as d.values(). > # So, e.g., > for key, group in d.iteritems(): > d[key] = sum(group) > > There's no code duplication, or warts for an empty sequence, which are the > ugly parts of the non-dict approach. It doesn't matter here whether the > elements orginally appear with equal keys all adjacent, and input often > isn't sorted that way. When it isn't, not needing to sort first can be a > major time savings if the sequence is big. Against it, a dict is a large > data structure. I don't think it's ever been a real problem that it > requires keys to be hashable. The major downside of this is that this keeps everything in memory. When that's acceptable, it's a great approach (especially because it doesn't require sorting). But often you really want to be able to handle input of arbitrary size. For example, suppose you are given a file with some kind of records, timestamped and maintained in chronological order (e.g. a log file -- perfect example of data that won't fit in memory and is already sorted). You're supposed to output this for printing, while inserting a header at the start of each day and a footer at the end of each day with various counts or totals per day. > groupby() looks very nice when it applies. Right. :-) > > ... > > totals = {} > > for key, group in groupby(keyfunc, sequence): > > totals[key] = sum(group) > > Or > > totals = dict((key, sum(group)) > for key, group in groupby(keyfunc, sequence)) > > exploiting generator expressions too. Nice. When can we get these? :-) > [after Raymond wonders about cases where the consumer doesn't > iterate over the group generators > ] > > > I don't think those semantics should be implemented. You should be > > required to iterate through each group. > > Brrrr. Sounds error-prone (hard to explain, and impossible to enforce > unless the implementation does almost all the work it would need to allow > groups to get skipped -- if the implementation can detect that a group > hasn't been fully iterated, then it could almost as easily go on to skip > over remaining equal keys itself instead of whining about it; but if the > implementation can't detect it, accidental violations of the requirement > will be hard to track down). I take it back after seeing Raymond's implementation -- it's simple enough to make sure that each group is exhausted before starting the next group, and this is clearly the "natural" semantics. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4