A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/48749 below:

Don't reorder categoricals when grouping by an unordered categorical and `sort=False` · Issue #48749 · pandas-dev/pandas · GitHub

xref: dask/dask#9486 (comment)

TLDR: When calling df.groupby(key=categocial<order=False>, sort=True, observed=False) the resulting CategoricalIndex will have it's values and categories unordered.

In [1]:     df = DataFrame(
   ...:         [
   ...:             ["(7.5, 10]", 10, 10],
   ...:             ["(7.5, 10]", 8, 20],
   ...:             ["(2.5, 5]", 5, 30],
   ...:             ["(5, 7.5]", 6, 40],
   ...:             ["(2.5, 5]", 4, 50],
   ...:             ["(0, 2.5]", 1, 60],
   ...:             ["(5, 7.5]", 7, 70],
   ...:         ],
   ...:         columns=["range", "foo", "bar"],
   ...:     )

In [2]: col = "range"

In [3]: df["range"] = Categorical(df["range"], ordered=False)

In [4]: df.groupby(col, sort=True, observed=False).first().index
Out[4]: CategoricalIndex(['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], categories=['(0, 2.5]', '(2.5, 5]', '(5, 7.5]', '(7.5, 10]'], ordered=False, dtype='category', name='range')

In [5]: df.groupby(col, sort=False, observed=False).first().index
Out[5]: CategoricalIndex(['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], categories=['(7.5, 10]', '(2.5, 5]', '(5, 7.5]', '(0, 2.5]'], ordered=False, dtype='category', name='range')

It's reasonable that the values are not sorted, but a lot of extra work can be spent un-ordering the categories in:

# sort=False should order groups in as-encountered order (GH-8868) cat = c.unique() # See GH-38140 for block below # exclude nan from indexer for categories take_codes = cat.codes[cat.codes != -1] if cat.ordered: take_codes = np.sort(take_codes) cat = cat.set_categories(cat.categories.take(take_codes)) # But for groupby to work, all categories should be present, # including those missing from the data (GH-13179), which .unique() # above dropped cat = cat.add_categories(c.categories[~c.categories.isin(cat.categories)]) return c.reorder_categories(cat.categories), None

May have been an outcome of fixing #8868, but if grouping and sort=False the values can be achieved without reordering the categories, there would probably be a nice performance benefit.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4