For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...)
, Pandas will compute the individual quantiles for columns c
and a
:
>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]}) >>> df[['c', 'a']].quantile([0, 0.5, 1]) c a 0.0 0.0 0.0 0.5 2.0 2.0 1.0 5.0 12.0
It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...)
, Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:
In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]}) In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1]) Out[12]: a b 0.0 1 1 0.5 1 3 1.0 1 5Describe the solution you'd like
I imagine the addition of multi-column quantiles support could happen in two ways:
DataFrame.quantile
to specify whether or not we want multi-column quantilesquantile
In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles
, to maintain consistency with cuDF.
I can't think of any breakages this would cause, as long as any direct changes to quantile
ensure that the original behavior is maintained by default.
This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.
Additional contextIf this functionality were added, along with a multi-columnar searchsorted
, it would enable Dask dataframes to compute sort_values
with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4