A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/43881 below:

Add support for multi-column quantiles of DataFrame · Issue #43881 · pandas-dev/pandas · GitHub

Is your feature request related to a problem?

For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...), Pandas will compute the individual quantiles for columns c and a:

>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})
>>> df[['c', 'a']].quantile([0, 0.5, 1])
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...), Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:

In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]: 
     a  b
0.0  1  1
0.5  1  3
1.0  1  5
Describe the solution you'd like

I imagine the addition of multi-column quantiles support could happen in two ways:

  1. the addition of a default kwarg to DataFrame.quantile to specify whether or not we want multi-column quantiles
  2. the addition of a new method to compute multi-column quantiles independent from the logic of quantile

In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles, to maintain consistency with cuDF.

API breaking implications

I can't think of any breakages this would cause, as long as any direct changes to quantile ensure that the original behavior is maintained by default.

Describe alternatives you've considered

This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.

Additional context

If this functionality were added, along with a multi-columnar searchsorted, it would enable Dask dataframes to compute sort_values with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4