RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/15503 below:

fast paths for product MultiIndex? · Issue #15503 · pandas-dev/pandas · GitHub

Feature Proposal

At the moment, we have a few different methods for storing indexed higher-dimensional arrays:

DataFrame/Series with a product MultiIndex (PMI), which can be slow
Panel, which is fast, but less flexible, and may be deprecated soon ( Deprecation of Panel ? #13563)
xarray, which is designed for this, but has a smaller API

For some datasets, I've found the PMI to be the best option, together with occasional workarounds for performance bottlenecks. Operations which are slow for a general MultiIndex, like unstack() or swaplevel().sortlevel(), can be sped up for PMIs (see below).

It would be great if we could do something like this more generally, with fast paths for PMIs. We could maybe have MultiIndex.from_product() return a PMI object, which would upcast to MultiIndex when necessary. We could also have stack() and unstack() create PMI objects where possible, and perhaps add an argument to concat() and set_index() to create PMIs. Slow MultiIndex operations could then have a fast path for PMI objects.

Code Sample

import numpy as np
import pandas as pd

m = 100
n = 1000

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels]*2)
columns = np.arange(n)
values = np.arange(m*m*n).reshape(m*m, n)
df = pd.DataFrame(values, index, columns)

# >>> timeit slow_unstack()
# 1 loop, best of 3: 363 ms per loop
def slow_unstack():
    return df.unstack()

# >>> timeit fast_unstack()
# 10 loops, best of 3: 55 ms per loop
def fast_unstack():
    columns = pd.MultiIndex.from_product([df.columns, levels])
    values = df.values.reshape(m, m, n).swapaxes(1, 2).reshape(m, m*n)
    return pd.DataFrame(values, levels, columns)

# >>> timeit slow_swaplevel_sortlevel()
# 1 loop, best of 3: 213 ms per loop
def slow_swaplevel_sortlevel():
    return df.swaplevel().sortlevel()

# >>> timeit fast_swaplevel_sortlevel()
# 10 loops, best of 3: 38.7 ms per loop
def fast_swaplevel_sortlevel():
    values = df.values.reshape(m, m, n).swapaxes(0, 1).reshape(m*m, n)
    index = df.index.swaplevel().sortlevel()[0]
    return pd.DataFrame(values, index, df.columns)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4