RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/16556 below:

remove_unused_levels is very slow · Issue #16556 · pandas-dev/pandas · GitHub

Code Sample

import numpy as np
import pandas as pd
series = pd.DataFrame(dict(
    A=np.random.randint(0, 10000, 100000),
    B=np.random.randint(0, 10000, 100000),
    V=np.random.rand(100000))).groupby(['A', 'B']).V.sum()
filtered_series = series[series < 0.1]
%time x = filtered_series.index.remove_unused_levels()
%time y = filtered_series.reset_index().set_index(['A', 'B']).index

Problem description

On my laptop, x takes 20 to 40 times as long as y, despite y doing the extra work of sorting the second level and reindexing the series in the process. The outputs, except for the sorting of the second level, are identical. Why is remove_unused_levels so slow?

Expected Output

remove_unused_levels should be at least as fast on large indexes as the reset_index/set_index hack.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.2-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4