import numpy as np import pandas as pd series = pd.DataFrame(dict( A=np.random.randint(0, 10000, 100000), B=np.random.randint(0, 10000, 100000), V=np.random.rand(100000))).groupby(['A', 'B']).V.sum() filtered_series = series[series < 0.1] %time x = filtered_series.index.remove_unused_levels() %time y = filtered_series.reset_index().set_index(['A', 'B']).indexProblem description
On my laptop, x
takes 20 to 40 times as long as y
, despite y
doing the extra work of sorting the second level and reindexing the series in the process. The outputs, except for the sorting of the second level, are identical. Why is remove_unused_levels
so slow?
remove_unused_levels
should be at least as fast on large indexes as the reset_index
/set_index
hack.
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.11.2-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8
pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4