A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/13288 below:

DataFrame.describe() breaks with a column index of object type and numeric entries · Issue #13288 · pandas-dev/pandas · GitHub

Preparing a commit for another issue in .describe(), I encountered this puzzling bug, surprisingly easy to trigger.

Symptoms
df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# Long traceback listing formatting and internal functions...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

However:

df.describe(include='all')
               0    A
count   4.000000    4
unique       NaN    4
top          NaN    D
freq         NaN    1
mean    2.500000  NaN
std     1.290994  NaN
min     1.000000  NaN
25%     1.750000  NaN
50%     2.500000  NaN
75%     3.250000  NaN
max     4.000000  NaN

# It's OK if we don't print on screen:
x = df.describe()
x.columns
Out[8]: Index([0], dtype='int64')

# Fixing this suspicious index (int works too):
x.columns = x.columns.astype(object)
x
Out[10]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Same issue happens with a simpler data frame:

df0 = pd.DataFrame([1,2,3,4])
# It's  OK now
df0.describe()
Out[28]: 
              0
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

# Modify column index:
df0.columns = pd.Index([0], dtype=object)
df0.describe()
# ...
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

Current version (but the bug is also present in pandas release 0.18.1):

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)_i5-2520M_CPU_@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+64.g7ed22fe.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0.dev0+3f3c371
IPython: 4.0.1
...
Reason

Some internal function gets confused by dtypes of a column index, I guess. But the faulty index is created in .describe().

# Output from %debug df.describe()
# NDFrame.describe() in pandas/core/generic.py:
#
   4943             data = self
   4944         else:
   4945             data = self.select_dtypes(include=include, exclude=exclude)
   4946 
   4947         ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
   4948         # set a convenient order for rows
   4949         names = []
   4950         ldesc_indexes = sorted([x.index for x in ldesc], key=len)
   4951         for idxnames in ldesc_indexes:
   4952             for name in idxnames:
   4953                 if name not in names:
   4954                     names.append(name)
   4955 
   4956         d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
1> 4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
   4958         d.columns.names = data.columns.names
   4959         return d

_shallow_copy() in the marked line changes d.columns:

ipdb> p d.columns
Int64Index([0], dtype='int64')
ipdb> n
> /home/users/piotr/workspace/pandas-pijucha/pandas/core/generic.py(4958)describe()
1  4957         d.columns = self.columns._shallow_copy(values=d.columns.values)
-> 4958         d.columns.names = data.columns.names
   4959         return d
ipdb> p d.columns
Index([0], dtype='int64')
Possible solutions

Lines 4957-4958 are actually used to fix issues that pd.concat brings about. They try to pass the column structure from self to d.
I think a simpler solution is replacing these lines with:

 d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
 d.columns = data.columns
 return d

or

d = pd.DataFrame(pd.concat(ldesc, axis=1), index = pd.Index(names), columns = data.columns)
return d

data is a subframe of self and retains the same column structure.

pd.concat has some parameters that help pass a hierarchical index but can't do anything on its own with a categorical one.

I'm going to submit a pull request with this fix together with some others related with describe(). I hope I haven't overlooked anything obvious. But if so, any comments are very welcome.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4