For a DataFrame I want to preserve rows that belong to groups that fulfil specific condition and replace other rows with NaN. I have used a combination of 'groupby' and 'filter' (with dropna=False). In a special case when there are no groups fulfilling the condition an exception occured.
AttributeError Traceback (most recent call last) <ipython-input-11-ffb9adbc134a> in <module>() ----> 1 pd.DataFrame({'a': [1,1,2], 'b':[1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False) ....../local/lib/python2.7/site-packages/pandas/core/groupby.py in filter(self, func, dropna, *args, **kwargs) 3570 type(res).__name__) 3571 -> 3572 return self._apply_filter(indices, dropna) 3573 3574 ....../local/lib/python2.7/site-packages/pandas/core/groupby.py in _apply_filter(self, indices, dropna) 831 mask = np.empty(len(self._selected_obj.index), dtype=bool) 832 mask.fill(False) --> 833 mask[indices.astype(int)] = True 834 # mask fails to broadcast when passed to where; broadcast manually. 835 mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T AttributeError: 'list' object has no attribute 'astype'
The problem I have identified is in the _apply_filter method of _GroupBy class (core/groupby.py) -- line with "mask[indices.astype(int)] = True" throws because in my case indices is equal to []; shouldn't it be "indices = np.array([])" instead of "indices = []" in the case when len(indices) == 0
def _apply_filter(self, indices, dropna): if len(indices) == 0: indices = [] else: indices = np.sort(np.concatenate(indices)) if dropna: filtered = self._selected_obj.take(indices, axis=self.axis) else: mask = np.empty(len(self._selected_obj.index), dtype=bool) mask.fill(False) mask[indices.astype(int)] = True # mask fails to broadcast when passed to where; broadcast manually. mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T filtered = self._selected_obj.where(mask) # Fill with NaNs. return filteredCode Sample, a copy-pastable example if possible
>>> import pandas as pd >>> pd.DataFrame({'a': [1,1,2], 'b': [1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)Expected Output
a b 0 NaN NaN 1 NaN NaN 2 NaN NaNoutput of
pd.show_versions()
commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-56-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.18.0 nose: 1.3.7 pip: 1.5.6 setuptools: 12.2 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.16.1 statsmodels: None xarray: None IPython: 4.0.3 sphinx: None patsy: 0.4.0 dateutil: 2.5.2 pytz: 2016.3 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.7.6 lxml: None bs4: 4.3.2 html5lib: 0.999 httplib2: 0.9 apiclient: None sqlalchemy: None pymysql: 0.6.6.None psycopg2: None jinja2: 2.8 boto: None
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4