Hi guys,
Working with pandas is great, however I might have notice a bug while grouping a 2 rows DataFrame by time and columns:
>>> import pandas as pd >>> import numpy as np >>> from datetime import datetime >>> freq = 's' >>> t1 = np.datetime64(datetime.utcnow(), freq) >>> index = pd.date_range(start=t1, periods=2, freq=freq) # DatetimeIndex(['2015-09-24 08:55:27', '2015-09-24 08:55:28'], dtype='datetime64[ns]', freq='S', tz=None) >>> df = pd.DataFrame([['A', 10], ['B', 15]], columns=['metric', 'values'], index=index) # metric values #2015-09-24 08:55:27 A 10 #2015-09-24 08:55:28 B 15 >>> grouped = df.groupby([pd.Grouper(level=0, freq=freq), 'metric']) # here the grouping should output something similar to the input DataFrame, # since each rows are already individual groups reguarding the parameters of the groupby function. >>> grouped.mean() # values # <pandas.tseries.resample.TimeGrouper object at ... 10 # metric 15 # # notice how the index is broken : a new TimeGrouper object is the first index values, # while the second value is the name of the columns used to create the second group... # now let's try to add another row : a new second, a new metric >>> df_2 = pd.DataFrame([['C', 0]], columns=df.columns, index=[df.index.shift(-1, freq)[0]]) >>> df_2 = df_2.append(df) # metric values #2015-09-24 08:55:26 C 0 #2015-09-24 08:55:27 A 10 #2015-09-24 08:55:28 B 15 >>> grouped = df_2.groupby([pd.Grouper(level=0, freq=freq), 'metric']) >>> grouped.mean() # values # metric #2015-09-24 08:55:26 C 0 #2015-09-24 08:55:27 A 10 #2015-09-24 08:55:28 B 15 # work as expected with 3 rows ! # let's try with 1 row : >>> df_2.iloc[0:1].groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean() # values # metric #2015-09-24 08:55:26 C 0 # work as expected too !
I have tried to group by key, instead of level, or to use another frequency for aggregating (using freq = 's' while building the dataframe, then aggregate with freq='T'), but the result is the same.
Did I miss something ?
Please, not that using the resampling API provide the expected result, but i think the grouping API should provide consistent results :
>>> df.groupby(['metric']).resample(how='mean', freq=freq) # values # metric # A 2015-09-24 08:55:27 10 # B 2015-09-24 08:55:28 15
Here are the dependencies I have installed with pandas (working on Ubuntu 12.04.5 LTS):
>>> from pandas.util.print_versions import show_versions >>> show_versions() INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Linux OS-release: 3.5.0-54-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 pandas: 0.16.2 nose: 1.3.7 Cython: 0.22.1 numpy: 1.9.2 scipy: 0.15.1 statsmodels: None IPython: 3.2.0 sphinx: 1.3.1 patsy: 0.3.0 dateutil: 2.4.2 pytz: 2015.4 bottleneck: 1.0.0 tables: 3.2.0 numexpr: 2.4.3 matplotlib: 1.4.3 openpyxl: 1.8.5 xlrd: 0.9.3 xlwt: 1.0.0 xlsxwriter: 0.7.3 lxml: 3.4.4 bs4: 4.3.2 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.5 pymysql: None psycopg2: None
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4