These are new features and improvements of note in each release.
v0.20.3 (July 7, 2017)¶This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.
Bug Fixes¶DataFrame
(GH16789, GH16825)UTC
is a timezone in a Series/DataFrame/Index (GH16608)Series
construction when passing a Series
with dtype='category'
(GH16524).DataFrame.astype()
when passing a Series
as the dtype
kwarg. (GH16717).Float64Index
causing an empty array instead of None
to be returned from .get(np.nan)
on a Series whose index did not contain any NaN
s (GH8569)MultiIndex.isin
causing an error when passing an empty iterable (GH16777)TimedeltaIndex
(GH16637)read_csv()
in which files weren’t opened as binary files by the C engine on Windows, causing EOF characters mid-field, which would fail (GH16039, GH16559, GH16675)read_hdf()
in which reading a Series
saved to an HDF file in ‘fixed’ format fails when an explicit mode='r'
argument is supplied (GH16583)DataFrame.to_latex()
where bold_rows
was wrongly specified to be True
by default, whereas in reality row labels remained non-bold whatever parameter provided. (GH16707)DataFrame.style()
where generated element ids were not unique (GH16780)DataFrame
with a PeriodIndex
, from a format='fixed'
HDFStore, in Python 3, that was written in Python 2 (GH16781)DataFrame.plot.scatter()
that incorrectly raised a KeyError
when categorical data is used for plotting (GH16199)PeriodIndex
/ TimedeltaIndex.join
was missing the sort=
kwarg (GH16541)MultiIndex
with a category
dtype for a level (GH16627).merge()
when merging/joining with multiple categorical columns (GH16767)DataFrame.sort_values
not respecting the kind
parameter with categorical data (GH16793)This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.
Enhancements¶Series
provides a to_latex
method (GH16180)ngroup()
, parallel to the existing cumcount()
, has been added to return the group order (GH11642); see here..clip()
with scalar arguments (GH15400)MultiIndex.remove_unused_levels()
(GH16556)pathlib.Path
or py.path.local
objects with io functions (GH16291)Index.symmetric_difference()
on two equal MultiIndex’s, results in a TypeError
(:issue 13490)DataFrame.update()
with overwrite=False
and NaN values
(GH15593)read_csv()
now raises an informative ValueError
rather than UnboundLocalError
. (GH16511)unique()
on an array of tuples (GH16519)cut()
when labels
are set, resulting in incorrect label ordering (GH16459)Categoricals
(GH16409)to_numeric()
in which empty data inputs were causing a segfault of the interpreter (GH16302)DataFrame
to Series
with comparison ops (GH16378, GH16306)DataFrame.reset_index(level=)
with single level index (GH16263)MultiIndex.remove_unused_levels()
that would not return a MultiIndex
equal to the original. (GH16556)read_csv()
when comment
is passed in a space delimited text file (GH16472)read_csv()
not raising an exception with nonexistent columns in usecols
when it had the correct length (GH14671)IndexError
when HTML-rendering an empty DataFrame
(GH15953)read_csv()
in which tarfile object inputs were raising an error in Python 2.x for the C engine (GH16530)DataFrame.to_html()
ignored the index_names
parameter (GH16493)pd.read_hdf()
returns numpy strings for index names (GH13492)HDFStore.select_as_multiple()
where start/stop arguments were not respected (GH16209)DataFrame.plot
with a single column and a list-like color
(GH3486)plot
where NaT
in DatetimeIndex
results in Timestamp.min
(:issue: 12405)DataFrame.boxplot
where figsize
keyword was not respected for non-grouped boxplots (GH11959)DataFrame
(GH15819)rolling.cov()
with offset window (GH16058).resample()
and .groupby()
when aggregating on integers (GH16361)SparseDataFrame
from scipy.sparse.dok_matrix
(GH16179)DataFrame.stack
with unsorted levels in MultiIndex
columns (GH16323)pd.wide_to_long()
where no error was raised when i
was not a unique identifier (GH16382)Series.isin(..)
with a list of tuples (GH16394)DataFrame
with mixed dtypes including an all-NaT column. (GH16395)DataFrame.agg()
and Series.agg()
with aggregating on non-callable attributes (GH16405).interpolate()
, where limit_direction
was not respected when limit=None
(default) was passed (GH16282)DataFrame.drop()
with an empty-list with non-unique indices (GH16270)This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
.agg()
API for Series/DataFrame similar to the groupby-rolling-resample API’s, see herefeather-format
, including a new top-level pd.read_feather()
and DataFrame.to_feather()
method, see here..ix
indexer has been deprecated, see herePanel
has been deprecated, see hereIntervalIndex
and Interval
scalar type, see here.groupby()
, see hereUInt64
dtypes, see hereorient='table'
, that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see hereDataFrame.style
) to Excel, see hereDataFrame
rather than a Panel
, as Panel
is now deprecated, see heres3fs
, see herepandas-gbq
library, see hereWarning
Pandas has changed the internal structure and layout of the codebase. This can affect imports that are not from the top-level pandas.*
namespace, please see the changes here.
Check the API Changes and deprecations before updating.
Note
This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils
routines. (GH16250)
agg
API for DataFrame/Series¶
Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from groupby, window operations, and resampling. This allows aggregation operations in a concise way by using agg()
and transform()
. The full documentation is here (GH1623).
Here is a sample
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'], ...: index=pd.date_range('1/1/2000', periods=10)) ...: In [2]: df.iloc[3:7] = np.nan In [3]: df Out[3]: A B C 2000-01-01 1.474071 -0.064034 -1.282782 2000-01-02 0.781836 -1.071357 0.441153 2000-01-03 2.353925 0.583787 0.221471 2000-01-04 NaN NaN NaN 2000-01-05 NaN NaN NaN 2000-01-06 NaN NaN NaN 2000-01-07 NaN NaN NaN 2000-01-08 0.901805 1.171216 0.520260 2000-01-09 -1.197071 -1.066969 -0.303421 2000-01-10 -0.858447 0.306996 -0.028665
One can operate using string function names, callables, lists, or dictionaries of these.
Using a single function is equivalent to .apply
.
In [4]: df.agg('sum') Out[4]: A 3.456119 B -0.140361 C -0.431984 dtype: float64
Multiple aggregations with a list of functions.
In [5]: df.agg(['sum', 'min']) Out[5]: A B C sum 3.456119 -0.140361 -0.431984 min -1.197071 -1.071357 -1.282782
Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output of all of the aggregators. The output has one column per unique function. Those functions applied to a particular column will be NaN
:
In [6]: df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) Out[6]: A B max NaN 1.171216 min -1.197071 -1.071357 sum 3.456119 NaN
The API also supports a .transform()
function for broadcasting results.
In [7]: df.transform(['abs', lambda x: x - x.min()]) Out[7]: A B C abs <lambda> abs <lambda> abs <lambda> 2000-01-01 1.474071 2.671143 0.064034 1.007322 1.282782 0.000000 2000-01-02 0.781836 1.978907 1.071357 0.000000 0.441153 1.723935 2000-01-03 2.353925 3.550996 0.583787 1.655143 0.221471 1.504252 2000-01-04 NaN NaN NaN NaN NaN NaN 2000-01-05 NaN NaN NaN NaN NaN NaN 2000-01-06 NaN NaN NaN NaN NaN NaN 2000-01-07 NaN NaN NaN NaN NaN NaN 2000-01-08 0.901805 2.098877 1.171216 2.242573 0.520260 1.803042 2000-01-09 1.197071 0.000000 1.066969 0.004388 0.303421 0.979361 2000-01-10 0.858447 0.338624 0.306996 1.378353 0.028665 1.254117
When presented with mixed dtypes that cannot be aggregated, .agg()
will only take the valid aggregations. This is similiar to how groupby .agg()
works. (GH15015)
In [8]: df = pd.DataFrame({'A': [1, 2, 3], ...: 'B': [1., 2., 3.], ...: 'C': ['foo', 'bar', 'baz'], ...: 'D': pd.date_range('20130101', periods=3)}) ...: In [9]: df.dtypes Out[9]: A int64 B float64 C object D datetime64[ns] dtype: object
In [10]: df.agg(['min', 'sum']) Out[10]: A B C D min 1 1.0 bar 2013-01-01 sum 6 6.0 foobarbaz NaT
dtype
keyword for data IO¶
The 'python'
engine for read_csv()
, as well as the read_fwf()
function for parsing fixed-width text files and read_excel()
for parsing Excel files, now accept the dtype
keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.
In [11]: data = "a b\n1 2\n3 4" In [12]: pd.read_fwf(StringIO(data)).dtypes Out[12]: a int64 b int64 dtype: object In [13]: pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes Out[13]: a float64 b object dtype: object
.to_datetime()
has gained an origin
parameter¶
to_datetime()
has gained a new parameter, origin
, to define a reference date from where to compute the resulting timestamps when parsing numerical values with a specific unit
specified. (GH11276, GH11745)
For example, with 1960-01-01 as the starting date:
In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01')) Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
The default is set at origin='unix'
, which defaults to 1970-01-01 00:00:00
, which is commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.
In [15]: pd.to_datetime([1, 2, 3], unit='D') Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)Groupby Enhancements¶
Strings passed to DataFrame.groupby()
as the by
parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677)
In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] ....: In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3], ....: 'B': np.arange(8)}, ....: index=index) ....: In [19]: df Out[19]: A B first second bar one 1 0 two 1 1 baz one 1 2 two 1 3 foo one 2 4 two 2 5 qux one 3 6 two 3 7 In [20]: df.groupby(['second', 'A']).sum() Out[20]: B second A one 1 2 2 4 3 6 two 1 4 2 5 3 7Better support for compressed URLs in
read_csv
¶
The compression code was refactored (GH12688). As a result, reading dataframes from URLs in read_csv()
or read_table()
now supports additional compression methods: xz
, bz2
, and zip
(GH14570). Previously, only gzip
compression was supported. By default, compression of URLs and paths are now inferred using their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved (GH14874).
In [21]: url = 'https://github.com/{repo}/raw/{branch}/{path}'.format( ....: repo = 'pandas-dev/pandas', ....: branch = 'master', ....: path = 'pandas/tests/io/parser/data/salaries.csv.bz2', ....: ) ....: In [22]: df = pd.read_table(url, compression='infer') # default, infer compression In [23]: df = pd.read_table(url, compression='bz2') # explicitly specify compression In [24]: df.head(2) Out[24]: S X E M 0 13876 1 1 1 1 11608 1 3 0Pickle file I/O now supports compression¶
read_pickle()
, DataFrame.to_pickle()
and Series.to_pickle()
can now read from and write to compressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension. See the docs here.
In [25]: df = pd.DataFrame({ ....: 'A': np.random.randn(1000), ....: 'B': 'foo', ....: 'C': pd.date_range('20130101', periods=1000, freq='s')}) ....:
Using an explicit compression type
In [26]: df.to_pickle("data.pkl.compress", compression="gzip") In [27]: rt = pd.read_pickle("data.pkl.compress", compression="gzip") In [28]: rt.head() Out[28]: A B C 0 0.384316 foo 2013-01-01 00:00:00 1 1.574159 foo 2013-01-01 00:00:01 2 1.588931 foo 2013-01-01 00:00:02 3 0.476720 foo 2013-01-01 00:00:03 4 0.473424 foo 2013-01-01 00:00:04
The default is to infer the compression type from the extension (compression='infer'
):
In [29]: df.to_pickle("data.pkl.gz") In [30]: rt = pd.read_pickle("data.pkl.gz") In [31]: rt.head() Out[31]: A B C 0 0.384316 foo 2013-01-01 00:00:00 1 1.574159 foo 2013-01-01 00:00:01 2 1.588931 foo 2013-01-01 00:00:02 3 0.476720 foo 2013-01-01 00:00:03 4 0.473424 foo 2013-01-01 00:00:04 In [32]: df["A"].to_pickle("s1.pkl.bz2") In [33]: rt = pd.read_pickle("s1.pkl.bz2") In [34]: rt.head() Out[34]: 0 0.384316 1 1.574159 2 1.588931 3 0.476720 4 0.473424 Name: A, dtype: float64UInt64 Support Improved¶
Pandas has significantly improved support for operations involving unsigned, or purely non-negative, integers. Previously, handling these integers would result in improper rounding or data-type casting, leading to incorrect results. Notably, a new numerical index, UInt64Index
, has been created (GH14937)
In [35]: idx = pd.UInt64Index([1, 2, 3]) In [36]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx) In [37]: df.index Out[37]: UInt64Index([1, 2, 3], dtype='uint64')
Series.unique()
in which unsigned 64-bit integers were causing overflow (GH14721)DataFrame
construction in which unsigned 64-bit integer elements were being converted to objects (GH14881)pd.read_csv()
in which unsigned 64-bit integer elements were being improperly converted to the wrong data types (GH14983)pd.unique()
in which unsigned 64-bit integers were causing overflow (GH14915)pd.value_counts()
in which unsigned 64-bit integers were being erroneously truncated in the output (GH14934)In previous versions, .groupby(..., sort=False)
would fail with a ValueError
when grouping on a categorical series with some categories not appearing in the data. (GH13179)
In [38]: chromosomes = np.r_[np.arange(1, 23).astype(str), ['X', 'Y']] In [39]: df = pd.DataFrame({ ....: 'A': np.random.randint(100), ....: 'B': np.random.randint(100), ....: 'C': np.random.randint(100), ....: 'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100), ....: categories=chromosomes, ....: ordered=True)}) ....: In [40]: df Out[40]: A B C chromosomes 0 21 62 10 17 1 21 62 10 Y 2 21 62 10 13 3 21 62 10 8 4 21 62 10 22 5 21 62 10 3 6 21 62 10 19 .. .. .. .. ... 93 21 62 10 17 94 21 62 10 Y 95 21 62 10 Y 96 21 62 10 22 97 21 62 10 5 98 21 62 10 20 99 21 62 10 X [100 rows x 4 columns]
Previous Behavior:
In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum() --------------------------------------------------------------------------- ValueError: items in new_categories are not the same as in old categories
New Behavior:
In [41]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum() Out[41]: A B C chromosomes 2 42.0 124.0 20.0 3 105.0 310.0 50.0 4 63.0 186.0 30.0 5 84.0 248.0 40.0 6 84.0 248.0 40.0 7 63.0 186.0 30.0 8 189.0 558.0 90.0 ... ... ... ... 20 126.0 372.0 60.0 21 42.0 124.0 20.0 22 84.0 248.0 40.0 X 63.0 186.0 30.0 Y 126.0 372.0 60.0 1 NaN NaN NaN 12 NaN NaN NaN [24 rows x 3 columns]Table Schema Output¶
The new orient 'table'
for DataFrame.to_json()
will generate a Table Schema compatible string representation of the data.
In [42]: df = pd.DataFrame( ....: {'A': [1, 2, 3], ....: 'B': ['a', 'b', 'c'], ....: 'C': pd.date_range('2016-01-01', freq='d', periods=3), ....: }, index=pd.Index(range(3), name='idx')) ....: In [43]: df Out[43]: A B C idx 0 1 a 2016-01-01 1 2 b 2016-01-02 2 3 c 2016-01-03 In [44]: df.to_json(orient='table') Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
See IO: Table Schema for more information.
Additionally, the repr for DataFrame
and Series
can now publish this JSON Table schema representation of the Series or DataFrame if you are using IPython (or another frontend like nteract using the Jupyter messaging protocol). This gives frontends like the Jupyter notebook and nteract more flexiblity in how they display pandas objects, since they have more information about the data. You must enable this by setting the display.html.table_schema
option to True
.
Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix
instances. See the documentation for more information. (GH4343)
All sparse formats are supported, but matrices that are not in COOrdinate
format will be converted, copying data as needed.
In [45]: from scipy.sparse import csr_matrix In [46]: arr = np.random.random(size=(1000, 5)) In [47]: arr[arr < .9] = 0 In [48]: sp_arr = csr_matrix(arr) In [49]: sp_arr Out[49]: <1000x5 sparse matrix of type '<class 'numpy.float64'>' with 500 stored elements in Compressed Sparse Row format> In [50]: sdf = pd.SparseDataFrame(sp_arr) In [51]: sdf Out[51]: 0 1 2 3 4 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN 0.997522 4 NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN 0.911034 6 NaN NaN NaN NaN NaN .. ... .. .. .. ... 993 0.925879 NaN NaN NaN NaN 994 NaN NaN NaN NaN 0.955585 995 NaN NaN NaN NaN NaN 996 NaN NaN NaN NaN NaN 997 NaN NaN NaN NaN NaN 998 NaN NaN NaN NaN 0.904855 999 NaN NaN NaN NaN NaN [1000 rows x 5 columns]
To convert a SparseDataFrame
back to sparse SciPy matrix in COO format, you can use:
In [52]: sdf.to_coo() Out[52]: <1000x5 sparse matrix of type '<class 'numpy.float64'>' with 500 stored elements in COOrdinate format>Excel output for styled DataFrames¶
Experimental support has been added to export DataFrame.style
formats to Excel using the openpyxl
engine. (GH15530)
For example, after running the following, styled.xlsx
renders as below:
In [53]: np.random.seed(24) In [54]: df = pd.DataFrame({'A': np.linspace(1, 10, 10)}) In [55]: df = pd.concat([df, pd.DataFrame(np.random.RandomState(24).randn(10, 4), ....: columns=list('BCDE'))], ....: axis=1) ....: In [56]: df.iloc[0, 2] = np.nan In [57]: df Out[57]: A B C D E 0 1.0 1.329212 NaN -0.316280 -0.990810 1 2.0 -1.070816 -1.438713 0.564417 0.295722 2 3.0 -1.626404 0.219565 0.678805 1.889273 3 4.0 0.961538 0.104011 -0.481165 0.850229 4 5.0 1.453425 1.057737 0.165562 0.515018 5 6.0 -1.336936 0.562861 1.392855 -0.063328 6 7.0 0.121668 1.207603 -0.002040 1.627796 7 8.0 0.354493 1.037528 -0.385684 0.519818 8 9.0 1.686583 -1.325963 1.428984 -2.089354 9 10.0 -0.129820 0.631523 -0.586538 0.290720 In [58]: styled = df.style.\ ....: applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black').\ ....: highlight_max() ....: In [59]: styled.to_excel('styled.xlsx', engine='openpyxl')
See the Style documentation for more detail.
IntervalIndex¶pandas has gained an IntervalIndex
with its own dtype, interval
as well as the Interval
scalar type. These allow first-class support for interval notation, specifically as a return type for the categories in cut()
and qcut()
. The IntervalIndex
allows some unique indexing, see the docs. (GH7640, GH8625)
Warning
These indexing behaviors of the IntervalIndex are provisional and may change in a future version of pandas. Feedback on usage is welcome.
Previous behavior:
The returned categories were strings, representing Intervals
In [1]: c = pd.cut(range(4), bins=2) In [2]: c Out[2]: [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]] Categories (2, object): [(-0.003, 1.5] < (1.5, 3]] In [3]: c.categories Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')
New behavior:
In [60]: c = pd.cut(range(4), bins=2) In [61]: c Out[61]: [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]] Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]] In [62]: c.categories Out[62]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]] closed='right', dtype='interval[float64]')
Furthermore, this allows one to bin other data with these same bins, with NaN
representing a missing value similar to other dtypes.
In [63]: pd.cut([0, 3, 5, 1], bins=c.categories) Out[63]: [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]] Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
An IntervalIndex
can also be used in Series
and DataFrame
as the index.
In [64]: df = pd.DataFrame({'A': range(4), ....: 'B': pd.cut([0, 3, 1, 1], bins=c.categories)} ....: ).set_index('B') ....: In [65]: df Out[65]: A B (-0.003, 1.5] 0 (1.5, 3.0] 1 (-0.003, 1.5] 2 (-0.003, 1.5] 3
Selecting via a specific interval:
In [66]: df.loc[pd.Interval(1.5, 3.0)] Out[66]: A 1 Name: (1.5, 3.0], dtype: int64
Selecting via a scalar value that is contained in the intervals.
In [67]: df.loc[0] Out[67]: A B (-0.003, 1.5] 0 (-0.003, 1.5] 2 (-0.003, 1.5] 3Other Enhancements¶
DataFrame.rolling()
now accepts the parameter closed='right'|'left'|'both'|'neither'
to choose the rolling window-endpoint closedness. See the documentation (GH13965)feather-format
, including a new top-level pd.read_feather()
and DataFrame.to_feather()
method, see here.Series.str.replace()
now accepts a callable, as replacement, which is passed to re.sub
(GH15055)Series.str.replace()
now accepts a compiled regular expression as a pattern (GH15446)Series.sort_index
accepts parameters kind
and na_position
(GH13589, GH14444)DataFrame
and DataFrame.groupby()
have gained a nunique()
method to count the distinct values over an axis (GH14336, GH15197).DataFrame
has gained a melt()
method, equivalent to pd.melt()
, for unpivoting from a wide to long format (GH12640).pd.read_excel()
now preserves sheet order when using sheetname=None
(GH9930)0.5min
is parsed as 30s
) (GH8419).isnull()
and .notnull()
have been added to Index
object to make them more consistent with the Series
API (GH15300)UnsortedIndexError
(subclass of KeyError
) raised when indexing/slicing into an unsorted MultiIndex (GH11897). This allows differentiation between errors due to lack of sorting or an incorrect key. See hereMultiIndex
has gained a .to_frame()
method to convert to a DataFrame
(GH12397)pd.cut
and pd.qcut
now support datetime64 and timedelta64 dtypes (GH14714, GH14798)pd.qcut
has gained the duplicates='raise'|'drop'
option to control whether to raise on duplicated edges (GH7751)Series
provides a to_excel
method to output Excel files (GH8825)usecols
argument in pd.read_csv()
now accepts a callable function as a value (GH14154)skiprows
argument in pd.read_csv()
now accepts a callable function as a value (GH10882)nrows
and chunksize
arguments in pd.read_csv()
are supported if both are passed (GH6774, GH15755)DataFrame.plot
now prints a title above each subplot if suplots=True
and title
is a list of strings (GH14753)DataFrame.plot
can pass the matplotlib 2.0 default color cycle as a single string as color parameter, see here. (GH15516)Series.interpolate()
now supports timedelta as an index type with method='time'
(GH6424)level
keyword to DataFrame/Series.rename
to rename labels in the specified level of a MultiIndex (GH4160).DataFrame.reset_index()
will now interpret a tuple index.name
as a key spanning across levels of columns
, if this is a MultiIndex
(GH16164)Timedelta.isoformat
method added for formatting Timedeltas as an ISO 8601 duration. See the Timedelta docs (GH15136).select_dtypes()
now allows the string datetimetz
to generically select datetimes with tz (GH14910).to_latex()
method will now accept multicolumn
and multirow
arguments to use the accompanying LaTeX enhancementspd.merge_asof()
gained the option direction='backward'|'forward'|'nearest'
(GH14887)Series/DataFrame.asfreq()
have gained a fill_value
parameter, to fill missing values (GH3715).Series/DataFrame.resample.asfreq
have gained a fill_value
parameter, to fill missing values during resampling (GH3715).pandas.util.hash_pandas_object()
has gained the ability to hash a MultiIndex
(GH15224)Series/DataFrame.squeeze()
have gained the axis
parameter. (GH15339)DataFrame.to_excel()
has a new freeze_panes
parameter to turn on Freeze Panes when exporting to Excel (GH15160)pd.read_html()
will parse multiple header rows, creating a MutliIndex header. (GH13434).colspan
or rowspan
attribute if equal to 1. (GH15403)pandas.io.formats.style.Styler
template now has blocks for easier extension, see the example notebook (GH15649)Styler.render()
now accepts **kwargs
to allow user-defined variables in the template (GH15649)TimedeltaIndex
now has a custom date-tick formatter specifically designed for nanosecond level precision (GH8711)pd.api.types.union_categoricals
gained the ignore_ordered
argument to allow ignoring the ordered attribute of unioned categoricals (GH13410). See the categorical union docs for more information.DataFrame.to_latex()
and DataFrame.to_string()
now allow optional header aliases. (GH15536)parse_dates
keyword of pd.read_excel()
to parse string columns as dates (GH14326).empty
property to subclasses of Index
. (GH15270)Timedelta
and TimedeltaIndex
(GH15828)pandas.io.json.json_normalize()
gained the option errors='ignore'|'raise'
; the default is errors='raise'
which is backward compatible. (GH14583)pandas.io.json.json_normalize()
with an empty list
will return an empty DataFrame
(GH15534)pandas.io.json.json_normalize()
has gained a sep
option that accepts str
to separate joined fields; the default is ”.”, which is backward compatible. (GH14883)MultiIndex.remove_unused_levels()
has been added to facilitate removing unused levels. (GH15694)pd.read_csv()
will now raise a ParserError
error whenever any parsing error occurs (GH15913, GH15925)pd.read_csv()
now supports the error_bad_lines
and warn_bad_lines
arguments for the Python parser (GH15925)display.show_dimensions
option can now also be used to specify whether the length of a Series
should be shown in its repr (GH7117).parallel_coordinates()
has gained a sort_labels
keyword argument that sorts class labels and the colors assigned to them (GH15908)bottleneck
and numexpr
, see here (GH16157)DataFrame.style.bar()
now accepts two more options to further customize the bar chart. Bar alignment is set with align='left'|'mid'|'zero'
, the default is “left”, which is backward compatible; You can now pass a list of color=[color_negative, color_positive]
. (GH14757)pd.TimeSeries
was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been dropped in favor of pd.Series
. (GH15098).
This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries
was used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the procedure below.
In [2]: s = pd.TimeSeries([1,2,3], index=pd.date_range('20130101', periods=3)) In [3]: s Out[3]: 2013-01-01 1 2013-01-02 2 2013-01-03 3 Freq: D, dtype: int64 In [4]: type(s) Out[4]: pandas.core.series.TimeSeries In [5]: s = pd.Series(s) In [6]: s Out[6]: 2013-01-01 1 2013-01-02 2 2013-01-03 3 Freq: D, dtype: int64 In [7]: type(s) Out[7]: pandas.core.series.SeriesMap on Index types now return other Index types¶
map
on an Index
now returns an Index
, not a numpy array (GH12766)
In [68]: idx = Index([1, 2]) In [69]: idx Out[69]: Int64Index([1, 2], dtype='int64') In [70]: mi = MultiIndex.from_tuples([(1, 2), (2, 4)]) In [71]: mi Out[71]: MultiIndex(levels=[[1, 2], [2, 4]], labels=[[0, 1], [0, 1]])
Previous Behavior:
In [5]: idx.map(lambda x: x * 2) Out[5]: array([2, 4]) In [6]: idx.map(lambda x: (x, x * 2)) Out[6]: array([(1, 2), (2, 4)], dtype=object) In [7]: mi.map(lambda x: x) Out[7]: array([(1, 2), (2, 4)], dtype=object) In [8]: mi.map(lambda x: x[0]) Out[8]: array([1, 2])
New Behavior:
In [72]: idx.map(lambda x: x * 2) Out[72]: Int64Index([2, 4], dtype='int64') In [73]: idx.map(lambda x: (x, x * 2)) Out[73]: MultiIndex(levels=[[1, 2], [2, 4]], labels=[[0, 1], [0, 1]]) In [74]: mi.map(lambda x: x) Out[74]: MultiIndex(levels=[[1, 2], [2, 4]], labels=[[0, 1], [0, 1]]) In [75]: mi.map(lambda x: x[0]) Out[75]: Int64Index([1, 2], dtype='int64')
map
on a Series
with datetime64
values may return int64
dtypes rather than int32
In [76]: s = Series(date_range('2011-01-02T00:00', '2011-01-02T02:00', freq='H').tz_localize('Asia/Tokyo')) In [77]: s Out[77]: 0 2011-01-02 00:00:00+09:00 1 2011-01-02 01:00:00+09:00 2 2011-01-02 02:00:00+09:00 dtype: datetime64[ns, Asia/Tokyo]
Previous Behavior:
In [9]: s.map(lambda x: x.hour) Out[9]: 0 0 1 1 2 2 dtype: int32
New Behavior:
In [78]: s.map(lambda x: x.hour) Out[78]: 0 0 1 1 2 2 dtype: int64Accessing datetime fields of Index now return Index¶
The datetime-related attributes (see here for an overview) of DatetimeIndex
, PeriodIndex
and TimedeltaIndex
previously returned numpy arrays. They will now return a new Index
object, except in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022)
Previous behaviour:
In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H') In [2]: idx.hour Out[2]: array([ 0, 10, 20, 6, 16], dtype=int32)
New Behavior:
In [79]: idx = pd.date_range("2015-01-01", periods=5, freq='10H') In [80]: idx.hour Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')
This has the advantage that specific Index
methods are still available on the result. On the other hand, this might have backward incompatibilities: e.g. compared to numpy arrays, Index
objects are not mutable. To get the original ndarray, you can always convert explicitly using np.asarray(idx.hour)
.
In prior versions, using Series.unique()
and pandas.unique()
on Categorical
and tz-aware data-types would yield different return types. These are now made consistent. (GH15903)
Datetime tz-aware
Previous behaviour:
# Series In [5]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'), pd.Timestamp('20160101', tz='US/Eastern')]).unique() Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object) In [6]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'), pd.Timestamp('20160101', tz='US/Eastern')])) Out[6]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]') # Index In [7]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'), pd.Timestamp('20160101', tz='US/Eastern')]).unique() Out[7]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None) In [8]: pd.unique([pd.Timestamp('20160101', tz='US/Eastern'), pd.Timestamp('20160101', tz='US/Eastern')]) Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')
New Behavior:
# Series, returns an array of Timestamp tz-aware In [81]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'), ....: pd.Timestamp('20160101', tz='US/Eastern')]).unique() ....: Out[81]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object) In [82]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'), ....: pd.Timestamp('20160101', tz='US/Eastern')])) ....: Out[82]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object) # Index, returns a DatetimeIndex In [83]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'), ....: pd.Timestamp('20160101', tz='US/Eastern')]).unique() ....: Out[83]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None) In [84]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'), ....: pd.Timestamp('20160101', tz='US/Eastern')])) ....: Out[84]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
Categoricals
Previous behaviour:
In [1]: pd.Series(list('baabc'), dtype='category').unique() Out[1]: [b, a, c] Categories (3, object): [b, a, c] In [2]: pd.unique(pd.Series(list('baabc'), dtype='category')) Out[2]: array(['b', 'a', 'c'], dtype=object)
New Behavior:
# returns a Categorical In [85]: pd.Series(list('baabc'), dtype='category').unique() Out[85]: [b, a, c] Categories (3, object): [b, a, c] In [86]: pd.unique(pd.Series(list('baabc'), dtype='category')) Out[86]: [b, a, c] Categories (3, object): [b, a, c]
pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs
is not a required dependency, you will need to install it separately, like boto
in prior versions of pandas. (GH11915).
DatetimeIndex Partial String Indexing now works as an exact match, provided that string resolution coincides with index resolution, including a case when both are seconds (GH14826). See Slice vs. Exact Match for details.
In [87]: df = DataFrame({'a': [1, 2, 3]}, DatetimeIndex(['2011-12-31 23:59:59', ....: '2012-01-01 00:00:00', ....: '2012-01-01 00:00:01'])) ....:
Previous Behavior:
In [4]: df['2011-12-31 23:59:59'] Out[4]: a 2011-12-31 23:59:59 1 In [5]: df['a']['2011-12-31 23:59:59'] Out[5]: 2011-12-31 23:59:59 1 Name: a, dtype: int64
New Behavior:
In [4]: df['2011-12-31 23:59:59'] KeyError: '2011-12-31 23:59:59' In [5]: df['a']['2011-12-31 23:59:59'] Out[5]: 1Concat of different float dtypes will not automatically upcast¶
Previously, concat
of multiple objects with different float
dtypes would automatically upcast results to a dtype of float64
. Now the smallest acceptable dtype will be used (GH13247)
In [88]: df1 = pd.DataFrame(np.array([1.0], dtype=np.float32, ndmin=2)) In [89]: df1.dtypes Out[89]: 0 float32 dtype: object In [90]: df2 = pd.DataFrame(np.array([np.nan], dtype=np.float32, ndmin=2)) In [91]: df2.dtypes Out[91]: 0 float32 dtype: object
Previous Behavior:
In [7]: pd.concat([df1, df2]).dtypes Out[7]: 0 float64 dtype: object
New Behavior:
In [92]: pd.concat([df1, df2]).dtypes Out[92]: 0 float32 dtype: objectPandas Google BigQuery support has moved¶
pandas has split off Google BigQuery support into a separate package pandas-gbq
. You can conda install pandas-gbq -c conda-forge
or pip install pandas-gbq
to get it. The functionality of read_gbq()
and DataFrame.to_gbq()
remain the same with the currently released version of pandas-gbq=0.1.4
. Documentation is now hosted here (GH15347)
In previous versions, showing .memory_usage()
on a pandas structure that has an index, would only include actual index values and not include structures that facilitated fast indexing. This will generally be different for Index
and MultiIndex
and less-so for other index types. (GH15237)
Previous Behavior:
In [8]: index = Index(['foo', 'bar', 'baz']) In [9]: index.memory_usage(deep=True) Out[9]: 180 In [10]: index.get_loc('foo') Out[10]: 0 In [11]: index.memory_usage(deep=True) Out[11]: 180
New Behavior:
In [8]: index = Index(['foo', 'bar', 'baz']) In [9]: index.memory_usage(deep=True) Out[9]: 180 In [10]: index.get_loc('foo') Out[10]: 0 In [11]: index.memory_usage(deep=True) Out[11]: 260DataFrame.sort_index changes¶
In certain cases, calling .sort_index()
on a MultiIndexed DataFrame would return the same DataFrame without seeming to sort. This would happen with a lexsorted
, but non-monotonic levels. (GH15622, GH15687, GH14015, GH13431, GH15797)
This is unchanged from prior versions, but shown for illustration purposes:
In [93]: df = DataFrame(np.arange(6), columns=['value'], index=MultiIndex.from_product([list('BA'), range(3)])) In [94]: df Out[94]: value B 0 0 1 1 2 2 A 0 3 1 4 2 5
In [95]: df.index.is_lexsorted() Out[95]: False In [96]: df.index.is_monotonic Out[96]: False
Sorting works as expected
In [97]: df.sort_index() Out[97]: value A 0 3 1 4 2 5 B 0 0 1 1 2 2
In [98]: df.sort_index().index.is_lexsorted() Out[98]: True In [99]: df.sort_index().index.is_monotonic Out[99]: True
However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.
In [100]: df = pd.DataFrame( .....: {'value': [1, 2, 3, 4]}, .....: index=pd.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']], .....: labels=[[0, 0, 1, 1], [0, 1, 0, 1]])) .....: In [101]: df Out[101]: value a bb 1 aa 2 b bb 3 aa 4
Previous Behavior:
In [11]: df.sort_index() Out[11]: value a bb 1 aa 2 b bb 3 aa 4 In [14]: df.sort_index().index.is_lexsorted() Out[14]: True In [15]: df.sort_index().index.is_monotonic Out[15]: False
New Behavior:
In [102]: df.sort_index() Out[102]: value a aa 2 bb 1 b aa 4 bb 3 In [103]: df.sort_index().index.is_lexsorted() Out[103]: True In [104]: df.sort_index().index.is_monotonic Out[104]: TrueGroupby Describe Formatting¶
The output formatting of groupby.describe()
now labels the describe()
metrics in the columns instead of the index. This format is consistent with groupby.agg()
when applying multiple functions at once. (GH4792)
Previous Behavior:
In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]}) In [2]: df.groupby('A').describe() Out[2]: B A 1 count 2.000000 mean 1.500000 std 0.707107 min 1.000000 25% 1.250000 50% 1.500000 75% 1.750000 max 2.000000 2 count 2.000000 mean 3.500000 std 0.707107 min 3.000000 25% 3.250000 50% 3.500000 75% 3.750000 max 4.000000 In [3]: df.groupby('A').agg([np.mean, np.std, np.min, np.max]) Out[3]: B mean std amin amax A 1 1.5 0.707107 1 2 2 3.5 0.707107 3 4
New Behavior:
In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]}) In [106]: df.groupby('A').describe() Out[106]: B count mean std min 25% 50% 75% max A 1 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0 2 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0 In [107]: df.groupby('A').agg([np.mean, np.std, np.min, np.max]) Out[107]: B mean std amin amax A 1 1.5 0.707107 1 2 2 3.5 0.707107 3 4Window Binary Corr/Cov operations return a MultiIndex DataFrame¶
A binary window operation, like .corr()
or .cov()
, when operating on a .rolling(..)
, .expanding(..)
, or .ewm(..)
object, will now return a 2-level MultiIndexed DataFrame
rather than a Panel
, as Panel
is now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame
enjoys more support in pandas. See the section on Windowed Binary Operations for more information. (GH15677)
In [108]: np.random.seed(1234) In [109]: df = pd.DataFrame(np.random.rand(100, 2), .....: columns=pd.Index(['A', 'B'], name='bar'), .....: index=pd.date_range('20160101', .....: periods=100, freq='D', name='foo')) .....: In [110]: df.tail() Out[110]: bar A B foo 2016-04-05 0.640880 0.126205 2016-04-06 0.171465 0.737086 2016-04-07 0.127029 0.369650 2016-04-08 0.604334 0.103104 2016-04-09 0.802374 0.945553
Previous Behavior:
In [2]: df.rolling(12).corr() Out[2]: <class 'pandas.core.panel.Panel'> Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis) Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00 Major_axis axis: A to B Minor_axis axis: A to B
New Behavior:
In [111]: res = df.rolling(12).corr() In [112]: res.tail() Out[112]: bar A B foo bar 2016-04-07 B -0.132090 1.000000 2016-04-08 A 1.000000 -0.145775 B -0.145775 1.000000 2016-04-09 A 1.000000 0.119645 B 0.119645 1.000000
Retrieving a correlation matrix for a cross-section
In [113]: df.rolling(12).corr().loc['2016-04-07'] Out[113]: bar A B foo bar 2016-04-07 A 1.00000 -0.13209 B -0.13209 1.00000HDFStore where string comparison¶
In previous versions most types could be compared to string column in a HDFStore
usually resulting in an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError
(GH15492)
In [114]: df = pd.DataFrame({'unparsed_date': ['2014-01-01', '2014-01-01']}) In [115]: df.to_hdf('store.h5', 'key', format='table', data_columns=True) In [116]: df.dtypes Out[116]: unparsed_date object dtype: object
Previous Behavior:
In [4]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts') File "<string>", line 1 (unparsed_date > 1970-01-01 00:00:01.388552400) ^ SyntaxError: invalid token
New Behavior:
In [18]: ts = pd.Timestamp('2014-01-01') In [19]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts') TypeError: Cannot compare 2014-01-01 00:00:00 of type <class 'pandas.tslib.Timestamp'> to string columnIndex.intersection and inner join now preserve the order of the left Index¶
Index.intersection()
now preserves the order of the calling Index
(left) instead of the other Index
(right) (GH15582). This affects inner joins, DataFrame.join()
and merge()
, and the .align
method.
Index.intersection
In [117]: left = pd.Index([2, 1, 0]) In [118]: left Out[118]: Int64Index([2, 1, 0], dtype='int64') In [119]: right = pd.Index([1, 2, 3]) In [120]: right Out[120]: Int64Index([1, 2, 3], dtype='int64')
Previous Behavior:
In [4]: left.intersection(right) Out[4]: Int64Index([1, 2], dtype='int64')
New Behavior:
In [121]: left.intersection(right) Out[121]: Int64Index([2, 1], dtype='int64')
DataFrame.join
and pd.merge
In [122]: left = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0]) In [123]: left Out[123]: a 2 20 1 10 0 0 In [124]: right = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3]) In [125]: right Out[125]: b 1 100 2 200 3 300
Previous Behavior:
In [4]: left.join(right, how='inner') Out[4]: a b 1 10 100 2 20 200
New Behavior:
In [126]: left.join(right, how='inner') Out[126]: a b 2 20 200 1 10 100
The documentation for pivot_table()
states that a DataFrame
is always returned. Here a bug is fixed that allowed this to return a Series
under certain circumstance. (GH4386)
In [127]: df = DataFrame({'col1': [3, 4, 5], .....: 'col2': ['C', 'D', 'E'], .....: 'col3': [1, 3, 9]}) .....: In [128]: df Out[128]: col1 col2 col3 0 3 C 1 1 4 D 3 2 5 E 9
Previous Behavior:
In [2]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum) Out[2]: col3 col2 1 C 3 3 D 4 9 E 5 Name: col1, dtype: int64
New Behavior:
In [129]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum) Out[129]: col1 col3 col2 1 C 3 3 D 4 9 E 5Other API Changes¶
numexpr
version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (GH15213).CParserError
has been renamed to ParserError
in pd.read_csv()
and will be removed in the future (GH12665)SparseArray.cumsum()
and SparseSeries.cumsum()
will now always return SparseArray
and SparseSeries
respectively (GH12855)DataFrame.applymap()
with an empty DataFrame
will return a copy of the empty DataFrame
instead of a Series
(GH8222)Series.map()
now respects default values of dictionary subclasses with a __missing__
method, such as collections.Counter
(GH15999).loc
has compat with .ix
for accepting iterators, and NamedTuples (GH15120)interpolate()
and fillna()
will raise a ValueError
if the limit
keyword argument is not greater than 0. (GH9217)pd.read_csv()
will now issue a ParserWarning
whenever there are conflicting values provided by the dialect
parameter and the user (GH14898)pd.read_csv()
will now raise a ValueError
for the C engine if the quote character is larger than than one byte (GH11592)inplace
arguments now require a boolean value, else a ValueError
is thrown (GH14189)pandas.api.types.is_datetime64_ns_dtype
will now report True
on a tz-aware dtype, similar to pandas.api.types.is_datetime64_any_dtype
DataFrame.asof()
will return a null filled Series
instead the scalar NaN
if a match is not found (GH15118)copy.copy()
and copy.deepcopy()
functions on NDFrame objects (GH15444)Series.sort_values()
accepts a one element list of bool for consistency with the behavior of DataFrame.sort_values()
(GH15604).merge()
and .join()
on category
dtype columns will now preserve the category dtype when possible (GH10409)SparseDataFrame.default_fill_value
will be 0, previously was nan
in the return from pd.get_dummies(..., sparse=True)
(GH15594)Series.str.match
has changed from extracting groups to matching the pattern. The extracting behaviour was deprecated since pandas version 0.13.0 and can be done with the Series.str.extract
method (GH5224). As a consequence, the as_indexer
keyword is ignored (no longer needed to specify the new behaviour) and is deprecated.NaT
will now correctly report False
for datetimelike boolean operations such as is_month_start
(GH15781)NaT
will now correctly return np.nan
for Timedelta
and Period
accessors such as days
and quarter
(GH15782)NaT
will now returns NaT
for tz_localize
and tz_convert
methods (GH15830)DataFrame
and Panel
constructors with invalid input will now raise ValueError
rather than PandasError
, if called with scalar inputs and not axes (GH15541)DataFrame
and Panel
constructors with invalid input will now raise ValueError
rather than pandas.core.common.PandasError
, if called with scalar inputs and not axes; The exception PandasError
is removed as well. (GH15541)pandas.core.common.AmbiguousIndexError
is removed as it is not referenced (GH15541)Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API. Furthermore, the pandas.core
, pandas.compat
, and pandas.util
top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if you reference theses modules. (GH12588)
Some new subpackages are created with public functionality that is not directly exposed in the top-level namespace: pandas.errors
, pandas.plotting
and pandas.testing
(more details below). Together with pandas.api.types
and certain functions in the pandas.io
and pandas.tseries
submodules, these are now the public subpackages.
Further changes:
union_categoricals()
is now importable from pandas.api.types
, formerly from pandas.types.concat
(GH15998)pandas.tslib.NaTType
is deprecated and can be replaced by using type(pandas.NaT)
(GH16146)pandas.tools.hashing
deprecated from that locations, but are now importable from pandas.util
(GH16223)pandas.util
: decorators
, print_versions
, doctools
, validators
, depr_module
are now private. Only the functions exposed in pandas.util
itself are public (GH16223)pandas.errors
¶
We are adding a standard public module for all pandas exceptions & warnings pandas.errors
. (GH14800). Previously these exceptions & warnings could be imported from pandas.core.common
or pandas.io.common
. These exceptions and warnings will be removed from the *.common
locations in a future release. (GH15541)
The following are now part of this API:
['DtypeWarning', 'EmptyDataError', 'OutOfBoundsDatetime', 'ParserError', 'ParserWarning', 'PerformanceWarning', 'UnsortedIndexError', 'UnsupportedFunctionCall']
pandas.plotting
¶
A new public pandas.plotting
module has been added that holds plotting functionality that was previously in either pandas.tools.plotting
or in the top-level namespace. See the deprecations sections for more details.
cython >= 0.23
(GH14831).ix
¶
The .ix
indexer is deprecated, in favor of the more strict .iloc
and .loc
indexers. .ix
offers a lot of magic on the inference of what the user wants to do. To wit, .ix
can decide to index positionally OR via labels, depending on the data type of the index. This has caused quite a bit of user confusion over the years. The full indexing documentation is here. (GH14218)
The recommended methods of indexing are:
.loc
if you want to label index.iloc
if you want to positionally index.Using .ix
will now show a DeprecationWarning
with a link to some examples of how to convert code here.
In [130]: df = pd.DataFrame({'A': [1, 2, 3], .....: 'B': [4, 5, 6]}, .....: index=list('abc')) .....: In [131]: df Out[131]: A B a 1 4 b 2 5 c 3 6
Previous Behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.
In [3]: df.ix[[0, 2], 'A'] Out[3]: a 1 c 3 Name: A, dtype: int64
Using .loc
. Here we will select the appropriate indexes from the index, then use label indexing.
In [132]: df.loc[df.index[[0, 2]], 'A'] Out[132]: a 1 c 3 Name: A, dtype: int64
Using .iloc
. Here we will get the location of the ‘A’ column, then use positional indexing to select things.
In [133]: df.iloc[[0, 2], df.columns.get_loc('A')] Out[133]: a 1 c 3 Name: A, dtype: int64Deprecate Panel¶
Panel
is deprecated and will be removed in a future version. The recommended way to represent 3-D data are with a MultiIndex
on a DataFrame
via the to_frame()
or with the xarray package. Pandas provides a to_xarray()
method to automate this conversion. For more details see Deprecate Panel documentation. (GH13563).
In [134]: p = tm.makePanel() In [135]: p Out[135]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemC Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D
Convert to a MultiIndex DataFrame
In [136]: p.to_frame() Out[136]: ItemA ItemB ItemC major minor 2000-01-03 A 0.628776 -1.409432 0.209395 B 0.988138 -1.347533 -0.896581 C -0.938153 1.272395 -0.161137 D -0.223019 -0.591863 -1.051539 2000-01-04 A 0.186494 1.422986 -0.592886 B -0.072608 0.363565 1.104352 C -1.239072 -1.449567 0.889157 D 2.123692 -0.414505 -0.319561 2000-01-05 A 0.952478 -2.147855 -1.473116 B -0.550603 -0.014752 -0.431550 C 0.139683 -1.195524 0.288377 D 0.122273 -1.425795 -0.619993
Convert to an xarray DataArray
In [137]: p.to_xarray() Out[137]: <xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)> array([[[ 0.628776, 0.988138, -0.938153, -0.223019], [ 0.186494, -0.072608, -1.239072, 2.123692], [ 0.952478, -0.550603, 0.139683, 0.122273]], [[-1.409432, -1.347533, 1.272395, -0.591863], [ 1.422986, 0.363565, -1.449567, -0.414505], [-2.147855, -0.014752, -1.195524, -1.425795]], [[ 0.209395, -0.896581, -0.161137, -1.051539], [-0.592886, 1.104352, 0.889157, -0.319561], [-1.473116, -0.43155 , 0.288377, -0.619993]]]) Coordinates: * items (items) object 'ItemA' 'ItemB' 'ItemC' * major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05 * minor_axis (minor_axis) object 'A' 'B' 'C' 'D'Deprecate groupby.agg() with a dictionary when renaming¶
The .groupby(..).agg(..)
, .rolling(..).agg(..)
, and .resample(..).agg(..)
syntax can accept a variable of inputs, including scalars, list, and a dict of column names to scalars or lists. This provides a useful syntax for constructing multiple (potentially different) aggregations.
However, .agg(..)
can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated and confusing syntax, as well as not consistent between Series
and DataFrame
. We are deprecating this ‘renaming’ functionaility.
Series
. This allowed one to rename
the resulting aggregation, but this had a completely different meaning than passing a dictionary to a grouped DataFrame
, which accepts column-to-aggregations.DataFrame
in a similar manner.This is an illustrative example:
In [138]: df = pd.DataFrame({'A': [1, 1, 1, 2, 2], .....: 'B': range(5), .....: 'C': range(5)}) .....: In [139]: df Out[139]: A B C 0 1 0 0 1 1 1 1 2 1 2 2 3 2 3 3 4 2 4 4
Here is a typical useful syntax for computing different aggregations for different columns. This is a natural, and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list of functions. This returns a MultiIndex
for the columns (this is not deprecated).
In [140]: df.groupby('A').agg({'B': 'sum', 'C': 'min'}) Out[140]: B C A 1 3 0 2 7 3
Here’s an example of the first deprecation, passing a dict to a grouped Series
. This is a combination aggregation & renaming:
In [6]: df.groupby('A').B.agg({'foo': 'count'}) FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version Out[6]: foo A 1 3 2 2
You can accomplish the same operation, more idiomatically by:
In [141]: df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'}) Out[141]: foo A 1 3 2 2
Here’s an example of the second deprecation, passing a dict-of-dict to a grouped DataFrame
:
In [23]: (df.groupby('A') .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}}) ) FutureWarning: using a dict with renaming is deprecated and will be removed in a future version Out[23]: B C foo bar A 1 3 0 2 7 3
You can accomplish nearly the same by:
In [142]: (df.groupby('A') .....: .agg({'B': 'sum', 'C': 'min'}) .....: .rename(columns={'B': 'foo', 'C': 'bar'}) .....: ) .....: Out[142]: foo bar A 1 3 0 2 7 3Deprecate .plotting¶
The pandas.tools.plotting
module has been deprecated, in favor of the top level pandas.plotting
module. All the public plotting functions are now available from pandas.plotting
(GH12548).
Furthermore, the top-level pandas.scatter_matrix
and pandas.plot_params
are deprecated. Users can import these from pandas.plotting
as well.
Previous script:
pd.tools.plotting.scatter_matrix(df) pd.scatter_matrix(df)
Should be changed to:
pd.plotting.scatter_matrix(df)Other Deprecations¶
SparseArray.to_dense()
has deprecated the fill
parameter, as that parameter was not being respected (GH14647)SparseSeries.to_dense()
has deprecated the sparse_only
parameter (GH14647)Series.repeat()
has deprecated the reps
parameter in favor of repeats
(GH12662)Series
constructor and .astype
method have deprecated accepting timestamp dtypes without a frequency (e.g. np.datetime64
) for the dtype
parameter (GH15524)Index.repeat()
and MultiIndex.repeat()
have deprecated the n
parameter in favor of repeats
(GH12662)Categorical.searchsorted()
and Series.searchsorted()
have deprecated the v
parameter in favor of value
(GH12662)TimedeltaIndex.searchsorted()
, DatetimeIndex.searchsorted()
, and PeriodIndex.searchsorted()
have deprecated the key
parameter in favor of value
(GH12662)DataFrame.astype()
has deprecated the raise_on_error
parameter in favor of errors
(GH14878)Series.sortlevel
and DataFrame.sortlevel
have been deprecated in favor of Series.sort_index
and DataFrame.sort_index
(GH15099)concat
from pandas.tools.merge
has been deprecated in favor of imports from the pandas
namespace. This should only affect explict imports (GH15358)Series/DataFrame/Panel.consolidate()
been deprecated as a public method. (GH15483)as_indexer
keyword of Series.str.match()
has been deprecated (ignored keyword) (GH15257).pd.pnow()
, replaced by Period.now()
pd.Term
, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStorepd.Expr
, is removed, as it is not applicable to user code.pd.match()
, is removed.pd.groupby()
, replaced by using the .groupby()
method directly on a Series/DataFrame
pd.get_store()
, replaced by a direct call to pd.HDFStore(...)
is_any_int_dtype
, is_floating_dtype
, and is_sequence
are deprecated from pandas.api.types
(GH16042)pandas.rpy
module is removed. Similar functionality can be accessed through the rpy2 project. See the R interfacing docs for more details.pandas.io.ga
module with a google-analytics
interface is removed (GH11308). Similar functionality can be found in the Google2Pandas package.pd.to_datetime
and pd.to_timedelta
have dropped the coerce
parameter in favor of errors
(GH13602)pandas.stats.fama_macbeth
, pandas.stats.ols
, pandas.stats.plm
and pandas.stats.var
, as well as the top-level pandas.fama_macbeth
and pandas.ols
routines are removed. Similar functionaility can be found in the statsmodels package. (GH11898)TimeSeries
and SparseTimeSeries
classes, aliases of Series
and SparseSeries
, are removed (GH10890, GH15098).Series.is_time_series
is dropped in favor of Series.index.is_all_dates
(GH15098)irow
, icol
, iget
and iget_value
methods are removed in favor of iloc
and iat
as explained here (GH10711).DataFrame.iterkv()
has been removed in favor of DataFrame.iteritems()
(GH10711)Categorical
constructor has dropped the name
parameter (GH10632)Categorical
has dropped support for NaN
categories (GH10748)take_last
parameter has been dropped from duplicated()
, drop_duplicates()
, nlargest()
, and nsmallest()
methods (GH10236, GH10792, GH10920)Series
, Index
, and DataFrame
have dropped the sort
and order
methods (GH10726)pytables
are only accepted as strings and expressions types and not other data-types (GH12027)DataFrame
has dropped the combineAdd
and combineMult
methods in favor of add
and mul
respectively (GH10735)pd.wide_to_long()
(GH14779)pd.factorize()
by releasing the GIL with object
dtype when inferred as strings (GH14859, GH16057)compat_x=True
) (GH15073).groupby().cummin()
and groupby().cummax()
(GH15048, GH15109, GH15561, GH15635)MultiIndex
(GH15245)read_sas()
method without specified format, filepath string is inferred rather than buffer object. (GH14947).rank()
for categorical data (GH15498).unstack()
(GH15503)category
columns (GH10409)drop_duplicates()
on bool
columns (GH12963)pd.core.groupby.GroupBy.apply
when the applied function used the .name
attribute of the group DataFrame (GH15062).iloc
indexing with a list or array (GH15504).Series.sort_index()
with a monotonic index (GH15694)pd.read_csv()
on some platforms with buffered reads (GH16039)Timestamp.replace
now raises TypeError
when incorrect argument names are given; previously this raised ValueError
(GH15240)Timestamp.replace
with compat for passing long integers (GH15030)Timestamp
returning UTC based time/date attributes when a timezone was provided (GH13303, GH6538)Timestamp
incorrectly localizing timezones during construction (GH11481, GH15777)TimedeltaIndex
addition where overflow was being allowed without error (GH14816)TimedeltaIndex
raising a ValueError
when boolean indexing with loc
(GH14946)Timestamp
+ Timedelta/Offset
operations (GH15126)DatetimeIndex.round()
and Timestamp.round()
floating point accuracy when rounding by milliseconds or less (GH14440, GH15578)astype()
where inf
values were incorrectly converted to integers. Now raises error now with astype()
for Series and DataFrames (GH14265)DataFrame(..).apply(to_numeric)
when values are of type decimal.Decimal. (GH14827)describe()
when passing a numpy array which does not contain the median to the percentiles
keyword argument (GH14908)PeriodIndex
constructor, including raising on floats more consistently (GH13277)__deepcopy__
on empty NDFrame objects (GH15370).replace()
may result in incorrect dtypes. (GH12747, GH15765)Series.replace
and DataFrame.replace
which failed on empty replacement dicts (GH15289)Series.replace
which replaced a numeric by string (GH15743)Index
construction with NaN
elements and integer dtype specified (GH15187)Series
construction with a datetimetz (GH14928)Series.dt.round()
inconsistent behaviour on NaT
‘s with different arguments (GH14940)Series
constructor when both copy=True
and dtype
arguments are provided (GH15125)Series
was returned by comparison methods (e.g., lt
, gt
, ...) against a constant for an empty DataFrame
(GH15077)Series.ffill()
with mixed dtypes containing tz-aware datetimes. (GH14956)DataFrame.fillna()
where the argument downcast
was ignored when fillna value was of type dict
(GH15277).asfreq()
, where frequency was not set for empty Series
(GH14320)DataFrame
construction with nulls and datetimes in a list-like (GH15869)DataFrame.fillna()
with tz-aware datetimes (GH15855)is_string_dtype
, is_timedelta64_ns_dtype
, and is_string_like_dtype
in which an error was raised when None
was passed in (GH15941)pd.unique
on a Categorical
, which was returning an ndarray and not a Categorical
(GH15903)Index.to_series()
where the index was not copied (and so mutating later would change the original), (GH15949)Series
construction where passing invalid dtype didn’t raise an error. (GH15520)Index
power operations with reversed operands (GH14973)DataFrame.sort_values()
when sorting by multiple columns where one column is of type int64
and contains NaT
(GH14922)DataFrame.reindex()
in which method
was ignored when passing columns
(GH14992)DataFrame.loc
with indexing a MultiIndex
with a Series
indexer (GH14730, GH15424)DataFrame.loc
with indexing a MultiIndex
with a numpy array (GH15434)Series.asof
which raised if the series contained all np.nan
(GH15713).at
when selecting from a tz-aware column (GH15822)Series.where()
and DataFrame.where()
where array-like conditionals were being rejected (GH15414)Series.where()
where TZ-aware data was converted to float representation (GH15701).loc
that would not return the correct dtype for scalar access for a DataFrame (GH11617)MultiIndex
when names are integers (GH12223, GH15262)Categorical.searchsorted()
where alphabetical instead of the provided categorical order was used (GH14522)Series.iloc
where a Categorical
object for list-like indexes input was returned, where a Series
was expected. (GH14580)DataFrame.isin
comparing datetimelike to empty frame (GH15473).reset_index()
when an all NaN
level of a MultiIndex
would fail (GH6322).reset_index()
when raising error for index name already present in MultiIndex
columns (GH16120)MultiIndex
with tuples and not passing a list of names; this will now raise ValueError
(GH15110)MultiIndex
and truncation (GH14882).info()
where a qualifier (+) would always be displayed with a MultiIndex
that contains only non-strings (GH15245)pd.concat()
where the names of MultiIndex
of resulting DataFrame
are not handled correctly when None
is presented in the names of MultiIndex
of input DataFrame
(GH15787)DataFrame.sort_index()
and Series.sort_index()
where na_position
doesn’t work with a MultiIndex
(GH14784, GH16604)pd.concat()
when combining objects with a CategoricalIndex
(GH16111)CategoricalIndex
(GH16123)pd.to_numeric()
in which float and unsigned integer elements were being improperly casted (GH14941, GH15005)pd.read_fwf()
where the skiprows parameter was not being respected during column width inference (GH11256)pd.read_csv()
in which the dialect
parameter was not being verified before processing (GH14898)pd.read_csv()
in which missing data was being improperly handled with usecols
(GH6710)pd.read_csv()
in which a file containing a row with many columns followed by rows with fewer columns would cause a crash (GH14125)pd.read_csv()
for the C engine where usecols
were being indexed incorrectly with parse_dates
(GH14792)pd.read_csv()
with parse_dates
when multiline headers are specified (GH15376)pd.read_csv()
with float_precision='round_trip'
which caused a segfault when a text entry is parsed (GH15140)pd.read_csv()
when an index was specified and no values were specified as null values (GH15835)pd.read_csv()
in which certain invalid file objects caused the Python interpreter to crash (GH15337)pd.read_csv()
in which invalid values for nrows
and chunksize
were allowed (GH15767)pd.read_csv()
for the Python engine in which unhelpful error messages were being raised when parsing errors occurred (GH15910)pd.read_csv()
in which the skipfooter
parameter was not being properly validated (GH15925)pd.to_csv()
in which there was numeric overflow when a timestamp index was being written (GH15982)pd.util.hashing.hash_pandas_object()
in which hashing of categoricals depended on the ordering of categories, instead of just their values. (GH15143).to_json()
where lines=True
and contents (keys or values) contain escaped characters (GH15096).to_json()
causing single byte ascii characters to be expanded to four byte unicode (GH15344).to_json()
for the C engine where rollover was not correctly handled for case where frac is odd and diff is exactly 0.5 (GH15716, GH15864)pd.read_json()
for Python 2 where lines=True
and contents contain non-ascii unicode characters (GH15132)pd.read_msgpack()
in which Series
categoricals were being improperly processed (GH14901)pd.read_msgpack()
which did not allow loading of a dataframe with an index of type CategoricalIndex
(GH15487)pd.read_msgpack()
when deserializing a CategoricalIndex
(GH15487)DataFrame.to_records()
with converting a DatetimeIndex
with a timezone (GH13937)DataFrame.to_records()
which failed with unicode characters in column names (GH11879).to_sql()
when writing a DataFrame with numeric index names (GH15404).DataFrame.to_html()
with index=False
and max_rows
raising in IndexError
(GH14998)pd.read_hdf()
passing a Timestamp
to the where
parameter with a non date column (GH15492)DataFrame.to_stata()
and StataWriter
which produces incorrectly formatted files to be produced for some locales (GH13856)StataReader
and StataWriter
which allows invalid encodings (GH15723)Series
repr not showing the length when the output was truncated (GH15962).DataFrame.hist
where plt.tight_layout
caused an AttributeError
(use matplotlib >= 2.0.1
) (GH9351)DataFrame.boxplot
where fontsize
was not applied to the tick labels on both axes (GH15108)pd.scatter_matrix()
could accept either color
or c
, but not both (GH14855).groupby(..).resample()
when passed the on=
kwarg. (GH15021)__name__
and __qualname__
for Groupby.*
functions (GH14620)GroupBy.get_group()
failing with a categorical grouper (GH15155).groupby(...).rolling(...)
when on
is specified and using a DatetimeIndex
(GH15130, GH13966)timedelta64
when passing numeric_only=False
(GH5724)groupby.apply()
coercing object
dtypes to numeric types, when not all values were numeric (GH14423, GH15421, GH15670)resample
, where a non-string loffset
argument would not be applied when resampling a timeseries (GH13218)DataFrame.groupby().describe()
when grouping on Index
containing tuples (GH14848)groupby().nunique()
with a datetimelike-grouper where bins counts were incorrect (GH13453)groupby.transform()
that would coerce the resultant dtypes back to the original (GH10972, GH11444)groupby.agg()
incorrectly localizing timezone on datetime
(GH15426, GH10668, GH13046).rolling/expanding()
functions where count()
was not counting np.Inf
, nor handling object
dtypes (GH12541).rolling()
where pd.Timedelta
or datetime.timedelta
was not accepted as a window
argument (GH15440)Rolling.quantile
function that caused a segmentation fault when called with a quantile value outside of the range [0, 1] (GH15463)DataFrame.resample().median()
if duplicate column names are present (GH14233)SparseSeries.reindex
on single level with list of length 1 (GH15447)SparseDataFrame
after a value was set on (a copy of) one of its series (GH15488)SparseDataFrame
construction with lists not coercing to dtype (GH15682)pd.merge_asof()
where left_index
or right_index
caused a failure when multiple by
was specified (GH15676)pd.merge_asof()
where left_index
/right_index
together caused a failure when tolerance
was specified (GH15135)DataFrame.pivot_table()
where dropna=True
would not drop all-NaN columns when the columns was a category
dtype (GH15193)pd.melt()
where passing a tuple value for value_vars
caused a TypeError
(GH15348)pd.pivot_table()
where no error was raised when values argument was not in the columns (GH14938)pd.concat()
in which concatenating with an empty dataframe with join='inner'
was being improperly handled (GH15328)sort=True
in DataFrame.join
and pd.merge
when joining on indexes (GH15582)DataFrame.nsmallest
and DataFrame.nlargest
where identical values resulted in duplicated rows (GH15297).rank()
which incorrectly ranks ordered categories (GH15420).corr()
and .cov()
where the column and index were the same object (GH14617).mode()
where mode
was not returned if was only a single value (GH15714)pd.cut()
with a single bin on an all 0s array (GH15428)pd.qcut()
with a single quantile and an array with identical values (GH15431)pandas.tools.utils.cartesian_product()
with large input can cause overflow on windows (GH15265).eval()
which caused multiline evals to fail with local variables not on the first line (GH15342).interpolate()
(GH15662).qcut/cut
; bins will now be int64
dtype (GH14866)Qt
when a QtApplication
already exists (GH14372)np.finfo()
during import pandas
removed to mitigate deadlock on Python GIL misuse (GH14641)This is a minor bug-fix release in the 0.19.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
The pd.merge_asof()
, added in 0.19.0, gained some improvements:
pd.merge_asof()
gained left_index
/right_index
and left_by
/right_by
arguments (GH14253)pd.merge_asof()
can take multiple columns in by
parameter and has specialized dtypes for better performace (GH13936)PeriodIndex
(GH14822).replace()
(GH12745)Series
creation with a datetime index and dictionary data (GH14894)dateutil==2.6.0
; segfault reported in the testing suite (GH14621)nanoseconds
in Timestamp.replace
as a kwarg (GH14621)pd.read_csv
in which aliasing was being done for na_values
when passed in as a dictionary (GH14203)pd.read_csv
in which column indices for a dict-like na_values
were not being respected (GH14203)pd.read_csv
where reading files fails, if the number of headers is equal to the number of lines in the file (GH14515)pd.read_csv
for the Python engine in which an unhelpful error message was being raised when multi-char delimiters were not being respected with quotes (GH14582)pd.read_sas
and pandas.io.sas.sas7bdat.SAS7BDATReader
that caused problems when reading a SAS file incrementally.pd.read_csv
for the Python engine in which an unhelpful error message was being raised when skipfooter
was not being respected by Python’s CSV library (GH13879).fillna()
in which timezone aware datetime64 values were incorrectly rounded (GH14872).groupby(..., sort=True)
of a non-lexsorted MultiIndex when grouping with multiple levels (GH14776)pd.cut
with negative values and a single bin (GH14652)pd.to_numeric
where a 0 was not unsigned on a downcast='unsigned'
argument (GH14401)sharex=True
or ax.twinx()
) (GH13341, GH14322).DatetimeIndex
in local TZ, covering a DST change, which would raise AmbiguousTimeError
(GH14682)RecursionError
into KeyError
or IndexingError
(GH14554)HDFStore
when writing a MultiIndex
when using data_columns=True
(GH14435)HDFStore.append()
when writing a Series
and passing a min_itemsize
argument containing a value for the index
(GH11412)HDFStore
in table
format with a min_itemsize
value for the index
and without asking to append (GH10381)Series.groupby.nunique()
raising an IndexError
for an empty Series
(GH12553)DataFrame.nlargest
and DataFrame.nsmallest
when the index had duplicate values (GH13412).to_clipboard()
and Excel compat (GH12529)DataFrame.combine_first()
for integer columns (GH14687).pd.read_csv()
in which the dtype
parameter was not being respected for empty data (GH14712)pd.read_csv()
in which the nrows
parameter was not being respected for large input when using the C engine for parsing (GH7626)pd.merge_asof()
could not handle timezone-aware DatetimeIndex when a tolerance was specified (GH14844)to_stata
and StataWriter
for out-of-range values when writing doubles (GH14618).plot(kind='kde')
which did not drop missing values to generate the KDE Plot, instead generating an empty plot. (GH14821)unstack()
if called with a list of column(s) as an argument, regardless of the dtypes of all columns, they get coerced to object
(GH11847)This is a minor bug-fix release from 0.19.0 and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.
Performance Improvements¶Period
data (GH14338)Series.asof(where)
when where
is a scalar (GH14461)DataFrame.asof(where)
when where
is a scalar (GH14461).to_json()
when lines=True
(GH14408)cython
installed, as in previous versions (GH14204)read_csv
(c engine) (GH14418).DataFrame.quantile
when missing values where present in some columns (GH14357).Index.difference
where the freq
of a DatetimeIndex
was incorrectly set (GH14323)pandas.core.common.array_equivalent
with a deprecation warning (GH14555).pd.read_csv
for the C engine in which quotation marks were improperly parsed in skipped rows (GH14459)pd.read_csv
for Python 2.x in which Unicode quote characters were no longer being respected (GH14477)Index.append
when categorical indices were appended (GH14545).pd.DataFrame
where constructor fails when given dict with None
value (GH14381)DatetimeIndex._maybe_cast_slice_bound
when index is empty (GH14354).TimedeltaIndex
addition with a Datetime-like object where addition overflow in the negative direction was not being caught (GH14068, GH14453)object
Index
may raise AttributeError
(GH14424)ValueError
on empty input to pd.eval()
and df.query()
(GH13139)RangeIndex.intersection
when result is a empty set (GH14364).Series.__setitem__
which allowed mutating read-only arrays (GH14359).DataFrame.insert
where multiple calls with duplicate columns can fail (GH14291)pd.merge()
will raise ValueError
with non-boolean parameters in passed boolean type arguments (GH14434)Timestamp
where dates very near the minimum (1677-09) could underflow on creation (GH14415)pd.concat
where names of the keys
were not propagated to the resulting MultiIndex
(GH14252)pd.concat
where axis
cannot take string parameters 'rows'
or 'columns'
(GH14369)pd.concat
with dataframes heterogeneous in length and tuple keys
(GH14438)MultiIndex.set_levels
where illegal level values were still set after raising an error (GH13754)DataFrame.to_json
where lines=True
and a value contained a }
character (GH14391)df.groupby
causing an AttributeError
when grouping a single index frame by a column and the index level (:issue`14327`)df.groupby
where TypeError
raised when pd.Grouper(key=...)
is passed in a list (GH14334)pd.pivot_table
may raise TypeError
or ValueError
when index
or columns
is not scalar and values
is not specified (GH14380)This is a major release from 0.18.1 and includes number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
merge_asof()
for asof-style time-series joining, see here.rolling()
is now time-series aware, see hereread_csv()
now supports parsing Categorical
data, see hereunion_categorical()
has been added for combining categoricals, see herePeriodIndex
now has its own period
dtype, and changed to be more consistent with other Index
classes. See hereint
and bool
dtypes, see hereSeries
no longer ignores the index, see here for an overview of the API changes.Panel4D
and PanelND
. We recommend to represent these types of n-dimensional data with the xarray package.pandas.io.data
, pandas.io.wb
, pandas.tools.rplot
.Warning
pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, see here.
New features¶merge_asof
for asof-style time-series joining¶
A long-time requested feature has been added through the merge_asof()
function, to support asof style joining of time-series (GH1870, GH13695, GH13709, GH13902). Full documentation is here.
The merge_asof()
performs an asof merge, which is similar to a left-join except that we match on nearest key rather than equal keys.
In [1]: left = pd.DataFrame({'a': [1, 5, 10], ...: 'left_val': ['a', 'b', 'c']}) ...: In [2]: right = pd.DataFrame({'a': [1, 2, 3, 6, 7], ...: 'right_val': [1, 2, 3, 6, 7]}) ...: In [3]: left Out[3]: a left_val 0 1 a 1 5 b 2 10 c In [4]: right Out[4]: a right_val 0 1 1 1 2 2 2 3 3 3 6 6 4 7 7
We typically want to match exactly when possible, and use the most recent value otherwise.
In [5]: pd.merge_asof(left, right, on='a') Out[5]: a left_val right_val 0 1 a 1 1 5 b 3 2 10 c 7
We can also match rows ONLY with prior data, and not an exact match.
In [6]: pd.merge_asof(left, right, on='a', allow_exact_matches=False) Out[6]: a left_val right_val 0 1 a NaN 1 5 b 3.0 2 10 c 7.0
In a typical time-series example, we have trades
and quotes
and we want to asof-join
them. This also illustrates using the by
parameter to group data before merging.
In [7]: trades = pd.DataFrame({ ...: 'time': pd.to_datetime(['20160525 13:30:00.023', ...: '20160525 13:30:00.038', ...: '20160525 13:30:00.048', ...: '20160525 13:30:00.048', ...: '20160525 13:30:00.048']), ...: 'ticker': ['MSFT', 'MSFT', ...: 'GOOG', 'GOOG', 'AAPL'], ...: 'price': [51.95, 51.95, ...: 720.77, 720.92, 98.00], ...: 'quantity': [75, 155, ...: 100, 100, 100]}, ...: columns=['time', 'ticker', 'price', 'quantity']) ...: In [8]: quotes = pd.DataFrame({ ...: 'time': pd.to_datetime(['20160525 13:30:00.023', ...: '20160525 13:30:00.023', ...: '20160525 13:30:00.030', ...: '20160525 13:30:00.041', ...: '20160525 13:30:00.048', ...: '20160525 13:30:00.049', ...: '20160525 13:30:00.072', ...: '20160525 13:30:00.075']), ...: 'ticker': ['GOOG', 'MSFT', 'MSFT', ...: 'MSFT', 'GOOG', 'AAPL', 'GOOG', ...: 'MSFT'], ...: 'bid': [720.50, 51.95, 51.97, 51.99, ...: 720.50, 97.99, 720.50, 52.01], ...: 'ask': [720.93, 51.96, 51.98, 52.00, ...: 720.93, 98.01, 720.88, 52.03]}, ...: columns=['time', 'ticker', 'bid', 'ask']) ...:
In [9]: trades Out[9]: time ticker price quantity 0 2016-05-25 13:30:00.023 MSFT 51.95 75 1 2016-05-25 13:30:00.038 MSFT 51.95 155 2 2016-05-25 13:30:00.048 GOOG 720.77 100 3 2016-05-25 13:30:00.048 GOOG 720.92 100 4 2016-05-25 13:30:00.048 AAPL 98.00 100 In [10]: quotes Out[10]: time ticker bid ask 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
An asof merge joins on the on
, typically a datetimelike field, which is ordered, and in this case we are using a grouper in the by
field. This is like a left-outer join, except that forward filling happens automatically taking the most recent non-NaN value.
In [11]: pd.merge_asof(trades, quotes, ....: on='time', ....: by='ticker') ....: Out[11]: time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
This returns a merged DataFrame with the entries in the same order as the original left passed DataFrame (trades
in this case), with the fields of the quotes
merged.
.rolling()
is now time-series aware¶
.rolling()
objects are now time-series aware and can accept a time-series offset (or convertible) for the window
argument (GH13327, GH12995). See the full documentation here.
In [12]: dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, ....: index=pd.date_range('20130101 09:00:00', periods=5, freq='s')) ....: In [13]: dft Out[13]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 2.0 2013-01-01 09:00:03 NaN 2013-01-01 09:00:04 4.0
This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.
In [14]: dft.rolling(2).sum() Out[14]: B 2013-01-01 09:00:00 NaN 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 NaN 2013-01-01 09:00:04 NaN In [15]: dft.rolling(2, min_periods=1).sum() Out[15]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:04 4.0
Specifying an offset allows a more intuitive specification of the rolling frequency.
In [16]: dft.rolling('2s').sum() Out[16]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:01 1.0 2013-01-01 09:00:02 3.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:04 4.0
Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.
In [17]: dft = DataFrame({'B': [0, 1, 2, np.nan, 4]}, ....: index = pd.Index([pd.Timestamp('20130101 09:00:00'), ....: pd.Timestamp('20130101 09:00:02'), ....: pd.Timestamp('20130101 09:00:03'), ....: pd.Timestamp('20130101 09:00:05'), ....: pd.Timestamp('20130101 09:00:06')], ....: name='foo')) ....: In [18]: dft Out[18]: B foo 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0 In [19]: dft.rolling(2).sum() Out[19]: B foo 2013-01-01 09:00:00 NaN 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 NaN
Using the time-specification generates variable windows for this sparse data.
In [20]: dft.rolling('2s').sum() Out[20]: B foo 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0
Furthermore, we now allow an optional on
parameter to specify a column (rather than the default of the index) in a DataFrame.
In [21]: dft = dft.reset_index() In [22]: dft Out[22]: foo B 0 2013-01-01 09:00:00 0.0 1 2013-01-01 09:00:02 1.0 2 2013-01-01 09:00:03 2.0 3 2013-01-01 09:00:05 NaN 4 2013-01-01 09:00:06 4.0 In [23]: dft.rolling('2s', on='foo').sum() Out[23]: foo B 0 2013-01-01 09:00:00 0.0 1 2013-01-01 09:00:02 1.0 2 2013-01-01 09:00:03 3.0 3 2013-01-01 09:00:05 NaN 4 2013-01-01 09:00:06 4.0
read_csv
has improved support for duplicate column names¶
Duplicate column names are now supported in read_csv()
whether they are in the file or passed in as the names
parameter (GH7160, GH9424)
In [24]: data = '0,1,2\n3,4,5' In [25]: names = ['a', 'b', 'a']
Previous behavior:
In [2]: pd.read_csv(StringIO(data), names=names) Out[2]: a b a 0 2 1 2 1 5 4 5
The first a
column contained the same data as the second a
column, when it should have contained the values [0, 3]
.
New behavior:
In [26]: pd.read_csv(StringIO(data), names=names) Out[26]: a b a.1 0 0 1 2 1 3 4 5
read_csv
supports parsing Categorical
directly¶
The read_csv()
function now supports parsing a Categorical
column when specified as a dtype (GH10153). Depending on the structure of the data, this can result in a faster parse time and lower memory usage compared to converting to Categorical
after parsing. See the io docs here.
In [27]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' In [28]: pd.read_csv(StringIO(data)) Out[28]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [29]: pd.read_csv(StringIO(data)).dtypes Out[29]: col1 object col2 object col3 int64 dtype: object In [30]: pd.read_csv(StringIO(data), dtype='category').dtypes Out[30]: col1 category col2 category col3 category dtype: object
Individual columns can be parsed as a Categorical
using a dict specification
In [31]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes Out[31]: col1 category col2 object col3 int64 dtype: object
Note
The resulting categories will always be parsed as strings (object dtype). If the categories are numeric they can be converted using the to_numeric()
function, or as appropriate, another converter such as to_datetime()
.
In [32]: df = pd.read_csv(StringIO(data), dtype='category') In [33]: df.dtypes Out[33]: col1 category col2 category col3 category dtype: object In [34]: df['col3'] Out[34]: 0 1 1 2 2 3 Name: col3, dtype: category Categories (3, object): [1, 2, 3] In [35]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories) In [36]: df['col3'] Out[36]: 0 1 1 2 2 3 Name: col3, dtype: category Categories (3, int64): [1, 2, 3]Categorical Concatenation¶
A function union_categoricals()
has been added for combining categoricals, see Unioning Categoricals (GH13361, GH:13763, issue:13846, GH14173)
In [37]: from pandas.api.types import union_categoricals In [38]: a = pd.Categorical(["b", "c"]) In [39]: b = pd.Categorical(["a", "b"]) In [40]: union_categoricals([a, b]) Out[40]: [b, c, a, b] Categories (3, object): [b, c, a]
concat
and append
now can concat category
dtypes with different categories
as object
dtype (GH13524)
In [41]: s1 = pd.Series(['a', 'b'], dtype='category') In [42]: s2 = pd.Series(['b', 'c'], dtype='category')
Previous behavior:
In [1]: pd.concat([s1, s2]) ValueError: incompatible categories in categorical concat
New behavior:
In [43]: pd.concat([s1, s2]) Out[43]: 0 a 1 b 0 b 1 c dtype: object
Pandas has gained new frequency offsets, SemiMonthEnd
(‘SM’) and SemiMonthBegin
(‘SMS’). These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively. (GH1543)
In [44]: from pandas.tseries.offsets import SemiMonthEnd, SemiMonthBegin
SemiMonthEnd:
In [45]: Timestamp('2016-01-01') + SemiMonthEnd() Out[45]: Timestamp('2016-01-15 00:00:00') In [46]: pd.date_range('2015-01-01', freq='SM', periods=4) Out[46]: DatetimeIndex(['2015-01-15', '2015-01-31', '2015-02-15', '2015-02-28'], dtype='datetime64[ns]', freq='SM-15')
SemiMonthBegin:
In [47]: Timestamp('2016-01-01') + SemiMonthBegin() Out[47]: Timestamp('2016-01-15 00:00:00') In [48]: pd.date_range('2015-01-01', freq='SMS', periods=4) Out[48]: DatetimeIndex(['2015-01-01', '2015-01-15', '2015-02-01', '2015-02-15'], dtype='datetime64[ns]', freq='SMS-15')
Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.
In [49]: pd.date_range('2015-01-01', freq='SMS-16', periods=4) Out[49]: DatetimeIndex(['2015-01-01', '2015-01-16', '2015-02-01', '2015-02-16'], dtype='datetime64[ns]', freq='SMS-16') In [50]: pd.date_range('2015-01-01', freq='SM-14', periods=4) Out[50]: DatetimeIndex(['2015-01-14', '2015-01-31', '2015-02-14', '2015-02-28'], dtype='datetime64[ns]', freq='SM-14')New Index methods¶
The following methods and options are added to Index
, to be more consistent with the Series
and DataFrame
API.
Index
now supports the .where()
function for same shape indexing (GH13170)
In [51]: idx = pd.Index(['a', 'b', 'c']) In [52]: idx.where([True, False, True]) Out[52]: Index(['a', nan, 'c'], dtype='object')
Index
now supports .dropna()
to exclude missing values (GH6194)
In [53]: idx = pd.Index([1, 2, np.nan, 4]) In [54]: idx.dropna() Out[54]: Float64Index([1.0, 2.0, 4.0], dtype='float64')
For MultiIndex
, values are dropped if any level is missing by default. Specifying how='all'
only drops values where all levels are missing.
In [55]: midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4], ....: [1, 2, np.nan, np.nan]]) ....: In [56]: midx Out[56]: MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1, -1, 2], [0, 1, -1, -1]]) In [57]: midx.dropna() Out[57]: MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1], [0, 1]]) In [58]: midx.dropna(how='all') Out[58]: MultiIndex(levels=[[1, 2, 4], [1, 2]], labels=[[0, 1, 2], [0, 1, -1]])
Index
now supports .str.extractall()
which returns a DataFrame
, see the docs here (GH10008, GH13156)
In [59]: idx = pd.Index(["a1a2", "b1", "c1"]) In [60]: idx.str.extractall("[ab](?P<digit>\d)") Out[60]: digit match 0 0 1 1 2 1 0 1
Index.astype()
now accepts an optional boolean argument copy
, which allows optional copying if the requirements on dtype are satisfied (GH13209)
read_gbq()
method has gained the dialect
argument to allow users to specify whether to use BigQuery’s legacy SQL or BigQuery’s standard SQL. See the docs for more details (GH13615).to_gbq()
method now allows the DataFrame column order to differ from the destination table schema (GH11359).Previous versions of pandas would permanently silence numpy’s ufunc error handling when pandas
was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as NaN
s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the numpy.errstate
context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas codebase. (GH13109, GH13145)
After upgrading pandas, you may see new RuntimeWarnings
being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Use numpy.errstate around the source of the RuntimeWarning
to control how these conditions are handled.
get_dummies
now returns integer dtypes¶
The pd.get_dummies
function now returns dummy-encoded columns as small integers, rather than floats (GH8725). This should provide an improved memory footprint.
Previous behavior:
In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes Out[1]: a float64 b float64 c float64 dtype: object
New behavior:
In [61]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes Out[61]: a uint8 b uint8 c uint8 dtype: objectDowncast values to smallest possible dtype in
to_numeric
¶
pd.to_numeric()
now accepts a downcast
parameter, which will downcast the data if possible to smallest specified numerical dtype (GH13352)
In [62]: s = ['1', 2, 3] In [63]: pd.to_numeric(s, downcast='unsigned') Out[63]: array([1, 2, 3], dtype=uint8) In [64]: pd.to_numeric(s, downcast='integer') Out[64]: array([1, 2, 3], dtype=int8)pandas development API¶
As part of making pandas API more uniform and accessible in the future, we have created a standard sub-package of pandas, pandas.api
to hold public API’s. We are starting by exposing type introspection functions in pandas.api.types
. More sub-packages and officially sanctioned API’s will be published in future versions of pandas (GH13147, GH13634)
The following are now part of this API:
In [65]: import pprint In [66]: from pandas.api import types In [67]: funcs = [ f for f in dir(types) if not f.startswith('_') ] In [68]: pprint.pprint(funcs) ['CategoricalDtype', 'DatetimeTZDtype', 'IntervalDtype', 'PeriodDtype', 'infer_dtype', 'is_any_int_dtype', 'is_bool', 'is_bool_dtype', 'is_categorical', 'is_categorical_dtype', 'is_complex', 'is_complex_dtype', 'is_datetime64_any_dtype', 'is_datetime64_dtype', 'is_datetime64_ns_dtype', 'is_datetime64tz_dtype', 'is_datetimetz', 'is_dict_like', 'is_dtype_equal', 'is_extension_type', 'is_file_like', 'is_float', 'is_float_dtype', 'is_floating_dtype', 'is_hashable', 'is_int64_dtype', 'is_integer', 'is_integer_dtype', 'is_interval', 'is_interval_dtype', 'is_iterator', 'is_list_like', 'is_named_tuple', 'is_number', 'is_numeric_dtype', 'is_object_dtype', 'is_period', 'is_period_dtype', 'is_re', 'is_re_compilable', 'is_scalar', 'is_sequence', 'is_signed_integer_dtype', 'is_sparse', 'is_string_dtype', 'is_timedelta64_dtype', 'is_timedelta64_ns_dtype', 'is_unsigned_integer_dtype', 'pandas_dtype', 'union_categoricals']
Note
Calling these functions from the internal module pandas.core.common
will now show a DeprecationWarning
(GH13990)
Timestamp
can now accept positional and keyword parameters similar to datetime.datetime()
(GH10758, GH11630)
In [69]: pd.Timestamp(2012, 1, 1) Out[69]: Timestamp('2012-01-01 00:00:00') In [70]: pd.Timestamp(year=2012, month=1, day=1, hour=8, minute=30) Out[70]: Timestamp('2012-01-01 08:30:00')
The .resample()
function now accepts a on=
or level=
parameter for resampling on a datetimelike column or MultiIndex
level (GH13500)
In [71]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5), ....: 'a': np.arange(5)}, ....: index=pd.MultiIndex.from_arrays([ ....: [1,2,3,4,5], ....: pd.date_range('2015-01-01', freq='W', periods=5)], ....: names=['v','d'])) ....: In [72]: df Out[72]: a date v d 1 2015-01-04 0 2015-01-04 2 2015-01-11 1 2015-01-11 3 2015-01-18 2 2015-01-18 4 2015-01-25 3 2015-01-25 5 2015-02-01 4 2015-02-01 In [73]: df.resample('M', on='date').sum() Out[73]: a date 2015-01-31 6 2015-02-28 4 In [74]: df.resample('M', level='d').sum() Out[74]: a d 2015-01-31 6 2015-02-28 4
The .get_credentials()
method of GbqConnector
can now first try to fetch the application default credentials. See the docs for more details (GH13577).
The .tz_localize()
method of DatetimeIndex
and Timestamp
has gained the errors
keyword, so you can potentially coerce nonexistent timestamps to NaT
. The default behavior remains to raising a NonExistentTimeError
(GH13057)
.to_hdf/read_hdf()
now accept path objects (e.g. pathlib.Path
, py.path.local
) for the file path (GH11773)
The pd.read_csv()
with engine='python'
has gained support for the decimal
(GH12933), na_filter
(GH13321) and the memory_map
option (GH13381).
Consistent with the Python API, pd.read_csv()
will now interpret +inf
as positive infinity (GH13274)
The pd.read_html()
has gained support for the na_values
, converters
, keep_default_na
options (GH13461)
Categorical.astype()
now accepts an optional boolean argument copy
, effective when dtype is categorical (GH13209)
DataFrame
has gained the .asof()
method to return the last non-NaN values according to the selected subset (GH13358)
The DataFrame
constructor will now respect key ordering if a list of OrderedDict
objects are passed in (GH13304)
pd.read_html()
has gained support for the decimal
option (GH12907)
Series
has gained the properties .is_monotonic
, .is_monotonic_increasing
, .is_monotonic_decreasing
, similar to Index
(GH13336)
DataFrame.to_sql()
now allows a single value as the SQL type for all columns (GH11886).
Series.append
now supports the ignore_index
option (GH13677)
.to_stata()
and StataWriter
can now write variable labels to Stata dta files using a dictionary to make column names to labels (GH13535, GH13536)
.to_stata()
and StataWriter
will automatically convert datetime64[ns]
columns to Stata format %tc
, rather than raising a ValueError
(GH12259)
read_stata()
and StataReader
raise with a more explicit error message when reading Stata files with repeated value labels when convert_categoricals=True
(GH13923)
DataFrame.style
will now render sparsified MultiIndexes (GH11655)
DataFrame.style
will now show column level names (e.g. DataFrame.columns.names
) (GH13775)
DataFrame
has gained support to re-order the columns based on the values in a row using df.sort_values(by='...', axis=1)
(GH10806)
In [75]: df = pd.DataFrame({'A': [2, 7], 'B': [3, 5], 'C': [4, 8]}, ....: index=['row1', 'row2']) ....: In [76]: df Out[76]: A B C row1 2 3 4 row2 7 5 8 In [77]: df.sort_values(by='row2', axis=1) Out[77]: B A C row1 3 2 4 row2 5 7 8
Added documentation to I/O regarding the perils of reading in columns with mixed dtypes and how to handle it (GH13746)
to_html()
now has a border
argument to control the value in the opening <table>
tag. The default is the value of the html.border
option, which defaults to 1. This also affects the notebook HTML repr, but since Jupyter’s CSS includes a border-width attribute, the visual effect is the same. (GH11563).
Raise ImportError
in the sql functions when sqlalchemy
is not installed and a connection string is used (GH11920).
Compatibility with matplotlib 2.0. Older versions of pandas should also work with matplotlib 2.0 (GH13333)
Timestamp
, Period
, DatetimeIndex
, PeriodIndex
and .dt
accessor have gained a .is_leap_year
property to check whether the date belongs to a leap year. (GH13727)
astype()
will now accept a dict of column name to data types mapping as the dtype
argument. (GH12086)
The pd.read_json
and DataFrame.to_json
has gained support for reading and writing json lines with lines
option see Line delimited json (GH9180)
read_excel()
now supports the true_values and false_values keyword arguments (GH13347)
groupby()
will now accept a scalar and a single-element list for specifying level
on a non-MultiIndex
grouper. (GH13907)
Non-convertible dates in an excel date column will be returned without conversion and the column will be object
dtype, rather than raising an exception (GH10001).
pd.Timedelta(None)
is now accepted and will return NaT
, mirroring pd.Timestamp
(GH13687)
pd.read_stata()
can now handle some format 111 files, which are produced by SAS when generating Stata dta files (GH11526)
Series
and Index
now support divmod
which will return a tuple of series or indices. This behaves like a standard binary operator with regards to broadcasting rules (GH14208).
Series.tolist()
will now return Python types¶
Series.tolist()
will now return Python types in the output, mimicking NumPy .tolist()
behavior (GH10904)
In [78]: s = pd.Series([1,2,3])
Previous behavior:
In [7]: type(s.tolist()[0]) Out[7]: <class 'numpy.int64'>
New behavior:
In [79]: type(s.tolist()[0]) Out[79]: int
Series
operators for different indexes¶
Following Series
operators have been changed to make all operators consistent, including DataFrame
(GH1134, GH4581, GH13538)
Series
comparison operators now raise ValueError
when index
are different.Series
logical operators align both index
of left and right hand side.Warning
Until 0.18.1, comparing Series
with the same length, would succeed even if the .index
are different (the result ignores .index
). As of 0.19.0, this will raises ValueError
to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like .eq
.
As a result, Series
and DataFrame
operators behave as below:
Arithmetic operators align both index
(no changes).
In [80]: s1 = pd.Series([1, 2, 3], index=list('ABC')) In [81]: s2 = pd.Series([2, 2, 2], index=list('ABD')) In [82]: s1 + s2 Out[82]: A 3.0 B 4.0 C NaN D NaN dtype: float64 In [83]: df1 = pd.DataFrame([1, 2, 3], index=list('ABC')) In [84]: df2 = pd.DataFrame([2, 2, 2], index=list('ABD')) In [85]: df1 + df2 Out[85]: 0 A 3.0 B 4.0 C NaN D NaNComparison operators¶
Comparison operators raise ValueError
when .index
are different.
Previous Behavior (Series
):
Series
compared values ignoring the .index
as long as both had the same length:
In [1]: s1 == s2 Out[1]: A False B True C False dtype: bool
New behavior (Series
):
In [2]: s1 == s2 Out[2]: ValueError: Can only compare identically-labeled Series objects
Note
To achieve the same result as previous versions (compare values based on locations ignoring .index
), compare both .values
.
In [86]: s1.values == s2.values Out[86]: array([False, True, False], dtype=bool)
If you want to compare Series
aligning its .index
, see flexible comparison methods section below:
In [87]: s1.eq(s2) Out[87]: A False B True C False D False dtype: bool
Current Behavior (DataFrame
, no change):
In [3]: df1 == df2 Out[3]: ValueError: Can only compare identically-labeled DataFrame objectsLogical operators¶
Logical operators align both .index
of left and right hand side.
Previous behavior (Series
), only left hand side index
was kept:
In [4]: s1 = pd.Series([True, False, True], index=list('ABC')) In [5]: s2 = pd.Series([True, True, True], index=list('ABD')) In [6]: s1 & s2 Out[6]: A True B False C False dtype: bool
New behavior (Series
):
In [88]: s1 = pd.Series([True, False, True], index=list('ABC')) In [89]: s2 = pd.Series([True, True, True], index=list('ABD')) In [90]: s1 & s2 Out[90]: A True B False C False D False dtype: bool
Note
Series
logical operators fill a NaN
result with False
.
Note
To achieve the same result as previous versions (compare values based on only left hand side index), you can use reindex_like
:
In [91]: s1 & s2.reindex_like(s1) Out[91]: A True B False C False dtype: bool
Current Behavior (DataFrame
, no change):
In [92]: df1 = pd.DataFrame([True, False, True], index=list('ABC')) In [93]: df2 = pd.DataFrame([True, True, True], index=list('ABD')) In [94]: df1 & df2 Out[94]: 0 A True B False C NaN D NaNFlexible comparison methods¶
Series
flexible comparison methods like eq
, ne
, le
, lt
, ge
and gt
now align both index
. Use these operators if you want to compare two Series
which has the different index
.
In [95]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) In [96]: s2 = pd.Series([2, 2, 2], index=['b', 'c', 'd']) In [97]: s1.eq(s2) Out[97]: a False b True c False d False dtype: bool In [98]: s1.ge(s2) Out[98]: a False b True c True d False dtype: bool
Previously, this worked the same as comparison operators (see above).
.to_datetime()
changes¶
Previously if .to_datetime()
encountered mixed integers/floats and strings, but no datetimes with errors='coerce'
it would convert all to NaT
.
Previous behavior:
In [2]: pd.to_datetime([1, 'foo'], errors='coerce') Out[2]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)
Current behavior:
This will now convert integers/floats with the default unit of ns
.
In [104]: pd.to_datetime([1, 'foo'], errors='coerce') Out[104]: DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None)
Bug fixes related to .to_datetime()
:
pd.to_datetime()
when passing integers or floats, and no unit
and errors='coerce'
(GH13180).pd.to_datetime()
when passing invalid datatypes (e.g. bool); will now respect the errors
keyword (GH13176)pd.to_datetime()
which overflowed on int8
, and int16
dtypes (GH13451)pd.to_datetime()
raise AttributeError
with NaN
and the other string is not valid when errors='ignore'
(GH12424)pd.to_datetime()
did not cast floats correctly when unit
was specified, resulting in truncated datetime (GH13834)Merging will now preserve the dtype of the join keys (GH8596)
In [105]: df1 = pd.DataFrame({'key': [1], 'v1': [10]}) In [106]: df1 Out[106]: key v1 0 1 10 In [107]: df2 = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]}) In [108]: df2 Out[108]: key v1 0 1 20 1 2 30
Previous behavior:
In [5]: pd.merge(df1, df2, how='outer') Out[5]: key v1 0 1.0 10.0 1 1.0 20.0 2 2.0 30.0 In [6]: pd.merge(df1, df2, how='outer').dtypes Out[6]: key float64 v1 float64 dtype: object
New behavior:
We are able to preserve the join keys
In [109]: pd.merge(df1, df2, how='outer') Out[109]: key v1 0 1 10 1 1 20 2 2 30 In [110]: pd.merge(df1, df2, how='outer').dtypes Out[110]: key int64 v1 int64 dtype: object
Of course if you have missing values that are introduced, then the resulting dtype will be upcast, which is unchanged from previous.
In [111]: pd.merge(df1, df2, how='outer', on='key') Out[111]: key v1_x v1_y 0 1 10.0 20 1 2 NaN 30 In [112]: pd.merge(df1, df2, how='outer', on='key').dtypes Out[112]: key int64 v1_x float64 v1_y int64 dtype: object
.describe()
changes¶
Percentile identifiers in the index of a .describe()
output will now be rounded to the least precision that keeps them distinct (GH13104)
In [113]: s = pd.Series([0, 1, 2, 3, 4]) In [114]: df = pd.DataFrame([0, 1, 2, 3, 4])
Previous behavior:
The percentiles were rounded to at most one decimal place, which could raise ValueError
for a data frame if the percentiles were duplicated.
In [3]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[3]: count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.0% 0.000400 0.1% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 100.0% 3.998000 100.0% 3.999600 max 4.000000 dtype: float64 In [4]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[4]: ... ValueError: cannot reindex from a duplicate axis
New behavior:
In [115]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[115]: count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.01% 0.000400 0.05% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 99.95% 3.998000 99.99% 3.999600 max 4.000000 dtype: float64 In [116]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999]) Out[116]: 0 count 5.000000 mean 2.000000 std 1.581139 min 0.000000 0.01% 0.000400 0.05% 0.002000 0.1% 0.004000 50% 2.000000 99.9% 3.996000 99.95% 3.998000 99.99% 3.999600 max 4.000000
Furthermore:
percentiles
will now raise a ValueError
..describe()
on a DataFrame with a mixed-dtype column index, which would previously raise a TypeError
(GH13288)Period
changes¶ PeriodIndex
now has period
dtype¶
PeriodIndex
now has its own period
dtype. The period
dtype is a pandas extension dtype like category
or the timezone aware dtype (datetime64[ns, tz]
) (GH13941). As a consequence of this change, PeriodIndex
no longer has an integer dtype:
Previous behavior:
In [1]: pi = pd.PeriodIndex(['2016-08-01'], freq='D') In [2]: pi Out[2]: PeriodIndex(['2016-08-01'], dtype='int64', freq='D') In [3]: pd.api.types.is_integer_dtype(pi) Out[3]: True In [4]: pi.dtype Out[4]: dtype('int64')
New behavior:
In [117]: pi = pd.PeriodIndex(['2016-08-01'], freq='D') In [118]: pi Out[118]: PeriodIndex(['2016-08-01'], dtype='period[D]', freq='D') In [119]: pd.api.types.is_integer_dtype(pi) Out[119]: False In [120]: pd.api.types.is_period_dtype(pi) Out[120]: True In [121]: pi.dtype Out[121]: period[D] In [122]: type(pi.dtype) Out[122]: pandas.core.dtypes.dtypes.PeriodDtype
Period('NaT')
now returns pd.NaT
¶
Previously, Period
has its own Period('NaT')
representation different from pd.NaT
. Now Period('NaT')
has been changed to return pd.NaT
. (GH12759, GH13582)
Previous behavior:
In [5]: pd.Period('NaT', freq='D') Out[5]: Period('NaT', 'D')
New behavior:
These result in pd.NaT
without providing freq
option.
In [123]: pd.Period('NaT') Out[123]: NaT In [124]: pd.Period(None) Out[124]: NaT
To be compatible with Period
addition and subtraction, pd.NaT
now supports addition and subtraction with int
. Previously it raised ValueError
.
Previous behavior:
In [5]: pd.NaT + 1 ... ValueError: Cannot add integral value to Timestamp without freq.
New behavior:
In [125]: pd.NaT + 1 Out[125]: NaT In [126]: pd.NaT - 1 Out[126]: NaT
PeriodIndex.values
now returns array of Period
object¶
.values
is changed to return an array of Period
objects, rather than an array of integers (GH13988).
Previous behavior:
In [6]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M') In [7]: pi.values array([492, 493])
New behavior:
In [127]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M') In [128]: pi.values Out[128]: array([Period('2011-01', 'M'), Period('2011-02', 'M')], dtype=object)Index
+
/ -
no longer used for set operations¶
Addition and subtraction of the base Index type and of DatetimeIndex (not the numeric index types) previously performed set operations (set union and difference). This behavior was already deprecated since 0.15.0 (in favor using the specific .union()
and .difference()
methods), and is now disabled. When possible, +
and -
are now used for element-wise operations, for example for concatenating strings or subtracting datetimes (GH8227, GH14127).
Previous behavior:
In [1]: pd.Index(['a', 'b']) + pd.Index(['a', 'c']) FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union() Out[1]: Index(['a', 'b', 'c'], dtype='object')
New behavior: the same operation will now perform element-wise addition:
In [129]: pd.Index(['a', 'b']) + pd.Index(['a', 'c']) Out[129]: Index(['aa', 'bc'], dtype='object')
Note that numeric Index objects already performed element-wise operations. For example, the behavior of adding two integer Indexes is unchanged. The base Index
is now made consistent with this behavior.
In [130]: pd.Index([1, 2, 3]) + pd.Index([2, 3, 4]) Out[130]: Int64Index([3, 5, 7], dtype='int64')
Further, because of this change, it is now possible to subtract two DatetimeIndex objects resulting in a TimedeltaIndex:
Previous behavior:
In [1]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03']) FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference() Out[1]: DatetimeIndex(['2016-01-01'], dtype='datetime64[ns]', freq=None)
New behavior:
In [131]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03']) Out[131]: TimedeltaIndex(['-1 days', '-1 days'], dtype='timedelta64[ns]', freq=None)
Index.difference
and .symmetric_difference
changes¶
Index.difference
and Index.symmetric_difference
will now, more consistently, treat NaN
values as any other values. (GH13514)
In [132]: idx1 = pd.Index([1, 2, 3, np.nan]) In [133]: idx2 = pd.Index([0, 1, np.nan])
Previous behavior:
In [3]: idx1.difference(idx2) Out[3]: Float64Index([nan, 2.0, 3.0], dtype='float64') In [4]: idx1.symmetric_difference(idx2) Out[4]: Float64Index([0.0, nan, 2.0, 3.0], dtype='float64')
New behavior:
In [134]: idx1.difference(idx2) Out[134]: Float64Index([2.0, 3.0], dtype='float64') In [135]: idx1.symmetric_difference(idx2) Out[135]: Float64Index([0.0, 2.0, 3.0], dtype='float64')
Index.unique
consistently returns Index
¶
Index.unique()
now returns unique values as an Index
of the appropriate dtype
. (GH13395). Previously, most Index
classes returned np.ndarray
, and DatetimeIndex
, TimedeltaIndex
and PeriodIndex
returned Index
to keep metadata like timezone.
Previous behavior:
In [1]: pd.Index([1, 2, 3]).unique() Out[1]: array([1, 2, 3]) In [2]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique() Out[2]: DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)
New behavior:
In [136]: pd.Index([1, 2, 3]).unique() Out[136]: Int64Index([1, 2, 3], dtype='int64') In [137]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique() Out[137]: DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00', '2011-01-03 00:00:00+09:00'], dtype='datetime64[ns, Asia/Tokyo]', freq=None)
MultiIndex
constructors, groupby
and set_index
preserve categorical dtypes¶
MultiIndex.from_arrays
and MultiIndex.from_product
will now preserve categorical dtype in MultiIndex
levels (GH13743, GH13854).
In [138]: cat = pd.Categorical(['a', 'b'], categories=list("bac")) In [139]: lvl1 = ['foo', 'bar'] In [140]: midx = pd.MultiIndex.from_arrays([cat, lvl1]) In [141]: midx Out[141]: MultiIndex(levels=[['b', 'a', 'c'], ['bar', 'foo']], labels=[[1, 0], [1, 0]])
Previous behavior:
In [4]: midx.levels[0] Out[4]: Index(['b', 'a', 'c'], dtype='object') In [5]: midx.get_level_values[0] Out[5]: Index(['a', 'b'], dtype='object')
New behavior: the single level is now a CategoricalIndex
:
In [142]: midx.levels[0] Out[142]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, dtype='category') In [143]: midx.get_level_values(0) Out[143]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=False, dtype='category')
An analogous change has been made to MultiIndex.from_product
. As a consequence, groupby
and set_index
also preserve categorical dtypes in indexes
In [144]: df = pd.DataFrame({'A': [0, 1], 'B': [10, 11], 'C': cat}) In [145]: df_grouped = df.groupby(by=['A', 'C']).first() In [146]: df_set_idx = df.set_index(['A', 'C'])
Previous behavior:
In [11]: df_grouped.index.levels[1] Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C') In [12]: df_grouped.reset_index().dtypes Out[12]: A int64 C object B float64 dtype: object In [13]: df_set_idx.index.levels[1] Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C') In [14]: df_set_idx.reset_index().dtypes Out[14]: A int64 C object B int64 dtype: object
New behavior:
In [147]: df_grouped.index.levels[1] Out[147]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, name='C', dtype='category') In [148]: df_grouped.reset_index().dtypes Out[148]: A int64 C category B float64 dtype: object In [149]: df_set_idx.index.levels[1] Out[149]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, name='C', dtype='category') In [150]: df_set_idx.reset_index().dtypes Out[150]: A int64 C category B int64 dtype: object
read_csv
will progressively enumerate chunks¶
When read_csv()
is called with chunksize=n
and without specifying an index, each chunk used to have an independently generated index from 0
to n-1
. They are now given instead a progressive index, starting from 0
for the first chunk, from n
for the second, and so on, so that, when concatenated, they are identical to the result of calling read_csv()
without the chunksize=
argument (GH12185).
In [151]: data = 'A,B\n0,1\n2,3\n4,5\n6,7'
Previous behavior:
In [2]: pd.concat(pd.read_csv(StringIO(data), chunksize=2)) Out[2]: A B 0 0 1 1 2 3 0 4 5 1 6 7
New behavior:
In [152]: pd.concat(pd.read_csv(StringIO(data), chunksize=2)) Out[152]: A B 0 0 1 1 2 3 2 4 5 3 6 7Sparse Changes¶
These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.
int64
and bool
support enhancements¶
Sparse data structures now gained enhanced support of int64
and bool
dtype
(GH667, GH13849).
Previously, sparse data were float64
dtype by default, even if all inputs were of int
or bool
dtype. You had to specify dtype
explicitly to create sparse data with int64
dtype. Also, fill_value
had to be specified explicitly because the default was np.nan
which doesn’t appear in int64
or bool
data.
In [1]: pd.SparseArray([1, 2, 0, 0]) Out[1]: [1.0, 2.0, 0.0, 0.0] Fill: nan IntIndex Indices: array([0, 1, 2, 3], dtype=int32) # specifying int64 dtype, but all values are stored in sp_values because # fill_value default is np.nan In [2]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64) Out[2]: [1, 2, 0, 0] Fill: nan IntIndex Indices: array([0, 1, 2, 3], dtype=int32) In [3]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64, fill_value=0) Out[3]: [1, 2, 0, 0] Fill: 0 IntIndex Indices: array([0, 1], dtype=int32)
As of v0.19.0, sparse data keeps the input dtype, and uses more appropriate fill_value
defaults (0
for int64
dtype, False
for bool
dtype).
In [153]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64) Out[153]: [1, 2, 0, 0] Fill: 0 IntIndex Indices: array([0, 1], dtype=int32) In [154]: pd.SparseArray([True, False, False, False]) Out[154]: [True, False, False, False] Fill: False IntIndex Indices: array([0], dtype=int32)
See the docs for more details.
Operators now preserve dtypes¶Sparse data structure now can preserve dtype
after arithmetic ops (GH13848)
In [155]: s = pd.SparseSeries([0, 2, 0, 1], fill_value=0, dtype=np.int64) In [156]: s.dtype Out[156]: dtype('int64') In [157]: s + 1 Out[157]: 0 1 1 3 2 1 3 2 dtype: int64 BlockIndex Block locations: array([1, 3], dtype=int32) Block lengths: array([1, 1], dtype=int32)
Sparse data structure now support astype
to convert internal dtype
(GH13900)
In [158]: s = pd.SparseSeries([1., 0., 2., 0.], fill_value=0) In [159]: s Out[159]: 0 1.0 1 0.0 2 2.0 3 0.0 dtype: float64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 1], dtype=int32) In [160]: s.astype(np.int64) Out[160]: 0 1 1 0 2 2 3 0 dtype: int64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 1], dtype=int32)
astype
fails if data contains values which cannot be converted to specified dtype
. Note that the limitation is applied to fill_value
which default is np.nan
.
In [7]: pd.SparseSeries([1., np.nan, 2., np.nan], fill_value=np.nan).astype(np.int64) Out[7]: ValueError: unable to coerce current fill_value nan to int64 dtype
SparseDataFrame
and SparseSeries
now preserve class types when slicing or transposing. (GH13787)SparseArray
with bool
dtype now supports logical (bool) operators (GH14000)SparseSeries
with MultiIndex
[]
indexing may raise IndexError
(GH13144)SparseSeries
with MultiIndex
[]
indexing result may have normal Index
(GH13144)SparseDataFrame
in which axis=None
did not default to axis=0
(GH13048)SparseSeries
and SparseDataFrame
creation with object
dtype may raise TypeError
(GH11633)SparseDataFrame
doesn’t respect passed SparseArray
or SparseSeries
‘s dtype and fill_value
(GH13866)SparseArray
and SparseSeries
don’t apply ufunc to fill_value
(GH13853)SparseSeries.abs
incorrectly keeps negative fill_value
(GH13853)SparseDataFrame
s, types were previously forced to float (GH13917)SparseSeries
slicing changes integer dtype to float (GH8292)SparseDataFarme
comparison ops may raise TypeError
(GH13001)SparseDataFarme.isnull
raises ValueError
(GH8276)SparseSeries
representation with bool
dtype may raise IndexError
(GH13110)SparseSeries
and SparseDataFrame
of bool
or int64
dtype may display its values like float64
dtype (GH13110)SparseArray
with bool
dtype may return incorrect result (GH13985)SparseArray
created from SparseSeries
may lose dtype
(GH13999)SparseSeries
comparison with dense returns normal Series
rather than SparseSeries
(GH13999)Note
This change only affects 64 bit python running on Windows, and only affects relatively advanced indexing operations
Methods such as Index.get_indexer
that return an indexer array, coerce that array to a “platform int”, so that it can be directly used in 3rd party library operations like numpy.take
. Previously, a platform int was defined as np.int_
which corresponds to a C integer, but the correct type, and what is being used now, is np.intp
, which corresponds to the C integer size that can hold a pointer (GH3033, GH13972).
These types are the same on many platform, but for 64 bit python on Windows, np.int_
is 32 bits, and np.intp
is 64 bits. Changing this behavior improves performance for many operations on that platform.
Previous behavior:
In [1]: i = pd.Index(['a', 'b', 'c']) In [2]: i.get_indexer(['b', 'b', 'c']).dtype Out[2]: dtype('int32')
New behavior:
In [1]: i = pd.Index(['a', 'b', 'c']) In [2]: i.get_indexer(['b', 'b', 'c']).dtype Out[2]: dtype('int64')Other API Changes¶
Timestamp.to_pydatetime
will issue a UserWarning
when warn=True
, and the instance has a non-zero number of nanoseconds, previously this would print a message to stdout (GH14101).Series.unique()
with datetime and timezone now returns return array of Timestamp
with timezone (GH13565).Panel.to_sparse()
will raise a NotImplementedError
exception when called (GH13778).Index.reshape()
will raise a NotImplementedError
exception when called (GH12882)..filter()
enforces mutual exclusion of the keyword arguments (GH12399).eval
‘s upcasting rules for float32
types have been updated to be more consistent with NumPy’s rules. New behavior will not upcast to float64
if you multiply a pandas float32
object by a scalar float64 (GH12388).UnsupportedFunctionCall
error is now raised if NumPy ufuncs like np.mean
are called on groupby or resample objects (GH12811).__setitem__
will no longer apply a callable rhs as a function instead of storing it. Call where
directly to get the previous behavior (GH13299)..sample()
will respect the random seed set via numpy.random.seed(n)
(GH13161)Styler.apply
is now more strict about the outputs your function must return. For axis=0
or axis=1
, the output shape must be identical. For axis=None
, the output must be a DataFrame with identical columns and index labels (GH13222).Float64Index.astype(int)
will now raise ValueError
if Float64Index
contains NaN
values (GH13149)TimedeltaIndex.astype(int)
and DatetimeIndex.astype(int)
will now return Int64Index
instead of np.array
(GH13209)Period
with multiple frequencies to normal Index
now returns Index
with object
dtype (GH13664)PeriodIndex.fillna
with Period
has different freq now coerces to object
dtype (GH13664)DataFrame.boxplot(by=col)
now return a Series
when return_type
is not None. Previously these returned an OrderedDict
. Note that when return_type=None
, the default, these still return a 2-D NumPy array (GH12216, GH7096).pd.read_hdf
will now raise a ValueError
instead of KeyError
, if a mode other than r
, r+
and a
is supplied. (GH13623)pd.read_csv()
, pd.read_table()
, and pd.read_hdf()
raise the builtin FileNotFoundError
exception for Python 3.x when called on a nonexistent file; this is back-ported as IOError
in Python 2.x (GH14086)CParserError
(GH13652).pd.read_csv()
in the C engine will now issue a ParserWarning
or raise a ValueError
when sep
encoded is more than one character long (GH14065)DataFrame.values
will now return float64
with a DataFrame
of mixed int64
and uint64
dtypes, conforming to np.find_common_type
(GH10364, GH13917).groupby.groups
will now return a dictionary of Index
objects, rather than a dictionary of np.ndarray
or lists
(GH14293)Series.reshape
and Categorical.reshape
have been deprecated and will be removed in a subsequent release (GH12882, GH12882)PeriodIndex.to_datetime
has been deprecated in favor of PeriodIndex.to_timestamp
(GH8254)Timestamp.to_datetime
has been deprecated in favor of Timestamp.to_pydatetime
(GH8254)Index.to_datetime
and DatetimeIndex.to_datetime
have been deprecated in favor of pd.to_datetime
(GH8254)pandas.core.datetools
module has been deprecated and will be removed in a subsequent release (GH14094)SparseList
has been deprecated and will be removed in a future version (GH13784)DataFrame.to_html()
and DataFrame.to_latex()
have dropped the colSpace
parameter in favor of col_space
(GH13857)DataFrame.to_sql()
has deprecated the flavor
parameter, as it is superfluous when SQLAlchemy is not installed (GH13611)read_csv
keywords:
compact_ints
and use_unsigned
have been deprecated and will be removed in a future version (GH13320)buffer_lines
has been deprecated and will be removed in a future version (GH13360)as_recarray
has been deprecated and will be removed in a future version (GH13373)skip_footer
has been deprecated in favor of skipfooter
and will be removed in a future version (GH13349)pd.ordered_merge()
has been renamed to pd.merge_ordered()
and the original name will be removed in a future version (GH13358)Timestamp.offset
property (and named arg in the constructor), has been deprecated in favor of freq
(GH12160)pd.tseries.util.pivot_annual
is deprecated. Use pivot_table
as alternative, an example is here (GH736)pd.tseries.util.isleapyear
has been deprecated and will be removed in a subsequent release. Datetime-likes now have a .is_leap_year
property (GH13727)Panel4D
and PanelND
constructors are deprecated and will be removed in a future version. The recommended way to represent these types of n-dimensional data are with the xarray package. Pandas provides a to_xarray()
method to automate this conversion (GH13564).pandas.tseries.frequencies.get_standard_freq
is deprecated. Use pandas.tseries.frequencies.to_offset(freq).rule_code
instead (GH13874)pandas.tseries.frequencies.to_offset
‘s freqstr
keyword is deprecated in favor of freq
(GH13874)Categorical.from_array
has been deprecated and will be removed in a future version (GH13854)SparsePanel
class has been removed (GH13778)pd.sandbox
module has been removed in favor of the external library pandas-qt
(GH13670)pandas.io.data
and pandas.io.wb
modules are removed in favor of the pandas-datareader package (GH13724).pandas.tools.rplot
module has been removed in favor of the seaborn package (GH13855)DataFrame.to_csv()
has dropped the engine
parameter, as was deprecated in 0.17.1 (GH11274, GH13419)DataFrame.to_dict()
has dropped the outtype
parameter in favor of orient
(GH13627, GH8486)pd.Categorical
has dropped setting of the ordered
attribute directly in favor of the set_ordered
method (GH13671)pd.Categorical
has dropped the levels
attribute in favor of categories
(GH8376)DataFrame.to_sql()
has dropped the mysql
option for the flavor
parameter (GH13611)Panel.shift()
has dropped the lags
parameter in favor of periods
(GH14041)pd.Index
has dropped the diff
method in favor of difference
(GH13669)pd.DataFrame
has dropped the to_wide
method in favor of to_panel
(GH14039)Series.to_csv
has dropped the nanRep
parameter in favor of na_rep
(GH13804)Series.xs
, DataFrame.xs
, Panel.xs
, Panel.major_xs
, and Panel.minor_xs
have dropped the copy
parameter (GH13781)str.split
has dropped the return_type
parameter in favor of expand
(GH13701)ValueError
. For the list of currently supported offsets, see here.return_type
parameter for DataFrame.plot.box
and DataFrame.boxplot
changed from None
to "axes"
. These methods will now return a matplotlib axes by default instead of a dictionary of artists. See here (GH6581).tquery
and uquery
functions in the pandas.io.sql
module are removed (GH5950).IntIndex.intersect
(GH13082)BlockIndex
when the number of blocks are large, though recommended to use IntIndex
in such cases (GH13082)DataFrame.quantile()
as it now operates per-block (GH11623)DataFrameGroupBy.transform
(GH12737)Index
and Series
.duplicated
(GH10235)Index.difference
(GH12044)RangeIndex.is_monotonic_increasing
and is_monotonic_decreasing
(GH13749)DatetimeIndex
(GH13692)Period
(GH12817)factorize
of datetime with timezone (GH13750)groupby.groups
(GH14293)groupby().shift()
, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (GH13813)groupby().cumsum()
calculating cumprod
when axis=1
. (GH13994)pd.to_timedelta()
in which the errors
parameter was not being respected (GH13613)io.json.json_normalize()
, where non-ascii keys raised an exception (GH13213)Series
as xerr
or yerr
in .plot()
(GH11858)DataFrame
assignment with an object-dtyped Index
where the resultant column is mutable to the original object. (GH13522)AutoDataFormatter
; this restores the second scaled formatting and re-adds micro-second scaled formatting (GH13131)HDFStore
with a fixed format and start
and/or stop
specified will now return the selected range (GH8287)Categorical.from_codes()
where an unhelpful error was raised when an invalid ordered
parameter was passed in (GH14058)Series
construction from a tuple of integers on windows not returning default dtype (int64) (GH13646)TimedeltaIndex
addition with a Datetime-like object where addition overflow was not being caught (GH14068).groupby(..).resample(..)
when the same object is called multiple times (GH13174).to_records()
when index name is a unicode string (GH13172).memory_usage()
on object which doesn’t implement (GH12924)Series.quantile
with nans (also shows up in .median()
and .describe()
); furthermore now names the Series
with the quantile (GH13098, GH13146)SeriesGroupBy.transform
with datetime values and missing groups (GH13191)Series
were incorrectly coerced in datetime-like numeric operations (GH13844)Categorical
constructor when passed a Categorical
containing datetimes with timezones (GH14190)Series.str.extractall()
with str
index raises ValueError
(GH13156)Series.str.extractall()
with single group and quantifier (GH13382)DatetimeIndex
and Period
subtraction raises ValueError
or AttributeError
rather than TypeError
(GH13078)Index
and Series
created with NaN
and NaT
mixed data may not have datetime64
dtype (GH13324)Index
and Series
may ignore np.datetime64('nat')
and np.timdelta64('nat')
to infer dtype (GH13324)PeriodIndex
and Period
subtraction raises AttributeError
(GH13071)PeriodIndex
construction returning a float64
index in some circumstances (GH13067).resample(..)
with a PeriodIndex
not changing its freq
appropriately when empty (GH13067).resample(..)
with a PeriodIndex
not retaining its type or name with an empty DataFrame
appropriately when empty (GH13212)groupby(..).apply(..)
when the passed function returns scalar values per group (GH13468).groupby(..).resample(..)
where passing some keywords would raise an exception (GH13235).tz_convert
on a tz-aware DateTimeIndex
that relied on index being sorted for correct results (GH13306).tz_localize
with dateutil.tz.tzlocal
may return incorrect result (GH13583)DatetimeTZDtype
dtype with dateutil.tz.tzlocal
cannot be regarded as valid dtype (GH13583)pd.read_hdf()
where attempting to load an HDF file with a single dataset, that had one or more categorical columns, failed unless the key argument was set to the name of the dataset. (GH13231).rolling()
that allowed a negative integer window in contruction of the Rolling()
object, but would later fail on aggregation (GH13383)Series
indexing with tuple-valued data and a numeric index (GH13509)pd.DataFrame
where unusual elements with the object
dtype were causing segfaults (GH13717)Series
which could result in segfaults (GH13445)DatetimeIndex
, which did not honour the copy=True
(GH13205)DatetimeIndex.is_normalized
returns incorrectly for normalized date_range in case of local timezones (GH13459)pd.concat
and .append
may coerces datetime64
and timedelta
to object
dtype containing python built-in datetime
or timedelta
rather than Timestamp
or Timedelta
(GH13626)PeriodIndex.append
may raises AttributeError
when the result is object
dtype (GH13221)CategoricalIndex.append
may accept normal list
(GH13626)pd.concat
and .append
with the same timezone get reset to UTC (GH7795)Series
and DataFrame
.append
raises AmbiguousTimeError
if data contains datetime near DST boundary (GH13626)DataFrame.to_csv()
in which float values were being quoted even though quotations were specified for non-numeric values only (GH12922, GH13259)DataFrame.describe()
raising ValueError
with only boolean columns (GH13898)MultiIndex
slicing where extra elements were returned when level is non-unique (GH12896).str.replace
does not raise TypeError
for invalid replacement (GH13438)MultiIndex.from_arrays
which didn’t check for input array lengths matching (GH13599)cartesian_product
and MultiIndex.from_product
which may raise with empty input arrays (GH12258)pd.read_csv()
which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (GH13703)pd.read_csv()
which caused errors to be raised when a dictionary containing scalars is passed in for na_values
(GH12224)pd.read_csv()
which caused BOM files to be incorrectly parsed by not ignoring the BOM (GH4793)pd.read_csv()
with engine='python'
which raised errors when a numpy array was passed in for usecols
(GH12546)pd.read_csv()
where the index columns were being incorrectly parsed when parsed as dates with a thousands
parameter (GH14066)pd.read_csv()
with engine='python'
in which NaN
values weren’t being detected after data was converted to numeric values (GH13314)pd.read_csv()
in which the nrows
argument was not properly validated for both engines (GH10476)pd.read_csv()
with engine='python'
in which infinities of mixed-case forms were not being interpreted properly (GH13274)pd.read_csv()
with engine='python'
in which trailing NaN
values were not being parsed (GH13320)pd.read_csv()
with engine='python'
when reading from a tempfile.TemporaryFile
on Windows with Python 3 (GH13398)pd.read_csv()
that prevents usecols
kwarg from accepting single-byte unicode strings (GH13219)pd.read_csv()
that prevents usecols
from being an empty set (GH13402)pd.read_csv()
in the C engine where the NULL character was not being parsed as NULL (GH14012)pd.read_csv()
with engine='c'
in which NULL quotechar
was not accepted even though quoting
was specified as None
(GH13411)pd.read_csv()
with engine='c'
in which fields were not properly cast to float when quoting was specified as non-numeric (GH13411)pd.read_csv()
in Python 2.x with non-UTF8 encoded, multi-character separated data (GH3404)pd.read_csv()
, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (GH13549)pd.read_csv
, pd.read_table
, pd.read_fwf
, pd.read_stata
and pd.read_sas
where files were opened by parsers but not closed if both chunksize
and iterator
were None
. (GH13940)StataReader
, StataWriter
, XportReader
and SAS7BDATReader
where a file was not properly closed when an error was raised. (GH13940)pd.pivot_table()
where margins_name
is ignored when aggfunc
is a list (GH13354)pd.Series.str.zfill
, center
, ljust
, rjust
, and pad
when passing non-integers, did not raise TypeError
(GH13598)TimedeltaIndex
, which always returned True
(GH13603)Series
arithmetic raises TypeError
if it contains datetime-like as object
dtype (GH13043)Series.isnull()
and Series.notnull()
ignore Period('NaT')
(GH13737)Series.fillna()
and Series.dropna()
don’t affect to Period('NaT')
(GH13737.fillna(value=np.nan)
incorrectly raises KeyError
on a category
dtyped Series
(GH14021).resample(..)
where incorrect warnings were triggered by IPython introspection (GH13618)NaT
- Period
raises AttributeError
(GH13071)Series
comparison may output incorrect result if rhs contains NaT
(GH9005)Series
and Index
comparison may output incorrect result if it contains NaT
with object
dtype (GH13592)Period
addition raises TypeError
if Period
is on right hand side (GH13069)Peirod
and Series
or Index
comparison raises TypeError
(GH13200)pd.set_eng_float_format()
that would prevent NaN and Inf from formatting (GH11981).unstack
with Categorical
dtype resets .ordered
to True
(GH13249)factorize
raises AmbiguousTimeError
if data contains datetime near DST boundary (GH13750).set_index
raises AmbiguousTimeError
if new index contains DST boundary and multi levels (GH12920).shift
raises AmbiguousTimeError
if data contains datetime near DST boundary (GH13926)pd.read_hdf()
returns incorrect result when a DataFrame
with a categorical
column and a query which doesn’t match any values (GH13792).iloc
when indexing with a non lex-sorted MultiIndex (GH13797).loc
when indexing with date strings in a reverse sorted DatetimeIndex
(GH14316)Series
comparison operators when dealing with zero dim NumPy arrays (GH13006).combine_first
may return incorrect dtype
(GH7630, GH10567)groupby
where apply
returns different result depending on whether first result is None
or not (GH12824)groupby(..).nth()
where the group key is included inconsistently if called after .head()/.tail()
(GH12839).to_html
, .to_latex
and .to_string
silently ignore custom datetime formatter passed through the formatters
key word (GH10690)DataFrame.iterrows()
, not yielding a Series
subclasse if defined (GH13977)pd.to_numeric
when errors='coerce'
and input contains non-hashable objects (GH13324)Timedelta
arithmetic and comparison may raise ValueError
rather than TypeError
(GH13624)to_datetime
and DatetimeIndex
may raise TypeError
rather than ValueError
(GH11169, GH11287)Index
created with tz-aware Timestamp
and mismatched tz
option incorrectly coerces timezone (GH13692)DatetimeIndex
with nanosecond frequency does not include timestamp specified with end
(GH13672)`Series`
when setting a slice with a `np.timedelta64`
(GH14155)Index
raises OutOfBoundsDatetime
if datetime
exceeds datetime64[ns]
bounds, rather than coercing to object
dtype (GH13663)Index
may ignore specified datetime64
or timedelta64
passed as dtype
(GH13981)RangeIndex
can be created without no arguments rather than raises TypeError
(GH13793).value_counts()
raises OutOfBoundsDatetime
if data exceeds datetime64[ns]
bounds (GH13663)DatetimeIndex
may raise OutOfBoundsDatetime
if input np.datetime64
has other unit than ns
(GH9114)Series
creation with np.datetime64
which has other unit than ns
as object
dtype results in incorrect values (GH13876)resample
with timedelta data where data was casted to float (GH13119).pd.isnull()
pd.notnull()
raise TypeError
if input datetime-like has other unit than ns
(GH13389)pd.merge()
may raise TypeError
if input datetime-like has other unit than ns
(GH13389)HDFStore
/read_hdf()
discarded DatetimeIndex.name
if tz
was set (GH13884)Categorical.remove_unused_categories()
changes .codes
dtype to platform int (GH13261)groupby
with as_index=False
returns all NaN’s when grouping on multiple columns including a categorical one (GH13204)df.groupby(...)[...]
where getitem with Int64Index
raised an error (GH13731)DataFrame.style
for index names. Previously they were assigned "col_heading level<n> col<c>"
where n
was the number of levels + 1. Now they are assigned "index_name level<n>"
, where n
is the correct level for that MultiIndex.pd.read_gbq()
could throw ImportError: No module named discovery
as a result of a naming conflict with another python package called apiclient (GH13454)Index.union
returns an incorrect result with a named empty index (GH13432)Index.difference
and DataFrame.join
raise in Python3 when using mixed-integer indexes (GH13432, GH12814)datetime.datetime
from tz-aware datetime64
series (GH14088).to_excel()
when DataFrame contains a MultiIndex which contains a label with a NaN value (GH13511)ValueError
(GH13930)concat
and groupby
for hierarchical frames with RangeIndex
levels (GH13542).Series.str.contains()
for Series containing only NaN
values of object
dtype (GH14171)agg()
function on groupby dataframe changes dtype of datetime64[ns]
column to float64
(GH12821)PeriodIndex
to add or subtract integer raise IncompatibleFrequency
. Note that using standard operator like +
or -
is recommended, because standard operators use more efficient path (GH13980)NaT
returning float
instead of datetime64[ns]
(GH12941)Series
flexible arithmetic methods (like .add()
) raises ValueError
when axis=None
(GH13894)DataFrame.to_csv()
with MultiIndex
columns in which a stray empty line was added (GH6618)DatetimeIndex
, TimedeltaIndex
and PeriodIndex.equals()
may return True
when input isn’t Index
but contains the same values (GH13107)pd.eval()
and HDFStore
query truncating long float literals with python 2 (GH14241)Index
raises KeyError
displaying incorrect column when column is not in the df and columns contains duplicate values (GH13822)Period
and PeriodIndex
creating wrong dates when frequency has combined offset aliases (GH13874).to_string()
when called with an integer line_width
and index=False
raises an UnboundLocalError exception because idx
referenced before assignment.eval()
where the resolvers
argument would not accept a list (GH14095)stack
, get_dummies
, make_axis_dummies
which don’t preserve categorical dtypes in (multi)indexes (GH13854)PeriodIndex
can now accept list
and array
which contains pd.NaT
(GH13430)df.groupby
where .median()
returns arbitrary values if grouped dataframe contains empty bins (GH13629)Index.copy()
where name
parameter was ignored (GH14302)This is a minor bug-fix release from 0.18.0 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
.groupby(...)
has been enhanced to provide convenient syntax when working with .rolling(..)
, .expanding(..)
and .resample(..)
per group, see herepd.to_datetime()
has gained the ability to assemble dates from a DataFrame
, see heresparse
, see hereThe CustomBusinessHour
is a mixture of BusinessHour
and CustomBusinessDay
which allows you to specify arbitrary holidays. For details, see Custom Business Hour (GH11514)
In [1]: from pandas.tseries.offsets import CustomBusinessHour In [2]: from pandas.tseries.holiday import USFederalHolidayCalendar In [3]: bhour_us = CustomBusinessHour(calendar=USFederalHolidayCalendar())
Friday before MLK Day
In [4]: dt = datetime(2014, 1, 17, 15) In [5]: dt + bhour_us Out[5]: Timestamp('2014-01-17 16:00:00')
Tuesday after MLK Day (Monday is skipped because it’s a holiday)
In [6]: dt + bhour_us * 2 Out[6]: Timestamp('2014-01-20 09:00:00')
.groupby(..)
syntax with window and resample operations¶
.groupby(...)
has been enhanced to provide convenient syntax when working with .rolling(..)
, .expanding(..)
and .resample(..)
per group, see (GH12486, GH12738).
You can now use .rolling(..)
and .expanding(..)
as methods on groupbys. These return another deferred object (similar to what .rolling()
and .expanding()
do on ungrouped pandas objects). You can then operate on these RollingGroupby
objects in a similar manner.
Previously you would have to do this to get a rolling window mean per-group:
In [7]: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8, ...: 'B': np.arange(40)}) ...: In [8]: df Out[8]: A B 0 1 0 1 1 1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 .. .. .. 33 3 33 34 3 34 35 3 35 36 3 36 37 3 37 38 3 38 39 3 39 [40 rows x 2 columns]
In [9]: df.groupby('A').apply(lambda x: x.rolling(4).B.mean()) Out[9]: A 1 0 NaN 1 NaN 2 NaN 3 1.5 4 2.5 5 3.5 6 4.5 ... 3 33 NaN 34 NaN 35 33.5 36 34.5 37 35.5 38 36.5 39 37.5 Name: B, Length: 40, dtype: float64
Now you can do:
In [10]: df.groupby('A').rolling(4).B.mean() Out[10]: A 1 0 NaN 1 NaN 2 NaN 3 1.5 4 2.5 5 3.5 6 4.5 ... 3 33 NaN 34 NaN 35 33.5 36 34.5 37 35.5 38 36.5 39 37.5 Name: B, Length: 40, dtype: float64
For .resample(..)
type of operations, previously you would have to:
In [11]: df = pd.DataFrame({'date': pd.date_range(start='2016-01-01', ....: periods=4, ....: freq='W'), ....: 'group': [1, 1, 2, 2], ....: 'val': [5, 6, 7, 8]}).set_index('date') ....: In [12]: df Out[12]: group val date 2016-01-03 1 5 2016-01-10 1 6 2016-01-17 2 7 2016-01-24 2 8
In [13]: df.groupby('group').apply(lambda x: x.resample('1D').ffill()) Out[13]: group val group date 1 2016-01-03 1 5 2016-01-04 1 5 2016-01-05 1 5 2016-01-06 1 5 2016-01-07 1 5 2016-01-08 1 5 2016-01-09 1 5 ... ... ... 2 2016-01-18 2 7 2016-01-19 2 7 2016-01-20 2 7 2016-01-21 2 7 2016-01-22 2 7 2016-01-23 2 7 2016-01-24 2 8 [16 rows x 2 columns]
Now you can do:
In [14]: df.groupby('group').resample('1D').ffill() Out[14]: group val group date 1 2016-01-03 1 5 2016-01-04 1 5 2016-01-05 1 5 2016-01-06 1 5 2016-01-07 1 5 2016-01-08 1 5 2016-01-09 1 5 ... ... ... 2 2016-01-18 2 7 2016-01-19 2 7 2016-01-20 2 7 2016-01-21 2 7 2016-01-22 2 7 2016-01-23 2 7 2016-01-24 2 8 [16 rows x 2 columns]Method chaininng improvements¶
The following methods / indexers now accept a callable
. It is intended to make these more useful in method chains, see the documentation. (GH11485, GH12533)
.where()
and .mask()
.loc[]
, iloc[]
and .ix[]
[]
indexing.where()
and .mask()
¶
These can accept a callable for the condition and other
arguments.
In [15]: df = pd.DataFrame({'A': [1, 2, 3], ....: 'B': [4, 5, 6], ....: 'C': [7, 8, 9]}) ....: In [16]: df.where(lambda x: x > 4, lambda x: x + 10) Out[16]: A B C 0 11 14 7 1 12 5 8 2 13 6 9
.loc[]
, .iloc[]
, .ix[]
¶
These can accept a callable, and a tuple of callable as a slicer. The callable can return a valid boolean indexer or anything which is valid for these indexer’s input.
# callable returns bool indexer In [17]: df.loc[lambda x: x.A >= 2, lambda x: x.sum() > 10] Out[17]: B C 1 5 8 2 6 9 # callable returns list of labels In [18]: df.loc[lambda x: [1, 2], lambda x: ['A', 'B']] Out[18]: A B 1 2 5 2 3 6
[]
indexing¶
Finally, you can use a callable in []
indexing of Series, DataFrame and Panel. The callable must return a valid input for []
indexing depending on its class and index type.
In [19]: df[lambda x: 'A'] Out[19]: 0 1 1 2 2 3 Name: A, dtype: int64
Using these methods / indexers, you can chain data selection operations without using temporary variable.
In [20]: bb = pd.read_csv('data/baseball.csv', index_col='id') In [21]: (bb.groupby(['year', 'team']) ....: .sum() ....: .loc[lambda df: df.r > 100] ....: ) ....: Out[21]: stint g ab r h X2b X3b hr rbi sb cs bb \ year team 2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105 DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97 HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60 LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114 NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174 SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235 TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73 TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190 so ibb hbp sh sf gidp year team 2007 CIN 127.0 14.0 1.0 1.0 15.0 18.0 DET 176.0 3.0 10.0 4.0 8.0 28.0 HOU 212.0 3.0 9.0 16.0 6.0 17.0 LAN 141.0 8.0 9.0 3.0 8.0 29.0 NYN 310.0 24.0 23.0 18.0 15.0 48.0 SFN 188.0 51.0 8.0 16.0 6.0 41.0 TEX 140.0 4.0 5.0 2.0 8.0 16.0 TOR 265.0 16.0 12.0 4.0 16.0 38.0Partial string indexing on
DateTimeIndex
when part of a MultiIndex
¶
Partial string indexing now matches on DateTimeIndex
when part of a MultiIndex
(GH10331)
In [22]: dft2 = pd.DataFrame(np.random.randn(20, 1), ....: columns=['A'], ....: index=pd.MultiIndex.from_product([pd.date_range('20130101', ....: periods=10, ....: freq='12H'), ....: ['a', 'b']])) ....: In [23]: dft2 Out[23]: A 2013-01-01 00:00:00 a 0.156998 b -0.571455 2013-01-01 12:00:00 a 1.057633 b -0.791489 2013-01-02 00:00:00 a -0.524627 b 0.071878 2013-01-02 12:00:00 a 1.910759 ... ... 2013-01-04 00:00:00 b 1.015405 2013-01-04 12:00:00 a 0.749185 b -0.675521 2013-01-05 00:00:00 a 0.440266 b 0.688972 2013-01-05 12:00:00 a -0.276646 b 1.924533 [20 rows x 1 columns] In [24]: dft2.loc['2013-01-05'] Out[24]: A 2013-01-05 00:00:00 a 0.440266 b 0.688972 2013-01-05 12:00:00 a -0.276646 b 1.924533
On other levels
In [25]: idx = pd.IndexSlice In [26]: dft2 = dft2.swaplevel(0, 1).sort_index() In [27]: dft2 Out[27]: A a 2013-01-01 00:00:00 0.156998 2013-01-01 12:00:00 1.057633 2013-01-02 00:00:00 -0.524627 2013-01-02 12:00:00 1.910759 2013-01-03 00:00:00 0.513082 2013-01-03 12:00:00 1.043945 2013-01-04 00:00:00 1.459927 ... ... b 2013-01-02 12:00:00 0.787965 2013-01-03 00:00:00 -0.546416 2013-01-03 12:00:00 2.107785 2013-01-04 00:00:00 1.015405 2013-01-04 12:00:00 -0.675521 2013-01-05 00:00:00 0.688972 2013-01-05 12:00:00 1.924533 [20 rows x 1 columns] In [28]: dft2.loc[idx[:, '2013-01-05'], :] Out[28]: A a 2013-01-05 00:00:00 0.440266 2013-01-05 12:00:00 -0.276646 b 2013-01-05 00:00:00 0.688972 2013-01-05 12:00:00 1.924533Assembling Datetimes¶
pd.to_datetime()
has gained the ability to assemble datetimes from a passed in DataFrame
or a dict. (GH8158).
In [29]: df = pd.DataFrame({'year': [2015, 2016], ....: 'month': [2, 3], ....: 'day': [4, 5], ....: 'hour': [2, 3]}) ....: In [30]: df Out[30]: day hour month year 0 4 2 2 2015 1 5 3 3 2016
Assembling using the passed frame.
In [31]: pd.to_datetime(df) Out[31]: 0 2015-02-04 02:00:00 1 2016-03-05 03:00:00 dtype: datetime64[ns]
You can pass only the columns that you need to assemble.
In [32]: pd.to_datetime(df[['year', 'month', 'day']]) Out[32]: 0 2015-02-04 1 2016-03-05 dtype: datetime64[ns]Other Enhancements¶
pd.read_csv()
now supports delim_whitespace=True
for the Python engine (GH12958)
pd.read_csv()
now supports opening ZIP files that contains a single CSV, via extension inference or explict compression='zip'
(GH12175)
pd.read_csv()
now supports opening files using xz compression, via extension inference or explicit compression='xz'
is specified; xz
compressions is also supported by DataFrame.to_csv
in the same way (GH11852)
pd.read_msgpack()
now always gives writeable ndarrays even when compression is used (GH12359).
pd.read_msgpack()
now supports serializing and de-serializing categoricals with msgpack (GH12573)
.to_json()
now supports NDFrames
that contain categorical and sparse data (GH10778)
interpolate()
now supports method='akima'
(GH7588).
pd.read_excel()
now accepts path objects (e.g. pathlib.Path
, py.path.local
) for the file path, in line with other read_*
functions (GH12655)
Added .weekday_name
property as a component to DatetimeIndex
and the .dt
accessor. (GH11128)
Index.take
now handles allow_fill
and fill_value
consistently (GH12631)
In [33]: idx = pd.Index([1., 2., 3., 4.], dtype='float') # default, allow_fill=True, fill_value=None In [34]: idx.take([2, -1]) Out[34]: Float64Index([3.0, 4.0], dtype='float64') In [35]: idx.take([2, -1], fill_value=True) Out[35]: Float64Index([3.0, nan], dtype='float64')
Index
now supports .str.get_dummies()
which returns MultiIndex
, see Creating Indicator Variables (GH10008, GH10103)
In [36]: idx = pd.Index(['a|b', 'a|c', 'b|c']) In [37]: idx.str.get_dummies('|') Out[37]: MultiIndex(levels=[[0, 1], [0, 1], [0, 1]], labels=[[1, 1, 0], [1, 0, 1], [0, 1, 1]], names=['a', 'b', 'c'])
pd.crosstab()
has gained a normalize
argument for normalizing frequency tables (GH12569). Examples in the updated docs here.
.resample(..).interpolate()
is now supported (GH12925)
.isin()
now accepts passed sets
(GH12988)
These changes conform sparse handling to return the correct types and work to make a smoother experience with indexing.
SparseArray.take
now returns a scalar for scalar input, SparseArray
for others. Furthermore, it handles a negative indexer with the same rule as Index
(GH10560, GH12796)
In [38]: s = pd.SparseArray([np.nan, np.nan, 1, 2, 3, np.nan, 4, 5, np.nan, 6]) In [39]: s.take(0) Out[39]: nan In [40]: s.take([1, 2, 3]) Out[40]: [nan, 1.0, 2.0] Fill: nan IntIndex Indices: array([1, 2], dtype=int32)
SparseSeries[]
indexing with Ellipsis
raises KeyError
(GH9467)SparseArray[]
indexing with tuples are not handled properly (GH12966)SparseSeries.loc[]
with list-like input raises TypeError
(GH10560)SparseSeries.iloc[]
with scalar input may raise IndexError
(GH10560)SparseSeries.loc[]
, .iloc[]
with slice
returns SparseArray
, rather than SparseSeries
(GH10560)SparseDataFrame.loc[]
, .iloc[]
may results in dense Series
, rather than SparseSeries
(GH12787)SparseArray
addition ignores fill_value
of right hand side (GH12910)SparseArray
mod raises AttributeError
(GH12910)SparseArray
pow calculates 1 ** np.nan
as np.nan
which must be 1 (GH12910)SparseArray
comparison output may incorrect result or raise ValueError
(GH12971)SparseSeries.__repr__
raises TypeError
when it is longer than max_rows
(GH10560)SparseSeries.shape
ignores fill_value
(GH10452)SparseSeries
and SparseArray
may have different dtype
from its dense values (GH12908)SparseSeries.reindex
incorrectly handle fill_value
(GH12797)SparseArray.to_frame()
results in DataFrame
, rather than SparseDataFrame
(GH9850)SparseSeries.value_counts()
does not count fill_value
(GH6749)SparseArray.to_dense()
does not preserve dtype
(GH10648)SparseArray.to_dense()
incorrectly handle fill_value
(GH12797)pd.concat()
of SparseSeries
results in dense (GH10536)pd.concat()
of SparseDataFrame
incorrectly handle fill_value
(GH9765)pd.concat()
of SparseDataFrame
may raise AttributeError
(GH12174)SparseArray.shift()
may raise NameError
or TypeError
(GH12908).groupby(..).nth()
changes¶
The index in .groupby(..).nth()
output is now more consistent when the as_index
argument is passed (GH11039):
In [41]: df = DataFrame({'A' : ['a', 'b', 'a'], ....: 'B' : [1, 2, 3]}) ....: In [42]: df Out[42]: A B 0 a 1 1 b 2 2 a 3
Previous Behavior:
In [3]: df.groupby('A', as_index=True)['B'].nth(0) Out[3]: 0 1 1 2 Name: B, dtype: int64 In [4]: df.groupby('A', as_index=False)['B'].nth(0) Out[4]: 0 1 1 2 Name: B, dtype: int64
New Behavior:
In [43]: df.groupby('A', as_index=True)['B'].nth(0) Out[43]: A a 1 b 2 Name: B, dtype: int64 In [44]: df.groupby('A', as_index=False)['B'].nth(0) Out[44]: 0 1 1 2 Name: B, dtype: int64
Furthermore, previously, a .groupby
would always sort, regardless if sort=False
was passed with .nth()
.
In [45]: np.random.seed(1234) In [46]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b']) In [47]: df['c'] = np.random.randint(0, 4, 100)
Previous Behavior:
In [4]: df.groupby('c', sort=True).nth(1) Out[4]: a b c 0 -0.334077 0.002118 1 0.036142 -2.074978 2 -0.720589 0.887163 3 0.859588 -0.636524 In [5]: df.groupby('c', sort=False).nth(1) Out[5]: a b c 0 -0.334077 0.002118 1 0.036142 -2.074978 2 -0.720589 0.887163 3 0.859588 -0.636524
New Behavior:
In [48]: df.groupby('c', sort=True).nth(1) Out[48]: a b c 0 -0.334077 0.002118 1 0.036142 -2.074978 2 -0.720589 0.887163 3 0.859588 -0.636524 In [49]: df.groupby('c', sort=False).nth(1) Out[49]: a b c 2 -0.720589 0.887163 3 0.859588 -0.636524 0 -0.334077 0.002118 1 0.036142 -2.074978numpy function compatibility¶
Compatibility between pandas array-like methods (e.g. sum
and take
) and their numpy
counterparts has been greatly increased by augmenting the signatures of the pandas
methods so as to accept arguments that can be passed in from numpy
, even if they are not necessarily used in the pandas
implementation (GH12644, GH12638, GH12687)
.searchsorted()
for Index
and TimedeltaIndex
now accept a sorter
argument to maintain compatibility with numpy’s searchsorted
function (GH12238)np.round()
on a Series
(GH12600)An example of this signature augmentation is illustrated below:
In [50]: sp = pd.SparseDataFrame([1, 2, 3]) In [51]: sp Out[51]: 0 0 1 1 2 2 3
Previous behaviour:
In [2]: np.cumsum(sp, axis=0) ... TypeError: cumsum() takes at most 2 arguments (4 given)
New behaviour:
In [52]: np.cumsum(sp, axis=0) Out[52]: 0 0 1 1 3 2 6Using
.apply
on groupby resampling¶
Using apply
on resampling groupby operations (using a pd.TimeGrouper
) now has the same output types as similar apply
calls on other groupby operations. (GH11742).
In [53]: df = pd.DataFrame({'date': pd.to_datetime(['10/10/2000', '11/10/2000']), ....: 'value': [10, 13]}) ....: In [54]: df Out[54]: date value 0 2000-10-10 10 1 2000-11-10 13
Previous behavior:
In [1]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum()) Out[1]: ... TypeError: cannot concatenate a non-NDFrame object # Output is a Series In [2]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum()) Out[2]: date 2000-10-31 value 10 2000-11-30 value 13 dtype: int64
New Behavior:
# Output is a Series In [55]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum()) Out[55]: date 2000-10-31 10 2000-11-30 13 Freq: M, dtype: int64 # Output is a DataFrame In [56]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum()) Out[56]: value date 2000-10-31 10 2000-11-30 13Changes in
read_csv
exceptions¶
In order to standardize the read_csv
API for both the c
and python
engines, both will now raise an EmptyDataError
, a subclass of ValueError
, in response to empty columns or header (GH12493, GH12506)
Previous behaviour:
In [1]: df = pd.read_csv(StringIO(''), engine='c') ... ValueError: No columns to parse from file In [2]: df = pd.read_csv(StringIO(''), engine='python') ... StopIteration
New behaviour:
In [1]: df = pd.read_csv(StringIO(''), engine='c') ... pandas.io.common.EmptyDataError: No columns to parse from file In [2]: df = pd.read_csv(StringIO(''), engine='python') ... pandas.io.common.EmptyDataError: No columns to parse from file
In addition to this error change, several others have been made as well:
CParserError
now sub-classes ValueError
instead of just a Exception
(GH12551)CParserError
is now raised instead of a generic Exception
in read_csv
when the c
engine cannot parse a column (GH12506)ValueError
is now raised instead of a generic Exception
in read_csv
when the c
engine encounters a NaN
value in an integer column (GH12506)ValueError
is now raised instead of a generic Exception
in read_csv
when true_values
is specified, and the c
engine encounters an element in a column containing unencodable bytes (GH12506)pandas.parser.OverflowError
exception has been removed and has been replaced with Python’s built-in OverflowError
exception (GH12506)pd.read_csv()
no longer allows a combination of strings and integers for the usecols
parameter (GH12678)to_datetime
error changes¶
Bugs in pd.to_datetime()
when passing a unit
with convertible entries and errors='coerce'
or non-convertible with errors='ignore'
. Furthermore, an OutOfBoundsDateime
exception will be raised when an out-of-range value is encountered for that unit when errors='raise'
. (GH11758, GH13052, GH13059)
Previous behaviour:
In [27]: pd.to_datetime(1420043460, unit='s', errors='coerce') Out[27]: NaT In [28]: pd.to_datetime(11111111, unit='D', errors='ignore') OverflowError: Python int too large to convert to C long In [29]: pd.to_datetime(11111111, unit='D', errors='raise') OverflowError: Python int too large to convert to C long
New behaviour:
In [2]: pd.to_datetime(1420043460, unit='s', errors='coerce') Out[2]: Timestamp('2014-12-31 16:31:00') In [3]: pd.to_datetime(11111111, unit='D', errors='ignore') Out[3]: 11111111 In [4]: pd.to_datetime(11111111, unit='D', errors='raise') OutOfBoundsDatetime: cannot convert input with unit 'D'Other API changes¶
.swaplevel()
for Series
, DataFrame
, Panel
, and MultiIndex
now features defaults for its first two parameters i
and j
that swap the two innermost levels of the index. (GH12934).searchsorted()
for Index
and TimedeltaIndex
now accept a sorter
argument to maintain compatibility with numpy’s searchsorted
function (GH12238)Period
and PeriodIndex
now raises IncompatibleFrequency
error which inherits ValueError
rather than raw ValueError
(GH12615)Series.apply
for category dtype now applies the passed function to each of the .categories
(and not the .codes
), and returns a category
dtype if possible (GH12473)read_csv
will now raise a TypeError
if parse_dates
is neither a boolean, list, or dictionary (matches the doc-string) (GH5636).query()/.eval()
is now engine=None
, which will use numexpr
if it’s installed; otherwise it will fallback to the python
engine. This mimics the pre-0.18.1 behavior if numexpr
is installed (and which, previously, if numexpr was not installed, .query()/.eval()
would raise). (GH12749)pd.show_versions()
now includes pandas_datareader
version (GH12740)__name__
and __qualname__
attributes for generic functions (GH12021)pd.concat(ignore_index=True)
now uses RangeIndex
as default (GH12695)pd.merge()
and DataFrame.join()
will show a UserWarning
when merging/joining a single- with a multi-leveled dataframe (GH9455, GH12219)scipy
> 0.17 for deprecated piecewise_polynomial
interpolation method; support for the replacement from_derivatives
method (GH12887)Index.sym_diff()
is deprecated and can be replaced by Index.symmetric_difference()
(GH12591)Categorical.sort()
is deprecated in favor of Categorical.sort_values()
(GH12882).groupby(..).cumcount()
(GH11039)pd.read_csv()
when using skiprows=an_integer
(GH13005)DataFrame.to_sql
when checking case sensitivity for tables. Now only checks if table has been created correctly when table name is not lower case. (GH12876)Period
construction and time series plotting (GH12903, GH11831)..str.encode()
and .str.decode()
methods (GH13008)to_numeric
if input is numeric dtype (GH12777)IntIndex
(GH13036)usecols
parameter in pd.read_csv
is now respected even when the lines of a CSV file are not even (GH12203)groupby.transform(..)
when axis=1
is specified with a non-monotonic ordered index (GH12713)Period
and PeriodIndex
creation raises KeyError
if freq="Minute"
is specified. Note that “Minute” freq is deprecated in v0.17.0, and recommended to use freq="T"
instead (GH11854).resample(...).count()
with a PeriodIndex
always raising a TypeError
(GH12774).resample(...)
with a PeriodIndex
casting to a DatetimeIndex
when empty (GH12868).resample(...)
with a PeriodIndex
when resampling to an existing frequency (GH12770)Period
with different freq
raises ValueError
(GH12615)Series
construction with Categorical
and dtype='category'
is specified (GH12574)display.max_rows
(GH12411, GH12045, GH11594, GH10571, GH12211)float_format
option with option not being validated as a callable. (GH12706)GroupBy.filter
when dropna=False
and no groups fulfilled the criteria (GH12768)__name__
of .cum*
functions (GH12021).astype()
of a Float64Inde/Int64Index
to an Int64Index
(GH12881).to_json()/.read_json()
when orient='index'
(the default) (GH12866)Categorical
dtypes cause error when attempting stacked bar plot (GH13019)numpy
1.11 for NaT
comparions (GH12969).drop()
with a non-unique MultiIndex
. (GH12701).concat
of datetime tz-aware and naive DataFrames (GH12467)ValueError
in .resample(..).fillna(..)
when passing a non-string (GH12952)pd.read_sas()
(GH12659, GH12654, GH12647, GH12809)pd.crosstab()
where would silently ignore aggfunc
if values=None
(GH12569).DataFrame.to_json
when serialising datetime.time
(GH11473).DataFrame.to_json
when attempting to serialise 0d array (GH11299).to_json
when attempting to serialise a DataFrame
or Series
with non-ndarray values; now supports serialization of category
, sparse
, and datetime64[ns, tz]
dtypes (GH10778).DataFrame.to_json
with unsupported dtype not passed to default handler (GH12554)..align
not returning the sub-class (GH12983)Series
with a DataFrame
(GH13037)ABCPanel
in which Panel4D
was not being considered as a valid instance of this generic type (GH12810).name
on .groupby(..).apply(..)
cases (GH12363)Timestamp.__repr__
that caused pprint
to fail in nested structures (GH12622)Timedelta.min
and Timedelta.max
, the properties now report the true minimum/maximum timedeltas
as recognized by pandas. See the documentation. (GH12727).quantile()
with interpolation may coerce to float
unexpectedly (GH12772).quantile()
with empty Series
may return scalar rather than empty Series
(GH12772).loc
with out-of-bounds in a large indexer would raise IndexError
rather than KeyError
(GH12527)TimedeltaIndex
and .asfreq()
, would previously not include the final fencepost (GH12926)Categorical
in a DataFrame
(GH12564)GroupBy.first()
, .last()
returns incorrect row when TimeGrouper
is used (GH7453)pd.read_csv()
with the c
engine when specifying skiprows
with newlines in quoted items (GH10911, GH12775)DataFrame
timezone lost when assigning tz-aware datetime Series
with alignment (GH12981).value_counts()
when normalize=True
and dropna=True
where nulls still contributed to the normalized count (GH12558)Series.value_counts()
loses name if its dtype is category
(GH12835)Series.value_counts()
loses timezone info (GH12835)Series.value_counts(normalize=True)
with Categorical
raises UnboundLocalError
(GH12835)Panel.fillna()
ignoring inplace=True
(GH12633)pd.read_csv()
when specifying names
, usecols
, and parse_dates
simultaneously with the c
engine (GH9755)pd.read_csv()
when specifying delim_whitespace=True
and lineterminator
simultaneously with the c
engine (GH12912)Series.rename
, DataFrame.rename
and DataFrame.rename_axis
not treating Series
as mappings to relabel (GH12623)..rolling.min
and .rolling.max
to enhance dtype handling (GH12373)groupby
where complex types are coerced to float (GH12902)Series.map
raises TypeError
if its dtype is category
or tz-aware datetime
(GH12473)RangeIndex
construction (GH12893)DataFrame
defined to return subclassed Series
may return normal Series
(GH11559).str
accessor methods may raise ValueError
if input has name
and the result is DataFrame
or MultiIndex
(GH12617)DataFrame.last_valid_index()
and DataFrame.first_valid_index()
on empty frames (GH12800)CategoricalIndex.get_loc
returns different result from regular Index
(GH12531)PeriodIndex.resample
where name not propagated (GH12769)date_range
closed
keyword and timezones (GH12684).pd.concat
raises AttributeError
when input data contains tz-aware datetime and timedelta (GH12620)pd.concat
did not handle empty Series
properly (GH11082).plot.bar
alginment when width
is specified with int
(GH12979)fill_value
is ignored if the argument to a binary operator is a constant (GH12723)pd.read_html()
when using bs4 flavor and parsing table with a header and only one column (GH9178).pivot_table
when margins=True
and dropna=True
where nulls still contributed to margin count (GH12577).pivot_table
when dropna=False
where table index/column names disappear (GH12133)pd.crosstab()
when margins=True
and dropna=False
which raised (GH12642)Series.name
when name
attribute can be a hashable type (GH12610).describe()
resets categorical columns information (GH11558)loffset
argument was not applied when calling resample().count()
on a timeseries (GH12725)pd.read_excel()
now accepts column names associated with keyword argument names
(GH12870)pd.to_numeric()
with Index
returns np.ndarray
, rather than Index
(GH12777)pd.to_numeric()
with datetime-like may raise TypeError
(GH12777)pd.to_numeric()
with scalar raises ValueError
(GH12777)This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.18.0 no longer supports compatibility with Python version 2.6 and 3.3 (GH7718, GH11273)
Warning
numexpr
version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH12489)
Highlights include:
.groupby
, see here.RangeIndex
as a specialized form of the Int64Index
for memory savings, see here..resample
method to make it more .groupby
like, see here.TypeError
, see here..to_xarray()
function has been added for compatibility with the xarray package, see here.read_sas
function has been enhanced to read sas7bdat
files, see here.pd.test()
top-level nose test runner is available (GH4327).Check the API Changes and deprecations before updating.
New features¶ Window functions are now methods¶Window functions have been refactored to be methods on Series/DataFrame
objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby
. See the full documentation here (GH11603, GH12373)
In [1]: np.random.seed(1234) In [2]: df = pd.DataFrame({'A' : range(10), 'B' : np.random.randn(10)}) In [3]: df Out[3]: A B 0 0 0.471435 1 1 -1.190976 2 2 1.432707 3 3 -0.312652 4 4 -0.720589 5 5 0.887163 6 6 0.859588 7 7 -0.636524 8 8 0.015696 9 9 -2.242685
Previous Behavior:
In [8]: pd.rolling_mean(df,window=3) FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with DataFrame.rolling(window=3,center=False).mean() Out[8]: A B 0 NaN NaN 1 NaN NaN 2 1 0.237722 3 2 -0.023640 4 3 0.133155 5 4 -0.048693 6 5 0.342054 7 6 0.370076 8 7 0.079587 9 8 -0.954504
New Behavior:
In [4]: r = df.rolling(window=3)
These show a descriptive repr
In [5]: r Out[5]: Rolling [window=3,center=False,axis=0]
with tab-completion of available methods and properties.
In [9]: r. r.A r.agg r.apply r.count r.exclusions r.max r.median r.name r.skew r.sum r.B r.aggregate r.corr r.cov r.kurt r.mean r.min r.quantile r.std r.var
The methods operate on the Rolling
object itself
In [6]: r.mean() Out[6]: A B 0 NaN NaN 1 NaN NaN 2 1.0 0.237722 3 2.0 -0.023640 4 3.0 0.133155 5 4.0 -0.048693 6 5.0 0.342054 7 6.0 0.370076 8 7.0 0.079587 9 8.0 -0.954504
They provide getitem accessors
In [7]: r['A'].mean() Out[7]: 0 NaN 1 NaN 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 Name: A, dtype: float64
And multiple aggregations
In [8]: r.agg({'A' : ['mean','std'], ...: 'B' : ['mean','std']}) ...: Out[8]: A B mean std mean std 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 1.0 1.0 0.237722 1.327364 3 2.0 1.0 -0.023640 1.335505 4 3.0 1.0 0.133155 1.143778 5 4.0 1.0 -0.048693 0.835747 6 5.0 1.0 0.342054 0.920379 7 6.0 1.0 0.370076 0.871850 8 7.0 1.0 0.079587 0.750099 9 8.0 1.0 -0.954504 1.162285Changes to rename¶
Series.rename
and NDFrame.rename_axis
can now take a scalar or list-like argument for altering the Series or axis name, in addition to their old behaviors of altering labels. (GH9494, GH11965)
In [9]: s = pd.Series(np.random.randn(5)) In [10]: s.rename('newname') Out[10]: 0 1.150036 1 0.991946 2 0.953324 3 -2.021255 4 -0.334077 Name: newname, dtype: float64
In [11]: df = pd.DataFrame(np.random.randn(5, 2)) In [12]: (df.rename_axis("indexname") ....: .rename_axis("columns_name", axis="columns")) ....: Out[12]: columns_name 0 1 indexname 0 0.002118 0.405453 1 0.289092 1.321158 2 -1.546906 -0.202646 3 -0.655969 0.193421 4 0.553439 1.318152
The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping a label to a new label. This continues to work as before for function or dict-like values.
Range Index¶A RangeIndex
has been added to the Int64Index
sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the python range
object (xrange
in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting to Int64Index
if needed.
This will now be the default constructed index for NDFrame
objects, rather than previous an Int64Index
. (GH939, GH12070, GH12071, GH12109, GH12888)
Previous Behavior:
In [3]: s = pd.Series(range(1000)) In [4]: s.index Out[4]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000) In [6]: s.index.nbytes Out[6]: 8000
New Behavior:
In [13]: s = pd.Series(range(1000)) In [14]: s.index Out[14]: RangeIndex(start=0, stop=1000, step=1) In [15]: s.index.nbytes Out[15]: 80Changes to str.cat¶
The method .str.cat()
concatenates the members of a Series
. Before, if NaN
values were present in the Series, calling .str.cat()
on it would return NaN
, unlike the rest of the Series.str.*
API. This behavior has been amended to ignore NaN
values by default. (GH11435).
A new, friendlier ValueError
is added to protect against the mistake of supplying the sep
as an arg, rather than as a kwarg. (GH11334).
In [27]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ') Out[27]: 'a b c' In [28]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ', na_rep='?') Out[28]: 'a b ? c'
In [2]: pd.Series(['a','b',np.nan,'c']).str.cat(' ') ValueError: Did you mean to supply a `sep` keyword?Datetimelike rounding¶
DatetimeIndex
, Timestamp
, TimedeltaIndex
, Timedelta
have gained the .round()
, .floor()
and .ceil()
method for datetimelike rounding, flooring and ceiling. (GH4314, GH11963)
Naive datetimes
In [29]: dr = pd.date_range('20130101 09:12:56.1234', periods=3) In [30]: dr Out[30]: DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400', '2013-01-03 09:12:56.123400'], dtype='datetime64[ns]', freq='D') In [31]: dr.round('s') Out[31]: DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56', '2013-01-03 09:12:56'], dtype='datetime64[ns]', freq=None) # Timestamp scalar In [32]: dr[0] Out[32]: Timestamp('2013-01-01 09:12:56.123400', freq='D') In [33]: dr[0].round('10s') Out[33]: Timestamp('2013-01-01 09:13:00')
Tz-aware are rounded, floored and ceiled in local times
In [34]: dr = dr.tz_localize('US/Eastern') In [35]: dr Out[35]: DatetimeIndex(['2013-01-01 09:12:56.123400-05:00', '2013-01-02 09:12:56.123400-05:00', '2013-01-03 09:12:56.123400-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D') In [36]: dr.round('s') Out[36]: DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00', '2013-01-03 09:12:56-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
Timedeltas
In [37]: t = timedelta_range('1 days 2 hr 13 min 45 us',periods=3,freq='d') In [38]: t Out[38]: TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045', '3 days 02:13:00.000045'], dtype='timedelta64[ns]', freq='D') In [39]: t.round('10min') Out[39]: TimedeltaIndex(['1 days 02:10:00', '2 days 02:10:00', '3 days 02:10:00'], dtype='timedelta64[ns]', freq=None) # Timedelta scalar In [40]: t[0] Out[40]: Timedelta('1 days 02:13:00.000045') In [41]: t[0].round('2h') Out[41]: Timedelta('1 days 02:00:00')
In addition, .round()
, .floor()
and .ceil()
will be available thru the .dt
accessor of Series
.
In [42]: s = pd.Series(dr) In [43]: s Out[43]: 0 2013-01-01 09:12:56.123400-05:00 1 2013-01-02 09:12:56.123400-05:00 2 2013-01-03 09:12:56.123400-05:00 dtype: datetime64[ns, US/Eastern] In [44]: s.dt.round('D') Out[44]: 0 2013-01-01 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 dtype: datetime64[ns, US/Eastern]Formatting of Integers in FloatIndex¶
Integers in FloatIndex
, e.g. 1., are now formatted with a decimal point and a 0
digit, e.g. 1.0
(GH11713) This change not only affects the display to the console, but also the output of IO methods like .to_csv
or .to_html
.
Previous Behavior:
In [2]: s = pd.Series([1,2,3], index=np.arange(3.)) In [3]: s Out[3]: 0 1 1 2 2 3 dtype: int64 In [4]: s.index Out[4]: Float64Index([0.0, 1.0, 2.0], dtype='float64') In [5]: print(s.to_csv(path=None)) 0,1 1,2 2,3
New Behavior:
In [45]: s = pd.Series([1,2,3], index=np.arange(3.)) In [46]: s Out[46]: 0.0 1 1.0 2 2.0 3 dtype: int64 In [47]: s.index Out[47]: Float64Index([0.0, 1.0, 2.0], dtype='float64') In [48]: print(s.to_csv(path=None)) 0.0,1 1.0,2 2.0,3Changes to dtype assignment behaviors¶
When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH10503)
Previous Behavior:
In [5]: df = pd.DataFrame({'a': [0, 1, 1], 'b': pd.Series([100, 200, 300], dtype='uint32')}) In [7]: df.dtypes Out[7]: a int64 b uint32 dtype: object In [8]: ix = df['a'] == 1 In [9]: df.loc[ix, 'b'] = df.loc[ix, 'b'] In [11]: df.dtypes Out[11]: a int64 b int64 dtype: object
New Behavior:
In [49]: df = pd.DataFrame({'a': [0, 1, 1], ....: 'b': pd.Series([100, 200, 300], dtype='uint32')}) ....: In [50]: df.dtypes Out[50]: a int64 b uint32 dtype: object In [51]: ix = df['a'] == 1 In [52]: df.loc[ix, 'b'] = df.loc[ix, 'b'] In [53]: df.dtypes Out[53]: a int64 b uint32 dtype: object
When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be downcasted to integer without losing precision, the dtype of the slice will be set to float instead of integer.
Previous Behavior:
In [4]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3), columns=list('abc'), index=[[4,4,8], [8,10,12]]) In [5]: df Out[5]: a b c 4 8 1 2 3 10 4 5 6 8 12 7 8 9 In [7]: df.ix[4, 'c'] = np.array([0., 1.]) In [8]: df Out[8]: a b c 4 8 1 2 0 10 4 5 1 8 12 7 8 9
New Behavior:
In [54]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3), ....: columns=list('abc'), ....: index=[[4,4,8], [8,10,12]]) ....: In [55]: df Out[55]: a b c 4 8 1 2 3 10 4 5 6 8 12 7 8 9 In [56]: df.loc[4, 'c'] = np.array([0., 1.]) In [57]: df Out[57]: a b c 4 8 1 2 0.0 10 4 5 1.0 8 12 7 8 9.0to_xarray¶
In a future version of pandas, we will be deprecating Panel
and other > 2 ndim objects. In order to provide for continuity, all NDFrame
objects have gained the .to_xarray()
method in order to convert to xarray
objects, which has a pandas-like interface for > 2 ndim. (GH11972)
See the xarray full-documentation here.
In [1]: p = Panel(np.arange(2*3*4).reshape(2,3,4)) In [2]: p.to_xarray() Out[2]: <xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)> array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) Coordinates: * items (items) int64 0 1 * major_axis (major_axis) int64 0 1 2 * minor_axis (minor_axis) int64 0 1 2 3Latex Representation¶
DataFrame
has gained a ._repr_latex_()
method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH11778)
Note that this must be activated by setting the option pd.display.latex.repr=True
(GH12182)
For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statement pd.display.latex.repr=True
in the first cell to have the contained DataFrame output also stored as latex.
The options display.latex.escape
and display.latex.longtable
have also been added to the configuration and are used automatically by the to_latex
method. See the available options docs for more info.
pd.read_sas()
changes¶
read_sas
has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details see here. (GH4052)
Series.to_string
(GH11729)read_excel
now supports s3 urls of the format s3://bucketname/filename
(GH11447)AWS_S3_HOST
env variable when reading from s3 (GH12198)Panel.round()
is now implemented (GH11763)round(DataFrame)
, round(Series)
, round(Panel)
will work (GH11763)sys.getsizeof(obj)
returns the memory usage of a pandas object, including the values it contains (GH11597)Series
gained an is_unique
attribute (GH11946)DataFrame.quantile
and Series.quantile
now accept interpolation
keyword (GH10174).DataFrame.style.format
for more flexible formatting of cell values (GH11692)DataFrame.select_dtypes
now allows the np.float16
typecode (GH11990)pivot_table()
now accepts most iterables for the values
parameter (GH12017)BigQuery
service account authentication support, which enables authentication on remote servers. (GH11881, GH12572). For further details see hereHDFStore
is now iterable: for k in store
is equivalent to for k in store.keys()
(GH12221)..dt
for Period
(GH8848)PEP
-ified (GH12096).to_string(index=False)
method (GH11833)out
parameter has been removed from the Series.round()
method. (GH11763)DataFrame.round()
leaves non-numeric columns unchanged in its return, rather than raises. (GH11885)DataFrame.head(0)
and DataFrame.tail(0)
return empty frames, rather than self
. (GH11937)Series.head(0)
and Series.tail(0)
return empty series, rather than self
. (GH11937)to_msgpack
and read_msgpack
encoding now defaults to 'utf-8'
. (GH12170).read_csv()
, .read_table()
, .read_fwf()
) changed to group related arguments. (GH11555)NaTType.isoformat
now returns the string 'NaT
to allow the result to be passed to the constructor of Timestamp
. (GH12300)NaT
and Timedelta
have expanded arithmetic operations, which are extended to Series
arithmetic where applicable. Operations defined for datetime64[ns]
or timedelta64[ns]
are now also defined for NaT
(GH11564).
NaT
now supports arithmetic operations with integers and floats.
In [58]: pd.NaT * 1 Out[58]: NaT In [59]: pd.NaT * 1.5 Out[59]: NaT In [60]: pd.NaT / 2 Out[60]: NaT In [61]: pd.NaT * np.nan Out[61]: NaT
NaT
defines more arithmetic operations with datetime64[ns]
and timedelta64[ns]
.
In [62]: pd.NaT / pd.NaT Out[62]: nan In [63]: pd.Timedelta('1s') / pd.NaT Out[63]: nan
NaT
may represent either a datetime64[ns]
null or a timedelta64[ns]
null. Given the ambiguity, it is treated as a timedelta64[ns]
, which allows more operations to succeed.
In [64]: pd.NaT + pd.NaT Out[64]: NaT # same as In [65]: pd.Timedelta('1s') + pd.Timedelta('1s') Out[65]: Timedelta('0 days 00:00:02')
as opposed to
In [3]: pd.Timestamp('19900315') + pd.Timestamp('19900315') TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
However, when wrapped in a Series
whose dtype
is datetime64[ns]
or timedelta64[ns]
, the dtype
information is respected.
In [1]: pd.Series([pd.NaT], dtype='<M8[ns]') + pd.Series([pd.NaT], dtype='<M8[ns]') TypeError: can only operate on a datetimes for subtraction, but the operator [__add__] was passed
In [66]: pd.Series([pd.NaT], dtype='<m8[ns]') + pd.Series([pd.NaT], dtype='<m8[ns]') Out[66]: 0 NaT dtype: timedelta64[ns]
Timedelta
division by floats
now works.
In [67]: pd.Timedelta('1s') / 2.0 Out[67]: Timedelta('0 days 00:00:00.500000')
Subtraction by Timedelta
in a Series
by a Timestamp
works (GH11925)
In [68]: ser = pd.Series(pd.timedelta_range('1 day', periods=3)) In [69]: ser Out[69]: 0 1 days 1 2 days 2 3 days dtype: timedelta64[ns] In [70]: pd.Timestamp('2012-01-01') - ser Out[70]: 0 2011-12-31 1 2011-12-30 2 2011-12-29 dtype: datetime64[ns]
NaT.isoformat()
now returns 'NaT'
. This change allows allows pd.Timestamp
to rehydrate any timestamp like object from its isoformat (GH12300).
Forward incompatible changes in msgpack
writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH12129, GH10527)
Bugs in to_msgpack
and read_msgpack
introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH12142). The following table describes the backward and forward compat of msgpacks.
Warning
Packed with Can be unpacked with pre-0.17 / Python 2 any pre-0.17 / Python 3 any 0.17 / Python 20.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.
Signature change for .rank¶Series.rank
and DataFrame.rank
now have the same signature (GH11759)
Previous signature
In [3]: pd.Series([0,1]).rank(method='average', na_option='keep', ascending=True, pct=False) Out[3]: 0 1 1 2 dtype: float64 In [4]: pd.DataFrame([0,1]).rank(axis=0, numeric_only=None, method='average', na_option='keep', ascending=True, pct=False) Out[4]: 0 0 1 1 2
New signature
In [71]: pd.Series([0,1]).rank(axis=0, method='average', numeric_only=None, ....: na_option='keep', ascending=True, pct=False) ....: Out[71]: 0 1.0 1 2.0 dtype: float64 In [72]: pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=None, ....: na_option='keep', ascending=True, pct=False) ....: Out[72]: 0 0 1.0 1 2.0Bug in QuarterBegin with n=0¶
In previous versions, the behavior of the QuarterBegin offset was inconsistent depending on the date when the n
parameter was 0. (GH11406)
The general semantics of anchored offsets for n=0
is to not move the date when it is an anchor point (e.g., a quarter start date), and otherwise roll forward to the next anchor point.
In [73]: d = pd.Timestamp('2014-02-01') In [74]: d Out[74]: Timestamp('2014-02-01 00:00:00') In [75]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[75]: Timestamp('2014-02-01 00:00:00') In [76]: d + pd.offsets.QuarterBegin(n=0, startingMonth=1) Out[76]: Timestamp('2014-04-01 00:00:00')
For the QuarterBegin
offset in previous versions, the date would be rolled backwards if date was in the same month as the quarter start date.
In [3]: d = pd.Timestamp('2014-02-15') In [4]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[4]: Timestamp('2014-02-01 00:00:00')
This behavior has been corrected in version 0.18.0, which is consistent with other anchored offsets like MonthBegin
and YearBegin
.
In [77]: d = pd.Timestamp('2014-02-15') In [78]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2) Out[78]: Timestamp('2014-05-01 00:00:00')Resample API¶
Like the change in the window functions API above, .resample(...)
is changing to have a more groupby-like API. (GH11732, GH12702, GH12202, GH12332, GH12334, GH12348, GH12448).
In [79]: np.random.seed(1234) In [80]: df = pd.DataFrame(np.random.rand(10,4), ....: columns=list('ABCD'), ....: index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s')) ....: In [81]: df Out[81]: A B C D 2010-01-01 09:00:00 0.191519 0.622109 0.437728 0.785359 2010-01-01 09:00:01 0.779976 0.272593 0.276464 0.801872 2010-01-01 09:00:02 0.958139 0.875933 0.357817 0.500995 2010-01-01 09:00:03 0.683463 0.712702 0.370251 0.561196 2010-01-01 09:00:04 0.503083 0.013768 0.772827 0.882641 2010-01-01 09:00:05 0.364886 0.615396 0.075381 0.368824 2010-01-01 09:00:06 0.933140 0.651378 0.397203 0.788730 2010-01-01 09:00:07 0.316836 0.568099 0.869127 0.436173 2010-01-01 09:00:08 0.802148 0.143767 0.704261 0.704581 2010-01-01 09:00:09 0.218792 0.924868 0.442141 0.909316
Previous API:
You would write a resampling operation that immediately evaluates. If a how
parameter was not provided, it would default to how='mean'
.
In [6]: df.resample('2s') Out[6]: A B C D 2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615 2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096 2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733 2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452 2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
You could also specify a how
directly
In [7]: df.resample('2s', how='sum') Out[7]: A B C D 2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231 2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191 2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465 2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904 2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
New API:
Now, you can write .resample(..)
as a 2-stage operation like .groupby(...)
, which yields a Resampler
.
In [82]: r = df.resample('2s') In [83]: r Out[83]: DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left, label=left, convention=start, base=0]Downsampling¶
You can then use this object to perform operations. These are downsampling operations (going from a higher frequency to a lower one).
In [84]: r.mean() Out[84]: A B C D 2010-01-01 09:00:00 0.485748 0.447351 0.357096 0.793615 2010-01-01 09:00:02 0.820801 0.794317 0.364034 0.531096 2010-01-01 09:00:04 0.433985 0.314582 0.424104 0.625733 2010-01-01 09:00:06 0.624988 0.609738 0.633165 0.612452 2010-01-01 09:00:08 0.510470 0.534317 0.573201 0.806949
In [85]: r.sum() Out[85]: A B C D 2010-01-01 09:00:00 0.971495 0.894701 0.714192 1.587231 2010-01-01 09:00:02 1.641602 1.588635 0.728068 1.062191 2010-01-01 09:00:04 0.867969 0.629165 0.848208 1.251465 2010-01-01 09:00:06 1.249976 1.219477 1.266330 1.224904 2010-01-01 09:00:08 1.020940 1.068634 1.146402 1.613897
Furthermore, resample now supports getitem
operations to perform the resample on specific columns.
In [86]: r[['A','C']].mean() Out[86]: A C 2010-01-01 09:00:00 0.485748 0.357096 2010-01-01 09:00:02 0.820801 0.364034 2010-01-01 09:00:04 0.433985 0.424104 2010-01-01 09:00:06 0.624988 0.633165 2010-01-01 09:00:08 0.510470 0.573201
and .aggregate
type operations.
In [87]: r.agg({'A' : 'mean', 'B' : 'sum'}) Out[87]: A B 2010-01-01 09:00:00 0.485748 0.894701 2010-01-01 09:00:02 0.820801 1.588635 2010-01-01 09:00:04 0.433985 0.629165 2010-01-01 09:00:06 0.624988 1.219477 2010-01-01 09:00:08 0.510470 1.068634
These accessors can of course, be combined
In [88]: r[['A','B']].agg(['mean','sum']) Out[88]: A B mean sum mean sum 2010-01-01 09:00:00 0.485748 0.971495 0.447351 0.894701 2010-01-01 09:00:02 0.820801 1.641602 0.794317 1.588635 2010-01-01 09:00:04 0.433985 0.867969 0.314582 0.629165 2010-01-01 09:00:06 0.624988 1.249976 0.609738 1.219477 2010-01-01 09:00:08 0.510470 1.020940 0.534317 1.068634Upsampling¶
Upsampling operations take you from a lower frequency to a higher frequency. These are now performed with the Resampler
objects with backfill()
, ffill()
, fillna()
and asfreq()
methods.
In [89]: s = pd.Series(np.arange(5,dtype='int64'), ....: index=date_range('2010-01-01', periods=5, freq='Q')) ....: In [90]: s Out[90]: 2010-03-31 0 2010-06-30 1 2010-09-30 2 2010-12-31 3 2011-03-31 4 Freq: Q-DEC, dtype: int64
Previously
In [6]: s.resample('M', fill_method='ffill') Out[6]: 2010-03-31 0 2010-04-30 0 2010-05-31 0 2010-06-30 1 2010-07-31 1 2010-08-31 1 2010-09-30 2 2010-10-31 2 2010-11-30 2 2010-12-31 3 2011-01-31 3 2011-02-28 3 2011-03-31 4 Freq: M, dtype: int64
New API
In [91]: s.resample('M').ffill() Out[91]: 2010-03-31 0 2010-04-30 0 2010-05-31 0 2010-06-30 1 2010-07-31 1 2010-08-31 1 2010-09-30 2 2010-10-31 2 2010-11-30 2 2010-12-31 3 2011-01-31 3 2011-02-28 3 2011-03-31 4 Freq: M, dtype: int64
Note
In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (like mean
) even though you were upsampling, providing a bit of confusion.
Warning
This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:
In [4]: r = df.resample('2s') In [6]: r*10 pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...) Out[6]: A B C D 2010-01-01 09:00:00 4.857476 4.473507 3.570960 7.936154 2010-01-01 09:00:02 8.208011 7.943173 3.640340 5.310957 2010-01-01 09:00:04 4.339846 3.145823 4.241039 6.257326 2010-01-01 09:00:06 6.249881 6.097384 6.331650 6.124518 2010-01-01 09:00:08 5.104699 5.343172 5.732009 8.069486
However, getting and assignment operations directly on a Resampler
will raise a ValueError
:
In [7]: r.iloc[0] = 5 ValueError: .resample() is now a deferred operation use .resample(...).mean() instead of .resample(...)
There is a situation where the new API can not perform all the operations when using original code. This code is intending to resample every 2s, take the mean
AND then take the min
of those results.
In [4]: df.resample('2s').min() Out[4]: A 0.433985 B 0.314582 C 0.357096 D 0.531096 dtype: float64
The new API will:
In [92]: df.resample('2s').min() Out[92]: A B C D 2010-01-01 09:00:00 0.191519 0.272593 0.276464 0.785359 2010-01-01 09:00:02 0.683463 0.712702 0.357817 0.500995 2010-01-01 09:00:04 0.364886 0.013768 0.075381 0.368824 2010-01-01 09:00:06 0.316836 0.568099 0.397203 0.436173 2010-01-01 09:00:08 0.218792 0.143767 0.442141 0.704581
The good news is the return dimensions will differ between the new API and the old API, so this should loudly raise an exception.
To replicate the original operation
In [93]: df.resample('2s').mean().min() Out[93]: A 0.433985 B 0.314582 C 0.357096 D 0.531096 dtype: float64Changes to eval¶
In prior versions, new columns assignments in an eval
expression resulted in an inplace change to the DataFrame
. (GH9297, GH8664, GH10486)
In [94]: df = pd.DataFrame({'a': np.linspace(0, 10, 5), 'b': range(5)}) In [95]: df Out[95]: a b 0 0.0 0 1 2.5 1 2 5.0 2 3 7.5 3 4 10.0 4
In [12]: df.eval('c = a + b') FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace. This will change in a future version of pandas, use inplace=True to avoid this warning. In [13]: df Out[13]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0
In version 0.18.0, a new inplace
keyword was added to choose whether the assignment should be done inplace or return a copy.
In [96]: df Out[96]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0 In [97]: df.eval('d = c - b', inplace=False) Out[97]: a b c d 0 0.0 0 0.0 0.0 1 2.5 1 3.5 2.5 2 5.0 2 7.0 5.0 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0 In [98]: df Out[98]: a b c 0 0.0 0 0.0 1 2.5 1 3.5 2 5.0 2 7.0 3 7.5 3 10.5 4 10.0 4 14.0 In [99]: df.eval('d = c - b', inplace=True) In [100]: df Out[100]: a b c d 0 0.0 0 0.0 0.0 1 2.5 1 3.5 2.5 2 5.0 2 7.0 5.0 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0
Warning
For backwards compatability, inplace
defaults to True
if not specified. This will change in a future version of pandas. If your code depends on an inplace assignment you should update to explicitly set inplace=True
The inplace
keyword parameter was also added the query
method.
In [101]: df.query('a > 5') Out[101]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0 In [102]: df.query('a > 5', inplace=True) In [103]: df Out[103]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0
Warning
Note that the default value for inplace
in a query
is False
, which is consistent with prior versions.
eval
has also been updated to allow multi-line expressions for multiple assignments. These expressions will be evaluated one at a time in order. Only assignments are valid for multi-line expressions.
In [104]: df Out[104]: a b c d 3 7.5 3 10.5 7.5 4 10.0 4 14.0 10.0 In [105]: df.eval(""" .....: e = d + a .....: f = e - 22 .....: g = f / 2.0""", inplace=True) .....: In [106]: df Out[106]: a b c d e f g 3 7.5 3 10.5 7.5 15.0 -7.0 -3.5 4 10.0 4 14.0 10.0 20.0 -2.0 -1.0Other API Changes¶
DataFrame.between_time
and Series.between_time
now only parse a fixed set of time strings. Parsing of date strings is no longer supported and raises a ValueError
. (GH11818)
In [107]: s = pd.Series(range(10), pd.date_range('2015-01-01', freq='H', periods=10)) In [108]: s.between_time("7:00am", "9:00am") Out[108]: 2015-01-01 07:00:00 7 2015-01-01 08:00:00 8 2015-01-01 09:00:00 9 Freq: H, dtype: int64
This will now raise.
In [2]: s.between_time('20150101 07:00:00','20150101 09:00:00') ValueError: Cannot convert arg ['20150101 07:00:00'] to a time.
.memory_usage()
now includes values in the index, as does memory_usage in .info()
(GH11597)
DataFrame.to_latex()
now supports non-ascii encodings (eg utf-8
) in Python 2 with the parameter encoding
(GH7061)
pandas.merge()
and DataFrame.merge()
will show a specific error message when trying to merge with an object that is not of type DataFrame
or a subclass (GH12081)
DataFrame.unstack
and Series.unstack
now take fill_value
keyword to allow direct replacement of missing values when an unstack results in missing values in the resulting DataFrame
. As an added benefit, specifying fill_value
will preserve the data type of the original stacked data. (GH9746)
As part of the new API for window functions and resampling, aggregation functions have been clarified, raising more informative error messages on invalid aggregations. (GH9052). A full set of examples are presented in groupby.
Statistical functions for NDFrame
objects (like sum(), mean(), min()
) will now raise if non-numpy-compatible arguments are passed in for **kwargs
(GH12301)
.to_latex
and .to_html
gain a decimal
parameter like .to_csv
; the default is '.'
(GH12031)
More helpful error message when constructing a DataFrame
with empty data but with indices (GH8020)
.describe()
will now properly handle bool dtype as a categorical (GH6625)
More helpful error message with an invalid .transform
with user defined input (GH10165)
Exponentially weighted functions now allow specifying alpha directly (GH10789) and raise ValueError
if parameters violate 0 < alpha <= 1
(GH12492)
The functions pd.rolling_*
, pd.expanding_*
, and pd.ewm*
are deprecated and replaced by the corresponding method call. Note that the new suggested syntax includes all of the arguments (even if default) (GH11603)
In [1]: s = pd.Series(range(3)) In [2]: pd.rolling_mean(s,window=2,min_periods=1) FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with Series.rolling(min_periods=1,window=2,center=False).mean() Out[2]: 0 0.0 1 0.5 2 1.5 dtype: float64 In [3]: pd.rolling_cov(s, s, window=2) FutureWarning: pd.rolling_cov is deprecated for Series and will be removed in a future version, replace with Series.rolling(window=2).cov(other=<Series>) Out[3]: 0 NaN 1 0.5 2 0.5 dtype: float64
The the freq
and how
arguments to the .rolling
, .expanding
, and .ewm
(new) functions are deprecated, and will be removed in a future version. You can simply resample the input prior to creating a window function. (GH11603).
For example, instead of s.rolling(window=5,freq='D').max()
to get the max value on a rolling 5 Day window, one could use s.resample('D').mean().rolling(window=5).max()
, which first resamples the data to daily data, then provides a rolling 5 day window.
pd.tseries.frequencies.get_offset_name
function is deprecated. Use offset’s .freqstr
property as alternative (GH11192)
pandas.stats.fama_macbeth
routines are deprecated and will be removed in a future version (GH6077)
pandas.stats.ols
, pandas.stats.plm
and pandas.stats.var
routines are deprecated and will be removed in a future version (GH6077)
show a FutureWarning
rather than a DeprecationWarning
on using long-time deprecated syntax in HDFStore.select
, where the where
clause is not a string-like (GH12027)
The pandas.options.display.mpl_style
configuration has been deprecated and will be removed in a future version of pandas. This functionality is better handled by matplotlib’s style sheets (GH11783).
In GH4892 indexing with floating point numbers on a non-Float64Index
was deprecated (in version 0.14.0). In 0.18.0, this deprecation warning is removed and these will now raise a TypeError
. (GH12165, GH12333)
In [109]: s = pd.Series([1, 2, 3], index=[4, 5, 6]) In [110]: s Out[110]: 4 1 5 2 6 3 dtype: int64 In [111]: s2 = pd.Series([1, 2, 3], index=list('abc')) In [112]: s2 Out[112]: a 1 b 2 c 3 dtype: int64
Previous Behavior:
# this is label indexing In [2]: s[5.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[2]: 2 # this is positional indexing In [3]: s.iloc[1.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[3]: 2 # this is label indexing In [4]: s.loc[5.0] FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[4]: 2 # .ix would coerce 1.0 to the positional 1, and index In [5]: s2.ix[1.0] = 10 FutureWarning: scalar indexers for index type Index should be integers and not floating point In [6]: s2 Out[6]: a 1 b 10 c 3 dtype: int64
New Behavior:
For iloc, getting & setting via a float scalar will always raise.
In [3]: s.iloc[2.0] TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>
Other indexers will coerce to a like integer for both getting and setting. The FutureWarning
has been dropped for .loc
, .ix
and []
.
In [113]: s[5.0] Out[113]: 2 In [114]: s.loc[5.0] Out[114]: 2
and setting
In [115]: s_copy = s.copy() In [116]: s_copy[5.0] = 10 In [117]: s_copy Out[117]: 4 1 5 10 6 3 dtype: int64 In [118]: s_copy = s.copy() In [119]: s_copy.loc[5.0] = 10 In [120]: s_copy Out[120]: 4 1 5 10 6 3 dtype: int64
Positional setting with .ix
and a float indexer will ADD this value to the index, rather than previously setting the value by position.
In [3]: s2.ix[1.0] = 10 In [4]: s2 Out[4]: a 1 b 2 c 3 1.0 10 dtype: int64
Slicing will also coerce integer-like floats to integers for a non-Float64Index
.
In [121]: s.loc[5.0:6] Out[121]: 5 2 6 3 dtype: int64
Note that for floats that are NOT coercible to ints, the label based bounds will be excluded
In [122]: s.loc[5.1:6] Out[122]: 6 3 dtype: int64
Float indexing on a Float64Index
is unchanged.
In [123]: s = pd.Series([1, 2, 3], index=np.arange(3.)) In [124]: s[1.0] Out[124]: 2 In [125]: s[1.0:2.5] Out[125]: 1.0 2 2.0 3 dtype: int64Removal of prior version deprecations/changes¶
rolling_corr_pairwise
in favor of .rolling().corr(pairwise=True)
(GH4950)expanding_corr_pairwise
in favor of .expanding().corr(pairwise=True)
(GH4950)DataMatrix
module. This was not imported into the pandas namespace in any event (GH12111)cols
keyword in favor of subset
in DataFrame.duplicated()
and DataFrame.drop_duplicates()
(GH6680)read_frame
and frame_query
(both aliases for pd.read_sql
) and write_frame
(alias of to_sql
) functions in the pd.io.sql
namespace, deprecated since 0.14.0 (GH6292).order
keyword from .factorize()
(GH6930)andrews_curves
(GH11534)DatetimeIndex
, PeriodIndex
and TimedeltaIndex
‘s ops performance including NaT
(GH10277)pandas.concat
(GH11958)StataReader
(GH11591)Categoricals
with Series
of datetimes containing NaT
(GH12077)GroupBy.size
when data-frame is empty. (GH11699)Period.end_time
when a multiple of time period is requested (GH11738).clip
with tz-aware datetimes (GH11838)date_range
when the boundaries fell on the frequency (GH11804, GH12409).groupby(...).agg(...)
(GH9052)Timedelta
constructor (GH11995)StataReader
when reading incrementally (GH12014)DateOffset
when n
parameter is 0
(GH11370)NaT
comparison changes (GH12049)read_csv
when reading from a StringIO
in threads (GH11790)NaT
as a missing value in datetimelikes when factorizing & with Categoricals
(GH12077)Series
were tz-aware (GH12089)Series.str.get_dummies
when one of the variables was ‘name’ (GH12180)pd.concat
while concatenating tz-aware NaT series. (GH11693, GH11755, GH12217)pd.read_stata
with version <= 108 files (GH12232)Series.resample
using a frequency of Nano
when the index is a DatetimeIndex
and contains non-zero nanosecond parts (GH12037).nunique
and a sparse index (GH12352)boto
in python 3.5 (GH11915)NaT
subtraction from Timestamp
or DatetimeIndex
with timezones (GH11718)Series
of a single tz-aware Timestamp
(GH12290).next()
(GH12299)Timedelta.round
with negative values (GH11690).loc
against CategoricalIndex
may result in normal Index
(GH11586)DataFrame.info
when duplicated column names exist (GH11761).copy
of datetime tz-aware objects (GH11794)Series.apply
and Series.map
where timedelta64
was not boxed (GH11349)DataFrame.set_index()
with tz-aware Series
(GH12358)DataFrame
where AttributeError
did not propagate (GH11808)Timestamp
(GH11616)pd.read_clipboard
and pd.to_clipboard
functions not supporting Unicode; upgrade included pyperclip
to v1.5.15 (GH9263)DataFrame.query
containing an assignment (GH8664)from_msgpack
where __contains__()
fails for columns of the unpacked DataFrame
, if the DataFrame
has object columns. (GH11880).resample
on categorical data with TimedeltaIndex
(GH12169)DataFrame
(GH11682)Index
creation from Timestamp
with mixed tz coerces to UTC (GH11488)to_numeric
where it does not raise if input is more than one dimension (GH11776)df.plot
using incorrect colors for bar plots under matplotlib 1.5+ (GH11614)groupby
plot
method when using keyword arguments (GH11805).DataFrame.duplicated
and drop_duplicates
causing spurious matches when setting keep=False
(GH11864).loc
result with duplicated key may have Index
with incorrect dtype (GH11497)pd.rolling_median
where memory allocation failed even with sufficient memory (GH11696)DataFrame.style
with spurious zeros (GH12134)DataFrame.style
with integer columns not starting at 0 (GH12125).style.bar
may not rendered properly using specific browser (GH11678)Timedelta
with a numpy.array
of Timedelta
that caused an infinite recursion (GH11835)DataFrame.round
dropping column index name (GH11986)df.replace
while replacing value in mixed dtype Dataframe
(GH11698)Index
prevents copying name of passed Index
, when a new name is not provided (GH11193)read_excel
failing to read any non-empty sheets when empty sheets exist and sheetname=None
(GH11711)read_excel
failing to raise NotImplemented
error when keywords parse_dates
and date_parser
are provided (GH11544)read_sql
with pymysql
connections failing to return chunked data (GH11522).to_csv
ignoring formatting parameters decimal
, na_rep
, float_format
for float indexes (GH11553)Int64Index
and Float64Index
preventing the use of the modulo operator (GH9244)MultiIndex.drop
for not lexsorted multi-indexes (GH12078)DataFrame
when masking an empty DataFrame
(GH11859).plot
potentially modifying the colors
input when the number of columns didn’t match the number of series provided (GH12039).Series.plot
failing when index has a CustomBusinessDay
frequency (GH7222)..to_sql
for datetime.time
values with sqlite fallback (GH8341)read_excel
failing to read data with one column when squeeze=True
(GH12157)read_excel
failing to read one empty column (GH12292, GH9002).groupby
where a KeyError
was not raised for a wrong column if there was only one row in the dataframe (GH11741).read_csv
with dtype specified on empty data producing an error (GH12048).read_csv
where strings like '2E'
are treated as valid floats (GH12237)millisecond
property of DatetimeIndex
. This would always raise a ValueError
(GH12019).Series
constructor with read-only data (GH11502)pandas.util.testing.choice()
. Should use np.random.choice()
, instead. (GH12386).loc
setitem indexer preventing the use of a TZ-aware DatetimeIndex (GH12050).style
indexes and multi-indexes not appearing (GH11655)to_msgpack
and from_msgpack
which did not correctly serialize or deserialize NaT
(GH12307)..skew
and .kurt
due to roundoff error for highly similar values (GH11974)Timestamp
constructor where microsecond resolution was lost if HHMMSS were not separated with ‘:’ (GH10041)buffer_rd_bytes
src->buffer could be freed more than once if reading failed, causing a segfault (GH12098)crosstab
where arguments with non-overlapping indexes would return a KeyError
(GH10291)DataFrame.apply
in which reduction was not being prevented for cases in which dtype
was not a numpy dtype (GH12244)DatetimeIndex
by setting utc=True
in .to_datetime
(GH11934)read_csv
(GH12494)DataFrame
with duplicate column names (GH12344)Note
We are proud to announce that pandas has become a sponsored project of the (NUMFocus organization). This will help ensure the success of development of pandas as a world-class open-source project.
This is a minor bug-fix release from 0.17.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
DataFrame.drop_duplicates
from 0.16.2, causing incorrect results on integer values (GH11376)Warning
This is a new feature and is under active development. We’ll be adding features an possibly making breaking changes in future releases. Feedback is welcome.
We’ve added experimental support for conditional HTML formatting: the visual styling of a DataFrame based on the data. The styling is accomplished with HTML and CSS. Acesses the styler class with the pandas.DataFrame.style
, attribute, an instance of Styler
with your data attached.
Here’s a quick example:
In [1]: np.random.seed(123) In [2]: df = DataFrame(np.random.randn(10, 5), columns=list('abcde')) In [3]: html = df.style.background_gradient(cmap='viridis', low=.5)
We can render the HTML to get the following table.
a b c d e 0 -1.085631 0.997345 0.282978 -1.506295 -0.5786 1 1.651437 -2.426679 -0.428913 1.265936 -0.86674 2 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 3 -0.434351 2.20593 2.186786 1.004054 0.386186 4 0.737369 1.490732 -0.935834 1.175829 -1.253881 5 -0.637752 0.907105 -1.428681 -0.140069 -0.861755 6 -0.255619 -2.798589 -1.771533 -0.699877 0.927462 7 -0.173636 0.002846 0.688223 -0.879536 0.283627 8 -0.805367 -1.727669 -0.3909 0.573806 0.338589 9 -0.01183 2.392365 0.412912 0.978736 2.238143Styler
interacts nicely with the Jupyter Notebook. See the documentation for more.
DatetimeIndex
now supports conversion to strings with astype(str)
(GH10442)
Support for compression
(gzip/bz2) in pandas.DataFrame.to_csv()
(GH7615)
pd.read_*
functions can now also accept pathlib.Path
, or py._path.local.LocalPath
objects for the filepath_or_buffer
argument. (GH11033) - The DataFrame
and Series
functions .to_csv()
, .to_html()
and .to_latex()
can now handle paths beginning with tildes (e.g. ~/Documents/
) (GH11438)
DataFrame
now uses the fields of a namedtuple
as columns, if columns are not supplied (GH11181)
DataFrame.itertuples()
now returns namedtuple
objects, when possible. (GH11269, GH11625)
Added axvlines_kwds
to parallel coordinates plot (GH10709)
Option to .info()
and .memory_usage()
to provide for deep introspection of memory consumption. Note that this can be expensive to compute and therefore is an optional parameter. (GH11595)
In [4]: df = DataFrame({'A' : ['foo']*1000}) In [5]: df['B'] = df['A'].astype('category') # shows the '+' as we have object dtypes In [6]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 2 columns): A 1000 non-null object B 1000 non-null category dtypes: category(1), object(1) memory usage: 9.0+ KB # we have an accurate memory assessment (but can be expensive to compute this) In [7]: df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 2 columns): A 1000 non-null object B 1000 non-null category dtypes: category(1), object(1) memory usage: 75.4 KB
Index
now has a fillna
method (GH10089)
In [8]: pd.Index([1, np.nan, 3]).fillna(2) Out[8]: Float64Index([1.0, 2.0, 3.0], dtype='float64')
Series of type category
now make .str.<...>
and .dt.<...>
accessor methods / properties available, if the categories are of that type. (GH10661)
In [9]: s = pd.Series(list('aabb')).astype('category') In [10]: s Out[10]: 0 a 1 a 2 b 3 b dtype: category Categories (2, object): [a, b] In [11]: s.str.contains("a") Out[11]: 0 True 1 True 2 False 3 False dtype: bool In [12]: date = pd.Series(pd.date_range('1/1/2015', periods=5)).astype('category') In [13]: date Out[13]: 0 2015-01-01 1 2015-01-02 2 2015-01-03 3 2015-01-04 4 2015-01-05 dtype: category Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05] In [14]: date.dt.day Out[14]: 0 1 1 2 2 3 3 4 4 5 dtype: int64
pivot_table
now has a margins_name
argument so you can use something other than the default of ‘All’ (GH3335)
Implement export of datetime64[ns, tz]
dtypes with a fixed HDF5 store (GH11411)
Pretty printing sets (e.g. in DataFrame cells) now uses set literal syntax ({x, y}
) instead of Legacy Python syntax (set([x, y])
) (GH11215)
Improve the error message in pandas.io.gbq.to_gbq()
when a streaming insert fails (GH11285) and when the DataFrame does not match the schema of the destination table (GH11359)
NotImplementedError
in Index.shift
for non-supported index types (GH8038)min
and max
reductions on datetime64
and timedelta64
dtyped series now result in NaT
and not nan
(GH11245).TypeError
, instead of a ValueError
(GH11356)Series.ptp
will now ignore missing values by default (GH11163)pandas.io.ga
module which implements google-analytics
support is deprecated and will be removed in a future version (GH11308)engine
keyword in .to_csv()
, which will be removed in a future version (GH11274)Series.dropna
performance improvement when its dtype can’t contain NaN
(GH11159)DatetimeIndex.year
, Series.dt.year
), normalization, and conversion to and from Period
, DatetimeIndex.to_period
and PeriodIndex.to_timestamp
(GH11263)rolling_median
, rolling_mean
, rolling_max
, rolling_min
, rolling_var
, rolling_kurt
, rolling_skew
(GH11450)read_csv
, read_table
(GH11272)rolling_median
(GH11450)to_excel
(GH11352)Categorical
categories, which was rendering the strings before chopping them for display (GH11305)Categorical.remove_unused_categories
, (GH11643).Series
constructor with no data and DatetimeIndex
(GH11433)shift
, cumprod
, and cumsum
with groupby (GH4095)SparseArray.__iter__()
now does not cause PendingDeprecationWarning
in Python 3.5 (GH11622)Series.sort_index()
now correctly handles the inplace
option (GH11402)PyPi
when reading a csv of floats and passing na_values=<a scalar>
would show an exception (GH11374).to_latex()
output broken when the index has a name (GH10660)HDFStore.append
with strings whose encoded length exceded the max unencoded length (GH11234)datetime64[ns, tz]
dtypes (GH11405)HDFStore.select
when comparing with a numpy scalar in a where clause (GH11283)DataFrame.ix
with a multi-index indexer (GH11372)date_range
with ambigous endpoints (GH11626).str
, .dt
and .cat
. Retrieving such a value was not possible, so error out on setting it. (GH10673).dt
accessors (GH11295)DataFrame.replace
with a datetime64[ns, tz]
and a non-compat to_replace (GH11326, GH11153)isnull
where numpy.datetime64('NaT')
in a numpy.array
was not determined to be null(GH11206)pivot_table
with margins=True
when indexes are of Categorical
dtype (GH10993)DataFrame.plot
cannot use hex strings colors (GH10299)DataFrame.drop_duplicates
from 0.16.2, causing incorrect results on integer values (GH11376)pd.eval
where unary ops in a list error (GH11235)squeeze()
with zero length arrays (GH11230, GH8999)describe()
dropping column names for hierarchical indexes (GH11517)DataFrame.pct_change()
not propagating axis
keyword on .fillna
method (GH11150).to_csv()
when a mix of integer and string column names are passed as the columns
parameter (GH11637)range
, (GH11652)to_sql
using unicode column names giving UnicodeEncodeError with (GH11431).xticks
in plot
(GH11529).holiday.dates
where observance rules could not be applied to holiday and doc enhancement (GH11477, GH11533)Axes
instances instead of SubplotAxes
(GH11520, GH11556).DataFrame.to_latex()
produces an extra rule when header=False
(GH7124)df.groupby(...).apply(func)
when a func returns a Series
containing a new datetimelike column (GH11324)pandas.json
when file to load is big (GH11344)to_excel
with duplicate columns (GH11007, GH10982, GH10970)datetime64[ns, tz]
(GH11245).read_excel
with multi-index containing integers (GH11317)to_excel
with openpyxl 2.2+ and merging (GH11408)DataFrame.to_dict()
produces a np.datetime64
object instead of Timestamp
when only datetime is present in data (GH11327)DataFrame.corr()
raises exception when computes Kendall correlation for DataFrames with boolean and not boolean columns (GH11560)inline
functions on FreeBSD 10+ (with clang
) (GH10510)DataFrame.to_csv
in passing through arguments for formatting MultiIndexes
, including date_format
(GH7791)DataFrame.join()
with how='right'
producing a TypeError
(GH11519)Series.quantile
with empty list results has Index
with object
dtype (GH11588)pd.merge
results in empty Int64Index
rather than Index(dtype=object)
when the merge result is empty (GH11588)Categorical.remove_unused_categories
when having NaN
values (GH11599)DataFrame.to_sparse()
loses column names for MultiIndexes (GH11600)DataFrame.round()
with non-unique column index producing a Fatal Python error (GH11611)DataFrame.round()
with decimals
being a non-unique indexed Series producing extra columns (GH11618)This is a major release from 0.16.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)
Warning
The pandas.io.data
package is deprecated and will be replaced by the pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1
is exactly the same as in pandas v0.17.0
(GH8961, GH10861).
After installing pandas-datareader, you can easily change your imports:
from pandas.io import data, wb
becomes
from pandas_datareader import data, wb
Highlights include:
.plot
accessor, see heredatetime64[ns]
with timezones as a first-class dtype, see hereto_datetime
will now be to raise
when presented with unparseable formats, previously this would return the original input. Also, date parse functions now return consistent results. See heredropna
in HDFStore
has changed to False
, to store by default all rows even if they are all NaN
, see heredt
) now supports Series.dt.strftime
to generate formatted strings for datetime-likes, and Series.dt.total_seconds
to generate each duration of the timedelta in seconds. See herePeriod
and PeriodIndex
can handle multiplied freq like 3D
, which corresponding to 3 days span. See herePEP440
compliant version strings (GH9518)Check the API Changes and deprecations before updating.
New features¶ Datetime with TZ¶We are adding an implementation that natively supports datetime with timezones. A Series
or a DataFrame
column previously could be assigned a datetime with timezones, and would work as an object
dtype. This had performance issues with a large number rows. See the docs for more details. (GH8260, GH10763, GH11034).
The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.
In [1]: df = DataFrame({'A' : date_range('20130101',periods=3), ...: 'B' : date_range('20130101',periods=3,tz='US/Eastern'), ...: 'C' : date_range('20130101',periods=3,tz='CET')}) ...: In [2]: df Out[2]: A B C 0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00 1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00 2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00 In [3]: df.dtypes Out[3]: A datetime64[ns] B datetime64[ns, US/Eastern] C datetime64[ns, CET] dtype: object
In [4]: df.B Out[4]: 0 2013-01-01 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-03 00:00:00-05:00 Name: B, dtype: datetime64[ns, US/Eastern] In [5]: df.B.dt.tz_localize(None) Out[5]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 Name: B, dtype: datetime64[ns]
This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin datetime64[ns]
In [6]: df['B'].dtype Out[6]: datetime64[ns, US/Eastern] In [7]: type(df['B'].dtype) Out[7]: pandas.core.dtypes.dtypes.DatetimeTZDtype
Note
There is a slightly different string repr for the underlying DatetimeIndex
as a result of the dtype changes, but functionally these are the same.
Previous Behavior:
In [1]: pd.date_range('20130101',periods=3,tz='US/Eastern') Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns]', freq='D', tz='US/Eastern') In [2]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype Out[2]: dtype('<M8[ns]')
New Behavior:
In [8]: pd.date_range('20130101',periods=3,tz='US/Eastern') Out[8]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D') In [9]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype Out[9]: datetime64[ns, US/Eastern]Releasing the GIL¶
We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby
, nsmallest
, value_counts
and some indexing operations benefit from this. (GH8882)
For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. df.groupby('key')
as well as the .sum()
operation.
N = 1000000 ngroups = 10 df = DataFrame({'key' : np.random.randint(0,ngroups,size=N), 'data' : np.random.randn(N) }) df.groupby('key')['data'].sum()
Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask library.
Plot submethods¶The Series and DataFrame .plot()
method allows for customizing plot types by supplying the kind
keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.
To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the .plot
attribute. Instead of writing series.plot(kind=<kind>, ...)
, you can now also use series.plot.<kind>(...)
:
In [10]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b']) In [11]: df.plot.bar()
As a result of this change, these methods are now all discoverable via tab-completion:
In [12]: df.plot.<TAB> df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.line df.plot.scatter df.plot.bar df.plot.box df.plot.hexbin df.plot.kde df.plot.pie
Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new Plotting API documentation.
Additional methods fordt
accessor¶ strftime¶
We are now supporting a Series.dt.strftime
method for datetime-likes to generate a formatted string (GH10110). Examples:
# DatetimeIndex In [13]: s = pd.Series(pd.date_range('20130101', periods=4)) In [14]: s Out[14]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 dtype: datetime64[ns] In [15]: s.dt.strftime('%Y/%m/%d') Out[15]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 dtype: object
# PeriodIndex In [16]: s = pd.Series(pd.period_range('20130101', periods=4)) In [17]: s Out[17]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 dtype: object In [18]: s.dt.strftime('%Y/%m/%d') Out[18]: 0 2013/01/01 1 2013/01/02 2 2013/01/03 3 2013/01/04 dtype: object
The string format is as the python standard library and details can be found here
total_seconds¶pd.Series
of type timedelta64
has new method .dt.total_seconds()
returning the duration of the timedelta in seconds (GH10817)
# TimedeltaIndex In [19]: s = pd.Series(pd.timedelta_range('1 minutes', periods=4)) In [20]: s Out[20]: 0 0 days 00:01:00 1 1 days 00:01:00 2 2 days 00:01:00 3 3 days 00:01:00 dtype: timedelta64[ns] In [21]: s.dt.total_seconds() Out[21]: 0 60.0 1 86460.0 2 172860.0 3 259260.0 dtype: float64Period Frequency Enhancement¶
Period
, PeriodIndex
and period_range
can now accept multiplied freq. Also, Period.freq
and PeriodIndex.freq
are now stored as a DateOffset
instance like DatetimeIndex
, and not as str
(GH7811)
A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.
In [22]: p = pd.Period('2015-08-01', freq='3D') In [23]: p Out[23]: Period('2015-08-01', '3D') In [24]: p + 1 Out[24]: Period('2015-08-04', '3D') In [25]: p - 2 Out[25]: Period('2015-07-26', '3D') In [26]: p.to_timestamp() Out[26]: Timestamp('2015-08-01 00:00:00') In [27]: p.to_timestamp(how='E') Out[27]: Timestamp('2015-08-03 00:00:00')
You can use the multiplied freq in PeriodIndex
and period_range
.
In [28]: idx = pd.period_range('2015-08-01', periods=4, freq='2D') In [29]: idx Out[29]: PeriodIndex(['2015-08-01', '2015-08-03', '2015-08-05', '2015-08-07'], dtype='period[2D]', freq='2D') In [30]: idx + 1 Out[30]: PeriodIndex(['2015-08-03', '2015-08-05', '2015-08-07', '2015-08-09'], dtype='period[2D]', freq='2D')Support for SAS XPORT files¶
read_sas()
provides support for reading SAS XPORT format files. (GH4052).
df = pd.read_sas('sas_xport.xpt')
It is also possible to obtain an iterator and read an XPORT file incrementally.
for df in pd.read_sas('sas_xport.xpt', chunksize=10000) do_something(df)
See the docs for more details.
Support for Math Functions in .eval()¶eval()
now supports calling math functions (GH4893)
df = pd.DataFrame({'a': np.random.randn(10)}) df.eval("b = sin(a)")
The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.
These functions map to the intrinsics for the NumExpr
engine. For the Python engine, they are mapped to NumPy
calls.
MultiIndex
¶
In version 0.16.2 a DataFrame
with MultiIndex
columns could not be written to Excel via to_excel
. That functionality has been added (GH10564), along with updating read_excel
so that the data can be read back with, no loss of information, by specifying which columns/rows make up the MultiIndex
in the header
and index_col
parameters (GH4679)
See the documentation for more details.
In [31]: df = pd.DataFrame([[1,2,3,4], [5,6,7,8]], ....: columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']], ....: names = ['col1', 'col2']), ....: index = pd.MultiIndex.from_product([['j'], ['l', 'k']], ....: names = ['i1', 'i2'])) ....: In [32]: df Out[32]: col1 foo bar col2 a b a b i1 i2 j l 1 2 3 4 k 5 6 7 8 In [33]: df.to_excel('test.xlsx') In [34]: df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1]) In [35]: df Out[35]: col1 foo bar col2 a b a b i1 i2 j l 1 2 3 4 k 5 6 7 8
Previously, it was necessary to specify the has_index_names
argument in read_excel
, if the serialized data had index names. For version 0.17.0 the ouptput format of to_excel
has been changed to make this keyword unnecessary - the change is shown below.
Old
New
Warning
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in, but the has_index_names
argument must specified to True
.
pandas.io.gbq.to_gbq()
function if the destination table/dataset does not exist. (GH8325, GH11121).pandas.io.gbq.to_gbq()
function via the if_exists
argument. See the docs for more details (GH8325).InvalidColumnOrder
and InvalidPageToken
in the gbq module will raise ValueError
instead of IOError
.generate_bq_schema()
function is now deprecated and will be removed in a future version (GH11121)Warning
Enabling this option will affect the performance for printing of DataFrame
and Series
(about 2 times slower). Use only when it is actually required.
Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If a DataFrame
or Series
contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.
display.unicode.east_asian_width
: Whether to use the Unicode East Asian Width to calculate the display text width. (GH2612)display.unicode.ambiguous_as_wide
: Whether to handle Unicode characters belong to Ambiguous as Wide. (GH11102)In [36]: df = pd.DataFrame({u'å½ç±': ['UK', u'æ¥æ¬'], u'åå': ['Alice', u'ãã®ã¶']}) In [37]: df;
In [38]: pd.set_option('display.unicode.east_asian_width', True) In [39]: df;
For further details, see here
Other enhancements¶Support for openpyxl
>= 2.2. The API for style support is now stable (GH10125)
merge
now accepts the argument indicator
which adds a Categorical-type column (by default called _merge
) to the output object that takes on the values (GH8790)
_merge
value Merge key only in 'left'
frame left_only
Merge key only in 'right'
frame right_only
Merge key in both frames both
In [40]: df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']}) In [41]: df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) In [42]: pd.merge(df1, df2, on='col1', how='outer', indicator=True) Out[42]: col1 col_left col_right _merge 0 0 a NaN left_only 1 1 b 2.0 both 2 2 NaN 2.0 right_only 3 2 NaN 2.0 right_only
For more, see the updated docs
pd.to_numeric
is a new function to coerce strings to numbers (possibly with coercion) (GH11133)
pd.merge
will now allow duplicate column names if they are not merged upon (GH10639).
pd.pivot
will now allow passing index as None
(GH3962).
pd.concat
will now use existing Series names if provided (GH10698).
In [43]: foo = pd.Series([1,2], name='foo') In [44]: bar = pd.Series([1,2]) In [45]: baz = pd.Series([4,5])
Previous Behavior:
In [1] pd.concat([foo, bar, baz], 1) Out[1]: 0 1 2 0 1 1 4 1 2 2 5
New Behavior:
In [46]: pd.concat([foo, bar, baz], 1) Out[46]: foo 0 1 0 1 1 4 1 2 2 5
DataFrame
has gained the nlargest
and nsmallest
methods (GH10393)
Add a limit_direction
keyword argument that works with limit
to enable interpolate
to fill NaN
values forward, backward, or both (GH9218, GH10420, GH11115)
In [47]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13]) In [48]: ser.interpolate(limit=1, limit_direction='both') Out[48]: 0 NaN 1 5.0 2 5.0 3 7.0 4 NaN 5 11.0 6 13.0 dtype: float64
Added a DataFrame.round
method to round the values to a variable number of decimal places (GH10568).
In [49]: df = pd.DataFrame(np.random.random([3, 3]), columns=['A', 'B', 'C'], ....: index=['first', 'second', 'third']) ....: In [50]: df Out[50]: A B C first 0.342764 0.304121 0.417022 second 0.681301 0.875457 0.510422 third 0.669314 0.585937 0.624904 In [51]: df.round(2) Out[51]: A B C first 0.34 0.30 0.42 second 0.68 0.88 0.51 third 0.67 0.59 0.62 In [52]: df.round({'A': 0, 'C': 2}) Out[52]: A B C first 0.0 0.304121 0.42 second 1.0 0.875457 0.51 third 1.0 0.585937 0.62
drop_duplicates
and duplicated
now accept a keep
keyword to target first, last, and all duplicates. The take_last
keyword is deprecated, see here (GH6511, GH8505)
In [53]: s = pd.Series(['A', 'B', 'C', 'A', 'B', 'D']) In [54]: s.drop_duplicates() Out[54]: 0 A 1 B 2 C 5 D dtype: object In [55]: s.drop_duplicates(keep='last') Out[55]: 2 C 3 A 4 B 5 D dtype: object In [56]: s.drop_duplicates(keep=False) Out[56]: 2 C 5 D dtype: object
Reindex now has a tolerance
argument that allows for finer control of Limits on filling while reindexing (GH10411):
In [57]: df = pd.DataFrame({'x': range(5), ....: 't': pd.date_range('2000-01-01', periods=5)}) ....: In [58]: df.reindex([0.1, 1.9, 3.5], ....: method='nearest', ....: tolerance=0.2) ....: Out[58]: t x 0.1 2000-01-01 0.0 1.9 2000-01-03 2.0 3.5 NaT NaN
When used on a DatetimeIndex
, TimedeltaIndex
or PeriodIndex
, tolerance
will coerced into a Timedelta
if possible. This allows you to specify tolerance with a string:
In [59]: df = df.set_index('t') In [60]: df.reindex(pd.to_datetime(['1999-12-31']), ....: method='nearest', ....: tolerance='1 day') ....: Out[60]: x 1999-12-31 0
tolerance
is also exposed by the lower level Index.get_indexer
and Index.get_loc
methods.
Added functionality to use the base
argument when resampling a TimeDeltaIndex
(GH10530)
DatetimeIndex
can be instantiated using strings contains NaT
(GH7599)
to_datetime
can now accept the yearfirst
keyword (GH7599)
pandas.tseries.offsets
larger than the Day
offset can now be used with a Series
for addition/subtraction (GH10699). See the docs for more details.
pd.Timedelta.total_seconds()
now returns Timedelta duration to ns precision (previously microsecond precision) (GH10939)
PeriodIndex
now supports arithmetic with np.ndarray
(GH10638)
Support pickling of Period
objects (GH10439)
.as_blocks
will now take a copy
optional argument to return a copy of the data, default is to copy (no change in behavior from prior versions), (GH9607)
regex
argument to DataFrame.filter
now handles numeric column names instead of raising ValueError
(GH10384).
Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (GH8685)
Enable writing Excel files in memory using StringIO/BytesIO (GH7074)
Enable serialization of lists and dicts to strings in ExcelWriter
(GH8188)
SQL io functions now accept a SQLAlchemy connectable. (GH7877)
pd.read_sql
and to_sql
can accept database URI as con
parameter (GH10214)
read_sql_table
will now allow reading from views (GH10750).
Enable writing complex values to HDFStores
when using the table
format (GH10447)
Enable pd.read_hdf
to be used without specifying a key when the HDF file contains a single dataset (GH10443)
pd.read_stata
will now read Stata 118 type files. (GH9882)
msgpack
submodule has been updated to 0.4.6 with backward compatibility (GH10581)
DataFrame.to_dict
now accepts orient='index'
keyword argument (GH10844).
DataFrame.apply
will return a Series of dicts if the passed function returns a dict and reduce=True
(GH8735).
Allow passing kwargs to the interpolation methods (GH10378).
Improved error message when concatenating an empty iterable of Dataframe
objects (GH9157)
pd.read_csv
can now read bz2-compressed files incrementally, and the C parser can read bz2-compressed files from AWS S3 (GH11070, GH11072).
In pd.read_csv
, recognize s3n://
and s3a://
URLs as designating S3 file storage (GH11070, GH11071).
Read CSV files from AWS S3 incrementally, instead of first downloading the entire file. (Full file download still required for compressed files in Python 2.) (GH11070, GH11073)
pd.read_csv
is now able to infer compression type for files read from AWS S3 storage (GH11070, GH11074).
The sorting API has had some longtime inconsistencies. (GH9816, GH8239).
Here is a summary of the API PRIOR to 0.17.0:
Series.sort
is INPLACE while DataFrame.sort
returns a new object.Series.order
returns a new objectSeries/DataFrame.sort_index
to sort by values by passing the by
keyword.Series/DataFrame.sortlevel
worked only on a MultiIndex
for sorting by index.To address these issues, we have revamped the API:
DataFrame.sort_values()
, which is the merger of DataFrame.sort()
, Series.sort()
, and Series.order()
, to handle sorting of values.Series.sort()
, Series.order()
, and DataFrame.sort()
have been deprecated and will be removed in a future version.by
argument of DataFrame.sort_index()
has been deprecated and will be removed in a future version..sort_index()
will gain the level
keyword to enable level sorting.We now have two distinct and non-overlapping methods of sorting. A *
marks items that will show a FutureWarning
.
To sort by the values:
Previous Replacement *Series.order()
Series.sort_values()
* Series.sort()
Series.sort_values(inplace=True)
* DataFrame.sort(columns=...)
DataFrame.sort_values(by=...)
To sort by the index:
Previous ReplacementSeries.sort_index()
Series.sort_index()
Series.sortlevel(level=...)
Series.sort_index(level=...
) DataFrame.sort_index()
DataFrame.sort_index()
DataFrame.sortlevel(level=...)
DataFrame.sort_index(level=...)
* DataFrame.sort()
DataFrame.sort_index()
We have also deprecated and changed similar methods in two Series-like classes, Index
and Categorical
.
Index.order()
Index.sort_values()
* Categorical.order()
Categorical.sort_values()
Changes to to_datetime and to_timedelta¶ Error handling¶
The default for pd.to_datetime
error handling has changed to errors='raise'
. In prior versions it was errors='ignore'
. Furthermore, the coerce
argument has been deprecated in favor of errors='coerce'
. This means that invalid parsing will raise rather that return the original input as in previous versions. (GH10636)
Previous Behavior:
In [2]: pd.to_datetime(['2009-07-31', 'asd']) Out[2]: array(['2009-07-31', 'asd'], dtype=object)
New Behavior:
In [3]: pd.to_datetime(['2009-07-31', 'asd']) ValueError: Unknown string format
Of course you can coerce this as well.
In [61]: to_datetime(['2009-07-31', 'asd'], errors='coerce') Out[61]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)
To keep the previous behavior, you can use errors='ignore'
:
In [62]: to_datetime(['2009-07-31', 'asd'], errors='ignore') Out[62]: array(['2009-07-31', 'asd'], dtype=object)
Furthermore, pd.to_timedelta
has gained a similar API, of errors='raise'|'ignore'|'coerce'
, and the coerce
keyword has been deprecated in favor of errors='coerce'
.
The string parsing of to_datetime
, Timestamp
and DatetimeIndex
has been made consistent. (GH7599)
Prior to v0.17.0, Timestamp
and to_datetime
may parse year-only datetime-string incorrectly using today’s date, otherwise DatetimeIndex
uses the beginning of the year. Timestamp
and to_datetime
may raise ValueError
in some types of datetime-string which DatetimeIndex
can parse, such as a quarterly string.
Previous Behavior:
In [1]: Timestamp('2012Q2') Traceback ... ValueError: Unable to parse 2012Q2 # Results in today's date. In [2]: Timestamp('2014') Out [2]: 2014-08-12 00:00:00
v0.17.0 can parse them as below. It works on DatetimeIndex
also.
New Behavior:
In [63]: Timestamp('2012Q2') Out[63]: Timestamp('2012-04-01 00:00:00') In [64]: Timestamp('2014') Out[64]: Timestamp('2014-01-01 00:00:00') In [65]: DatetimeIndex(['2012Q2', '2014']) Out[65]: DatetimeIndex(['2012-04-01', '2014-01-01'], dtype='datetime64[ns]', freq=None)
Note
If you want to perform calculations based on today’s date, use Timestamp.now()
and pandas.tseries.offsets
.
In [66]: import pandas.tseries.offsets as offsets In [67]: Timestamp.now() Out[67]: Timestamp('2017-07-07 12:29:28.795000') In [68]: Timestamp.now() + offsets.DateOffset(years=1) Out[68]: Timestamp('2018-07-07 12:29:28.796446')Changes to Index Comparisons¶
Operator equal on Index
should behavior similarly to Series
(GH9947, GH10637)
Starting in v0.17.0, comparing Index
objects of different lengths will raise a ValueError
. This is to be consistent with the behavior of Series
.
Previous Behavior:
In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[2]: array([ True, False, False], dtype=bool) In [3]: pd.Index([1, 2, 3]) == pd.Index([2]) Out[3]: array([False, True, False], dtype=bool) In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) Out[4]: False
New Behavior:
In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5]) Out[8]: array([ True, False, False], dtype=bool) In [9]: pd.Index([1, 2, 3]) == pd.Index([2]) ValueError: Lengths must match to compare In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2]) ValueError: Lengths must match to compare
Note that this is different from the numpy
behavior where a comparison can be broadcast:
In [69]: np.array([1, 2, 3]) == np.array([1]) Out[69]: array([ True, False, False], dtype=bool)
or it can return False if broadcasting can not be done:
In [70]: np.array([1, 2, 3]) == np.array([1, 2]) Out[70]: FalseChanges to Boolean Comparisons vs. None¶
Boolean comparisons of a Series
vs None
will now be equivalent to comparing with np.nan
, rather than raise TypeError
. (GH1079).
In [71]: s = Series(range(3)) In [72]: s.iloc[1] = None In [73]: s Out[73]: 0 0.0 1 NaN 2 2.0 dtype: float64
Previous Behavior:
In [5]: s==None TypeError: Could not compare <type 'NoneType'> type with Series
New Behavior:
In [74]: s==None Out[74]: 0 False 1 False 2 False dtype: bool
Usually you simply want to know which values are null.
In [75]: s.isnull() Out[75]: 0 False 1 True 2 False dtype: bool
Warning
You generally will want to use isnull/notnull
for these types of comparisons, as isnull/notnull
tells you which elements are null. One has to be mindful that nan's
don’t compare equal, but None's
do. Note that Pandas/numpy uses the fact that np.nan != np.nan
, and treats None
like np.nan
.
In [76]: None == None Out[76]: True In [77]: np.nan == np.nan Out[77]: FalseHDFStore dropna behavior¶
The default behavior for HDFStore write functions with format='table'
is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the dropna=True
option. (GH9382)
Previous Behavior:
In [78]: df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2], ....: 'col2':[1, np.nan, np.nan]}) ....: In [79]: df_with_missing Out[79]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN
In [27]: df_with_missing.to_hdf('file.h5', 'df_with_missing', format='table', mode='w') In [28]: pd.read_hdf('file.h5', 'df_with_missing') Out [28]: col1 col2 0 0 1 2 2 NaN
New Behavior:
In [80]: df_with_missing.to_hdf('file.h5', ....: 'df_with_missing', ....: format='table', ....: mode='w') ....: In [81]: pd.read_hdf('file.h5', 'df_with_missing') Out[81]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN
See the docs for more details.
Changes todisplay.precision
option¶
The display.precision
option has been clarified to refer to decimal places (GH10451).
Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in display.precision
.
In [1]: pd.set_option('display.precision', 2) In [2]: pd.DataFrame({'x': [123.456789]}) Out[2]: x 0 123.5
If interpreting precision as “significant figures” this did work for scientific notation but that same interpretation did not work for values with standard formatting. It was also out of step with how numpy handles formatting.
Going forward the value of display.precision
will directly control the number of places after the decimal, for regular formatting as well as scientific notation, similar to how numpy’s precision
print option works.
In [82]: pd.set_option('display.precision', 2) In [83]: pd.DataFrame({'x': [123.456789]}) Out[83]: x 0 123.46
To preserve output behavior with prior versions the default value of display.precision
has been reduced to 6
from 7
.
Categorical.unique
¶
Categorical.unique
now returns new Categoricals
with categories
and codes
that are unique, rather than returning np.array
(GH10508)
In [84]: cat = pd.Categorical(['C', 'A', 'B', 'C'], ....: categories=['A', 'B', 'C'], ....: ordered=True) ....: In [85]: cat Out[85]: [C, A, B, C] Categories (3, object): [A < B < C] In [86]: cat.unique() Out[86]: [C, A, B] Categories (3, object): [A < B < C] In [87]: cat = pd.Categorical(['C', 'A', 'B', 'C'], ....: categories=['A', 'B', 'C']) ....: In [88]: cat Out[88]: [C, A, B, C] Categories (3, object): [A, B, C] In [89]: cat.unique() Out[89]: [C, A, B] Categories (3, object): [C, A, B]Other API Changes¶
Line and kde plot with subplots=True
now uses default colors, not all black. Specify color='k'
to draw all lines in black (GH9894)
Calling the .value_counts()
method on a Series with a categorical
dtype now returns a Series with a CategoricalIndex
(GH10704)
The metadata properties of subclasses of pandas objects will now be serialized (GH10553).
groupby
using Categorical
follows the same rule as Categorical.unique
described above (GH10508)
When constructing DataFrame
with an array of complex64
dtype previously meant the corresponding column was automatically promoted to the complex128
dtype. Pandas will now preserve the itemsize of the input for complex data (GH10952)
some numeric reduction operators would return ValueError
, rather than TypeError
on object types that includes strings and numbers (GH11131)
Passing currently unsupported chunksize
argument to read_excel
or ExcelFile.parse
will now raise NotImplementedError
(GH8011)
Allow an ExcelFile
object to be passed into read_excel
(GH11198)
DatetimeIndex.union
does not infer freq
if self
and the input have None
as freq
(GH11086)
NaT
‘s methods now either raise ValueError
, or return np.nan
or NaT
(GH9513)
np.nan
weekday
, isoweekday
return NaT
date
, now
, replace
, to_datetime
, today
return np.datetime64('NaT')
to_datetime64
(unchanged) raise ValueError
All other public methods (names not beginning with underscores)For Series
the following indexing functions are deprecated (GH10177).
.irow(i)
.iloc[i]
or .iat[i]
.iget(i)
.iloc[i]
or .iat[i]
.iget_value(i)
.iloc[i]
or .iat[i]
For DataFrame
the following indexing functions are deprecated (GH10177).
.irow(i)
.iloc[i]
.iget_value(i, j)
.iloc[i, j]
or .iat[i, j]
.icol(j)
.iloc[:, j]
Note
These indexing function have been deprecated in the documentation since 0.11.0.
Categorical.name
was deprecated to make Categorical
more numpy.ndarray
like. Use Series(cat, name="whatever")
instead (GH10482).Categorical
‘s categories
will issue a warning (GH10748). You can still have missing values in the values
.drop_duplicates
and duplicated
‘s take_last
keyword was deprecated in favor of keep
. (GH6511, GH8505)Series.nsmallest
and nlargest
‘s take_last
keyword was deprecated in favor of keep
. (GH10792)DataFrame.combineAdd
and DataFrame.combineMult
are deprecated. They can easily be replaced by using the add
and mul
methods: DataFrame.add(other, fill_value=0)
and DataFrame.mul(other, fill_value=1.)
(GH10735).TimeSeries
deprecated in favor of Series
(note that this has been an alias since 0.13.0), (GH10890)SparsePanel
deprecated and will be removed in a future version (GH11157).Series.is_time_series
deprecated in favor of Series.index.is_all_dates
(GH11135)'A@JAN'
) are deprecated (note that this has been alias since 0.8.0) (GH10878)WidePanel
deprecated in favor of Panel
, LongPanel
in favor of DataFrame
(note these have been aliases since < 0.11.0), (GH10892)DataFrame.convert_objects
has been deprecated in favor of type-specific functions pd.to_datetime
, pd.to_timestamp
and pd.to_numeric
(new in 0.17.0) (GH11133).Removal of na_last
parameters from Series.order()
and Series.sort()
, in favor of na_position
. (GH5231)
Remove of percentile_width
from .describe()
, in favor of percentiles
. (GH7088)
Removal of colSpace
parameter from DataFrame.to_string()
, in favor of col_space
, circa 0.8.0 version.
Removal of automatic time-series broadcasting (GH2304)
In [90]: np.random.seed(1234) In [91]: df = DataFrame(np.random.randn(5,2),columns=list('AB'),index=date_range('20130101',periods=5)) In [92]: df Out[92]: A B 2013-01-01 0.471435 -1.190976 2013-01-02 1.432707 -0.312652 2013-01-03 -0.720589 0.887163 2013-01-04 0.859588 -0.636524 2013-01-05 0.015696 -2.242685
Previously
In [3]: df + df.A FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index Out[3]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989
Current
In [93]: df.add(df.A,axis='index') Out[93]: A B 2013-01-01 0.942870 -0.719541 2013-01-02 2.865414 1.120055 2013-01-03 -1.441177 0.166574 2013-01-04 1.719177 0.223065 2013-01-05 0.031393 -2.226989
Remove table
keyword in HDFStore.put/append
, in favor of using format=
(GH4645)
Remove kind
in read_excel/ExcelFile
as its unused (GH4712)
Remove infer_type
keyword from pd.read_html
as its unused (GH4770, GH7032)
Remove offset
and timeRule
keywords from Series.tshift/shift
, in favor of freq
(GH4853, GH4864)
Remove pd.load/pd.save
aliases in favor of pd.to_pickle/pd.read_pickle
(GH3787)
Categorical.value_counts
(GH10804)SeriesGroupBy.nunique
and SeriesGroupBy.value_counts
and SeriesGroupby.transform
(GH10820, GH11077)DataFrame.drop_duplicates
with integer dtypes (GH10917)DataFrame.duplicated
with wide frames. (GH10161, GH11180)timedelta
string parsing (GH6755, GH10426)timedelta64
and datetime64
ops (GH6755)MultiIndex
with slicers (GH10287)iloc
using list-like input (GH10791)Series.isin
for datetimelike/integer Series (GH10287)concat
of Categoricals when categories are identical (GH10587)to_datetime
when specified format string is ISO8601 (GH10178)Series.value_counts
for float dtype (GH10821)infer_datetime_format
in to_datetime
when date components do not have 0 padding (GH11142)DataFrame
from nested dictionary (GH11084)DateOffset
with Series
or DatetimeIndex
(GH10744, GH11205).mean()
on timedelta64[ns]
because of overflow (GH9442).isin
on older numpies (:issue: 11232)DataFrame.to_html(index=False)
renders unnecessary name
row (GH10344)DataFrame.to_latex()
the column_format
argument could not be passed (GH9402)DatetimeIndex
when localizing with NaT
(GH10477)Series.dt
ops in preserving meta-data (GH10477)NaT
when passed in an otherwise invalid to_datetime
construction (GH10477)DataFrame.apply
when function returns categorical series. (GH9573)to_datetime
with invalid dates and formats supplied (GH10154)Index.drop_duplicates
dropping name(s) (GH10115)Series.quantile
dropping name (GH10881)pd.Series
when setting a value on an empty Series
whose index has a frequency. (GH10193)pd.Series.interpolate
with invalid order
keyword values. (GH10633)DataFrame.plot
raises ValueError
when color name is specified by multiple characters (GH10387)Index
construction with a mixed list of tuples (GH10697)DataFrame.reset_index
when index contains NaT
. (GH10388)ExcelReader
when worksheet is empty (GH6403)BinGrouper.group_info
where returned values are not compatible with base class (GH10914)DataFrame.pop
and a subsequent inplace op (GH10912)Index
causing an ImportError
(GH10610)Series.count
when index has nulls (GH10946)DatetimeIndex
(GH11002)DataFrame.where
to not respect the axis
parameter when the frame has a symmetric shape. (GH9736)Table.select_column
where name is not preserved (GH10392)offsets.generate_range
where start
and end
have finer precision than offset
(GH9907)pd.rolling_*
where Series.name
would be lost in the output (GH10565)stack
when index or columns are not unique. (GH10417)Panel
when an axis has a multi-index (GH10360)USFederalHolidayCalendar
where USMemorialDay
and USMartinLutherKingJr
were incorrect (GH10278 and GH9760 ).sample()
where returned object, if set, gives unnecessary SettingWithCopyWarning
(GH10738).sample()
where weights passed as Series
were not aligned along axis before being treated positionally, potentially causing problems if weight indices were not aligned with sampled object. (GH10738)DataFrame.interpolate
with axis=1
and inplace=True
(GH10395)io.sql.get_schema
when specifying multiple columns as primary key (GH10385).groupby(sort=False)
with datetime-like Categorical
raises ValueError
(GH10505)groupby(axis=1)
with filter()
throws IndexError
(GH11041)test_categorical
on big-endian builds (GH10425)Series.shift
and DataFrame.shift
not supporting categorical data (GH9416)Series.map
using categorical Series
raises AttributeError
(GH10324)MultiIndex.get_level_values
including Categorical
raises AttributeError
(GH10460)pd.get_dummies
with sparse=True
not returning SparseDataFrame
(GH10531)Index
subtypes (such as PeriodIndex
) not returning their own type for .drop
and .insert
methods (GH10620)algos.outer_join_indexer
when right
array is empty (GH10618)filter
(regression from 0.16.0) and transform
when grouping on multiple keys, one of which is datetime-like (GH10114)to_datetime
and to_timedelta
causing Index
name to be lost (GH10875)len(DataFrame.groupby)
causing IndexError
when there’s a column containing only NaNs (:issue: 11016)DatetimeIndex
and PeriodIndex.value_counts
resets name from its result, but retains in result’s Index
. (GH10150)pd.eval
using numexpr
engine coerces 1 element numpy array to scalar (GH10546)pd.concat
with axis=0
when column is of dtype category
(GH10177)read_msgpack
where input type is not always checked (GH10369, GH10630)pd.read_csv
with kwargs index_col=False
, index_col=['a', 'b']
or dtype
(GH10413, GH10467, GH10577)Series.from_csv
with header
kwarg not setting the Series.name
or the Series.index.name
(GH10483)groupby.var
which caused variance to be inaccurate for small float values (GH10448)Series.plot(kind='hist')
Y Label not informative (GH10485)read_csv
when using a converter which generates a uint8
type (GH9266)Panel
sliced along the major or minor axes when the right-hand side is a DataFrame
(GH11014)None
and does not raise NotImplementedError
when operator functions (e.g. .add
) of Panel
are not implemented (GH7692)subplots=True
(GH9894)DataFrame.plot
raises ValueError
when color name is specified by multiple characters (GH10387)align
of Series
with MultiIndex
may be inverted (GH10665)join
of with MultiIndex
may be inverted (GH10741)read_stata
when reading a file with a different order set in columns
(GH10757)Categorical
may not representing properly when category contains tz
or Period
(GH10713)Categorical.__iter__
may not returning correct datetime
and Period
(GH10713)PeriodIndex
on an object with a PeriodIndex
(GH4125)read_csv
with engine='c'
: EOF preceded by a comment, blank line, etc. was not handled correctly (GH10728, GH10548)DataReader
results in HTTP 404 error because of the website url is changed (GH10591).read_msgpack
where DataFrame to decode has duplicate column names (GH9618)io.common.get_filepath_or_buffer
which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (GH10604)datetime.date
and numpy datetime64
(GH10408, GH10412)Index.take
may add unnecessary freq
attribute (GH10791)merge
with empty DataFrame
may raise IndexError
(GH10824)to_latex
where unexpected keyword argument for some documented arguments (GH10888)DataFrame
where IndexError
is uncaught (GH10645 and GH10692)read_csv
when using the nrows
or chunksize
parameters if file contains only a header line (GH9535)category
types in HDF5 in presence of alternate encodings. (GH10366)pd.DataFrame
when constructing an empty DataFrame with a string dtype (GH9428)pd.DataFrame.diff
when DataFrame is not consolidated (GH10907)pd.unique
for arrays with the datetime64
or timedelta64
dtype that meant an array with object dtype was returned instead the original dtype (GH9431)Timedelta
raising error when slicing from 0s (GH10583)DatetimeIndex.take
and TimedeltaIndex.take
may not raise IndexError
against invalid index (GH10295)Series([np.nan]).astype('M8[ms]')
, which now returns Series([pd.NaT])
(GH10747)PeriodIndex.order
reset freq (GH10295)date_range
when freq
divides end
as nanos (GH10885)iloc
allowing memory outside bounds of a Series to be accessed with negative integers (GH10779)read_msgpack
where encoding is not respected (GH10581)iloc
with a list containing the appropriate negative integer (GH10547, GH10779)TimedeltaIndex
formatter causing error while trying to save DataFrame
with TimedeltaIndex
using to_csv
(GH10833)DataFrame.where
when handling Series slicing (GH10218, GH9558)pd.read_gbq
throws ValueError
when Bigquery returns zero rows (GH10273)to_json
which was causing segmentation fault when serializing 0-rank ndarray (GH9576)IndexError
when plotted on GridSpec
(GH10819)groupby
incorrect computation for aggregation on DataFrame
with NaT
(E.g first
, last
, min
). (GH10590, GH11010)DataFrame
where passing a dictionary with only scalar values and specifying columns did not raise an error (GH10856).var()
causing roundoff errors for highly similar values (GH10242)DataFrame.plot(subplots=True)
with duplicated columns outputs incorrect result (GH10962)Index
arithmetic may result in incorrect class (GH10638)date_range
results in empty if freq is negative annualy, quarterly and monthly (GH11018)DatetimeIndex
cannot infer negative freq (GH11018)Index
dtype may not applied properly (GH11017)io.gbq
when testing for minimum google api client version (GH10652)DataFrame
construction from nested dict
with timedelta
keys (GH11129).fillna
against may raise TypeError
when data contains datetime dtype (GH7095, GH11153).groupby
when number of keys to group by is same as length of index (GH11185)convert_objects
where converted values might not be returned if all null and coerce
(GH9589)convert_objects
where copy
keyword was not respected (GH9589)This is a minor bug-fix release from 0.16.1 and includes a a large number of bug fixes along some new features (pipe()
method), enhancements, and performance improvements.
We recommend that all users upgrade to this version.
Highlights include:
New features¶ Pipe¶We’ve introduced a new method DataFrame.pipe()
. As suggested by the name, pipe
should be used to pipe data through a chain of function calls. The goal is to avoid confusing nested function calls like
# df is a DataFrame # f, g, and h are functions that take and return DataFrames f(g(h(df), arg1=1), arg2=2, arg3=3)
The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as
(df.pipe(h) .pipe(g, arg1=1) .pipe(f, arg2=2, arg3=3) )
Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.
In the example above, the functions f
, g
, and h
each expected the DataFrame as the first positional argument. When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple of (function, keyword)
indicating where the DataFrame should flow. For example:
In [1]: import statsmodels.formula.api as sm In [2]: bb = pd.read_csv('data/baseball.csv', index_col='id') # sm.poisson takes (formula, data) In [3]: (bb.query('h > 0') ...: .assign(ln_h = lambda df: np.log(df.h)) ...: .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)') ...: .fit() ...: .summary() ...: ) ...: Optimization terminated successfully. Current function value: 2.116284 Iterations 24 Out[3]: <class 'statsmodels.iolib.summary.Summary'> """ Poisson Regression Results ============================================================================== Dep. Variable: hr No. Observations: 68 Model: Poisson Df Residuals: 63 Method: MLE Df Model: 4 Date: Fri, 07 Jul 2017 Pseudo R-squ.: 0.6878 Time: 12:29:29 Log-Likelihood: -143.91 converged: True LL-Null: -460.91 LLR p-value: 6.774e-136 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept -1267.3636 457.867 -2.768 0.006 -2164.767 -369.960 C(lg)[T.NL] -0.2057 0.101 -2.044 0.041 -0.403 -0.008 ln_h 0.9280 0.191 4.866 0.000 0.554 1.302 year 0.6301 0.228 2.762 0.006 0.183 1.077 g 0.0099 0.004 2.754 0.006 0.003 0.017 =============================================================================== """
The pipe method is inspired by unix pipes, which stream text through processes. More recently dplyr and magrittr have introduced the popular (%>%)
pipe operator for R.
See the documentation for more. (GH10129)
Other Enhancements¶Added rsplit to Index/Series StringMethods (GH10303)
Removed the hard-coded size limits on the DataFrame
HTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH10231).
Note that the notebook has a toggle output scrolling
feature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.
axis
parameter of DataFrame.quantile
now accepts also index
and column
. (GH9543)
Holiday
now raises NotImplementedError
if both offset
and observance
are used in the constructor instead of returning an incorrect result (GH10217).Series.resample
performance with dtype=datetime64[ns]
(GH7754)str.split
when expand=True
(GH10081)Series.hist
raises an error when a one row Series
was given (GH10214)HDFStore.select
modifies the passed columns list (GH7212)Categorical
repr with display.width
of None
in Python 3 (GH10087)to_json
with certain orients and a CategoricalIndex
would segfault (GH10317)DataFrame.quantile
on checking that a valid axis was passed (GH9543)groupby.apply
aggregation for Categorical
not preserving categories (GH10138)to_csv
where date_format
is ignored if the datetime
is fractional (GH10209)DataFrame.to_json
with mixed data types (GH10289)mean()
where integer dtypes can overflow (GH10172)Panel.from_dict
does not set dtype when specified (GH10058)Index.union
raises AttributeError
when passing array-likes. (GH10149)Timestamp
‘s’ microsecond
, quarter
, dayofyear
, week
and daysinmonth
properties return np.int
type, not built-in int
. (GH10050)NaT
raises AttributeError
when accessing to daysinmonth
, dayofweek
properties. (GH10096)max_seq_items=None
setting (GH10182).dateutil
on various platforms ( GH9059, GH8639, GH9663, GH10121)setitem
where type promotion is applied to the entire block (GH10280)Series
arithmetic methods may incorrectly hold names (GH10068)GroupBy.get_group
when grouping on multiple keys, one of which is categorical. (GH10132)DatetimeIndex
and TimedeltaIndex
names are lost after timedelta arithmetics ( GH9926)DataFrame
construction from nested dict
with datetime64
(GH10160)Series
construction from dict
with datetime64
keys (GH9456)Series.plot(label="LABEL")
not correctly setting the label (GH10119)plot
not defaulting to matplotlib axes.grid
setting (GH9792)int
instead of float
in engine='python'
for the read_csv
parser (GH9565)Series.align
resets name
when fill_value
is specified (GH10067)read_csv
causing index name not to be set on an empty DataFrame (GH10184)SparseSeries.abs
resets name
(GH10241)TimedeltaIndex
slicing may reset freq (GH10292)GroupBy.get_group
raises ValueError
when group key contains NaT
(GH6992)SparseSeries
constructor ignores input data name (GH10258)Categorical.remove_categories
causing a ValueError
when removing the NaN
category if underlying dtype is floating-point (GH10156)DataFrame.to_hdf()
where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057)DataFrame
(GH10126).read_csv
with a date_parser
that returned a datetime64
array of other time resolution than [ns]
(GH10245)Panel.apply
when the result has ndim=0 (GH10332)read_hdf
where auto_close
could not be passed (GH9327).read_hdf
where open stores could not be used (GH10330).DataFrame``s, now results in a ``DataFrame
that .equals
an empty DataFrame
(GH10181).to_hdf
and HDFStore
which did not check that complib choices were valid (GH4582, GH8874).This is a minor bug-fix release from 0.16.0 and includes a a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
CategoricalIndex
, a category based index, see heresample
for drawing random samples from Series, DataFrames and Panels. See hereIndex
printing has changed to a more uniform format, see hereBusinessHour
datetime-offset is now supported, see here.str
accessor to make string operations easier, see hereWarning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable package. See here for details (GH8961)
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [1]: df = DataFrame({'A' : np.arange(6), ...: 'B' : Series(list('aabbca')).astype('category', ...: categories=list('cab')) ...: }) ...: In [2]: df Out[2]: A B 0 0 a 1 1 a 2 2 b 3 3 b 4 4 c 5 5 a In [3]: df.dtypes Out[3]: A int64 B category dtype: object In [4]: df.B.cat.categories Out[4]: Index(['c', 'a', 'b'], dtype='object')
setting the index, will create create a CategoricalIndex
In [5]: df2 = df.set_index('B') In [6]: df2.index Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.
In [7]: df2.loc['a'] Out[7]: A B a 0 a 1 a 5
and preserves the CategoricalIndex
In [8]: df2.loc['a'].index Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
sorting will order by the order of the categories
In [9]: df2.sort_index() Out[9]: A B c 4 a 0 a 1 a 5 b 2 b 3
groupby operations on the index will preserve the index nature as well
In [10]: df2.groupby(level=0).sum() Out[10]: A B c 4 a 6 b 5 In [11]: df2.groupby(level=0).sum().index Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index
; indexing with a Categorical
will return a CategoricalIndex
, indexed according to the categories of the PASSED Categorical
dtype. This allows one to arbitrarly index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]: df2.reindex(['a','e']) Out[12]: A B a 0.0 a 1.0 a 5.0 e NaN In [13]: df2.reindex(['a','e']).index Out[13]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B') In [14]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))) Out[14]: A B a 0.0 a 1.0 a 5.0 e NaN In [15]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index Out[15]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')
See the documentation for more. (GH7629, GH10038, GH10039)
Sample¶Series, DataFrames, and Panels now have a new method: sample()
. The method accepts a specific number of rows or columns to return, or a fraction of the total number or rows or columns. It also has options for sampling with or without replacement, for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication. (GH2419)
In [16]: example_series = Series([0,1,2,3,4,5]) # When no arguments are passed, returns 1 In [17]: example_series.sample() Out[17]: 3 3 dtype: int64 # One may specify either a number of rows: In [18]: example_series.sample(n=3) Out[18]: 5 5 1 1 4 4 dtype: int64 # Or a fraction of the rows: In [19]: example_series.sample(frac=0.5) Out[19]: 4 4 1 1 0 0 dtype: int64 # weights are accepted. In [20]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4] In [21]: example_series.sample(n=3, weights=example_weights) Out[21]: 2 2 3 3 5 5 dtype: int64 # weights will also be normalized if they do not sum to one, # and missing values will be treated as zeros. In [22]: example_weights2 = [0.5, 0, 0, 0, None, np.nan] In [23]: example_series.sample(n=1, weights=example_weights2) Out[23]: 0 0 dtype: int64
When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.
In [24]: df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]}) In [25]: df.sample(n=3, weights='weight_column') Out[25]: col1 weight_column 0 9 0.5 1 8 0.4 2 7 0.1String Methods Enhancements¶
Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.
Added StringMethods
(.str
accessor) to Index
(GH9068)
The .str
accessor is now available for both Series
and Index
.
In [26]: idx = Index([' jack', 'jill ', ' jesse ', 'frank']) In [27]: idx.str.strip() Out[27]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
One special case for the .str accessor on Index
is that if a string method returns bool
, the .str
accessor will return a np.array
instead of a boolean Index
(GH8875). This enables the following expression to work naturally:
In [28]: idx = Index(['a1', 'a2', 'b1', 'b2']) In [29]: s = Series(range(4), index=idx) In [30]: s Out[30]: a1 0 a2 1 b1 2 b2 3 dtype: int64 In [31]: idx.str.startswith('a') Out[31]: array([ True, True, False, False], dtype=bool) In [32]: s[s.index.str.startswith('a')] Out[32]: a1 0 a2 1 dtype: int64
The following new methods are accesible via .str
accessor to apply the function to each values. (GH9766, GH9773, GH10031, GH10045, GH10052)
capitalize()
swapcase()
normalize()
partition()
rpartition()
index()
rindex()
translate()
split
now takes expand
keyword to specify whether to expand dimensionality. return_type
is deprecated. (GH9847)
In [33]: s = Series(['a,b', 'a,c', 'b,c']) # return Series In [34]: s.str.split(',') Out[34]: 0 [a, b] 1 [a, c] 2 [b, c] dtype: object # return DataFrame In [35]: s.str.split(',', expand=True) Out[35]: 0 1 0 a b 1 a c 2 b c In [36]: idx = Index(['a,b', 'a,c', 'b,c']) # return Index In [37]: idx.str.split(',') Out[37]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object') # return MultiIndex In [38]: idx.str.split(',', expand=True) Out[38]: MultiIndex(levels=[['a', 'b'], ['b', 'c']], labels=[[0, 0, 1], [0, 1, 1]])
Improved extract
and get_dummies
methods for Index.str
(GH9980)
BusinessHour
offset is now supported, which represents business hours starting from 09:00 - 17:00 on BusinessDay
by default. See Here for details. (GH7905)
In [39]: from pandas.tseries.offsets import BusinessHour In [40]: Timestamp('2014-08-01 09:00') + BusinessHour() Out[40]: Timestamp('2014-08-01 10:00:00') In [41]: Timestamp('2014-08-01 07:00') + BusinessHour() Out[41]: Timestamp('2014-08-01 10:00:00') In [42]: Timestamp('2014-08-01 16:30') + BusinessHour() Out[42]: Timestamp('2014-08-04 09:30:00')
DataFrame.diff
now takes an axis
parameter that determines the direction of differencing (GH9727)
Allow clip
, clip_lower
, and clip_upper
to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have an axis
parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)
DataFrame.mask()
and Series.mask()
now support same keywords as where
(GH8801)
drop
function can now accept errors
keyword to suppress ValueError
raised when any of label does not exist in the target data. (GH6736)
In [43]: df = DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C']) In [44]: df.drop(['A', 'X'], axis=1, errors='ignore') Out[44]: B C 0 1.058969 -0.397840 1 1.047579 1.045938 2 -0.122092 0.124713
Add support for separating years and quarters using dashes, for example 2014-Q1. (GH9688)
Allow conversion of values with dtype datetime64
or timedelta64
to strings using astype(str)
(GH9757)
get_dummies
function now accepts sparse
keyword. If set to True
, the return DataFrame
is sparse, e.g. SparseDataFrame
. (GH8823)
Period
now accepts datetime64
as value input. (GH9054)
Allow timedelta string conversion when leading zero is missing from time definition, ie 0:00:00 vs 00:00:00. (GH9570)
Allow Panel.shift
with axis='items'
(GH9890)
Trying to write an excel file now raises NotImplementedError
if the DataFrame
has a MultiIndex
instead of writing a broken Excel file. (GH9794)
Allow Categorical.add_categories
to accept Series
or np.array
. (GH9927)
Add/delete str/dt/cat
accessors dynamically from __dir__
. (GH9910)
Add normalize
as a dt
accessor method. (GH10047)
DataFrame
and Series
now have _constructor_expanddim
property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see here
pd.lib.infer_dtype
now returns 'bytes'
in Python 3 where appropriate. (GH10032)
df.plot( ..., ax=ax)
, the sharex kwarg will now default to False. The result is that the visibility of xlabels and xticklabels will not anymore be changed. You have to do that by yourself for the right axes in your figure or set sharex=True
explicitly (but this changes the visible for all axes in the figure, not only the one which is passed in!). If pandas creates the subplots itself (e.g. no passed in ax kwarg), then the default is still sharex=True
and the visibility changes are applied.assign()
now inserts new columns in alphabetical order. Previously the order was arbitrary. (GH9777)read_csv
and read_table
will now try to infer the compression type based on the file extension. Set compression=None
to restore the previous behavior (no decompression). (GH9770)Series.str.split
‘s return_type
keyword was removed in favor of expand
(GH9847)The string representation of Index
and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items
; if lots of items (> display.max_seq_items
) will show a truncated display (the head and tail of the data). The formatting for MultiIndex
is unchanges (a multi-line wrapped display). The display width responds to the option display.max_seq_items
, which is defaulted to 100. (GH6482)
Previous Behavior
In [2]: pd.Index(range(4),name='foo') Out[2]: Int64Index([0, 1, 2, 3], dtype='int64') In [3]: pd.Index(range(104),name='foo') Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64') In [4]: pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern') Out[4]: <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00] Length: 4, Freq: D, Timezone: US/Eastern In [5]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern') Out[5]: <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00] Length: 104, Freq: D, Timezone: US/Eastern
New Behavior
In [45]: pd.set_option('display.width', 80) In [46]: pd.Index(range(4), name='foo') Out[46]: RangeIndex(start=0, stop=4, step=1, name='foo') In [47]: pd.Index(range(30), name='foo') Out[47]: RangeIndex(start=0, stop=30, step=1, name='foo') In [48]: pd.Index(range(104), name='foo') Out[48]: RangeIndex(start=0, stop=104, step=1, name='foo') In [49]: pd.CategoricalIndex(['a','bb','ccc','dddd'], ordered=True, name='foobar') Out[49]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category') In [50]: pd.CategoricalIndex(['a','bb','ccc','dddd']*10, ordered=True, name='foobar') Out[50]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category') In [51]: pd.CategoricalIndex(['a','bb','ccc','dddd']*100, ordered=True, name='foobar') Out[51]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', ... 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category', length=400) In [52]: pd.date_range('20130101',periods=4, name='foo', tz='US/Eastern') Out[52]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='foo', freq='D') In [53]: pd.date_range('20130101',periods=25, freq='D') Out[53]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12', '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16', '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20', '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24', '2013-01-25'], dtype='datetime64[ns]', freq='D') In [54]: pd.date_range('20130101',periods=104, name='foo', tz='US/Eastern') Out[54]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00', '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00', '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00', '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00', ... '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00', '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00', '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00', '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00', '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')Performance Improvements¶
pd.lib.max_len_string_array
by 5-7x (GH10024)DataFrame.plot()
, passing label=
arguments works, and Series indices are no longer mutated. (GH9542)read_csv
where missing trailing delimiters would cause segfault. (GH5664)scatter_matrix
draws unexpected axis ticklabels (GH5662)StataWriter
resulting in changes to input DataFrame
upon save (GH9795).transform
causing length mismatch when null entries were present and a fast aggregator was being used (GH9697)equals
causing false negatives when block order differed (GH9330)pd.Grouper
where one is non-time based (GH10063)read_sql_table
error when reading postgres table with timezone (GH7139)DataFrame
slicing may not retain metadata (GH9776)TimdeltaIndex
were not properly serialized in fixed HDFStore
(GH9635)TimedeltaIndex
constructor ignoring name
when given another TimedeltaIndex
as data (GH10025).DataFrameFormatter._get_formatted_index
with not applying max_colwidth
to the DataFrame
index (GH7856).loc
with a read-only ndarray data source (GH10043)groupby.apply()
that would raise if a passed user defined function either returned only None
(for all input). (GH9685)secondary_y
may not show legend properly. (GH9610, GH9779)DataFrame.plot(kind="hist")
results in TypeError
when DataFrame
contains non-numeric columns (GH9853)DataFrame
with a DatetimeIndex
may raise TypeError
(GH9852)setup.py
that would allow an incompat cython version to build (GH9827)secondary_y
incorrectly attaches right_ax
property to secondary axes specifying itself recursively. (GH9861)Series.quantile
on empty Series of type Datetime
or Timedelta
(GH9675)where
causing incorrect results when upcasting was required (GH9731)FloatArrayFormatter
where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)DataFrame.plot()
raised an error when both color
and style
keywords were passed and there was no color symbol in the style strings (GH9671)DeprecationWarning
on combining list-likes with an Index
(GH10083)read_csv
and read_table
when using skip_rows
parameter if blank lines are present. (GH9832)read_csv()
interprets index_col=True
as 1
(GH9798)==
failing on Index/MultiIndex type incompatibility (GH9785)SparseDataFrame
could not take nan as a column name (GH8822)to_msgpack
and read_msgpack
zlib and blosc compression support (GH9783)GroupBy.size
doesn’t attach index name properly if grouped by TimeGrouper
(GH9925)length_of_indexer
returns wrong results (GH9995)Categorical
(GH9603)TimedeltaIndex
incorrectly raised ValueError
instead of AttributeError
(GH9680)Series(Categorical(list("abc"), ordered=True)) > "d"
. This returned False
for all elements, but now raises a TypeError
. Equality comparisons also now return False
for ==
and True
for !=
. (GH9848)__setitem__
when right hand side is a dictionary (GH9874)where
when dtype is datetime64/timedelta64
, but dtype of other is not (GH9804)MultiIndex.sortlevel()
results in unicode level name breaks (GH9856)groupby.transform
incorrectly enforced output dtypes to match input dtypes. (GH9807)DataFrame
constructor when columns
parameter is set, and data
is an empty list (GH9939)log=True
raises TypeError
if all values are less than 1 (GH9905)log=True
(GH9905)Decimal
by another Decimal
would raise. (GH9787)AbstractHolidayCalendar
to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)DataFrame.loc
(GH9596)transform
and filter
when grouping on a categorical variable (GH9921)transform
when groups are equal in number and dtype to the input index (GH9700)oauth2client.tools.run()
(GH8327)DataFrame
. It may not return the correct class, when slicing or subsetting it. (GH9632).median()
where non-float null values are not handled correctly (GH10040)This is a major release from 0.15.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
DataFrame.assign
method, see hereSeries.to_coo/from_coo
methods to interact with scipy.sparse
, see hereTimedelta
to conform the .seconds
attribute with datetime.timedelta
, see here.loc
slicing API to conform with the behavior of .ix
see hereCategorical
constructor, see here.str
accessor to make string operations easier, see herepandas.tools.rplot
, pandas.sandbox.qtpandas
and pandas.rpy
modules are deprecated. We refer users to external packages like seaborn, pandas-qt and rpy2 for similar or equivalent functionality, see hereCheck the API Changes and deprecations before updating.
New features¶ DataFrame Assign¶Inspired by dplyr’s mutate
verb, DataFrame has a new assign()
method. The function signature for assign
is simply **kwargs
. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series
or NumPy array), or a function of one argument to be called on the DataFrame
. The new values are inserted, and the entire DataFrame (with all original and new columns) is returned.
In [1]: iris = read_csv('data/iris.data') In [2]: iris.head() Out[2]: SepalLength SepalWidth PetalLength PetalWidth Name 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head() Out[3]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275 1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245 2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851 3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913 4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
Above was an example of inserting a precomputed value. We can also pass in a function to be evalutated.
In [4]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] / ...: x['SepalLength'])).head() ...: Out[4]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275 1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245 2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851 3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913 4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
The power of assign
comes when used in chains of operations. For example, we can limit the DataFrame to just those with a Sepal Length greater than 5, calculate the ratio, and plot
In [5]: (iris.query('SepalLength > 5') ...: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength, ...: PetalRatio = lambda x: x.PetalWidth / x.PetalLength) ...: .plot(kind='scatter', x='SepalRatio', y='PetalRatio')) ...: Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x13df38208>
See the documentation for more. (GH9229)
Interaction with scipy.sparse¶Added SparseSeries.to_coo()
and SparseSeries.from_coo()
methods (GH8048) for converting to and from scipy.sparse.coo_matrix
instances (see here). For example, given a SparseSeries with MultiIndex we can convert to a scipy.sparse.coo_matrix by specifying the row and column labels as index levels:
In [6]: from numpy import nan In [7]: s = Series([3.0, nan, 1.0, 3.0, nan, nan]) In [8]: s.index = MultiIndex.from_tuples([(1, 2, 'a', 0), ...: (1, 2, 'a', 1), ...: (1, 1, 'b', 0), ...: (1, 1, 'b', 1), ...: (2, 1, 'b', 0), ...: (2, 1, 'b', 1)], ...: names=['A', 'B', 'C', 'D']) ...: In [9]: s Out[9]: A B C D 1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.0 2 1 b 0 NaN 1 NaN dtype: float64 # SparseSeries In [10]: ss = s.to_sparse() In [11]: ss Out[11]: A B C D 1 2 a 0 3.0 1 NaN 1 b 0 1.0 1 3.0 2 1 b 0 NaN 1 NaN dtype: float64 BlockIndex Block locations: array([0, 2], dtype=int32) Block lengths: array([1, 2], dtype=int32) In [12]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'], ....: column_levels=['C', 'D'], ....: sort_labels=False) ....: In [13]: A Out[13]: <3x4 sparse matrix of type '<class 'numpy.float64'>' with 3 stored elements in COOrdinate format> In [14]: A.todense() Out[14]: matrix([[ 3., 0., 0., 0.], [ 0., 0., 1., 3.], [ 0., 0., 0., 0.]]) In [15]: rows Out[15]: [(1, 2), (1, 1), (2, 1)] In [16]: columns Out[16]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
The from_coo method is a convenience method for creating a SparseSeries
from a scipy.sparse.coo_matrix
:
In [17]: from scipy import sparse In [18]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), ....: shape=(3, 4)) ....: In [19]: A Out[19]: <3x4 sparse matrix of type '<class 'numpy.float64'>' with 3 stored elements in COOrdinate format> In [20]: A.todense() Out[20]: matrix([[ 0., 0., 1., 2.], [ 3., 0., 0., 0.], [ 0., 0., 0., 0.]]) In [21]: ss = SparseSeries.from_coo(A) In [22]: ss Out[22]: 0 2 1.0 3 2.0 1 0 3.0 dtype: float64 BlockIndex Block locations: array([0], dtype=int32) Block lengths: array([3], dtype=int32)String Methods Enhancements¶
Following new methods are accesible via .str
accessor to apply the function to each values. This is intended to make it more consistent with standard methods on strings. (GH9282, GH9352, GH9386, GH9387, GH9439)
isalnum()
isalpha()
isdigit()
isdigit()
isspace()
islower()
isupper()
istitle()
isnumeric()
isdecimal()
find()
rfind()
ljust()
rjust()
zfill()
In [23]: s = Series(['abcd', '3456', 'EFGH']) In [24]: s.str.isalpha() Out[24]: 0 True 1 False 2 True dtype: bool In [25]: s.str.find('ab') Out[25]: 0 0 1 -1 2 -1 dtype: int64
Series.str.pad()
and Series.str.center()
now accept fillchar
option to specify filling character (GH9352)
In [26]: s = Series(['12', '300', '25']) In [27]: s.str.pad(5, fillchar='_') Out[27]: 0 ___12 1 __300 2 ___25 dtype: object
Added Series.str.slice_replace()
, which previously raised NotImplementedError
(GH8888)
In [28]: s = Series(['ABCD', 'EFGH', 'IJK']) In [29]: s.str.slice_replace(1, 3, 'X') Out[29]: 0 AXD 1 EXH 2 IX dtype: object # replaced with empty char In [30]: s.str.slice_replace(0, 1) Out[30]: 0 BCD 1 FGH 2 JK dtype: object
Reindex now supports method='nearest'
for frames or series with a monotonic increasing or decreasing index (GH9258):
In [31]: df = pd.DataFrame({'x': range(5)}) In [32]: df.reindex([0.2, 1.8, 3.5], method='nearest') Out[32]: x 0.2 0 1.8 2 3.5 4
This method is also exposed by the lower level Index.get_indexer
and Index.get_loc
methods.
The read_excel()
function’s sheetname argument now accepts a list and None
, to get multiple or all sheets respectively. If more than one sheet is specified, a dictionary is returned. (GH9450)
# Returns the 1st and 4th sheet, as a dictionary of DataFrames. pd.read_excel('path_to_file.xls',sheetname=['Sheet1',3])
Allow Stata files to be read incrementally with an iterator; support for long strings in Stata files. See the docs here (GH9493:).
Paths beginning with ~ will now be expanded to begin with the user’s home directory (GH9066)
Added time interval selection in get_data_yahoo
(GH9071)
Added Timestamp.to_datetime64()
to complement Timedelta.to_timedelta64()
(GH9255)
tseries.frequencies.to_offset()
now accepts Timedelta
as input (GH9064)
Lag parameter was added to the autocorrelation method of Series
, defaults to lag-1 autocorrelation (GH9192)
Timedelta
will now accept nanoseconds
keyword in constructor (GH9273)
SQL code now safely escapes table and column names (GH8986)
Added auto-complete for Series.str.<tab>
, Series.dt.<tab>
and Series.cat.<tab>
(GH9322)
Index.get_indexer
now supports method='pad'
and method='backfill'
even for any target array, not just monotonic targets. These methods also work for monotonic decreasing as well as monotonic increasing indexes (GH9258).
Index.asof
now works on all index types (GH9258).
A verbose
argument has been augmented in io.read_excel()
, defaults to False. Set to True to print sheet names as they are parsed. (GH9450)
Added days_in_month
(compatibility alias daysinmonth
) property to Timestamp
, DatetimeIndex
, Period
, PeriodIndex
, and Series.dt
(GH9572)
Added decimal
option in to_csv
to provide formatting for non-‘.’ decimal separators (GH781)
Added normalize
option for Timestamp
to normalized to midnight (GH8794)
Added example for DataFrame
import to R using HDF5 file and rhdf5
library. See the documentation for more (GH9636).
In v0.15.0 a new scalar type Timedelta
was introduced, that is a sub-class of datetime.timedelta
. Mentioned here was a notice of an API change w.r.t. the .seconds
accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had a Timedelta('1 day, 10:11:12')
, then .seconds
would return 12. However, this is at odds with the definition of datetime.timedelta
, which defines .seconds
as 10 * 3600 + 11 * 60 + 12 == 36672
.
So in v0.16.0, we are restoring the API to match that of datetime.timedelta
. Further, the component values are still available through the .components
accessor. This affects the .seconds
and .microseconds
accessors, and removes the .hours
, .minutes
, .milliseconds
accessors. These changes affect TimedeltaIndex
and the Series .dt
accessor as well. (GH9185, GH9139)
Previous Behavior
In [2]: t = pd.Timedelta('1 day, 10:11:12.100123') In [3]: t.days Out[3]: 1 In [4]: t.seconds Out[4]: 12 In [5]: t.microseconds Out[5]: 123
New Behavior
In [33]: t = pd.Timedelta('1 day, 10:11:12.100123') In [34]: t.days Out[34]: 1 In [35]: t.seconds Out[35]: 36672 In [36]: t.microseconds Out[36]: 100123
Using .components
allows the full component access
In [37]: t.components Out[37]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0) In [38]: t.components.seconds Out[38]: 12Indexing Changes¶
The behavior of a small sub-set of edge cases for using .loc
have changed (GH8613). Furthermore we have improved the content of the error messages that are raised:
Slicing with .loc
where the start and/or stop bound is not found in the index is now allowed; this previously would raise a KeyError
. This makes the behavior the same as .ix
in this case. This change is only for slicing, not when indexing with a single label.
In [39]: df = DataFrame(np.random.randn(5,4), ....: columns=list('ABCD'), ....: index=date_range('20130101',periods=5)) ....: In [40]: df Out[40]: A B C D 2013-01-01 -0.322795 0.841675 2.390961 0.076200 2013-01-02 -0.566446 0.036142 -2.074978 0.247792 2013-01-03 -0.897157 -0.136795 0.018289 0.755414 2013-01-04 0.215269 0.841009 -1.445810 -1.401973 2013-01-05 -0.100918 -0.548242 -0.144620 0.354020 In [41]: s = Series(range(5),[-2,-1,1,2,3]) In [42]: s Out[42]: -2 0 -1 1 1 2 2 3 3 4 dtype: int64
Previous Behavior
In [4]: df.loc['2013-01-02':'2013-01-10'] KeyError: 'stop bound [2013-01-10] is not in the [index]' In [6]: s.loc[-10:3] KeyError: 'start bound [-10] is not the [index]'
New Behavior
In [43]: df.loc['2013-01-02':'2013-01-10'] Out[43]: A B C D 2013-01-02 -0.566446 0.036142 -2.074978 0.247792 2013-01-03 -0.897157 -0.136795 0.018289 0.755414 2013-01-04 0.215269 0.841009 -1.445810 -1.401973 2013-01-05 -0.100918 -0.548242 -0.144620 0.354020 In [44]: s.loc[-10:3] Out[44]: -2 0 -1 1 1 2 2 3 3 4 dtype: int64
Allow slicing with float-like values on an integer index for .ix
. Previously this was only enabled for .loc
:
Previous Behavior
In [8]: s.ix[-1.0:2] TypeError: the slice start value [-1.0] is not a proper indexer for this index type (Int64Index)
New Behavior
In [2]: s.ix[-1.0:2] Out[2]: -1 1 1 2 2 3 dtype: int64
Provide a useful exception for indexing with an invalid type for that index when using .loc
. For example trying to use .loc
on an index of type DatetimeIndex
or PeriodIndex
or TimedeltaIndex
, with an integer (or a float).
Previous Behavior
In [4]: df.loc[2:3] KeyError: 'start bound [2] is not the [index]'
New Behavior
In [4]: df.loc[2:3] TypeError: Cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with <type 'int'> keys
In prior versions, Categoricals
that had an unspecified ordering (meaning no ordered
keyword was passed) were defaulted as ordered
Categoricals. Going forward, the ordered
keyword in the Categorical
constructor will default to False
. Ordering must now be explicit.
Furthermore, previously you could change the ordered
attribute of a Categorical by just setting the attribute, e.g. cat.ordered=True
; This is now deprecated and you should use cat.as_ordered()
or cat.as_unordered()
. These will by default return a new object and not modify the existing object. (GH9347, GH9190)
Previous Behavior
In [3]: s = Series([0,1,2], dtype='category') In [4]: s Out[4]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0 < 1 < 2] In [5]: s.cat.ordered Out[5]: True In [6]: s.cat.ordered = False In [7]: s Out[7]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0, 1, 2]
New Behavior
In [45]: s = Series([0,1,2], dtype='category') In [46]: s Out[46]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0, 1, 2] In [47]: s.cat.ordered Out[47]: False In [48]: s = s.cat.as_ordered() In [49]: s Out[49]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0 < 1 < 2] In [50]: s.cat.ordered Out[50]: True # you can set in the constructor of the Categorical In [51]: s = Series(Categorical([0,1,2],ordered=True)) In [52]: s Out[52]: 0 0 1 1 2 2 dtype: category Categories (3, int64): [0 < 1 < 2] In [53]: s.cat.ordered Out[53]: True
For ease of creation of series of categorical data, we have added the ability to pass keywords when calling .astype()
. These are passed directly to the constructor.
In [54]: s = Series(["a","b","c","a"]).astype('category',ordered=True) In [55]: s Out[55]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a < b < c] In [56]: s = Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False) In [57]: s Out[57]: 0 a 1 b 2 c 3 a dtype: category Categories (6, object): [a, b, c, d, e, f]Other API Changes¶
Index.duplicated
now returns np.array(dtype=bool)
rather than Index(dtype=object)
containing bool
values. (GH8875)
DataFrame.to_json
now returns accurate type serialisation for each column for frames of mixed dtype (GH9037)
Previously data was coerced to a common dtype before serialisation, which for example resulted in integers being serialised to floats:
In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json() Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'
Now each column is serialised using its correct dtype:
In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json() Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'
DatetimeIndex
, PeriodIndex
and TimedeltaIndex.summary
now output the same format. (GH9116)
TimedeltaIndex.freqstr
now output the same string format as DatetimeIndex
. (GH9116)
Bar and horizontal bar plots no longer add a dashed line along the info axis. The prior style can be achieved with matplotlib’s axhline
or axvline
methods (GH9088).
Series
accessors .dt
, .cat
and .str
now raise AttributeError
instead of TypeError
if the series does not contain the appropriate type of data (GH9617). This follows Python’s built-in exception hierarchy more closely and ensures that tests like hasattr(s, 'cat')
are consistent on both Python 2 and 3.
Series
now supports bitwise operation for integral types (GH9016). Previously even if the input dtypes were integral, the output dtype was coerced to bool
.
Previous Behavior
In [2]: pd.Series([0,1,2,3], list('abcd')) | pd.Series([4,4,4,4], list('abcd')) Out[2]: a True b True c True d True dtype: bool
New Behavior. If the input dtypes are integral, the output dtype is also integral and the output values are the result of the bitwise operation.
In [2]: pd.Series([0,1,2,3], list('abcd')) | pd.Series([4,4,4,4], list('abcd')) Out[2]: a 4 b 5 c 6 d 7 dtype: int64
During division involving a Series
or DataFrame
, 0/0
and 0//0
now give np.nan
instead of np.inf
. (GH9144, GH8445)
Previous Behavior
In [2]: p = pd.Series([0, 1]) In [3]: p / 0 Out[3]: 0 inf 1 inf dtype: float64 In [4]: p // 0 Out[4]: 0 inf 1 inf dtype: float64
New Behavior
In [58]: p = pd.Series([0, 1]) In [59]: p / 0 Out[59]: 0 NaN 1 inf dtype: float64 In [60]: p // 0 Out[60]: 0 NaN 1 inf dtype: float64
Series.values_counts
and Series.describe
for categorical data will now put NaN
entries at the end. (GH9443)
Series.describe
for categorical data will now give counts and frequencies of 0, not NaN
, for unused categories (GH9443)
Due to a bug fix, looking up a partial string label with DatetimeIndex.asof
now includes values that match the string, even if they are after the start of the partial string label (GH9258).
Old behavior:
In [4]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02') Out[4]: Timestamp('2000-01-31 00:00:00')
Fixed behavior:
In [61]: pd.to_datetime(['2000-01-31', '2000-02-28']).asof('2000-02') Out[61]: Timestamp('2000-02-28 00:00:00')
To reproduce the old behavior, simply add more precision to the label (e.g., use 2000-02-01
instead of 2000-02
).
rplot
trellis plotting interface is deprecated and will be removed in a future version. We refer to external packages like seaborn for similar but more refined functionality (GH3445). The documentation includes some examples how to convert your existing code using rplot
to seaborn: rplot docs.pandas.sandbox.qtpandas
interface is deprecated and will be removed in a future version. We refer users to the external package pandas-qt. (GH9615)pandas.rpy
interface is deprecated and will be removed in a future version. Similar functionaility can be accessed thru the rpy2 project (GH9602)DatetimeIndex/PeriodIndex
to another DatetimeIndex/PeriodIndex
is being deprecated as a set-operation. This will be changed to a TypeError
in a future version. .union()
should be used for the union set operation. (GH9094)DatetimeIndex/PeriodIndex
from another DatetimeIndex/PeriodIndex
is being deprecated as a set-operation. This will be changed to an actual numeric subtraction yielding a TimeDeltaIndex
in a future version. .difference()
should be used for the differencing set operation. (GH9094)DataFrame.pivot_table
and crosstab
‘s rows
and cols
keyword arguments were removed in favor of index
and columns
(GH6581)DataFrame.to_excel
and DataFrame.to_csv
cols
keyword argument was removed in favor of columns
(GH6581)convert_dummies
in favor of get_dummies
(GH6581)value_range
in favor of describe
(GH6581).loc
indexing with an array or list-like (GH9126:).DataFrame.to_json
30x performance improvement for mixed dtype frames. (GH9037)MultiIndex.duplicated
by working with labels instead of values (GH9125)nunique
by calling unique
instead of value_counts
(GH9129, GH7771)DataFrame.count
and DataFrame.dropna
by taking advantage of homogeneous/heterogeneous dtypes appropriately (GH9136)DataFrame.count
when using a MultiIndex
and the level
keyword argument (GH9163)merge
when key space exceeds int64
bounds (GH9151)groupby
(GH9429)MultiIndex.sortlevel
(GH9445)DataFrame.duplicated
(GH9398)Period
(GH9440)to_hdf
(GH9648).to_html
to remove leading/trailing spaces in table body (GH4987)read_csv
on s3 with Python 3 (GH9452)DatetimeIndex
affecting architectures where numpy.int_
defaults to numpy.int32
(GH8943)Series.dt.components
index was reset to the default index (GH9247)Categorical.__getitem__/__setitem__
with listlike input getting incorrect results from indexer coercion (GH9469)to_sql
when mapping a Timestamp
object column (datetime column with timezone info) to the appropriate sqlalchemy type (GH9085).to_sql
dtype
argument not accepting an instantiated SQLAlchemy type (GH9083)..loc
partial setting with a np.datetime64
(GH9516)Series
& on .xs
slices (GH9477)Categorical.unique()
(and s.unique()
if s
is of dtype category
) now appear in the order in which they are originally found, not in sorted order (GH9331). This is now consistent with the behavior for other dtypes in pandas.StataReader
(GH8688).MultiIndex.has_duplicates
when having many levels causes an indexer overflow (GH9075, GH5873)pivot
and unstack
where nan
values would break index alignment (GH4862, GH7401, GH7403, GH7405, GH7466, GH9497)join
on multi-index with sort=True
or null values (GH9210).MultiIndex
where inserting new keys would fail (GH9250).groupby
when key space exceeds int64
bounds (GH9096).unstack
with TimedeltaIndex
or DatetimeIndex
and nulls (GH9491).rank
where comparing floats with tolerance will cause inconsistent behaviour (GH8365).read_stata
and StataReader
when loading data from a URL (GH9231).offsets.Nano
to other offets raises TypeError
(GH9284)DatetimeIndex
iteration, related to (GH8890), fixed in (GH9100)resample
around DST transitions. This required fixing offset classes so they behave correctly on DST transitions. (GH5172, GH8744, GH8653, GH9173, GH9468)..mul()
) alignment with integer levels (GH9463).layout
kw may show unnecessary warning (GH9464)fillna
), (GH9221)DataFrame
now properly supports simultaneous copy
and dtype
arguments in constructor (GH9099)read_csv
when using skiprows on a file with CR line endings with the c engine. (GH9079)isnull
now detects NaT
in PeriodIndex
(GH9129).nth()
with a multiple column groupby (GH8979)DataFrame.where
and Series.where
coerce numerics to string incorrectly (GH9280)DataFrame.where
and Series.where
raise ValueError
when string list-like is passed. (GH9280)Series.str
methods on with non-string values now raises TypeError
instead of producing incorrect results (GH9184)DatetimeIndex.__contains__
when index has duplicates and is not monotonic increasing (GH9512)Series.kurt()
when all values are equal (GH9197)xlsxwriter
engine where it added a default ‘General’ format to cells if no other format wass applied. This prevented other row or column formatting being applied. (GH9167)index_col=False
when usecols
is also specified in read_csv
. (GH9082)wide_to_long
would modify the input stubnames list (GH9204)to_sql
not storing float64 values using double precision. (GH9009)SparseSeries
and SparsePanel
now accept zero argument constructors (same as their non-sparse counterparts) (GH9272).Categorical
and object
dtypes (GH9426)read_csv
with buffer overflows with certain malformed input files (GH9205)Series.groupby
where grouping on MultiIndex
levels would ignore the sort argument (GH9444)DataFrame.Groupby
where sort=False
is ignored in the case of Categorical columns. (GH8868)Series.values_counts
with excluding NaN
for categorical type Series
with dropna=True
(GH9443)DataFrame.std/var/sem
(GH9201)Panel
or Panel4D
with scalar data (GH8285)Series
text representation disconnected from max_rows/max_columns (GH7508).Series
number formatting inconsistent when truncated (GH8532).
Previous Behavior
In [2]: pd.options.display.max_rows = 10 In [3]: s = pd.Series([1,1,1,1,1,1,1,1,1,1,0.9999,1,1]*10) In [4]: s Out[4]: 0 1 1 1 2 1 ... 127 0.9999 128 1.0000 129 1.0000 Length: 130, dtype: float64
New Behavior
0 1.0000 1 1.0000 2 1.0000 3 1.0000 4 1.0000 ... 125 1.0000 126 1.0000 127 0.9999 128 1.0000 129 1.0000 dtype: float64
A Spurious SettingWithCopy
Warning was generated when setting a new item in a frame in some cases (GH8730)
The following would previously report a SettingWithCopy
Warning.
In [1]: df1 = DataFrame({'x': Series(['a','b','c']), 'y': Series(['d','e','f'])}) In [2]: df2 = df1[['x']] In [3]: df2['y'] = ['g', 'h', 'i']
This is a minor release from 0.15.1 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. A small number of API changes were necessary to fix existing bugs. We recommend that all users upgrade to this version.
API changes¶Indexing in MultiIndex
beyond lex-sort depth is now supported, though a lexically sorted index will have a better performance. (GH2646)
In [1]: df = pd.DataFrame({'jim':[0, 0, 1, 1], ...: 'joe':['x', 'x', 'z', 'y'], ...: 'jolie':np.random.rand(4)}).set_index(['jim', 'joe']) ...: In [2]: df Out[2]: jolie jim joe 0 x 0.123943 x 0.119381 1 z 0.738523 y 0.587304 In [3]: df.index.lexsort_depth Out[3]: 1 # in prior versions this would raise a KeyError # will now show a PerformanceWarning In [4]: df.loc[(1, 'z')] Out[4]: jolie jim joe 1 z 0.738523 # lexically sorting In [5]: df2 = df.sort_index() In [6]: df2 Out[6]: jolie jim joe 0 x 0.123943 x 0.119381 1 y 0.587304 z 0.738523 In [7]: df2.index.lexsort_depth Out[7]: 2 In [8]: df2.loc[(1,'z')] Out[8]: jolie jim joe 1 z 0.738523
Bug in unique of Series with category
dtype, which returned all categories regardless whether they were “used” or not (see GH8559 for the discussion). Previous behaviour was to return all categories:
In [3]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) In [4]: cat Out[4]: [a, b, a] Categories (3, object): [a < b < c] In [5]: cat.unique() Out[5]: array(['a', 'b', 'c'], dtype=object)
Now, only the categories that do effectively occur in the array are returned:
In [9]: cat = pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']) In [10]: cat.unique() Out[10]: [a, b] Categories (2, object): [a, b]
Series.all
and Series.any
now support the level
and skipna
parameters. Series.all
, Series.any
, Index.all
, and Index.any
no longer support the out
and keepdims
parameters, which existed for compatibility with ndarray. Various index types no longer support the all
and any
aggregation functions and will now raise TypeError
. (GH8302).
Allow equality comparisons of Series with a categorical dtype and object dtype; previously these would raise TypeError
(GH8938)
Bug in NDFrame
: conflicting attribute/column names now behave consistently between getting and setting. Previously, when both a column and attribute named y
existed, data.y
would return the attribute, while data.y = z
would update the column (GH8994)
In [11]: data = pd.DataFrame({'x':[1, 2, 3]}) In [12]: data.y = 2 In [13]: data['y'] = [2, 4, 6] In [14]: data Out[14]: x y 0 1 2 1 2 4 2 3 6 # this assignment was inconsistent In [15]: data.y = 5
Old behavior:
In [6]: data.y Out[6]: 2 In [7]: data['y'].values Out[7]: array([5, 5, 5])
New behavior:
In [16]: data.y Out[16]: 5 In [17]: data['y'].values Out[17]: array([2, 4, 6])
Timestamp('now')
is now equivalent to Timestamp.now()
in that it returns the local time rather than UTC. Also, Timestamp('today')
is now equivalent to Timestamp.today()
and both have tz
as a possible argument. (GH9000)
Fix negative step support for label-based slices (GH8753)
Old behavior:
In [1]: s = pd.Series(np.arange(3), ['a', 'b', 'c']) Out[1]: a 0 b 1 c 2 dtype: int64 In [2]: s.loc['c':'a':-1] Out[2]: c 2 dtype: int64
New behavior:
In [18]: s = pd.Series(np.arange(3), ['a', 'b', 'c']) In [19]: s.loc['c':'a':-1] Out[19]: c 2 b 1 a 0 dtype: int64
Categorical
enhancements:
order_categoricals
to StataReader
and read_stata
to select whether to order imported categorical data (GH8836). See here for more information on importing categorical variables from Stata data files.category
dtyped data is stored in a more efficient manner. See here for an example and caveats w.r.t. prior versions of pandas.searchsorted()
on Categorical class (GH8420).Other enhancements:
Added the ability to specify the SQL type of columns when writing a DataFrame to a database (GH8778). For example, specifying to use the sqlalchemy String
type instead of the default Text
type for string columns:
from sqlalchemy.types import String data.to_sql('data_dtype', engine, dtype={'Col_1': String})
Series.all
and Series.any
now support the level
and skipna
parameters (GH8302):
In [20]: s = pd.Series([False, True, False], index=[0, 0, 1]) In [21]: s.any(level=0) Out[21]: 0 True 1 False dtype: bool
Panel
now supports the all
and any
aggregation functions. (GH8302):
In [22]: p = pd.Panel(np.random.rand(2, 5, 4) > 0.1) In [23]: p.all() Out[23]: 0 1 0 True True 1 True True 2 False False 3 True True
Added support for utcfromtimestamp()
, fromtimestamp()
, and combine()
on Timestamp class (GH5351).
Added Google Analytics (pandas.io.ga) basic documentation (GH8835). See `here<http://pandas.pydata.org/pandas-docs/version/0.15.2/remote_data.html#remote-data-ga>`__.
Timedelta
arithmetic returns NotImplemented
in unknown cases, allowing extensions by custom classes (GH8813).
Timedelta
now supports arithemtic with numpy.ndarray
objects of the appropriate dtype (numpy 1.8 or newer only) (GH8884).
Added Timedelta.to_timedelta64()
method to the public API (GH8884).
Added gbq.generate_bq_schema()
function to the gbq module (GH8325).
Series
now works with map objects the same way as generators (GH8909).
Added context manager to HDFStore
for automatic closing (GH8791).
to_datetime
gains an exact
keyword to allow for a format to not require an exact match for a provided format string (if its False
). exact
defaults to True
(meaning that exact matching is still the default) (GH8904)
Added axvlines
boolean option to parallel_coordinates plot function, determines whether vertical lines will be printed, default is True
Added ability to read table footers to read_html (GH8552)
to_sql
now infers datatypes of non-NA values for columns that contain NA values and have dtype object
(GH8778).
to_datetime
conversions with a passed format=
, and the exact=False
(GH8904)category
dtype which were coercing to object
. (GH8641)TypeError
rather than ValueError
(a couple of edge cases only), (GH8865)pd.Grouper(key=...)
with no level/axis or level only (GH8795, GH8866)TypeError
when invalid/no parameters are passed in a groupby (GH8015)py2app/cx_Freeze
(GH8602, GH8831)groupby
signatures that didn’t include *args or **kwargs (GH8733).io.data.Options
now raises RemoteDataError
when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).io.data.Options
now raises RemoteDataError
when no expiry dates are available from Yahoo (GH8761).Timedelta
kwargs may now be numpy ints and floats (GH8757).Timedelta
arithmetic and comparisons (GH8813, GH5963, GH5436).sql_schema
now generates dialect appropriate CREATE TABLE
statements (GH8697)slice
string method now takes step into account (GH8754)BlockManager
where setting values with different type would break block integrity (GH8850)DatetimeIndex
when using time
object as key (GH8667)merge
where how='left'
and sort=False
would not preserve left frame order (GH7331)MultiIndex.reindex
where reindexing at level would not reorder labels (GH4088)to_datetime
when parsing a nanoseconds using the %f
format (GH8989)io.data.Options
now raises RemoteDataError
when no expiry dates are available from Yahoo and when it receives no data from Yahoo (GH8761), (GH8783).to_html,index=False
which would add an extra column (GH8452).size
attribute across NDFrame
objects to provide compat with numpy >= 1.9.1; buggy with np.array_split
(GH8846)get_data_google
returned object dtypes (GH3995)DataFrame.stack(..., dropna=False)
when the DataFrame’s columns
is a MultiIndex
whose labels
do not reference all its levels
. (GH8844)__enter__
(GH8514)DataFrame.plot(kind='scatter')
fails when checking if an np.array is in the DataFrame (GH8852)pd.infer_freq/DataFrame.inferred_freq
that prevented proper sub-daily frequency inference when the index contained DST days (GH8772).use_index=False
(GH8558).MultiIndex
where __contains__
returns wrong result if index is not lexically sorted or unique (GH7724)Timestamp
does not parse ‘Z’ zone designator for UTC (GH8771)This is a minor bug-fix release from 0.15.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
API changes¶s.dt.hour
and other .dt
accessors will now return np.nan
for missing values (rather than previously -1), (GH8689)
In [1]: s = Series(date_range('20130101',periods=5,freq='D')) In [2]: s.iloc[2] = np.nan In [3]: s Out[3]: 0 2013-01-01 1 2013-01-02 2 NaT 3 2013-01-04 4 2013-01-05 dtype: datetime64[ns]
previous behavior:
In [6]: s.dt.hour Out[6]: 0 0 1 0 2 -1 3 0 4 0 dtype: int64
current behavior:
In [4]: s.dt.hour Out[4]: 0 0.0 1 0.0 2 NaN 3 0.0 4 0.0 dtype: float64
groupby
with as_index=False
will not add erroneous extra columns to result (GH8582):
In [5]: np.random.seed(2718281) In [6]: df = pd.DataFrame(np.random.randint(0, 100, (10, 2)), ...: columns=['jim', 'joe']) ...: In [7]: df.head() Out[7]: jim joe 0 61 81 1 96 49 2 55 65 3 72 51 4 77 12 In [8]: ts = pd.Series(5 * np.random.randint(0, 3, 10))
previous behavior:
In [4]: df.groupby(ts, as_index=False).max() Out[4]: NaN jim joe 0 0 72 83 1 5 77 84 2 10 96 65
current behavior:
In [9]: df.groupby(ts, as_index=False).max() Out[9]: jim joe 0 72 83 1 77 84 2 96 65
groupby
will not erroneously exclude columns if the column name conflics with the grouper name (GH8112):
In [10]: df = pd.DataFrame({'jim': range(5), 'joe': range(5, 10)}) In [11]: df Out[11]: jim joe 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 In [12]: gr = df.groupby(df['jim'] < 2)
previous behavior (excludes 1st column from output):
In [4]: gr.apply(sum) Out[4]: joe jim False 24 True 11
current behavior:
In [13]: gr.apply(sum) Out[13]: jim joe jim False 9 24 True 1 11
Support for slicing with monotonic decreasing indexes, even if start
or stop
is not found in the index (GH7860):
In [14]: s = pd.Series(['a', 'b', 'c', 'd'], [4, 3, 2, 1]) In [15]: s Out[15]: 4 a 3 b 2 c 1 d dtype: object
previous behavior:
In [8]: s.loc[3.5:1.5] KeyError: 3.5
current behavior:
In [16]: s.loc[3.5:1.5] Out[16]: 3 b 2 c dtype: object
io.data.Options
has been fixed for a change in the format of the Yahoo Options page (GH8612), (GH8741)
Note
As a result of a change in Yahoo’s option page layout, when an expiry date is given, Options
methods now return data for a single expiry date. Previously, methods returned all data for the selected month.
The month
and year
parameters have been undeprecated and can be used to get all options data for a given month.
If an expiry date that is not valid is given, data for the next expiry after the given date is returned.
Option data frames are now saved on the instance as callsYYMMDD
or putsYYMMDD
. Previously they were saved as callsMMYY
and putsMMYY
. The next expiry is saved as calls
and puts
.
New features:
expiry_dates
was added, which returns all available expiry dates.Current behavior:
In [17]: from pandas.io.data import Options In [18]: aapl = Options('aapl','yahoo') In [19]: aapl.get_call_data().iloc[0:5,0:1] Out[19]: Last Strike Expiry Type Symbol 80 2014-11-14 call AAPL141114C00080000 29.05 84 2014-11-14 call AAPL141114C00084000 24.80 85 2014-11-14 call AAPL141114C00085000 24.05 86 2014-11-14 call AAPL141114C00086000 22.76 87 2014-11-14 call AAPL141114C00087000 21.74 In [20]: aapl.expiry_dates Out[20]: [datetime.date(2014, 11, 14), datetime.date(2014, 11, 22), datetime.date(2014, 11, 28), datetime.date(2014, 12, 5), datetime.date(2014, 12, 12), datetime.date(2014, 12, 20), datetime.date(2015, 1, 17), datetime.date(2015, 2, 20), datetime.date(2015, 4, 17), datetime.date(2015, 7, 17), datetime.date(2016, 1, 15), datetime.date(2017, 1, 20)] In [21]: aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3]).iloc[0:5,0:1] Out[21]: Last Strike Expiry Type Symbol 109 2014-11-22 call AAPL141122C00109000 1.48 2014-11-28 call AAPL141128C00109000 1.79 110 2014-11-14 call AAPL141114C00110000 0.55 2014-11-22 call AAPL141122C00110000 1.02 2014-11-28 call AAPL141128C00110000 1.32
datetime64
dtype in matplotlib’s units registry to plot such values as datetimes. This is activated once pandas is imported. In previous versions, plotting an array of datetime64
values will have resulted in plotted integer values. To keep the previous behaviour, you can do del matplotlib.units.registry[np.datetime64]
(GH8614).concat
permits a wider variety of iterables of pandas objects to be passed as the first parameter (GH8645):
In [17]: from collections import deque In [18]: df1 = pd.DataFrame([1, 2, 3]) In [19]: df2 = pd.DataFrame([4, 5, 6])
previous behavior:
In [7]: pd.concat(deque((df1, df2))) TypeError: first argument must be a list-like of pandas objects, you passed an object of type "deque"
current behavior:
In [20]: pd.concat(deque((df1, df2))) Out[20]: 0 0 1 1 2 2 3 0 4 1 5 2 6
Represent MultiIndex
labels with a dtype that utilizes memory based on the level size. In prior versions, the memory usage was a constant 8 bytes per element in each level. In addition, in prior versions, the reported memory usage was incorrect as it didn’t show the usage for the memory occupied by the underling data array. (GH8456)
In [21]: dfi = DataFrame(1,index=pd.MultiIndex.from_product([['a'],range(1000)]),columns=['A'])
previous behavior:
# this was underreported in prior versions In [1]: dfi.memory_usage(index=True) Out[1]: Index 8000 # took about 24008 bytes in < 0.15.1 A 8000 dtype: int64
current behavior:
In [22]: dfi.memory_usage(index=True) Out[22]: Index 11040 A 8000 dtype: int64
Added Index properties is_monotonic_increasing and is_monotonic_decreasing (GH8680).
Added option to select columns when importing Stata files (GH7935)
Qualify memory usage in DataFrame.info()
by adding +
if it is a lower bound (GH8578)
Raise errors in certain aggregation cases where an argument such as numeric_only
is not handled (GH8592).
Added support for 3-character ISO and non-standard country codes in io.wb.download()
(GH8482)
World Bank data requests now will warn/raise based on an errors
argument, as well as a list of hard-coded country codes and the World Bank’s JSON response. In prior versions, the error messages didn’t look at the World Bank’s JSON response. Problem-inducing input were simply dropped prior to the request. The issue was that many good countries were cropped in the hard-coded approach. All countries will work now, but some bad countries will raise exceptions because some edge cases break the entire response. (GH8482)
Added option to Series.str.split()
to return a DataFrame
rather than a Series
(GH8428)
Added option to df.info(null_counts=None|True|False)
to override the default display options and force showing of the null-counts (GH8701)
CustomBusinessDay
object (GH8591)Categorical
to a records array, e.g. df.to_records()
(GH8626)Categorical
not created properly with Series.to_frame()
(GH8626)Categorical
of a passed pd.Categorical
(this now raises TypeError
correctly), (GH8626)cut
/qcut
when using Series
and retbins=True
(GH8589)to_sql
(GH8624).Categorical
of datetime raising when being compared to a scalar datetime (GH8687)Categorical
with .iloc
(GH8623)Categorical
reflected comparison operator raising if the first argument was a numpy array scalar (e.g. np.int64) (GH8658)DataFrame.dtypes
when options.mode.use_inf_as_null
is True (GH8722)read_csv
, dialect
parameter would not take a string (:issue: 8703)np.nan
on numpy 1.7 (GH8980).shape
attribute for MultiIndex
(GH8609)GroupBy
where a name conflict between the grouper and columns would break groupby
operations (GH7115, GH8112)y
and specifying a label would mutate the index name of the original DataFrame (GH8494)date_range
where partially-specified dates would incorporate current date (GH6961)DataReader
‘s would fail if one of the symbols passed was invalid. Now returns data for valid symbols and np.nan for invalid (GH8494)get_quote_yahoo
that wouldn’t allow non-float return values (GH5229).This is a major release from 0.14.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Warning
pandas >= 0.15.0 will no longer support compatibility with NumPy versions < 1.7.0. If you want to use the latest versions of pandas, please upgrade to NumPy >= 1.7.0 (GH7711)
Categorical
type was integrated as a first-class pandas type, see hereTimedelta
, and a new index type TimedeltaIndex
, see here.dt
for Series, see Datetimelike Propertiesdf.info()
to include memory usage, see Memory Usageread_csv
will now by default ignore blank lines when parsing, see hereIndex
class to no longer sub-class ndarray
, see Internal RefactoringPyTables
less than version 3.0.0, and numexpr
less than version 2.1 (GH7990)Warning
In 0.15.0 Index
has internally been refactored to no longer sub-class ndarray
but instead subclass PandasObject
, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (See the Internal Refactoring)
Warning
The refactorings in Categorical
changed the two argument constructor from “codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you use Categorical
directly, please audit your code before updating to this pandas version and change it to use the from_codes()
constructor. See more on Categorical
here
Categorical
can now be included in Series and DataFrames and gained new methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH3943, GH5313, GH5314, GH7444, GH7839, GH7848, GH7864, GH7914, GH7768, GH8006, GH3678, GH8075, GH8076, GH8143, GH8453, GH8518).
For full docs, see the categorical introduction and the API documentation.
In [1]: df = DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) In [2]: df["grade"] = df["raw_grade"].astype("category") In [3]: df["grade"] Out[3]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e] # Rename the categories In [4]: df["grade"].cat.categories = ["very good", "good", "very bad"] # Reorder the categories and simultaneously add the missing categories In [5]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [6]: df["grade"] Out[6]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good] In [7]: df.sort_values("grade") Out[7]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good In [8]: df.groupby("grade").size() Out[8]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
pandas.core.group_agg
and pandas.core.factor_agg
were removed. As an alternative, construct a dataframe and use df.groupby(<group>).agg(<func>)
.Categorical
constructor is not supported anymore. Supplying two arguments to the constructor is now interpreted as “values and levels (now called ‘categories’)”. Please change your code to use the from_codes()
constructor.Categorical.labels
attribute was renamed to Categorical.codes
and is read only. If you want to manipulate codes, please use one of the API methods on Categoricals.Categorical.levels
attribute is renamed to Categorical.categories
.We introduce a new scalar type Timedelta
, which is a subclass of datetime.timedelta
, and behaves in a similar manner, but allows compatibility with np.timedelta64
types as well as a host of custom representation, parsing, and attributes. This type is very similar to how Timestamp
works for datetimes
. It is a nice-API box for the type. See the docs. (GH3009, GH4533, GH8209, GH8187, GH8190, GH7869, GH7661, GH8345, GH8471)
Warning
Timedelta
scalars (and TimedeltaIndex
) component fields are not the same as the component fields on a datetime.timedelta
object. For example, .seconds
on a datetime.timedelta
object returns the total number of seconds combined between hours
, minutes
and seconds
. In contrast, the pandas Timedelta
breaks out hours, minutes, microseconds and nanoseconds separately.
# Timedelta accessor In [9]: tds = Timedelta('31 days 5 min 3 sec') In [10]: tds.minutes Out[10]: 5L In [11]: tds.seconds Out[11]: 3L # datetime.timedelta accessor # this is 5 minutes * 60 + 3 seconds In [12]: tds.to_pytimedelta().seconds Out[12]: 303
Note: this is no longer true starting from v0.16.0, where full compatibility with datetime.timedelta
is introduced. See the 0.16.0 whatsnew entry
Warning
Prior to 0.15.0 pd.to_timedelta
would return a Series
for list-like/Series input, and a np.timedelta64
for scalar input. It will now return a TimedeltaIndex
for list-like input, Series
for Series input, and Timedelta
for scalar input.
The arguments to pd.to_timedelta
are now (arg,unit='ns',box=True,coerce=False)
, previously were (arg,box=True,unit='ns')
as these are more logical.
Consruct a scalar
In [9]: Timedelta('1 days 06:05:01.00003') Out[9]: Timedelta('1 days 06:05:01.000030') In [10]: Timedelta('15.5us') Out[10]: Timedelta('0 days 00:00:00.000015') In [11]: Timedelta('1 hour 15.5us') Out[11]: Timedelta('0 days 01:00:00.000015') # negative Timedeltas have this string repr # to be more consistent with datetime.timedelta conventions In [12]: Timedelta('-1us') Out[12]: Timedelta('-1 days +23:59:59.999999') # a NaT In [13]: Timedelta('nan') Out[13]: NaT
Access fields for a Timedelta
In [14]: td = Timedelta('1 hour 3m 15.5us') In [15]: td.seconds Out[15]: 3780 In [16]: td.microseconds Out[16]: 15 In [17]: td.nanoseconds Out[17]: 500
Construct a TimedeltaIndex
In [18]: TimedeltaIndex(['1 days','1 days, 00:00:05', ....: np.timedelta64(2,'D'),timedelta(days=2,seconds=2)]) ....: Out[18]: TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00', '2 days 00:00:02'], dtype='timedelta64[ns]', freq=None)
Constructing a TimedeltaIndex
with a regular range
In [19]: timedelta_range('1 days',periods=5,freq='D') Out[19]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D') In [20]: timedelta_range(start='1 days',end='2 days',freq='30T') Out[20]: TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00', '1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00', '1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00', '1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00', '1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00', '1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00', '1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00', '1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00', '1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00', '1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00', '1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00', '1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00', '1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00', '1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00', '1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00', '1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00', '2 days 00:00:00'], dtype='timedelta64[ns]', freq='30T')
You can now use a TimedeltaIndex
as the index of a pandas object
In [21]: s = Series(np.arange(5), ....: index=timedelta_range('1 days',periods=5,freq='s')) ....: In [22]: s Out[22]: 1 days 00:00:00 0 1 days 00:00:01 1 1 days 00:00:02 2 1 days 00:00:03 3 1 days 00:00:04 4 Freq: S, dtype: int64
You can select with partial string selections
In [23]: s['1 day 00:00:02'] Out[23]: 2 In [24]: s['1 day':'1 day 00:00:02'] Out[24]: 1 days 00:00:00 0 1 days 00:00:01 1 1 days 00:00:02 2 Freq: S, dtype: int64
Finally, the combination of TimedeltaIndex
with DatetimeIndex
allow certain combination operations that are NaT
preserving:
In [25]: tdi = TimedeltaIndex(['1 days',pd.NaT,'2 days']) In [26]: tdi.tolist() Out[26]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')] In [27]: dti = date_range('20130101',periods=3) In [28]: dti.tolist() Out[28]: [Timestamp('2013-01-01 00:00:00', freq='D'), Timestamp('2013-01-02 00:00:00', freq='D'), Timestamp('2013-01-03 00:00:00', freq='D')] In [29]: (dti + tdi).tolist() Out[29]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')] In [30]: (dti - tdi).tolist() Out[30]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]
Series
e.g. list(Series(...))
of timedelta64[ns]
would prior to v0.15.0 return np.timedelta64
for each element. These will now be wrapped in Timedelta
.Implemented methods to find memory usage of a DataFrame. See the FAQ for more. (GH6852).
A new display option display.memory_usage
(see Options and Settings) sets the default behavior of the memory_usage
argument in the df.info()
method. By default display.memory_usage
is True
.
In [31]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', ....: 'complex128', 'object', 'bool'] ....: In [32]: n = 5000 In [33]: data = dict([ (t, np.random.randint(100, size=n).astype(t)) ....: for t in dtypes]) ....: In [34]: df = DataFrame(data) In [35]: df['categorical'] = df['object'].astype('category') In [36]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 8 columns): bool 5000 non-null bool complex128 5000 non-null complex128 datetime64[ns] 5000 non-null datetime64[ns] float64 5000 non-null float64 int64 5000 non-null int64 object 5000 non-null object timedelta64[ns] 5000 non-null timedelta64[ns] categorical 5000 non-null category dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1) memory usage: 289.1+ KB
Additionally memory_usage()
is an available method for a dataframe object which returns the memory usage of each column.
In [37]: df.memory_usage(index=True) Out[37]: Index 80 bool 5000 complex128 80000 datetime64[ns] 40000 float64 40000 int64 40000 object 40000 timedelta64[ns] 40000 categorical 10920 dtype: int64.dt accessor¶
Series
has gained an accessor to succinctly return datetime like properties for the values of the Series, if its a datetime/period like Series. (GH7207) This will return a Series, indexed like the existing Series. See the docs
# datetime In [38]: s = Series(date_range('20130101 09:10:12',periods=4)) In [39]: s Out[39]: 0 2013-01-01 09:10:12 1 2013-01-02 09:10:12 2 2013-01-03 09:10:12 3 2013-01-04 09:10:12 dtype: datetime64[ns] In [40]: s.dt.hour Out[40]: 0 9 1 9 2 9 3 9 dtype: int64 In [41]: s.dt.second Out[41]: 0 12 1 12 2 12 3 12 dtype: int64 In [42]: s.dt.day Out[42]: 0 1 1 2 2 3 3 4 dtype: int64 In [43]: s.dt.freq Out[43]: <Day>
This enables nice expressions like this:
In [44]: s[s.dt.day==2] Out[44]: 1 2013-01-02 09:10:12 dtype: datetime64[ns]
You can easily produce tz aware transformations:
In [45]: stz = s.dt.tz_localize('US/Eastern') In [46]: stz Out[46]: 0 2013-01-01 09:10:12-05:00 1 2013-01-02 09:10:12-05:00 2 2013-01-03 09:10:12-05:00 3 2013-01-04 09:10:12-05:00 dtype: datetime64[ns, US/Eastern] In [47]: stz.dt.tz Out[47]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
You can also chain these types of operations:
In [48]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern') Out[48]: 0 2013-01-01 04:10:12-05:00 1 2013-01-02 04:10:12-05:00 2 2013-01-03 04:10:12-05:00 3 2013-01-04 04:10:12-05:00 dtype: datetime64[ns, US/Eastern]
The .dt
accessor works for period and timedelta dtypes.
# period In [49]: s = Series(period_range('20130101',periods=4,freq='D')) In [50]: s Out[50]: 0 2013-01-01 1 2013-01-02 2 2013-01-03 3 2013-01-04 dtype: object In [51]: s.dt.year Out[51]: 0 2013 1 2013 2 2013 3 2013 dtype: int64 In [52]: s.dt.day Out[52]: 0 1 1 2 2 3 3 4 dtype: int64
# timedelta In [53]: s = Series(timedelta_range('1 day 00:00:05',periods=4,freq='s')) In [54]: s Out[54]: 0 1 days 00:00:05 1 1 days 00:00:06 2 1 days 00:00:07 3 1 days 00:00:08 dtype: timedelta64[ns] In [55]: s.dt.days Out[55]: 0 1 1 1 2 1 3 1 dtype: int64 In [56]: s.dt.seconds Out[56]: 0 5 1 6 2 7 3 8 dtype: int64 In [57]: s.dt.components Out[57]: days hours minutes seconds milliseconds microseconds nanoseconds 0 1 0 0 5 0 0 0 1 1 0 0 6 0 0 0 2 1 0 0 7 0 0 0 3 1 0 0 8 0 0 0Timezone handling improvements¶
tz_localize(None)
for tz-aware Timestamp
and DatetimeIndex
now removes timezone holding local time, previously this resulted in Exception
or TypeError
(GH7812)
In [58]: ts = Timestamp('2014-08-01 09:00', tz='US/Eastern') In [59]: ts Out[59]: Timestamp('2014-08-01 09:00:00-0400', tz='US/Eastern') In [60]: ts.tz_localize(None) Out[60]: Timestamp('2014-08-01 09:00:00') In [61]: didx = DatetimeIndex(start='2014-08-01 09:00', freq='H', periods=10, tz='US/Eastern') In [62]: didx Out[62]: DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00', '2014-08-01 11:00:00-04:00', '2014-08-01 12:00:00-04:00', '2014-08-01 13:00:00-04:00', '2014-08-01 14:00:00-04:00', '2014-08-01 15:00:00-04:00', '2014-08-01 16:00:00-04:00', '2014-08-01 17:00:00-04:00', '2014-08-01 18:00:00-04:00'], dtype='datetime64[ns, US/Eastern]', freq='H') In [63]: didx.tz_localize(None) Out[63]: DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00', '2014-08-01 11:00:00', '2014-08-01 12:00:00', '2014-08-01 13:00:00', '2014-08-01 14:00:00', '2014-08-01 15:00:00', '2014-08-01 16:00:00', '2014-08-01 17:00:00', '2014-08-01 18:00:00'], dtype='datetime64[ns]', freq='H')
tz_localize
now accepts the ambiguous
keyword which allows for passing an array of bools indicating whether the date belongs in DST or not, ‘NaT’ for setting transition times to NaT, ‘infer’ for inferring DST/non-DST, and ‘raise’ (default) for an AmbiguousTimeError
to be raised. See the docs for more details (GH7943)
DataFrame.tz_localize
and DataFrame.tz_convert
now accepts an optional level
argument for localizing a specific level of a MultiIndex (GH7846)
Timestamp.tz_localize
and Timestamp.tz_convert
now raise TypeError
in error cases, rather than Exception
(GH8025)
a timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone (rather than being a naive datetime64[ns]
) as object
dtype (GH8411)
Timestamp.__repr__
displays dateutil.tz.tzoffset
info (GH7907)
rolling_min()
, rolling_max()
, rolling_cov()
, and rolling_corr()
now return objects with all NaN
when len(arg) < min_periods <= window
rather than raising. (This makes all rolling functions consistent in this behavior). (GH7766)
Prior to 0.15.0
In [64]: s = Series([10, 11, 12, 13])
In [15]: rolling_min(s, window=10, min_periods=5) ValueError: min_periods (5) must be <= window (4)
New behavior
In [4]: pd.rolling_min(s, window=10, min_periods=5) Out[4]: 0 NaN 1 NaN 2 NaN 3 NaN dtype: float64
rolling_max()
, rolling_min()
, rolling_sum()
, rolling_mean()
, rolling_median()
, rolling_std()
, rolling_var()
, rolling_skew()
, rolling_kurt()
, rolling_quantile()
, rolling_cov()
, rolling_corr()
, rolling_corr_pairwise()
, rolling_window()
, and rolling_apply()
with center=True
previously would return a result of the same structure as the input arg
with NaN
in the final (window-1)/2
entries.
Now the final (window-1)/2
entries of the result are calculated as if the input arg
were followed by (window-1)/2
NaN
values (or with shrinking windows, in the case of rolling_apply()
). (GH7925, GH8269)
Prior behavior (note final value is NaN
):
In [7]: rolling_sum(Series(range(4)), window=3, min_periods=0, center=True) Out[7]: 0 1 1 3 2 6 3 NaN dtype: float64
New behavior (note final value is 5 = sum([2, 3, NaN])
):
In [7]: rolling_sum(Series(range(4)), window=3, min_periods=0, center=True) Out[7]: 0 1 1 3 2 6 3 5 dtype: float64
rolling_window()
now normalizes the weights properly in rolling mean mode (mean=True) so that the calculated weighted means (e.g. ‘triang’, ‘gaussian’) are distributed about the same means as those calculated without weighting (i.e. ‘boxcar’). See the note on normalization for further details. (GH7618)
In [65]: s = Series([10.5, 8.8, 11.4, 9.7, 9.3])
Behavior prior to 0.15.0:
In [39]: rolling_window(s, window=3, win_type='triang', center=True) Out[39]: 0 NaN 1 6.583333 2 6.883333 3 6.683333 4 NaN dtype: float64
New behavior
In [10]: pd.rolling_window(s, window=3, win_type='triang', center=True) Out[10]: 0 NaN 1 9.875 2 10.325 3 10.025 4 NaN dtype: float64
Removed center
argument from all expanding_
functions (see list), as the results produced when center=True
did not make much sense. (GH7925)
Added optional ddof
argument to expanding_cov()
and rolling_cov()
. The default value of 1
is backwards-compatible. (GH8279)
Documented the ddof
argument to expanding_var()
, expanding_std()
, rolling_var()
, and rolling_std()
. These functions’ support of a ddof
argument (with a default value of 1
) was previously undocumented. (GH8064)
ewma()
, ewmstd()
, ewmvol()
, ewmvar()
, ewmcov()
, and ewmcorr()
now interpret min_periods
in the same manner that the rolling_*()
and expanding_*()
functions do: a given result entry will be NaN
if the (expanding, in this case) window does not contain at least min_periods
values. The previous behavior was to set to NaN
the min_periods
entries starting with the first non- NaN
value. (GH7977)
Prior behavior (note values start at index 2
, which is min_periods
after index 0
(the index of the first non-empty value)):
In [66]: s = Series([1, None, None, None, 2, 3])
In [51]: ewma(s, com=3., min_periods=2) Out[51]: 0 NaN 1 NaN 2 1.000000 3 1.000000 4 1.571429 5 2.189189 dtype: float64
New behavior (note values start at index 4
, the location of the 2nd (since min_periods=2
) non-empty value):
In [2]: pd.ewma(s, com=3., min_periods=2) Out[2]: 0 NaN 1 NaN 2 NaN 3 NaN 4 1.759644 5 2.383784 dtype: float64
ewmstd()
, ewmvol()
, ewmvar()
, ewmcov()
, and ewmcorr()
now have an optional adjust
argument, just like ewma()
does, affecting how the weights are calculated. The default value of adjust
is True
, which is backwards-compatible. See Exponentially weighted moment functions for details. (GH7911)
ewma()
, ewmstd()
, ewmvol()
, ewmvar()
, ewmcov()
, and ewmcorr()
now have an optional ignore_na
argument. When ignore_na=False
(the default), missing values are taken into account in the weights calculation. When ignore_na=True
(which reproduces the pre-0.15.0 behavior), missing values are ignored in the weights calculation. (GH7543)
In [7]: pd.ewma(Series([None, 1., 8.]), com=2.) Out[7]: 0 NaN 1 1.0 2 5.2 dtype: float64 In [8]: pd.ewma(Series([1., None, 8.]), com=2., ignore_na=True) # pre-0.15.0 behavior Out[8]: 0 1.0 1 1.0 2 5.2 dtype: float64 In [9]: pd.ewma(Series([1., None, 8.]), com=2., ignore_na=False) # new default Out[9]: 0 1.000000 1 1.000000 2 5.846154 dtype: float64
Warning
By default (ignore_na=False
) the ewm*()
functions’ weights calculation in the presence of missing values is different than in pre-0.15.0 versions. To reproduce the pre-0.15.0 calculation of weights in the presence of missing values one must specify explicitly ignore_na=True
.
Bug in expanding_cov()
, expanding_corr()
, rolling_cov()
, rolling_cor()
, ewmcov()
, and ewmcorr()
returning results with columns sorted by name and producing an error for non-unique columns; now handles non-unique columns and returns columns in original order (except for the case of two DataFrames with pairwise=False
, where behavior is unchanged) (GH7542)
Bug in rolling_count()
and expanding_*()
functions unnecessarily producing error message for zero-length data (GH8056)
Bug in rolling_apply()
and expanding_apply()
interpreting min_periods=0
as min_periods=1
(GH8080)
Bug in expanding_std()
and expanding_var()
for a single value producing a confusing error message (GH7900)
Bug in rolling_std()
and rolling_var()
for a single value producing 0
rather than NaN
(GH7900)
Bug in ewmstd()
, ewmvol()
, ewmvar()
, and ewmcov()
calculation of de-biasing factors when bias=False
(the default). Previously an incorrect constant factor was used, based on adjust=True
, ignore_na=True
, and an infinite number of observations. Now a different factor is used for each entry, based on the actual weights (analogous to the usual N/(N-1)
factor). In particular, for a single point a value of NaN
is returned when bias=False
, whereas previously a value of (approximately) 0
was returned.
For example, consider the following pre-0.15.0 results for ewmvar(..., bias=False)
, and the corresponding debiasing factors:
In [67]: s = Series([1., 2., 0., 4.])
In [89]: ewmvar(s, com=2., bias=False) Out[89]: 0 -2.775558e-16 1 3.000000e-01 2 9.556787e-01 3 3.585799e+00 dtype: float64 In [90]: ewmvar(s, com=2., bias=False) / ewmvar(s, com=2., bias=True) Out[90]: 0 1.25 1 1.25 2 1.25 3 1.25 dtype: float64
Note that entry 0
is approximately 0, and the debiasing factors are a constant 1.25. By comparison, the following 0.15.0 results have a NaN
for entry 0
, and the debiasing factors are decreasing (towards 1.25):
In [14]: pd.ewmvar(s, com=2., bias=False) Out[14]: 0 NaN 1 0.500000 2 1.210526 3 4.089069 dtype: float64 In [15]: pd.ewmvar(s, com=2., bias=False) / pd.ewmvar(s, com=2., bias=True) Out[15]: 0 NaN 1 2.083333 2 1.583333 3 1.425439 dtype: float64
See Exponentially weighted moment functions for details. (GH7912)
Added support for a chunksize
parameter to to_sql
function. This allows DataFrame to be written in chunks and avoid packet-size overflow errors (GH8062).
Added support for a chunksize
parameter to read_sql
function. Specifying this argument will return an iterator through chunks of the query result (GH2908).
Added support for writing datetime.date
and datetime.time
object columns with to_sql
(GH6932).
Added support for specifying a schema
to read from/write to with read_sql_table
and to_sql
(GH7441, GH7952). For example:
df.to_sql('table', engine, schema='other_schema') pd.read_sql_table('table', engine, schema='other_schema')
Added support for writing NaN
values with to_sql
(GH2754).
Added support for writing datetime64 columns with to_sql
for all database flavors (GH7103).
API changes related to Categorical
(see here for more details):
The Categorical
constructor with two arguments changed from “codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you use Categorical
directly, please audit your code by changing it to use the from_codes()
constructor.
An old function call like (prior to 0.15.0):
pd.Categorical([0,1,0,2,1], levels=['a', 'b', 'c'])
will have to adapted to the following to keep the same behaviour:
In [2]: pd.Categorical.from_codes([0,1,0,2,1], categories=['a', 'b', 'c']) Out[2]: [a, b, a, c, b] Categories (3, object): [a, b, c]
API changes related to the introduction of the Timedelta
scalar (see above for more details):
to_timedelta()
would return a Series
for list-like/Series input, and a np.timedelta64
for scalar input. It will now return a TimedeltaIndex
for list-like input, Series
for Series input, and Timedelta
for scalar input.For API changes related to the rolling and expanding functions, see detailed overview above.
Other notable API changes:
Consistency when indexing with .loc
and a list-like indexer when no values are found.
In [68]: df = DataFrame([['a'],['b']],index=[1,2]) In [69]: df Out[69]: 0 1 a 2 b
In prior versions there was a difference in these two constructs:
df.loc[[3]]
would return a frame reindexed by 3 (with all np.nan
values)df.loc[[3],:]
would raise KeyError
.Both will now raise a KeyError
. The rule is that at least 1 indexer must be found when using a list-like and .loc
(GH7999)
Furthermore in prior versions these were also different:
df.loc[[1,3]]
would return a frame reindexed by [1,3]df.loc[[1,3],:]
would raise KeyError
.Both will now return a frame reindex by [1,3]. E.g.
In [70]: df.loc[[1,3]] Out[70]: 0 1 a 3 NaN In [71]: df.loc[[1,3],:] Out[71]: 0 1 a 3 NaN
This can also be seen in multi-axis indexing with a Panel
.
In [72]: p = Panel(np.arange(2*3*4).reshape(2,3,4), ....: items=['ItemA','ItemB'], ....: major_axis=[1,2,3], ....: minor_axis=['A','B','C','D']) ....: In [73]: p Out[73]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemB Major_axis axis: 1 to 3 Minor_axis axis: A to D
The following would raise KeyError
prior to 0.15.0:
In [74]: p.loc[['ItemA','ItemD'],:,'D'] Out[74]: ItemA ItemD 1 3 NaN 2 7 NaN 3 11 NaN
Furthermore, .loc
will raise If no values are found in a multi-index with a list-like indexer:
In [75]: s = Series(np.arange(3,dtype='int64'), ....: index=MultiIndex.from_product([['A'],['foo','bar','baz']], ....: names=['one','two']) ....: ).sort_index() ....: In [76]: s Out[76]: one two A bar 1 baz 2 foo 0 dtype: int64 In [77]: try: ....: s.loc[['D']] ....: except KeyError as e: ....: print("KeyError: " + str(e)) ....: KeyError: "['D'] not in index"
Assigning values to None
now considers the dtype when choosing an ‘empty’ value (GH7941).
Previously, assigning to None
in numeric containers changed the dtype to object (or errored, depending on the call). It now uses NaN
:
In [78]: s = Series([1, 2, 3]) In [79]: s.loc[0] = None In [80]: s Out[80]: 0 NaN 1 2.0 2 3.0 dtype: float64
NaT
is now used similarly for datetime containers.
For object containers, we now preserve None
values (previously these were converted to NaN
values).
In [81]: s = Series(["a", "b", "c"]) In [82]: s.loc[0] = None In [83]: s Out[83]: 0 None 1 b 2 c dtype: object
To insert a NaN
, you must explicitly use np.nan
. See the docs.
In prior versions, updating a pandas object inplace would not reflect in other python references to this object. (GH8511, GH5104)
In [84]: s = Series([1, 2, 3]) In [85]: s2 = s In [86]: s += 1.5
Behavior prior to v0.15.0
# the original object In [5]: s Out[5]: 0 2.5 1 3.5 2 4.5 dtype: float64 # a reference to the original object In [7]: s2 Out[7]: 0 1 1 2 2 3 dtype: int64
This is now the correct behavior
# the original object In [87]: s Out[87]: 0 2.5 1 3.5 2 4.5 dtype: float64 # a reference to the original object In [88]: s2 Out[88]: 0 2.5 1 3.5 2 4.5 dtype: float64
Made both the C-based and Python engines for read_csv and read_table ignore empty lines in input as well as whitespace-filled lines, as long as sep
is not whitespace. This is an API change that can be controlled by the keyword parameter skip_blank_lines
. See the docs (GH4466)
A timeseries/index localized to UTC when inserted into a Series/DataFrame will preserve the UTC timezone and inserted as object
dtype rather than being converted to a naive datetime64[ns]
(GH8411).
Bug in passing a DatetimeIndex
with a timezone that was not being retained in DataFrame construction from a dict (GH7822)
In prior versions this would drop the timezone, now it retains the timezone, but gives a column of object
dtype:
In [89]: i = date_range('1/1/2011', periods=3, freq='10s', tz = 'US/Eastern') In [90]: i Out[90]: DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-01 00:00:10-05:00', '2011-01-01 00:00:20-05:00'], dtype='datetime64[ns, US/Eastern]', freq='10S') In [91]: df = DataFrame( {'a' : i } ) In [92]: df Out[92]: a 0 2011-01-01 00:00:00-05:00 1 2011-01-01 00:00:10-05:00 2 2011-01-01 00:00:20-05:00 In [93]: df.dtypes Out[93]: a datetime64[ns, US/Eastern] dtype: object
Previously this would have yielded a column of datetime64
dtype, but without timezone info.
The behaviour of assigning a column to an existing dataframe as df[‘a’] = i remains unchanged (this already returned an object
column with a timezone).
When passing multiple levels to stack()
, it will now raise a ValueError
when the levels aren’t all level names or all level numbers (GH7660). See Reshaping by stacking and unstacking.
Raise a ValueError
in df.to_hdf
with ‘fixed’ format, if df
has non-unique columns as the resulting file will be broken (GH7761)
SettingWithCopy
raise/warnings (according to the option mode.chained_assignment
) will now be issued when setting a value on a sliced mixed-dtype DataFrame using chained-assignment. (GH7845, GH7950)
In [1]: df = DataFrame(np.arange(0,9), columns=['count']) In [2]: df['group'] = 'b' In [3]: df.iloc[0:5]['group'] = 'a' /usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
merge
, DataFrame.merge
, and ordered_merge
now return the same type as the left
argument (GH7737).
Previously an enlargement with a mixed-dtype frame would act unlike .append
which will preserve dtypes (related GH2578, GH8176):
In [94]: df = DataFrame([[True, 1],[False, 2]], ....: columns=["female","fitness"]) ....: In [95]: df Out[95]: female fitness 0 True 1 1 False 2 In [96]: df.dtypes Out[96]: female bool fitness int64 dtype: object # dtypes are now preserved In [97]: df.loc[2] = df.loc[1] In [98]: df Out[98]: female fitness 0 True 1 1 False 2 2 False 2 In [99]: df.dtypes Out[99]: female bool fitness int64 dtype: object
Series.to_csv()
now returns a string when path=None
, matching the behaviour of DataFrame.to_csv()
(GH8215).
read_hdf
now raises IOError
when a file that doesn’t exist is passed in. Previously, a new, empty file was created, and a KeyError
raised (GH7715).
DataFrame.info()
now ends its output with a newline character (GH8114)
Concatenating no objects will now raise a ValueError
rather than a bare Exception
.
Merge errors will now be sub-classes of ValueError
rather than raw Exception
(GH8501)
DataFrame.plot
and Series.plot
keywords are now have consistent orders (GH8037)
In 0.15.0 Index
has internally been refactored to no longer sub-class ndarray
but instead subclass PandasObject
, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (GH5080, GH7439, GH7796, GH8024, GH8367, GH7997, GH8522):
pd.read_pickle
rather than pickle.load
. See pickle docsPeriodIndex
, the matplotlib internal axes will now be arrays of Period
rather than a PeriodIndex
(this is similar to how a DatetimeIndex
passes arrays of datetimes
now)datetime64
). UPDATE This is fixed in 0.15.1, see here.Categorical
labels
and levels
attributes are deprecated and renamed to codes
and categories
.outtype
argument to pd.DataFrame.to_dict
has been deprecated in favor of orient
. (GH7840)convert_dummies
method has been deprecated in favor of get_dummies
(GH8140)infer_dst
argument in tz_localize
will be deprecated in favor of ambiguous
to allow for more flexibility in dealing with DST transitions. Replace infer_dst=True
with ambiguous='infer'
for the same behavior (GH7943). See the docs for more details.pd.value_range
has been deprecated and can be replaced by .describe()
(GH8481)The Index
set operations +
and -
were deprecated in order to provide these for numeric type operations on certain index types. +
can be replaced by .union()
or |
, and -
by .difference()
. Further the method name Index.diff()
is deprecated and can be replaced by Index.difference()
(GH8226)
# + Index(['a','b','c']) + Index(['b','c','d']) # should be replaced by Index(['a','b','c']).union(Index(['b','c','d']))
# - Index(['a','b','c']) - Index(['b','c','d']) # should be replaced by Index(['a','b','c']).difference(Index(['b','c','d']))
The infer_types
argument to read_html()
now has no effect and is deprecated (GH7762, GH7032).
DataFrame.delevel
method in favor of DataFrame.reset_index
Enhancements in the importing/exporting of Stata files:
to_stata
(GH7097, GH7365)DataFrame.to_stata
and StataWriter
check string length for compatibility with limitations imposed in dta files where fixed-width strings must contain 244 or fewer characters. Attempting to write Stata dta files with strings longer than 244 characters raises a ValueError
. (GH7858)read_stata
and StataReader
can import missing data information into a DataFrame
by setting the argument convert_missing
to True
. When using this options, missing values are returned as StataMissingValue
objects and columns containing missing values have object
data type. (GH8045)Enhancements in the plotting functions:
layout
keyword to DataFrame.plot
. You can pass a tuple of (rows, columns)
, one of which can be -1
to automatically infer (GH6667, GH8071).DataFrame.plot
, hist
and boxplot
(GH5353, GH6970, GH7069)c
, colormap
and colorbar
arguments for DataFrame.plot
with kind='scatter'
(GH7780)DataFrame.plot
with kind='hist'
(GH7809), See the docs.DataFrame.plot
with kind='box'
(GH7998), See the docs.Other:
read_csv
now has a keyword parameter float_precision
which specifies which floating-point converter the C engine should use during parsing, see here (GH8002, GH8044)
Added searchsorted
method to Series
objects (GH7447)
describe()
on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the include
/exclude
arguments. See the docs (GH8164).
In [100]: df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, .....: 'catB': ['a', 'b', 'c', 'd'] * 6, .....: 'numC': np.arange(24), .....: 'numD': np.arange(24.) + .5}) .....: In [101]: df.describe(include=["object"]) Out[101]: catA catB count 24 24 unique 2 4 top foo b freq 16 6 In [102]: df.describe(include=["number", "object"], exclude=["float"]) Out[102]: catA catB numC count 24 24 24.000000 unique 2 4 NaN top foo b NaN freq 16 6 NaN mean NaN NaN 11.500000 std NaN NaN 7.071068 min NaN NaN 0.000000 25% NaN NaN 5.750000 50% NaN NaN 11.500000 75% NaN NaN 17.250000 max NaN NaN 23.000000
Requesting all columns is possible with the shorthand ‘all’
In [103]: df.describe(include='all') Out[103]: catA catB numC numD count 24 24 24.000000 24.000000 unique 2 4 NaN NaN top foo b NaN NaN freq 16 6 NaN NaN mean NaN NaN 11.500000 12.000000 std NaN NaN 7.071068 7.071068 min NaN NaN 0.000000 0.500000 25% NaN NaN 5.750000 6.250000 50% NaN NaN 11.500000 12.000000 75% NaN NaN 17.250000 17.750000 max NaN NaN 23.000000 23.500000
Without those arguments, ‘describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the docs
Added split
as an option to the orient
argument in pd.DataFrame.to_dict
. (GH7840)
The get_dummies
method can now be used on DataFrames. By default only catagorical columns are encoded as 0’s and 1’s, while other columns are left untouched.
In [104]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], .....: 'C': [1, 2, 3]}) .....: In [105]: pd.get_dummies(df) Out[105]: C A_a A_b B_b B_c 0 1 1 0 0 1 1 2 0 1 0 1 2 3 1 0 1 0
PeriodIndex
supports resolution
as the same as DatetimeIndex
(GH7708)
pandas.tseries.holiday
has added support for additional holidays and ways to observe holidays (GH7070)
pandas.tseries.holiday.Holiday
now supports a list of offsets in Python3 (GH7070)
pandas.tseries.holiday.Holiday
now supports a days_of_week parameter (GH7070)
GroupBy.nth()
now supports selecting multiple nth values (GH7910)
In [106]: business_dates = date_range(start='4/1/2014', end='6/30/2014', freq='B') In [107]: df = DataFrame(1, index=business_dates, columns=['a', 'b']) # get the first, 4th, and last date index for each month In [108]: df.groupby((df.index.year, df.index.month)).nth([0, 3, -1]) Out[108]: a b 2014 4 1 1 4 1 1 4 1 1 5 1 1 5 1 1 5 1 1 6 1 1 6 1 1 6 1 1
Period
and PeriodIndex
supports addition/subtraction with timedelta
-likes (GH7966)
If Period
freq is D
, H
, T
, S
, L
, U
, N
, Timedelta
-like can be added if the result can have same freq. Otherwise, only the same offsets
can be added.
In [109]: idx = pd.period_range('2014-07-01 09:00', periods=5, freq='H') In [110]: idx Out[110]: PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00'], dtype='period[H]', freq='H') In [111]: idx + pd.offsets.Hour(2) Out[111]: PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H') In [112]: idx + Timedelta('120m') Out[112]: PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00', '2014-07-01 14:00', '2014-07-01 15:00'], dtype='period[H]', freq='H') In [113]: idx = pd.period_range('2014-07', periods=5, freq='M') In [114]: idx Out[114]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype='period[M]', freq='M') In [115]: idx + pd.offsets.MonthEnd(3) Out[115]: PeriodIndex(['2014-10', '2014-11', '2014-12', '2015-01', '2015-02'], dtype='period[M]', freq='M')
Added experimental compatibility with openpyxl
for versions >= 2.0. The DataFrame.to_excel
method engine
keyword now recognizes openpyxl1
and openpyxl2
which will explicitly require openpyxl v1 and v2 respectively, failing if the requested version is not available. The openpyxl
engine is a now a meta-engine that automatically uses whichever version of openpyxl is installed. (GH7177)
DataFrame.fillna
can now accept a DataFrame
as a fill value (GH8377)
Passing multiple levels to stack()
will now work when multiple level numbers are passed (GH7660). See Reshaping by stacking and unstacking.
set_names()
, set_labels()
, and set_levels()
methods now take an optional level
keyword argument to all modification of specific level(s) of a MultiIndex. Additionally set_names()
now accepts a scalar string value when operating on an Index
or on a specific level of a MultiIndex
(GH7792)
In [116]: idx = MultiIndex.from_product([['a'], range(3), list("pqr")], names=['foo', 'bar', 'baz']) In [117]: idx.set_names('qux', level=0) Out[117]: MultiIndex(levels=[['a'], [0, 1, 2], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['qux', 'bar', 'baz']) In [118]: idx.set_names(['qux','baz'], level=[0,1]) Out[118]: MultiIndex(levels=[['a'], [0, 1, 2], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['qux', 'baz', 'baz']) In [119]: idx.set_levels(['a','b','c'], level='bar') Out[119]: MultiIndex(levels=[['a'], ['a', 'b', 'c'], ['p', 'q', 'r']], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['foo', 'bar', 'baz']) In [120]: idx.set_levels([['a','b','c'],[1,2,3]], level=[1,2]) Out[120]: MultiIndex(levels=[['a'], ['a', 'b', 'c'], [1, 2, 3]], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]], names=['foo', 'bar', 'baz'])
Index.isin
now supports a level
argument to specify which index level to use for membership tests (GH7892, GH7890)
In [1]: idx = MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]) In [2]: idx.values Out[2]: array([(0, 'a'), (0, 'b'), (0, 'c'), (1, 'a'), (1, 'b'), (1, 'c')], dtype=object) In [3]: idx.isin(['a', 'c', 'e'], level=1) Out[3]: array([ True, False, True, True, False, True], dtype=bool)
Index
now supports duplicated
and drop_duplicates
. (GH4060)
In [121]: idx = Index([1, 2, 3, 4, 1, 2]) In [122]: idx Out[122]: Int64Index([1, 2, 3, 4, 1, 2], dtype='int64') In [123]: idx.duplicated() Out[123]: array([False, False, False, False, True, True], dtype=bool) In [124]: idx.drop_duplicates() Out[124]: Int64Index([1, 2, 3, 4], dtype='int64')
add copy=True
argument to pd.concat
to enable pass thru of complete blocks (GH8252)
Added support for numpy 1.8+ data types (bool_
, int_
, float_
, string_
) for conversion to R dataframe (GH8400)
DatetimeIndex.__iter__
to allow faster iteration (GH7683)Period
creation (and PeriodIndex
setitem) (GH5155)StataReader
when reading large files (GH8040, GH8073)StataWriter
when writing large files (GH8079)groupby
(GH8128).agg
and .apply
where builtins max/min were not mapped to numpy/cythonized versions (GH7722)to_sql
) of up to 50% (GH8208).CustomBusinessDay
, CustomBusinessMonth
(GH8236)MultiIndex.values
for multi-level indexes containing datetimes (GH8543)read_csv
where squeeze=True
would return a view (GH8217)read_sql
in certain cases (GH7826).DataFrame.groupby
where Grouper
does not recognize level when frequency is specified (GH7885)Series
0-division with a float and integer operand dtypes (GH7785)Series.astype("unicode")
not calling unicode
on the values correctly (GH7758)DataFrame.as_matrix()
with mixed datetime64[ns]
and timedelta64[ns]
dtypes (GH7778)HDFStore.select_column()
not preserving UTC timezone info when selecting a DatetimeIndex
(GH7777)to_datetime
when format='%Y%m%d'
and coerce=True
are specified, where previously an object array was returned (rather than a coerced time-series with NaT
), (GH7930)DatetimeIndex
and PeriodIndex
in-place addition and subtraction cause different result from normal one (GH6527)PeriodIndex
with PeriodIndex
raise TypeError
(GH7741)combine_first
with PeriodIndex
data raises TypeError
(GH3367)Timestamp
comparisons with ==
and int64
dtype (GH8058)DateOffset
may raise AttributeError
when normalize
attribute is reffered internally (GH7748)Panel
when using major_xs
and copy=False
is passed (deprecation warning fails because of missing warnings
) (GH8152).PeriodIndex
into a Series
would convert to int64
dtype, rather than object
of Periods
(GH7932)HDFStore
iteration when passing a where (GH8014)DataFrameGroupby.transform
when transforming with a passed non-sorted key (GH8046, GH8430)ValueError
or incorrect kind (GH7733)MultiIndex
with datetime.date
inputs (GH7888)get
where an IndexError
would not cause the default value to be returned (GH7725)offsets.apply
, rollforward
and rollback
may reset nanosecond (GH7697)offsets.apply
, rollforward
and rollback
may raise AttributeError
if Timestamp
has dateutil
tzinfo (GH7697)Float64Index
(GH8017)DataFrame
for alignment (GH7763)is_superperiod
and is_subperiod
cannot handle higher frequencies than S
(GH7760, GH7772, GH7803)Series.shift
(GH8129)PeriodIndex.unique
returns int64 np.ndarray
(GH7540)groupby.apply
with a non-affecting mutation in the function (GH8467)DataFrame.reset_index
which has MultiIndex
contains PeriodIndex
or DatetimeIndex
with tz raises ValueError
(GH7746, GH7793)DataFrame.plot
with subplots=True
may draw unnecessary minor xticks and yticks (GH7801)StataReader
which did not read variable labels in 117 files due to difference between Stata documentation and implementation (GH7816)StataReader
where strings were always converted to 244 characters-fixed width irrespective of underlying string size (GH7858)DataFrame.plot
and Series.plot
may ignore rot
and fontsize
keywords (GH7844)DatetimeIndex.value_counts
doesn’t preserve tz (GH7735)PeriodIndex.value_counts
results in Int64Index
(GH7735)DataFrame.join
when doing left join on index and there are multiple matches (GH5391)GroupBy.transform()
where int groups with a transform that didn’t preserve the index were incorrectly truncated (GH7972).groupby
where callable objects without name attributes would take the wrong path, and produce a DataFrame
instead of a Series
(GH7929)groupby
error message when a DataFrame grouping column is duplicated (GH7511)read_html
where the infer_types
argument forced coercion of date-likes incorrectly (GH7762, GH7032).Series.str.cat
with an index which was filtered as to not include the first item (GH7857)Timestamp
cannot parse nanosecond
from string (GH7878)Timestamp
with string offset and tz
results incorrect (GH7833)tslib.tz_convert
and tslib.tz_convert_single
may return different results (GH7798)DatetimeIndex.intersection
of non-overlapping timestamps with tz raises IndexError
(GH7880)GroupBy.filter()
where fast path vs. slow path made the filter return a non scalar value that appeared valid but wasn’t (GH7870).date_range()
/DatetimeIndex()
when the timezone was inferred from input dates yet incorrect times were returned when crossing DST boundaries (GH7835, GH7901).to_excel()
where a negative sign was being prepended to positive infinity and was absent for negative infinity (GH7949)alpha
when stacked=True
(GH8027)Period
and PeriodIndex
addition/subtraction with np.timedelta64
results in incorrect internal representations (GH7740)Holiday
with no offset or observance (GH7987)DataFrame.to_latex
formatting when columns or index is a MultiIndex
(GH7982).DateOffset
around Daylight Savings Time produces unexpected results (GH5175).DataFrame.shift
where empty columns would throw ZeroDivisionError
on numpy 1.7 (GH8019)html_encoding/*.html
wasn’t installed and therefore some tests were not running correctly (GH7927).read_html
where bytes
objects were not tested for in _read
(GH7927).DataFrame.stack()
when one of the column levels was a datelike (GH8039)DataFrame
(GH8116)pivot_table
performed with nameless index
and columns
raises KeyError
(GH8103)DataFrame.plot(kind='scatter')
draws points and errorbars with different colors when the color is specified by c
keyword (GH8081)Float64Index
where iat
and at
were not testing and were failing (GH8092).DataFrame.boxplot()
where y-limits were not set correctly when producing multiple axes (GH7528, GH5517).read_csv
where line comments were not handled correctly given a custom line terminator or delim_whitespace=True
(GH8122).read_html
where empty tables caused a StopIteration
(GH7575)GroupBy
when the original grouper was a tuple (GH8121)..at
that would accept integer indexers on a non-integer index and do fallback (GH7814)GroupBy.count
with float32 data type were nan values were not excluded (GH8169).limit
keyword when no values needed interpolating (GH7173).col_space
was ignored in DataFrame.to_string()
when header=False
(GH8230).DatetimeIndex.asof
incorrectly matching partial strings and returning the wrong date (GH8245).DataFrame.__setitem__
that caused errors when setting a dataframe column to a sparse array (GH8131)Dataframe.boxplot()
failed when entire column was empty (GH8181).radviz
visualization (GH8199).limit
keyword when no values needed interpolating (GH7173).col_space
was ignored in DataFrame.to_string()
when header=False
(GH8230).to_clipboard
that would clip long column data (GH8305)DataFrame
terminal display: Setting max_column/max_rows to zero did not trigger auto-resizing of dfs to fit terminal width/height (GH7180).DataFrame.dropna
that interpreted non-existent columns in the subset argument as the ‘last column’ (GH8303)Index.intersection
on non-monotonic non-unique indexes (GH8362).NDFrame.equals
gives false negatives with dtype=object (GH8437)NDFrame.loc
indexing when row/column names were lost when target was a list/ndarray (GH6552)NDFrame.loc
indexing when rows/columns were converted to Float64Index if target was an empty list/ndarray (GH7774)Series
that allows it to be indexed by a DataFrame
which has unexpected results. Such indexing is no longer permitted (GH8444)DataFrame
with multi-index columns where right-hand-side columns were not aligned (GH7655)DataFrame.eval()
where the dtype of the not
operator (~
) was not correctly inferred as bool
.This is a minor release from 0.14.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
select_dtypes()
to select columns based on the dtype and sem()
to calculate the standard error of the mean.read_csv()
text parser.Openpyxl now raises a ValueError on construction of the openpyxl writer instead of warning on pandas import (GH7284).
For StringMethods.extract
, when no match is found, the result - only containing NaN
values - now also has dtype=object
instead of float
(GH7242)
Period
objects no longer raise a TypeError
when compared using ==
with another object that isn’t a Period
. Instead when comparing a Period
with another object using ==
if the other object isn’t a Period
False
is returned. (GH7376)
Previously, the behaviour on resetting the time or not in offsets.apply
, rollforward
and rollback
operations differed between offsets. With the support of the normalize
keyword for all offsets(see below) with a default value of False (preserve time), the behaviour changed for certain offsets (BusinessMonthBegin, MonthEnd, BusinessMonthEnd, CustomBusinessMonthEnd, BusinessYearBegin, LastWeekOfMonth, FY5253Quarter, LastWeekOfMonth, Easter):
In [6]: from pandas.tseries import offsets In [7]: d = pd.Timestamp('2014-01-01 09:00') # old behaviour < 0.14.1 In [8]: d + offsets.MonthEnd() Out[8]: Timestamp('2014-01-31 00:00:00')
Starting from 0.14.1 all offsets preserve time by default. The old behaviour can be obtained with normalize=True
# new behaviour In [1]: d + offsets.MonthEnd() Out[1]: Timestamp('2014-01-31 09:00:00') In [2]: d + offsets.MonthEnd(normalize=True) Out[2]: Timestamp('2014-01-31 00:00:00')
Note that for the other offsets the default behaviour did not change.
Add back #N/A N/A
as a default NA value in text parsing, (regresion from 0.12) (GH5521)
Raise a TypeError
on inplace-setting with a .where
and a non np.nan
value as this is inconsistent with a set-item expression like df[mask] = None
(GH7656)
Add dropna
argument to value_counts
and nunique
(GH5569).
Add select_dtypes()
method to allow selection of columns based on dtype (GH7316). See the docs.
All offsets
suppports the normalize
keyword to specify whether offsets.apply
, rollforward
and rollback
resets the time (hour, minute, etc) or not (default False
, preserves time) (GH7156):
In [3]: import pandas.tseries.offsets as offsets In [4]: day = offsets.Day() In [5]: day.apply(Timestamp('2014-01-01 09:00')) Out[5]: Timestamp('2014-01-02 09:00:00') In [6]: day = offsets.Day(normalize=True) In [7]: day.apply(Timestamp('2014-01-01 09:00')) Out[7]: Timestamp('2014-01-02 00:00:00')
PeriodIndex
is represented as the same format as DatetimeIndex
(GH7601)
StringMethods
now work on empty Series (GH7242)
The file parsers read_csv
and read_table
now ignore line comments provided by the parameter comment, which accepts only a single character for the C reader. In particular, they allow for comments before file data begins (GH2685)
Add NotImplementedError
for simultaneous use of chunksize
and nrows
for read_csv() (GH6774).
Tests for basic reading of public S3 buckets now exist (GH7281).
read_html
now sports an encoding
argument that is passed to the underlying parser library. You can use this to read non-ascii encoded web pages (GH7323).
read_excel
now supports reading from URLs in the same way that read_csv
does. (GH6809)
Support for dateutil timezones, which can now be used in the same way as pytz timezones across pandas. (GH4688)
In [8]: rng = date_range('3/6/2012 00:00', periods=10, freq='D', ...: tz='dateutil/Europe/London') ...: In [9]: rng.tz Out[9]: tzfile('/usr/share/zoneinfo/Europe/London')
See the docs.
Implemented sem
(standard error of the mean) operation for Series
, DataFrame
, Panel
, and Groupby
(GH6897)
Add nlargest
and nsmallest
to the Series
groupby
whitelist, which means you can now use these methods on a SeriesGroupBy
object (GH7053).
All offsets apply
, rollforward
and rollback
can now handle np.datetime64
, previously results in ApplyTypeError
(GH7452)
Period
and PeriodIndex
can contain NaT
in its values (GH7485)
Support pickling Series
, DataFrame
and Panel
objects with non-unique labels along item axis (index
, columns
and items
respectively) (GH7370).
Improved inference of datetime/timedelta with mixed null objects. Regression from 0.13.1 in interpretation of an object Index with all null elements (GH7431)
int64
, timedelta64
, datetime64
(GH7223)pandas.io.data.Options
has a new method, get_all_data
method, and now consistently returns a multi-indexed DataFrame
(GH5602)io.gbq.read_gbq
and io.gbq.to_gbq
were refactored to remove the dependency on the Google bq.py
command line client. This submodule now uses httplib2
and the Google apiclient
and oauth2client
API client libraries which should be more stable and, therefore, reliable than bq.py
. See the docs. (GH6937).DataFrame.where
with a symmetric shaped frame and a passed other of a DataFrame (GH7506).nth
with a Series and integer-like column name (GH7559)Series.get
with a boolean accessor (GH7407)value_counts
where NaT
did not qualify as missing (NaN
) (GH7423)to_timedelta
that accepted invalid units and misinterpreted ‘m/h’ (GH7611, GH6423)xlim
if secondary_y=True
(GH7459)hist
and scatter
plots use old figsize
default (GH7394)DataFrame.plot
, hist
clears passed ax
even if the number of subplots is one (GH7391).DataFrame.boxplot
with by
kw raises ValueError
if the number of subplots exceeds 1 (GH7391).ticklabels
and labels
in different rule (GH5897)Panel.apply
with a multi-index as an axis (GH7469)DatetimeIndex.insert
doesn’t preserve name
and tz
(GH7299)DatetimeIndex.asobject
doesn’t preserve name
(GH7299)Index.min
and max
doesn’t handle nan
and NaT
properly (GH7261)PeriodIndex.min/max
results in int
(GH7609)resample
where fill_method
was ignored if you passed how
(GH2073)TimeGrouper
doesn’t exclude column specified by key
(GH7227)DataFrame
and Series
bar and barh plot raises TypeError
when bottom
and left
keyword is specified (GH7226)DataFrame.hist
raises TypeError
when it contains non numeric column (GH7277)Index.delete
does not preserve name
and freq
attributes (GH7302)DataFrame.query()
/eval
where local string variables with the @ sign were being treated as temporaries attempting to be deleted (GH7300).Float64Index
which didn’t allow duplicates (GH7149).DataFrame.replace()
where truthy values were being replaced (GH7140).StringMethods.extract()
where a single match group Series would use the matcher’s name instead of the group name (GH7313).isnull()
when mode.use_inf_as_null == True
where isnull wouldn’t test True
when it encountered an inf
/-inf
(GH7315).Easter
returns incorrect date when offset is negative (GH7195).div
, integer dtypes and divide-by-zero (GH7325)CustomBusinessDay.apply
raiases NameError
when np.datetime64
object is passed (GH7196)MultiIndex.append
, concat
and pivot_table
don’t preserve timezone (GH6606).loc
with a list of indexers on a single-multi index level (that is not nested) (GH7349)Series.map
when mapping a dict with tuple keys of different lengths (GH7333)StringMethods
now work on empty Series (GH7242)DataFrame
with a Float64Index
raised a TypeError
during a call to np.isnan
(GH7366).NDFrame.replace()
didn’t correctly replace objects with Period
values (GH7379)..ix
getitem should always return a Series (GH7150)DatetimeIndex
were not correctly sliced (GH7408)NaT
wasn’t repr’d correctly in a MultiIndex
(GH7406, GH7409).nan
in convert_objects
(GH7416).quantile
ignoring the axis keyword argument (:issue`7306`)nanops._maybe_null_out
doesn’t work with complex numbers (GH7353)nanops
functions when axis==0
for 1-dimensional nan
arrays (GH7354)nanops.nanmedian
doesn’t work when axis==None
(GH7352)nanops._has_infs
doesn’t work with many dtypes (GH7357)StataReader.data
where reading a 0-observation dta failed (GH7369)StataReader
when reading Stata 13 (117) files containing fixed width strings (GH7360)StataWriter
where encoding was ignored (GH7286)DatetimeIndex
comparison doesn’t handle NaT
properly (GH7529)tzinfo
to some offsets apply
, rollforward
or rollback
resets tzinfo
or raises ValueError
(GH7465)DatetimeIndex.to_period
, PeriodIndex.asobject
, PeriodIndex.to_timestamp
doesn’t preserve name
(GH7485)DatetimeIndex.to_period
and PeriodIndex.to_timestanp
handle NaT
incorrectly (GH7228)offsets.apply
, rollforward
and rollback
may return normal datetime
(GH7502)resample
raises ValueError
when target contains NaT
(GH7227)Timestamp.tz_localize
resets nanosecond
info (GH7534)DatetimeIndex.asobject
raises ValueError
when it contains NaT
(GH7539)Timestamp.__new__
doesn’t preserve nanosecond properly (GH7610)Index.astype(float)
where it would return an object
dtype Index
(GH7464).DataFrame.reset_index
loses tz
(GH3950)DatetimeIndex.freqstr
raises AttributeError
when freq
is None
(GH7606)GroupBy.size
created by TimeGrouper
raises AttributeError
(GH7453)ValueError
(GH7471)Index.union
may preserve name
incorrectly (GH7458)DatetimeIndex.intersection
doesn’t preserve timezone (GH4690)rolling_var
where a window larger than the array would raise an error(GH7297)xlim
(GH2960)secondary_y
axis not being considered for timeseries xlim
(GH3490)Float64Index
assignment with a non scalar indexer (GH7586)pandas.core.strings.str_contains
does not properly match in a case insensitive fashion when regex=False
and case=False
(GH7505)expanding_cov
, expanding_corr
, rolling_cov
, and rolling_corr
for two arguments with mismatched index (GH7512)to_sql
taking the boolean column as text column (GH7678).loc
performing fallback integer indexing with object
dtype indices (GH7496)PeriodIndex
constructor when passed Series
objects (GH7701).This is a major release from 0.13.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
sqlalchemy
, See Here.CustomBusinessDay
, see HereWarning
In 0.14.0 all NDFrame
based containers have undergone significant internal refactoring. Before that each block of homogeneous data had its own labels and extra care was necessary to keep those in sync with the parent container’s labels. This should not have any visible user/API behavior changes (GH6745)
read_excel
uses 0 as the default sheet (GH6573)
iloc
will now accept out-of-bounds indexers for slices, e.g. a value that exceeds the length of the object being indexed. These will be excluded. This will make pandas conform more with python/numpy indexing of out-of-bounds values. A single indexer that is out-of-bounds and drops the dimensions of the object will still raise IndexError
(GH6296, GH6299). This could result in an empty axis (e.g. an empty DataFrame being returned)
In [1]: dfl = DataFrame(np.random.randn(5,2),columns=list('AB')) In [2]: dfl Out[2]: A B 0 1.583584 -0.438313 1 -0.402537 -0.780572 2 -0.141685 0.542241 3 0.370966 -0.251642 4 0.787484 1.666563 In [3]: dfl.iloc[:,2:3] Out[3]: Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4] In [4]: dfl.iloc[:,1:3] Out[4]: B 0 -0.438313 1 -0.780572 2 0.542241 3 -0.251642 4 1.666563 In [5]: dfl.iloc[4:6] Out[5]: A B 4 0.787484 1.666563
These are out-of-bounds selections
dfl.iloc[[4,5,6]] IndexError: positional indexers are out-of-bounds dfl.iloc[:,4] IndexError: single positional indexer is out-of-bounds
Slicing with negative start, stop & step values handles corner cases better (GH6531):
df.iloc[:-len(df)]
is now emptydf.iloc[len(df)::-1]
now enumerates all elements in reverseThe DataFrame.interpolate()
keyword downcast
default has been changed from infer
to None
. This is to preseve the original dtype unless explicitly requested otherwise (GH6290).
When converting a dataframe to HTML it used to return Empty DataFrame. This special case has been removed, instead a header with the column names is returned (GH6062).
Series
and Index
now internall share more common operations, e.g. factorize(),nunique(),value_counts()
are now supported on Index
types as well. The Series.weekday
property from is removed from Series for API consistency. Using a DatetimeIndex/PeriodIndex
method on a Series will now raise a TypeError
. (GH4551, GH4056, GH5519, GH6380, GH7206).
Add is_month_start
, is_month_end
, is_quarter_start
, is_quarter_end
, is_year_start
, is_year_end
accessors for DateTimeIndex
/ Timestamp
which return a boolean array of whether the timestamp(s) are at the start/end of the month/quarter/year defined by the frequency of the DateTimeIndex
/ Timestamp
(GH4565, GH6998)
Local variable usage has changed in pandas.eval()
/DataFrame.eval()
/DataFrame.query()
(GH5987). For the DataFrame
methods, two things have changed
'@'
prefix.df.query('@a < a')
with no complaints from pandas
about ambiguity of the name a
.pandas.eval()
function does not allow you use the '@'
prefix and provides you with an error message telling you so.NameResolutionError
was removed because it isn’t necessary anymore.Define and document the order of column vs index names in query/eval (GH6676)
concat
will now concatenate mixed Series and DataFrames using the Series name or numbering columns as needed (GH2385). See the docs
Slicing and advanced/boolean indexing operations on Index
classes as well as Index.delete()
and Index.drop()
methods will no longer change the type of the resulting index (GH6440, GH7040)
In [6]: i = pd.Index([1, 2, 3, 'a' , 'b', 'c']) In [7]: i[[0,1,2]] Out[7]: Index([1, 2, 3], dtype='object') In [8]: i.drop(['a', 'b', 'c']) Out[8]: Index([1, 2, 3], dtype='object')
Previously, the above operation would return Int64Index
. If you’d like to do this manually, use Index.astype()
In [9]: i[[0,1,2]].astype(np.int_) Out[9]: Int64Index([1, 2, 3], dtype='int64')
set_index
no longer converts MultiIndexes to an Index of tuples. For example, the old behavior returned an Index in this case (GH6459):
# Old behavior, casted MultiIndex to an Index In [10]: tuple_ind Out[10]: Index([('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd')], dtype='object') In [11]: df_multi.set_index(tuple_ind) Out[11]: 0 1 (a, c) 0.471435 -1.190976 (a, d) 1.432707 -0.312652 (b, c) -0.720589 0.887163 (b, d) 0.859588 -0.636524 # New behavior In [12]: mi Out[12]: MultiIndex(levels=[['a', 'b'], ['c', 'd']], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) In [13]: df_multi.set_index(mi) Out[13]: 0 1 a c 0.471435 -1.190976 d 1.432707 -0.312652 b c -0.720589 0.887163 d 0.859588 -0.636524
This also applies when passing multiple indices to set_index
:
# Old output, 2-level MultiIndex of tuples In [14]: df_multi.set_index([df_multi.index, df_multi.index]) Out[14]: 0 1 (a, c) (a, c) 0.471435 -1.190976 (a, d) (a, d) 1.432707 -0.312652 (b, c) (b, c) -0.720589 0.887163 (b, d) (b, d) 0.859588 -0.636524 # New output, 4-level MultiIndex In [15]: df_multi.set_index([df_multi.index, df_multi.index]) Out[15]: 0 1 a c a c 0.471435 -1.190976 d a d 1.432707 -0.312652 b c b c -0.720589 0.887163 d b d 0.859588 -0.636524
pairwise
keyword was added to the statistical moment functions rolling_cov
, rolling_corr
, ewmcov
, ewmcorr
, expanding_cov
, expanding_corr
to allow the calculation of moving window covariance and correlation matrices (GH4950). See Computing rolling pairwise covariances and correlations in the docs.
In [1]: df = DataFrame(np.random.randn(10,4),columns=list('ABCD')) In [4]: covs = pd.rolling_cov(df[['A','B','C']], df[['B','C','D']], 5, pairwise=True) In [5]: covs[df.index[-1]] Out[5]: B C D A 0.035310 0.326593 -0.505430 B 0.137748 -0.006888 -0.005383 C -0.006888 0.861040 0.020762
Series.iteritems()
is now lazy (returns an iterator rather than a list). This was the documented behavior prior to 0.14. (GH6760)
Added nunique
and value_counts
functions to Index
for counting unique elements. (GH6734)
stack
and unstack
now raise a ValueError
when the level
keyword refers to a non-unique item in the Index
(previously raised a KeyError
). (GH6738)
drop unused order argument from Series.sort
; args now are in the same order as Series.order
; add na_position
arg to conform to Series.order
(GH6847)
default sorting algorithm for Series.order
is now quicksort
, to conform with Series.sort
(and numpy defaults)
add inplace
keyword to Series.order/sort
to make them inverses (GH6859)
DataFrame.sort
now places NaNs at the beginning or end of the sort according to the na_position
parameter. (GH3917)
accept TextFileReader
in concat
, which was affecting a common user idiom (GH6583), this was a regression from 0.13.1
Added factorize
functions to Index
and Series
to get indexer and unique values (GH7090)
describe
on a DataFrame with a mix of Timestamp and string like objects returns a different Index (GH7088). Previously the index was unintentionally sorted.
Arithmetic operations with only bool
dtypes now give a warning indicating that they are evaluated in Python space for +
, -
, and *
operations and raise for all others (GH7011, GH6762, GH7015, GH7210)
x = pd.Series(np.random.rand(10) > 0.5) y = True x + y # warning generated: should do x | y instead x / y # this raises because it doesn't make sense NotImplementedError: operator '/' not implemented for bool dtypes
In HDFStore
, select_as_multiple
will always raise a KeyError
, when a key or the selector is not found (GH6177)
df['col'] = value
and df.loc[:,'col'] = value
are now completely equivalent; previously the .loc
would not necessarily coerce the dtype of the resultant series (GH6149)
dtypes
and ftypes
now return a series with dtype=object
on empty containers (GH5740)
df.to_csv
will now return a string of the CSV data if neither a target path nor a buffer is provided (GH6061)
pd.infer_freq()
will now raise a TypeError
if given an invalid Series/Index
type (GH6407, GH6463)
A tuple passed to DataFame.sort_index
will be interpreted as the levels of the index, rather than requiring a list of tuple (GH4370)
all offset operations now return Timestamp
types (rather than datetime), Business/Week frequencies were incorrect (GH4069)
to_excel
now converts np.inf
into a string representation, customizable by the inf_rep
keyword argument (Excel has no native inf representation) (GH6782)
Replace pandas.compat.scipy.scoreatpercentile
with numpy.percentile
(GH6810)
.quantile
on a datetime[ns]
series now returns Timestamp
instead of np.datetime64
objects (GH6810)
change AssertionError
to TypeError
for invalid types passed to concat
(GH6583)
Raise a TypeError
when DataFrame
is passed an iterator as the data
argument (GH5357)
The default way of printing large DataFrames has changed. DataFrames exceeding max_rows
and/or max_columns
are now displayed in a centrally truncated view, consistent with the printing of a pandas.Series
(GH5603).
In previous versions, a DataFrame was truncated once the dimension constraints were reached and an ellipse (...) signaled that part of the data was cut off.
In the current version, large DataFrames are centrally truncated, showing a preview of head and tail in both dimensions.
allow option 'truncate'
for display.show_dimensions
to only show the dimensions if the frame is truncated (GH6547).
The default for display.show_dimensions
will now be truncate
. This is consistent with how Series display length.
In [16]: dfd = pd.DataFrame(np.arange(25).reshape(-1,5), index=[0,1,2,3,4], columns=[0,1,2,3,4]) # show dimensions since this is truncated In [17]: with pd.option_context('display.max_rows', 2, 'display.max_columns', 2, ....: 'display.show_dimensions', 'truncate'): ....: print(dfd) ....: 0 ... 4 0 0 ... 4 .. .. ... .. 4 20 ... 24 [5 rows x 5 columns] # will not show dimensions since it is not truncated In [18]: with pd.option_context('display.max_rows', 10, 'display.max_columns', 40, ....: 'display.show_dimensions', 'truncate'): ....: print(dfd) ....: 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24
Regression in the display of a MultiIndexed Series with display.max_rows
is less than the length of the series (GH7101)
Fixed a bug in the HTML repr of a truncated Series or DataFrame not showing the class name with the large_repr set to ‘info’ (GH7105)
The verbose keyword in DataFrame.info()
, which controls whether to shorten the info
representation, is now None
by default. This will follow the global setting in display.max_info_columns
. The global setting can be overriden with verbose=True
or verbose=False
.
Fixed a bug with the info repr not honoring the display.max_info_columns setting (GH6939)
Offset/freq info now in Timestamp __repr__ (GH4553)
read_csv()
/read_table()
will now be noiser w.r.t invalid options rather than falling back to the PythonParser
.
ValueError
when sep
specified with delim_whitespace=True
in read_csv()
/read_table()
(GH6607)ValueError
when engine='c'
specified with unsupported options in read_csv()
/read_table()
(GH6607)ValueError
when fallback to python parser causes options to be ignored (GH6607)ParserWarning
on fallback to python parser when no options are ignored (GH6607)sep='\s+'
to delim_whitespace=True
in read_csv()
/read_table()
if no other C-unsupported options specified (GH6607)More consistent behaviour for some groupby methods:
groupby head
and tail
now act more like filter
rather than an aggregation:
In [19]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B']) In [20]: g = df.groupby('A') In [21]: g.head(1) # filters DataFrame Out[21]: A B 0 1 2 2 5 6 In [22]: g.apply(lambda x: x.head(1)) # used to simply fall-through Out[22]: A B A 1 0 1 2 5 2 5 6
groupby head and tail respect column selection:
In [23]: g[['B']].head(1) Out[23]: B 0 2 2 6
groupby nth
now reduces by default; filtering can be achieved by passing as_index=False
. With an optional dropna
argument to ignore NaN. See the docs.
Reducing
In [24]: df = DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B']) In [25]: g = df.groupby('A') In [26]: g.nth(0) Out[26]: B A 1 NaN 5 6.0 # this is equivalent to g.first() In [27]: g.nth(0, dropna='any') Out[27]: B A 1 4.0 5 6.0 # this is equivalent to g.last() In [28]: g.nth(-1, dropna='any') Out[28]: B A 1 4.0 5 6.0
Filtering
In [29]: gf = df.groupby('A',as_index=False) In [30]: gf.nth(0) Out[30]: A B 0 1 NaN 2 5 6.0 In [31]: gf.nth(0, dropna='any') Out[31]: A B A 1 1 4.0 5 5 6.0
groupby will now not return the grouped column for non-cython functions (GH5610, GH5614, GH6732), as its already the index
In [32]: df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B']) In [33]: g = df.groupby('A') In [34]: g.count() Out[34]: B A 1 1 5 2 In [35]: g.describe() Out[35]: B count mean std min 25% 50% 75% max A 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0 5 2.0 7.0 1.414214 6.0 6.5 7.0 7.5 8.0
passing as_index
will leave the grouped column in-place (this is not change in 0.14.0)
In [36]: df = DataFrame([[1, np.nan], [1, 4], [5, 6], [5, 8]], columns=['A', 'B']) In [37]: g = df.groupby('A',as_index=False) In [38]: g.count() Out[38]: A B 0 1 1 1 5 2 In [39]: g.describe() Out[39]: A B \ count mean std min 25% 50% 75% max count mean std min 25% 0 2.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 4.0 NaN 4.0 4.0 1 2.0 5.0 0.0 5.0 5.0 5.0 5.0 5.0 2.0 7.0 1.414214 6.0 6.5 50% 75% max 0 4.0 4.0 4.0 1 7.0 7.5 8.0
Allow specification of a more complex groupby via pd.Grouper
, such as grouping by a Time and a string field simultaneously. See the docs. (GH3794)
Better propagation/preservation of Series names when performing groupby operations:
SeriesGroupBy.agg
will ensure that the name attribute of the original series is propagated to the result (GH6265).GroupBy.apply
returns a named series, the name of the series will be kept as the name of the column index of the DataFrame returned by GroupBy.apply
(GH6124). This facilitates DataFrame.stack
operations where the name of the column index is used as the name of the inserted column containing the pivoted data.The SQL reading and writing functions now support more database flavors through SQLAlchemy (GH2717, GH4163, GH5950, GH6292). All databases supported by SQLAlchemy can be used, such as PostgreSQL, MySQL, Oracle, Microsoft SQL server (see documentation of SQLAlchemy on included dialects).
The functionality of providing DBAPI connection objects will only be supported for sqlite3 in the future. The 'mysql'
flavor is deprecated.
The new functions read_sql_query()
and read_sql_table()
are introduced. The function read_sql()
is kept as a convenience wrapper around the other two and will delegate to specific function depending on the provided input (database table name or sql query).
In practice, you have to provide a SQLAlchemy engine
to the sql functions. To connect with SQLAlchemy you use the create_engine()
function to create an engine object from database URI. You only need to create the engine once per database you are connecting to. For an in-memory sqlite database:
In [40]: from sqlalchemy import create_engine # Create your connection. In [41]: engine = create_engine('sqlite:///:memory:')
This engine
can then be used to write or read data to/from this database:
In [42]: df = pd.DataFrame({'A': [1,2,3], 'B': ['a', 'b', 'c']}) In [43]: df.to_sql('db_table', engine, index=False)
You can read data from a database by specifying the table name:
In [44]: pd.read_sql_table('db_table', engine) Out[44]: A B 0 1 a 1 2 b 2 3 c
or by specifying a sql query:
In [45]: pd.read_sql_query('SELECT * FROM db_table', engine) Out[45]: A B 0 1 a 1 2 b 2 3 c
Some other enhancements to the sql functions include:
index
keyword (default is True).index_label
.parse_dates
keyword in read_sql_query()
and read_sql_table()
.Warning
Some of the existing functions or function aliases have been deprecated and will be removed in future versions. This includes: tquery
, uquery
, read_frame
, frame_query
, write_frame
.
Warning
The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated. MySQL will be further supported with SQLAlchemy engines (GH6900).
MultiIndexing Using Slicers¶In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.
You can use slice(None)
to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None)
.
As usual, both sides of the slicers are included as this is label indexing.
See the docs See also issues (GH6134, GH4036, GH3057, GH2598, GH5641, GH7106)
Warning
You should specify all axes in the .loc
specifier, meaning the indexer for the index and for the columns. Their are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows.
You should do this:
df.loc[(slice('A1','A3'),.....),:]
rather than this:
df.loc[(slice('A1','A3'),.....)]
Warning
You will need to make sure that the selection axes are fully lexsorted!
In [46]: def mklbl(prefix,n): ....: return ["%s%s" % (prefix,i) for i in range(n)] ....: In [47]: index = MultiIndex.from_product([mklbl('A',4), ....: mklbl('B',2), ....: mklbl('C',4), ....: mklbl('D',2)]) ....: In [48]: columns = MultiIndex.from_tuples([('a','foo'),('a','bar'), ....: ('b','foo'),('b','bah')], ....: names=['lvl0', 'lvl1']) ....: In [49]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))), ....: index=index, ....: columns=columns).sort_index().sort_index(axis=1) ....: In [50]: df Out[50]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25 24 27 26 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 233 232 235 234 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254 [64 rows x 4 columns]
Basic multi-index slicing using slices, lists, and labels.
In [51]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:] Out[51]: lvl0 a b lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 D1 109 108 111 110 C3 D0 121 120 123 122 ... ... ... ... ... A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254 [24 rows x 4 columns]
You can use a pd.IndexSlice
to shortcut the creation of these slices
In [52]: idx = pd.IndexSlice In [53]: df.loc[idx[:,:,['C1','C3']],idx[:,'foo']] Out[53]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58 ... ... ... A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254 [32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [54]: df.loc['A1',(slice(None),'foo')] Out[54]: lvl0 a b lvl1 foo foo B0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 D1 84 86 C3 D0 88 90 ... ... ... B1 C0 D1 100 102 C1 D0 104 106 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126 [16 rows x 2 columns] In [55]: df.loc[idx[:,:,['C1','C3']],idx[:,'foo']] Out[55]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58 ... ... ... A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254 [32 rows x 2 columns]
Using a boolean indexer you can provide selection related to the values.
In [56]: mask = df[('a','foo')]>200 In [57]: df.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] Out[57]: lvl0 a b lvl1 foo foo A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254
You can also specify the axis
argument to .loc
to interpret the passed slicers on a single axis.
In [58]: df.loc(axis=0)[:,:,['C1','C3']] Out[58]: lvl0 a b lvl1 bar foo bah foo A0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 D1 45 44 47 46 C3 D0 57 56 59 58 ... ... ... ... ... A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254 [32 rows x 4 columns]
Furthermore you can set the values using these methods
In [59]: df2 = df.copy() In [60]: df2.loc(axis=0)[:,:,['C1','C3']] = -10 In [61]: df2 Out[61]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 -10 -10 -10 -10 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 [64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [62]: df2 = df.copy() In [63]: df2.loc[idx[:,:,['C1','C3']],:] = df2*1000 In [64]: df2 Out[64]: lvl0 a b lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25000 24000 27000 26000 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 233000 232000 235000 234000 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000 [64 rows x 4 columns]Plotting¶
Hexagonal bin plots from DataFrame.plot
with kind='hexbin'
(GH5478), See the docs.
DataFrame.plot
and Series.plot
now supports area plot with specifying kind='area'
(GH6656), See the docs
Pie plots from Series.plot
and DataFrame.plot
with kind='pie'
(GH6976), See the docs.
Plotting with Error Bars is now supported in the .plot
method of DataFrame
and Series
objects (GH3796, GH6834), See the docs.
DataFrame.plot
and Series.plot
now support a table
keyword for plotting matplotlib.Table
, See the docs. The table
keyword can receive the following values.
False
: Do nothing (default).True
: Draw a table using the DataFrame
or Series
called plot
method. Data will be transposed to meet matplotlib’s default layout.DataFrame
or Series
: Draw matplotlib.table using the passed data. The data will be drawn as displayed in print method (not transposed automatically). Also, helper function pandas.tools.plotting.table
is added to create a table from DataFrame
and Series
, and add it to an matplotlib.Axes
.plot(legend='reverse')
will now reverse the order of legend labels for most plot kinds. (GH6014)
Line plot and area plot can be stacked by stacked=True
(GH6656)
Following keywords are now acceptable for DataFrame.plot()
with kind='bar'
and kind='barh'
:
Because of the default align value changes, coordinates of bar plots are now located on integer values (0.0, 1.0, 2.0 ...). This is intended to make bar plot be located on the same coodinates as line plot. However, bar plot may differs unexpectedly when you manually adjust the bar location or drawing area, such as using set_xlim, set_ylim, etc. In this cases, please modify your script to meet with new coordinates.
The parallel_coordinates()
function now takes argument color
instead of colors
. A FutureWarning
is raised to alert that the old colors
argument will not be supported in a future release. (GH6956)
The parallel_coordinates()
and andrews_curves()
functions now take positional argument frame
instead of data
. A FutureWarning
is raised if the old data
argument is used by name. (GH6956)
DataFrame.boxplot()
now supports layout
keyword (GH6769)
DataFrame.boxplot()
has a new keyword argument, return_type. It accepts 'dict'
, 'axes'
, or 'both'
, in which case a namedtuple with the matplotlib axes and a dict of matplotlib Lines is returned.
There are prior version deprecations that are taking effect as of 0.14.0.
DateRange
in favor of DatetimeIndex
(GH6816)column
keyword from DataFrame.sort
(GH4370)precision
keyword from set_eng_float_format()
(GH395)force_unicode
keyword from DataFrame.to_string()
, DataFrame.to_latex()
, and DataFrame.to_html()
; these function encode in unicode by default (GH2224, GH2225)nanRep
keyword from DataFrame.to_csv()
and DataFrame.to_string()
(GH275)unique
keyword from HDFStore.select_column()
(GH3256)inferTimeRule
keyword from Timestamp.offset()
(GH391)name
keyword from get_data_yahoo()
and get_data_google()
( commit b921d1a )offset
keyword from DatetimeIndex
constructor ( commit 3136390 )time_rule
from several rolling-moment statistical functions, such as rolling_sum()
(GH1042)-
boolean operations on numpy arrays in favor of inv ~
, as this is going to be deprecated in numpy 1.9 (GH6960)The pivot_table()
/DataFrame.pivot_table()
and crosstab()
functions now take arguments index
and columns
instead of rows
and cols
. A FutureWarning
is raised to alert that the old rows
and cols
arguments will not be supported in a future release (GH5505)
The DataFrame.drop_duplicates()
and DataFrame.duplicated()
methods now take argument subset
instead of cols
to better align with DataFrame.dropna()
. A FutureWarning
is raised to alert that the old cols
arguments will not be supported in a future release (GH6680)
The DataFrame.to_csv()
and DataFrame.to_excel()
functions now takes argument columns
instead of cols
. A FutureWarning
is raised to alert that the old cols
arguments will not be supported in a future release (GH6645)
Indexers will warn FutureWarning
when used with a scalar indexer and a non-floating point Index (GH4892, GH6960)
# non-floating point indexes can only be indexed by integers / labels In [1]: Series(1,np.arange(5))[3.0] pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[1]: 1 In [2]: Series(1,np.arange(5)).iloc[3.0] pandas/core/index.py:469: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point Out[2]: 1 In [3]: Series(1,np.arange(5)).iloc[3.0:4] pandas/core/index.py:527: FutureWarning: slice indexers when using iloc should be integers and not floating point Out[3]: 3 1 dtype: int64 # these are Float64Indexes, so integer or floating point is acceptable In [4]: Series(1,np.arange(5.))[3] Out[4]: 1 In [5]: Series(1,np.arange(5.))[3.0] Out[6]: 1
Numpy 1.9 compat w.r.t. deprecation warnings (GH6960)
Panel.shift()
now has a function signature that matches DataFrame.shift()
. The old positional argument lags
has been changed to a keyword argument periods
with a default value of 1. A FutureWarning
is raised if the old argument lags
is used by name. (GH6910)
The order
keyword argument of factorize()
will be removed. (GH6926).
Remove the copy
keyword from DataFrame.xs()
, Panel.major_xs()
, Panel.minor_xs()
. A view will be returned if possible, otherwise a copy will be made. Previously the user could think that copy=False
would ALWAYS return a view. (GH6894)
The parallel_coordinates()
function now takes argument color
instead of colors
. A FutureWarning
is raised to alert that the old colors
argument will not be supported in a future release. (GH6956)
The parallel_coordinates()
and andrews_curves()
functions now take positional argument frame
instead of data
. A FutureWarning
is raised if the old data
argument is used by name. (GH6956)
The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated. MySQL will be further supported with SQLAlchemy engines (GH6900).
The following io.sql
functions have been deprecated: tquery
, uquery
, read_frame
, frame_query
, write_frame
.
The percentile_width keyword argument in describe()
has been deprecated. Use the percentiles keyword instead, which takes a list of percentiles to display. The default output is unchanged.
The default return type of boxplot()
will change from a dict to a matpltolib Axes in a future release. You can use the future behavior now by passing return_type='axes'
to boxplot.
DataFrame and Series will create a MultiIndex object if passed a tuples dict, See the docs (GH3323)
In [65]: Series({('a', 'b'): 1, ('a', 'a'): 0, ....: ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4}) ....: Out[65]: a a 0 b 1 c 2 b a 3 b 4 dtype: int64 In [66]: DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}) ....: Out[66]: a b a b c a b A B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
Added the sym_diff
method to Index
(GH5543)
DataFrame.to_latex
now takes a longtable keyword, which if True will return a table in a longtable environment. (GH6617)
Add option to turn off escaping in DataFrame.to_latex
(GH6472)
pd.read_clipboard
will, if the keyword sep
is unspecified, try to detect data copied from a spreadsheet and parse accordingly. (GH6223)
Joining a singly-indexed DataFrame with a multi-indexed DataFrame (GH3662)
See the docs. Joining multi-index DataFrames on both the left and right is not yet supported ATM.
In [67]: household = DataFrame(dict(household_id = [1,2,3], ....: male = [0,1,0], ....: wealth = [196087.3,316478.7,294750]), ....: columns = ['household_id','male','wealth'] ....: ).set_index('household_id') ....: In [68]: household Out[68]: male wealth household_id 1 0 196087.3 2 1 316478.7 3 0 294750.0 In [69]: portfolio = DataFrame(dict(household_id = [1,2,2,3,3,3,4], ....: asset_id = ["nl0000301109","nl0000289783","gb00b03mlx29", ....: "gb00b03mlx29","lu0197800237","nl0000289965",np.nan], ....: name = ["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell", ....: "AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan], ....: share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]), ....: columns = ['household_id','asset_id','name','share'] ....: ).set_index(['household_id','asset_id']) ....: In [70]: portfolio Out[70]: name share household_id asset_id 1 nl0000301109 ABN Amro 1.00 2 nl0000289783 Robeco 0.40 gb00b03mlx29 Royal Dutch Shell 0.60 3 gb00b03mlx29 Royal Dutch Shell 0.15 lu0197800237 AAB Eastern Europe Equity Fund 0.60 nl0000289965 Postbank BioTech Fonds 0.25 4 NaN NaN 1.00 In [71]: household.join(portfolio, how='inner') Out[71]: male wealth name \ household_id asset_id 1 nl0000301109 0 196087.3 ABN Amro 2 nl0000289783 1 316478.7 Robeco gb00b03mlx29 1 316478.7 Royal Dutch Shell 3 gb00b03mlx29 0 294750.0 Royal Dutch Shell lu0197800237 0 294750.0 AAB Eastern Europe Equity Fund nl0000289965 0 294750.0 Postbank BioTech Fonds share household_id asset_id 1 nl0000301109 1.00 2 nl0000289783 0.40 gb00b03mlx29 0.60 3 gb00b03mlx29 0.15 lu0197800237 0.60 nl0000289965 0.25
quotechar
, doublequote
, and escapechar
can now be specified when using DataFrame.to_csv
(GH5414, GH4528)
Partially sort by only the specified levels of a MultiIndex with the sort_remaining
boolean kwarg. (GH3984)
Added to_julian_date
to TimeStamp
and DatetimeIndex
. The Julian Date is used primarily in astronomy and represents the number of days from noon, January 1, 4713 BC. Because nanoseconds are used to define the time in pandas the actual range of dates that you can use is 1678 AD to 2262 AD. (GH4041)
DataFrame.to_stata
will now check data for compatibility with Stata data types and will upcast when needed. When it is not possible to losslessly upcast, a warning is issued (GH6327)
DataFrame.to_stata
and StataWriter
will accept keyword arguments time_stamp and data_label which allow the time stamp and dataset label to be set when creating a file. (GH6545)
pandas.io.gbq
now handles reading unicode strings properly. (GH5940)
Holidays Calendars are now available and can be used with the CustomBusinessDay
offset (GH6719)
Float64Index
is now backed by a float64
dtype ndarray instead of an object
dtype array (GH6471).
Implemented Panel.pct_change
(GH6904)
Added how
option to rolling-moment functions to dictate how to handle resampling; rolling_max()
defaults to max, rolling_min()
defaults to min, and all others default to mean (GH6297)
CustomBuisnessMonthBegin
and CustomBusinessMonthEnd
are now available (GH6866)
Series.quantile()
and DataFrame.quantile()
now accept an array of quantiles.
describe()
now accepts an array of percentiles to include in the summary statistics (GH4196)
pivot_table
can now accept Grouper
by index
and columns
keywords (GH6913)
In [72]: import datetime In [73]: df = DataFrame({ ....: 'Branch' : 'A A A A A B'.split(), ....: 'Buyer': 'Carl Mark Carl Carl Joe Joe'.split(), ....: 'Quantity': [1, 3, 5, 1, 8, 1], ....: 'Date' : [datetime.datetime(2013,11,1,13,0), datetime.datetime(2013,9,1,13,5), ....: datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0), ....: datetime.datetime(2013,11,1,20,0), datetime.datetime(2013,10,2,10,0)], ....: 'PayDay' : [datetime.datetime(2013,10,4,0,0), datetime.datetime(2013,10,15,13,5), ....: datetime.datetime(2013,9,5,20,0), datetime.datetime(2013,11,2,10,0), ....: datetime.datetime(2013,10,7,20,0), datetime.datetime(2013,9,5,10,0)]}) ....: In [74]: df Out[74]: Branch Buyer Date PayDay Quantity 0 A Carl 2013-11-01 13:00:00 2013-10-04 00:00:00 1 1 A Mark 2013-09-01 13:05:00 2013-10-15 13:05:00 3 2 A Carl 2013-10-01 20:00:00 2013-09-05 20:00:00 5 3 A Carl 2013-10-02 10:00:00 2013-11-02 10:00:00 1 4 A Joe 2013-11-01 20:00:00 2013-10-07 20:00:00 8 5 B Joe 2013-10-02 10:00:00 2013-09-05 10:00:00 1 In [75]: pivot_table(df, index=Grouper(freq='M', key='Date'), ....: columns=Grouper(freq='M', key='PayDay'), ....: values='Quantity', aggfunc=np.sum) ....: Out[75]: PayDay 2013-09-30 2013-10-31 2013-11-30 Date 2013-09-30 NaN 3.0 NaN 2013-10-31 6.0 NaN 1.0 2013-11-30 NaN 9.0 NaN
Arrays of strings can be wrapped to a specified width (str.wrap
) (GH6999)
Add nsmallest()
and Series.nlargest()
methods to Series, See the docs (GH3960)
PeriodIndex fully supports partial string indexing like DatetimeIndex (GH7043)
In [76]: prng = period_range('2013-01-01 09:00', periods=100, freq='H') In [77]: ps = Series(np.random.randn(len(prng)), index=prng) In [78]: ps Out[78]: 2013-01-01 09:00 0.015696 2013-01-01 10:00 -2.242685 2013-01-01 11:00 1.150036 2013-01-01 12:00 0.991946 2013-01-01 13:00 0.953324 2013-01-01 14:00 -2.021255 2013-01-01 15:00 -0.334077 ... 2013-01-05 06:00 0.566534 2013-01-05 07:00 0.503592 2013-01-05 08:00 0.285296 2013-01-05 09:00 0.484288 2013-01-05 10:00 1.363482 2013-01-05 11:00 -0.781105 2013-01-05 12:00 -0.468018 Freq: H, Length: 100, dtype: float64 In [79]: ps['2013-01-02'] Out[79]: 2013-01-02 00:00 0.553439 2013-01-02 01:00 1.318152 2013-01-02 02:00 -0.469305 2013-01-02 03:00 0.675554 2013-01-02 04:00 -1.817027 2013-01-02 05:00 -0.183109 2013-01-02 06:00 1.058969 ... 2013-01-02 17:00 0.076200 2013-01-02 18:00 -0.566446 2013-01-02 19:00 0.036142 2013-01-02 20:00 -2.074978 2013-01-02 21:00 0.247792 2013-01-02 22:00 -0.897157 2013-01-02 23:00 -0.136795 Freq: H, Length: 24, dtype: float64
read_excel
can now read milliseconds in Excel dates and times with xlrd >= 0.9.3. (GH5945)
pd.stats.moments.rolling_var
now uses Welford’s method for increased numerical stability (GH6817)
pd.expanding_apply and pd.rolling_apply now take args and kwargs that are passed on to the func (GH6289)
DataFrame.rank()
now has a percentage rank option (GH5971)
Series.rank()
now has a percentage rank option (GH5971)
Series.rank()
and DataFrame.rank()
now accept method='dense'
for ranks without gaps (GH6514)
Support passing encoding
with xlwt (GH3710)
Refactor Block classes removing Block.items attributes to avoid duplication in item handling (GH6745, GH6988).
Testing statements updated to use specialized asserts (GH6175)
DatetimeIndex
to floating ordinals using DatetimeConverter
(GH6636)DataFrame.shift
(GH5609)CustomBusinessDay
(GH6584)DataFrame.from_records
when reading a specified number of rows from an iterable (GH6700)take_2d
(GH6749)GroupBy.count()
is now implemented in Cython and is much faster for large numbers of groups (GH7016).There are no experimental changes in 0.14.0
Bug Fixes¶pd.DataFrame.sort_index
where mergesort wasn’t stable when ascending=False
(GH6399)pd.tseries.frequencies.to_offset
when argument has leading zeroes (GH6391)Timestamp
/ to_datetime
for current year (GH5958).xs
with a Series multiindex (GH6258, GH5684)eval
where type-promotion failed for large expressions (GH6205)inplace=True
(GH6281)HDFStore.remove
now handles start and stop (GH6177)HDFStore.select_as_multiple
handles start and stop the same way as select
(GH6177)HDFStore.select_as_coordinates
and select_column
works with a where
clause that results in filters (GH6177)agg
with a single function and a a mixed-type frame (GH6337)DataFrame.replace()
when passing a non- bool
to_replace
argument (GH6332)groups
was missing) (GH3881)pd.eval
when parsing strings with possible tokens like '&'
(GH6351)-inf
in Panels when dividing by integer 0 (GH6178)DataFrame.shift
with axis=1
was raising (GH6371)nosetests -A disabled
) (GH6048).DataFrame.replace()
when passing a nested dict
that contained keys not in the values to be replaced (GH6342)str.match
ignored the na flag (GH6609).Series.get
, was using a buggy access method (GH6383)where=[('date', '>=', datetime(2013,1,1)), ('date', '<=', datetime(2014,1,1))]
(GH6313)DataFrame.dropna
with duplicate indices (GH6355)Float64Index
with nans not comparing correctly (GH6401)eval
/query
expressions with strings containing the @
character will now work (GH6366).Series.reindex
when specifying a method
with some nan values was inconsistent (noted on a resample) (GH6418)DataFrame.replace()
where nested dicts were erroneously depending on the order of dictionary keys and values (GH5338).sym_diff
on Index
objects with NaN
values (GH6444)MultiIndex.from_product
with a DatetimeIndex
as input (GH6439)str.extract
when passed a non-default index (GH6348)str.split
when passed pat=None
and n=1
(GH6466)io.data.DataReader
when passed "F-F_Momentum_Factor"
and data_source="famafrench"
(GH6460)sum
of a timedelta64[ns]
series (GH6462)resample
with a timezone and certain offsets (GH6397)iat/iloc
with duplicate indices on a Series (GH6493)read_html
where nan’s were incorrectly being used to indicate missing values in text. Should use the empty string for consistency with the rest of pandas (GH5129).read_html
tests where redirected invalid URLs would make one test fail (GH6445)..loc
on non-unique indices (GH6504)datetime64
non-ns dtypes in Series creation (GH6529).names
attribute of MultiIndexes passed to set_index
are now preserved (GH6459)..loc
on mixed integer Indexes (GH6546)pd.read_stata
which would use the wrong data types and missing values (GH6327)DataFrame.to_stata
that lead to data loss in certain cases, and could be exported using the wrong data types and missing values (GH6335)StataWriter
replaces missing values in string columns by empty string (GH6802)Timestamp
addition/subtraction (GH6543)IndexError
exceptions (GH6536, GH6551)Series.quantile
raising on an object
dtype (GH6555).xs
with a nan
in level when dropped (GH6574)method='bfill/ffill'
and datetime64[ns]
dtype (GH6587)Series.pop
(GH6600)iloc
indexing when positional indexer matched Int64Index
of the corresponding axis and no reordering happened (GH6612)fillna
with limit
and value
specifiedDataFrame.to_stata
when columns have non-string names (GH4558)np.compress
, surfaced in (GH6658)DataFrame.to_stata
which incorrectly handles nan values and ignores with_index
keyword argument (GH6685)how=None
resample freq is the same as the axis frequency (GH5955)obj.blocks
on sparse containers dropping all but the last items of same for dtype (GH6748)NaT (NaTType)
(GH4606)DataFrame.replace()
where regex metacharacters were being treated as regexs even when regex=False
(GH6777)..index
(GH6785)make clean
(GH6768)HDFStore
(GH6166)DataFrame._reduce
where non bool-like (0/1) integers were being coverted into bools. (GH6806)fillna
and a Series on datetime-like (GH6344)np.timedelta64
to DatetimeIndex
with timezone outputs incorrect results (GH6818)DataFrame.replace()
where changing a dtype through replacement would only replace the first occurrence of a value (GH6689)Period
construction (GH5332)Series.__unicode__
when max_rows=None
and the Series has more than 1000 rows. (GH6863)groupby.get_group
where a datetlike wasn’t always accepted (GH5267)groupBy.get_group
created by TimeGrouper
raises AttributeError
(GH6914)DatetimeIndex.tz_localize
and DatetimeIndex.tz_convert
converting NaT
incorrectly (GH5546)NaT
(GH6873)Series.str.extract
where the resulting Series
from a single group match wasn’t renamed to the group nameDataFrame.to_csv
where setting index=False
ignored the header
kwarg (GH6186)DataFrame.plot
and Series.plot
, where the legend behave inconsistently when plotting to the same axes repeatedly (GH6678)__finalize__
/ bug in merge not finalizing (GH6923, GH6927)TextFileReader
in concat
, which was affecting a common user idiom (GH6583)delim_whitespace=True
and \r
-delimited linesSeries.rank
and DataFrame.rank
that caused small floats (<1e-13) to all receive the same rank (GH6886)DataFrame.apply
with functions that used *args`` or **kwargs and returned an empty result (GH6952)Panel.shift
to NDFrame.slice_shift
and fixed to respect multiple dtypes. (GH6959)subplots=True
in DataFrame.plot
only has single column raises TypeError
, and Series.plot
raises AttributeError
(GH6951)DataFrame.plot
draws unnecessary axes when enabling subplots
and kind=scatter
(GH6951)read_csv
from a filesystem with non-utf-8 encoding (GH6807)iloc
when setting / aligning (GH6766)groupby.plot
when using a Float64Index
(GH7025)parallel_coordinates
and radviz
where reordering of class column caused possible color/class mismatch (GH6956)radviz
and andrews_curves
where multiple values of ‘color’ were being passed to plotting method (GH6956)Float64Index.isin()
where containing nan
s would make indices claim that they contained all the things (GH7066).DataFrame.boxplot
where it failed to use the axis passed as the ax
argument (GH3578)XlsxWriter
and XlwtWriter
implementations that resulted in datetime columns being formatted without the time (GH7075) were being passed to plotting methodread_fwf()
treats None
in colspec
like regular python slices. It now reads from the beginning or until the end of the line when colspec
contains a None
(previously raised a TypeError
)_is_view
property to NDFrame
to correctly predict views; mark is_copy
on xs
only if its an actual copy (and not a view) (GH7084)dayfirst=True
(GH5917)MultiIndex.from_arrays
created from DatetimeIndex
doesn’t preserve freq
and tz
(GH7090)unstack
raises ValueError
when MultiIndex
contains PeriodIndex
(GH4342)boxplot
and hist
draws unnecessary axes (GH6769)groupby.nth()
for out-of-bounds indexers (GH6621)quantile
with datetime values (GH6965)Dataframe.set_index
, reindex
and pivot
don’t preserve DatetimeIndex
and PeriodIndex
attributes (GH3950, GH5878, GH6631)MultiIndex.get_level_values
doesn’t preserve DatetimeIndex
and PeriodIndex
attributes (GH7092)Groupby
doesn’t preserve tz
(GH3950)PeriodIndex
partial string slicing (GH6716)DatetimeIndex
specifying freq
raises ValueError
when passed value is too short (GH7098)PeriodIndex
string slicing with out of bounds values (GH5407)isnull
when applied to 0-dimensional object arrays (GH7176)query
/eval
where global constants were not looked up correctly (GH7178)iloc
and a multi-axis tuple indexer (GH7189)This is a minor release from 0.13.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.
Highlights include:
infer_datetime_format
keyword to read_csv/to_datetime
to allow speedups for homogeneously formatted datetimes.apply()
method.Warning
0.13.1 fixes a bug that was caused by a combination of having numpy < 1.8, and doing chained assignment on a string-like array. Please review the docs, chained indexing can have unexpected results and should generally be avoided.
This would previously segfault:
In [1]: df = DataFrame(dict(A = np.array(['foo','bar','bah','foo','bar']))) In [2]: df['A'].iloc[0] = np.nan In [3]: df Out[3]: A 0 NaN 1 bar 2 bah 3 foo 4 bar
The recommended way to do this type of assignment is:
In [4]: df = DataFrame(dict(A = np.array(['foo','bar','bah','foo','bar']))) In [5]: df.loc[0,'A'] = np.nan In [6]: df Out[6]: A 0 NaN 1 bar 2 bah 3 foo 4 barOutput Formatting Enhancements¶
df.info() view now display dtype info per column (GH5682)
df.info() now honors the option max_info_rows
, to disable null counts for large frames (GH5974)
In [7]: max_info_rows = pd.get_option('max_info_rows') In [8]: df = DataFrame(dict(A = np.random.randn(10), ...: B = np.random.randn(10), ...: C = date_range('20130101',periods=10))) ...: In [9]: df.iloc[3:6,[0,2]] = np.nan
# set to not display the null counts In [10]: pd.set_option('max_info_rows',0) In [11]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 3 columns): A float64 B float64 C datetime64[ns] dtypes: datetime64[ns](1), float64(2) memory usage: 320.0 bytes
# this is the default (same as in 0.13.0) In [12]: pd.set_option('max_info_rows',max_info_rows) In [13]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 3 columns): A 7 non-null float64 B 10 non-null float64 C 7 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(2) memory usage: 320.0 bytes
Add show_dimensions
display option for the new DataFrame repr to control whether the dimensions print.
In [14]: df = DataFrame([[1, 2], [3, 4]]) In [15]: pd.set_option('show_dimensions', False) In [16]: df Out[16]: 0 1 0 1 2 1 3 4 In [17]: pd.set_option('show_dimensions', True) In [18]: df Out[18]: 0 1 0 1 2 1 3 4 [2 rows x 2 columns]
The ArrayFormatter
for datetime
and timedelta64
now intelligently limit precision based on the values in the array (GH3401)
Previously output might look like:
age today diff 0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00
Now the output looks like:
In [19]: df = DataFrame([ Timestamp('20010101'), ....: Timestamp('20040601') ], columns=['age']) ....: In [20]: df['today'] = Timestamp('20130419') In [21]: df['diff'] = df['today']-df['age'] In [22]: df Out[22]: age today diff 0 2001-01-01 2013-04-19 4491 days 1 2004-06-01 2013-04-19 3244 days [2 rows x 3 columns]
Add -NaN
and -nan
to the default set of NA values (GH5952). See NA Values.
Added Series.str.get_dummies
vectorized string method (GH6021), to extract dummy/indicator variables for separated string columns:
In [23]: s = Series(['a', 'a|b', np.nan, 'a|c']) In [24]: s.str.get_dummies(sep='|') Out[24]: a b c 0 1 0 0 1 1 1 0 2 0 0 0 3 1 0 1 [4 rows x 3 columns]
Added the NDFrame.equals()
method to compare if two NDFrames are equal have equal axes, dtypes, and values. Added the array_equivalent
function to compare if two ndarrays are equal. NaNs in identical locations are treated as equal. (GH5283) See also the docs for a motivating example.
In [25]: df = DataFrame({'col':['foo', 0, np.nan]}) In [26]: df2 = DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0]) In [27]: df.equals(df2) Out[27]: False In [28]: df.equals(df2.sort_index()) Out[28]: True In [29]: import pandas.core.common as com In [30]: com.array_equivalent(np.array([0, np.nan]), np.array([0, np.nan])) Out[30]: True In [31]: np.array_equal(np.array([0, np.nan]), np.array([0, np.nan])) Out[31]: False
DataFrame.apply
will use the reduce
argument to determine whether a Series
or a DataFrame
should be returned when the DataFrame
is empty (GH6007).
Previously, calling DataFrame.apply
an empty DataFrame
would return either a DataFrame
if there were no columns, or the function being applied would be called with an empty Series
to guess whether a Series
or DataFrame
should be returned:
In [32]: def applied_func(col): ....: print("Apply function being called with: ", col) ....: return col.sum() ....: In [33]: empty = DataFrame(columns=['a', 'b']) In [34]: empty.apply(applied_func) Apply function being called with: Series([], Length: 0, dtype: float64) Out[34]: a NaN b NaN Length: 2, dtype: float64
Now, when apply
is called on an empty DataFrame
: if the reduce
argument is True
a Series
will returned, if it is False
a DataFrame
will be returned, and if it is None
(the default) the function being applied will be called with an empty series to try and guess the return type.
In [35]: empty.apply(applied_func, reduce=True) Out[35]: a NaN b NaN Length: 2, dtype: float64 In [36]: empty.apply(applied_func, reduce=False) Out[36]: Empty DataFrame Columns: [a, b] Index: [] [0 rows x 2 columns]
There are no announced changes in 0.13 or prior that are taking effect as of 0.13.1
Deprecations¶There are no deprecations of prior behavior in 0.13.1
Enhancements¶pd.read_csv
and pd.to_datetime
learned a new infer_datetime_format
keyword which greatly improves parsing perf in many cases. Thanks to @lexual for suggesting and @danbirken for rapidly implementing. (GH5490, GH6021)
If parse_dates
is enabled and this flag is set, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
# Try to infer the format for the index column df = pd.read_csv('foo.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
date_format
and datetime_format
keywords can now be specified when writing to excel
files (GH4133)
MultiIndex.from_product
convenience function for creating a MultiIndex from the cartesian product of a set of iterables (GH6055):
In [37]: shades = ['light', 'dark'] In [38]: colors = ['red', 'green', 'blue'] In [39]: MultiIndex.from_product([shades, colors], names=['shade', 'color']) Out[39]: MultiIndex(levels=[['dark', 'light'], ['blue', 'green', 'red']], labels=[[1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]], names=['shade', 'color'])
Panel apply()
will work on non-ufuncs. See the docs.
In [40]: import pandas.util.testing as tm In [41]: panel = tm.makePanel(5) In [42]: panel Out[42]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemC Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00 Minor_axis axis: A to D In [43]: panel['ItemA'] Out[43]: A B C D 2000-01-03 0.694103 1.893534 -1.735349 -0.850346 2000-01-04 0.678630 0.639633 1.210384 1.176812 2000-01-05 0.239556 -0.962029 0.797435 -0.524336 2000-01-06 0.151227 -2.085266 -0.379811 0.700908 2000-01-07 0.816127 1.930247 0.702562 0.984188 [5 rows x 4 columns]
Specifying an apply
that operates on a Series (to return a single element)
In [44]: panel.apply(lambda x: x.dtype, axis='items') Out[44]: A B C D 2000-01-03 float64 float64 float64 float64 2000-01-04 float64 float64 float64 float64 2000-01-05 float64 float64 float64 float64 2000-01-06 float64 float64 float64 float64 2000-01-07 float64 float64 float64 float64 [5 rows x 4 columns]
A similar reduction type operation
In [45]: panel.apply(lambda x: x.sum(), axis='major_axis') Out[45]: ItemA ItemB ItemC A 2.579643 3.062757 0.379252 B 1.416120 -1.960855 0.923558 C 0.595222 -1.079772 -3.118269 D 1.487226 -0.734611 -1.979310 [4 rows x 3 columns]
This is equivalent to
In [46]: panel.sum('major_axis') Out[46]: ItemA ItemB ItemC A 2.579643 3.062757 0.379252 B 1.416120 -1.960855 0.923558 C 0.595222 -1.079772 -3.118269 D 1.487226 -0.734611 -1.979310 [4 rows x 3 columns]
A transformation operation that returns a Panel, but is computing the z-score across the major_axis
In [47]: result = panel.apply( ....: lambda x: (x-x.mean())/x.std(), ....: axis='major_axis') ....: In [48]: result Out[48]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemC Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00 Minor_axis axis: A to D In [49]: result['ItemA'] Out[49]: A B C D 2000-01-03 0.595800 0.907552 -1.556260 -1.244875 2000-01-04 0.544058 0.200868 0.915883 0.953747 2000-01-05 -0.924165 -0.701810 0.569325 -0.891290 2000-01-06 -1.219530 -1.334852 -0.418654 0.437589 2000-01-07 1.003837 0.928242 0.489705 0.744830 [5 rows x 4 columns]
Panel apply()
operating on cross-sectional slabs. (GH1148)
In [50]: f = lambda x: ((x.T-x.mean(1))/x.std(1)).T In [51]: result = panel.apply(f, axis = ['items','major_axis']) In [52]: result Out[52]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis) Items axis: A to D Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00 Minor_axis axis: ItemA to ItemC In [53]: result.loc[:,:,'ItemA'] Out[53]: A B C D 2000-01-03 0.331409 1.071034 -0.914540 -0.510587 2000-01-04 -0.741017 -0.118794 0.383277 0.537212 2000-01-05 0.065042 -0.767353 0.655436 0.069467 2000-01-06 0.027932 -0.569477 0.908202 0.610585 2000-01-07 1.116434 1.133591 0.871287 1.004064 [5 rows x 4 columns]
This is equivalent to the following
In [54]: result = Panel(dict([ (ax,f(panel.loc[:,:,ax])) ....: for ax in panel.minor_axis ])) ....: In [55]: result Out[55]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis) Items axis: A to D Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00 Minor_axis axis: ItemA to ItemC In [56]: result.loc[:,:,'ItemA'] Out[56]: A B C D 2000-01-03 0.331409 1.071034 -0.914540 -0.510587 2000-01-04 -0.741017 -0.118794 0.383277 0.537212 2000-01-05 0.065042 -0.767353 0.655436 0.069467 2000-01-06 0.027932 -0.569477 0.908202 0.610585 2000-01-07 1.116434 1.133591 0.871287 1.004064 [5 rows x 4 columns]
Performance improvements for 0.13.1
count/dropna
for axis=1
dtypes/ftypes
methods (GH5968)DataFrame.apply
(GH6013)There are no experimental changes in 0.13.1
Bug Fixes¶See V0.13.1 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.1.
See the full release notes or issue tracker on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.
v0.13.0 (January 3, 2014)¶This is a major release from 0.12.0 and includes a number of API changes, several new features and enhancements along with a large number of bug fixes.
Highlights include:
Float64Index
, and other Indexing enhancementsHDFStore
has a new string based syntax for query specificationtimedelta
operationsextract
isin
for DataFramesSeveral experimental features are added, including:
eval/query
methods for expression evaluationmsgpack
serializationBigQuery
Their are several new or updated docs sections including:
eval/query
.Warning
In 0.13.0 Series
has internally been refactored to no longer sub-class ndarray
but instead subclass NDFrame
, similar to the rest of the pandas containers. This should be a transparent change with only very limited API implications. See Internal Refactoring
read_excel
now supports an integer in its sheetname
argument giving the index of the sheet to read in (GH4301).
Text parser now treats anything that reads like inf (“inf”, “Inf”, “-Inf”, “iNf”, etc.) as infinity. (GH4220, GH4219), affecting read_table
, read_csv
, etc.
pandas
now is Python 2/3 compatible without the need for 2to3 thanks to @jtratner. As a result, pandas now uses iterators more extensively. This also led to the introduction of substantive parts of the Benjamin Peterson’s six
library into compat. (GH4384, GH4375, GH4372)
pandas.util.compat
and pandas.util.py3compat
have been merged into pandas.compat
. pandas.compat
now includes many functions allowing 2/3 compatibility. It contains both list and iterator versions of range, filter, map and zip, plus other necessary elements for Python 3 compatibility. lmap
, lzip
, lrange
and lfilter
all produce lists instead of iterators, for compatibility with numpy
, subscripting and pandas
constructors.(GH4384, GH4375, GH4372)
Series.get
with negative indexers now returns the same as []
(GH4390)
Changes to how Index
and MultiIndex
handle metadata (levels
, labels
, and names
) (GH4039):
# previously, you would have set levels or labels directly index.levels = [[1, 2, 3, 4], [1, 2, 4, 4]] # now, you use the set_levels or set_labels methods index = index.set_levels([[1, 2, 3, 4], [1, 2, 4, 4]]) # similarly, for names, you can rename the object # but setting names is not deprecated index = index.set_names(["bob", "cranberry"]) # and all methods take an inplace kwarg - but return None index.set_names(["bob", "cranberry"], inplace=True)
All division with NDFrame
objects is now truedivision, regardless of the future import. This means that operating on pandas objects will by default use floating point division, and return a floating point dtype. You can use //
and floordiv
to do integer division.
Integer division
In [3]: arr = np.array([1, 2, 3, 4]) In [4]: arr2 = np.array([5, 3, 2, 1]) In [5]: arr / arr2 Out[5]: array([0, 0, 1, 4]) In [6]: Series(arr) // Series(arr2) Out[6]: 0 0 1 0 2 1 3 4 dtype: int64
True Division
In [7]: pd.Series(arr) / pd.Series(arr2) # no future import required Out[7]: 0 0.200000 1 0.666667 2 1.500000 3 4.000000 dtype: float64
Infer and downcast dtype if downcast='infer'
is passed to fillna/ffill/bfill
(GH4604)
__nonzero__
for all NDFrame objects, will now raise a ValueError
, this reverts back to (GH1073, GH4633) behavior. See gotchas for a more detailed discussion.
This prevents doing boolean comparison on entire pandas objects, which is inherently ambiguous. These all will raise a ValueError
.
if df: .... df1 and df2 s1 and s2
Added the .bool()
method to NDFrame
objects to facilitate evaluating of single-element boolean Series:
In [1]: Series([True]).bool() Out[1]: True In [2]: Series([False]).bool() Out[2]: False In [3]: DataFrame([[True]]).bool() Out[3]: True In [4]: DataFrame([[False]]).bool() Out[4]: False
All non-Index NDFrames (Series
, DataFrame
, Panel
, Panel4D
, SparsePanel
, etc.), now support the entire set of arithmetic operators and arithmetic flex methods (add, sub, mul, etc.). SparsePanel
does not support pow
or mod
with non-scalars. (GH3765)
Series
and DataFrame
now have a mode()
method to calculate the statistical mode(s) by axis/Series. (GH5367)
Chained assignment will now by default warn if the user is assigning to a copy. This can be changed with the option mode.chained_assignment
, allowed options are raise/warn/None
. See the docs.
In [5]: dfc = DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]}) In [6]: pd.set_option('chained_assignment','warn')
The following warning / exception will show if this is attempted.
In [7]: dfc.loc[0]['A'] = 1111
Traceback (most recent call last) ... SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead
Here is the correct method of assignment.
In [8]: dfc.loc[0,'A'] = 11 In [9]: dfc Out[9]: A B 0 11 1 1 bbb 2 2 ccc 3 [3 rows x 2 columns]
Panel.reindex
has the following call signature Panel.reindex(items=None, major_axis=None, minor_axis=None, **kwargs)
to conform with other NDFrame
objects. See Internal Refactoring for more information.
Series.argmin
and Series.argmax
are now aliased to Series.idxmin
and Series.idxmax
. These return the index of the
min or max element respectively. Prior to 0.13.0 these would return the position of the min / max element. (GH6214)
These were announced changes in 0.12 or prior that are taking effect as of 0.13.0
Factor
(GH3650)set_printoptions/reset_printoptions
(GH3046)_verbose_info
(GH3215)read_clipboard/to_clipboard/ExcelFile/ExcelWriter
from pandas.io.parsers
(GH3717) These are available as functions in the main pandas namespace (e.g. pd.read_clipboard
)tupleize_cols
is now False
for both to_csv
and read_csv
. Fair warning in 0.12 (GH3604)Deprecated in 0.13.0
iterkv
, which will be removed in a future release (this was an alias of iteritems used to bypass 2to3
‘s changes). (GH4384, GH4375, GH4372)match
, whose role is now performed more idiomatically by extract
. In a future release, the default behavior of match
will change to become analogous to contains
, which returns a boolean indexer. (Their distinction is strictness: match
relies on re.match
while contains
relies on re.search
.) In this release, the deprecated behavior is the default, but the new behavior is available through the keyword argument as_indexer=True
.Prior to 0.13, it was impossible to use a label indexer (.loc/.ix
) to set a value that was not contained in the index of a particular axis. (GH2578). See the docs
In the Series
case this is effectively an appending operation
In [10]: s = Series([1,2,3]) In [11]: s Out[11]: 0 1 1 2 2 3 Length: 3, dtype: int64 In [12]: s[5] = 5. In [13]: s Out[13]: 0 1.0 1 2.0 2 3.0 5 5.0 Length: 4, dtype: float64
In [14]: dfi = DataFrame(np.arange(6).reshape(3,2), ....: columns=['A','B']) ....: In [15]: dfi Out[15]: A B 0 0 1 1 2 3 2 4 5 [3 rows x 2 columns]
This would previously KeyError
In [16]: dfi.loc[:,'C'] = dfi.loc[:,'A'] In [17]: dfi Out[17]: A B C 0 0 1 0 1 2 3 2 2 4 5 4 [3 rows x 3 columns]
This is like an append
operation.
In [18]: dfi.loc[3] = 5 In [19]: dfi Out[19]: A B C 0 0 1 0 1 2 3 2 2 4 5 4 3 5 5 5 [4 rows x 3 columns]
A Panel setting operation on an arbitrary axis aligns the input to the Panel
In [20]: p = pd.Panel(np.arange(16).reshape(2,4,2), ....: items=['Item1','Item2'], ....: major_axis=pd.date_range('2001/1/12',periods=4), ....: minor_axis=['A','B'],dtype='float64') ....: In [21]: p Out[21]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00 Minor_axis axis: A to B In [22]: p.loc[:,:,'C'] = Series([30,32],index=p.items) In [23]: p Out[23]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00 Minor_axis axis: A to C In [24]: p.loc[:,:,'C'] Out[24]: Item1 Item2 2001-01-12 30.0 32.0 2001-01-13 30.0 32.0 2001-01-14 30.0 32.0 2001-01-15 30.0 32.0 [4 rows x 2 columns]Float64Index API Change¶
Added a new index type, Float64Index
. This will be automatically created when passing floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc
for scalar indexing and slicing work exactly the same. See the docs, (GH263)
Construction is by default for floating type values.
In [25]: index = Index([1.5, 2, 3, 4.5, 5]) In [26]: index Out[26]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64') In [27]: s = Series(range(5),index=index) In [28]: s Out[28]: 1.5 0 2.0 1 3.0 2 4.5 3 5.0 4 Length: 5, dtype: int64
Scalar selection for [],.ix,.loc
will always be label based. An integer will match an equal float index (e.g. 3
is equivalent to 3.0
)
In [29]: s[3] Out[29]: 2 In [30]: s.loc[3] Out[30]: 2
The only positional indexing is via iloc
In [31]: s.iloc[3] Out[31]: 3
A scalar index that is not found will raise KeyError
Slicing is ALWAYS on the values of the index, for [],ix,loc
and ALWAYS positional with iloc
In [32]: s[2:4] Out[32]: 2.0 1 3.0 2 Length: 2, dtype: int64 In [33]: s.loc[2:4] Out[33]: 2.0 1 3.0 2 Length: 2, dtype: int64 In [34]: s.iloc[2:4] Out[34]: 3.0 2 4.5 3 Length: 2, dtype: int64
In float indexes, slicing using floats are allowed
In [35]: s[2.1:4.6] Out[35]: 3.0 2 4.5 3 Length: 2, dtype: int64 In [36]: s.loc[2.1:4.6] Out[36]: 3.0 2 4.5 3 Length: 2, dtype: int64
Indexing on other index types are preserved (and positional fallback for [],ix
), with the exception, that floating point slicing on indexes on non Float64Index
will now raise a TypeError
.
In [1]: Series(range(5))[3.5] TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index) In [1]: Series(range(5))[3.5:4.5] TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
Using a scalar float indexer will be deprecated in a future version, but is allowed for now.
In [3]: Series(range(5))[3.0] Out[3]: 3
Query Format Changes. A much more string-like query format is now supported. See the docs.
In [37]: path = 'test.h5' In [38]: dfq = DataFrame(randn(10,4), ....: columns=list('ABCD'), ....: index=date_range('20130101',periods=10)) ....: In [39]: dfq.to_hdf(path,'dfq',format='table',data_columns=True)
Use boolean expressions, with in-line function evaluation.
In [40]: read_hdf(path,'dfq', ....: where="index>Timestamp('20130104') & columns=['A', 'B']") ....: Out[40]: A B 2013-01-05 1.057633 -0.791489 2013-01-06 1.910759 0.787965 2013-01-07 1.043945 2.107785 2013-01-08 0.749185 -0.675521 2013-01-09 -0.276646 1.924533 2013-01-10 0.226363 -2.078618 [6 rows x 2 columns]
Use an inline column reference
In [41]: read_hdf(path,'dfq', ....: where="A>0 or C>0") ....: Out[41]: A B C D 2013-01-01 -0.414505 -1.425795 0.209395 -0.592886 2013-01-02 -1.473116 -0.896581 1.104352 -0.431550 2013-01-03 -0.161137 0.889157 0.288377 -1.051539 2013-01-04 -0.319561 -0.619993 0.156998 -0.571455 2013-01-05 1.057633 -0.791489 -0.524627 0.071878 2013-01-06 1.910759 0.787965 0.513082 -0.546416 2013-01-07 1.043945 2.107785 1.459927 1.015405 2013-01-08 0.749185 -0.675521 0.440266 0.688972 2013-01-09 -0.276646 1.924533 0.411204 0.890765 2013-01-10 0.226363 -2.078618 -0.387886 -0.087107 [10 rows x 4 columns]
the format
keyword now replaces the table
keyword; allowed values are fixed(f)
or table(t)
the same defaults as prior < 0.13.0 remain, e.g. put
implies fixed
format and append
implies table
format. This default format can be set as an option by setting io.hdf.default_format
.
In [42]: path = 'test.h5' In [43]: df = pd.DataFrame(np.random.randn(10,2)) In [44]: df.to_hdf(path,'df_table',format='table') In [45]: df.to_hdf(path,'df_table2',append=True) In [46]: df.to_hdf(path,'df_fixed') In [47]: with pd.HDFStore(path) as store: ....: print(store) ....: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df_fixed frame (shape->[10,2]) /df_table frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) /df_table2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])
Significant table writing performance improvements
handle a passed Series
in table format (GH4330)
can now serialize a timedelta64[ns]
dtype in a table (GH3577), See the docs.
added an is_open
property to indicate if the underlying file handle is_open; a closed store will now report ‘CLOSED’ when viewing the store (rather than raising an error) (GH4409)
a close of a HDFStore
now will close that instance of the HDFStore
but will only close the actual file if the ref count (by PyTables
) w.r.t. all of the open handles are 0. Essentially you have a local instance of HDFStore
referenced by a variable. Once you close it, it will report closed. Other references (to the same file) will continue to operate until they themselves are closed. Performing an action on a closed file will raise ClosedFileError
In [48]: path = 'test.h5' In [49]: df = DataFrame(randn(10,2)) In [50]: store1 = HDFStore(path) In [51]: store2 = HDFStore(path) In [52]: store1.append('df',df) In [53]: store2.append('df2',df) In [54]: store1 Out[54]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) In [55]: store2 Out[55]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) /df2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) In [56]: store1.close() In [57]: store2 Out[57]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) /df2 frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index]) In [58]: store2.close() In [59]: store2 Out[59]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 File is CLOSED
removed the _quiet
attribute, replace by a DuplicateWarning
if retrieving duplicate rows from a table (GH4367)
removed the warn
argument from open
. Instead a PossibleDataLossError
exception will be raised if you try to use mode='w'
with an OPEN file handle (GH4367)
allow a passed locations array or mask as a where
condition (GH4467). See the docs for an example.
add the keyword dropna=True
to append
to change whether ALL nan rows are not written to the store (default is True
, ALL nan rows are NOT written), also settable via the option io.hdf.dropna_table
(GH4625)
pass thru store creation arguments; can be used to support in-memory stores
The HTML and plain text representations of DataFrame
now show a truncated view of the table once it exceeds a certain size, rather than switching to the short info view (GH4886, GH5550). This makes the representation more consistent as small DataFrames get larger.
To get the info view, call DataFrame.info()
. If you prefer the info view as the repr for large DataFrames, you can set this by running set_option('display.large_repr', 'info')
.
df.to_clipboard()
learned a new excel
keyword that let’s you paste df data directly into excel (enabled by default). (GH5070).
read_html
now raises a URLError
instead of catching and raising a ValueError
(GH4303, GH4305)
Added a test for read_clipboard()
and to_clipboard()
(GH4282)
Clipboard functionality now works with PySide (GH4282)
Added a more informative error message when plot arguments contain overlapping color and style arguments (GH4402)
to_dict
now takes records
as a possible outtype. Returns an array of column-keyed dictionaries. (GH4936)
NaN
handing in get_dummies (GH4446) with dummy_na
# previously, nan was erroneously counted as 2 here # now it is not counted at all In [60]: get_dummies([1, 2, np.nan]) Out[60]: 1.0 2.0 0 1 0 1 0 1 2 0 0 [3 rows x 2 columns] # unless requested In [61]: get_dummies([1, 2, np.nan], dummy_na=True) Out[61]: 1.0 2.0 NaN 0 1 0 0 1 0 1 0 2 0 0 1 [3 rows x 3 columns]
timedelta64[ns]
operations. See the docs.
Warning
Most of these operations require numpy >= 1.7
Using the new top-level to_timedelta
, you can convert a scalar or array from the standard timedelta format (produced by to_csv
) into a timedelta type (np.timedelta64
in nanoseconds
).
In [62]: to_timedelta('1 days 06:05:01.00003') Out[62]: Timedelta('1 days 06:05:01.000030') In [63]: to_timedelta('15.5us') Out[63]: Timedelta('0 days 00:00:00.000015') In [64]: to_timedelta(['1 days 06:05:01.00003','15.5us','nan']) Out[64]: TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015', NaT], dtype='timedelta64[ns]', freq=None) In [65]: to_timedelta(np.arange(5),unit='s') Out[65]: TimedeltaIndex(['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04'], dtype='timedelta64[ns]', freq=None) In [66]: to_timedelta(np.arange(5),unit='d') Out[66]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
A Series of dtype timedelta64[ns]
can now be divided by another timedelta64[ns]
object, or astyped to yield a float64
dtyped Series. This is frequency conversion. See the docs for the docs.
In [67]: from datetime import timedelta In [68]: td = Series(date_range('20130101',periods=4))-Series(date_range('20121201',periods=4)) In [69]: td[2] += np.timedelta64(timedelta(minutes=5,seconds=3)) In [70]: td[3] = np.nan In [71]: td Out[71]: 0 31 days 00:00:00 1 31 days 00:00:00 2 31 days 00:05:03 3 NaT Length: 4, dtype: timedelta64[ns] # to days In [72]: td / np.timedelta64(1,'D') Out[72]: 0 31.000000 1 31.000000 2 31.003507 3 NaN Length: 4, dtype: float64 In [73]: td.astype('timedelta64[D]') Out[73]: 0 31.0 1 31.0 2 31.0 3 NaN Length: 4, dtype: float64 # to seconds In [74]: td / np.timedelta64(1,'s') Out[74]: 0 2678400.0 1 2678400.0 2 2678703.0 3 NaN Length: 4, dtype: float64 In [75]: td.astype('timedelta64[s]') Out[75]: 0 2678400.0 1 2678400.0 2 2678703.0 3 NaN Length: 4, dtype: float64
Dividing or multiplying a timedelta64[ns]
Series by an integer or integer Series
In [76]: td * -1 Out[76]: 0 -31 days +00:00:00 1 -31 days +00:00:00 2 -32 days +23:54:57 3 NaT Length: 4, dtype: timedelta64[ns] In [77]: td * Series([1,2,3,4]) Out[77]: 0 31 days 00:00:00 1 62 days 00:00:00 2 93 days 00:15:09 3 NaT Length: 4, dtype: timedelta64[ns]
Absolute DateOffset
objects can act equivalently to timedeltas
In [78]: from pandas import offsets In [79]: td + offsets.Minute(5) + offsets.Milli(5) Out[79]: 0 31 days 00:05:00.005000 1 31 days 00:05:00.005000 2 31 days 00:10:03.005000 3 NaT Length: 4, dtype: timedelta64[ns]
Fillna is now supported for timedeltas
In [80]: td.fillna(0) Out[80]: 0 31 days 00:00:00 1 31 days 00:00:00 2 31 days 00:05:03 3 0 days 00:00:00 Length: 4, dtype: timedelta64[ns] In [81]: td.fillna(timedelta(days=1,seconds=5)) Out[81]: 0 31 days 00:00:00 1 31 days 00:00:00 2 31 days 00:05:03 3 1 days 00:00:05 Length: 4, dtype: timedelta64[ns]
You can do numeric reduction operations on timedeltas.
In [82]: td.mean() Out[82]: Timedelta('31 days 00:01:41') In [83]: td.quantile(.1) Out[83]: Timedelta('31 days 00:00:00')
plot(kind='kde')
now accepts the optional parameters bw_method
and ind
, passed to scipy.stats.gaussian_kde() (for scipy >= 0.11.0) to set the bandwidth, and to gkde.evaluate() to specify the indices at which it is evaluated, respectively. See scipy docs. (GH4298)
DataFrame constructor now accepts a numpy masked record array (GH3478)
The new vectorized string method extract
return regular expression matches more conveniently.
In [84]: Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)') Out[84]: 0 1 1 2 2 NaN Length: 3, dtype: object
Elements that do not match return NaN
. Extracting a regular expression with more than one group returns a DataFrame with one column per group.
In [85]: Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)') Out[85]: 0 1 0 a 1 1 b 2 2 NaN NaN [3 rows x 2 columns]
Elements that do not match return a row of NaN
. Thus, a Series of messy strings can be converted into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get()
to access tuples or re.match
objects.
Named groups like
In [86]: Series(['a1', 'b2', 'c3']).str.extract( ....: '(?P<letter>[ab])(?P<digit>\d)') ....: Out[86]: letter digit 0 a 1 1 b 2 2 NaN NaN [3 rows x 2 columns]
and optional groups can also be used.
In [87]: Series(['a1', 'b2', '3']).str.extract( ....: '(?P<letter>[ab])?(?P<digit>\d)') ....: Out[87]: letter digit 0 a 1 1 b 2 2 NaN 3 [3 rows x 2 columns]
read_stata
now accepts Stata 13 format (GH4291)
read_fwf
now infers the column specifications from the first 100 rows of the file if the data has correctly separated and properly aligned columns using the delimiter provided to the function (GH4488).
support for nanosecond times as an offset
Warning
These operations require numpy >= 1.7
Period conversions in the range of seconds and below were reworked and extended up to nanoseconds. Periods in the nanosecond range are now available.
In [88]: date_range('2013-01-01', periods=5, freq='5N') Out[88]: DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq='5N')
or with frequency as offset
In [89]: date_range('2013-01-01', periods=5, freq=pd.offsets.Nano(5)) Out[89]: DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq='5N')
Timestamps can be modified in the nanosecond range
In [90]: t = Timestamp('20130101 09:01:02') In [91]: t + pd.tseries.offsets.Nano(123) Out[91]: Timestamp('2013-01-01 09:01:02.000000123')
A new method, isin
for DataFrames, which plays nicely with boolean indexing. The argument to isin
, what we’re comparing the DataFrame to, can be a DataFrame, Series, dict, or array of values. See the docs for more.
To get the rows where any of the conditions are met:
In [92]: dfi = DataFrame({'A': [1, 2, 3, 4], 'B': ['a', 'b', 'f', 'n']}) In [93]: dfi Out[93]: A B 0 1 a 1 2 b 2 3 f 3 4 n [4 rows x 2 columns] In [94]: other = DataFrame({'A': [1, 3, 3, 7], 'B': ['e', 'f', 'f', 'e']}) In [95]: mask = dfi.isin(other) In [96]: mask Out[96]: A B 0 True False 1 False False 2 True True 3 False False [4 rows x 2 columns] In [97]: dfi[mask.any(1)] Out[97]: A B 0 1 a 2 3 f [2 rows x 2 columns]
Series
now supports a to_frame
method to convert it to a single-column DataFrame (GH5164)
All R datasets listed here http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into Pandas objects
# note that pandas.rpy was deprecated in v0.16.0 import pandas.rpy.common as com com.load_data('Titanic')
tz_localize
can infer a fall daylight savings transition based on the structure of the unlocalized data (GH4230), see the docs
DatetimeIndex
is now in the API documentation, see the docs
json_normalize()
is a new method to allow you to create a flat table from semi-structured JSON data. See the docs (GH1067)
Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
Python csv parser now supports usecols (GH4335)
Frequencies gained several new offsets:
DataFrame has a new interpolate
method, similar to Series (GH4434, GH1892)
In [98]: df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8], ....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]}) ....: In [99]: df.interpolate() Out[99]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 [6 rows x 2 columns]
Additionally, the method
argument to interpolate
has been expanded to include 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'piecewise_polynomial', 'pchip', `polynomial`, 'spline'
The new methods require scipy. Consult the Scipy reference guide and documentation for more information about when the various methods are appropriate. See the docs.
Interpolate now also accepts a limit
keyword argument. This works similar to fillna
‘s limit:
In [100]: ser = Series([1, 3, np.nan, np.nan, np.nan, 11]) In [101]: ser.interpolate(limit=2) Out[101]: 0 1.0 1 3.0 2 5.0 3 7.0 4 NaN 5 11.0 Length: 6, dtype: float64
Added wide_to_long
panel data convenience function. See the docs.
In [102]: np.random.seed(123) In [103]: df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"}, .....: "A1980" : {0 : "d", 1 : "e", 2 : "f"}, .....: "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7}, .....: "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1}, .....: "X" : dict(zip(range(3), np.random.randn(3))) .....: }) .....: In [104]: df["id"] = df.index In [105]: df Out[105]: A1970 A1980 B1970 B1980 X id 0 a d 2.5 3.2 -1.085631 0 1 b e 1.2 1.3 0.997345 1 2 c f 0.7 0.1 0.282978 2 [3 rows x 6 columns] In [106]: wide_to_long(df, ["A", "B"], i="id", j="year") Out[106]: X A B id year 0 1970 -1.085631 a 2.5 1 1970 0.997345 b 1.2 2 1970 0.282978 c 0.7 0 1980 -1.085631 d 3.2 1 1980 0.997345 e 1.3 2 1980 0.282978 f 0.1 [6 rows x 3 columns]
to_csv
now takes a date_format
keyword argument that specifies how output datetime objects should be formatted. Datetimes encountered in the index, columns, and values will all have this formatting applied. (GH4313)DataFrame.plot
will scatter plot x versus y by passing kind='scatter'
(GH2215)The new eval()
function implements expression evaluation using numexpr
behind the scenes. This results in large speedups for complicated expressions involving large DataFrames/Series. For example,
In [107]: nrows, ncols = 20000, 100 In [108]: df1, df2, df3, df4 = [DataFrame(randn(nrows, ncols)) .....: for _ in range(4)] .....:
# eval with NumExpr backend In [109]: %timeit pd.eval('df1 + df2 + df3 + df4') 7.17 ms +- 323 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
# pure Python evaluation In [110]: %timeit df1 + df2 + df3 + df4 10.6 ms +- 598 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
For more details, see the the docs
Similar to pandas.eval
, DataFrame
has a new DataFrame.eval
method that evaluates an expression in the context of the DataFrame
. For example,
In [111]: df = DataFrame(randn(10, 2), columns=['a', 'b']) In [112]: df.eval('a + b') Out[112]: 0 -0.685204 1 1.589745 2 0.325441 3 -1.784153 4 -0.432893 5 0.171850 6 1.895919 7 3.065587 8 -0.092759 9 1.391365 Length: 10, dtype: float64
query()
method has been added that allows you to select elements of a DataFrame
using a natural query syntax nearly identical to Python syntax. For example,
In [113]: n = 20 In [114]: df = DataFrame(np.random.randint(n, size=(n, 3)), columns=['a', 'b', 'c']) In [115]: df.query('a < b < c') Out[115]: a b c 11 1 5 8 15 8 16 19 [2 rows x 3 columns]
selects all the rows of df
where a < b < c
evaluates to True
. For more details see the the docs.
pd.read_msgpack()
and pd.to_msgpack()
are now a supported method of serialization of arbitrary pandas (and python objects) in a lightweight portable binary format. See the docs
Warning
Since this is an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.
In [116]: df = DataFrame(np.random.rand(5,2),columns=list('AB')) In [117]: df.to_msgpack('foo.msg') In [118]: pd.read_msgpack('foo.msg') Out[118]: A B 0 0.251082 0.017357 1 0.347915 0.929879 2 0.546233 0.203368 3 0.064942 0.031722 4 0.355309 0.524575 [5 rows x 2 columns] In [119]: s = Series(np.random.rand(5),index=date_range('20130101',periods=5)) In [120]: pd.to_msgpack('foo.msg', df, s) In [121]: pd.read_msgpack('foo.msg') Out[121]: [ A B 0 0.251082 0.017357 1 0.347915 0.929879 2 0.546233 0.203368 3 0.064942 0.031722 4 0.355309 0.524575 [5 rows x 2 columns], 2013-01-01 0.022321 2013-01-02 0.227025 2013-01-03 0.383282 2013-01-04 0.193225 2013-01-05 0.110977 Freq: D, Length: 5, dtype: float64]
You can pass iterator=True
to iterator over the unpacked results
In [122]: for o in pd.read_msgpack('foo.msg',iterator=True): .....: print o .....: File "<ipython-input-122-59af9f4d3a62>", line 2 print o ^ SyntaxError: Missing parentheses in call to 'print'
pandas.io.gbq
provides a simple way to extract from, and load data into, Google’s BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high performance SQL-like database service, useful for performing ad-hoc queries against extremely large datasets. See the docs
from pandas.io import gbq # A query to select the average monthly temperatures in the # in the year 2000 across the USA. The dataset, # publicata:samples.gsod, is available on all BigQuery accounts, # and is based on NOAA gsod data. query = """SELECT station_number as STATION, month as MONTH, AVG(mean_temp) as MEAN_TEMP FROM publicdata:samples.gsod WHERE YEAR = 2000 GROUP BY STATION, MONTH ORDER BY STATION, MONTH ASC""" # Fetch the result set for this query # Your Google BigQuery Project ID # To find this, see your dashboard: # https://console.developers.google.com/iam-admin/projects?authuser=0 projectid = xxxxxxxxx; df = gbq.read_gbq(query, project_id = projectid) # Use pandas to process and reshape the dataset df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP') df3 = pandas.concat([df2.min(), df2.mean(), df2.max()], axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"])
The resulting DataFrame is:
> df3 Min Tem Mean Temp Max Temp MONTH 1 -53.336667 39.827892 89.770968 2 -49.837500 43.685219 93.437932 3 -77.926087 48.708355 96.099998 4 -82.892858 55.070087 97.317240 5 -92.378261 61.428117 102.042856 6 -77.703334 65.858888 102.900000 7 -87.821428 68.169663 106.510714 8 -89.431999 68.614215 105.500000 9 -86.611112 63.436935 107.142856 10 -78.209677 56.880838 92.103333 11 -50.125000 48.861228 94.996428 12 -50.332258 42.286879 94.396774
Warning
To use this module, you will need a BigQuery account. See <https://cloud.google.com/products/big-query> for details.
As of 10/10/13, there is a bug in Google’s API preventing result sets from being larger than 100,000 rows. A patch is scheduled for the week of 10/14/13.
In 0.13.0 there is a major refactor primarily to subclass Series
from NDFrame
, which is the base class currently for DataFrame
and Panel
, to unify methods and behaviors. Series formerly subclassed directly from ndarray
. (GH4080, GH3862, GH816)
Warning
There are two potential incompatibilities from < 0.13.0
Using certain numpy functions would previously return a Series
if passed a Series
as an argument. This seems only to affect np.ones_like
, np.empty_like
, np.diff
and np.where
. These now return ndarrays
.
In [123]: s = Series([1,2,3,4])
Numpy Usage
In [124]: np.ones_like(s) Out[124]: array([1, 1, 1, 1]) In [125]: np.diff(s) Out[125]: array([1, 1, 1]) In [126]: np.where(s>1,s,np.nan) Out[126]: array([ nan, 2., 3., 4.])
Pandonic Usage
In [127]: Series(1,index=s.index) Out[127]: 0 1 1 1 2 1 3 1 Length: 4, dtype: int64 In [128]: s.diff() Out[128]: 0 NaN 1 1.0 2 1.0 3 1.0 Length: 4, dtype: float64 In [129]: s.where(s>1) Out[129]: 0 NaN 1 2.0 2 3.0 3 4.0 Length: 4, dtype: float64
Passing a Series
directly to a cython function expecting an ndarray
type will no long work directly, you must pass Series.values
, See Enhancing Performance
Series(0.5)
would previously return the scalar 0.5
, instead this will return a 1-element Series
This change breaks rpy2<=2.3.8
. an Issue has been opened against rpy2 and a workaround is detailed in GH5698. Thanks @JanSchulz.
Pickle compatibility is preserved for pickles created prior to 0.13. These must be unpickled with pd.read_pickle
, see Pickling.
Refactor of series.py/frame.py/panel.py to move common code to generic.py
_setup_axes
to created generic NDFrame structuresfrom_axes,_wrap_array,axes,ix,loc,iloc,shape,empty,swapaxes,transpose,pop
__iter__,keys,__contains__,__len__,__neg__,__invert__
convert_objects,as_blocks,as_matrix,values
__getstate__,__setstate__
(compat remains in frame/panel)__getattr__,__setattr__
_indexed_same,reindex_like,align,where,mask
fillna,replace
(Series
replace is now consistent with DataFrame
)filter
(also added axis argument to selectively filter on a different axis)reindex,reindex_axis,take
truncate
(moved to become part of NDFrame
)These are API changes which make Panel
more consistent with DataFrame
swapaxes
on a Panel
with the same axes specified now return a copyDataFrame
filterReindex called with no arguments will now return a copy of the input object
TimeSeries
is now an alias for Series
. the property is_time_series
can be used to distinguish (if desired)
Refactor of Sparse objects to use BlockManager
SparseBlock
, which can hold multi-dtypes and is non-consolidatable. SparseSeries
and SparseDataFrame
now inherit more methods from there hierarchy (Series/DataFrame), and no longer inherit from SparseArray
(which instead is the object of the SparseBlock
)SparseSeries
for boolean/integer/slicesSparsePanels
implementation is unchanged (e.g. not using BlockManager, needs work)added ftypes
method to Series/DataFrame, similar to dtypes
, but indicates if the underlying is sparse/dense (as well as the dtype)
All NDFrame
objects can now use __finalize__()
to specify various values to propagate to new objects from an existing one (e.g. name
in Series
will follow more automatically now)
Internal type checking is now done via a suite of generated classes, allowing isinstance(value, klass)
without having to directly import the klass, courtesy of @jtratner
Bug in Series update where the parent frame is not updating its cache based on changes (GH4080) or types (GH3217), fillna (GH3386)
Refactor Series.reindex
to core/generic.py (GH4604, GH4618), allow method=
in reindexing on a Series to work
Series.copy
no longer accepts the order
parameter and is now consistent with NDFrame
copy
Refactor rename
methods to core/generic.py; fixes Series.rename
for (GH4605), and adds rename
with the same signature for Panel
Refactor clip
methods to core/generic.py (GH4798)
Refactor of _get_numeric_data/_get_bool_data
to core/generic.py, allowing Series/Panel functionality
Series
(for index) / Panel
(for items) now allow attribute access to its elements (GH1903)
In [130]: s = Series([1,2,3],index=list('abc')) In [131]: s.b Out[131]: 2 In [132]: s.a = 5 In [133]: s Out[133]: a 5 b 2 c 3 Length: 3, dtype: int64
See V0.13.0 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.0.
See the full release notes or issue tracker on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.
v0.12.0 (July 24, 2013)¶This is a major release from 0.11.0 and includes several new features and enhancements along with a large number of bug fixes.
Highlights include a consistent I/O API naming scheme, routines to read html, write multi-indexes to csv files, read & write STATA data files, read & write JSON format files, Python 3 support for HDFStore
, filtering of groupby expressions via filter
, and a revamped replace
routine that accepts regular expressions.
I/O Enhancements¶
The I/O API is now much more consistent with a set of top level
reader
functions accessed likepd.read_csv()
that generally return apandas
object.
read_csv
read_excel
read_hdf
read_sql
read_json
read_html
read_stata
read_clipboard
The corresponding
writer
functions are object methods that are accessed likedf.to_csv()
to_csv
to_excel
to_hdf
to_sql
to_json
to_html
to_stata
to_clipboard
Fix modulo and integer division on Series,DataFrames to act similary to
float
dtypes to returnnp.nan
ornp.inf
as appropriate (GH3590). This correct a numpy bug that treatsinteger
andfloat
dtypes differently.In [1]: p = DataFrame({ 'first' : [4,5,8], 'second' : [0,0,3] }) In [2]: p % 0 Out[2]: first second 0 NaN NaN 1 NaN NaN 2 NaN NaN [3 rows x 2 columns] In [3]: p % p Out[3]: first second 0 0.0 NaN 1 0.0 NaN 2 0.0 0.0 [3 rows x 2 columns] In [4]: p / p Out[4]: first second 0 1.0 NaN 1 1.0 NaN 2 1.0 1.0 [3 rows x 2 columns] In [5]: p / 0 Out[5]: first second 0 inf NaN 1 inf NaN 2 inf inf [3 rows x 2 columns]Add
squeeze
keyword togroupby
to allow reduction from DataFrame -> Series if groups are unique. This is a Regression from 0.10.1. We are reverting back to the prior behavior. This means groupby will return the same shaped objects whether the groups are unique or not. Revert this issue (GH2893) with (GH3596).In [6]: df2 = DataFrame([{"val1": 1, "val2" : 20}, {"val1":1, "val2": 19}, ...: {"val1":1, "val2": 27}, {"val1":1, "val2": 12}]) ...: In [7]: def func(dataf): ...: return dataf["val2"] - dataf["val2"].mean() ...: # squeezing the result frame to a series (because we have unique groups) In [8]: df2.groupby("val1", squeeze=True).apply(func) Out[8]: 0 0.5 1 -0.5 2 7.5 3 -7.5 Name: 1, Length: 4, dtype: float64 # no squeezing (the default, and behavior in 0.10.1) In [9]: df2.groupby("val1").apply(func) Out[9]: val2 0 1 2 3 val1 1 0.5 -0.5 7.5 -7.5 [1 rows x 4 columns]Raise on
iloc
when boolean indexing with a label based indexer mask e.g. a boolean Series, even with integer labels, will raise. Sinceiloc
is purely positional based, the labels on the Series are not alignable (GH3631)This case is rarely used, and there are plently of alternatives. This preserves the
iloc
API to be purely positional based.In [10]: df = DataFrame(lrange(5), list('ABCDE'), columns=['a']) In [11]: mask = (df.a%2 == 0) In [12]: mask Out[12]: A True B False C True D False E True Name: a, Length: 5, dtype: bool # this is what you should use In [13]: df.loc[mask] Out[13]: a A 0 C 2 E 4 [3 rows x 1 columns] # this will work as well In [14]: df.iloc[mask.values] Out[14]: a A 0 C 2 E 4 [3 rows x 1 columns]
df.iloc[mask]
will raise aValueError
The
raise_on_error
argument to plotting functions is removed. Instead, plotting functions raise aTypeError
when thedtype
of the object isobject
to remind you to avoidobject
arrays whenever possible and thus you should cast to an appropriate numeric dtype if you need to plot something.Add
colormap
keyword to DataFrame plotting methods. Accepts either a matplotlib colormap object (ie, matplotlib.cm.jet) or a string name of such an object (ie, ‘jet’). The colormap is sampled to select the color for each column. Please see Colormaps for more information. (GH3860)
DataFrame.interpolate()
is now deprecated. Please useDataFrame.fillna()
andDataFrame.replace()
instead. (GH3582, GH3675, GH3676)the
method
andaxis
arguments ofDataFrame.replace()
are deprecated
DataFrame.replace
‘sinfer_types
parameter is removed and now performs conversion by default. (GH3907)Add the keyword
allow_duplicates
toDataFrame.insert
to allow a duplicate column to be inserted ifTrue
, default isFalse
(same as prior to 0.12) (GH3679)IO api
added top-level function
read_excel
to replace the following, The original API is deprecated and will be removed in a future versionfrom pandas.io.parsers import ExcelFile xls = ExcelFile('path_to_file.xls') xls.parse('Sheet1', index_col=None, na_values=['NA'])With
import pandas as pd pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])added top-level function
read_sql
that is equivalent to the followingfrom pandas.io.sql import read_frame read_frame(....)
DataFrame.to_html
andDataFrame.to_latex
now accept a path for their first argument (GH3702)Do not allow astypes on
datetime64[ns]
except toobject
, andtimedelta64[ns]
toobject/int
(GH3425)The behavior of
datetime64
dtypes has changed with respect to certain so-called reduction operations (GH3726). The following operations now raise aTypeError
when perfomed on aSeries
and return an emptySeries
when performed on aDataFrame
similar to performing these operations on, for example, aDataFrame
ofslice
objects:
- sum, prod, mean, std, var, skew, kurt, corr, and cov
read_html
now defaults toNone
when reading, and falls back onbs4
+html5lib
when lxml fails to parse. a list of parsers to try until success is also validThe internal
pandas
class hierarchy has changed (slightly). The previousPandasObject
now is calledPandasContainer
and a newPandasObject
has become the baseclass forPandasContainer
as well asIndex
,Categorical
,GroupBy
,SparseList
, andSparseArray
(+ their base classes). Currently,PandasObject
provides string methods (fromStringMixin
). (GH4090, GH4092)New
StringMixin
that, given a__unicode__
method, gets python 2 and python 3 compatible string methods (__str__
,__bytes__
, and__repr__
). Plus string safety throughout. Now employed in many places throughout the pandas library. (GH4090, GH4092)
Other Enhancements¶
pd.read_html()
can now parse HTML strings, files or urls and return DataFrames, courtesy of @cpcloud. (GH3477, GH3605, GH3606, GH3616). It works with a single parser backend: BeautifulSoup4 + html5lib See the docsYou can use
pd.read_html()
to read the output fromDataFrame.to_html()
like soIn [15]: df = DataFrame({'a': range(3), 'b': list('abc')}) In [16]: print(df) a b 0 0 a 1 1 b 2 2 c [3 rows x 2 columns] In [17]: html = df.to_html() In [18]: alist = pd.read_html(html, index_col=0) In [19]: print(df == alist[0]) a b 0 True True 1 True True 2 True True [3 rows x 2 columns]Note that
alist
here is a Pythonlist
sopd.read_html()
andDataFrame.to_html()
are not inverses.
pd.read_html()
no longer performs hard conversion of date strings (GH3656).Added module for reading and writing Stata files:
pandas.io.stata
(GH1512) accessable viaread_stata
top-level function for reading, andto_stata
DataFrame method for writing, See the docsAdded module for reading and writing json format files:
pandas.io.json
accessable viaread_json
top-level function for reading, andto_json
DataFrame method for writing, See the docs various issues (GH1226, GH3804, GH3876, GH3867, GH1305)
MultiIndex
column support for reading and writing csv format files
The
header
option inread_csv
now accepts a list of the rows from which to read the index.The option,
tupleize_cols
can now be specified in bothto_csv
andread_csv
, to provide compatiblity for the pre 0.12 behavior of writing and readingMultIndex
columns via a list of tuples. The default in 0.12 is to write lists of tuples and not interpret list of tuples as aMultiIndex
column.Note: The default behavior in 0.12 remains unchanged from prior versions, but starting with 0.13, the default to write and read
MultiIndex
columns will be in the new format. (GH3571, GH1651, GH3141)If an
index_col
is not specified (e.g. you don’t have an index, or wrote it withdf.to_csv(..., index=False
), then anynames
on the columns index will be lost.In [20]: from pandas.util.testing import makeCustomDataframe as mkdf In [21]: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4) In [22]: df.to_csv('mi.csv',tupleize_cols=False) In [23]: print(open('mi.csv').read()) C0,,C_l0_g0,C_l0_g1,C_l0_g2 C1,,C_l1_g0,C_l1_g1,C_l1_g2 C2,,C_l2_g0,C_l2_g1,C_l2_g2 C3,,C_l3_g0,C_l3_g1,C_l3_g2 R0,R1,,, R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2 R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2 R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2 R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2 R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2 In [24]: pd.read_csv('mi.csv',header=[0,1,2,3],index_col=[0,1],tupleize_cols=False) Out[24]: C0 C_l0_g0 C_l0_g1 C_l0_g2 C1 C_l1_g0 C_l1_g1 C_l1_g2 C2 C_l2_g0 C_l2_g1 C_l2_g2 C3 C_l3_g0 C_l3_g1 C_l3_g2 R0 R1 R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2 R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2 R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2 R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2 R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2 [5 rows x 3 columns]Support for
HDFStore
(viaPyTables 3.0.0
) on Python3Iterator support via
read_hdf
that automatically opens and closes the store when iteration is finished. This is only for tablesIn [25]: path = 'store_iterator.h5' In [26]: DataFrame(randn(10,2)).to_hdf(path,'df',table=True) In [27]: for df in read_hdf(path,'df', chunksize=3): ....: print df ....: 0 1 0 0.713216 -0.778461 1 -0.661062 0.862877 2 0.344342 0.149565 0 1 3 -0.626968 -0.875772 4 -0.930687 -0.218983 5 0.949965 -0.442354 0 1 6 -0.402985 1.111358 7 -0.241527 -0.670477 8 0.049355 0.632633 0 1 9 -1.502767 -1.225492
read_csv
will now throw a more informative error message when a file contains no columns, e.g., all newline characters
Experimental Features¶
DataFrame.replace()
now allows regular expressions on containedSeries
with object dtype. See the examples section in the regular docs Replacing via String ExpressionFor example you can do
In [25]: df = DataFrame({'a': list('ab..'), 'b': [1, 2, 3, 4]}) In [26]: df.replace(regex=r'\s*\.\s*', value=np.nan) Out[26]: a b 0 a 1 1 b 2 2 NaN 3 3 NaN 4 [4 rows x 2 columns]to replace all occurrences of the string
'.'
with zero or more instances of surrounding whitespace withNaN
.Regular string replacement still works as expected. For example, you can do
In [27]: df.replace('.', np.nan) Out[27]: a b 0 a 1 1 b 2 2 NaN 3 3 NaN 4 [4 rows x 2 columns]to replace all occurrences of the string
'.'
withNaN
.
pd.melt()
now accepts the optional parametersvar_name
andvalue_name
to specify custom column names of the returned DataFrame.
pd.set_option()
now allows N option, value pairs (GH3667).Let’s say that we had an option
'a.b'
and another option'b.c'
. We can set them at the same time:In [28]: pd.get_option('a.b') Out[28]: 2 In [29]: pd.get_option('b.c') Out[29]: 3 In [30]: pd.set_option('a.b', 1, 'b.c', 4) In [31]: pd.get_option('a.b') Out[31]: 1 In [32]: pd.get_option('b.c') Out[32]: 4The
filter
method for group objects returns a subset of the original object. Suppose we want to take only elements that belong to groups with a group sum greater than 2.In [33]: sf = Series([1, 1, 2, 3, 3, 3]) In [34]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[34]: 3 3 4 3 5 3 Length: 3, dtype: int64The argument of
filter
must a function that, applied to the group as a whole, returnsTrue
orFalse
.Another useful operation is filtering out elements that belong to groups with only a couple members.
In [35]: dff = DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')}) In [36]: dff.groupby('B').filter(lambda x: len(x) > 2) Out[36]: A B 2 2 b 3 3 b 4 4 b 5 5 b [4 rows x 2 columns]Alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do not pass the filter are filled with NaNs.
In [37]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False) Out[37]: A B 0 NaN NaN 1 NaN NaN 2 2.0 b 3 3.0 b 4 4.0 b 5 5.0 b 6 NaN NaN 7 NaN NaN [8 rows x 2 columns]Series and DataFrame hist methods now take a
figsize
argument (GH3834)DatetimeIndexes no longer try to convert mixed-integer indexes during join operations (GH3877)
Timestamp.min and Timestamp.max now represent valid Timestamp instances instead of the default datetime.min and datetime.max (respectively), thanks @SleepingPills
read_html
now raises when no tables are found and BeautifulSoup==4.2.0 is detected (GH4214)
Bug Fixes¶
Added experimental
CustomBusinessDay
class to supportDateOffsets
with custom holiday calendars and custom weekmasks. (GH2301)Note
This uses the
numpy.busdaycalendar
API introduced in Numpy 1.7 and therefore requires Numpy 1.7.0 or newer.In [38]: from pandas.tseries.offsets import CustomBusinessDay In [39]: from datetime import datetime # As an interesting example, let's look at Egypt where # a Friday-Saturday weekend is observed. In [40]: weekmask_egypt = 'Sun Mon Tue Wed Thu' # They also observe International Workers' Day so let's # add that for a couple of years In [41]: holidays = ['2012-05-01', datetime(2013, 5, 1), np.datetime64('2014-05-01')] In [42]: bday_egypt = CustomBusinessDay(holidays=holidays, weekmask=weekmask_egypt) In [43]: dt = datetime(2013, 4, 30) In [44]: print(dt + 2 * bday_egypt) 2013-05-05 00:00:00 In [45]: dts = date_range(dt, periods=5, freq=bday_egypt) In [46]: print(Series(dts.weekday, dts).map(Series('Mon Tue Wed Thu Fri Sat Sun'.split()))) 2013-04-30 Tue 2013-05-02 Thu 2013-05-05 Sun 2013-05-06 Mon 2013-05-07 Tue Freq: C, Length: 5, dtype: object
Plotting functions now raise a
TypeError
before trying to plot anything if the associated objects have have a dtype ofobject
(GH1818, GH3572, GH3911, GH3912), but they will try to convert object arrays to numeric arrays if possible so that you can still plot, for example, an object array with floats. This happens before any drawing takes place which elimnates any spurious plots from showing up.
fillna
methods now raise aTypeError
if thevalue
parameter is a list or tuple.
Series.str
now supports iteration (GH3638). You can iterate over the individual elements of each string in theSeries
. Each iteration yields yields aSeries
with either a single character at each index of the originalSeries
orNaN
. For example,In [47]: strs = 'go', 'bow', 'joe', 'slow' In [48]: ds = Series(strs) In [49]: for s in ds.str: ....: print(s) ....: 0 g 1 b 2 j 3 s Length: 4, dtype: object 0 o 1 o 2 o 3 l Length: 4, dtype: object 0 NaN 1 w 2 e 3 o Length: 4, dtype: object 0 NaN 1 NaN 2 NaN 3 w Length: 4, dtype: object In [50]: s Out[50]: 0 NaN 1 NaN 2 NaN 3 w Length: 4, dtype: object In [51]: s.dropna().values.item() == 'w' Out[51]: TrueThe last element yielded by the iterator will be a
Series
containing the last element of the longest string in theSeries
with all other elements beingNaN
. Here since'slow'
is the longest string and there are no other strings with the same length'w'
is the only non-null string in the yieldedSeries
.
HDFStore
- will retain index attributes (freq,tz,name) on recreation (GH3499)
- will warn with a
AttributeConflictWarning
if you are attempting to append an index with a different frequency than the existing, or attempting to append an index with a different name than the existing- support datelike columns with a timezone as data_columns (GH2852)
Non-unique index support clarified (GH3468).
- Fix assigning a new index to a duplicate index in a DataFrame would fail (GH3468)
- Fix construction of a DataFrame with a duplicate index
- ref_locs support to allow duplicative indices across dtypes, allows iget support to always find the index (even across dtypes) (GH2194)
- applymap on a DataFrame with a non-unique index now works (removed warning) (GH2786), and fix (GH3230)
- Fix to_csv to handle non-unique columns (GH3495)
- Duplicate indexes with getitem will return items in the correct order (GH3455, GH3457) and handle missing elements like unique indices (GH3561)
- Duplicate indexes with and empty DataFrame.from_records will return a correct frame (GH3562)
- Concat to produce a non-unique columns when duplicates are across dtypes is fixed (GH3602)
- Allow insert/delete to non-unique columns (GH3679)
- Non-unique indexing with a slice via
loc
and friends fixed (GH3659)- Allow insert/delete to non-unique columns (GH3679)
- Extend
reindex
to correctly deal with non-unique indices (GH3679)DataFrame.itertuples()
now works with frames with duplicate column names (GH3873)- Bug in non-unique indexing via
iloc
(GH4017); addedtakeable
argument toreindex
for location-based taking- Allow non-unique indexing in series via
.ix/.loc
and__getitem__
(GH4246)- Fixed non-unique indexing memory allocation issue with
.ix/.loc
(GH4280)
DataFrame.from_records
did not accept empty recarrays (GH3682)
read_html
now correctly skips tests (GH3741)Fixed a bug where
DataFrame.replace
with a compiled regular expression in theto_replace
argument wasn’t working (GH3907)Improved
network
test decorator to catchIOError
(and thereforeURLError
as well). Addedwith_connectivity_check
decorator to allow explicitly checking a website as a proxy for seeing if there is network connectivity. Plus, newoptional_args
decorator factory for decorators. (GH3910, GH3914)Fixed testing issue where too many sockets where open thus leading to a connection reset issue (GH3982, GH3985, GH4028, GH4054)
Fixed failing tests in test_yahoo, test_google where symbols were not retrieved but were being accessed (GH3982, GH3985, GH4028, GH4054)
Series.hist
will now take the figure from the current environment if one is not passedFixed bug where a 1xN DataFrame would barf on a 1xN mask (GH4071)
Fixed running of
tox
under python3 where the pickle import was getting rewritten in an incompatible way (GH4062, GH4063)Fixed bug where sharex and sharey were not being passed to grouped_hist (GH4089)
Fixed bug in
DataFrame.replace
where a nested dict wasn’t being iterated over when regex=False (GH4115)Fixed bug in the parsing of microseconds when using the
format
argument into_datetime
(GH4152)Fixed bug in
PandasAutoDateLocator
whereinvert_xaxis
triggered incorrectlyMilliSecondLocator
(GH3990)Fixed bug in plotting that wasn’t raising on invalid colormap for matplotlib 1.1.1 (GH4215)
Fixed the legend displaying in
DataFrame.plot(kind='kde')
(GH4216)Fixed bug where Index slices weren’t carrying the name attribute (GH4226)
Fixed bug in initializing
DatetimeIndex
with an array of strings in a certain time zone (GH4229)Fixed bug where html5lib wasn’t being properly skipped (GH4265)
Fixed bug where get_data_famafrench wasn’t using the correct file edges (GH4281)
See the full release notes or issue tracker on GitHub for a complete list.
v0.11.0 (April 22, 2013)¶This is a major release from 0.10.1 and includes many new features and enhancements along with a large number of bug fixes. The methods of Selecting Data have had quite a number of additions, and Dtype support is now full-fledged. There are also a number of important API changes that long-time pandas users should pay close attention to.
There is a new section in the documentation, 10 Minutes to Pandas, primarily geared to new users.
There is a new section in the documentation, Cookbook, a collection of useful recipes in pandas (and that we want contributions!).
There are several libraries that are now Recommended Dependencies
Selection Choices¶Starting in 0.11.0, object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing.
.loc
is strictly label based, will raise KeyError
when the items are not found, allowed inputs are:
5
or 'a'
, (note that 5
is interpreted as a label of the index. This use is not an integer position along the index)['a', 'b', 'c']
'a':'f'
, (note that contrary to usual python slices, both the start and the stop are included!)See more at Selection by Label
.iloc
is strictly integer position based (from 0
to length-1
of the axis), will raise IndexError
when the requested indicies are out of bounds. Allowed inputs are:
5
[4, 3, 0]
1:7
See more at Selection by Position
.ix
supports mixed integer and label based access. It is primarily label based, but will fallback to integer positional access. .ix
is the most general and will support any of the inputs to .loc
and .iloc
, as well as support for floating point label schemes. .ix
is especially useful when dealing with mixed positional and label based hierarchial indexes.
As using integer slices with .ix
have different behavior depending on whether the slice is interpreted as position based or label based, it’s usually better to be explicit and use .iloc
or .loc
.
See more at Advanced Indexing and Advanced Hierarchical.
Starting in version 0.11.0, these methods may be deprecated in future versions.
irow
icol
iget_value
See the section Selection by Position for substitutes.
Dtypes¶Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype
keyword, a passed ndarray
, or a passed Series
, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.
In [1]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32') In [2]: df1 Out[2]: A 0 1.392665 1 -0.123497 2 -0.402761 3 -0.246604 4 -0.288433 5 -0.763434 6 2.069526 7 -1.203569 [8 rows x 1 columns] In [3]: df1.dtypes Out[3]: A float32 Length: 1, dtype: object In [4]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), ...: B = Series(randn(8)), ...: C = Series(randn(8),dtype='uint8') )) ...: In [5]: df2 Out[5]: A B C 0 0.591797 -0.038605 0 1 0.841309 -0.460478 1 2 -0.500977 -0.310458 0 3 -0.816406 0.866493 254 4 -0.207031 0.245972 0 5 -0.664062 0.319442 1 6 0.580566 1.378512 1 7 -0.965820 0.292502 255 [8 rows x 3 columns] In [6]: df2.dtypes Out[6]: A float16 B float64 C uint8 Length: 3, dtype: object # here you get some upcasting In [7]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2 In [8]: df3 Out[8]: A B C 0 1.984462 -0.038605 0.0 1 0.717812 -0.460478 1.0 2 -0.903737 -0.310458 0.0 3 -1.063011 0.866493 254.0 4 -0.495465 0.245972 0.0 5 -1.427497 0.319442 1.0 6 2.650092 1.378512 1.0 7 -2.169390 0.292502 255.0 [8 rows x 3 columns] In [9]: df3.dtypes Out[9]: A float32 B float64 C float64 Length: 3, dtype: objectDtype Conversion¶
This is lower-common-denomicator upcasting, meaning you get the dtype which can accomodate all of the types
In [10]: df3.values.dtype Out[10]: dtype('float64')
Conversion
In [11]: df3.astype('float32').dtypes Out[11]: A float32 B float32 C float32 Length: 3, dtype: object
Mixed Conversion
In [12]: df3['D'] = '1.' In [13]: df3['E'] = '1' In [14]: df3.convert_objects(convert_numeric=True).dtypes Out[14]: A float32 B float64 C float64 D float64 E int64 Length: 5, dtype: object # same, but specific dtype conversion In [15]: df3['D'] = df3['D'].astype('float16') In [16]: df3['E'] = df3['E'].astype('int32') In [17]: df3.dtypes Out[17]: A float32 B float64 C float64 D float16 E int32 Length: 5, dtype: object
Forcing Date coercion (and setting NaT
when not datelike)
In [18]: from datetime import datetime In [19]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, ....: Timestamp('20010104'), '20010105'],dtype='O') ....: In [20]: s.convert_objects(convert_dates='coerce') Out[20]: 0 2001-01-01 1 NaT 2 NaT 3 NaT 4 2001-01-04 5 2001-01-05 Length: 6, dtype: datetime64[ns]Dtype Gotchas¶
Platform Gotchas
Starting in 0.11.0, construction of DataFrame/Series will use default dtypes of int64
and float64
, regardless of platform. This is not an apparent change from earlier versions of pandas. If you specify dtypes, they WILL be respected, however (GH2837)
The following will all result in int64
dtypes
In [21]: DataFrame([1,2],columns=['a']).dtypes Out[21]: a int64 Length: 1, dtype: object In [22]: DataFrame({'a' : [1,2] }).dtypes Out[22]: a int64 Length: 1, dtype: object In [23]: DataFrame({'a' : 1 }, index=range(2)).dtypes Out[23]: a int64 Length: 1, dtype: object
Keep in mind that DataFrame(np.array([1,2]))
WILL result in int32
on 32-bit platforms!
Upcasting Gotchas
Performing indexing operations on integer type data can easily upcast the data. The dtype of the input data will be preserved in cases where nans
are not introduced.
In [24]: dfi = df3.astype('int32') In [25]: dfi['D'] = dfi['D'].astype('int64') In [26]: dfi Out[26]: A B C D E 0 1 0 0 1 1 1 0 0 1 1 1 2 0 0 0 1 1 3 -1 0 254 1 1 4 0 0 0 1 1 5 -1 0 1 1 1 6 2 1 1 1 1 7 -2 0 255 1 1 [8 rows x 5 columns] In [27]: dfi.dtypes Out[27]: A int32 B int32 C int32 D int64 E int32 Length: 5, dtype: object In [28]: casted = dfi[dfi>0] In [29]: casted Out[29]: A B C D E 0 1.0 NaN NaN 1 1 1 NaN NaN 1.0 1 1 2 NaN NaN NaN 1 1 3 NaN NaN 254.0 1 1 4 NaN NaN NaN 1 1 5 NaN NaN 1.0 1 1 6 2.0 1.0 1.0 1 1 7 NaN NaN 255.0 1 1 [8 rows x 5 columns] In [30]: casted.dtypes Out[30]: A float64 B float64 C float64 D int64 E int32 Length: 5, dtype: object
While float dtypes are unchanged.
In [31]: df4 = df3.copy() In [32]: df4['A'] = df4['A'].astype('float32') In [33]: df4.dtypes Out[33]: A float32 B float64 C float64 D float16 E int32 Length: 5, dtype: object In [34]: casted = df4[df4>0] In [35]: casted Out[35]: A B C D E 0 1.984462 NaN NaN 1.0 1 1 0.717812 NaN 1.0 1.0 1 2 NaN NaN NaN 1.0 1 3 NaN 0.866493 254.0 1.0 1 4 NaN 0.245972 NaN 1.0 1 5 NaN 0.319442 1.0 1.0 1 6 2.650092 1.378512 1.0 1.0 1 7 NaN 0.292502 255.0 1.0 1 [8 rows x 5 columns] In [36]: casted.dtypes Out[36]: A float32 B float64 C float64 D float16 E int32 Length: 5, dtype: objectDatetimes Conversion¶
Datetime64[ns] columns in a DataFrame (or a Series) allow the use of np.nan
to indicate a nan value, in addition to the traditional NaT
, or not-a-time. This allows convenient nan setting in a generic way. Furthermore datetime64[ns]
columns are created by default, when passed datetimelike objects (this change was introduced in 0.10.1) (GH2809, GH2810)
In [37]: df = DataFrame(randn(6,2),date_range('20010102',periods=6),columns=['A','B']) In [38]: df['timestamp'] = Timestamp('20010103') In [39]: df Out[39]: A B timestamp 2001-01-02 1.023958 0.660103 2001-01-03 2001-01-03 1.236475 -2.170629 2001-01-03 2001-01-04 -0.270630 -1.685677 2001-01-03 2001-01-05 -0.440747 -0.115070 2001-01-03 2001-01-06 -0.632102 -0.585977 2001-01-03 2001-01-07 -1.444787 -0.201135 2001-01-03 [6 rows x 3 columns] # datetime64[ns] out of the box In [40]: df.get_dtype_counts() Out[40]: datetime64[ns] 1 float64 2 Length: 2, dtype: int64 # use the traditional nan, which is mapped to NaT internally In [41]: df.loc[df.index[2:4], ['A','timestamp']] = np.nan In [42]: df Out[42]: A B timestamp 2001-01-02 1.023958 0.660103 2001-01-03 2001-01-03 1.236475 -2.170629 2001-01-03 2001-01-04 NaN -1.685677 NaT 2001-01-05 NaN -0.115070 NaT 2001-01-06 -0.632102 -0.585977 2001-01-03 2001-01-07 -1.444787 -0.201135 2001-01-03 [6 rows x 3 columns]
Astype conversion on datetime64[ns]
to object
, implicity converts NaT
to np.nan
In [43]: import datetime In [44]: s = Series([datetime.datetime(2001, 1, 2, 0, 0) for i in range(3)]) In [45]: s.dtype Out[45]: dtype('<M8[ns]') In [46]: s[1] = np.nan In [47]: s Out[47]: 0 2001-01-02 1 NaT 2 2001-01-02 Length: 3, dtype: datetime64[ns] In [48]: s.dtype Out[48]: dtype('<M8[ns]') In [49]: s = s.astype('O') In [50]: s Out[50]: 0 2001-01-02 00:00:00 1 NaT 2 2001-01-02 00:00:00 Length: 3, dtype: object In [51]: s.dtype Out[51]: dtype('O')API changes¶
Enhancements¶
- Added to_series() method to indicies, to facilitate the creation of indexers (GH3275)
HDFStore
- added the method
select_column
to select a single column from a table as a Series.- deprecated the
unique
method, can be replicated byselect_column(key,column).unique()
min_itemsize
parameter toappend
will now automatically create data_columns for passed keys
Improved performance of df.to_csv() by up to 10x in some cases. (GH3059)
Numexpr is now a Recommended Dependencies, to accelerate certain types of numerical and boolean operations
Bottleneck is now a Recommended Dependencies, to accelerate certain types of
nan
operations
HDFStore
support
read_hdf/to_hdf
API similar toread_csv/to_csv
In [52]: df = DataFrame(dict(A=lrange(5), B=lrange(5))) In [53]: df.to_hdf('store.h5','table',append=True) In [54]: read_hdf('store.h5', 'table', where = ['index>2']) Out[54]: A B 3 3 3 4 4 4 [2 rows x 2 columns]provide dotted attribute access to
get
from stores, e.g.store.df == store['df']
new keywords
iterator=boolean
, andchunksize=number_in_a_chunk
are provided to support iteration onselect
andselect_as_multiple
(GH3076)You can now select timestamps from an unordered timeseries similarly to an ordered timeseries (GH2437)
You can now select with a string from a DataFrame with a datelike index, in a similar way to a Series (GH3070)
In [55]: idx = date_range("2001-10-1", periods=5, freq='M') In [56]: ts = Series(np.random.rand(len(idx)),index=idx) In [57]: ts['2001'] Out[57]: 2001-10-31 0.663256 2001-11-30 0.079126 2001-12-31 0.587699 Freq: M, Length: 3, dtype: float64 In [58]: df = DataFrame(dict(A = ts)) In [59]: df['2001'] Out[59]: A 2001-10-31 0.663256 2001-11-30 0.079126 2001-12-31 0.587699 [3 rows x 1 columns]
Squeeze
to possibly remove length 1 dimensions from an object.In [60]: p = Panel(randn(3,4,4),items=['ItemA','ItemB','ItemC'], ....: major_axis=date_range('20010102',periods=4), ....: minor_axis=['A','B','C','D']) ....: In [61]: p Out[61]: <class 'pandas.core.panel.Panel'> Dimensions: 3 (items) x 4 (major_axis) x 4 (minor_axis) Items axis: ItemA to ItemC Major_axis axis: 2001-01-02 00:00:00 to 2001-01-05 00:00:00 Minor_axis axis: A to D In [62]: p.reindex(items=['ItemA']).squeeze() Out[62]: A B C D 2001-01-02 -1.203403 0.425882 -0.436045 -0.982462 2001-01-03 0.348090 -0.969649 0.121731 0.202798 2001-01-04 1.215695 -0.218549 -0.631381 -0.337116 2001-01-05 0.404238 0.907213 -0.865657 0.483186 [4 rows x 4 columns] In [63]: p.reindex(items=['ItemA'],minor=['B']).squeeze() Out[63]: 2001-01-02 0.425882 2001-01-03 -0.969649 2001-01-04 -0.218549 2001-01-05 0.907213 Freq: D, Name: B, Length: 4, dtype: float64In
pd.io.data.Options
,
- Fix bug when trying to fetch data for the current month when already past expiry.
- Now using lxml to scrape html instead of BeautifulSoup (lxml was faster).
- New instance variables for calls and puts are automatically created when a method that creates them is called. This works for current month where the instance variables are simply
calls
andputs
. Also works for future expiry months and save the instance variable ascallsMMYY
orputsMMYY
, whereMMYY
are, respectively, the month and year of the option’s expiry.Options.get_near_stock_price
now allows the user to specify the month for which to get relevant options data.Options.get_forward_data
now has optional kwargsnear
andabove_below
. This allows the user to specify if they would like to only return forward looking data for options near the current stock price. This just obtains the data from Options.get_near_stock_price instead of Options.get_xxx_data() (GH2758).Cursor coordinate information is now displayed in time-series plots.
added option display.max_seq_items to control the number of elements printed per sequence pprinting it. (GH2979)
added option display.chop_threshold to control display of small numerical values. (GH2739)
added option display.max_info_rows to prevent verbose_info from being calculated for frames above 1M rows (configurable). (GH2807, GH2918)
value_counts() now accepts a “normalize” argument, for normalized histograms. (GH2710).
DataFrame.from_records now accepts not only dicts but any instance of the collections.Mapping ABC.
added option display.mpl_style providing a sleeker visual style for plots. Based on https://gist.github.com/huyng/816622 (GH3075).
Treat boolean values as integers (values 1 and 0) for numeric operations. (GH2641)
to_html() now accepts an optional “escape” argument to control reserved HTML character escaping (enabled by default) and escapes
&
, in addition to<
and>
. (GH2919)
See the full release notes or issue tracker on GitHub for a complete list.
v0.10.1 (January 22, 2013)¶This is a minor release from 0.10.0 and includes new features, enhancements, and bug fixes. In particular, there is substantial new HDFStore functionality contributed by Jeff Reback.
An undesired API breakage with functions taking the inplace
option has been reverted and deprecation warnings added.
inplace
option return the calling object as before. A deprecation message has been addedYou may need to upgrade your existing data files. Please visit the compatibility section in the main docs.
You can designate (and index) certain columns that you want to be able to perform queries on a table, by passing a list to data_columns
In [1]: store = HDFStore('store.h5') In [2]: df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8), ...: columns=['A', 'B', 'C']) ...: In [3]: df['string'] = 'foo' In [4]: df.loc[df.index[4:6], 'string'] = np.nan In [5]: df.loc[df.index[7:9], 'string'] = 'bar' In [6]: df['string2'] = 'cool' In [7]: df Out[7]: A B C string string2 2000-01-01 1.885136 -0.183873 2.550850 foo cool 2000-01-02 0.180759 -1.117089 0.061462 foo cool 2000-01-03 -0.294467 -0.591411 -0.876691 foo cool 2000-01-04 3.127110 1.451130 0.045152 foo cool 2000-01-05 -0.242846 1.195819 1.533294 NaN cool 2000-01-06 0.820521 -0.281201 1.651561 NaN cool 2000-01-07 -0.034086 0.252394 -0.498772 foo cool 2000-01-08 -2.290958 -1.601262 -0.256718 bar cool [8 rows x 5 columns] # on-disk operations In [8]: store.append('df', df, data_columns = ['B','C','string','string2']) In [9]: store.select('df', "B>0 and string=='foo'") Out[9]: A B C string string2 2000-01-04 3.127110 1.451130 0.045152 foo cool 2000-01-07 -0.034086 0.252394 -0.498772 foo cool [2 rows x 5 columns] # this is in-memory version of this type of selection In [10]: df[(df.B > 0) & (df.string == 'foo')] Out[10]: A B C string string2 2000-01-04 3.127110 1.451130 0.045152 foo cool 2000-01-07 -0.034086 0.252394 -0.498772 foo cool [2 rows x 5 columns]
Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0 # can be replicated by: store.select_column('df','index').unique() store.unique('df','index') store.unique('df','string')
You can now store datetime64
in data columns
In [11]: df_mixed = df.copy() In [12]: df_mixed['datetime64'] = Timestamp('20010102') In [13]: df_mixed.loc[df_mixed.index[3:4], ['A','B']] = np.nan In [14]: store.append('df_mixed', df_mixed) In [15]: df_mixed1 = store.select('df_mixed') In [16]: df_mixed1 Out[16]: A B C string string2 datetime64 2000-01-01 1.885136 -0.183873 2.550850 foo cool 2001-01-02 2000-01-02 0.180759 -1.117089 0.061462 foo cool 2001-01-02 2000-01-03 -0.294467 -0.591411 -0.876691 foo cool 2001-01-02 2000-01-04 NaN NaN 0.045152 foo cool 2001-01-02 2000-01-05 -0.242846 1.195819 1.533294 NaN cool 2001-01-02 2000-01-06 0.820521 -0.281201 1.651561 NaN cool 2001-01-02 2000-01-07 -0.034086 0.252394 -0.498772 foo cool 2001-01-02 2000-01-08 -2.290958 -1.601262 -0.256718 bar cool 2001-01-02 [8 rows x 6 columns] In [17]: df_mixed1.get_dtype_counts() Out[17]: datetime64[ns] 1 float64 3 object 2 Length: 3, dtype: int64
You can pass columns
keyword to select to filter a list of the return columns, this is equivalent to passing a Term('columns',list_of_columns_to_filter)
In [18]: store.select('df',columns = ['A','B']) Out[18]: A B 2000-01-01 1.885136 -0.183873 2000-01-02 0.180759 -1.117089 2000-01-03 -0.294467 -0.591411 2000-01-04 3.127110 1.451130 2000-01-05 -0.242846 1.195819 2000-01-06 0.820521 -0.281201 2000-01-07 -0.034086 0.252394 2000-01-08 -2.290958 -1.601262 [8 rows x 2 columns]
HDFStore
now serializes multi-index dataframes when appending tables.
In [19]: index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'], ....: ['one', 'two', 'three']], ....: labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], ....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], ....: names=['foo', 'bar']) ....: In [20]: df = DataFrame(np.random.randn(10, 3), index=index, ....: columns=['A', 'B', 'C']) ....: In [21]: df Out[21]: A B C foo bar foo one 0.239369 0.174122 -1.131794 two -1.948006 0.980347 -0.674429 three -0.361633 -0.761218 1.768215 bar one 0.152288 -0.862613 -0.210968 two -0.859278 1.498195 0.462413 baz two -0.647604 1.511487 -0.727189 three -0.342928 -0.007364 1.427674 qux one 0.104020 2.052171 -1.230963 two -0.019240 -1.713238 0.838912 three -0.637855 0.215109 -1.515362 [10 rows x 3 columns] In [22]: store.append('mi',df) In [23]: store.select('mi') Out[23]: A B C foo bar foo one 0.239369 0.174122 -1.131794 two -1.948006 0.980347 -0.674429 three -0.361633 -0.761218 1.768215 bar one 0.152288 -0.862613 -0.210968 two -0.859278 1.498195 0.462413 baz two -0.647604 1.511487 -0.727189 three -0.342928 -0.007364 1.427674 qux one 0.104020 2.052171 -1.230963 two -0.019240 -1.713238 0.838912 three -0.637855 0.215109 -1.515362 [10 rows x 3 columns] # the levels are automatically included as data columns In [24]: store.select('mi', "foo='bar'") Out[24]: A B C foo bar bar one 0.152288 -0.862613 -0.210968 two -0.859278 1.498195 0.462413 [2 rows x 3 columns]
Multi-table creation via append_to_multiple
and selection via select_as_multiple
can create/select from multiple tables and return a combined result, by using where
on a selector table.
In [25]: df_mt = DataFrame(randn(8, 6), index=date_range('1/1/2000', periods=8), ....: columns=['A', 'B', 'C', 'D', 'E', 'F']) ....: In [26]: df_mt['foo'] = 'bar' # you can also create the tables individually In [27]: store.append_to_multiple({ 'df1_mt' : ['A','B'], 'df2_mt' : None }, df_mt, selector = 'df1_mt') In [28]: store Out[28]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2]) /df1_mt frame_table (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B]) /df2_mt frame_table (typ->appendable,nrows->8,ncols->5,indexers->[index]) /df_mixed frame_table (typ->appendable,nrows->8,ncols->6,indexers->[index]) /mi frame_table (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo]) # indiviual tables were created In [29]: store.select('df1_mt') Out[29]: A B 2000-01-01 1.586924 -0.447974 2000-01-02 -0.102206 0.870302 2000-01-03 1.249874 1.458210 2000-01-04 -0.616293 0.150468 2000-01-05 -0.431163 0.016640 2000-01-06 0.800353 -0.451572 2000-01-07 1.239198 0.185437 2000-01-08 -0.040863 0.290110 [8 rows x 2 columns] In [30]: store.select('df2_mt') Out[30]: C D E F foo 2000-01-01 -1.573998 0.630925 -0.071659 -1.277640 bar 2000-01-02 1.275280 -1.199212 1.060780 1.673018 bar 2000-01-03 -0.710542 0.825392 1.557329 1.993441 bar 2000-01-04 0.132104 0.580923 -0.128750 1.445964 bar 2000-01-05 0.904578 -1.645852 -0.688741 0.228006 bar 2000-01-06 0.831767 0.228760 0.932498 -2.200069 bar 2000-01-07 -0.540770 -0.370038 1.298390 1.662964 bar 2000-01-08 -0.096145 1.717830 -0.462446 -0.112019 bar [8 rows x 5 columns] # as a multiple In [31]: store.select_as_multiple(['df1_mt','df2_mt'], where = [ 'A>0','B>0' ], selector = 'df1_mt') Out[31]: A B C D E F foo 2000-01-03 1.249874 1.458210 -0.710542 0.825392 1.557329 1.993441 bar 2000-01-07 1.239198 0.185437 -0.540770 -0.370038 1.298390 1.662964 bar [2 rows x 7 columns]
Enhancements
HDFStore
now can read native PyTables table format tablesnan_rep = 'my_nan_rep'
to append, to change the default nan representation on disk (which converts to/from np.nan), this defaults to nan.index
to append
. This defaults to True
. This will automagically create indicies on the indexables and data columns of the tablechunksize=an integer
to append
, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.expectedrows=an integer
to the first append
, to set the TOTAL number of expectedrows that PyTables
will expected. This will optimize read/write performance.Select
now supports passing start
and stop
to provide selection space limiting in selection.DataFrame.merge
to handle combinatorial sizes too large for 64-bit integer (GH2690)logx
parameter to change the x-axis to log scale (GH2327)kind
argument to specify the file type (GH2613)Bug Fixes
HDFStore
tables can now store float32
types correctly (cannot be mixed with float64
however)pattern in HDFStore
expressions when pattern is not a valid regex (GH2694)See the full release notes or issue tracker on GitHub for a complete list.
v0.10.0 (December 17, 2012)¶This is a major release from 0.9.1 and includes many new features and enhancements along with a large number of bug fixes. There are also a number of important API changes that long-time pandas users should pay close attention to.
File parsing new features¶The delimited file parsing engine (the guts of read_csv
and read_table
) has been rewritten from the ground up and now uses a fraction the amount of memory while parsing, while being 40% or more faster in most use cases (in some cases much faster).
There are also many new features:
encoding
option.usecols
)dtype
argument)as_recarray
)delim_whitespace
optionescapechar
, lineterminator
, quotechar
, etc.Deprecated DataFrame BINOP TimeSeries special case behavior
The default behavior of binary operations between a DataFrame and a Series has always been to align on the DataFrame’s columns and broadcast down the rows, except in the special case that the DataFrame contains time series. Since there are now method for each binary operator enabling you to specify how you want to broadcast, we are phasing out this special case (Zen of Python: Special cases aren’t special enough to break the rules). Here’s what I’m talking about:
In [1]: import pandas as pd In [2]: df = pd.DataFrame(np.random.randn(6, 4), ...: index=pd.date_range('1/1/2000', periods=6)) ...: In [3]: df Out[3]: 0 1 2 3 2000-01-01 -0.134024 -0.205969 1.348944 -1.198246 2000-01-02 -1.626124 0.982041 0.059493 -0.460111 2000-01-03 -1.565401 -0.025706 0.942864 2.502156 2000-01-04 -0.302741 0.261551 -0.066342 0.897097 2000-01-05 0.268766 -1.225092 0.582752 -1.490764 2000-01-06 -0.639757 -0.952750 -0.892402 0.505987 [6 rows x 4 columns] # deprecated now In [4]: df - df[0] Out[4]: 2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 \ 2000-01-01 NaN NaN NaN 2000-01-02 NaN NaN NaN 2000-01-03 NaN NaN NaN 2000-01-04 NaN NaN NaN 2000-01-05 NaN NaN NaN 2000-01-06 NaN NaN NaN 2000-01-04 00:00:00 2000-01-05 00:00:00 2000-01-06 00:00:00 0 \ 2000-01-01 NaN NaN NaN NaN 2000-01-02 NaN NaN NaN NaN 2000-01-03 NaN NaN NaN NaN 2000-01-04 NaN NaN NaN NaN 2000-01-05 NaN NaN NaN NaN 2000-01-06 NaN NaN NaN NaN 1 2 3 2000-01-01 NaN NaN NaN 2000-01-02 NaN NaN NaN 2000-01-03 NaN NaN NaN 2000-01-04 NaN NaN NaN 2000-01-05 NaN NaN NaN 2000-01-06 NaN NaN NaN [6 rows x 10 columns] # Change your code to In [5]: df.sub(df[0], axis=0) # align on axis 0 (rows) Out[5]: 0 1 2 3 2000-01-01 0.0 -0.071946 1.482967 -1.064223 2000-01-02 0.0 2.608165 1.685618 1.166013 2000-01-03 0.0 1.539695 2.508265 4.067556 2000-01-04 0.0 0.564293 0.236399 1.199839 2000-01-05 0.0 -1.493857 0.313986 -1.759530 2000-01-06 0.0 -0.312993 -0.252645 1.145744 [6 rows x 4 columns]
You will get a deprecation warning in the 0.10.x series, and the deprecated functionality will be removed in 0.11 or later.
Altered resample default behavior
The default time series resample
binning behavior of daily D
and higher frequencies has been changed to closed='left', label='left'
. Lower nfrequencies are unaffected. The prior defaults were causing a great deal of confusion for users, especially resampling data to daily frequency (which labeled the aggregated group with the end of the interval: the next day).
In [1]: dates = pd.date_range('1/1/2000', '1/5/2000', freq='4h') In [2]: series = Series(np.arange(len(dates)), index=dates) In [3]: series Out[3]: 2000-01-01 00:00:00 0 2000-01-01 04:00:00 1 2000-01-01 08:00:00 2 2000-01-01 12:00:00 3 2000-01-01 16:00:00 4 2000-01-01 20:00:00 5 2000-01-02 00:00:00 6 2000-01-02 04:00:00 7 2000-01-02 08:00:00 8 2000-01-02 12:00:00 9 2000-01-02 16:00:00 10 2000-01-02 20:00:00 11 2000-01-03 00:00:00 12 2000-01-03 04:00:00 13 2000-01-03 08:00:00 14 2000-01-03 12:00:00 15 2000-01-03 16:00:00 16 2000-01-03 20:00:00 17 2000-01-04 00:00:00 18 2000-01-04 04:00:00 19 2000-01-04 08:00:00 20 2000-01-04 12:00:00 21 2000-01-04 16:00:00 22 2000-01-04 20:00:00 23 2000-01-05 00:00:00 24 Freq: 4H, dtype: int64 In [4]: series.resample('D', how='sum') Out[4]: 2000-01-01 15 2000-01-02 51 2000-01-03 87 2000-01-04 123 2000-01-05 24 Freq: D, dtype: int64 In [5]: # old behavior In [6]: series.resample('D', how='sum', closed='right', label='right') Out[6]: 2000-01-01 0 2000-01-02 21 2000-01-03 57 2000-01-04 93 2000-01-05 129 Freq: D, dtype: int64
isnull
and notnull
. That they ever were was a relic of early pandas. This behavior can be re-enabled globally by the mode.use_inf_as_null
option:In [6]: s = pd.Series([1.5, np.inf, 3.4, -np.inf]) In [7]: pd.isnull(s) Out[7]: 0 False 1 False 2 False 3 False Length: 4, dtype: bool In [8]: s.fillna(0) Out[8]: 0 1.500000 1 inf 2 3.400000 3 -inf Length: 4, dtype: float64 In [9]: pd.set_option('use_inf_as_null', True) In [10]: pd.isnull(s) Out[10]: 0 False 1 True 2 False 3 True Length: 4, dtype: bool In [11]: s.fillna(0) Out[11]: 0 1.5 1 0.0 2 3.4 3 0.0 Length: 4, dtype: float64 In [12]: pd.reset_option('use_inf_as_null')
inplace
option now all return None
instead of the calling object. E.g. code written like df = df.fillna(0, inplace=True)
may stop working. To fix, simply delete the unnecessary variable assignment.pandas.merge
no longer sorts the group keys (sort=False
) by default. This was done for performance reasons: the group-key sorting is often one of the more expensive parts of the computation and is often unnecessary.0
through N - 1
. This is to create consistency with the DataFrame constructor with no columns specified. The v0.9.0 behavior (names X0
, X1
, ...) can be reproduced by specifying prefix='X'
:In [13]: data= 'a,b,c\n1,Yes,2\n3,No,4' In [14]: print(data) a,b,c 1,Yes,2 3,No,4 In [15]: pd.read_csv(StringIO(data), header=None) Out[15]: 0 1 2 0 a b c 1 1 Yes 2 2 3 No 4 [3 rows x 3 columns] In [16]: pd.read_csv(StringIO(data), header=None, prefix='X') Out[16]: X0 X1 X2 0 a b c 1 1 Yes 2 2 3 No 4 [3 rows x 3 columns]
'Yes'
and 'No'
are not interpreted as boolean by default, though this can be controlled by new true_values
and false_values
arguments:In [17]: print(data) a,b,c 1,Yes,2 3,No,4 In [18]: pd.read_csv(StringIO(data)) Out[18]: a b c 0 1 Yes 2 1 3 No 4 [2 rows x 3 columns] In [19]: pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No']) Out[19]: a b c 0 1 True 2 1 3 False 4 [2 rows x 3 columns]
na_values
argument. It’s better to do post-processing using the replace
function instead.fillna
on Series or DataFrame with no arguments is no longer valid code. You must either specify a fill value or an interpolation method:In [20]: s = Series([np.nan, 1., 2., np.nan, 4]) In [21]: s Out[21]: 0 NaN 1 1.0 2 2.0 3 NaN 4 4.0 Length: 5, dtype: float64 In [22]: s.fillna(0) Out[22]: 0 0.0 1 1.0 2 2.0 3 0.0 4 4.0 Length: 5, dtype: float64 In [23]: s.fillna(method='pad') Out[23]: 0 NaN 1 1.0 2 2.0 3 2.0 4 4.0 Length: 5, dtype: float64
Convenience methods ffill
and bfill
have been added:
In [24]: s.ffill() Out[24]: 0 NaN 1 1.0 2 2.0 3 2.0 4 4.0 Length: 5, dtype: float64
Series.apply
will now operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame
In [25]: def f(x): ....: return Series([ x, x**2 ], index = ['x', 'x^2']) ....: In [26]: s = Series(np.random.rand(5)) In [27]: s Out[27]: 0 0.717478 1 0.815199 2 0.452478 3 0.848385 4 0.235477 Length: 5, dtype: float64 In [28]: s.apply(f) Out[28]: x x^2 0 0.717478 0.514775 1 0.815199 0.664550 2 0.452478 0.204737 3 0.848385 0.719757 4 0.235477 0.055449 [5 rows x 2 columns]
New API functions for working with pandas options (GH2097):
get_option
/ set_option
- get/set the value of an option. Partial names are accepted. - reset_option
- reset one or more options to their default value. Partial names are accepted. - describe_option
- print a description of one or more options. When called with no arguments. print all registered options.Note: set_printoptions
/ reset_printoptions
are now deprecated (but functioning), the print options now live under “display.XYZ”. For example:
In [29]: get_option("display.max_rows") Out[29]: 15
to_string() methods now always return unicode strings (GH2224).
Instead of printing the summary information, pandas now splits the string representation across multiple rows by default:
In [30]: wide_frame = DataFrame(randn(5, 16)) In [31]: wide_frame Out[31]: 0 1 2 3 4 5 6 \ 0 -0.681624 0.191356 1.180274 -0.834179 0.703043 0.166568 -0.583599 1 0.441522 -0.316864 -0.017062 1.570114 -0.360875 -0.880096 0.235532 2 -0.412451 -0.462580 0.422194 0.288403 -0.487393 -0.777639 0.055865 3 -0.277255 1.331263 0.585174 -0.568825 -0.719412 1.191340 -0.456362 4 -1.642511 0.432560 1.218080 -0.564705 -0.581790 0.286071 0.048725 7 8 9 10 11 12 13 \ 0 -1.201796 -1.422811 -0.882554 1.209871 -0.941235 0.863067 -0.336232 1 0.207232 -1.983857 -1.702547 -1.621234 -0.906840 1.014601 -0.475108 2 1.383381 0.085638 0.246392 0.965887 0.246354 -0.727728 -0.094414 3 0.089931 0.776079 0.752889 -1.195795 -1.425911 -0.548829 0.774225 4 1.002440 1.276582 0.054399 0.241963 -0.471786 0.314510 -0.059986 14 15 0 -0.976847 0.033862 1 -0.358944 1.262942 2 -0.276854 0.158399 3 0.740501 1.510263 4 -2.069319 -1.115104 [5 rows x 16 columns]
The old behavior of printing out summary information can be achieved via the ‘expand_frame_repr’ print option:
In [32]: pd.set_option('expand_frame_repr', False) In [33]: wide_frame Out[33]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 -0.681624 0.191356 1.180274 -0.834179 0.703043 0.166568 -0.583599 -1.201796 -1.422811 -0.882554 1.209871 -0.941235 0.863067 -0.336232 -0.976847 0.033862 1 0.441522 -0.316864 -0.017062 1.570114 -0.360875 -0.880096 0.235532 0.207232 -1.983857 -1.702547 -1.621234 -0.906840 1.014601 -0.475108 -0.358944 1.262942 2 -0.412451 -0.462580 0.422194 0.288403 -0.487393 -0.777639 0.055865 1.383381 0.085638 0.246392 0.965887 0.246354 -0.727728 -0.094414 -0.276854 0.158399 3 -0.277255 1.331263 0.585174 -0.568825 -0.719412 1.191340 -0.456362 0.089931 0.776079 0.752889 -1.195795 -1.425911 -0.548829 0.774225 0.740501 1.510263 4 -1.642511 0.432560 1.218080 -0.564705 -0.581790 0.286071 0.048725 1.002440 1.276582 0.054399 0.241963 -0.471786 0.314510 -0.059986 -2.069319 -1.115104 [5 rows x 16 columns]
The width of each line can be changed via ‘line_width’ (80 by default):
In [34]: pd.set_option('line_width', 40) line_width has been deprecated, use display.width instead (currently both are identical) In [35]: wide_frame Out[35]: 0 1 2 \ 0 -0.681624 0.191356 1.180274 1 0.441522 -0.316864 -0.017062 2 -0.412451 -0.462580 0.422194 3 -0.277255 1.331263 0.585174 4 -1.642511 0.432560 1.218080 3 4 5 \ 0 -0.834179 0.703043 0.166568 1 1.570114 -0.360875 -0.880096 2 0.288403 -0.487393 -0.777639 3 -0.568825 -0.719412 1.191340 4 -0.564705 -0.581790 0.286071 6 7 8 \ 0 -0.583599 -1.201796 -1.422811 1 0.235532 0.207232 -1.983857 2 0.055865 1.383381 0.085638 3 -0.456362 0.089931 0.776079 4 0.048725 1.002440 1.276582 9 10 11 \ 0 -0.882554 1.209871 -0.941235 1 -1.702547 -1.621234 -0.906840 2 0.246392 0.965887 0.246354 3 0.752889 -1.195795 -1.425911 4 0.054399 0.241963 -0.471786 12 13 14 \ 0 0.863067 -0.336232 -0.976847 1 1.014601 -0.475108 -0.358944 2 -0.727728 -0.094414 -0.276854 3 -0.548829 0.774225 0.740501 4 0.314510 -0.059986 -2.069319 15 0 0.033862 1 1.262942 2 0.158399 3 1.510263 4 -1.115104 [5 rows x 16 columns]Updated PyTables Support¶
Docs for PyTables Table
format & several enhancements to the api. Here is a taste of what to expect.
In [36]: store = HDFStore('store.h5') In [37]: df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8), ....: columns=['A', 'B', 'C']) ....: In [38]: df Out[38]: A B C 2000-01-01 -0.369325 -1.502617 -0.376280 2000-01-02 0.511936 -0.116412 -0.625256 2000-01-03 -0.550627 1.261433 -0.552429 2000-01-04 1.695803 -1.025917 -0.910942 2000-01-05 0.426805 -0.131749 0.432600 2000-01-06 0.044671 -0.341265 1.844536 2000-01-07 -2.036047 0.000830 -0.955697 2000-01-08 -0.898872 -0.725411 0.059904 [8 rows x 3 columns] # appending data frames In [39]: df1 = df[0:4] In [40]: df2 = df[4:] In [41]: store.append('df', df1) In [42]: store.append('df', df2) In [43]: store Out[43]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index]) # selecting the entire store In [44]: store.select('df') Out[44]: A B C 2000-01-01 -0.369325 -1.502617 -0.376280 2000-01-02 0.511936 -0.116412 -0.625256 2000-01-03 -0.550627 1.261433 -0.552429 2000-01-04 1.695803 -1.025917 -0.910942 2000-01-05 0.426805 -0.131749 0.432600 2000-01-06 0.044671 -0.341265 1.844536 2000-01-07 -2.036047 0.000830 -0.955697 2000-01-08 -0.898872 -0.725411 0.059904 [8 rows x 3 columns]
In [45]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], ....: major_axis=date_range('1/1/2000', periods=5), ....: minor_axis=['A', 'B', 'C', 'D']) ....: In [46]: wp Out[46]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D # storing a panel In [47]: store.append('wp',wp) # selecting via A QUERY In [48]: store.select('wp', "major_axis>20000102 and minor_axis=['A','B']") Out[48]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to B # removing data from tables In [49]: store.remove('wp', "major_axis>20000103") Out[49]: 8 In [50]: store.select('wp') Out[50]: <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-03 00:00:00 Minor_axis axis: A to D # deleting a store In [51]: del store['df'] In [52]: store Out[52]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])
Enhancements
added ability to hierarchical keys
In [53]: store.put('foo/bar/bah', df) In [54]: store.append('food/orange', df) In [55]: store.append('food/apple', df) In [56]: store Out[56]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /foo/bar/bah frame (shape->[8,3]) /food/apple frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index]) /food/orange frame_table (typ->appendable,nrows->8,ncols->3,indexers->[index]) /wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis]) # remove all nodes under this level In [57]: store.remove('food') In [58]: store Out[58]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /foo/bar/bah frame (shape->[8,3]) /wp wide_table (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])
added mixed-dtype support!
In [59]: df['string'] = 'string' In [60]: df['int'] = 1 In [61]: store.append('df',df) In [62]: df1 = store.select('df') In [63]: df1 Out[63]: A B C string int 2000-01-01 -0.369325 -1.502617 -0.376280 string 1 2000-01-02 0.511936 -0.116412 -0.625256 string 1 2000-01-03 -0.550627 1.261433 -0.552429 string 1 2000-01-04 1.695803 -1.025917 -0.910942 string 1 2000-01-05 0.426805 -0.131749 0.432600 string 1 2000-01-06 0.044671 -0.341265 1.844536 string 1 2000-01-07 -2.036047 0.000830 -0.955697 string 1 2000-01-08 -0.898872 -0.725411 0.059904 string 1 [8 rows x 5 columns] In [64]: df1.get_dtype_counts() Out[64]: float64 3 int64 1 object 1 Length: 3, dtype: int64
performance improvments on table writing
support for arbitrarily indexed dimensions
SparseSeries
now has a density
property (GH2384)
enable Series.str.strip/lstrip/rstrip
methods to take an input argument to strip arbitrary characters (GH2411)
implement value_vars
in melt
to limit values to certain columns and add melt
to pandas namespace (GH2412)
Bug Fixes
Term
method of specifying where conditions (GH1996).del store['df']
now call store.remove('df')
for store deletionmin_itemsize
parameter can be specified in table creation to force a minimum size for indexing columns (the previous implementation would set the column size based on the first append)create_table_index
(requires PyTables >= 2.3) (GH698).put
Compatibility
0.10 of HDFStore
is backwards compatible for reading tables created in a prior version of pandas, however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entire file and write it out using the new format to take advantage of the updates.
Adding experimental support for Panel4D and factory functions to create n-dimensional named panels. Docs for NDim. Here is a taste of what to expect.
In [65]: p4d = Panel4D(randn(2, 2, 5, 4), ....: labels=['Label1','Label2'], ....: items=['Item1', 'Item2'], ....: major_axis=date_range('1/1/2000', periods=5), ....: minor_axis=['A', 'B', 'C', 'D']) ....: In [66]: p4d Out[66]: <class 'pandas.core.panelnd.Panel4D'> Dimensions: 2 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis) Labels axis: Label1 to Label2 Items axis: Item1 to Item2 Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00 Minor_axis axis: A to D
See the full release notes or issue tracker on GitHub for a complete list.
v0.9.1 (November 14, 2012)¶This is a bugfix release from 0.9.0 and includes several new features and enhancements along with a large number of bug fixes. The new features include by-column sort order for DataFrame and Series, improved NA handling for the rank method, masking functions for DataFrame, and intraday time-series filtering for DataFrame.
New features¶API changes¶
Series.sort, DataFrame.sort, and DataFrame.sort_index can now be specified in a per-column manner to support multiple sort orders (GH928)
In [2]: df = DataFrame(np.random.randint(0, 2, (6, 3)), columns=['A', 'B', 'C']) In [3]: df.sort(['A', 'B'], ascending=[1, 0]) Out[3]: A B C 3 0 1 1 4 0 1 1 2 0 0 1 0 1 0 0 1 1 0 0 5 1 0 0DataFrame.rank now supports additional argument values for the na_option parameter so missing values can be assigned either the largest or the smallest rank (GH1508, GH2159)
In [1]: df = DataFrame(np.random.randn(6, 3), columns=['A', 'B', 'C']) In [2]: df.loc[2:4] = np.nan In [3]: df.rank() Out[3]: A B C 0 3.0 1.0 3.0 1 2.0 2.0 1.0 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN 5 1.0 3.0 2.0 [6 rows x 3 columns] In [4]: df.rank(na_option='top') Out[4]: A B C 0 6.0 4.0 6.0 1 5.0 5.0 4.0 2 2.0 2.0 2.0 3 2.0 2.0 2.0 4 2.0 2.0 2.0 5 4.0 6.0 5.0 [6 rows x 3 columns] In [5]: df.rank(na_option='bottom') Out[5]: A B C 0 3.0 1.0 3.0 1 2.0 2.0 1.0 2 5.0 5.0 5.0 3 5.0 5.0 5.0 4 5.0 5.0 5.0 5 1.0 3.0 2.0 [6 rows x 3 columns]DataFrame has new where and mask methods to select values according to a given boolean mask (GH2109, GH2151)
DataFrame currently supports slicing via a boolean vector the same length as the DataFrame (inside the []). The returned DataFrame has the same number of columns as the original, but is sliced on its index.
In [6]: df = DataFrame(np.random.randn(5, 3), columns = ['A','B','C']) In [7]: df Out[7]: A B C 0 1.744738 -0.356939 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 -0.404023 -1.115882 3 2.093925 0.010808 -1.775758 4 1.303175 0.025683 -1.795489 [5 rows x 3 columns] In [8]: df[df['A'] > 0] Out[8]: A B C 0 1.744738 -0.356939 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 -0.404023 -1.115882 3 2.093925 0.010808 -1.775758 4 1.303175 0.025683 -1.795489 [5 rows x 3 columns]If a DataFrame is sliced with a DataFrame based boolean condition (with the same size as the original DataFrame), then a DataFrame the same size (index and columns) as the original is returned, with elements that do not meet the boolean condition as NaN. This is accomplished via the new method DataFrame.where. In addition, where takes an optional other argument for replacement.
In [9]: df[df>0] Out[9]: A B C 0 1.744738 NaN 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 NaN NaN 3 2.093925 0.010808 NaN 4 1.303175 0.025683 NaN [5 rows x 3 columns] In [10]: df.where(df>0) Out[10]: A B C 0 1.744738 NaN 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 NaN NaN 3 2.093925 0.010808 NaN 4 1.303175 0.025683 NaN [5 rows x 3 columns] In [11]: df.where(df>0,-df) Out[11]: A B C 0 1.744738 0.356939 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 0.404023 1.115882 3 2.093925 0.010808 1.775758 4 1.303175 0.025683 1.795489 [5 rows x 3 columns]Furthermore, where now aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analagous to partial setting via .ix (but on the contents rather than the axis labels)
In [12]: df2 = df.copy() In [13]: df2[ df2[1:4] > 0 ] = 3 In [14]: df2 Out[14]: A B C 0 1.744738 -0.356939 0.092791 1 3.000000 3.000000 3.000000 2 3.000000 -0.404023 -1.115882 3 3.000000 3.000000 -1.775758 4 1.303175 0.025683 -1.795489 [5 rows x 3 columns]DataFrame.mask is the inverse boolean operation of where.
In [15]: df.mask(df<=0) Out[15]: A B C 0 1.744738 NaN 0.092791 1 1.222637 1.909179 0.195946 2 0.481559 NaN NaN 3 2.093925 0.010808 NaN 4 1.303175 0.025683 NaN [5 rows x 3 columns]Enable referencing of Excel columns by their column names (GH1936)
In [16]: xl = ExcelFile('data/test.xls') In [17]: xl.parse('Sheet1', index_col=0, parse_dates=True, ....: parse_cols='A:D') ....: Out[17]: A B C 2000-01-03 0.980269 3.685731 -0.364217 2000-01-04 1.047916 -0.041232 -0.161812 2000-01-05 0.498581 0.731168 -0.537677 2000-01-06 1.120202 1.567621 0.003641 2000-01-07 -0.487094 0.571455 -1.611639 2000-01-10 0.836649 0.246462 0.588543 2000-01-11 -0.157161 1.340307 1.195778 [7 rows x 3 columns]Added option to disable pandas-style tick locators and formatters using series.plot(x_compat=True) or pandas.plot_params[‘x_compat’] = True (GH2205)
Existing TimeSeries methods at_time and between_time were added to DataFrame (GH2149)
DataFrame.dot can now accept ndarrays (GH2042)
DataFrame.drop now supports non-unique indexes (GH2101)
Panel.shift now supports negative periods (GH2164)
DataFrame now support unary ~ operator (GH2110)
Upsampling data with a PeriodIndex will result in a higher frequency TimeSeries that spans the original time window
In [1]: prng = period_range('2012Q1', periods=2, freq='Q') In [2]: s = Series(np.random.randn(len(prng)), prng) In [4]: s.resample('M') Out[4]: 2012-01 -1.471992 2012-02 NaN 2012-03 NaN 2012-04 -0.493593 2012-05 NaN 2012-06 NaN Freq: M, dtype: float64Period.end_time now returns the last nanosecond in the time interval (GH2124, GH2125, GH1764)
In [18]: p = Period('2012') In [19]: p.end_time Out[19]: Timestamp('2012-12-31 23:59:59.999999999')File parsers no longer coerce to float or bool for columns that have custom converters specified (GH2184)
In [20]: data = 'A,B,C\n00001,001,5\n00002,002,6' In [21]: read_csv(StringIO(data), converters={'A' : lambda x: x.strip()}) Out[21]: A B C 0 00001 1 5 1 00002 2 6 [2 rows x 3 columns]
See the full release notes or issue tracker on GitHub for a complete list.
v0.9.0 (October 7, 2012)¶This is a major release from 0.8.1 and includes several new features and enhancements along with a large number of bug fixes. New features include vectorized unicode encoding/decoding for Series.str, to_latex method to DataFrame, more flexible parsing of boolean values, and enabling the download of options data from Yahoo! Finance.
New features¶API changes¶
- Add
encode
anddecode
for unicode handling to vectorized string processing methods in Series.str (GH1706)- Add
DataFrame.to_latex
method (GH1735)- Add convenient expanding window equivalents of all rolling_* ops (GH1785)
- Add Options class to pandas.io.data for fetching options data from Yahoo! Finance (GH1748, GH1739)
- More flexible parsing of boolean values (Yes, No, TRUE, FALSE, etc) (GH1691, GH1295)
- Add
level
parameter toSeries.reset_index
TimeSeries.between_time
can now select times across midnight (GH1871)- Series constructor can now handle generator as input (GH1679)
DataFrame.dropna
can now take multiple axes (tuple/list) as input (GH924)- Enable
skip_footer
parameter inExcelFile.parse
(GH1843)
- The default column names when
header=None
and no columns names passed to functions likeread_csv
has changed to be more Pythonic and amenable to attribute access:
In [1]: data = '0,0,1\n1,1,0\n0,1,0' In [2]: df = read_csv(StringIO(data), header=None) In [3]: df Out[3]: 0 1 2 0 0 0 1 1 1 1 0 2 0 1 0 [3 rows x 3 columns]
Series(df[col1], index=df[col2])
that worked before “by accident” (this was never intended) will lead to all NA Series in some cases. To be perfectly clear:In [4]: s1 = Series([1, 2, 3]) In [5]: s1 Out[5]: 0 1 1 2 2 3 Length: 3, dtype: int64 In [6]: s2 = Series(s1, index=['foo', 'bar', 'baz']) In [7]: s2 Out[7]: foo NaN bar NaN baz NaN Length: 3, dtype: float64
day_of_year
API removed from PeriodIndex, use dayofyear
(GH1723)first
and last
methods in GroupBy
no longer drop non-numeric columns (GH1809)na_values
of type dict no longer override default NAs unless keep_default_na
is set to false explicitly (GH1657)DataFrame.dot
will not do data alignment, and also work with Series (GH1915)See the full release notes or issue tracker on GitHub for a complete list.
v0.8.1 (July 22, 2012)¶This release includes a few new features, performance enhancements, and over 30 bug fixes from 0.8.0. New features include notably NA friendly string processing functionality and a series of new plot types and options.
Performance improvements¶v0.8.0 (June 29, 2012)¶
- Improved implementation of rolling min and max (thanks to Bottleneck !)
- Add accelerated
'median'
GroupBy option (GH1358)- Significantly improve the performance of parsing ISO8601-format date strings with
DatetimeIndex
orto_datetime
(GH1571)- Improve the performance of GroupBy on single-key aggregations and use with Categorical types
- Significant datetime parsing performance improvments
This is a major release from 0.7.3 and includes extensive work on the time series handling and processing infrastructure as well as a great deal of new functionality throughout the library. It includes over 700 commits from more than 20 distinct authors. Most pandas 0.7.3 and earlier users should not experience any issues upgrading, but due to the migration to the NumPy datetime64 dtype, there may be a number of bugs and incompatibilities lurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release if necessary. See the full release notes or issue tracker on GitHub for a complete list.
Support for non-unique indexes¶All objects can now work with non-unique indexes. Data alignment / join operations work according to SQL join semantics (including, if application, index duplication in many-to-many joins)
NumPy datetime64 dtype and 1.6 dependency¶Time series data are now represented using NumPy’s datetime64 dtype; thus, pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verified to work with the development version (1.7+) of NumPy as well which includes some significant user-facing API changes. NumPy 1.6 also has a number of bugs having to do with nanosecond resolution data, so I recommend that you steer clear of NumPy 1.6’s datetime64 API functions (though limited as they are) and only interact with this data using the interface that pandas provides.
See the end of the 0.8.0 section for a “porting” guide listing potential issues for users migrating legacy codebases from pandas 0.7 or earlier to 0.8.0.
Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided as they arise. There will be no more further development in 0.7.x beyond bug fixes.
Time series changes and improvements¶Note
With this release, legacy scikits.timeseries users should be able to port their code to use pandas.
PeriodIndex
and Period
classes for representing time spans and performing calendar logic, including the 12 fiscal quarterly frequencies <timeseries.quarterly>. This is a partial port of, and a substantial enhancement to, elements of the scikits.timeseries codebase. Support for conversion between PeriodIndex and DatetimeIndextz_lcoalize
methods to TimeSeries and DataFrame. All timestamps are stored as UTC; Timestamps from DatetimeIndex objects with time zone set will be localized to localtime. Time zone conversions are therefore essentially free. User needs to know very little about pytz library now; only time zone names as as strings are required. Time zone-aware timestamps are equal if and only if their UTC timestamps match. Operations between time zone-aware time series with different time zones will result in a UTC-indexed time series.date_range
, bdate_range
, and period_range
factory functionsinferred_freq
property of DatetimeIndex, with option to infer frequency on construction of DatetimeIndexTimeSeries.at_time
) or between two times (TimeSeries.between_time
)qcut
functions (like R’s cut function) for computing a categorical variable from a continuous variable by binning values either into value-based (cut
) or quantile-based (qcut
) binsFactor
to Categorical
and add a number of usability featurespivot_table
bugs (empty columns being introduced)any
and all
method to DataFrameSeries.plot
now supports a secondary_y
option:
In [1]: plt.figure() Out[1]: <matplotlib.figure.Figure at 0x133e4b8d0> In [2]: fx['FR'].plot(style='g') Out[2]: <matplotlib.axes._subplots.AxesSubplot at 0x12ad13cc0> In [3]: fx['IT'].plot(style='k--', secondary_y=True) Out[3]: <matplotlib.axes._subplots.AxesSubplot at 0x133554dd8>
Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot types. For example, 'kde'
is a new option:
In [4]: s = Series(np.concatenate((np.random.randn(1000), ...: np.random.randn(1000) * 0.5 + 3))) ...: In [5]: plt.figure() Out[5]: <matplotlib.figure.Figure at 0x1300adda0> In [6]: s.hist(normed=True, alpha=0.2) Out[6]: <matplotlib.axes._subplots.AxesSubplot at 0x13099ca20> In [7]: s.plot(kind='kde') Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x13099ca20>
See the plotting page for much more.
Other API changes¶offset
, time_rule
, and timeRule
arguments names in time series functions. Warnings will be printed until pandas 0.9 or 1.0.The major change that may affect you in pandas 0.8.0 is that time series indexes use NumPy’s datetime64
data type instead of dtype=object
arrays of Python’s built-in datetime.datetime
objects. DateRange
has been replaced by DatetimeIndex
but otherwise behaved identically. But, if you have code that converts DateRange
or Index
objects that used to contain datetime.datetime
values to plain NumPy arrays, you may have bugs lurking with code using scalar values because you are handing control over to NumPy:
In [8]: import datetime In [9]: rng = date_range('1/1/2000', periods=10) In [10]: rng[5] Out[10]: Timestamp('2000-01-06 00:00:00', freq='D') In [11]: isinstance(rng[5], datetime.datetime) Out[11]: True In [12]: rng_asarray = np.asarray(rng) In [13]: scalar_val = rng_asarray[5] In [14]: type(scalar_val) Out[14]: numpy.datetime64
pandas’s Timestamp
object is a subclass of datetime.datetime
that has nanosecond support (the nanosecond
field store the nanosecond value between 0 and 999). It should substitute directly into any code that used datetime.datetime
values before. Thus, I recommend not casting DatetimeIndex
to regular NumPy arrays.
If you have code that requires an array of datetime.datetime
objects, you have a couple of options. First, the asobject
property of DatetimeIndex
produces an array of Timestamp
objects:
In [15]: stamp_array = rng.asobject In [16]: stamp_array Out[16]: Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00, 2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00, 2000-01-10 00:00:00], dtype='object') In [17]: stamp_array[5] Out[17]: Timestamp('2000-01-06 00:00:00', freq='D')
To get an array of proper datetime.datetime
objects, use the to_pydatetime
method:
In [18]: dt_array = rng.to_pydatetime() In [19]: dt_array Out[19]: array([datetime.datetime(2000, 1, 1, 0, 0), datetime.datetime(2000, 1, 2, 0, 0), datetime.datetime(2000, 1, 3, 0, 0), datetime.datetime(2000, 1, 4, 0, 0), datetime.datetime(2000, 1, 5, 0, 0), datetime.datetime(2000, 1, 6, 0, 0), datetime.datetime(2000, 1, 7, 0, 0), datetime.datetime(2000, 1, 8, 0, 0), datetime.datetime(2000, 1, 9, 0, 0), datetime.datetime(2000, 1, 10, 0, 0)], dtype=object) In [20]: dt_array[5] Out[20]: datetime.datetime(2000, 1, 6, 0, 0)
matplotlib knows how to handle datetime.datetime
but not Timestamp objects. While I recommend that you plot time series using TimeSeries.plot
, you can either use to_pydatetime
or register a converter for the Timestamp type. See matplotlib documentation for more on this.
Warning
There are bugs in the user-facing API with the nanosecond datetime64 unit in NumPy 1.6. In particular, the string version of the array shows garbage values, and conversion to dtype=object
is similarly broken.
In [21]: rng = date_range('1/1/2000', periods=10) In [22]: rng Out[22]: DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08', '2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D') In [23]: np.asarray(rng) Out[23]: array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000', '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000', '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000', '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]') In [24]: converted = np.asarray(rng, dtype=object) In [25]: converted[5] Out[25]: 947116800000000000
Trust me: don’t panic. If you are using NumPy 1.6 and restrict your interaction with datetime64
values to pandas’s API you will be just fine. There is nothing wrong with the data-type (a 64-bit integer internally); all of the important data processing happens in pandas and is heavily tested. I strongly recommend that you do not work directly with datetime64 arrays in NumPy 1.6 and only use the pandas API.
Support for non-unique indexes: In the latter case, you may have code inside a try:... catch:
block that failed due to the index not being unique. In many cases it will no longer fail (some method like append
still check for uniqueness unless disabled). However, all is not lost: you can inspect index.is_unique
and raise an exception explicitly if it is False
or go to a different code branch.
This is a minor release from 0.7.2 and fixes many minor bugs and adds a number of nice new features. There are also a couple of API changes to note; these should not affect very many users, and we are inclined to call them “bug fixes” even though they do constitute a change in behavior. See the full release notes or issue tracker on GitHub for a complete list.
NA Boolean Comparison API Change¶Reverted some changes to how NA values (represented typically as NaN
or None
) are handled in non-numeric Series:
In [1]: series = Series(['Steve', np.nan, 'Joe']) In [2]: series == 'Steve' Out[2]: 0 True 1 False 2 False Length: 3, dtype: bool In [3]: series != 'Steve' Out[3]: 0 False 1 True 2 True Length: 3, dtype: bool
In comparisons, NA / NaN will always come through as False
except with !=
which is True
. Be very careful with boolean arithmetic, especially negation, in the presence of NA data. You may wish to add an explicit NA filter into boolean array operations if you are worried about this:
In [4]: mask = series == 'Steve' In [5]: series[mask & series.notnull()] Out[5]: 0 Steve Length: 1, dtype: object
While propagating NA in comparisons may seem like the right behavior to some users (and you could argue on purely technical grounds that this is the right thing to do), the evaluation was made that propagating NA everywhere, including in numerical arrays, would cause a large amount of problems for users. Thus, a “practicality beats purity” approach was taken. This issue may be revisited at some point in the future.
Other API Changes¶When calling apply
on a grouped Series, the return value will also be a Series, to be more consistent with the groupby
behavior with DataFrame:
In [6]: df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', ...: 'foo', 'bar', 'foo', 'foo'], ...: 'B' : ['one', 'one', 'two', 'three', ...: 'two', 'two', 'one', 'three'], ...: 'C' : np.random.randn(8), 'D' : np.random.randn(8)}) ...: In [7]: df Out[7]: A B C D 0 foo one 1.075059 -0.449141 1 bar one 0.785676 1.443014 2 foo two 0.958157 0.612324 3 bar three 1.477773 -0.178818 4 foo two -1.006023 0.133072 5 bar two -1.506997 -0.550981 6 foo one 1.218042 -2.043335 7 foo three -0.565878 0.753539 [8 rows x 4 columns] In [8]: grouped = df.groupby('A')['C'] In [9]: grouped.describe() Out[9]: count mean std min 25% 50% 75% \ A bar 3.0 0.252151 1.562274 -1.506997 -0.360661 0.785676 1.131724 foo 5.0 0.335871 1.039915 -1.006023 -0.565878 0.958157 1.075059 max A bar 1.477773 foo 1.218042 [2 rows x 8 columns] In [10]: grouped.apply(lambda x: x.sort_values()[-2:]) # top 2 values Out[10]: A bar 1 0.785676 3 1.477773 foo 0 1.075059 6 1.218042 Name: C, Length: 4, dtype: float64v.0.7.2 (March 16, 2012)¶
This release targets bugs in 0.7.1, and adds a few minor features.
New features¶Performance improvements¶
- Add additional tie-breaking methods in DataFrame.rank (GH874)
- Add ascending parameter to rank in Series, DataFrame (GH875)
- Add coerce_float option to DataFrame.from_records (GH893)
- Add sort_columns parameter to allow unsorted plots (GH918)
- Enable column access via attributes on GroupBy (GH882)
- Can pass dict of values to DataFrame.fillna (GH661)
- Can select multiple hierarchical groups by passing list of values in .ix (GH134)
- Add
axis
option to DataFrame.fillna (GH174)- Add level keyword to
drop
for dropping values from a level (GH159)
v.0.7.1 (February 29, 2012)¶
This release includes a few new features and addresses over a dozen bugs in 0.7.0.
New features¶Performance improvements¶
- Add
to_clipboard
function to pandas namespace for writing objects to the system clipboard (GH774)- Add
itertuples
method to DataFrame for iterating through the rows of a dataframe as tuples (GH818)- Add ability to pass fill_value and method to DataFrame and Series align method (GH806, GH807)
- Add fill_value option to reindex, align methods (GH784)
- Enable concat to produce DataFrame from Series (GH787)
- Add
between
method to Series (GH802)- Add HTML representation hook to DataFrame for the IPython HTML notebook (GH773)
- Support for reading Excel 2007 XML documents using openpyxl
v.0.7.0 (February 9, 2012)¶ New features¶
- Improve performance and memory usage of fillna on DataFrame
- Can concatenate a list of Series along axis=1 to obtain a DataFrame (GH787)
Series.append
and DataFrame.append
(GH468, GH479, GH273)Series.append
too__getitem__
, useful for transformation (GH342)DataFrame.apply
(GH498)In [1]: df = DataFrame(randn(10, 4)) In [2]: df.apply(lambda x: x.describe()) Out[2]: 0 1 2 3 count 10.000000 10.000000 10.000000 10.000000 mean -0.409608 0.539495 0.163276 0.051646 std 1.397779 0.968808 0.874489 0.719651 min -2.539411 -0.737206 -1.202276 -1.050435 25% -1.202202 0.021308 -0.368812 -0.383608 50% -0.384480 0.306124 0.211431 0.165586 75% 0.186280 1.024039 0.730744 0.494457 max 2.524998 2.533114 1.334428 1.147396 [8 rows x 4 columns]
reorder_levels
method to Series and DataFrame (GH534)get
function to DataFrame and Panel (GH521)DataFrame.iterrows
method for efficiently iterating through the rows of a DataFrameDataFrame.to_panel
with code adapted from LongPanel.to_long
reindex_axis
method added to DataFramelevel
option to binary arithmetic functions on DataFrame
and Series
level
option to the reindex
and align
methods on Series and DataFrame for broadcasting values across a level (GH542, GH552, others)Panel
and add IPython completion (GH563)logy
option to Series.plot
for log-scaling on the Y axisindex
and header
options to DataFrame.to_string
DataFrame.join
to join on index (GH115)Panel.join
(GH115)justify
argument to DataFrame.to_string
to allow different alignment of column headerssort
option to GroupBy to allow disabling sorting of the group keys for potential speedups (GH595)DataFrame.lookup
, fancy-indexing analogue for retrieving values given a sequence of row and column labels (GH338)cummin
and cummax
on Series and DataFrame to get cumulative minimum and maximum, respectively (GH647)value_range
added as utility function to get min and max of a dataframe (GH288)encoding
argument to read_csv
, read_table
, to_csv
and from_csv
for non-ascii text (GH717)abs
method to pandas objectscrosstab
function for easily computing frequency tablesisin
method to index objectslevel
argument to xs
method of DataFrame.One of the potentially riskiest API changes in 0.7.0, but also one of the most important, was a complete review of how integer indexes are handled with regard to label-based indexing. Here is an example:
In [3]: s = Series(randn(10), index=range(0, 20, 2)) In [4]: s Out[4]: 0 -0.543429 2 1.425447 4 -0.408795 6 -1.489348 8 -1.166408 10 -0.481205 12 -0.810355 14 -0.985491 16 -0.336246 18 -0.629058 Length: 10, dtype: float64 In [5]: s[0] Out[5]: -0.54342898765020686 In [6]: s[2] Out[6]: 1.4254474252163707 In [7]: s[4] Out[7]: -0.40879476802408349
This is all exactly identical to the behavior before. However, if you ask for a key not contained in the Series, in versions 0.6.1 and prior, Series would fall back on a location-based lookup. This now raises a KeyError
:
This change also has the same impact on DataFrame:
In [3]: df = DataFrame(randn(8, 4), index=range(0, 16, 2)) In [4]: df 0 1 2 3 0 0.88427 0.3363 -0.1787 0.03162 2 0.14451 -0.1415 0.2504 0.58374 4 -1.44779 -0.9186 -1.4996 0.27163 6 -0.26598 -2.4184 -0.2658 0.11503 8 -0.58776 0.3144 -0.8566 0.61941 10 0.10940 -0.7175 -1.0108 0.47990 12 -1.16919 -0.3087 -0.6049 -0.43544 14 -0.07337 0.3410 0.0424 -0.16037 In [5]: df.ix[3] KeyError: 3
In order to support purely integer-based indexing, the following methods have been added:
Method DescriptionSeries.iget_value(i)
Retrieve value stored at location i
Series.iget(i)
Alias for iget_value
DataFrame.irow(i)
Retrieve the i
-th row DataFrame.icol(j)
Retrieve the j
-th column DataFrame.iget_value(i, j)
Retrieve the value at row i
and column j
API tweaks regarding label-based slicing¶
Label-based slicing using ix
now requires that the index be sorted (monotonic) unless both the start and endpoint are contained in the index:
In [1]: s = Series(randn(6), index=list('gmkaec')) In [2]: s Out[2]: g -1.182230 m -0.276183 k -0.243550 a 1.628992 e 0.073308 c -0.539890 dtype: float64
Then this is OK:
In [3]: s.ix['k':'e'] Out[3]: k -0.243550 a 1.628992 e 0.073308 dtype: float64
But this is not:
In [12]: s.ix['b':'h'] KeyError 'b'
If the index had been sorted, the “range selection” would have been possible:
In [4]: s2 = s.sort_index() In [5]: s2 Out[5]: a 1.628992 c -0.539890 e 0.073308 g -1.182230 k -0.243550 m -0.276183 dtype: float64 In [6]: s2.ix['b':'h'] Out[6]: c -0.539890 e 0.073308 g -1.182230 dtype: float64Changes to Series
[]
operator¶
As as notational convenience, you can pass a sequence of labels or a label slice to a Series when getting and setting values via []
(i.e. the __getitem__
and __setitem__
methods). The behavior will be the same as passing similar input to ix
except in the case of integer indexing:
In [8]: s = Series(randn(6), index=list('acegkm')) In [9]: s Out[9]: a -0.297788 c 0.499769 e 0.810531 g 0.414649 k -1.551478 m 1.012459 Length: 6, dtype: float64 In [10]: s[['m', 'a', 'c', 'e']] Out[10]: m 1.012459 a -0.297788 c 0.499769 e 0.810531 Length: 4, dtype: float64 In [11]: s['b':'l'] Out[11]: c 0.499769 e 0.810531 g 0.414649 k -1.551478 Length: 4, dtype: float64 In [12]: s['c':'k'] Out[12]: c 0.499769 e 0.810531 g 0.414649 k -1.551478 Length: 4, dtype: float64
In the case of integer indexes, the behavior will be exactly as before (shadowing ndarray
):
In [13]: s = Series(randn(6), index=range(0, 12, 2)) In [14]: s[[4, 0, 2]] Out[14]: 4 0.928877 0 1.171752 2 0.026488 Length: 3, dtype: float64 In [15]: s[1:5] Out[15]: 2 0.026488 4 0.928877 6 -1.264991 8 0.419449 Length: 4, dtype: float64
If you wish to do indexing with sequences and slicing on an integer index with label semantics, use ix
.
LongPanel
class has been completely removedSeries.sort
is called on a column of a DataFrame, an exception will now be raised. Before it was possible to accidentally mutate a DataFrame’s column by doing df[col].sort()
instead of the side-effect free method df[col].order()
(GH316)FutureWarning
drop
added as an optional parameter to DataFrame.reset_index
(GH699)reset_index
on DataFrame with a regular (non-hierarchical) index (GH476)level
parameter passed (GH545)rolling_median
by about 5-10x in most typical use cases (GH374)melt
function to pandas.core.reshape
level
parameter to group by level in Series and DataFrame descriptive statistics (GH313)head
and tail
methods to Series, analogous to to DataFrame (GH296)Series.isin
function which checks if each value is contained in a passed sequence (GH289)float_format
option to Series.to_string
skip_footer
(GH291) and converters
(GH343) options to read_csv
and read_table
drop_duplicates
and duplicated
functions for removing duplicate DataFrame rows and checking for duplicate rows, respectively (GH319)Series.mad
, mean absolute deviationQuarterEnd
DateOffset (GH321)dot
to DataFrame (GH65)orient
option to Panel.from_dict
(GH359, GH301)orient
option to DataFrame.from_dict
DataFrame.from_records
(GH357)by
argument of DataFrame.sort_index
(GH92, GH362)get_value
and put_value
methods to DataFrame (GH360)cov
instance methods to Series and DataFrame (GH194, GH362)kind='bar'
option to DataFrame.plot
(GH348)idxmin
and idxmax
to Series and DataFrame (GH286)read_clipboard
function to parse DataFrame from clipboard (GH300)nunique
function to Series for counting unique elements (GH297)DataFrame.to_html
for writing DataFrame to HTML (GH387)DataFrame.boxplot
function (GH368)DataFrame.join
with vector on
argument (GH312)legend
boolean flag to DataFrame.plot
(GH324)stack
and unstack
(GH370)pivot_table
(GH381)raw
option to DataFrame.apply
for performance if only need ndarray (GH309)cache_readonly
, resulting in substantial micro-performance enhancements throughout the codebase (GH361)MultiIndex.from_tuples
raw
option to DataFrame.apply
for getting better performance whenmap_infer
speeds up Series.apply
and Series.map
significantly when passed elementwise Python function, motivated by (GH355)Series.order
, which also makes np.unique called on a Series faster (GH327)DataFrame.align
method with standard join optionsparse_dates
option to read_csv
and read_table
methods to optionally try to parse dates in the index columnsnrows
, chunksize
, and iterator
arguments to read_csv
and read_table
. The last two return a new TextParser
class capable of lazily iterating through chunks of a flat file (GH242)DataFrame.join
(GH214)_get_duplicates
function to Index
for identifying duplicate values more easily (ENH5c)Series.describe
for Series containing objects (GH241)DataFrame.join
when joining on key(s) (GH248)__getitem__
(GH253)pivot_table
convenience function to pandas namespace (GH234)Panel.rename_axis
function (GH243)Panel.take
set_eng_float_format
for alternate DataFrame floating point string formatting (ENH61)set_index
function for creating a DataFrame index from its existing columnsgroupby
hierarchical index level name (GH223)DataFrame.to_csv
(GH244)read_csv
and read_table
DataFrame.xs
on mixed-type DataFrame objects by about 5x, regression from 0.3.0 (GH215)DataFrame.align
method, speeding up binary operations between differently-indexed DataFrame objects by 10-25%.__repr__
and count
on large mixed-type DataFrame objectsname
attribute to Series
, now prints as part of Series.__repr__
isnull
and notnull
to Series (GH209, GH203)Series.align
method for aligning two series with choice of join method (ENH56)get_level_values
to MultiIndex
(GH188)DataFrame
objects via .ix
indexing attribute (GH135)DataFrame
methods get_dtype_counts
and property dtypes
(ENHdc)DataFrame.append
to stack DataFrames (ENH1b)read_csv
tries to sniff delimiters using csv.Sniffer
(GH146)read_csv
can read multiple columns into a MultiIndex
; DataFrame’s to_csv
method writes out a corresponding MultiIndex
(GH151)DataFrame.rename
has a new copy
parameter to rename a DataFrame in place (ENHed)sortlevel
to work by level (GH141)isnull
and notnull
, a regression from v0.3.0 (GH187)DataFrame.join
so that intermediate aligned copies of the data in each DataFrame
argument do not need to be created. Substantial performance increases result (GH176)Index.intersection
and Index.union
BlockManager.take
resulting in significantly faster take
performance on mixed-type DataFrame
objects (GH104)Series.sort_index
_ensure_index
function resulting in performance savings in type-checking Index objectsRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4