A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html below:

What’s New — pandas 0.20.3 documentation

What’s New¶

These are new features and improvements of note in each release.

v0.20.3 (July 7, 2017)¶

This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes and bug fixes. We recommend that all users upgrade to this version.

Bug Fixes¶ Conversion¶ Indexing¶ I/O¶ Plotting¶ Reshaping¶ Categorical¶ v0.20.2 (June 4, 2017)¶

This is a minor bug-fix release in the 0.20.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.

Enhancements¶ Performance Improvements¶ Bug Fixes¶ Conversion¶ Indexing¶ I/O¶ Plotting¶ Groupby/Resample/Rolling¶ Sparse¶ Reshaping¶ Numeric¶ Categorical¶ Other¶ v0.20.1 (May 5, 2017)¶

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Warning

Pandas has changed the internal structure and layout of the codebase. This can affect imports that are not from the top-level pandas.* namespace, please see the changes here.

Check the API Changes and deprecations before updating.

Note

This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils routines. (GH16250)

New features¶ agg API for DataFrame/Series¶

Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from groupby, window operations, and resampling. This allows aggregation operations in a concise way by using agg() and transform(). The full documentation is here (GH1623).

Here is a sample

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   ...:                  index=pd.date_range('1/1/2000', periods=10))
   ...: 

In [2]: df.iloc[3:7] = np.nan

In [3]: df
Out[3]: 
                   A         B         C
2000-01-01  1.474071 -0.064034 -1.282782
2000-01-02  0.781836 -1.071357  0.441153
2000-01-03  2.353925  0.583787  0.221471
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.901805  1.171216  0.520260
2000-01-09 -1.197071 -1.066969 -0.303421
2000-01-10 -0.858447  0.306996 -0.028665

One can operate using string function names, callables, lists, or dictionaries of these.

Using a single function is equivalent to .apply.

In [4]: df.agg('sum')
Out[4]: 
A    3.456119
B   -0.140361
C   -0.431984
dtype: float64

Multiple aggregations with a list of functions.

In [5]: df.agg(['sum', 'min'])
Out[5]: 
            A         B         C
sum  3.456119 -0.140361 -0.431984
min -1.197071 -1.071357 -1.282782

Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output of all of the aggregators. The output has one column per unique function. Those functions applied to a particular column will be NaN:

In [6]: df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
Out[6]: 
            A         B
max       NaN  1.171216
min -1.197071 -1.071357
sum  3.456119       NaN

The API also supports a .transform() function for broadcasting results.

In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]: 
                   A                   B                   C          
                 abs  <lambda>       abs  <lambda>       abs  <lambda>
2000-01-01  1.474071  2.671143  0.064034  1.007322  1.282782  0.000000
2000-01-02  0.781836  1.978907  1.071357  0.000000  0.441153  1.723935
2000-01-03  2.353925  3.550996  0.583787  1.655143  0.221471  1.504252
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.901805  2.098877  1.171216  2.242573  0.520260  1.803042
2000-01-09  1.197071  0.000000  1.066969  0.004388  0.303421  0.979361
2000-01-10  0.858447  0.338624  0.306996  1.378353  0.028665  1.254117

When presented with mixed dtypes that cannot be aggregated, .agg() will only take the valid aggregations. This is similiar to how groupby .agg() works. (GH15015)

In [8]: df = pd.DataFrame({'A': [1, 2, 3],
   ...:                    'B': [1., 2., 3.],
   ...:                    'C': ['foo', 'bar', 'baz'],
   ...:                    'D': pd.date_range('20130101', periods=3)})
   ...: 

In [9]: df.dtypes
Out[9]: 
A             int64
B           float64
C            object
D    datetime64[ns]
dtype: object
In [10]: df.agg(['min', 'sum'])
Out[10]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01
sum  6  6.0  foobarbaz        NaT
dtype keyword for data IO¶

The 'python' engine for read_csv(), as well as the read_fwf() function for parsing fixed-width text files and read_excel() for parsing Excel files, now accept the dtype keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.

In [11]: data = "a  b\n1  2\n3  4"

In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]: 
a    int64
b    int64
dtype: object

In [13]: pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes
����������������������������������������������Out[13]: 
a    float64
b     object
dtype: object
.to_datetime() has gained an origin parameter¶

to_datetime() has gained a new parameter, origin, to define a reference date from where to compute the resulting timestamps when parsing numerical values with a specific unit specified. (GH11276, GH11745)

For example, with 1960-01-01 as the starting date:

In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

The default is set at origin='unix', which defaults to 1970-01-01 00:00:00, which is commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.

In [15]: pd.to_datetime([1, 2, 3], unit='D')
Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)
Groupby Enhancements¶

Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677)

In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
   ....:                    'B': np.arange(8)},
   ....:                   index=index)
   ....: 

In [19]: df
Out[19]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

In [20]: df.groupby(['second', 'A']).sum()
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[20]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7
Better support for compressed URLs in read_csv¶

The compression code was refactored (GH12688). As a result, reading dataframes from URLs in read_csv() or read_table() now supports additional compression methods: xz, bz2, and zip (GH14570). Previously, only gzip compression was supported. By default, compression of URLs and paths are now inferred using their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved (GH14874).

In [21]: url = 'https://github.com/{repo}/raw/{branch}/{path}'.format(
   ....:     repo = 'pandas-dev/pandas',
   ....:     branch = 'master',
   ....:     path = 'pandas/tests/io/parser/data/salaries.csv.bz2',
   ....: )
   ....: 

In [22]: df = pd.read_table(url, compression='infer')  # default, infer compression

In [23]: df = pd.read_table(url, compression='bz2')  # explicitly specify compression

In [24]: df.head(2)
Out[24]: 
       S  X  E  M
0  13876  1  1  1
1  11608  1  3  0
Pickle file I/O now supports compression¶

read_pickle(), DataFrame.to_pickle() and Series.to_pickle() can now read from and write to compressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension. See the docs here.

In [25]: df = pd.DataFrame({
   ....:     'A': np.random.randn(1000),
   ....:     'B': 'foo',
   ....:     'C': pd.date_range('20130101', periods=1000, freq='s')})
   ....: 

Using an explicit compression type

In [26]: df.to_pickle("data.pkl.compress", compression="gzip")

In [27]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")

In [28]: rt.head()
Out[28]: 
          A    B                   C
0  0.384316  foo 2013-01-01 00:00:00
1  1.574159  foo 2013-01-01 00:00:01
2  1.588931  foo 2013-01-01 00:00:02
3  0.476720  foo 2013-01-01 00:00:03
4  0.473424  foo 2013-01-01 00:00:04

The default is to infer the compression type from the extension (compression='infer'):

In [29]: df.to_pickle("data.pkl.gz")

In [30]: rt = pd.read_pickle("data.pkl.gz")

In [31]: rt.head()
Out[31]: 
          A    B                   C
0  0.384316  foo 2013-01-01 00:00:00
1  1.574159  foo 2013-01-01 00:00:01
2  1.588931  foo 2013-01-01 00:00:02
3  0.476720  foo 2013-01-01 00:00:03
4  0.473424  foo 2013-01-01 00:00:04

In [32]: df["A"].to_pickle("s1.pkl.bz2")

In [33]: rt = pd.read_pickle("s1.pkl.bz2")

In [34]: rt.head()
Out[34]: 
0    0.384316
1    1.574159
2    1.588931
3    0.476720
4    0.473424
Name: A, dtype: float64
UInt64 Support Improved¶

Pandas has significantly improved support for operations involving unsigned, or purely non-negative, integers. Previously, handling these integers would result in improper rounding or data-type casting, leading to incorrect results. Notably, a new numerical index, UInt64Index, has been created (GH14937)

In [35]: idx = pd.UInt64Index([1, 2, 3])

In [36]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)

In [37]: df.index
Out[37]: UInt64Index([1, 2, 3], dtype='uint64')
GroupBy on Categoricals¶

In previous versions, .groupby(..., sort=False) would fail with a ValueError when grouping on a categorical series with some categories not appearing in the data. (GH13179)

In [38]: chromosomes = np.r_[np.arange(1, 23).astype(str), ['X', 'Y']]

In [39]: df = pd.DataFrame({
   ....:     'A': np.random.randint(100),
   ....:     'B': np.random.randint(100),
   ....:     'C': np.random.randint(100),
   ....:     'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100),
   ....:                                   categories=chromosomes,
   ....:                                   ordered=True)})
   ....: 

In [40]: df
Out[40]: 
     A   B   C chromosomes
0   21  62  10          17
1   21  62  10           Y
2   21  62  10          13
3   21  62  10           8
4   21  62  10          22
5   21  62  10           3
6   21  62  10          19
..  ..  ..  ..         ...
93  21  62  10          17
94  21  62  10           Y
95  21  62  10           Y
96  21  62  10          22
97  21  62  10           5
98  21  62  10          20
99  21  62  10           X

[100 rows x 4 columns]

Previous Behavior:

In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
---------------------------------------------------------------------------
ValueError: items in new_categories are not the same as in old categories

New Behavior:

In [41]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
Out[41]: 
                 A      B     C
chromosomes                    
2             42.0  124.0  20.0
3            105.0  310.0  50.0
4             63.0  186.0  30.0
5             84.0  248.0  40.0
6             84.0  248.0  40.0
7             63.0  186.0  30.0
8            189.0  558.0  90.0
...            ...    ...   ...
20           126.0  372.0  60.0
21            42.0  124.0  20.0
22            84.0  248.0  40.0
X             63.0  186.0  30.0
Y            126.0  372.0  60.0
1              NaN    NaN   NaN
12             NaN    NaN   NaN

[24 rows x 3 columns]
Table Schema Output¶

The new orient 'table' for DataFrame.to_json() will generate a Table Schema compatible string representation of the data.

In [42]: df = pd.DataFrame(
   ....:     {'A': [1, 2, 3],
   ....:      'B': ['a', 'b', 'c'],
   ....:      'C': pd.date_range('2016-01-01', freq='d', periods=3),
   ....:     }, index=pd.Index(range(3), name='idx'))
   ....: 

In [43]: df
Out[43]: 
     A  B          C
idx                 
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

In [44]: df.to_json(orient='table')
�������������������������������������������������������������������������������������������������������������������Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

See IO: Table Schema for more information.

Additionally, the repr for DataFrame and Series can now publish this JSON Table schema representation of the Series or DataFrame if you are using IPython (or another frontend like nteract using the Jupyter messaging protocol). This gives frontends like the Jupyter notebook and nteract more flexiblity in how they display pandas objects, since they have more information about the data. You must enable this by setting the display.html.table_schema option to True.

SciPy sparse matrix from/to SparseDataFrame¶

Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix instances. See the documentation for more information. (GH4343)

All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying data as needed.

In [45]: from scipy.sparse import csr_matrix

In [46]: arr = np.random.random(size=(1000, 5))

In [47]: arr[arr < .9] = 0

In [48]: sp_arr = csr_matrix(arr)

In [49]: sp_arr
Out[49]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 500 stored elements in Compressed Sparse Row format>

In [50]: sdf = pd.SparseDataFrame(sp_arr)

In [51]: sdf
Out[51]: 
            0   1   2   3         4
0         NaN NaN NaN NaN       NaN
1         NaN NaN NaN NaN       NaN
2         NaN NaN NaN NaN       NaN
3         NaN NaN NaN NaN  0.997522
4         NaN NaN NaN NaN       NaN
5         NaN NaN NaN NaN  0.911034
6         NaN NaN NaN NaN       NaN
..        ...  ..  ..  ..       ...
993  0.925879 NaN NaN NaN       NaN
994       NaN NaN NaN NaN  0.955585
995       NaN NaN NaN NaN       NaN
996       NaN NaN NaN NaN       NaN
997       NaN NaN NaN NaN       NaN
998       NaN NaN NaN NaN  0.904855
999       NaN NaN NaN NaN       NaN

[1000 rows x 5 columns]

To convert a SparseDataFrame back to sparse SciPy matrix in COO format, you can use:

In [52]: sdf.to_coo()
Out[52]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 500 stored elements in COOrdinate format>
Excel output for styled DataFrames¶

Experimental support has been added to export DataFrame.style formats to Excel using the openpyxl engine. (GH15530)

For example, after running the following, styled.xlsx renders as below:

In [53]: np.random.seed(24)

In [54]: df = pd.DataFrame({'A': np.linspace(1, 10, 10)})

In [55]: df = pd.concat([df, pd.DataFrame(np.random.RandomState(24).randn(10, 4),
   ....:                                  columns=list('BCDE'))],
   ....:                axis=1)
   ....: 

In [56]: df.iloc[0, 2] = np.nan

In [57]: df
Out[57]: 
      A         B         C         D         E
0   1.0  1.329212       NaN -0.316280 -0.990810
1   2.0 -1.070816 -1.438713  0.564417  0.295722
2   3.0 -1.626404  0.219565  0.678805  1.889273
3   4.0  0.961538  0.104011 -0.481165  0.850229
4   5.0  1.453425  1.057737  0.165562  0.515018
5   6.0 -1.336936  0.562861  1.392855 -0.063328
6   7.0  0.121668  1.207603 -0.002040  1.627796
7   8.0  0.354493  1.037528 -0.385684  0.519818
8   9.0  1.686583 -1.325963  1.428984 -2.089354
9  10.0 -0.129820  0.631523 -0.586538  0.290720

In [58]: styled = df.style.\
   ....:     applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black').\
   ....:     highlight_max()
   ....: 

In [59]: styled.to_excel('styled.xlsx', engine='openpyxl')

See the Style documentation for more detail.

IntervalIndex¶

pandas has gained an IntervalIndex with its own dtype, interval as well as the Interval scalar type. These allow first-class support for interval notation, specifically as a return type for the categories in cut() and qcut(). The IntervalIndex allows some unique indexing, see the docs. (GH7640, GH8625)

Warning

These indexing behaviors of the IntervalIndex are provisional and may change in a future version of pandas. Feedback on usage is welcome.

Previous behavior:

The returned categories were strings, representing Intervals

In [1]: c = pd.cut(range(4), bins=2)

In [2]: c
Out[2]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]]
Categories (2, object): [(-0.003, 1.5] < (1.5, 3]]

In [3]: c.categories
Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')

New behavior:

In [60]: c = pd.cut(range(4), bins=2)

In [61]: c
Out[61]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [62]: c.categories
���������������������������������������������������������������������������������������������������������������������������������Out[62]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]]
              closed='right',
              dtype='interval[float64]')

Furthermore, this allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

In [63]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[63]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

An IntervalIndex can also be used in Series and DataFrame as the index.

In [64]: df = pd.DataFrame({'A': range(4),
   ....:                    'B': pd.cut([0, 3, 1, 1], bins=c.categories)}
   ....:                  ).set_index('B')
   ....: 

In [65]: df
Out[65]: 
               A
B               
(-0.003, 1.5]  0
(1.5, 3.0]     1
(-0.003, 1.5]  2
(-0.003, 1.5]  3

Selecting via a specific interval:

In [66]: df.loc[pd.Interval(1.5, 3.0)]
Out[66]: 
A    1
Name: (1.5, 3.0], dtype: int64

Selecting via a scalar value that is contained in the intervals.

In [67]: df.loc[0]
Out[67]: 
               A
B               
(-0.003, 1.5]  0
(-0.003, 1.5]  2
(-0.003, 1.5]  3
Other Enhancements¶ Backwards incompatible API changes¶ Possible incompatibility for HDF5 formats created with pandas < 0.13.0¶

pd.TimeSeries was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been dropped in favor of pd.Series. (GH15098).

This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries was used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the procedure below.

In [2]: s = pd.TimeSeries([1,2,3], index=pd.date_range('20130101', periods=3))

In [3]: s
Out[3]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [4]: type(s)
Out[4]: pandas.core.series.TimeSeries

In [5]: s = pd.Series(s)

In [6]: s
Out[6]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [7]: type(s)
Out[7]: pandas.core.series.Series
Map on Index types now return other Index types¶

map on an Index now returns an Index, not a numpy array (GH12766)

In [68]: idx = Index([1, 2])

In [69]: idx
Out[69]: Int64Index([1, 2], dtype='int64')

In [70]: mi = MultiIndex.from_tuples([(1, 2), (2, 4)])

In [71]: mi
Out[71]: 
MultiIndex(levels=[[1, 2], [2, 4]],
           labels=[[0, 1], [0, 1]])

Previous Behavior:

In [5]: idx.map(lambda x: x * 2)
Out[5]: array([2, 4])

In [6]: idx.map(lambda x: (x, x * 2))
Out[6]: array([(1, 2), (2, 4)], dtype=object)

In [7]: mi.map(lambda x: x)
Out[7]: array([(1, 2), (2, 4)], dtype=object)

In [8]: mi.map(lambda x: x[0])
Out[8]: array([1, 2])

New Behavior:

In [72]: idx.map(lambda x: x * 2)
Out[72]: Int64Index([2, 4], dtype='int64')

In [73]: idx.map(lambda x: (x, x * 2))
�������������������������������������������Out[73]: 
MultiIndex(levels=[[1, 2], [2, 4]],
           labels=[[0, 1], [0, 1]])

In [74]: mi.map(lambda x: x)
�����������������������������������������������������������������������������������������������������������������������������Out[74]: 
MultiIndex(levels=[[1, 2], [2, 4]],
           labels=[[0, 1], [0, 1]])

In [75]: mi.map(lambda x: x[0])
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[75]: Int64Index([1, 2], dtype='int64')

map on a Series with datetime64 values may return int64 dtypes rather than int32

In [76]: s = Series(date_range('2011-01-02T00:00', '2011-01-02T02:00', freq='H').tz_localize('Asia/Tokyo'))

In [77]: s
Out[77]: 
0   2011-01-02 00:00:00+09:00
1   2011-01-02 01:00:00+09:00
2   2011-01-02 02:00:00+09:00
dtype: datetime64[ns, Asia/Tokyo]

Previous Behavior:

In [9]: s.map(lambda x: x.hour)
Out[9]:
0    0
1    1
2    2
dtype: int32

New Behavior:

In [78]: s.map(lambda x: x.hour)
Out[78]: 
0    0
1    1
2    2
dtype: int64
Accessing datetime fields of Index now return Index¶

The datetime-related attributes (see here for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously returned numpy arrays. They will now return a new Index object, except in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022)

Previous behaviour:

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx.hour
Out[2]: array([ 0, 10, 20,  6, 16], dtype=int32)

New Behavior:

In [79]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [80]: idx.hour
Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

This has the advantage that specific Index methods are still available on the result. On the other hand, this might have backward incompatibilities: e.g. compared to numpy arrays, Index objects are not mutable. To get the original ndarray, you can always convert explicitly using np.asarray(idx.hour).

pd.unique will now be consistent with extension types¶

In prior versions, using Series.unique() and pandas.unique() on Categorical and tz-aware data-types would yield different return types. These are now made consistent. (GH15903)

S3 File Handling¶

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. (GH11915).

Partial String Indexing Changes¶

DatetimeIndex Partial String Indexing now works as an exact match, provided that string resolution coincides with index resolution, including a case when both are seconds (GH14826). See Slice vs. Exact Match for details.

In [87]: df = DataFrame({'a': [1, 2, 3]}, DatetimeIndex(['2011-12-31 23:59:59',
   ....:                                                 '2012-01-01 00:00:00',
   ....:                                                 '2012-01-01 00:00:01']))
   ....: 

Previous Behavior:

In [4]: df['2011-12-31 23:59:59']
Out[4]:
                       a
2011-12-31 23:59:59  1

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]:
2011-12-31 23:59:59    1
Name: a, dtype: int64

New Behavior:

In [4]: df['2011-12-31 23:59:59']
KeyError: '2011-12-31 23:59:59'

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]: 1
Concat of different float dtypes will not automatically upcast¶

Previously, concat of multiple objects with different float dtypes would automatically upcast results to a dtype of float64. Now the smallest acceptable dtype will be used (GH13247)

In [88]: df1 = pd.DataFrame(np.array([1.0], dtype=np.float32, ndmin=2))

In [89]: df1.dtypes
Out[89]: 
0    float32
dtype: object

In [90]: df2 = pd.DataFrame(np.array([np.nan], dtype=np.float32, ndmin=2))

In [91]: df2.dtypes
Out[91]: 
0    float32
dtype: object

Previous Behavior:

In [7]: pd.concat([df1, df2]).dtypes
Out[7]:
0    float64
dtype: object

New Behavior:

In [92]: pd.concat([df1, df2]).dtypes
Out[92]: 
0    float32
dtype: object
Pandas Google BigQuery support has moved¶

pandas has split off Google BigQuery support into a separate package pandas-gbq. You can conda install pandas-gbq -c conda-forge or pip install pandas-gbq to get it. The functionality of read_gbq() and DataFrame.to_gbq() remain the same with the currently released version of pandas-gbq=0.1.4. Documentation is now hosted here (GH15347)

Memory Usage for Index is more Accurate¶

In previous versions, showing .memory_usage() on a pandas structure that has an index, would only include actual index values and not include structures that facilitated fast indexing. This will generally be different for Index and MultiIndex and less-so for other index types. (GH15237)

Previous Behavior:

In [8]: index = Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 180

New Behavior:

In [8]: index = Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 260
DataFrame.sort_index changes¶

In certain cases, calling .sort_index() on a MultiIndexed DataFrame would return the same DataFrame without seeming to sort. This would happen with a lexsorted, but non-monotonic levels. (GH15622, GH15687, GH14015, GH13431, GH15797)

This is unchanged from prior versions, but shown for illustration purposes:

In [93]: df = DataFrame(np.arange(6), columns=['value'], index=MultiIndex.from_product([list('BA'), range(3)]))

In [94]: df
Out[94]: 
     value
B 0      0
  1      1
  2      2
A 0      3
  1      4
  2      5
In [95]: df.index.is_lexsorted()
Out[95]: False

In [96]: df.index.is_monotonic
���������������Out[96]: False

Sorting works as expected

In [97]: df.sort_index()
Out[97]: 
     value
A 0      3
  1      4
  2      5
B 0      0
  1      1
  2      2
In [98]: df.sort_index().index.is_lexsorted()
Out[98]: True

In [99]: df.sort_index().index.is_monotonic
��������������Out[99]: True

However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.

In [100]: df = pd.DataFrame(
   .....:         {'value': [1, 2, 3, 4]},
   .....:          index=pd.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
   .....:                              labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
   .....: 

In [101]: df
Out[101]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

Previous Behavior:

In [11]: df.sort_index()
Out[11]:
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [14]: df.sort_index().index.is_lexsorted()
Out[14]: True

In [15]: df.sort_index().index.is_monotonic
Out[15]: False

New Behavior:

In [102]: df.sort_index()
Out[102]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

In [103]: df.sort_index().index.is_lexsorted()
�����������������������������������������������������������������������Out[103]: True

In [104]: df.sort_index().index.is_monotonic
��������������������������������������������������������������������������������������Out[104]: True
Groupby Describe Formatting¶

The output formatting of groupby.describe() now labels the describe() metrics in the columns instead of the index. This format is consistent with groupby.agg() when applying multiple functions at once. (GH4792)

Previous Behavior:

In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [2]: df.groupby('A').describe()
Out[2]:
                B
A
1 count  2.000000
  mean   1.500000
  std    0.707107
  min    1.000000
  25%    1.250000
  50%    1.500000
  75%    1.750000
  max    2.000000
2 count  2.000000
  mean   3.500000
  std    0.707107
  min    3.000000
  25%    3.250000
  50%    3.500000
  75%    3.750000
  max    4.000000

In [3]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
Out[3]:
     B
  mean       std amin amax
A
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4

New Behavior:

In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [106]: df.groupby('A').describe()
Out[106]: 
      B                                          
  count mean       std  min   25%  50%   75%  max
A                                                
1   2.0  1.5  0.707107  1.0  1.25  1.5  1.75  2.0
2   2.0  3.5  0.707107  3.0  3.25  3.5  3.75  4.0

In [107]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[107]: 
     B                    
  mean       std amin amax
A                         
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4
Window Binary Corr/Cov operations return a MultiIndex DataFrame¶

A binary window operation, like .corr() or .cov(), when operating on a .rolling(..), .expanding(..), or .ewm(..) object, will now return a 2-level MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame enjoys more support in pandas. See the section on Windowed Binary Operations for more information. (GH15677)

In [108]: np.random.seed(1234)

In [109]: df = pd.DataFrame(np.random.rand(100, 2),
   .....:                   columns=pd.Index(['A', 'B'], name='bar'),
   .....:                   index=pd.date_range('20160101',
   .....:                                       periods=100, freq='D', name='foo'))
   .....: 

In [110]: df.tail()
Out[110]: 
bar                A         B
foo                           
2016-04-05  0.640880  0.126205
2016-04-06  0.171465  0.737086
2016-04-07  0.127029  0.369650
2016-04-08  0.604334  0.103104
2016-04-09  0.802374  0.945553

Previous Behavior:

In [2]: df.rolling(12).corr()
Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
Major_axis axis: A to B
Minor_axis axis: A to B

New Behavior:

In [111]: res = df.rolling(12).corr()

In [112]: res.tail()
Out[112]: 
bar                    A         B
foo        bar                    
2016-04-07 B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

Retrieving a correlation matrix for a cross-section

In [113]: df.rolling(12).corr().loc['2016-04-07']
Out[113]: 
bar                   A        B
foo        bar                  
2016-04-07 A    1.00000 -0.13209
           B   -0.13209  1.00000
HDFStore where string comparison¶

In previous versions most types could be compared to string column in a HDFStore usually resulting in an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError (GH15492)

In [114]: df = pd.DataFrame({'unparsed_date': ['2014-01-01', '2014-01-01']})

In [115]: df.to_hdf('store.h5', 'key', format='table', data_columns=True)

In [116]: df.dtypes
Out[116]: 
unparsed_date    object
dtype: object

Previous Behavior:

In [4]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
File "<string>", line 1
  (unparsed_date > 1970-01-01 00:00:01.388552400)
                        ^
SyntaxError: invalid token

New Behavior:

In [18]: ts = pd.Timestamp('2014-01-01')

In [19]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column
Index.intersection and inner join now preserve the order of the left Index¶

Index.intersection() now preserves the order of the calling Index (left) instead of the other Index (right) (GH15582). This affects inner joins, DataFrame.join() and merge(), and the .align method.

Pivot Table always returns a DataFrame¶

The documentation for pivot_table() states that a DataFrame is always returned. Here a bug is fixed that allowed this to return a Series under certain circumstance. (GH4386)

In [127]: df = DataFrame({'col1': [3, 4, 5],
   .....:                 'col2': ['C', 'D', 'E'],
   .....:                 'col3': [1, 3, 9]})
   .....: 

In [128]: df
Out[128]: 
   col1 col2  col3
0     3    C     1
1     4    D     3
2     5    E     9

Previous Behavior:

In [2]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[2]:
col3  col2
1     C       3
3     D       4
9     E       5
Name: col1, dtype: int64

New Behavior:

In [129]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[129]: 
           col1
col3 col2      
1    C        3
3    D        4
9    E        5
Other API Changes¶ Reorganization of the library: Privacy Changes¶ Modules Privacy Has Changed¶

Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API. Furthermore, the pandas.core, pandas.compat, and pandas.util top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if you reference theses modules. (GH12588)

Previous Location New Location Deprecated pandas.lib pandas._libs.lib X pandas.tslib pandas._libs.tslib X pandas.computation pandas.core.computation X pandas.msgpack pandas.io.msgpack   pandas.index pandas._libs.index   pandas.algos pandas._libs.algos   pandas.hashtable pandas._libs.hashtable   pandas.indexes pandas.core.indexes   pandas.json pandas._libs.json / pandas.io.json X pandas.parser pandas._libs.parsers X pandas.formats pandas.io.formats   pandas.sparse pandas.core.sparse   pandas.tools pandas.core.reshape X pandas.types pandas.core.dtypes X pandas.io.sas.saslib pandas.io.sas._sas   pandas._join pandas._libs.join   pandas._hash pandas._libs.hashing   pandas._period pandas._libs.period   pandas._sparse pandas._libs.sparse   pandas._testing pandas._libs.testing   pandas._window pandas._libs.window  

Some new subpackages are created with public functionality that is not directly exposed in the top-level namespace: pandas.errors, pandas.plotting and pandas.testing (more details below). Together with pandas.api.types and certain functions in the pandas.io and pandas.tseries submodules, these are now the public subpackages.

Further changes:

pandas.errors¶

We are adding a standard public module for all pandas exceptions & warnings pandas.errors. (GH14800). Previously these exceptions & warnings could be imported from pandas.core.common or pandas.io.common. These exceptions and warnings will be removed from the *.common locations in a future release. (GH15541)

The following are now part of this API:

['DtypeWarning',
 'EmptyDataError',
 'OutOfBoundsDatetime',
 'ParserError',
 'ParserWarning',
 'PerformanceWarning',
 'UnsortedIndexError',
 'UnsupportedFunctionCall']
pandas.plotting¶

A new public pandas.plotting module has been added that holds plotting functionality that was previously in either pandas.tools.plotting or in the top-level namespace. See the deprecations sections for more details.

Other Development Changes¶ Deprecations¶ Deprecate .ix¶

The .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. .ix offers a lot of magic on the inference of what the user wants to do. To wit, .ix can decide to index positionally OR via labels, depending on the data type of the index. This has caused quite a bit of user confusion over the years. The full indexing documentation is here. (GH14218)

The recommended methods of indexing are:

Using .ix will now show a DeprecationWarning with a link to some examples of how to convert code here.

In [130]: df = pd.DataFrame({'A': [1, 2, 3],
   .....:                    'B': [4, 5, 6]},
   .....:                   index=list('abc'))
   .....: 

In [131]: df
Out[131]: 
   A  B
a  1  4
b  2  5
c  3  6

Previous Behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.

In [3]: df.ix[[0, 2], 'A']
Out[3]:
a    1
c    3
Name: A, dtype: int64

Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.

In [132]: df.loc[df.index[[0, 2]], 'A']
Out[132]: 
a    1
c    3
Name: A, dtype: int64

Using .iloc. Here we will get the location of the ‘A’ column, then use positional indexing to select things.

In [133]: df.iloc[[0, 2], df.columns.get_loc('A')]
Out[133]: 
a    1
c    3
Name: A, dtype: int64
Deprecate Panel¶

Panel is deprecated and will be removed in a future version. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() or with the xarray package. Pandas provides a to_xarray() method to automate this conversion. For more details see Deprecate Panel documentation. (GH13563).

In [134]: p = tm.makePanel()

In [135]: p
Out[135]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

In [136]: p.to_frame()
Out[136]: 
                     ItemA     ItemB     ItemC
major      minor                              
2000-01-03 A      0.628776 -1.409432  0.209395
           B      0.988138 -1.347533 -0.896581
           C     -0.938153  1.272395 -0.161137
           D     -0.223019 -0.591863 -1.051539
2000-01-04 A      0.186494  1.422986 -0.592886
           B     -0.072608  0.363565  1.104352
           C     -1.239072 -1.449567  0.889157
           D      2.123692 -0.414505 -0.319561
2000-01-05 A      0.952478 -2.147855 -1.473116
           B     -0.550603 -0.014752 -0.431550
           C      0.139683 -1.195524  0.288377
           D      0.122273 -1.425795 -0.619993

Convert to an xarray DataArray

In [137]: p.to_xarray()
Out[137]: 
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.628776,  0.988138, -0.938153, -0.223019],
        [ 0.186494, -0.072608, -1.239072,  2.123692],
        [ 0.952478, -0.550603,  0.139683,  0.122273]],

       [[-1.409432, -1.347533,  1.272395, -0.591863],
        [ 1.422986,  0.363565, -1.449567, -0.414505],
        [-2.147855, -0.014752, -1.195524, -1.425795]],

       [[ 0.209395, -0.896581, -0.161137, -1.051539],
        [-0.592886,  1.104352,  0.889157, -0.319561],
        [-1.473116, -0.43155 ,  0.288377, -0.619993]]])
Coordinates:
  * items       (items) object 'ItemA' 'ItemB' 'ItemC'
  * major_axis  (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
  * minor_axis  (minor_axis) object 'A' 'B' 'C' 'D'
Deprecate groupby.agg() with a dictionary when renaming¶

The .groupby(..).agg(..), .rolling(..).agg(..), and .resample(..).agg(..) syntax can accept a variable of inputs, including scalars, list, and a dict of column names to scalars or lists. This provides a useful syntax for constructing multiple (potentially different) aggregations.

However, .agg(..) can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated and confusing syntax, as well as not consistent between Series and DataFrame. We are deprecating this ‘renaming’ functionaility.

This is an illustrative example:

In [138]: df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
   .....:                    'B': range(5),
   .....:                    'C': range(5)})
   .....: 

In [139]: df
Out[139]: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  2  3  3
4  2  4  4

Here is a typical useful syntax for computing different aggregations for different columns. This is a natural, and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list of functions. This returns a MultiIndex for the columns (this is not deprecated).

In [140]: df.groupby('A').agg({'B': 'sum', 'C': 'min'})
Out[140]: 
   B  C
A      
1  3  0
2  7  3

Here’s an example of the first deprecation, passing a dict to a grouped Series. This is a combination aggregation & renaming:

In [6]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version

Out[6]:
   foo
A
1    3
2    2

You can accomplish the same operation, more idiomatically by:

In [141]: df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
Out[141]: 
   foo
A     
1    3
2    2

Here’s an example of the second deprecation, passing a dict-of-dict to a grouped DataFrame:

In [23]: (df.groupby('A')
            .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
         )
FutureWarning: using a dict with renaming is deprecated and
will be removed in a future version

Out[23]:
     B   C
   foo bar
A
1   3   0
2   7   3

You can accomplish nearly the same by:

In [142]: (df.groupby('A')
   .....:    .agg({'B': 'sum', 'C': 'min'})
   .....:    .rename(columns={'B': 'foo', 'C': 'bar'})
   .....: )
   .....: 
Out[142]: 
   foo  bar
A          
1    3    0
2    7    3
Deprecate .plotting¶

The pandas.tools.plotting module has been deprecated, in favor of the top level pandas.plotting module. All the public plotting functions are now available from pandas.plotting (GH12548).

Furthermore, the top-level pandas.scatter_matrix and pandas.plot_params are deprecated. Users can import these from pandas.plotting as well.

Previous script:

pd.tools.plotting.scatter_matrix(df)
pd.scatter_matrix(df)

Should be changed to:

pd.plotting.scatter_matrix(df)
Other Deprecations¶ Removal of prior version deprecations/changes¶ Performance Improvements¶ Bug Fixes¶ Conversion¶ Indexing¶ I/O¶ Plotting¶ Groupby/Resample/Rolling¶ Sparse¶ Reshaping¶ Numeric¶ Other¶ v0.19.2 (December 24, 2016)¶

This is a minor bug-fix release in the 0.19.x series and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

Enhancements¶

The pd.merge_asof(), added in 0.19.0, gained some improvements:

Performance Improvements¶ Bug Fixes¶ v0.19.1 (November 3, 2016)¶

This is a minor bug-fix release from 0.19.0 and includes some small regression fixes, bug fixes and performance improvements. We recommend that all users upgrade to this version.

Performance Improvements¶ Bug Fixes¶ v0.19.0 (October 2, 2016)¶

This is a major release from 0.18.1 and includes number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Warning

pandas >= 0.19.0 will no longer silence numpy ufunc warnings upon import, see here.

New features¶ merge_asof for asof-style time-series joining¶

A long-time requested feature has been added through the merge_asof() function, to support asof style joining of time-series (GH1870, GH13695, GH13709, GH13902). Full documentation is here.

The merge_asof() performs an asof merge, which is similar to a left-join except that we match on nearest key rather than equal keys.

In [1]: left = pd.DataFrame({'a': [1, 5, 10],
   ...:                      'left_val': ['a', 'b', 'c']})
   ...: 

In [2]: right = pd.DataFrame({'a': [1, 2, 3, 6, 7],
   ...:                      'right_val': [1, 2, 3, 6, 7]})
   ...: 

In [3]: left
Out[3]: 
    a left_val
0   1        a
1   5        b
2  10        c

In [4]: right
���������������������������������������������������������������������Out[4]: 
   a  right_val
0  1          1
1  2          2
2  3          3
3  6          6
4  7          7

We typically want to match exactly when possible, and use the most recent value otherwise.

In [5]: pd.merge_asof(left, right, on='a')
Out[5]: 
    a left_val  right_val
0   1        a          1
1   5        b          3
2  10        c          7

We can also match rows ONLY with prior data, and not an exact match.

In [6]: pd.merge_asof(left, right, on='a', allow_exact_matches=False)
Out[6]: 
    a left_val  right_val
0   1        a        NaN
1   5        b        3.0
2  10        c        7.0

In a typical time-series example, we have trades and quotes and we want to asof-join them. This also illustrates using the by parameter to group data before merging.

In [7]: trades = pd.DataFrame({
   ...:     'time': pd.to_datetime(['20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.038',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.048']),
   ...:     'ticker': ['MSFT', 'MSFT',
   ...:                'GOOG', 'GOOG', 'AAPL'],
   ...:     'price': [51.95, 51.95,
   ...:               720.77, 720.92, 98.00],
   ...:     'quantity': [75, 155,
   ...:                  100, 100, 100]},
   ...:     columns=['time', 'ticker', 'price', 'quantity'])
   ...: 

In [8]: quotes = pd.DataFrame({
   ...:     'time': pd.to_datetime(['20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.030',
   ...:                             '20160525 13:30:00.041',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.049',
   ...:                             '20160525 13:30:00.072',
   ...:                             '20160525 13:30:00.075']),
   ...:     'ticker': ['GOOG', 'MSFT', 'MSFT',
   ...:                'MSFT', 'GOOG', 'AAPL', 'GOOG',
   ...:                'MSFT'],
   ...:     'bid': [720.50, 51.95, 51.97, 51.99,
   ...:             720.50, 97.99, 720.50, 52.01],
   ...:     'ask': [720.93, 51.96, 51.98, 52.00,
   ...:             720.93, 98.01, 720.88, 52.03]},
   ...:     columns=['time', 'ticker', 'bid', 'ask'])
   ...: 
In [9]: trades
Out[9]: 
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

In [10]: quotes
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[10]: 
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

An asof merge joins on the on, typically a datetimelike field, which is ordered, and in this case we are using a grouper in the by field. This is like a left-outer join, except that forward filling happens automatically taking the most recent non-NaN value.

In [11]: pd.merge_asof(trades, quotes,
   ....:               on='time',
   ....:               by='ticker')
   ....: 
Out[11]: 
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

This returns a merged DataFrame with the entries in the same order as the original left passed DataFrame (trades in this case), with the fields of the quotes merged.

.rolling() is now time-series aware¶

.rolling() objects are now time-series aware and can accept a time-series offset (or convertible) for the window argument (GH13327, GH12995). See the full documentation here.

In [12]: dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
   ....:                    index=pd.date_range('20130101 09:00:00', periods=5, freq='s'))
   ....: 

In [13]: dft
Out[13]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:01  1.0
2013-01-01 09:00:02  2.0
2013-01-01 09:00:03  NaN
2013-01-01 09:00:04  4.0

This is a regular frequency index. Using an integer window parameter works to roll along the window frequency.

In [14]: dft.rolling(2).sum()
Out[14]: 
                       B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:01  1.0
2013-01-01 09:00:02  3.0
2013-01-01 09:00:03  NaN
2013-01-01 09:00:04  NaN

In [15]: dft.rolling(2, min_periods=1).sum()
����������������������������������������������������������������������������������������������������������������������������������������������������������������Out[15]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:01  1.0
2013-01-01 09:00:02  3.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:04  4.0

Specifying an offset allows a more intuitive specification of the rolling frequency.

In [16]: dft.rolling('2s').sum()
Out[16]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:01  1.0
2013-01-01 09:00:02  3.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:04  4.0

Using a non-regular, but still monotonic index, rolling with an integer window does not impart any special calculation.

In [17]: dft = DataFrame({'B': [0, 1, 2, np.nan, 4]},
   ....:                 index = pd.Index([pd.Timestamp('20130101 09:00:00'),
   ....:                                   pd.Timestamp('20130101 09:00:02'),
   ....:                                   pd.Timestamp('20130101 09:00:03'),
   ....:                                   pd.Timestamp('20130101 09:00:05'),
   ....:                                   pd.Timestamp('20130101 09:00:06')],
   ....:                                  name='foo'))
   ....: 

In [18]: dft
Out[18]: 
                       B
foo                     
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  4.0

In [19]: dft.rolling(2).sum()
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[19]: 
                       B
foo                     
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  NaN

Using the time-specification generates variable windows for this sparse data.

In [20]: dft.rolling('2s').sum()
Out[20]: 
                       B
foo                     
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  4.0

Furthermore, we now allow an optional on parameter to specify a column (rather than the default of the index) in a DataFrame.

In [21]: dft = dft.reset_index()

In [22]: dft
Out[22]: 
                  foo    B
0 2013-01-01 09:00:00  0.0
1 2013-01-01 09:00:02  1.0
2 2013-01-01 09:00:03  2.0
3 2013-01-01 09:00:05  NaN
4 2013-01-01 09:00:06  4.0

In [23]: dft.rolling('2s', on='foo').sum()
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[23]: 
                  foo    B
0 2013-01-01 09:00:00  0.0
1 2013-01-01 09:00:02  1.0
2 2013-01-01 09:00:03  3.0
3 2013-01-01 09:00:05  NaN
4 2013-01-01 09:00:06  4.0
read_csv has improved support for duplicate column names¶

Duplicate column names are now supported in read_csv() whether they are in the file or passed in as the names parameter (GH7160, GH9424)

In [24]: data = '0,1,2\n3,4,5'

In [25]: names = ['a', 'b', 'a']

Previous behavior:

In [2]: pd.read_csv(StringIO(data), names=names)
Out[2]:
   a  b  a
0  2  1  2
1  5  4  5

The first a column contained the same data as the second a column, when it should have contained the values [0, 3].

New behavior:

In [26]: pd.read_csv(StringIO(data), names=names)
Out[26]: 
   a  b  a.1
0  0  1    2
1  3  4    5
read_csv supports parsing Categorical directly¶

The read_csv() function now supports parsing a Categorical column when specified as a dtype (GH10153). Depending on the structure of the data, this can result in a faster parse time and lower memory usage compared to converting to Categorical after parsing. See the io docs here.

In [27]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

In [28]: pd.read_csv(StringIO(data))
Out[28]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [29]: pd.read_csv(StringIO(data)).dtypes
����������������������������������������������������������������������������������Out[29]: 
col1    object
col2    object
col3     int64
dtype: object

In [30]: pd.read_csv(StringIO(data), dtype='category').dtypes
�������������������������������������������������������������������������������������������������������������������������������������������������������Out[30]: 
col1    category
col2    category
col3    category
dtype: object

Individual columns can be parsed as a Categorical using a dict specification

In [31]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes
Out[31]: 
col1    category
col2      object
col3       int64
dtype: object

Note

The resulting categories will always be parsed as strings (object dtype). If the categories are numeric they can be converted using the to_numeric() function, or as appropriate, another converter such as to_datetime().

In [32]: df = pd.read_csv(StringIO(data), dtype='category')

In [33]: df.dtypes
Out[33]: 
col1    category
col2    category
col3    category
dtype: object

In [34]: df['col3']
���������������������������������������������������������������������������Out[34]: 
0    1
1    2
2    3
Name: col3, dtype: category
Categories (3, object): [1, 2, 3]

In [35]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)

In [36]: df['col3']
Out[36]: 
0    1
1    2
2    3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]
Categorical Concatenation¶ Semi-Month Offsets¶

Pandas has gained new frequency offsets, SemiMonthEnd (‘SM’) and SemiMonthBegin (‘SMS’). These provide date offsets anchored (by default) to the 15th and end of month, and 15th and 1st of month respectively. (GH1543)

In [44]: from pandas.tseries.offsets import SemiMonthEnd, SemiMonthBegin

SemiMonthEnd:

In [45]: Timestamp('2016-01-01') + SemiMonthEnd()
Out[45]: Timestamp('2016-01-15 00:00:00')

In [46]: pd.date_range('2015-01-01', freq='SM', periods=4)
������������������������������������������Out[46]: DatetimeIndex(['2015-01-15', '2015-01-31', '2015-02-15', '2015-02-28'], dtype='datetime64[ns]', freq='SM-15')

SemiMonthBegin:

In [47]: Timestamp('2016-01-01') + SemiMonthBegin()
Out[47]: Timestamp('2016-01-15 00:00:00')

In [48]: pd.date_range('2015-01-01', freq='SMS', periods=4)
������������������������������������������Out[48]: DatetimeIndex(['2015-01-01', '2015-01-15', '2015-02-01', '2015-02-15'], dtype='datetime64[ns]', freq='SMS-15')

Using the anchoring suffix, you can also specify the day of month to use instead of the 15th.

In [49]: pd.date_range('2015-01-01', freq='SMS-16', periods=4)
Out[49]: DatetimeIndex(['2015-01-01', '2015-01-16', '2015-02-01', '2015-02-16'], dtype='datetime64[ns]', freq='SMS-16')

In [50]: pd.date_range('2015-01-01', freq='SM-14', periods=4)
������������������������������������������������������������������������������������������������������������������������Out[50]: DatetimeIndex(['2015-01-14', '2015-01-31', '2015-02-14', '2015-02-28'], dtype='datetime64[ns]', freq='SM-14')
New Index methods¶

The following methods and options are added to Index, to be more consistent with the Series and DataFrame API.

Index now supports the .where() function for same shape indexing (GH13170)

In [51]: idx = pd.Index(['a', 'b', 'c'])

In [52]: idx.where([True, False, True])
Out[52]: Index(['a', nan, 'c'], dtype='object')

Index now supports .dropna() to exclude missing values (GH6194)

In [53]: idx = pd.Index([1, 2, np.nan, 4])

In [54]: idx.dropna()
Out[54]: Float64Index([1.0, 2.0, 4.0], dtype='float64')

For MultiIndex, values are dropped if any level is missing by default. Specifying how='all' only drops values where all levels are missing.

In [55]: midx = pd.MultiIndex.from_arrays([[1, 2, np.nan, 4],
   ....:                                     [1, 2, np.nan, np.nan]])
   ....: 

In [56]: midx
Out[56]: 
MultiIndex(levels=[[1, 2, 4], [1, 2]],
           labels=[[0, 1, -1, 2], [0, 1, -1, -1]])

In [57]: midx.dropna()
����������������������������������������������������������������������������������������������������Out[57]: 
MultiIndex(levels=[[1, 2, 4], [1, 2]],
           labels=[[0, 1], [0, 1]])

In [58]: midx.dropna(how='all')
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[58]: 
MultiIndex(levels=[[1, 2, 4], [1, 2]],
           labels=[[0, 1, 2], [0, 1, -1]])

Index now supports .str.extractall() which returns a DataFrame, see the docs here (GH10008, GH13156)

In [59]: idx = pd.Index(["a1a2", "b1", "c1"])

In [60]: idx.str.extractall("[ab](?P<digit>\d)")
Out[60]: 
        digit
  match      
0 0         1
  1         2
1 0         1

Index.astype() now accepts an optional boolean argument copy, which allows optional copying if the requirements on dtype are satisfied (GH13209)

Google BigQuery Enhancements¶ Fine-grained numpy errstate¶

Previous versions of pandas would permanently silence numpy’s ufunc error handling when pandas was imported. Pandas did this in order to silence the warnings that would arise from using numpy ufuncs on missing data, which are usually represented as NaN s. Unfortunately, this silenced legitimate warnings arising in non-pandas code in the application. Starting with 0.19.0, pandas will use the numpy.errstate context manager to silence these warnings in a more fine-grained manner, only around where these operations are actually used in the pandas codebase. (GH13109, GH13145)

After upgrading pandas, you may see new RuntimeWarnings being issued from your code. These are likely legitimate, and the underlying cause likely existed in the code when using previous versions of pandas that simply silenced the warning. Use numpy.errstate around the source of the RuntimeWarning to control how these conditions are handled.

get_dummies now returns integer dtypes¶

The pd.get_dummies function now returns dummy-encoded columns as small integers, rather than floats (GH8725). This should provide an improved memory footprint.

Previous behavior:

In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes

Out[1]:
a    float64
b    float64
c    float64
dtype: object

New behavior:

In [61]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
Out[61]: 
a    uint8
b    uint8
c    uint8
dtype: object
Downcast values to smallest possible dtype in to_numeric¶

pd.to_numeric() now accepts a downcast parameter, which will downcast the data if possible to smallest specified numerical dtype (GH13352)

In [62]: s = ['1', 2, 3]

In [63]: pd.to_numeric(s, downcast='unsigned')
Out[63]: array([1, 2, 3], dtype=uint8)

In [64]: pd.to_numeric(s, downcast='integer')
���������������������������������������Out[64]: array([1, 2, 3], dtype=int8)
pandas development API¶

As part of making pandas API more uniform and accessible in the future, we have created a standard sub-package of pandas, pandas.api to hold public API’s. We are starting by exposing type introspection functions in pandas.api.types. More sub-packages and officially sanctioned API’s will be published in future versions of pandas (GH13147, GH13634)

The following are now part of this API:

In [65]: import pprint

In [66]: from pandas.api import types

In [67]: funcs = [ f for f in dir(types) if not f.startswith('_') ]

In [68]: pprint.pprint(funcs)
['CategoricalDtype',
 'DatetimeTZDtype',
 'IntervalDtype',
 'PeriodDtype',
 'infer_dtype',
 'is_any_int_dtype',
 'is_bool',
 'is_bool_dtype',
 'is_categorical',
 'is_categorical_dtype',
 'is_complex',
 'is_complex_dtype',
 'is_datetime64_any_dtype',
 'is_datetime64_dtype',
 'is_datetime64_ns_dtype',
 'is_datetime64tz_dtype',
 'is_datetimetz',
 'is_dict_like',
 'is_dtype_equal',
 'is_extension_type',
 'is_file_like',
 'is_float',
 'is_float_dtype',
 'is_floating_dtype',
 'is_hashable',
 'is_int64_dtype',
 'is_integer',
 'is_integer_dtype',
 'is_interval',
 'is_interval_dtype',
 'is_iterator',
 'is_list_like',
 'is_named_tuple',
 'is_number',
 'is_numeric_dtype',
 'is_object_dtype',
 'is_period',
 'is_period_dtype',
 'is_re',
 'is_re_compilable',
 'is_scalar',
 'is_sequence',
 'is_signed_integer_dtype',
 'is_sparse',
 'is_string_dtype',
 'is_timedelta64_dtype',
 'is_timedelta64_ns_dtype',
 'is_unsigned_integer_dtype',
 'pandas_dtype',
 'union_categoricals']

Note

Calling these functions from the internal module pandas.core.common will now show a DeprecationWarning (GH13990)

Other enhancements¶ API changes¶ Series.tolist() will now return Python types¶

Series.tolist() will now return Python types in the output, mimicking NumPy .tolist() behavior (GH10904)

In [78]: s = pd.Series([1,2,3])

Previous behavior:

In [7]: type(s.tolist()[0])
Out[7]:
 <class 'numpy.int64'>

New behavior:

In [79]: type(s.tolist()[0])
Out[79]: int
Series operators for different indexes¶

Following Series operators have been changed to make all operators consistent, including DataFrame (GH1134, GH4581, GH13538)

Warning

Until 0.18.1, comparing Series with the same length, would succeed even if the .index are different (the result ignores .index). As of 0.19.0, this will raises ValueError to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like .eq.

As a result, Series and DataFrame operators behave as below:

Arithmetic operators¶

Arithmetic operators align both index (no changes).

In [80]: s1 = pd.Series([1, 2, 3], index=list('ABC'))

In [81]: s2 = pd.Series([2, 2, 2], index=list('ABD'))

In [82]: s1 + s2
Out[82]: 
A    3.0
B    4.0
C    NaN
D    NaN
dtype: float64

In [83]: df1 = pd.DataFrame([1, 2, 3], index=list('ABC'))

In [84]: df2 = pd.DataFrame([2, 2, 2], index=list('ABD'))

In [85]: df1 + df2
Out[85]: 
     0
A  3.0
B  4.0
C  NaN
D  NaN
Comparison operators¶

Comparison operators raise ValueError when .index are different.

Previous Behavior (Series):

Series compared values ignoring the .index as long as both had the same length:

In [1]: s1 == s2
Out[1]:
A    False
B     True
C    False
dtype: bool

New behavior (Series):

In [2]: s1 == s2
Out[2]:
ValueError: Can only compare identically-labeled Series objects

Note

To achieve the same result as previous versions (compare values based on locations ignoring .index), compare both .values.

In [86]: s1.values == s2.values
Out[86]: array([False,  True, False], dtype=bool)

If you want to compare Series aligning its .index, see flexible comparison methods section below:

In [87]: s1.eq(s2)
Out[87]: 
A    False
B     True
C    False
D    False
dtype: bool

Current Behavior (DataFrame, no change):

In [3]: df1 == df2
Out[3]:
ValueError: Can only compare identically-labeled DataFrame objects
Logical operators¶

Logical operators align both .index of left and right hand side.

Previous behavior (Series), only left hand side index was kept:

In [4]: s1 = pd.Series([True, False, True], index=list('ABC'))
In [5]: s2 = pd.Series([True, True, True], index=list('ABD'))
In [6]: s1 & s2
Out[6]:
A     True
B    False
C    False
dtype: bool

New behavior (Series):

In [88]: s1 = pd.Series([True, False, True], index=list('ABC'))

In [89]: s2 = pd.Series([True, True, True], index=list('ABD'))

In [90]: s1 & s2
Out[90]: 
A     True
B    False
C    False
D    False
dtype: bool

Note

Series logical operators fill a NaN result with False.

Note

To achieve the same result as previous versions (compare values based on only left hand side index), you can use reindex_like:

In [91]: s1 & s2.reindex_like(s1)
Out[91]: 
A     True
B    False
C    False
dtype: bool

Current Behavior (DataFrame, no change):

In [92]: df1 = pd.DataFrame([True, False, True], index=list('ABC'))

In [93]: df2 = pd.DataFrame([True, True, True], index=list('ABD'))

In [94]: df1 & df2
Out[94]: 
       0
A   True
B  False
C    NaN
D    NaN
Flexible comparison methods¶

Series flexible comparison methods like eq, ne, le, lt, ge and gt now align both index. Use these operators if you want to compare two Series which has the different index.

In [95]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [96]: s2 = pd.Series([2, 2, 2], index=['b', 'c', 'd'])

In [97]: s1.eq(s2)
Out[97]: 
a    False
b     True
c    False
d    False
dtype: bool

In [98]: s1.ge(s2)
������������������������������������������������������������������Out[98]: 
a    False
b     True
c     True
d    False
dtype: bool

Previously, this worked the same as comparison operators (see above).

.to_datetime() changes¶

Previously if .to_datetime() encountered mixed integers/floats and strings, but no datetimes with errors='coerce' it would convert all to NaT.

Previous behavior:

In [2]: pd.to_datetime([1, 'foo'], errors='coerce')
Out[2]: DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

Current behavior:

This will now convert integers/floats with the default unit of ns.

In [104]: pd.to_datetime([1, 'foo'], errors='coerce')
Out[104]: DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None)

Bug fixes related to .to_datetime():

Merging changes¶

Merging will now preserve the dtype of the join keys (GH8596)

In [105]: df1 = pd.DataFrame({'key': [1], 'v1': [10]})

In [106]: df1
Out[106]: 
   key  v1
0    1  10

In [107]: df2 = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})

In [108]: df2
Out[108]: 
   key  v1
0    1  20
1    2  30

Previous behavior:

In [5]: pd.merge(df1, df2, how='outer')
Out[5]:
   key    v1
0  1.0  10.0
1  1.0  20.0
2  2.0  30.0

In [6]: pd.merge(df1, df2, how='outer').dtypes
Out[6]:
key    float64
v1     float64
dtype: object

New behavior:

We are able to preserve the join keys

In [109]: pd.merge(df1, df2, how='outer')
Out[109]: 
   key  v1
0    1  10
1    1  20
2    2  30

In [110]: pd.merge(df1, df2, how='outer').dtypes
�������������������������������������������������������Out[110]: 
key    int64
v1     int64
dtype: object

Of course if you have missing values that are introduced, then the resulting dtype will be upcast, which is unchanged from previous.

In [111]: pd.merge(df1, df2, how='outer', on='key')
Out[111]: 
   key  v1_x  v1_y
0    1  10.0    20
1    2   NaN    30

In [112]: pd.merge(df1, df2, how='outer', on='key').dtypes
��������������������������������������������������������������������Out[112]: 
key       int64
v1_x    float64
v1_y      int64
dtype: object
.describe() changes¶

Percentile identifiers in the index of a .describe() output will now be rounded to the least precision that keeps them distinct (GH13104)

In [113]: s = pd.Series([0, 1, 2, 3, 4])

In [114]: df = pd.DataFrame([0, 1, 2, 3, 4])

Previous behavior:

The percentiles were rounded to at most one decimal place, which could raise ValueError for a data frame if the percentiles were duplicated.

In [3]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[3]:
count     5.000000
mean      2.000000
std       1.581139
min       0.000000
0.0%      0.000400
0.1%      0.002000
0.1%      0.004000
50%       2.000000
99.9%     3.996000
100.0%    3.998000
100.0%    3.999600
max       4.000000
dtype: float64

In [4]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[4]:
...
ValueError: cannot reindex from a duplicate axis

New behavior:

In [115]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[115]: 
count     5.000000
mean      2.000000
std       1.581139
min       0.000000
0.01%     0.000400
0.05%     0.002000
0.1%      0.004000
50%       2.000000
99.9%     3.996000
99.95%    3.998000
99.99%    3.999600
max       4.000000
dtype: float64

In [116]: df.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[116]: 
               0
count   5.000000
mean    2.000000
std     1.581139
min     0.000000
0.01%   0.000400
0.05%   0.002000
0.1%    0.004000
50%     2.000000
99.9%   3.996000
99.95%  3.998000
99.99%  3.999600
max     4.000000

Furthermore:

Period changes¶ PeriodIndex now has period dtype¶

PeriodIndex now has its own period dtype. The period dtype is a pandas extension dtype like category or the timezone aware dtype (datetime64[ns, tz]) (GH13941). As a consequence of this change, PeriodIndex no longer has an integer dtype:

Previous behavior:

In [1]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')

In [2]: pi
Out[2]: PeriodIndex(['2016-08-01'], dtype='int64', freq='D')

In [3]: pd.api.types.is_integer_dtype(pi)
Out[3]: True

In [4]: pi.dtype
Out[4]: dtype('int64')

New behavior:

In [117]: pi = pd.PeriodIndex(['2016-08-01'], freq='D')

In [118]: pi
Out[118]: PeriodIndex(['2016-08-01'], dtype='period[D]', freq='D')

In [119]: pd.api.types.is_integer_dtype(pi)
�������������������������������������������������������������������Out[119]: False

In [120]: pd.api.types.is_period_dtype(pi)
�����������������������������������������������������������������������������������Out[120]: True

In [121]: pi.dtype
��������������������������������������������������������������������������������������������������Out[121]: period[D]

In [122]: type(pi.dtype)
����������������������������������������������������������������������������������������������������������������������Out[122]: pandas.core.dtypes.dtypes.PeriodDtype
Period('NaT') now returns pd.NaT¶

Previously, Period has its own Period('NaT') representation different from pd.NaT. Now Period('NaT') has been changed to return pd.NaT. (GH12759, GH13582)

Previous behavior:

In [5]: pd.Period('NaT', freq='D')
Out[5]: Period('NaT', 'D')

New behavior:

These result in pd.NaT without providing freq option.

In [123]: pd.Period('NaT')
Out[123]: NaT

In [124]: pd.Period(None)
��������������Out[124]: NaT

To be compatible with Period addition and subtraction, pd.NaT now supports addition and subtraction with int. Previously it raised ValueError.

Previous behavior:

In [5]: pd.NaT + 1
...
ValueError: Cannot add integral value to Timestamp without freq.

New behavior:

In [125]: pd.NaT + 1
Out[125]: NaT

In [126]: pd.NaT - 1
��������������Out[126]: NaT
PeriodIndex.values now returns array of Period object¶

.values is changed to return an array of Period objects, rather than an array of integers (GH13988).

Previous behavior:

In [6]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M')
In [7]: pi.values
array([492, 493])

New behavior:

In [127]: pi = pd.PeriodIndex(['2011-01', '2011-02'], freq='M')

In [128]: pi.values
Out[128]: array([Period('2011-01', 'M'), Period('2011-02', 'M')], dtype=object)
Index + / - no longer used for set operations¶

Addition and subtraction of the base Index type and of DatetimeIndex (not the numeric index types) previously performed set operations (set union and difference). This behavior was already deprecated since 0.15.0 (in favor using the specific .union() and .difference() methods), and is now disabled. When possible, + and - are now used for element-wise operations, for example for concatenating strings or subtracting datetimes (GH8227, GH14127).

Previous behavior:

In [1]: pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
FutureWarning: using '+' to provide set union with Indexes is deprecated, use '|' or .union()
Out[1]: Index(['a', 'b', 'c'], dtype='object')

New behavior: the same operation will now perform element-wise addition:

In [129]: pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
Out[129]: Index(['aa', 'bc'], dtype='object')

Note that numeric Index objects already performed element-wise operations. For example, the behavior of adding two integer Indexes is unchanged. The base Index is now made consistent with this behavior.

In [130]: pd.Index([1, 2, 3]) + pd.Index([2, 3, 4])
Out[130]: Int64Index([3, 5, 7], dtype='int64')

Further, because of this change, it is now possible to subtract two DatetimeIndex objects resulting in a TimedeltaIndex:

Previous behavior:

In [1]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03'])
FutureWarning: using '-' to provide set differences with datetimelike Indexes is deprecated, use .difference()
Out[1]: DatetimeIndex(['2016-01-01'], dtype='datetime64[ns]', freq=None)

New behavior:

In [131]: pd.DatetimeIndex(['2016-01-01', '2016-01-02']) - pd.DatetimeIndex(['2016-01-02', '2016-01-03'])
Out[131]: TimedeltaIndex(['-1 days', '-1 days'], dtype='timedelta64[ns]', freq=None)
Index.difference and .symmetric_difference changes¶

Index.difference and Index.symmetric_difference will now, more consistently, treat NaN values as any other values. (GH13514)

In [132]: idx1 = pd.Index([1, 2, 3, np.nan])

In [133]: idx2 = pd.Index([0, 1, np.nan])

Previous behavior:

In [3]: idx1.difference(idx2)
Out[3]: Float64Index([nan, 2.0, 3.0], dtype='float64')

In [4]: idx1.symmetric_difference(idx2)
Out[4]: Float64Index([0.0, nan, 2.0, 3.0], dtype='float64')

New behavior:

In [134]: idx1.difference(idx2)
Out[134]: Float64Index([2.0, 3.0], dtype='float64')

In [135]: idx1.symmetric_difference(idx2)
����������������������������������������������������Out[135]: Float64Index([0.0, 2.0, 3.0], dtype='float64')
Index.unique consistently returns Index¶

Index.unique() now returns unique values as an Index of the appropriate dtype. (GH13395). Previously, most Index classes returned np.ndarray, and DatetimeIndex, TimedeltaIndex and PeriodIndex returned Index to keep metadata like timezone.

Previous behavior:

In [1]: pd.Index([1, 2, 3]).unique()
Out[1]: array([1, 2, 3])

In [2]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique()
Out[2]:
DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',
               '2011-01-03 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Tokyo]', freq=None)

New behavior:

In [136]: pd.Index([1, 2, 3]).unique()
Out[136]: Int64Index([1, 2, 3], dtype='int64')

In [137]: pd.DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], tz='Asia/Tokyo').unique()
�����������������������������������������������Out[137]: 
DatetimeIndex(['2011-01-01 00:00:00+09:00', '2011-01-02 00:00:00+09:00',
               '2011-01-03 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Tokyo]', freq=None)
MultiIndex constructors, groupby and set_index preserve categorical dtypes¶

MultiIndex.from_arrays and MultiIndex.from_product will now preserve categorical dtype in MultiIndex levels (GH13743, GH13854).

In [138]: cat = pd.Categorical(['a', 'b'], categories=list("bac"))

In [139]: lvl1 = ['foo', 'bar']

In [140]: midx = pd.MultiIndex.from_arrays([cat, lvl1])

In [141]: midx
Out[141]: 
MultiIndex(levels=[['b', 'a', 'c'], ['bar', 'foo']],
           labels=[[1, 0], [1, 0]])

Previous behavior:

In [4]: midx.levels[0]
Out[4]: Index(['b', 'a', 'c'], dtype='object')

In [5]: midx.get_level_values[0]
Out[5]: Index(['a', 'b'], dtype='object')

New behavior: the single level is now a CategoricalIndex:

In [142]: midx.levels[0]
Out[142]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, dtype='category')

In [143]: midx.get_level_values(0)
���������������������������������������������������������������������������������������������������������Out[143]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=False, dtype='category')

An analogous change has been made to MultiIndex.from_product. As a consequence, groupby and set_index also preserve categorical dtypes in indexes

In [144]: df = pd.DataFrame({'A': [0, 1], 'B': [10, 11], 'C': cat})

In [145]: df_grouped = df.groupby(by=['A', 'C']).first()

In [146]: df_set_idx = df.set_index(['A', 'C'])

Previous behavior:

In [11]: df_grouped.index.levels[1]
Out[11]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [12]: df_grouped.reset_index().dtypes
Out[12]:
A      int64
C     object
B    float64
dtype: object

In [13]: df_set_idx.index.levels[1]
Out[13]: Index(['b', 'a', 'c'], dtype='object', name='C')
In [14]: df_set_idx.reset_index().dtypes
Out[14]:
A      int64
C     object
B      int64
dtype: object

New behavior:

In [147]: df_grouped.index.levels[1]
Out[147]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, name='C', dtype='category')

In [148]: df_grouped.reset_index().dtypes
�������������������������������������������������������������������������������������������������������������������Out[148]: 
A       int64
C    category
B     float64
dtype: object

In [149]: df_set_idx.index.levels[1]
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[149]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=False, name='C', dtype='category')

In [150]: df_set_idx.reset_index().dtypes
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[150]: 
A       int64
C    category
B       int64
dtype: object
read_csv will progressively enumerate chunks¶

When read_csv() is called with chunksize=n and without specifying an index, each chunk used to have an independently generated index from 0 to n-1. They are now given instead a progressive index, starting from 0 for the first chunk, from n for the second, and so on, so that, when concatenated, they are identical to the result of calling read_csv() without the chunksize= argument (GH12185).

In [151]: data = 'A,B\n0,1\n2,3\n4,5\n6,7'

Previous behavior:

In [2]: pd.concat(pd.read_csv(StringIO(data), chunksize=2))
Out[2]:
   A  B
0  0  1
1  2  3
0  4  5
1  6  7

New behavior:

In [152]: pd.concat(pd.read_csv(StringIO(data), chunksize=2))
Out[152]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
Sparse Changes¶

These changes allow pandas to handle sparse data with more dtypes, and for work to make a smoother experience with data handling.

int64 and bool support enhancements¶

Sparse data structures now gained enhanced support of int64 and bool dtype (GH667, GH13849).

Previously, sparse data were float64 dtype by default, even if all inputs were of int or bool dtype. You had to specify dtype explicitly to create sparse data with int64 dtype. Also, fill_value had to be specified explicitly because the default was np.nan which doesn’t appear in int64 or bool data.

In [1]: pd.SparseArray([1, 2, 0, 0])
Out[1]:
[1.0, 2.0, 0.0, 0.0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

# specifying int64 dtype, but all values are stored in sp_values because
# fill_value default is np.nan
In [2]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
Out[2]:
[1, 2, 0, 0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

In [3]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64, fill_value=0)
Out[3]:
[1, 2, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

As of v0.19.0, sparse data keeps the input dtype, and uses more appropriate fill_value defaults (0 for int64 dtype, False for bool dtype).

In [153]: pd.SparseArray([1, 2, 0, 0], dtype=np.int64)
Out[153]: 
[1, 2, 0, 0]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

In [154]: pd.SparseArray([True, False, False, False])
�����������������������������������������������������������������������������Out[154]: 
[True, False, False, False]
Fill: False
IntIndex
Indices: array([0], dtype=int32)

See the docs for more details.

Operators now preserve dtypes¶ Other sparse fixes¶ Indexer dtype changes¶

Note

This change only affects 64 bit python running on Windows, and only affects relatively advanced indexing operations

Methods such as Index.get_indexer that return an indexer array, coerce that array to a “platform int”, so that it can be directly used in 3rd party library operations like numpy.take. Previously, a platform int was defined as np.int_ which corresponds to a C integer, but the correct type, and what is being used now, is np.intp, which corresponds to the C integer size that can hold a pointer (GH3033, GH13972).

These types are the same on many platform, but for 64 bit python on Windows, np.int_ is 32 bits, and np.intp is 64 bits. Changing this behavior improves performance for many operations on that platform.

Previous behavior:

In [1]: i = pd.Index(['a', 'b', 'c'])

In [2]: i.get_indexer(['b', 'b', 'c']).dtype
Out[2]: dtype('int32')

New behavior:

In [1]: i = pd.Index(['a', 'b', 'c'])

In [2]: i.get_indexer(['b', 'b', 'c']).dtype
Out[2]: dtype('int64')
Other API Changes¶ Deprecations¶ Removal of prior version deprecations/changes¶ Performance Improvements¶ Bug Fixes¶ v0.18.1 (May 3, 2016)¶

This is a minor bug-fix release from 0.18.0 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

New features¶ Custom Business Hour¶

The CustomBusinessHour is a mixture of BusinessHour and CustomBusinessDay which allows you to specify arbitrary holidays. For details, see Custom Business Hour (GH11514)

In [1]: from pandas.tseries.offsets import CustomBusinessHour

In [2]: from pandas.tseries.holiday import USFederalHolidayCalendar

In [3]: bhour_us = CustomBusinessHour(calendar=USFederalHolidayCalendar())

Friday before MLK Day

In [4]: dt = datetime(2014, 1, 17, 15)

In [5]: dt + bhour_us
Out[5]: Timestamp('2014-01-17 16:00:00')

Tuesday after MLK Day (Monday is skipped because it’s a holiday)

In [6]: dt + bhour_us * 2
Out[6]: Timestamp('2014-01-20 09:00:00')
.groupby(..) syntax with window and resample operations¶

.groupby(...) has been enhanced to provide convenient syntax when working with .rolling(..), .expanding(..) and .resample(..) per group, see (GH12486, GH12738).

You can now use .rolling(..) and .expanding(..) as methods on groupbys. These return another deferred object (similar to what .rolling() and .expanding() do on ungrouped pandas objects). You can then operate on these RollingGroupby objects in a similar manner.

Previously you would have to do this to get a rolling window mean per-group:

In [7]: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.arange(40)})
   ...: 

In [8]: df
Out[8]: 
    A   B
0   1   0
1   1   1
2   1   2
3   1   3
4   1   4
5   1   5
6   1   6
.. ..  ..
33  3  33
34  3  34
35  3  35
36  3  36
37  3  37
38  3  38
39  3  39

[40 rows x 2 columns]
In [9]: df.groupby('A').apply(lambda x: x.rolling(4).B.mean())
Out[9]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
         ... 
3  33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, Length: 40, dtype: float64

Now you can do:

In [10]: df.groupby('A').rolling(4).B.mean()
Out[10]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
         ... 
3  33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, Length: 40, dtype: float64

For .resample(..) type of operations, previously you would have to:

In [11]: df = pd.DataFrame({'date': pd.date_range(start='2016-01-01',
   ....:                                          periods=4,
   ....:                                          freq='W'),
   ....:                    'group': [1, 1, 2, 2],
   ....:                    'val': [5, 6, 7, 8]}).set_index('date')
   ....: 

In [12]: df
Out[12]: 
            group  val
date                  
2016-01-03      1    5
2016-01-10      1    6
2016-01-17      2    7
2016-01-24      2    8
In [13]: df.groupby('group').apply(lambda x: x.resample('1D').ffill())
Out[13]: 
                  group  val
group date                  
1     2016-01-03      1    5
      2016-01-04      1    5
      2016-01-05      1    5
      2016-01-06      1    5
      2016-01-07      1    5
      2016-01-08      1    5
      2016-01-09      1    5
...                 ...  ...
2     2016-01-18      2    7
      2016-01-19      2    7
      2016-01-20      2    7
      2016-01-21      2    7
      2016-01-22      2    7
      2016-01-23      2    7
      2016-01-24      2    8

[16 rows x 2 columns]

Now you can do:

In [14]: df.groupby('group').resample('1D').ffill()
Out[14]: 
                  group  val
group date                  
1     2016-01-03      1    5
      2016-01-04      1    5
      2016-01-05      1    5
      2016-01-06      1    5
      2016-01-07      1    5
      2016-01-08      1    5
      2016-01-09      1    5
...                 ...  ...
2     2016-01-18      2    7
      2016-01-19      2    7
      2016-01-20      2    7
      2016-01-21      2    7
      2016-01-22      2    7
      2016-01-23      2    7
      2016-01-24      2    8

[16 rows x 2 columns]
Method chaininng improvements¶

The following methods / indexers now accept a callable. It is intended to make these more useful in method chains, see the documentation. (GH11485, GH12533)

.where() and .mask()¶

These can accept a callable for the condition and other arguments.

In [15]: df = pd.DataFrame({'A': [1, 2, 3],
   ....:                    'B': [4, 5, 6],
   ....:                    'C': [7, 8, 9]})
   ....: 

In [16]: df.where(lambda x: x > 4, lambda x: x + 10)
Out[16]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9
.loc[], .iloc[], .ix[]¶

These can accept a callable, and a tuple of callable as a slicer. The callable can return a valid boolean indexer or anything which is valid for these indexer’s input.

# callable returns bool indexer
In [17]: df.loc[lambda x: x.A >= 2, lambda x: x.sum() > 10]
Out[17]: 
   B  C
1  5  8
2  6  9

# callable returns list of labels
In [18]: df.loc[lambda x: [1, 2], lambda x: ['A', 'B']]
����������������������������������Out[18]: 
   A  B
1  2  5
2  3  6
[] indexing¶

Finally, you can use a callable in [] indexing of Series, DataFrame and Panel. The callable must return a valid input for [] indexing depending on its class and index type.

In [19]: df[lambda x: 'A']
Out[19]: 
0    1
1    2
2    3
Name: A, dtype: int64

Using these methods / indexers, you can chain data selection operations without using temporary variable.

In [20]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [21]: (bb.groupby(['year', 'team'])
   ....:    .sum()
   ....:    .loc[lambda df: df.r > 100]
   ....: )
   ....: 
Out[21]: 
           stint    g    ab    r    h  X2b  X3b  hr    rbi    sb   cs   bb  \
year team                                                                    
2007 CIN       6  379   745  101  203   35    2  36  125.0  10.0  1.0  105   
     DET       5  301  1062  162  283   54    4  37  144.0  24.0  7.0   97   
     HOU       4  311   926  109  218   47    6  14   77.0  10.0  4.0   60   
     LAN      11  413  1021  153  293   61    3  36  154.0   7.0  5.0  114   
     NYN      13  622  1854  240  509  101    3  61  243.0  22.0  4.0  174   
     SFN       5  482  1305  198  337   67    6  40  171.0  26.0  7.0  235   
     TEX       2  198   729  115  200   40    4  28  115.0  21.0  4.0   73   
     TOR       4  459  1408  187  378   96    2  58  223.0   4.0  2.0  190   

              so   ibb   hbp    sh    sf  gidp  
year team                                       
2007 CIN   127.0  14.0   1.0   1.0  15.0  18.0  
     DET   176.0   3.0  10.0   4.0   8.0  28.0  
     HOU   212.0   3.0   9.0  16.0   6.0  17.0  
     LAN   141.0   8.0   9.0   3.0   8.0  29.0  
     NYN   310.0  24.0  23.0  18.0  15.0  48.0  
     SFN   188.0  51.0   8.0  16.0   6.0  41.0  
     TEX   140.0   4.0   5.0   2.0   8.0  16.0  
     TOR   265.0  16.0  12.0   4.0  16.0  38.0  
Partial string indexing on DateTimeIndex when part of a MultiIndex¶

Partial string indexing now matches on DateTimeIndex when part of a MultiIndex (GH10331)

In [22]: dft2 = pd.DataFrame(np.random.randn(20, 1),
   ....:                     columns=['A'],
   ....:                     index=pd.MultiIndex.from_product([pd.date_range('20130101',
   ....:                                                                     periods=10,
   ....:                                                                     freq='12H'),
   ....:                                                      ['a', 'b']]))
   ....: 

In [23]: dft2
Out[23]: 
                              A
2013-01-01 00:00:00 a  0.156998
                    b -0.571455
2013-01-01 12:00:00 a  1.057633
                    b -0.791489
2013-01-02 00:00:00 a -0.524627
                    b  0.071878
2013-01-02 12:00:00 a  1.910759
...                         ...
2013-01-04 00:00:00 b  1.015405
2013-01-04 12:00:00 a  0.749185
                    b -0.675521
2013-01-05 00:00:00 a  0.440266
                    b  0.688972
2013-01-05 12:00:00 a -0.276646
                    b  1.924533

[20 rows x 1 columns]

In [24]: dft2.loc['2013-01-05']
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[24]: 
                              A
2013-01-05 00:00:00 a  0.440266
                    b  0.688972
2013-01-05 12:00:00 a -0.276646
                    b  1.924533

On other levels

In [25]: idx = pd.IndexSlice

In [26]: dft2 = dft2.swaplevel(0, 1).sort_index()

In [27]: dft2
Out[27]: 
                              A
a 2013-01-01 00:00:00  0.156998
  2013-01-01 12:00:00  1.057633
  2013-01-02 00:00:00 -0.524627
  2013-01-02 12:00:00  1.910759
  2013-01-03 00:00:00  0.513082
  2013-01-03 12:00:00  1.043945
  2013-01-04 00:00:00  1.459927
...                         ...
b 2013-01-02 12:00:00  0.787965
  2013-01-03 00:00:00 -0.546416
  2013-01-03 12:00:00  2.107785
  2013-01-04 00:00:00  1.015405
  2013-01-04 12:00:00 -0.675521
  2013-01-05 00:00:00  0.688972
  2013-01-05 12:00:00  1.924533

[20 rows x 1 columns]

In [28]: dft2.loc[idx[:, '2013-01-05'], :]
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[28]: 
                              A
a 2013-01-05 00:00:00  0.440266
  2013-01-05 12:00:00 -0.276646
b 2013-01-05 00:00:00  0.688972
  2013-01-05 12:00:00  1.924533
Assembling Datetimes¶

pd.to_datetime() has gained the ability to assemble datetimes from a passed in DataFrame or a dict. (GH8158).

In [29]: df = pd.DataFrame({'year': [2015, 2016],
   ....:                    'month': [2, 3],
   ....:                    'day': [4, 5],
   ....:                    'hour': [2, 3]})
   ....: 

In [30]: df
Out[30]: 
   day  hour  month  year
0    4     2      2  2015
1    5     3      3  2016

Assembling using the passed frame.

In [31]: pd.to_datetime(df)
Out[31]: 
0   2015-02-04 02:00:00
1   2016-03-05 03:00:00
dtype: datetime64[ns]

You can pass only the columns that you need to assemble.

In [32]: pd.to_datetime(df[['year', 'month', 'day']])
Out[32]: 
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]
Other Enhancements¶ Sparse changes¶

These changes conform sparse handling to return the correct types and work to make a smoother experience with indexing.

SparseArray.take now returns a scalar for scalar input, SparseArray for others. Furthermore, it handles a negative indexer with the same rule as Index (GH10560, GH12796)

In [38]: s = pd.SparseArray([np.nan, np.nan, 1, 2, 3, np.nan, 4, 5, np.nan, 6])

In [39]: s.take(0)
Out[39]: nan

In [40]: s.take([1, 2, 3])
�������������Out[40]: 
[nan, 1.0, 2.0]
Fill: nan
IntIndex
Indices: array([1, 2], dtype=int32)
API changes¶ .groupby(..).nth() changes¶

The index in .groupby(..).nth() output is now more consistent when the as_index argument is passed (GH11039):

In [41]: df = DataFrame({'A' : ['a', 'b', 'a'],
   ....:                 'B' : [1, 2, 3]})
   ....: 

In [42]: df
Out[42]: 
   A  B
0  a  1
1  b  2
2  a  3

Previous Behavior:

In [3]: df.groupby('A', as_index=True)['B'].nth(0)
Out[3]:
0    1
1    2
Name: B, dtype: int64

In [4]: df.groupby('A', as_index=False)['B'].nth(0)
Out[4]:
0    1
1    2
Name: B, dtype: int64

New Behavior:

In [43]: df.groupby('A', as_index=True)['B'].nth(0)
Out[43]: 
A
a    1
b    2
Name: B, dtype: int64

In [44]: df.groupby('A', as_index=False)['B'].nth(0)
������������������������������������������������Out[44]: 
0    1
1    2
Name: B, dtype: int64

Furthermore, previously, a .groupby would always sort, regardless if sort=False was passed with .nth().

In [45]: np.random.seed(1234)

In [46]: df = pd.DataFrame(np.random.randn(100, 2), columns=['a', 'b'])

In [47]: df['c'] = np.random.randint(0, 4, 100)

Previous Behavior:

In [4]: df.groupby('c', sort=True).nth(1)
Out[4]:
          a         b
c
0 -0.334077  0.002118
1  0.036142 -2.074978
2 -0.720589  0.887163
3  0.859588 -0.636524

In [5]: df.groupby('c', sort=False).nth(1)
Out[5]:
          a         b
c
0 -0.334077  0.002118
1  0.036142 -2.074978
2 -0.720589  0.887163
3  0.859588 -0.636524

New Behavior:

In [48]: df.groupby('c', sort=True).nth(1)
Out[48]: 
          a         b
c                    
0 -0.334077  0.002118
1  0.036142 -2.074978
2 -0.720589  0.887163
3  0.859588 -0.636524

In [49]: df.groupby('c', sort=False).nth(1)
����������������������������������������������������������������������������������������������������������������������������������������������Out[49]: 
          a         b
c                    
2 -0.720589  0.887163
3  0.859588 -0.636524
0 -0.334077  0.002118
1  0.036142 -2.074978
numpy function compatibility¶

Compatibility between pandas array-like methods (e.g. sum and take) and their numpy counterparts has been greatly increased by augmenting the signatures of the pandas methods so as to accept arguments that can be passed in from numpy, even if they are not necessarily used in the pandas implementation (GH12644, GH12638, GH12687)

An example of this signature augmentation is illustrated below:

In [50]: sp = pd.SparseDataFrame([1, 2, 3])

In [51]: sp
Out[51]: 
   0
0  1
1  2
2  3

Previous behaviour:

In [2]: np.cumsum(sp, axis=0)
...
TypeError: cumsum() takes at most 2 arguments (4 given)

New behaviour:

In [52]: np.cumsum(sp, axis=0)
Out[52]: 
   0
0  1
1  3
2  6
Using .apply on groupby resampling¶

Using apply on resampling groupby operations (using a pd.TimeGrouper) now has the same output types as similar apply calls on other groupby operations. (GH11742).

In [53]: df = pd.DataFrame({'date': pd.to_datetime(['10/10/2000', '11/10/2000']),
   ....:                   'value': [10, 13]})
   ....: 

In [54]: df
Out[54]: 
        date  value
0 2000-10-10     10
1 2000-11-10     13

Previous behavior:

In [1]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum())
Out[1]:
...
TypeError: cannot concatenate a non-NDFrame object

# Output is a Series
In [2]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())
Out[2]:
date
2000-10-31  value    10
2000-11-30  value    13
dtype: int64

New Behavior:

# Output is a Series
In [55]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x.value.sum())
Out[55]: 
date
2000-10-31    10
2000-11-30    13
Freq: M, dtype: int64

# Output is a DataFrame
In [56]: df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum())
�����������������������������������������������������������������������Out[56]: 
            value
date             
2000-10-31     10
2000-11-30     13
Changes in read_csv exceptions¶

In order to standardize the read_csv API for both the c and python engines, both will now raise an EmptyDataError, a subclass of ValueError, in response to empty columns or header (GH12493, GH12506)

Previous behaviour:

In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
ValueError: No columns to parse from file

In [2]: df = pd.read_csv(StringIO(''), engine='python')
...
StopIteration

New behaviour:

In [1]: df = pd.read_csv(StringIO(''), engine='c')
...
pandas.io.common.EmptyDataError: No columns to parse from file

In [2]: df = pd.read_csv(StringIO(''), engine='python')
...
pandas.io.common.EmptyDataError: No columns to parse from file

In addition to this error change, several others have been made as well:

to_datetime error changes¶

Bugs in pd.to_datetime() when passing a unit with convertible entries and errors='coerce' or non-convertible with errors='ignore'. Furthermore, an OutOfBoundsDateime exception will be raised when an out-of-range value is encountered for that unit when errors='raise'. (GH11758, GH13052, GH13059)

Previous behaviour:

In [27]: pd.to_datetime(1420043460, unit='s', errors='coerce')
Out[27]: NaT

In [28]: pd.to_datetime(11111111, unit='D', errors='ignore')
OverflowError: Python int too large to convert to C long

In [29]: pd.to_datetime(11111111, unit='D', errors='raise')
OverflowError: Python int too large to convert to C long

New behaviour:

In [2]: pd.to_datetime(1420043460, unit='s', errors='coerce')
Out[2]: Timestamp('2014-12-31 16:31:00')

In [3]: pd.to_datetime(11111111, unit='D', errors='ignore')
Out[3]: 11111111

In [4]: pd.to_datetime(11111111, unit='D', errors='raise')
OutOfBoundsDatetime: cannot convert input with unit 'D'
Other API changes¶ Deprecations¶ Performance Improvements¶ Bug Fixes¶ v0.18.0 (March 13, 2016)¶

This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.18.0 no longer supports compatibility with Python version 2.6 and 3.3 (GH7718, GH11273)

Warning

numexpr version 2.4.4 will now show a warning and not be used as a computation back-end for pandas because of some buggy behavior. This does not affect other versions (>= 2.1 and >= 2.4.6). (GH12489)

Highlights include:

Check the API Changes and deprecations before updating.

New features¶ Window functions are now methods¶

Window functions have been refactored to be methods on Series/DataFrame objects, rather than top-level functions, which are now deprecated. This allows these window-type functions, to have a similar API to that of .groupby. See the full documentation here (GH11603, GH12373)

In [1]: np.random.seed(1234)

In [2]: df = pd.DataFrame({'A' : range(10), 'B' : np.random.randn(10)})

In [3]: df
Out[3]: 
   A         B
0  0  0.471435
1  1 -1.190976
2  2  1.432707
3  3 -0.312652
4  4 -0.720589
5  5  0.887163
6  6  0.859588
7  7 -0.636524
8  8  0.015696
9  9 -2.242685

Previous Behavior:

In [8]: pd.rolling_mean(df,window=3)
        FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with
                       DataFrame.rolling(window=3,center=False).mean()
Out[8]:
    A         B
0 NaN       NaN
1 NaN       NaN
2   1  0.237722
3   2 -0.023640
4   3  0.133155
5   4 -0.048693
6   5  0.342054
7   6  0.370076
8   7  0.079587
9   8 -0.954504

New Behavior:

In [4]: r = df.rolling(window=3)

These show a descriptive repr

In [5]: r
Out[5]: Rolling [window=3,center=False,axis=0]

with tab-completion of available methods and properties.

In [9]: r.
r.A           r.agg         r.apply       r.count       r.exclusions  r.max         r.median      r.name        r.skew        r.sum
r.B           r.aggregate   r.corr        r.cov         r.kurt        r.mean        r.min         r.quantile    r.std         r.var

The methods operate on the Rolling object itself

In [6]: r.mean()
Out[6]: 
     A         B
0  NaN       NaN
1  NaN       NaN
2  1.0  0.237722
3  2.0 -0.023640
4  3.0  0.133155
5  4.0 -0.048693
6  5.0  0.342054
7  6.0  0.370076
8  7.0  0.079587
9  8.0 -0.954504

They provide getitem accessors

In [7]: r['A'].mean()
Out[7]: 
0    NaN
1    NaN
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    6.0
8    7.0
9    8.0
Name: A, dtype: float64

And multiple aggregations

In [8]: r.agg({'A' : ['mean','std'],
   ...:        'B' : ['mean','std']})
   ...: 
Out[8]: 
     A              B          
  mean  std      mean       std
0  NaN  NaN       NaN       NaN
1  NaN  NaN       NaN       NaN
2  1.0  1.0  0.237722  1.327364
3  2.0  1.0 -0.023640  1.335505
4  3.0  1.0  0.133155  1.143778
5  4.0  1.0 -0.048693  0.835747
6  5.0  1.0  0.342054  0.920379
7  6.0  1.0  0.370076  0.871850
8  7.0  1.0  0.079587  0.750099
9  8.0  1.0 -0.954504  1.162285
Changes to rename¶

Series.rename and NDFrame.rename_axis can now take a scalar or list-like argument for altering the Series or axis name, in addition to their old behaviors of altering labels. (GH9494, GH11965)

In [9]: s = pd.Series(np.random.randn(5))

In [10]: s.rename('newname')
Out[10]: 
0    1.150036
1    0.991946
2    0.953324
3   -2.021255
4   -0.334077
Name: newname, dtype: float64
In [11]: df = pd.DataFrame(np.random.randn(5, 2))

In [12]: (df.rename_axis("indexname")
   ....:    .rename_axis("columns_name", axis="columns"))
   ....: 
Out[12]: 
columns_name         0         1
indexname                       
0             0.002118  0.405453
1             0.289092  1.321158
2            -1.546906 -0.202646
3            -0.655969  0.193421
4             0.553439  1.318152

The new functionality works well in method chains. Previously these methods only accepted functions or dicts mapping a label to a new label. This continues to work as before for function or dict-like values.

Range Index¶

A RangeIndex has been added to the Int64Index sub-classes to support a memory saving alternative for common use cases. This has a similar implementation to the python range object (xrange in python 2), in that it only stores the start, stop, and step values for the index. It will transparently interact with the user API, converting to Int64Index if needed.

This will now be the default constructed index for NDFrame objects, rather than previous an Int64Index. (GH939, GH12070, GH12071, GH12109, GH12888)

Previous Behavior:

In [3]: s = pd.Series(range(1000))

In [4]: s.index
Out[4]:
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=1000)

In [6]: s.index.nbytes
Out[6]: 8000

New Behavior:

In [13]: s = pd.Series(range(1000))

In [14]: s.index
Out[14]: RangeIndex(start=0, stop=1000, step=1)

In [15]: s.index.nbytes
������������������������������������������������Out[15]: 80
Changes to str.cat¶

The method .str.cat() concatenates the members of a Series. Before, if NaN values were present in the Series, calling .str.cat() on it would return NaN, unlike the rest of the Series.str.* API. This behavior has been amended to ignore NaN values by default. (GH11435).

A new, friendlier ValueError is added to protect against the mistake of supplying the sep as an arg, rather than as a kwarg. (GH11334).

In [27]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ')
Out[27]: 'a b c'

In [28]: pd.Series(['a','b',np.nan,'c']).str.cat(sep=' ', na_rep='?')
�����������������Out[28]: 'a b ? c'
In [2]: pd.Series(['a','b',np.nan,'c']).str.cat(' ')
ValueError: Did you mean to supply a `sep` keyword?
Datetimelike rounding¶

DatetimeIndex, Timestamp, TimedeltaIndex, Timedelta have gained the .round(), .floor() and .ceil() method for datetimelike rounding, flooring and ceiling. (GH4314, GH11963)

Naive datetimes

In [29]: dr = pd.date_range('20130101 09:12:56.1234', periods=3)

In [30]: dr
Out[30]: 
DatetimeIndex(['2013-01-01 09:12:56.123400', '2013-01-02 09:12:56.123400',
               '2013-01-03 09:12:56.123400'],
              dtype='datetime64[ns]', freq='D')

In [31]: dr.round('s')
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[31]: 
DatetimeIndex(['2013-01-01 09:12:56', '2013-01-02 09:12:56',
               '2013-01-03 09:12:56'],
              dtype='datetime64[ns]', freq=None)

# Timestamp scalar
In [32]: dr[0]
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[32]: Timestamp('2013-01-01 09:12:56.123400', freq='D')

In [33]: dr[0].round('10s')
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[33]: Timestamp('2013-01-01 09:13:00')

Tz-aware are rounded, floored and ceiled in local times

In [34]: dr = dr.tz_localize('US/Eastern')

In [35]: dr
Out[35]: 
DatetimeIndex(['2013-01-01 09:12:56.123400-05:00',
               '2013-01-02 09:12:56.123400-05:00',
               '2013-01-03 09:12:56.123400-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [36]: dr.round('s')
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[36]: 
DatetimeIndex(['2013-01-01 09:12:56-05:00', '2013-01-02 09:12:56-05:00',
               '2013-01-03 09:12:56-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq=None)

Timedeltas

In [37]: t = timedelta_range('1 days 2 hr 13 min 45 us',periods=3,freq='d')

In [38]: t
Out[38]: 
TimedeltaIndex(['1 days 02:13:00.000045', '2 days 02:13:00.000045',
                '3 days 02:13:00.000045'],
               dtype='timedelta64[ns]', freq='D')

In [39]: t.round('10min')
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[39]: TimedeltaIndex(['1 days 02:10:00', '2 days 02:10:00', '3 days 02:10:00'], dtype='timedelta64[ns]', freq=None)

# Timedelta scalar
In [40]: t[0]
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[40]: Timedelta('1 days 02:13:00.000045')

In [41]: t[0].round('2h')
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[41]: Timedelta('1 days 02:00:00')

In addition, .round(), .floor() and .ceil() will be available thru the .dt accessor of Series.

In [42]: s = pd.Series(dr)

In [43]: s
Out[43]: 
0   2013-01-01 09:12:56.123400-05:00
1   2013-01-02 09:12:56.123400-05:00
2   2013-01-03 09:12:56.123400-05:00
dtype: datetime64[ns, US/Eastern]

In [44]: s.dt.round('D')
�����������������������������������������������������������������������������������������������������������������������������������������������������������Out[44]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Formatting of Integers in FloatIndex¶

Integers in FloatIndex, e.g. 1., are now formatted with a decimal point and a 0 digit, e.g. 1.0 (GH11713) This change not only affects the display to the console, but also the output of IO methods like .to_csv or .to_html.

Previous Behavior:

In [2]: s = pd.Series([1,2,3], index=np.arange(3.))

In [3]: s
Out[3]:
0    1
1    2
2    3
dtype: int64

In [4]: s.index
Out[4]: Float64Index([0.0, 1.0, 2.0], dtype='float64')

In [5]: print(s.to_csv(path=None))
0,1
1,2
2,3

New Behavior:

In [45]: s = pd.Series([1,2,3], index=np.arange(3.))

In [46]: s
Out[46]: 
0.0    1
1.0    2
2.0    3
dtype: int64

In [47]: s.index
��������������������������������������������������Out[47]: Float64Index([0.0, 1.0, 2.0], dtype='float64')

In [48]: print(s.to_csv(path=None))
����������������������������������������������������������������������������������������������������������0.0,1
1.0,2
2.0,3
Changes to dtype assignment behaviors¶

When a DataFrame’s slice is updated with a new slice of the same dtype, the dtype of the DataFrame will now remain the same. (GH10503)

Previous Behavior:

In [5]: df = pd.DataFrame({'a': [0, 1, 1],
                           'b': pd.Series([100, 200, 300], dtype='uint32')})

In [7]: df.dtypes
Out[7]:
a     int64
b    uint32
dtype: object

In [8]: ix = df['a'] == 1

In [9]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [11]: df.dtypes
Out[11]:
a    int64
b    int64
dtype: object

New Behavior:

In [49]: df = pd.DataFrame({'a': [0, 1, 1],
   ....:                    'b': pd.Series([100, 200, 300], dtype='uint32')})
   ....: 

In [50]: df.dtypes
Out[50]: 
a     int64
b    uint32
dtype: object

In [51]: ix = df['a'] == 1

In [52]: df.loc[ix, 'b'] = df.loc[ix, 'b']

In [53]: df.dtypes
Out[53]: 
a     int64
b    uint32
dtype: object

When a DataFrame’s integer slice is partially updated with a new slice of floats that could potentially be downcasted to integer without losing precision, the dtype of the slice will be set to float instead of integer.

Previous Behavior:

In [4]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
                          columns=list('abc'),
                          index=[[4,4,8], [8,10,12]])

In [5]: df
Out[5]:
      a  b  c
4 8   1  2  3
  10  4  5  6
8 12  7  8  9

In [7]: df.ix[4, 'c'] = np.array([0., 1.])

In [8]: df
Out[8]:
      a  b  c
4 8   1  2  0
  10  4  5  1
8 12  7  8  9

New Behavior:

In [54]: df = pd.DataFrame(np.array(range(1,10)).reshape(3,3),
   ....:                   columns=list('abc'),
   ....:                   index=[[4,4,8], [8,10,12]])
   ....: 

In [55]: df
Out[55]: 
      a  b  c
4 8   1  2  3
  10  4  5  6
8 12  7  8  9

In [56]: df.loc[4, 'c'] = np.array([0., 1.])

In [57]: df
Out[57]: 
      a  b    c
4 8   1  2  0.0
  10  4  5  1.0
8 12  7  8  9.0
to_xarray¶

In a future version of pandas, we will be deprecating Panel and other > 2 ndim objects. In order to provide for continuity, all NDFrame objects have gained the .to_xarray() method in order to convert to xarray objects, which has a pandas-like interface for > 2 ndim. (GH11972)

See the xarray full-documentation here.

In [1]: p = Panel(np.arange(2*3*4).reshape(2,3,4))

In [2]: p.to_xarray()
Out[2]:
<xarray.DataArray (items: 2, major_axis: 3, minor_axis: 4)>
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * items       (items) int64 0 1
  * major_axis  (major_axis) int64 0 1 2
  * minor_axis  (minor_axis) int64 0 1 2 3
Latex Representation¶

DataFrame has gained a ._repr_latex_() method in order to allow for conversion to latex in a ipython/jupyter notebook using nbconvert. (GH11778)

Note that this must be activated by setting the option pd.display.latex.repr=True (GH12182)

For example, if you have a jupyter notebook you plan to convert to latex using nbconvert, place the statement pd.display.latex.repr=True in the first cell to have the contained DataFrame output also stored as latex.

The options display.latex.escape and display.latex.longtable have also been added to the configuration and are used automatically by the to_latex method. See the available options docs for more info.

pd.read_sas() changes¶

read_sas has gained the ability to read SAS7BDAT files, including compressed files. The files can be read in entirety, or incrementally. For full details see here. (GH4052)

Other enhancements¶ Backwards incompatible API changes¶ NaT and Timedelta operations¶

NaT and Timedelta have expanded arithmetic operations, which are extended to Series arithmetic where applicable. Operations defined for datetime64[ns] or timedelta64[ns] are now also defined for NaT (GH11564).

NaT now supports arithmetic operations with integers and floats.

In [58]: pd.NaT * 1
Out[58]: NaT

In [59]: pd.NaT * 1.5
�������������Out[59]: NaT

In [60]: pd.NaT / 2
��������������������������Out[60]: NaT

In [61]: pd.NaT * np.nan
���������������������������������������Out[61]: NaT

NaT defines more arithmetic operations with datetime64[ns] and timedelta64[ns].

In [62]: pd.NaT / pd.NaT
Out[62]: nan

In [63]: pd.Timedelta('1s') / pd.NaT
�������������Out[63]: nan

NaT may represent either a datetime64[ns] null or a timedelta64[ns] null. Given the ambiguity, it is treated as a timedelta64[ns], which allows more operations to succeed.

In [64]: pd.NaT + pd.NaT
Out[64]: NaT

# same as
In [65]: pd.Timedelta('1s') + pd.Timedelta('1s')
�������������Out[65]: Timedelta('0 days 00:00:02')

as opposed to

In [3]: pd.Timestamp('19900315') + pd.Timestamp('19900315')
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

However, when wrapped in a Series whose dtype is datetime64[ns] or timedelta64[ns], the dtype information is respected.

In [1]: pd.Series([pd.NaT], dtype='<M8[ns]') + pd.Series([pd.NaT], dtype='<M8[ns]')
TypeError: can only operate on a datetimes for subtraction,
           but the operator [__add__] was passed
In [66]: pd.Series([pd.NaT], dtype='<m8[ns]') + pd.Series([pd.NaT], dtype='<m8[ns]')
Out[66]: 
0   NaT
dtype: timedelta64[ns]

Timedelta division by floats now works.

In [67]: pd.Timedelta('1s') / 2.0
Out[67]: Timedelta('0 days 00:00:00.500000')

Subtraction by Timedelta in a Series by a Timestamp works (GH11925)

In [68]: ser = pd.Series(pd.timedelta_range('1 day', periods=3))

In [69]: ser
Out[69]: 
0   1 days
1   2 days
2   3 days
dtype: timedelta64[ns]

In [70]: pd.Timestamp('2012-01-01') - ser
������������������������������������������������������������������Out[70]: 
0   2011-12-31
1   2011-12-30
2   2011-12-29
dtype: datetime64[ns]

NaT.isoformat() now returns 'NaT'. This change allows allows pd.Timestamp to rehydrate any timestamp like object from its isoformat (GH12300).

Changes to msgpack¶

Forward incompatible changes in msgpack writing format were made over 0.17.0 and 0.18.0; older versions of pandas cannot read files packed by newer versions (GH12129, GH10527)

Bugs in to_msgpack and read_msgpack introduced in 0.17.0 and fixed in 0.18.0, caused files packed in Python 2 unreadable by Python 3 (GH12142). The following table describes the backward and forward compat of msgpacks.

Warning

Packed with Can be unpacked with pre-0.17 / Python 2 any pre-0.17 / Python 3 any 0.17 / Python 2 0.17 / Python 3 >=0.18 / any Python 0.18 >= 0.18

0.18.0 is backward-compatible for reading files packed by older versions, except for files packed with 0.17 in Python 2, in which case only they can only be unpacked in Python 2.

Signature change for .rank¶

Series.rank and DataFrame.rank now have the same signature (GH11759)

Previous signature

In [3]: pd.Series([0,1]).rank(method='average', na_option='keep',
                              ascending=True, pct=False)
Out[3]:
0    1
1    2
dtype: float64

In [4]: pd.DataFrame([0,1]).rank(axis=0, numeric_only=None,
                                 method='average', na_option='keep',
                                 ascending=True, pct=False)
Out[4]:
   0
0  1
1  2

New signature

In [71]: pd.Series([0,1]).rank(axis=0, method='average', numeric_only=None,
   ....:                       na_option='keep', ascending=True, pct=False)
   ....: 
Out[71]: 
0    1.0
1    2.0
dtype: float64

In [72]: pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=None,
   ....:                          na_option='keep', ascending=True, pct=False)
   ....: 
�������������������������������������������Out[72]: 
     0
0  1.0
1  2.0
Bug in QuarterBegin with n=0¶

In previous versions, the behavior of the QuarterBegin offset was inconsistent depending on the date when the n parameter was 0. (GH11406)

The general semantics of anchored offsets for n=0 is to not move the date when it is an anchor point (e.g., a quarter start date), and otherwise roll forward to the next anchor point.

In [73]: d = pd.Timestamp('2014-02-01')

In [74]: d
Out[74]: Timestamp('2014-02-01 00:00:00')

In [75]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
������������������������������������������Out[75]: Timestamp('2014-02-01 00:00:00')

In [76]: d + pd.offsets.QuarterBegin(n=0, startingMonth=1)
������������������������������������������������������������������������������������Out[76]: Timestamp('2014-04-01 00:00:00')

For the QuarterBegin offset in previous versions, the date would be rolled backwards if date was in the same month as the quarter start date.

In [3]: d = pd.Timestamp('2014-02-15')

In [4]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[4]: Timestamp('2014-02-01 00:00:00')

This behavior has been corrected in version 0.18.0, which is consistent with other anchored offsets like MonthBegin and YearBegin.

In [77]: d = pd.Timestamp('2014-02-15')

In [78]: d + pd.offsets.QuarterBegin(n=0, startingMonth=2)
Out[78]: Timestamp('2014-05-01 00:00:00')
Resample API¶

Like the change in the window functions API above, .resample(...) is changing to have a more groupby-like API. (GH11732, GH12702, GH12202, GH12332, GH12334, GH12348, GH12448).

In [79]: np.random.seed(1234)

In [80]: df = pd.DataFrame(np.random.rand(10,4),
   ....:                   columns=list('ABCD'),
   ....:                   index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s'))
   ....: 

In [81]: df
Out[81]: 
                            A         B         C         D
2010-01-01 09:00:00  0.191519  0.622109  0.437728  0.785359
2010-01-01 09:00:01  0.779976  0.272593  0.276464  0.801872
2010-01-01 09:00:02  0.958139  0.875933  0.357817  0.500995
2010-01-01 09:00:03  0.683463  0.712702  0.370251  0.561196
2010-01-01 09:00:04  0.503083  0.013768  0.772827  0.882641
2010-01-01 09:00:05  0.364886  0.615396  0.075381  0.368824
2010-01-01 09:00:06  0.933140  0.651378  0.397203  0.788730
2010-01-01 09:00:07  0.316836  0.568099  0.869127  0.436173
2010-01-01 09:00:08  0.802148  0.143767  0.704261  0.704581
2010-01-01 09:00:09  0.218792  0.924868  0.442141  0.909316

Previous API:

You would write a resampling operation that immediately evaluates. If a how parameter was not provided, it would default to how='mean'.

In [6]: df.resample('2s')
Out[6]:
                         A         B         C         D
2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.793615
2010-01-01 09:00:02  0.820801  0.794317  0.364034  0.531096
2010-01-01 09:00:04  0.433985  0.314582  0.424104  0.625733
2010-01-01 09:00:06  0.624988  0.609738  0.633165  0.612452
2010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949

You could also specify a how directly

In [7]: df.resample('2s', how='sum')
Out[7]:
                         A         B         C         D
2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.587231
2010-01-01 09:00:02  1.641602  1.588635  0.728068  1.062191
2010-01-01 09:00:04  0.867969  0.629165  0.848208  1.251465
2010-01-01 09:00:06  1.249976  1.219477  1.266330  1.224904
2010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

New API:

Now, you can write .resample(..) as a 2-stage operation like .groupby(...), which yields a Resampler.

In [82]: r = df.resample('2s')

In [83]: r
Out[83]: DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left, label=left, convention=start, base=0]
Downsampling¶

You can then use this object to perform operations. These are downsampling operations (going from a higher frequency to a lower one).

In [84]: r.mean()
Out[84]: 
                            A         B         C         D
2010-01-01 09:00:00  0.485748  0.447351  0.357096  0.793615
2010-01-01 09:00:02  0.820801  0.794317  0.364034  0.531096
2010-01-01 09:00:04  0.433985  0.314582  0.424104  0.625733
2010-01-01 09:00:06  0.624988  0.609738  0.633165  0.612452
2010-01-01 09:00:08  0.510470  0.534317  0.573201  0.806949
In [85]: r.sum()
Out[85]: 
                            A         B         C         D
2010-01-01 09:00:00  0.971495  0.894701  0.714192  1.587231
2010-01-01 09:00:02  1.641602  1.588635  0.728068  1.062191
2010-01-01 09:00:04  0.867969  0.629165  0.848208  1.251465
2010-01-01 09:00:06  1.249976  1.219477  1.266330  1.224904
2010-01-01 09:00:08  1.020940  1.068634  1.146402  1.613897

Furthermore, resample now supports getitem operations to perform the resample on specific columns.

In [86]: r[['A','C']].mean()
Out[86]: 
                            A         C
2010-01-01 09:00:00  0.485748  0.357096
2010-01-01 09:00:02  0.820801  0.364034
2010-01-01 09:00:04  0.433985  0.424104
2010-01-01 09:00:06  0.624988  0.633165
2010-01-01 09:00:08  0.510470  0.573201

and .aggregate type operations.

In [87]: r.agg({'A' : 'mean', 'B' : 'sum'})
Out[87]: 
                            A         B
2010-01-01 09:00:00  0.485748  0.894701
2010-01-01 09:00:02  0.820801  1.588635
2010-01-01 09:00:04  0.433985  0.629165
2010-01-01 09:00:06  0.624988  1.219477
2010-01-01 09:00:08  0.510470  1.068634

These accessors can of course, be combined

In [88]: r[['A','B']].agg(['mean','sum'])
Out[88]: 
                            A                   B          
                         mean       sum      mean       sum
2010-01-01 09:00:00  0.485748  0.971495  0.447351  0.894701
2010-01-01 09:00:02  0.820801  1.641602  0.794317  1.588635
2010-01-01 09:00:04  0.433985  0.867969  0.314582  0.629165
2010-01-01 09:00:06  0.624988  1.249976  0.609738  1.219477
2010-01-01 09:00:08  0.510470  1.020940  0.534317  1.068634
Upsampling¶

Upsampling operations take you from a lower frequency to a higher frequency. These are now performed with the Resampler objects with backfill(), ffill(), fillna() and asfreq() methods.

In [89]: s = pd.Series(np.arange(5,dtype='int64'),
   ....:               index=date_range('2010-01-01', periods=5, freq='Q'))
   ....: 

In [90]: s
Out[90]: 
2010-03-31    0
2010-06-30    1
2010-09-30    2
2010-12-31    3
2011-03-31    4
Freq: Q-DEC, dtype: int64

Previously

In [6]: s.resample('M', fill_method='ffill')
Out[6]:
2010-03-31    0
2010-04-30    0
2010-05-31    0
2010-06-30    1
2010-07-31    1
2010-08-31    1
2010-09-30    2
2010-10-31    2
2010-11-30    2
2010-12-31    3
2011-01-31    3
2011-02-28    3
2011-03-31    4
Freq: M, dtype: int64

New API

In [91]: s.resample('M').ffill()
Out[91]: 
2010-03-31    0
2010-04-30    0
2010-05-31    0
2010-06-30    1
2010-07-31    1
2010-08-31    1
2010-09-30    2
2010-10-31    2
2010-11-30    2
2010-12-31    3
2011-01-31    3
2011-02-28    3
2011-03-31    4
Freq: M, dtype: int64

Note

In the new API, you can either downsample OR upsample. The prior implementation would allow you to pass an aggregator function (like mean) even though you were upsampling, providing a bit of confusion.

Previous API will work but with deprecations¶

Warning

This new API for resample includes some internal changes for the prior-to-0.18.0 API, to work with a deprecation warning in most cases, as the resample operation returns a deferred object. We can intercept operations and just do what the (pre 0.18.0) API did (with a warning). Here is a typical use case:

In [4]: r = df.resample('2s')

In [6]: r*10
pandas/tseries/resample.py:80: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

Out[6]:
                      A         B         C         D
2010-01-01 09:00:00  4.857476  4.473507  3.570960  7.936154
2010-01-01 09:00:02  8.208011  7.943173  3.640340  5.310957
2010-01-01 09:00:04  4.339846  3.145823  4.241039  6.257326
2010-01-01 09:00:06  6.249881  6.097384  6.331650  6.124518
2010-01-01 09:00:08  5.104699  5.343172  5.732009  8.069486

However, getting and assignment operations directly on a Resampler will raise a ValueError:

In [7]: r.iloc[0] = 5
ValueError: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)

There is a situation where the new API can not perform all the operations when using original code. This code is intending to resample every 2s, take the mean AND then take the min of those results.

In [4]: df.resample('2s').min()
Out[4]:
A    0.433985
B    0.314582
C    0.357096
D    0.531096
dtype: float64

The new API will:

In [92]: df.resample('2s').min()
Out[92]: 
                            A         B         C         D
2010-01-01 09:00:00  0.191519  0.272593  0.276464  0.785359
2010-01-01 09:00:02  0.683463  0.712702  0.357817  0.500995
2010-01-01 09:00:04  0.364886  0.013768  0.075381  0.368824
2010-01-01 09:00:06  0.316836  0.568099  0.397203  0.436173
2010-01-01 09:00:08  0.218792  0.143767  0.442141  0.704581

The good news is the return dimensions will differ between the new API and the old API, so this should loudly raise an exception.

To replicate the original operation

In [93]: df.resample('2s').mean().min()
Out[93]: 
A    0.433985
B    0.314582
C    0.357096
D    0.531096
dtype: float64
Changes to eval¶

In prior versions, new columns assignments in an eval expression resulted in an inplace change to the DataFrame. (GH9297, GH8664, GH10486)

In [94]: df = pd.DataFrame({'a': np.linspace(0, 10, 5), 'b': range(5)})

In [95]: df
Out[95]: 
      a  b
0   0.0  0
1   2.5  1
2   5.0  2
3   7.5  3
4  10.0  4
In [12]: df.eval('c = a + b')
FutureWarning: eval expressions containing an assignment currentlydefault to operating inplace.
This will change in a future version of pandas, use inplace=True to avoid this warning.

In [13]: df
Out[13]:
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

In version 0.18.0, a new inplace keyword was added to choose whether the assignment should be done inplace or return a copy.

In [96]: df
Out[96]: 
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

In [97]: df.eval('d = c - b', inplace=False)
����������������������������������������������������������������������������������������������������������������Out[97]: 
      a  b     c     d
0   0.0  0   0.0   0.0
1   2.5  1   3.5   2.5
2   5.0  2   7.0   5.0
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

In [98]: df
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[98]: 
      a  b     c
0   0.0  0   0.0
1   2.5  1   3.5
2   5.0  2   7.0
3   7.5  3  10.5
4  10.0  4  14.0

In [99]: df.eval('d = c - b', inplace=True)

In [100]: df
Out[100]: 
      a  b     c     d
0   0.0  0   0.0   0.0
1   2.5  1   3.5   2.5
2   5.0  2   7.0   5.0
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

Warning

For backwards compatability, inplace defaults to True if not specified. This will change in a future version of pandas. If your code depends on an inplace assignment you should update to explicitly set inplace=True

The inplace keyword parameter was also added the query method.

In [101]: df.query('a > 5')
Out[101]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

In [102]: df.query('a > 5', inplace=True)

In [103]: df
Out[103]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

Warning

Note that the default value for inplace in a query is False, which is consistent with prior versions.

eval has also been updated to allow multi-line expressions for multiple assignments. These expressions will be evaluated one at a time in order. Only assignments are valid for multi-line expressions.

In [104]: df
Out[104]: 
      a  b     c     d
3   7.5  3  10.5   7.5
4  10.0  4  14.0  10.0

In [105]: df.eval("""
   .....: e = d + a
   .....: f = e - 22
   .....: g = f / 2.0""", inplace=True)
   .....: 

In [106]: df
Out[106]: 
      a  b     c     d     e    f    g
3   7.5  3  10.5   7.5  15.0 -7.0 -3.5
4  10.0  4  14.0  10.0  20.0 -2.0 -1.0
Other API Changes¶ Deprecations¶ Removal of deprecated float indexers¶

In GH4892 indexing with floating point numbers on a non-Float64Index was deprecated (in version 0.14.0). In 0.18.0, this deprecation warning is removed and these will now raise a TypeError. (GH12165, GH12333)

In [109]: s = pd.Series([1, 2, 3], index=[4, 5, 6])

In [110]: s
Out[110]: 
4    1
5    2
6    3
dtype: int64

In [111]: s2 = pd.Series([1, 2, 3], index=list('abc'))

In [112]: s2
Out[112]: 
a    1
b    2
c    3
dtype: int64

Previous Behavior:

# this is label indexing
In [2]: s[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[2]: 2

# this is positional indexing
In [3]: s.iloc[1.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[3]: 2

# this is label indexing
In [4]: s.loc[5.0]
FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
Out[4]: 2

# .ix would coerce 1.0 to the positional 1, and index
In [5]: s2.ix[1.0] = 10
FutureWarning: scalar indexers for index type Index should be integers and not floating point

In [6]: s2
Out[6]:
a     1
b    10
c     3
dtype: int64

New Behavior:

For iloc, getting & setting via a float scalar will always raise.

In [3]: s.iloc[2.0]
TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [2.0] of <type 'float'>

Other indexers will coerce to a like integer for both getting and setting. The FutureWarning has been dropped for .loc, .ix and [].

In [113]: s[5.0]
Out[113]: 2

In [114]: s.loc[5.0]
������������Out[114]: 2

and setting

In [115]: s_copy = s.copy()

In [116]: s_copy[5.0] = 10

In [117]: s_copy
Out[117]: 
4     1
5    10
6     3
dtype: int64

In [118]: s_copy = s.copy()

In [119]: s_copy.loc[5.0] = 10

In [120]: s_copy
Out[120]: 
4     1
5    10
6     3
dtype: int64

Positional setting with .ix and a float indexer will ADD this value to the index, rather than previously setting the value by position.

In [3]: s2.ix[1.0] = 10
In [4]: s2
Out[4]:
a       1
b       2
c       3
1.0    10
dtype: int64

Slicing will also coerce integer-like floats to integers for a non-Float64Index.

In [121]: s.loc[5.0:6]
Out[121]: 
5    2
6    3
dtype: int64

Note that for floats that are NOT coercible to ints, the label based bounds will be excluded

In [122]: s.loc[5.1:6]
Out[122]: 
6    3
dtype: int64

Float indexing on a Float64Index is unchanged.

In [123]: s = pd.Series([1, 2, 3], index=np.arange(3.))

In [124]: s[1.0]
Out[124]: 2

In [125]: s[1.0:2.5]
������������Out[125]: 
1.0    2
2.0    3
dtype: int64
Removal of prior version deprecations/changes¶ Performance Improvements¶ Bug Fixes¶ v0.17.1 (November 21, 2015)¶

Note

We are proud to announce that pandas has become a sponsored project of the (NUMFocus organization). This will help ensure the success of development of pandas as a world-class open-source project.

This is a minor bug-fix release from 0.17.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

New features¶ Conditional HTML Formatting¶

Warning

This is a new feature and is under active development. We’ll be adding features an possibly making breaking changes in future releases. Feedback is welcome.

We’ve added experimental support for conditional HTML formatting: the visual styling of a DataFrame based on the data. The styling is accomplished with HTML and CSS. Acesses the styler class with the pandas.DataFrame.style, attribute, an instance of Styler with your data attached.

Here’s a quick example:

In [1]: np.random.seed(123)

In [2]: df = DataFrame(np.random.randn(10, 5), columns=list('abcde'))

In [3]: html = df.style.background_gradient(cmap='viridis', low=.5)

We can render the HTML to get the following table.

a b c d e 0 -1.085631 0.997345 0.282978 -1.506295 -0.5786 1 1.651437 -2.426679 -0.428913 1.265936 -0.86674 2 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 3 -0.434351 2.20593 2.186786 1.004054 0.386186 4 0.737369 1.490732 -0.935834 1.175829 -1.253881 5 -0.637752 0.907105 -1.428681 -0.140069 -0.861755 6 -0.255619 -2.798589 -1.771533 -0.699877 0.927462 7 -0.173636 0.002846 0.688223 -0.879536 0.283627 8 -0.805367 -1.727669 -0.3909 0.573806 0.338589 9 -0.01183 2.392365 0.412912 0.978736 2.238143

Styler interacts nicely with the Jupyter Notebook. See the documentation for more.

Enhancements¶ API changes¶ Deprecations¶ Performance Improvements¶ Bug Fixes¶ v0.17.0 (October 9, 2015)¶

This is a major release from 0.16.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.17.0 will no longer support compatibility with Python version 3.2 (GH9118)

Warning

The pandas.io.data package is deprecated and will be replaced by the pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1 is exactly the same as in pandas v0.17.0 (GH8961, GH10861).

After installing pandas-datareader, you can easily change your imports:

from pandas.io import data, wb

becomes

from pandas_datareader import data, wb

Highlights include:

Check the API Changes and deprecations before updating.

New features¶ Datetime with TZ¶

We are adding an implementation that natively supports datetime with timezones. A Series or a DataFrame column previously could be assigned a datetime with timezones, and would work as an object dtype. This had performance issues with a large number rows. See the docs for more details. (GH8260, GH10763, GH11034).

The new implementation allows for having a single-timezone across all rows, with operations in a performant manner.

In [1]: df = DataFrame({'A' : date_range('20130101',periods=3),
   ...:                 'B' : date_range('20130101',periods=3,tz='US/Eastern'),
   ...:                 'C' : date_range('20130101',periods=3,tz='CET')})
   ...: 

In [2]: df
Out[2]: 
           A                         B                         C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00

In [3]: df.dtypes
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[3]: 
A                datetime64[ns]
B    datetime64[ns, US/Eastern]
C           datetime64[ns, CET]
dtype: object
In [4]: df.B
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]

In [5]: df.B.dt.tz_localize(None)
����������������������������������������������������������������������������������������������������������������������������������������������Out[5]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
Name: B, dtype: datetime64[ns]

This uses a new-dtype representation as well, that is very similar in look-and-feel to its numpy cousin datetime64[ns]

In [6]: df['B'].dtype
Out[6]: datetime64[ns, US/Eastern]

In [7]: type(df['B'].dtype)
�����������������������������������Out[7]: pandas.core.dtypes.dtypes.DatetimeTZDtype

Note

There is a slightly different string repr for the underlying DatetimeIndex as a result of the dtype changes, but functionally these are the same.

Previous Behavior:

In [1]: pd.date_range('20130101',periods=3,tz='US/Eastern')
Out[1]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
                       '2013-01-03 00:00:00-05:00'],
                      dtype='datetime64[ns]', freq='D', tz='US/Eastern')

In [2]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
Out[2]: dtype('<M8[ns]')

New Behavior:

In [8]: pd.date_range('20130101',periods=3,tz='US/Eastern')
Out[8]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [9]: pd.date_range('20130101',periods=3,tz='US/Eastern').dtype
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[9]: datetime64[ns, US/Eastern]
Releasing the GIL¶

We are releasing the global-interpreter-lock (GIL) on some cython operations. This will allow other threads to run simultaneously during computation, potentially allowing performance improvements from multi-threading. Notably groupby, nsmallest, value_counts and some indexing operations benefit from this. (GH8882)

For example the groupby expression in the following code will have the GIL released during the factorization step, e.g. df.groupby('key') as well as the .sum() operation.

N = 1000000
ngroups = 10
df = DataFrame({'key' : np.random.randint(0,ngroups,size=N),
                'data' : np.random.randn(N) })
df.groupby('key')['data'].sum()

Releasing of the GIL could benefit an application that uses threads for user interactions (e.g. QT), or performing multi-threaded computations. A nice example of a library that can handle these types of computation-in-parallel is the dask library.

Plot submethods¶

The Series and DataFrame .plot() method allows for customizing plot types by supplying the kind keyword arguments. Unfortunately, many of these kinds of plots use different required and optional keyword arguments, which makes it difficult to discover what any given plot kind uses out of the dozens of possible arguments.

To alleviate this issue, we have added a new, optional plotting interface, which exposes each kind of plot as a method of the .plot attribute. Instead of writing series.plot(kind=<kind>, ...), you can now also use series.plot.<kind>(...):

In [10]: df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])

In [11]: df.plot.bar()

As a result of this change, these methods are now all discoverable via tab-completion:

In [12]: df.plot.<TAB>
df.plot.area     df.plot.barh     df.plot.density  df.plot.hist     df.plot.line     df.plot.scatter
df.plot.bar      df.plot.box      df.plot.hexbin   df.plot.kde      df.plot.pie

Each method signature only includes relevant arguments. Currently, these are limited to required arguments, but in the future these will include optional arguments, as well. For an overview, see the new Plotting API documentation.

Additional methods for dt accessor¶ strftime¶

We are now supporting a Series.dt.strftime method for datetime-likes to generate a formatted string (GH10110). Examples:

# DatetimeIndex
In [13]: s = pd.Series(pd.date_range('20130101', periods=4))

In [14]: s
Out[14]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: datetime64[ns]

In [15]: s.dt.strftime('%Y/%m/%d')
��������������������������������������������������������������������������������������������Out[15]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object
# PeriodIndex
In [16]: s = pd.Series(pd.period_range('20130101', periods=4))

In [17]: s
Out[17]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: object

In [18]: s.dt.strftime('%Y/%m/%d')
������������������������������������������������������������������������������������Out[18]: 
0    2013/01/01
1    2013/01/02
2    2013/01/03
3    2013/01/04
dtype: object

The string format is as the python standard library and details can be found here

total_seconds¶

pd.Series of type timedelta64 has new method .dt.total_seconds() returning the duration of the timedelta in seconds (GH10817)

# TimedeltaIndex
In [19]: s = pd.Series(pd.timedelta_range('1 minutes', periods=4))

In [20]: s
Out[20]: 
0   0 days 00:01:00
1   1 days 00:01:00
2   2 days 00:01:00
3   3 days 00:01:00
dtype: timedelta64[ns]

In [21]: s.dt.total_seconds()
�����������������������������������������������������������������������������������������������������������������Out[21]: 
0        60.0
1     86460.0
2    172860.0
3    259260.0
dtype: float64
Period Frequency Enhancement¶

Period, PeriodIndex and period_range can now accept multiplied freq. Also, Period.freq and PeriodIndex.freq are now stored as a DateOffset instance like DatetimeIndex, and not as str (GH7811)

A multiplied freq represents a span of corresponding length. The example below creates a period of 3 days. Addition and subtraction will shift the period by its span.

In [22]: p = pd.Period('2015-08-01', freq='3D')

In [23]: p
Out[23]: Period('2015-08-01', '3D')

In [24]: p + 1
������������������������������������Out[24]: Period('2015-08-04', '3D')

In [25]: p - 2
������������������������������������������������������������������������Out[25]: Period('2015-07-26', '3D')

In [26]: p.to_timestamp()
������������������������������������������������������������������������������������������������������������Out[26]: Timestamp('2015-08-01 00:00:00')

In [27]: p.to_timestamp(how='E')
������������������������������������������������������������������������������������������������������������������������������������������������������Out[27]: Timestamp('2015-08-03 00:00:00')

You can use the multiplied freq in PeriodIndex and period_range.

In [28]: idx = pd.period_range('2015-08-01', periods=4, freq='2D')

In [29]: idx
Out[29]: PeriodIndex(['2015-08-01', '2015-08-03', '2015-08-05', '2015-08-07'], dtype='period[2D]', freq='2D')

In [30]: idx + 1
��������������������������������������������������������������������������������������������������������������Out[30]: PeriodIndex(['2015-08-03', '2015-08-05', '2015-08-07', '2015-08-09'], dtype='period[2D]', freq='2D')
Support for SAS XPORT files¶

read_sas() provides support for reading SAS XPORT format files. (GH4052).

df = pd.read_sas('sas_xport.xpt')

It is also possible to obtain an iterator and read an XPORT file incrementally.

for df in pd.read_sas('sas_xport.xpt', chunksize=10000)
    do_something(df)

See the docs for more details.

Support for Math Functions in .eval()¶

eval() now supports calling math functions (GH4893)

df = pd.DataFrame({'a': np.random.randn(10)})
df.eval("b = sin(a)")

The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

These functions map to the intrinsics for the NumExpr engine. For the Python engine, they are mapped to NumPy calls.

Changes to Excel with MultiIndex¶

In version 0.16.2 a DataFrame with MultiIndex columns could not be written to Excel via to_excel. That functionality has been added (GH10564), along with updating read_excel so that the data can be read back with, no loss of information, by specifying which columns/rows make up the MultiIndex in the header and index_col parameters (GH4679)

See the documentation for more details.

In [31]: df = pd.DataFrame([[1,2,3,4], [5,6,7,8]],
   ....:                   columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
   ....:                                                        names = ['col1', 'col2']),
   ....:                   index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
   ....:                                                      names = ['i1', 'i2']))
   ....: 

In [32]: df
Out[32]: 
col1  foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

In [33]: df.to_excel('test.xlsx')

In [34]: df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1])

In [35]: df
Out[35]: 
col1  foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

Previously, it was necessary to specify the has_index_names argument in read_excel, if the serialized data had index names. For version 0.17.0 the ouptput format of to_excel has been changed to make this keyword unnecessary - the change is shown below.

Old

New

Warning

Excel files saved in version 0.16.2 or prior that had index names will still able to be read in, but the has_index_names argument must specified to True.

Google BigQuery Enhancements¶ Display Alignment with Unicode East Asian Width¶

Warning

Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower). Use only when it is actually required.

Some East Asian countries use Unicode characters its width is corresponding to 2 alphabets. If a DataFrame or Series contains these characters, the default output cannot be aligned properly. The following options are added to enable precise handling for these characters.

In [36]: df = pd.DataFrame({u'国籍': ['UK', u'日本'], u'名前': ['Alice', u'しのぶ']})

In [37]: df;
In [38]: pd.set_option('display.unicode.east_asian_width', True)

In [39]: df;

For further details, see here

Other enhancements¶ Backwards incompatible API changes¶ Changes to sorting API¶

The sorting API has had some longtime inconsistencies. (GH9816, GH8239).

Here is a summary of the API PRIOR to 0.17.0:

To address these issues, we have revamped the API:

We now have two distinct and non-overlapping methods of sorting. A * marks items that will show a FutureWarning.

To sort by the values:

Previous Replacement * Series.order() Series.sort_values() * Series.sort() Series.sort_values(inplace=True) * DataFrame.sort(columns=...) DataFrame.sort_values(by=...)

To sort by the index:

Previous Replacement Series.sort_index() Series.sort_index() Series.sortlevel(level=...) Series.sort_index(level=...) DataFrame.sort_index() DataFrame.sort_index() DataFrame.sortlevel(level=...) DataFrame.sort_index(level=...) * DataFrame.sort() DataFrame.sort_index()

We have also deprecated and changed similar methods in two Series-like classes, Index and Categorical.

Previous Replacement * Index.order() Index.sort_values() * Categorical.order() Categorical.sort_values() Changes to to_datetime and to_timedelta¶ Error handling¶

The default for pd.to_datetime error handling has changed to errors='raise'. In prior versions it was errors='ignore'. Furthermore, the coerce argument has been deprecated in favor of errors='coerce'. This means that invalid parsing will raise rather that return the original input as in previous versions. (GH10636)

Previous Behavior:

In [2]: pd.to_datetime(['2009-07-31', 'asd'])
Out[2]: array(['2009-07-31', 'asd'], dtype=object)

New Behavior:

In [3]: pd.to_datetime(['2009-07-31', 'asd'])
ValueError: Unknown string format

Of course you can coerce this as well.

In [61]: to_datetime(['2009-07-31', 'asd'], errors='coerce')
Out[61]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)

To keep the previous behavior, you can use errors='ignore':

In [62]: to_datetime(['2009-07-31', 'asd'], errors='ignore')
Out[62]: array(['2009-07-31', 'asd'], dtype=object)

Furthermore, pd.to_timedelta has gained a similar API, of errors='raise'|'ignore'|'coerce', and the coerce keyword has been deprecated in favor of errors='coerce'.

Consistent Parsing¶

The string parsing of to_datetime, Timestamp and DatetimeIndex has been made consistent. (GH7599)

Prior to v0.17.0, Timestamp and to_datetime may parse year-only datetime-string incorrectly using today’s date, otherwise DatetimeIndex uses the beginning of the year. Timestamp and to_datetime may raise ValueError in some types of datetime-string which DatetimeIndex can parse, such as a quarterly string.

Previous Behavior:

In [1]: Timestamp('2012Q2')
Traceback
   ...
ValueError: Unable to parse 2012Q2

# Results in today's date.
In [2]: Timestamp('2014')
Out [2]: 2014-08-12 00:00:00

v0.17.0 can parse them as below. It works on DatetimeIndex also.

New Behavior:

In [63]: Timestamp('2012Q2')
Out[63]: Timestamp('2012-04-01 00:00:00')

In [64]: Timestamp('2014')
������������������������������������������Out[64]: Timestamp('2014-01-01 00:00:00')

In [65]: DatetimeIndex(['2012Q2', '2014'])
������������������������������������������������������������������������������������Out[65]: DatetimeIndex(['2012-04-01', '2014-01-01'], dtype='datetime64[ns]', freq=None)

Note

If you want to perform calculations based on today’s date, use Timestamp.now() and pandas.tseries.offsets.

In [66]: import pandas.tseries.offsets as offsets

In [67]: Timestamp.now()
Out[67]: Timestamp('2017-07-07 12:29:28.795000')

In [68]: Timestamp.now() + offsets.DateOffset(years=1)
�������������������������������������������������Out[68]: Timestamp('2018-07-07 12:29:28.796446')
Changes to Index Comparisons¶

Operator equal on Index should behavior similarly to Series (GH9947, GH10637)

Starting in v0.17.0, comparing Index objects of different lengths will raise a ValueError. This is to be consistent with the behavior of Series.

Previous Behavior:

In [2]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[2]: array([ True, False, False], dtype=bool)

In [3]: pd.Index([1, 2, 3]) == pd.Index([2])
Out[3]: array([False,  True, False], dtype=bool)

In [4]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
Out[4]: False

New Behavior:

In [8]: pd.Index([1, 2, 3]) == pd.Index([1, 4, 5])
Out[8]: array([ True, False, False], dtype=bool)

In [9]: pd.Index([1, 2, 3]) == pd.Index([2])
ValueError: Lengths must match to compare

In [10]: pd.Index([1, 2, 3]) == pd.Index([1, 2])
ValueError: Lengths must match to compare

Note that this is different from the numpy behavior where a comparison can be broadcast:

In [69]: np.array([1, 2, 3]) == np.array([1])
Out[69]: array([ True, False, False], dtype=bool)

or it can return False if broadcasting can not be done:

In [70]: np.array([1, 2, 3]) == np.array([1, 2])
Out[70]: False
Changes to Boolean Comparisons vs. None¶

Boolean comparisons of a Series vs None will now be equivalent to comparing with np.nan, rather than raise TypeError. (GH1079).

In [71]: s = Series(range(3))

In [72]: s.iloc[1] = None

In [73]: s
Out[73]: 
0    0.0
1    NaN
2    2.0
dtype: float64

Previous Behavior:

In [5]: s==None
TypeError: Could not compare <type 'NoneType'> type with Series

New Behavior:

In [74]: s==None
Out[74]: 
0    False
1    False
2    False
dtype: bool

Usually you simply want to know which values are null.

In [75]: s.isnull()
Out[75]: 
0    False
1     True
2    False
dtype: bool

Warning

You generally will want to use isnull/notnull for these types of comparisons, as isnull/notnull tells you which elements are null. One has to be mindful that nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [76]: None == None
Out[76]: True

In [77]: np.nan == np.nan
��������������Out[77]: False
HDFStore dropna behavior¶

The default behavior for HDFStore write functions with format='table' is now to keep rows that are all missing. Previously, the behavior was to drop rows that were all missing save the index. The previous behavior can be replicated using the dropna=True option. (GH9382)

Previous Behavior:

In [78]: df_with_missing = pd.DataFrame({'col1':[0, np.nan, 2],
   ....:                                 'col2':[1, np.nan, np.nan]})
   ....: 

In [79]: df_with_missing
Out[79]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN
In [27]:
df_with_missing.to_hdf('file.h5',
                       'df_with_missing',
                       format='table',
                       mode='w')

In [28]: pd.read_hdf('file.h5', 'df_with_missing')

Out [28]:
      col1  col2
  0     0     1
  2     2   NaN

New Behavior:

In [80]: df_with_missing.to_hdf('file.h5',
   ....:                        'df_with_missing',
   ....:                         format='table',
   ....:                         mode='w')
   ....: 

In [81]: pd.read_hdf('file.h5', 'df_with_missing')
Out[81]: 
   col1  col2
0   0.0   1.0
1   NaN   NaN
2   2.0   NaN

See the docs for more details.

Changes to display.precision option¶

The display.precision option has been clarified to refer to decimal places (GH10451).

Earlier versions of pandas would format floating point numbers to have one less decimal place than the value in display.precision.

In [1]: pd.set_option('display.precision', 2)

In [2]: pd.DataFrame({'x': [123.456789]})
Out[2]:
       x
0  123.5

If interpreting precision as “significant figures” this did work for scientific notation but that same interpretation did not work for values with standard formatting. It was also out of step with how numpy handles formatting.

Going forward the value of display.precision will directly control the number of places after the decimal, for regular formatting as well as scientific notation, similar to how numpy’s precision print option works.

In [82]: pd.set_option('display.precision', 2)

In [83]: pd.DataFrame({'x': [123.456789]})
Out[83]: 
        x
0  123.46

To preserve output behavior with prior versions the default value of display.precision has been reduced to 6 from 7.

Changes to Categorical.unique¶

Categorical.unique now returns new Categoricals with categories and codes that are unique, rather than returning np.array (GH10508)

In [84]: cat = pd.Categorical(['C', 'A', 'B', 'C'],
   ....:                      categories=['A', 'B', 'C'],
   ....:                      ordered=True)
   ....: 

In [85]: cat
Out[85]: 
[C, A, B, C]
Categories (3, object): [A < B < C]

In [86]: cat.unique()
�����������������������������������������������������������Out[86]: 
[C, A, B]
Categories (3, object): [A < B < C]

In [87]: cat = pd.Categorical(['C', 'A', 'B', 'C'],
   ....:                      categories=['A', 'B', 'C'])
   ....: 

In [88]: cat
Out[88]: 
[C, A, B, C]
Categories (3, object): [A, B, C]

In [89]: cat.unique()
���������������������������������������������������������Out[89]: 
[C, A, B]
Categories (3, object): [C, A, B]
Other API Changes¶ Deprecations¶

Note

These indexing function have been deprecated in the documentation since 0.11.0.

Removal of prior version deprecations/changes¶ Performance Improvements¶ Bug Fixes¶ v0.16.2 (June 12, 2015)¶

This is a minor bug-fix release from 0.16.1 and includes a a large number of bug fixes along some new features (pipe() method), enhancements, and performance improvements.

We recommend that all users upgrade to this version.

Highlights include:

New features¶ Pipe¶

We’ve introduced a new method DataFrame.pipe(). As suggested by the name, pipe should be used to pipe data through a chain of function calls. The goal is to avoid confusing nested function calls like

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)

The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as

(df.pipe(h)
   .pipe(g, arg1=1)
   .pipe(f, arg2=2, arg3=3)
)

Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.

In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple of (function, keyword) indicating where the DataFrame should flow. For example:

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv('data/baseball.csv', index_col='id')

# sm.poisson takes (formula, data)
In [3]: (bb.query('h > 0')
   ...:    .assign(ln_h = lambda df: np.log(df.h))
   ...:    .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
   ...:    .fit()
   ...:    .summary()
   ...: )
   ...: 
Optimization terminated successfully.
         Current function value: 2.116284
         Iterations 24
Out[3]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:                     hr   No. Observations:                   68
Model:                        Poisson   Df Residuals:                       63
Method:                           MLE   Df Model:                            4
Date:                Fri, 07 Jul 2017   Pseudo R-squ.:                  0.6878
Time:                        12:29:29   Log-Likelihood:                -143.91
converged:                       True   LL-Null:                       -460.91
                                        LLR p-value:                6.774e-136
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -1267.3636    457.867     -2.768      0.006   -2164.767    -369.960
C(lg)[T.NL]    -0.2057      0.101     -2.044      0.041      -0.403      -0.008
ln_h            0.9280      0.191      4.866      0.000       0.554       1.302
year            0.6301      0.228      2.762      0.006       0.183       1.077
g               0.0099      0.004      2.754      0.006       0.003       0.017
===============================================================================
"""

The pipe method is inspired by unix pipes, which stream text through processes. More recently dplyr and magrittr have introduced the popular (%>%) pipe operator for R.

See the documentation for more. (GH10129)

Other Enhancements¶ API Changes¶ Performance Improvements¶ Bug Fixes¶ v0.16.1 (May 11, 2015)¶

This is a minor bug-fix release from 0.16.0 and includes a a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

Warning

In pandas 0.17.0, the sub-package pandas.io.data will be removed in favor of a separately installable package. See here for details (GH8961)

Enhancements¶ CategoricalIndex¶

We introduce a CategoricalIndex, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.

In [1]: df = DataFrame({'A' : np.arange(6),
   ...:                 'B' : Series(list('aabbca')).astype('category',
   ...:                                                     categories=list('cab'))
   ...:                })
   ...: 

In [2]: df
Out[2]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
�����������������������������������������������������������������Out[3]: 
A       int64
B    category
dtype: object

In [4]: df.B.cat.categories
��������������������������������������������������������������������������������������������������������������������Out[4]: Index(['c', 'a', 'b'], dtype='object')

setting the index, will create create a CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.

In [7]: df2.loc['a']
Out[7]: 
   A
B   
a  0
a  1
a  5

and preserves the CategoricalIndex

In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

sorting will order by the order of the categories

In [9]: df2.sort_index()
Out[9]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

groupby operations on the index will preserve the index nature as well

In [10]: df2.groupby(level=0).sum()
Out[10]: 
   A
B   
c  4
a  6
b  5

In [11]: df2.groupby(level=0).sum().index
�����������������������������������Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the PASSED Categorical dtype. This allows one to arbitrarly index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.

In [12]: df2.reindex(['a','e'])
Out[12]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN

In [13]: df2.reindex(['a','e']).index
����������������������������������������������������Out[13]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
�������������������������������������������������������������������������������������������������������������������Out[14]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN

In [15]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[15]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

See the documentation for more. (GH7629, GH10038, GH10039)

Sample¶

Series, DataFrames, and Panels now have a new method: sample(). The method accepts a specific number of rows or columns to return, or a fraction of the total number or rows or columns. It also has options for sampling with or without replacement, for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication. (GH2419)

In [16]: example_series = Series([0,1,2,3,4,5])

# When no arguments are passed, returns 1
In [17]: example_series.sample()
Out[17]: 
3    3
dtype: int64

# One may specify either a number of rows:
In [18]: example_series.sample(n=3)
������������������������������Out[18]: 
5    5
1    1
4    4
dtype: int64

# Or a fraction of the rows:
In [19]: example_series.sample(frac=0.5)
��������������������������������������������������������������������������Out[19]: 
4    4
1    1
0    0
dtype: int64

# weights are accepted.
In [20]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [21]: example_series.sample(n=3, weights=example_weights)
Out[21]: 
2    2
3    3
5    5
dtype: int64

# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [22]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [23]: example_series.sample(n=1, weights=example_weights2)
Out[23]: 
0    0
dtype: int64

When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.

In [24]: df = DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})

In [25]: df.sample(n=3, weights='weight_column')
Out[25]: 
   col1  weight_column
0     9            0.5
1     8            0.4
2     7            0.1
String Methods Enhancements¶

Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.

Other Enhancements¶ API changes¶ Deprecations¶ Index Representation¶

The string representation of Index and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items; if lots of items (> display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting for MultiIndex is unchanges (a multi-line wrapped display). The display width responds to the option display.max_seq_items, which is defaulted to 100. (GH6482)

Previous Behavior

In [2]: pd.Index(range(4),name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104),name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

New Behavior

In [45]: pd.set_option('display.width', 80)

In [46]: pd.Index(range(4), name='foo')
Out[46]: RangeIndex(start=0, stop=4, step=1, name='foo')

In [47]: pd.Index(range(30), name='foo')
���������������������������������������������������������Out[47]: RangeIndex(start=0, stop=30, step=1, name='foo')

In [48]: pd.Index(range(104), name='foo')
�������������������������������������������������������������������������������������������������������������������Out[48]: RangeIndex(start=0, stop=104, step=1, name='foo')

In [49]: pd.CategoricalIndex(['a','bb','ccc','dddd'], ordered=True, name='foobar')
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[49]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category')

In [50]: pd.CategoricalIndex(['a','bb','ccc','dddd']*10, ordered=True, name='foobar')
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[50]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
                  'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
                  'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
                  'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
                  'a', 'bb', 'ccc', 'dddd'],
                 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category')

In [51]: pd.CategoricalIndex(['a','bb','ccc','dddd']*100, ordered=True, name='foobar')
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[51]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
                  'bb',
                  ...
                  'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
                  'dddd'],
                 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, name='foobar', dtype='category', length=400)

In [52]: pd.date_range('20130101',periods=4, name='foo', tz='US/Eastern')
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[52]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')

In [53]: pd.date_range('20130101',periods=25, freq='D')
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[53]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
               '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
               '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
               '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
               '2013-01-25'],
              dtype='datetime64[ns]', freq='D')

In [54]: pd.date_range('20130101',periods=104, name='foo', tz='US/Eastern')
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[54]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
               ...
               '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
               '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
               '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
               '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')
Performance Improvements¶ Bug Fixes¶ v0.16.0 (March 22, 2015)¶

This is a major release from 0.15.2 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Check the API Changes and deprecations before updating.

New features¶ DataFrame Assign¶

Inspired by dplyr’s mutate verb, DataFrame has a new assign() method. The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be called on the DataFrame. The new values are inserted, and the entire DataFrame (with all original and new columns) is returned.

In [1]: iris = read_csv('data/iris.data')

In [2]: iris.head()
Out[2]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

In [3]: iris.assign(sepal_ratio=iris['SepalWidth'] / iris['SepalLength']).head()
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[3]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

Above was an example of inserting a precomputed value. We can also pass in a function to be evalutated.

In [4]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
   ...:                                      x['SepalLength'])).head()
   ...: 
Out[4]: 
   SepalLength  SepalWidth  PetalLength  PetalWidth         Name  sepal_ratio
0          5.1         3.5          1.4         0.2  Iris-setosa     0.686275
1          4.9         3.0          1.4         0.2  Iris-setosa     0.612245
2          4.7         3.2          1.3         0.2  Iris-setosa     0.680851
3          4.6         3.1          1.5         0.2  Iris-setosa     0.673913
4          5.0         3.6          1.4         0.2  Iris-setosa     0.720000

The power of assign comes when used in chains of operations. For example, we can limit the DataFrame to just those with a Sepal Length greater than 5, calculate the ratio, and plot

In [5]: (iris.query('SepalLength > 5')
   ...:      .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
   ...:              PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
   ...:      .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
   ...: 
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x13df38208>

See the documentation for more. (GH9229)

Interaction with scipy.sparse¶

Added SparseSeries.to_coo() and SparseSeries.from_coo() methods (GH8048) for converting to and from scipy.sparse.coo_matrix instances (see here). For example, given a SparseSeries with MultiIndex we can convert to a scipy.sparse.coo_matrix by specifying the row and column labels as index levels:

In [6]: from numpy import nan

In [7]: s = Series([3.0, nan, 1.0, 3.0, nan, nan])

In [8]: s.index = MultiIndex.from_tuples([(1, 2, 'a', 0),
   ...:                                   (1, 2, 'a', 1),
   ...:                                   (1, 1, 'b', 0),
   ...:                                   (1, 1, 'b', 1),
   ...:                                   (2, 1, 'b', 0),
   ...:                                   (2, 1, 'b', 1)],
   ...:                                   names=['A', 'B', 'C', 'D'])
   ...: 

In [9]: s
Out[9]: 
A  B  C  D
1  2  a  0    3.0
         1    NaN
   1  b  0    1.0
         1    3.0
2  1  b  0    NaN
         1    NaN
dtype: float64

# SparseSeries
In [10]: ss = s.to_sparse()

In [11]: ss
Out[11]: 
A  B  C  D
1  2  a  0    3.0
         1    NaN
   1  b  0    1.0
         1    3.0
2  1  b  0    NaN
         1    NaN
dtype: float64
BlockIndex
Block locations: array([0, 2], dtype=int32)
Block lengths: array([1, 2], dtype=int32)

In [12]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
   ....:                              column_levels=['C', 'D'],
   ....:                              sort_labels=False)
   ....: 

In [13]: A
Out[13]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [14]: A.todense()
�������������������������������������������������������������������������������������������������������������Out[14]: 
matrix([[ 3.,  0.,  0.,  0.],
        [ 0.,  0.,  1.,  3.],
        [ 0.,  0.,  0.,  0.]])

In [15]: rows
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[15]: [(1, 2), (1, 1), (2, 1)]

In [16]: columns
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[16]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]

The from_coo method is a convenience method for creating a SparseSeries from a scipy.sparse.coo_matrix:

In [17]: from scipy import sparse

In [18]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
   ....:                             shape=(3, 4))
   ....: 

In [19]: A
Out[19]: 
<3x4 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [20]: A.todense()
�������������������������������������������������������������������������������������������������������������Out[20]: 
matrix([[ 0.,  0.,  1.,  2.],
        [ 3.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.]])

In [21]: ss = SparseSeries.from_coo(A)

In [22]: ss
Out[22]: 
0  2    1.0
   3    2.0
1  0    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
String Methods Enhancements¶ Other enhancements¶ Backwards incompatible API changes¶ Changes in Timedelta¶

In v0.15.0 a new scalar type Timedelta was introduced, that is a sub-class of datetime.timedelta. Mentioned here was a notice of an API change w.r.t. the .seconds accessor. The intent was to provide a user-friendly set of accessors that give the ‘natural’ value for that unit, e.g. if you had a Timedelta('1 day, 10:11:12'), then .seconds would return 12. However, this is at odds with the definition of datetime.timedelta, which defines .seconds as 10 * 3600 + 11 * 60 + 12 == 36672.

So in v0.16.0, we are restoring the API to match that of datetime.timedelta. Further, the component values are still available through the .components accessor. This affects the .seconds and .microseconds accessors, and removes the .hours, .minutes, .milliseconds accessors. These changes affect TimedeltaIndex and the Series .dt accessor as well. (GH9185, GH9139)

Previous Behavior

In [2]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [3]: t.days
Out[3]: 1

In [4]: t.seconds
Out[4]: 12

In [5]: t.microseconds
Out[5]: 123

New Behavior

In [33]: t = pd.Timedelta('1 day, 10:11:12.100123')

In [34]: t.days
Out[34]: 1

In [35]: t.seconds
�����������Out[35]: 36672

In [36]: t.microseconds
��������������������������Out[36]: 100123

Using .components allows the full component access

In [37]: t.components
Out[37]: Components(days=1, hours=10, minutes=11, seconds=12, milliseconds=100, microseconds=123, nanoseconds=0)

In [38]: t.components.seconds
�����������������������������������������������������������������������������������������������������������������Out[38]: 12
Indexing Changes¶

The behavior of a small sub-set of edge cases for using .loc have changed (GH8613). Furthermore we have improved the content of the error messages that are raised:

Categorical Changes¶

In prior versions, Categoricals that had an unspecified ordering (meaning no ordered keyword was passed) were defaulted as ordered Categoricals. Going forward, the ordered keyword in the Categorical constructor will default to False. Ordering must now be explicit.

Furthermore, previously you could change the ordered attribute of a Categorical by just setting the attribute, e.g. cat.ordered=True; This is now deprecated and you should use cat.as_ordered() or cat.as_unordered(). These will by default return a new object and not modify the existing object. (GH9347, GH9190)

Previous Behavior

In [3]: s = Series([0,1,2], dtype='category')

In [4]: s
Out[4]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [5]: s.cat.ordered
Out[5]: True

In [6]: s.cat.ordered = False

In [7]: s
Out[7]:
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

New Behavior

In [45]: s = Series([0,1,2], dtype='category')

In [46]: s
Out[46]: 
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0, 1, 2]

In [47]: s.cat.ordered
��������������������������������������������������������������������������������Out[47]: False

In [48]: s = s.cat.as_ordered()

In [49]: s
Out[49]: 
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [50]: s.cat.ordered
����������������������������������������������������������������������������������Out[50]: True

# you can set in the constructor of the Categorical
In [51]: s = Series(Categorical([0,1,2],ordered=True))

In [52]: s
Out[52]: 
0    0
1    1
2    2
dtype: category
Categories (3, int64): [0 < 1 < 2]

In [53]: s.cat.ordered
����������������������������������������������������������������������������������Out[53]: True

For ease of creation of series of categorical data, we have added the ability to pass keywords when calling .astype(). These are passed directly to the constructor.

In [54]: s = Series(["a","b","c","a"]).astype('category',ordered=True)

In [55]: s
Out[55]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a < b < c]

In [56]: s = Series(["a","b","c","a"]).astype('category',categories=list('abcdef'),ordered=False)

In [57]: s
Out[57]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (6, object): [a, b, c, d, e, f]
Other API Changes¶ Deprecations¶ Removal of prior version deprecations/changes¶ Performance Improvements¶ Bug Fixes¶ v0.15.2 (December 12, 2014)¶

This is a minor release from 0.15.1 and includes a large number of bug fixes along with several new features, enhancements, and performance improvements. A small number of API changes were necessary to fix existing bugs. We recommend that all users upgrade to this version.

API changes¶ Enhancements¶

Categorical enhancements:

Other enhancements:

Performance¶ Bug Fixes¶ v0.15.1 (November 9, 2014)¶

This is a minor bug-fix release from 0.15.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

API changes¶ Enhancements¶ Bug Fixes¶ v0.15.0 (October 18, 2014)¶

This is a major release from 0.14.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

pandas >= 0.15.0 will no longer support compatibility with NumPy versions < 1.7.0. If you want to use the latest versions of pandas, please upgrade to NumPy >= 1.7.0 (GH7711)

Warning

In 0.15.0 Index has internally been refactored to no longer sub-class ndarray but instead subclass PandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (See the Internal Refactoring)

Warning

The refactorings in Categorical changed the two argument constructor from “codes/labels and levels” to “values and levels (now called ‘categories’)”. This can lead to subtle bugs. If you use Categorical directly, please audit your code before updating to this pandas version and change it to use the from_codes() constructor. See more on Categorical here

New features¶ Categoricals in Series/DataFrame¶

Categorical can now be included in Series and DataFrames and gained new methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (GH3943, GH5313, GH5314, GH7444, GH7839, GH7848, GH7864, GH7914, GH7768, GH8006, GH3678, GH8075, GH8076, GH8143, GH8453, GH8518).

For full docs, see the categorical introduction and the API documentation.

In [1]: df = DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

In [2]: df["grade"] = df["raw_grade"].astype("category")

In [3]: df["grade"]
Out[3]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

# Rename the categories
In [4]: df["grade"].cat.categories = ["very good", "good", "very bad"]

# Reorder the categories and simultaneously add the missing categories
In [5]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In [6]: df["grade"]
Out[6]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [7]: df.sort_values("grade")
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[7]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

In [8]: df.groupby("grade").size()
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[8]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64
TimedeltaIndex/Scalar¶

We introduce a new scalar type Timedelta, which is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility with np.timedelta64 types as well as a host of custom representation, parsing, and attributes. This type is very similar to how Timestamp works for datetimes. It is a nice-API box for the type. See the docs. (GH3009, GH4533, GH8209, GH8187, GH8190, GH7869, GH7661, GH8345, GH8471)

Warning

Timedelta scalars (and TimedeltaIndex) component fields are not the same as the component fields on a datetime.timedelta object. For example, .seconds on a datetime.timedelta object returns the total number of seconds combined between hours, minutes and seconds. In contrast, the pandas Timedelta breaks out hours, minutes, microseconds and nanoseconds separately.

# Timedelta accessor
In [9]: tds = Timedelta('31 days 5 min 3 sec')

In [10]: tds.minutes
Out[10]: 5L

In [11]: tds.seconds
Out[11]: 3L

# datetime.timedelta accessor
# this is 5 minutes * 60 + 3 seconds
In [12]: tds.to_pytimedelta().seconds
Out[12]: 303

Note: this is no longer true starting from v0.16.0, where full compatibility with datetime.timedelta is introduced. See the 0.16.0 whatsnew entry

Warning

Prior to 0.15.0 pd.to_timedelta would return a Series for list-like/Series input, and a np.timedelta64 for scalar input. It will now return a TimedeltaIndex for list-like input, Series for Series input, and Timedelta for scalar input.

The arguments to pd.to_timedelta are now (arg,unit='ns',box=True,coerce=False), previously were (arg,box=True,unit='ns') as these are more logical.

Consruct a scalar

In [9]: Timedelta('1 days 06:05:01.00003')
Out[9]: Timedelta('1 days 06:05:01.000030')

In [10]: Timedelta('15.5us')
��������������������������������������������Out[10]: Timedelta('0 days 00:00:00.000015')

In [11]: Timedelta('1 hour 15.5us')
�����������������������������������������������������������������������������������������Out[11]: Timedelta('0 days 01:00:00.000015')

# negative Timedeltas have this string repr
# to be more consistent with datetime.timedelta conventions
In [12]: Timedelta('-1us')
��������������������������������������������������������������������������������������������������������������������������������������Out[12]: Timedelta('-1 days +23:59:59.999999')

# a NaT
In [13]: Timedelta('nan')
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[13]: NaT

Access fields for a Timedelta

In [14]: td = Timedelta('1 hour 3m 15.5us')

In [15]: td.seconds
Out[15]: 3780

In [16]: td.microseconds
��������������Out[16]: 15

In [17]: td.nanoseconds
��������������������������Out[17]: 500

Construct a TimedeltaIndex

In [18]: TimedeltaIndex(['1 days','1 days, 00:00:05',
   ....:                 np.timedelta64(2,'D'),timedelta(days=2,seconds=2)])
   ....: 
Out[18]: 
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
                '2 days 00:00:02'],
               dtype='timedelta64[ns]', freq=None)

Constructing a TimedeltaIndex with a regular range

In [19]: timedelta_range('1 days',periods=5,freq='D')
Out[19]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')

In [20]: timedelta_range(start='1 days',end='2 days',freq='30T')
���������������������������������������������������������������������������������������������������������������Out[20]: 
TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00',
                '1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00',
                '1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00',
                '1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00',
                '1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00',
                '1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00',
                '1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00',
                '1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00',
                '1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00',
                '1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00',
                '1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00',
                '1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00',
                '1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00',
                '1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00',
                '1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00',
                '1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00',
                '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq='30T')

You can now use a TimedeltaIndex as the index of a pandas object

In [21]: s = Series(np.arange(5),
   ....:            index=timedelta_range('1 days',periods=5,freq='s'))
   ....: 

In [22]: s
Out[22]: 
1 days 00:00:00    0
1 days 00:00:01    1
1 days 00:00:02    2
1 days 00:00:03    3
1 days 00:00:04    4
Freq: S, dtype: int64

You can select with partial string selections

In [23]: s['1 day 00:00:02']
Out[23]: 2

In [24]: s['1 day':'1 day 00:00:02']
�����������Out[24]: 
1 days 00:00:00    0
1 days 00:00:01    1
1 days 00:00:02    2
Freq: S, dtype: int64

Finally, the combination of TimedeltaIndex with DatetimeIndex allow certain combination operations that are NaT preserving:

In [25]: tdi = TimedeltaIndex(['1 days',pd.NaT,'2 days'])

In [26]: tdi.tolist()
Out[26]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]

In [27]: dti = date_range('20130101',periods=3)

In [28]: dti.tolist()
Out[28]: 
[Timestamp('2013-01-01 00:00:00', freq='D'),
 Timestamp('2013-01-02 00:00:00', freq='D'),
 Timestamp('2013-01-03 00:00:00', freq='D')]

In [29]: (dti + tdi).tolist()
�������������������������������������������������������������������������������������������������������������������������������������������������Out[29]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')]

In [30]: (dti - tdi).tolist()
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[30]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]
Memory Usage¶

Implemented methods to find memory usage of a DataFrame. See the FAQ for more. (GH6852).

A new display option display.memory_usage (see Options and Settings) sets the default behavior of the memory_usage argument in the df.info() method. By default display.memory_usage is True.

In [31]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
   ....:           'complex128', 'object', 'bool']
   ....: 

In [32]: n = 5000

In [33]: data = dict([ (t, np.random.randint(100, size=n).astype(t))
   ....:                 for t in dtypes])
   ....: 

In [34]: df = DataFrame(data)

In [35]: df['categorical'] = df['object'].astype('category')

In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
bool               5000 non-null bool
complex128         5000 non-null complex128
datetime64[ns]     5000 non-null datetime64[ns]
float64            5000 non-null float64
int64              5000 non-null int64
object             5000 non-null object
timedelta64[ns]    5000 non-null timedelta64[ns]
categorical        5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 289.1+ KB

Additionally memory_usage() is an available method for a dataframe object which returns the memory usage of each column.

In [37]: df.memory_usage(index=True)
Out[37]: 
Index                 80
bool                5000
complex128         80000
datetime64[ns]     40000
float64            40000
int64              40000
object             40000
timedelta64[ns]    40000
categorical        10920
dtype: int64
.dt accessor¶

Series has gained an accessor to succinctly return datetime like properties for the values of the Series, if its a datetime/period like Series. (GH7207) This will return a Series, indexed like the existing Series. See the docs

# datetime
In [38]: s = Series(date_range('20130101 09:10:12',periods=4))

In [39]: s
Out[39]: 
0   2013-01-01 09:10:12
1   2013-01-02 09:10:12
2   2013-01-03 09:10:12
3   2013-01-04 09:10:12
dtype: datetime64[ns]

In [40]: s.dt.hour
��������������������������������������������������������������������������������������������������������������������������������Out[40]: 
0    9
1    9
2    9
3    9
dtype: int64

In [41]: s.dt.second
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[41]: 
0    12
1    12
2    12
3    12
dtype: int64

In [42]: s.dt.day
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[42]: 
0    1
1    2
2    3
3    4
dtype: int64

In [43]: s.dt.freq
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[43]: <Day>

This enables nice expressions like this:

In [44]: s[s.dt.day==2]
Out[44]: 
1   2013-01-02 09:10:12
dtype: datetime64[ns]

You can easily produce tz aware transformations:

In [45]: stz = s.dt.tz_localize('US/Eastern')

In [46]: stz
Out[46]: 
0   2013-01-01 09:10:12-05:00
1   2013-01-02 09:10:12-05:00
2   2013-01-03 09:10:12-05:00
3   2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]

In [47]: stz.dt.tz
��������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[47]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>

You can also chain these types of operations:

In [48]: s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[48]: 
0   2013-01-01 04:10:12-05:00
1   2013-01-02 04:10:12-05:00
2   2013-01-03 04:10:12-05:00
3   2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]

The .dt accessor works for period and timedelta dtypes.

# period
In [49]: s = Series(period_range('20130101',periods=4,freq='D'))

In [50]: s
Out[50]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
dtype: object

In [51]: s.dt.year
������������������������������������������������������������������������������������Out[51]: 
0    2013
1    2013
2    2013
3    2013
dtype: int64

In [52]: s.dt.day
���������������������������������������������������������������������������������������������������������������������������������������������������Out[52]: 
0    1
1    2
2    3
3    4
dtype: int64
# timedelta
In [53]: s = Series(timedelta_range('1 day 00:00:05',periods=4,freq='s'))

In [54]: s
Out[54]: 
0   1 days 00:00:05
1   1 days 00:00:06
2   1 days 00:00:07
3   1 days 00:00:08
dtype: timedelta64[ns]

In [55]: s.dt.days
�����������������������������������������������������������������������������������������������������������������Out[55]: 
0    1
1    1
2    1
3    1
dtype: int64

In [56]: s.dt.seconds
��������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[56]: 
0    5
1    6
2    7
3    8
dtype: int64

In [57]: s.dt.components
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[57]: 
   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds
0     1      0        0        5             0             0            0
1     1      0        0        6             0             0            0
2     1      0        0        7             0             0            0
3     1      0        0        8             0             0            0
Timezone handling improvements¶ Rolling/Expanding Moments improvements¶ Improvements in the sql io module¶ Backwards incompatible API changes¶ Breaking changes¶

API changes related to Categorical (see here for more details):

API changes related to the introduction of the Timedelta scalar (see above for more details):

For API changes related to the rolling and expanding functions, see detailed overview above.

Other notable API changes:

Internal Refactoring¶

In 0.15.0 Index has internally been refactored to no longer sub-class ndarray but instead subclass PandasObject, similarly to the rest of the pandas objects. This change allows very easy sub-classing and creation of new index types. This should be a transparent change with only very limited API implications (GH5080, GH7439, GH7796, GH8024, GH8367, GH7997, GH8522):

Deprecations¶ Removal of prior version deprecations/changes¶ Enhancements¶

Enhancements in the importing/exporting of Stata files:

Enhancements in the plotting functions:

Other:

Performance¶ Bug Fixes¶ v0.14.1 (July 11, 2014)¶

This is a minor release from 0.14.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

API changes¶ Enhancements¶ Performance¶ Experimental¶ Bug Fixes¶ v0.14.0 (May 31 , 2014)¶

This is a major release from 0.13.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Warning

In 0.14.0 all NDFrame based containers have undergone significant internal refactoring. Before that each block of homogeneous data had its own labels and extra care was necessary to keep those in sync with the parent container’s labels. This should not have any visible user/API behavior changes (GH6745)

API changes¶ Display Changes¶ Text Parsing API Changes¶

read_csv()/read_table() will now be noiser w.r.t invalid options rather than falling back to the PythonParser.

Groupby API Changes¶

More consistent behaviour for some groupby methods:

SQL¶

The SQL reading and writing functions now support more database flavors through SQLAlchemy (GH2717, GH4163, GH5950, GH6292). All databases supported by SQLAlchemy can be used, such as PostgreSQL, MySQL, Oracle, Microsoft SQL server (see documentation of SQLAlchemy on included dialects).

The functionality of providing DBAPI connection objects will only be supported for sqlite3 in the future. The 'mysql' flavor is deprecated.

The new functions read_sql_query() and read_sql_table() are introduced. The function read_sql() is kept as a convenience wrapper around the other two and will delegate to specific function depending on the provided input (database table name or sql query).

In practice, you have to provide a SQLAlchemy engine to the sql functions. To connect with SQLAlchemy you use the create_engine() function to create an engine object from database URI. You only need to create the engine once per database you are connecting to. For an in-memory sqlite database:

In [40]: from sqlalchemy import create_engine

# Create your connection.
In [41]: engine = create_engine('sqlite:///:memory:')

This engine can then be used to write or read data to/from this database:

In [42]: df = pd.DataFrame({'A': [1,2,3], 'B': ['a', 'b', 'c']})

In [43]: df.to_sql('db_table', engine, index=False)

You can read data from a database by specifying the table name:

In [44]: pd.read_sql_table('db_table', engine)
Out[44]: 
   A  B
0  1  a
1  2  b
2  3  c

or by specifying a sql query:

In [45]: pd.read_sql_query('SELECT * FROM db_table', engine)
Out[45]: 
   A  B
0  1  a
1  2  b
2  3  c

Some other enhancements to the sql functions include:

Warning

Some of the existing functions or function aliases have been deprecated and will be removed in future versions. This includes: tquery, uquery, read_frame, frame_query, write_frame.

Warning

The support for the ‘mysql’ flavor when using DBAPI connection objects has been deprecated. MySQL will be further supported with SQLAlchemy engines (GH6900).

MultiIndexing Using Slicers¶

In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

See the docs See also issues (GH6134, GH4036, GH3057, GH2598, GH5641, GH7106)

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. Their are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....),:]

rather than this:

df.loc[(slice('A1','A3'),.....)]

Warning

You will need to make sure that the selection axes are fully lexsorted!

In [46]: def mklbl(prefix,n):
   ....:     return ["%s%s" % (prefix,i)  for i in range(n)]
   ....: 

In [47]: index = MultiIndex.from_product([mklbl('A',4),
   ....:                                  mklbl('B',2),
   ....:                                  mklbl('C',4),
   ....:                                  mklbl('D',2)])
   ....: 

In [48]: columns = MultiIndex.from_tuples([('a','foo'),('a','bar'),
   ....:                                   ('b','foo'),('b','bah')],
   ....:                                    names=['lvl0', 'lvl1'])
   ....: 

In [49]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),
   ....:                index=index,
   ....:                columns=columns).sort_index().sort_index(axis=1)
   ....: 

In [50]: df
Out[50]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0   25   24   27   26
...          ...  ...  ...  ...
A3 B1 C0 D1  229  228  231  230
      C1 D0  233  232  235  234
         D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

Basic multi-index slicing using slices, lists, and labels.

In [51]: df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[51]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
         D1  109  108  111  110
      C3 D0  121  120  123  122
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

You can use a pd.IndexSlice to shortcut the creation of these slices

In [52]: idx = pd.IndexSlice

In [53]: df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
Out[53]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
...          ...  ...
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [54]: df.loc['A1',(slice(None),'foo')]
Out[54]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
      D1   84   86
   C3 D0   88   90
...       ...  ...
B1 C0 D1  100  102
   C1 D0  104  106
      D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [55]: df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[55]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
         D1   44   46
      C3 D0   56   58
...          ...  ...
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [56]: mask = df[('a','foo')]>200

In [57]: df.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [58]: df.loc(axis=0)[:,:,['C1','C3']]
Out[58]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
         D1   45   44   47   46
      C3 D0   57   56   59   58
...          ...  ...  ...  ...
A3 B0 C1 D1  205  204  207  206
      C3 D0  217  216  219  218
         D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]

Furthermore you can set the values using these methods

In [59]: df2 = df.copy()

In [60]: df2.loc(axis=0)[:,:,['C1','C3']] = -10

In [61]: df2
Out[61]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
         D1   21   20   23   22
      C3 D0  -10  -10  -10  -10
...          ...  ...  ...  ...
A3 B1 C0 D1  229  228  231  230
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [62]: df2 = df.copy()

In [63]: df2.loc[idx[:,:,['C1','C3']],:] = df2*1000

In [64]: df2
Out[64]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
         D1      21      20      23      22
      C3 D0   25000   24000   27000   26000
...             ...     ...     ...     ...
A3 B1 C0 D1     229     228     231     230
      C1 D0  233000  232000  235000  234000
         D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]
Plotting¶ Prior Version Deprecations/Changes¶

There are prior version deprecations that are taking effect as of 0.14.0.

Deprecations¶ Known Issues¶ Enhancements¶ Performance¶ Experimental¶

There are no experimental changes in 0.14.0

Bug Fixes¶ v0.13.1 (February 3, 2014)¶

This is a minor release from 0.13.0 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

Warning

0.13.1 fixes a bug that was caused by a combination of having numpy < 1.8, and doing chained assignment on a string-like array. Please review the docs, chained indexing can have unexpected results and should generally be avoided.

This would previously segfault:

In [1]: df = DataFrame(dict(A = np.array(['foo','bar','bah','foo','bar'])))

In [2]: df['A'].iloc[0] = np.nan

In [3]: df
Out[3]: 
     A
0  NaN
1  bar
2  bah
3  foo
4  bar

The recommended way to do this type of assignment is:

In [4]: df = DataFrame(dict(A = np.array(['foo','bar','bah','foo','bar'])))

In [5]: df.loc[0,'A'] = np.nan

In [6]: df
Out[6]: 
     A
0  NaN
1  bar
2  bah
3  foo
4  bar
Output Formatting Enhancements¶ API changes¶ Prior Version Deprecations/Changes¶

There are no announced changes in 0.13 or prior that are taking effect as of 0.13.1

Deprecations¶

There are no deprecations of prior behavior in 0.13.1

Enhancements¶ Performance¶

Performance improvements for 0.13.1

Experimental¶

There are no experimental changes in 0.13.1

Bug Fixes¶

See V0.13.1 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.1.

See the full release notes or issue tracker on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.

v0.13.0 (January 3, 2014)¶

This is a major release from 0.12.0 and includes a number of API changes, several new features and enhancements along with a large number of bug fixes.

Highlights include:

Several experimental features are added, including:

Their are several new or updated docs sections including:

Warning

In 0.13.0 Series has internally been refactored to no longer sub-class ndarray but instead subclass NDFrame, similar to the rest of the pandas containers. This should be a transparent change with only very limited API implications. See Internal Refactoring

API changes¶ Prior Version Deprecations/Changes¶

These were announced changes in 0.12 or prior that are taking effect as of 0.13.0

Deprecations¶

Deprecated in 0.13.0

Indexing API Changes¶

Prior to 0.13, it was impossible to use a label indexer (.loc/.ix) to set a value that was not contained in the index of a particular axis. (GH2578). See the docs

In the Series case this is effectively an appending operation

In [10]: s = Series([1,2,3])

In [11]: s
Out[11]: 
0    1
1    2
2    3
Length: 3, dtype: int64

In [12]: s[5] = 5.

In [13]: s
Out[13]: 
0    1.0
1    2.0
2    3.0
5    5.0
Length: 4, dtype: float64
In [14]: dfi = DataFrame(np.arange(6).reshape(3,2),
   ....:                 columns=['A','B'])
   ....: 

In [15]: dfi
Out[15]: 
   A  B
0  0  1
1  2  3
2  4  5

[3 rows x 2 columns]

This would previously KeyError

In [16]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [17]: dfi
Out[17]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4

[3 rows x 3 columns]

This is like an append operation.

In [18]: dfi.loc[3] = 5

In [19]: dfi
Out[19]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

[4 rows x 3 columns]

A Panel setting operation on an arbitrary axis aligns the input to the Panel

In [20]: p = pd.Panel(np.arange(16).reshape(2,4,2),
   ....:             items=['Item1','Item2'],
   ....:             major_axis=pd.date_range('2001/1/12',periods=4),
   ....:             minor_axis=['A','B'],dtype='float64')
   ....: 

In [21]: p
Out[21]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00
Minor_axis axis: A to B

In [22]: p.loc[:,:,'C'] = Series([30,32],index=p.items)

In [23]: p
Out[23]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2001-01-12 00:00:00 to 2001-01-15 00:00:00
Minor_axis axis: A to C

In [24]: p.loc[:,:,'C']
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[24]: 
            Item1  Item2
2001-01-12   30.0   32.0
2001-01-13   30.0   32.0
2001-01-14   30.0   32.0
2001-01-15   30.0   32.0

[4 rows x 2 columns]
Float64Index API Change¶ HDFStore API Changes¶ DataFrame repr Changes¶

The HTML and plain text representations of DataFrame now show a truncated view of the table once it exceeds a certain size, rather than switching to the short info view (GH4886, GH5550). This makes the representation more consistent as small DataFrames get larger.

To get the info view, call DataFrame.info(). If you prefer the info view as the repr for large DataFrames, you can set this by running set_option('display.large_repr', 'info').

Enhancements¶ Experimental¶ Internal Refactoring¶

In 0.13.0 there is a major refactor primarily to subclass Series from NDFrame, which is the base class currently for DataFrame and Panel, to unify methods and behaviors. Series formerly subclassed directly from ndarray. (GH4080, GH3862, GH816)

Warning

There are two potential incompatibilities from < 0.13.0

Bug Fixes¶

See V0.13.0 Bug Fixes for an extensive list of bugs that have been fixed in 0.13.0.

See the full release notes or issue tracker on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.

v0.12.0 (July 24, 2013)¶

This is a major release from 0.11.0 and includes several new features and enhancements along with a large number of bug fixes.

Highlights include a consistent I/O API naming scheme, routines to read html, write multi-indexes to csv files, read & write STATA data files, read & write JSON format files, Python 3 support for HDFStore, filtering of groupby expressions via filter, and a revamped replace routine that accepts regular expressions.

API changes¶
I/O Enhancements¶
Other Enhancements¶
Experimental Features¶
Bug Fixes¶

See the full release notes or issue tracker on GitHub for a complete list.

v0.11.0 (April 22, 2013)¶

This is a major release from 0.10.1 and includes many new features and enhancements along with a large number of bug fixes. The methods of Selecting Data have had quite a number of additions, and Dtype support is now full-fledged. There are also a number of important API changes that long-time pandas users should pay close attention to.

There is a new section in the documentation, 10 Minutes to Pandas, primarily geared to new users.

There is a new section in the documentation, Cookbook, a collection of useful recipes in pandas (and that we want contributions!).

There are several libraries that are now Recommended Dependencies

Selection Choices¶

Starting in 0.11.0, object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing.

Selection Deprecations¶

Starting in version 0.11.0, these methods may be deprecated in future versions.

See the section Selection by Position for substitutes.

Dtypes¶

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.

In [1]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')

In [2]: df1
Out[2]: 
          A
0  1.392665
1 -0.123497
2 -0.402761
3 -0.246604
4 -0.288433
5 -0.763434
6  2.069526
7 -1.203569

[8 rows x 1 columns]

In [3]: df1.dtypes
�������������������������������������������������������������������������������������������������������������������������������������������Out[3]: 
A    float32
Length: 1, dtype: object

In [4]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
   ...:                       B = Series(randn(8)),
   ...:                       C = Series(randn(8),dtype='uint8') ))
   ...: 

In [5]: df2
Out[5]: 
          A         B    C
0  0.591797 -0.038605    0
1  0.841309 -0.460478    1
2 -0.500977 -0.310458    0
3 -0.816406  0.866493  254
4 -0.207031  0.245972    0
5 -0.664062  0.319442    1
6  0.580566  1.378512    1
7 -0.965820  0.292502  255

[8 rows x 3 columns]

In [6]: df2.dtypes
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[6]: 
A    float16
B    float64
C      uint8
Length: 3, dtype: object

# here you get some upcasting
In [7]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [8]: df3
Out[8]: 
          A         B      C
0  1.984462 -0.038605    0.0
1  0.717812 -0.460478    1.0
2 -0.903737 -0.310458    0.0
3 -1.063011  0.866493  254.0
4 -0.495465  0.245972    0.0
5 -1.427497  0.319442    1.0
6  2.650092  1.378512    1.0
7 -2.169390  0.292502  255.0

[8 rows x 3 columns]

In [9]: df3.dtypes
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[9]: 
A    float32
B    float64
C    float64
Length: 3, dtype: object
Dtype Conversion¶

This is lower-common-denomicator upcasting, meaning you get the dtype which can accomodate all of the types

In [10]: df3.values.dtype
Out[10]: dtype('float64')

Conversion

In [11]: df3.astype('float32').dtypes
Out[11]: 
A    float32
B    float32
C    float32
Length: 3, dtype: object

Mixed Conversion

In [12]: df3['D'] = '1.'

In [13]: df3['E'] = '1'

In [14]: df3.convert_objects(convert_numeric=True).dtypes
Out[14]: 
A    float32
B    float64
C    float64
D    float64
E      int64
Length: 5, dtype: object

# same, but specific dtype conversion
In [15]: df3['D'] = df3['D'].astype('float16')

In [16]: df3['E'] = df3['E'].astype('int32')

In [17]: df3.dtypes
Out[17]: 
A    float32
B    float64
C    float64
D    float16
E      int32
Length: 5, dtype: object

Forcing Date coercion (and setting NaT when not datelike)

In [18]: from datetime import datetime

In [19]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
   ....:             Timestamp('20010104'), '20010105'],dtype='O')
   ....: 

In [20]: s.convert_objects(convert_dates='coerce')
Out[20]: 
0   2001-01-01
1          NaT
2          NaT
3          NaT
4   2001-01-04
5   2001-01-05
Length: 6, dtype: datetime64[ns]
Dtype Gotchas¶

Platform Gotchas

Starting in 0.11.0, construction of DataFrame/Series will use default dtypes of int64 and float64, regardless of platform. This is not an apparent change from earlier versions of pandas. If you specify dtypes, they WILL be respected, however (GH2837)

The following will all result in int64 dtypes

In [21]: DataFrame([1,2],columns=['a']).dtypes
Out[21]: 
a    int64
Length: 1, dtype: object

In [22]: DataFrame({'a' : [1,2] }).dtypes
����������������������������������������������Out[22]: 
a    int64
Length: 1, dtype: object

In [23]: DataFrame({'a' : 1 }, index=range(2)).dtypes
��������������������������������������������������������������������������������������������Out[23]: 
a    int64
Length: 1, dtype: object

Keep in mind that DataFrame(np.array([1,2])) WILL result in int32 on 32-bit platforms!

Upcasting Gotchas

Performing indexing operations on integer type data can easily upcast the data. The dtype of the input data will be preserved in cases where nans are not introduced.

In [24]: dfi = df3.astype('int32')

In [25]: dfi['D'] = dfi['D'].astype('int64')

In [26]: dfi
Out[26]: 
   A  B    C  D  E
0  1  0    0  1  1
1  0  0    1  1  1
2  0  0    0  1  1
3 -1  0  254  1  1
4  0  0    0  1  1
5 -1  0    1  1  1
6  2  1    1  1  1
7 -2  0  255  1  1

[8 rows x 5 columns]

In [27]: dfi.dtypes
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[27]: 
A    int32
B    int32
C    int32
D    int64
E    int32
Length: 5, dtype: object

In [28]: casted = dfi[dfi>0]

In [29]: casted
Out[29]: 
     A    B      C  D  E
0  1.0  NaN    NaN  1  1
1  NaN  NaN    1.0  1  1
2  NaN  NaN    NaN  1  1
3  NaN  NaN  254.0  1  1
4  NaN  NaN    NaN  1  1
5  NaN  NaN    1.0  1  1
6  2.0  1.0    1.0  1  1
7  NaN  NaN  255.0  1  1

[8 rows x 5 columns]

In [30]: casted.dtypes
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[30]: 
A    float64
B    float64
C    float64
D      int64
E      int32
Length: 5, dtype: object

While float dtypes are unchanged.

In [31]: df4 = df3.copy()

In [32]: df4['A'] = df4['A'].astype('float32')

In [33]: df4.dtypes
Out[33]: 
A    float32
B    float64
C    float64
D    float16
E      int32
Length: 5, dtype: object

In [34]: casted = df4[df4>0]

In [35]: casted
Out[35]: 
          A         B      C    D  E
0  1.984462       NaN    NaN  1.0  1
1  0.717812       NaN    1.0  1.0  1
2       NaN       NaN    NaN  1.0  1
3       NaN  0.866493  254.0  1.0  1
4       NaN  0.245972    NaN  1.0  1
5       NaN  0.319442    1.0  1.0  1
6  2.650092  1.378512    1.0  1.0  1
7       NaN  0.292502  255.0  1.0  1

[8 rows x 5 columns]

In [36]: casted.dtypes
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[36]: 
A    float32
B    float64
C    float64
D    float16
E      int32
Length: 5, dtype: object
Datetimes Conversion¶

Datetime64[ns] columns in a DataFrame (or a Series) allow the use of np.nan to indicate a nan value, in addition to the traditional NaT, or not-a-time. This allows convenient nan setting in a generic way. Furthermore datetime64[ns] columns are created by default, when passed datetimelike objects (this change was introduced in 0.10.1) (GH2809, GH2810)

In [37]: df = DataFrame(randn(6,2),date_range('20010102',periods=6),columns=['A','B'])

In [38]: df['timestamp'] = Timestamp('20010103')

In [39]: df
Out[39]: 
                   A         B  timestamp
2001-01-02  1.023958  0.660103 2001-01-03
2001-01-03  1.236475 -2.170629 2001-01-03
2001-01-04 -0.270630 -1.685677 2001-01-03
2001-01-05 -0.440747 -0.115070 2001-01-03
2001-01-06 -0.632102 -0.585977 2001-01-03
2001-01-07 -1.444787 -0.201135 2001-01-03

[6 rows x 3 columns]

# datetime64[ns] out of the box
In [40]: df.get_dtype_counts()
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[40]: 
datetime64[ns]    1
float64           2
Length: 2, dtype: int64

# use the traditional nan, which is mapped to NaT internally
In [41]: df.loc[df.index[2:4], ['A','timestamp']] = np.nan

In [42]: df
Out[42]: 
                   A         B  timestamp
2001-01-02  1.023958  0.660103 2001-01-03
2001-01-03  1.236475 -2.170629 2001-01-03
2001-01-04       NaN -1.685677        NaT
2001-01-05       NaN -0.115070        NaT
2001-01-06 -0.632102 -0.585977 2001-01-03
2001-01-07 -1.444787 -0.201135 2001-01-03

[6 rows x 3 columns]

Astype conversion on datetime64[ns] to object, implicity converts NaT to np.nan

In [43]: import datetime

In [44]: s = Series([datetime.datetime(2001, 1, 2, 0, 0) for i in range(3)])

In [45]: s.dtype
Out[45]: dtype('<M8[ns]')

In [46]: s[1] = np.nan

In [47]: s
Out[47]: 
0   2001-01-02
1          NaT
2   2001-01-02
Length: 3, dtype: datetime64[ns]

In [48]: s.dtype
����������������������������������������������������������������������������������������Out[48]: dtype('<M8[ns]')

In [49]: s = s.astype('O')

In [50]: s
Out[50]: 
0    2001-01-02 00:00:00
1                    NaT
2    2001-01-02 00:00:00
Length: 3, dtype: object

In [51]: s.dtype
��������������������������������������������������������������������������������������������������������������Out[51]: dtype('O')
API changes¶
Enhancements¶

See the full release notes or issue tracker on GitHub for a complete list.

v0.10.1 (January 22, 2013)¶

This is a minor release from 0.10.0 and includes new features, enhancements, and bug fixes. In particular, there is substantial new HDFStore functionality contributed by Jeff Reback.

An undesired API breakage with functions taking the inplace option has been reverted and deprecation warnings added.

API changes¶ New features¶ HDFStore¶

You may need to upgrade your existing data files. Please visit the compatibility section in the main docs.

You can designate (and index) certain columns that you want to be able to perform queries on a table, by passing a list to data_columns

In [1]: store = HDFStore('store.h5')

In [2]: df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8),
   ...:            columns=['A', 'B', 'C'])
   ...: 

In [3]: df['string'] = 'foo'

In [4]: df.loc[df.index[4:6], 'string'] = np.nan

In [5]: df.loc[df.index[7:9], 'string'] = 'bar'

In [6]: df['string2'] = 'cool'

In [7]: df
Out[7]: 
                   A         B         C string string2
2000-01-01  1.885136 -0.183873  2.550850    foo    cool
2000-01-02  0.180759 -1.117089  0.061462    foo    cool
2000-01-03 -0.294467 -0.591411 -0.876691    foo    cool
2000-01-04  3.127110  1.451130  0.045152    foo    cool
2000-01-05 -0.242846  1.195819  1.533294    NaN    cool
2000-01-06  0.820521 -0.281201  1.651561    NaN    cool
2000-01-07 -0.034086  0.252394 -0.498772    foo    cool
2000-01-08 -2.290958 -1.601262 -0.256718    bar    cool

[8 rows x 5 columns]

# on-disk operations
In [8]: store.append('df', df, data_columns = ['B','C','string','string2'])

In [9]: store.select('df', "B>0 and string=='foo'")
Out[9]: 
                   A         B         C string string2
2000-01-04  3.127110  1.451130  0.045152    foo    cool
2000-01-07 -0.034086  0.252394 -0.498772    foo    cool

[2 rows x 5 columns]

# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == 'foo')]
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[10]: 
                   A         B         C string string2
2000-01-04  3.127110  1.451130  0.045152    foo    cool
2000-01-07 -0.034086  0.252394 -0.498772    foo    cool

[2 rows x 5 columns]

Retrieving unique values in an indexable or data column.

# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique('df','index')
store.unique('df','string')

You can now store datetime64 in data columns

In [11]: df_mixed               = df.copy()

In [12]: df_mixed['datetime64'] = Timestamp('20010102')

In [13]: df_mixed.loc[df_mixed.index[3:4], ['A','B']] = np.nan

In [14]: store.append('df_mixed', df_mixed)

In [15]: df_mixed1 = store.select('df_mixed')

In [16]: df_mixed1
Out[16]: 
                   A         B         C string string2 datetime64
2000-01-01  1.885136 -0.183873  2.550850    foo    cool 2001-01-02
2000-01-02  0.180759 -1.117089  0.061462    foo    cool 2001-01-02
2000-01-03 -0.294467 -0.591411 -0.876691    foo    cool 2001-01-02
2000-01-04       NaN       NaN  0.045152    foo    cool 2001-01-02
2000-01-05 -0.242846  1.195819  1.533294    NaN    cool 2001-01-02
2000-01-06  0.820521 -0.281201  1.651561    NaN    cool 2001-01-02
2000-01-07 -0.034086  0.252394 -0.498772    foo    cool 2001-01-02
2000-01-08 -2.290958 -1.601262 -0.256718    bar    cool 2001-01-02

[8 rows x 6 columns]

In [17]: df_mixed1.get_dtype_counts()
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[17]: 
datetime64[ns]    1
float64           3
object            2
Length: 3, dtype: int64

You can pass columns keyword to select to filter a list of the return columns, this is equivalent to passing a Term('columns',list_of_columns_to_filter)

In [18]: store.select('df',columns = ['A','B'])
Out[18]: 
                   A         B
2000-01-01  1.885136 -0.183873
2000-01-02  0.180759 -1.117089
2000-01-03 -0.294467 -0.591411
2000-01-04  3.127110  1.451130
2000-01-05 -0.242846  1.195819
2000-01-06  0.820521 -0.281201
2000-01-07 -0.034086  0.252394
2000-01-08 -2.290958 -1.601262

[8 rows x 2 columns]

HDFStore now serializes multi-index dataframes when appending tables.

In [19]: index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
   ....:                            ['one', 'two', 'three']],
   ....:                    labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
   ....:                            [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   ....:                    names=['foo', 'bar'])
   ....: 

In [20]: df = DataFrame(np.random.randn(10, 3), index=index,
   ....:                columns=['A', 'B', 'C'])
   ....: 

In [21]: df
Out[21]: 
                  A         B         C
foo bar                                
foo one    0.239369  0.174122 -1.131794
    two   -1.948006  0.980347 -0.674429
    three -0.361633 -0.761218  1.768215
bar one    0.152288 -0.862613 -0.210968
    two   -0.859278  1.498195  0.462413
baz two   -0.647604  1.511487 -0.727189
    three -0.342928 -0.007364  1.427674
qux one    0.104020  2.052171 -1.230963
    two   -0.019240 -1.713238  0.838912
    three -0.637855  0.215109 -1.515362

[10 rows x 3 columns]

In [22]: store.append('mi',df)

In [23]: store.select('mi')
Out[23]: 
                  A         B         C
foo bar                                
foo one    0.239369  0.174122 -1.131794
    two   -1.948006  0.980347 -0.674429
    three -0.361633 -0.761218  1.768215
bar one    0.152288 -0.862613 -0.210968
    two   -0.859278  1.498195  0.462413
baz two   -0.647604  1.511487 -0.727189
    three -0.342928 -0.007364  1.427674
qux one    0.104020  2.052171 -1.230963
    two   -0.019240 -1.713238  0.838912
    three -0.637855  0.215109 -1.515362

[10 rows x 3 columns]

# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[24]: 
                A         B         C
foo bar                              
bar one  0.152288 -0.862613 -0.210968
    two -0.859278  1.498195  0.462413

[2 rows x 3 columns]

Multi-table creation via append_to_multiple and selection via select_as_multiple can create/select from multiple tables and return a combined result, by using where on a selector table.

In [25]: df_mt = DataFrame(randn(8, 6), index=date_range('1/1/2000', periods=8),
   ....:                                columns=['A', 'B', 'C', 'D', 'E', 'F'])
   ....: 

In [26]: df_mt['foo'] = 'bar'

# you can also create the tables individually
In [27]: store.append_to_multiple({ 'df1_mt' : ['A','B'], 'df2_mt' : None }, df_mt, selector = 'df1_mt')

In [28]: store
Out[28]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df                  frame_table  (typ->appendable,nrows->8,ncols->5,indexers->[index],dc->[B,C,string,string2])
/df1_mt              frame_table  (typ->appendable,nrows->8,ncols->2,indexers->[index],dc->[A,B])               
/df2_mt              frame_table  (typ->appendable,nrows->8,ncols->5,indexers->[index])                         
/df_mixed            frame_table  (typ->appendable,nrows->8,ncols->6,indexers->[index])                         
/mi                  frame_table  (typ->appendable_multi,nrows->10,ncols->5,indexers->[index],dc->[bar,foo])    

# indiviual tables were created
In [29]: store.select('df1_mt')
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[29]: 
                   A         B
2000-01-01  1.586924 -0.447974
2000-01-02 -0.102206  0.870302
2000-01-03  1.249874  1.458210
2000-01-04 -0.616293  0.150468
2000-01-05 -0.431163  0.016640
2000-01-06  0.800353 -0.451572
2000-01-07  1.239198  0.185437
2000-01-08 -0.040863  0.290110

[8 rows x 2 columns]

In [30]: store.select('df2_mt')
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[30]: 
                   C         D         E         F  foo
2000-01-01 -1.573998  0.630925 -0.071659 -1.277640  bar
2000-01-02  1.275280 -1.199212  1.060780  1.673018  bar
2000-01-03 -0.710542  0.825392  1.557329  1.993441  bar
2000-01-04  0.132104  0.580923 -0.128750  1.445964  bar
2000-01-05  0.904578 -1.645852 -0.688741  0.228006  bar
2000-01-06  0.831767  0.228760  0.932498 -2.200069  bar
2000-01-07 -0.540770 -0.370038  1.298390  1.662964  bar
2000-01-08 -0.096145  1.717830 -0.462446 -0.112019  bar

[8 rows x 5 columns]

# as a multiple
In [31]: store.select_as_multiple(['df1_mt','df2_mt'], where = [ 'A>0','B>0' ], selector = 'df1_mt')
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[31]: 
                   A         B         C         D         E         F  foo
2000-01-03  1.249874  1.458210 -0.710542  0.825392  1.557329  1.993441  bar
2000-01-07  1.239198  0.185437 -0.540770 -0.370038  1.298390  1.662964  bar

[2 rows x 7 columns]

Enhancements

Bug Fixes

See the full release notes or issue tracker on GitHub for a complete list.

v0.10.0 (December 17, 2012)¶

This is a major release from 0.9.1 and includes many new features and enhancements along with a large number of bug fixes. There are also a number of important API changes that long-time pandas users should pay close attention to.

File parsing new features¶

The delimited file parsing engine (the guts of read_csv and read_table) has been rewritten from the ground up and now uses a fraction the amount of memory while parsing, while being 40% or more faster in most use cases (in some cases much faster).

There are also many new features:

API changes¶

Deprecated DataFrame BINOP TimeSeries special case behavior

The default behavior of binary operations between a DataFrame and a Series has always been to align on the DataFrame’s columns and broadcast down the rows, except in the special case that the DataFrame contains time series. Since there are now method for each binary operator enabling you to specify how you want to broadcast, we are phasing out this special case (Zen of Python: Special cases aren’t special enough to break the rules). Here’s what I’m talking about:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(np.random.randn(6, 4),
   ...:                   index=pd.date_range('1/1/2000', periods=6))
   ...: 

In [3]: df
Out[3]: 
                   0         1         2         3
2000-01-01 -0.134024 -0.205969  1.348944 -1.198246
2000-01-02 -1.626124  0.982041  0.059493 -0.460111
2000-01-03 -1.565401 -0.025706  0.942864  2.502156
2000-01-04 -0.302741  0.261551 -0.066342  0.897097
2000-01-05  0.268766 -1.225092  0.582752 -1.490764
2000-01-06 -0.639757 -0.952750 -0.892402  0.505987

[6 rows x 4 columns]

# deprecated now
In [4]: df - df[0]
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[4]: 
            2000-01-01 00:00:00  2000-01-02 00:00:00  2000-01-03 00:00:00  \
2000-01-01                  NaN                  NaN                  NaN   
2000-01-02                  NaN                  NaN                  NaN   
2000-01-03                  NaN                  NaN                  NaN   
2000-01-04                  NaN                  NaN                  NaN   
2000-01-05                  NaN                  NaN                  NaN   
2000-01-06                  NaN                  NaN                  NaN   

            2000-01-04 00:00:00  2000-01-05 00:00:00  2000-01-06 00:00:00   0  \
2000-01-01                  NaN                  NaN                  NaN NaN   
2000-01-02                  NaN                  NaN                  NaN NaN   
2000-01-03                  NaN                  NaN                  NaN NaN   
2000-01-04                  NaN                  NaN                  NaN NaN   
2000-01-05                  NaN                  NaN                  NaN NaN   
2000-01-06                  NaN                  NaN                  NaN NaN   

             1   2   3  
2000-01-01 NaN NaN NaN  
2000-01-02 NaN NaN NaN  
2000-01-03 NaN NaN NaN  
2000-01-04 NaN NaN NaN  
2000-01-05 NaN NaN NaN  
2000-01-06 NaN NaN NaN  

[6 rows x 10 columns]

# Change your code to
In [5]: df.sub(df[0], axis=0) # align on axis 0 (rows)
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[5]: 
              0         1         2         3
2000-01-01  0.0 -0.071946  1.482967 -1.064223
2000-01-02  0.0  2.608165  1.685618  1.166013
2000-01-03  0.0  1.539695  2.508265  4.067556
2000-01-04  0.0  0.564293  0.236399  1.199839
2000-01-05  0.0 -1.493857  0.313986 -1.759530
2000-01-06  0.0 -0.312993 -0.252645  1.145744

[6 rows x 4 columns]

You will get a deprecation warning in the 0.10.x series, and the deprecated functionality will be removed in 0.11 or later.

Altered resample default behavior

The default time series resample binning behavior of daily D and higher frequencies has been changed to closed='left', label='left'. Lower nfrequencies are unaffected. The prior defaults were causing a great deal of confusion for users, especially resampling data to daily frequency (which labeled the aggregated group with the end of the interval: the next day).

In [1]: dates = pd.date_range('1/1/2000', '1/5/2000', freq='4h')

In [2]: series = Series(np.arange(len(dates)), index=dates)

In [3]: series
Out[3]:
2000-01-01 00:00:00     0
2000-01-01 04:00:00     1
2000-01-01 08:00:00     2
2000-01-01 12:00:00     3
2000-01-01 16:00:00     4
2000-01-01 20:00:00     5
2000-01-02 00:00:00     6
2000-01-02 04:00:00     7
2000-01-02 08:00:00     8
2000-01-02 12:00:00     9
2000-01-02 16:00:00    10
2000-01-02 20:00:00    11
2000-01-03 00:00:00    12
2000-01-03 04:00:00    13
2000-01-03 08:00:00    14
2000-01-03 12:00:00    15
2000-01-03 16:00:00    16
2000-01-03 20:00:00    17
2000-01-04 00:00:00    18
2000-01-04 04:00:00    19
2000-01-04 08:00:00    20
2000-01-04 12:00:00    21
2000-01-04 16:00:00    22
2000-01-04 20:00:00    23
2000-01-05 00:00:00    24
Freq: 4H, dtype: int64

In [4]: series.resample('D', how='sum')
Out[4]:
2000-01-01     15
2000-01-02     51
2000-01-03     87
2000-01-04    123
2000-01-05     24
Freq: D, dtype: int64

In [5]: # old behavior
In [6]: series.resample('D', how='sum', closed='right', label='right')
Out[6]:
2000-01-01      0
2000-01-02     21
2000-01-03     57
2000-01-04     93
2000-01-05    129
Freq: D, dtype: int64
In [6]: s = pd.Series([1.5, np.inf, 3.4, -np.inf])

In [7]: pd.isnull(s)
Out[7]: 
0    False
1    False
2    False
3    False
Length: 4, dtype: bool

In [8]: s.fillna(0)
����������������������������������������������������������������������������Out[8]: 
0    1.500000
1         inf
2    3.400000
3        -inf
Length: 4, dtype: float64

In [9]: pd.set_option('use_inf_as_null', True)

In [10]: pd.isnull(s)
Out[10]: 
0    False
1     True
2    False
3     True
Length: 4, dtype: bool

In [11]: s.fillna(0)
�����������������������������������������������������������������������������Out[11]: 
0    1.5
1    0.0
2    3.4
3    0.0
Length: 4, dtype: float64

In [12]: pd.reset_option('use_inf_as_null')
In [13]: data= 'a,b,c\n1,Yes,2\n3,No,4'

In [14]: print(data)
a,b,c
1,Yes,2
3,No,4

In [15]: pd.read_csv(StringIO(data), header=None)
���������������������Out[15]: 
   0    1  2
0  a    b  c
1  1  Yes  2
2  3   No  4

[3 rows x 3 columns]

In [16]: pd.read_csv(StringIO(data), header=None, prefix='X')
���������������������������������������������������������������������������������������������������������Out[16]: 
  X0   X1 X2
0  a    b  c
1  1  Yes  2
2  3   No  4

[3 rows x 3 columns]
In [17]: print(data)
a,b,c
1,Yes,2
3,No,4

In [18]: pd.read_csv(StringIO(data))
���������������������Out[18]: 
   a    b  c
0  1  Yes  2
1  3   No  4

[2 rows x 3 columns]

In [19]: pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No'])
��������������������������������������������������������������������������������������������Out[19]: 
   a      b  c
0  1   True  2
1  3  False  4

[2 rows x 3 columns]
In [20]: s = Series([np.nan, 1., 2., np.nan, 4])

In [21]: s
Out[21]: 
0    NaN
1    1.0
2    2.0
3    NaN
4    4.0
Length: 5, dtype: float64

In [22]: s.fillna(0)
���������������������������������������������������������������������������������Out[22]: 
0    0.0
1    1.0
2    2.0
3    0.0
4    4.0
Length: 5, dtype: float64

In [23]: s.fillna(method='pad')
������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[23]: 
0    NaN
1    1.0
2    2.0
3    2.0
4    4.0
Length: 5, dtype: float64

Convenience methods ffill and bfill have been added:

In [24]: s.ffill()
Out[24]: 
0    NaN
1    1.0
2    2.0
3    2.0
4    4.0
Length: 5, dtype: float64
Wide DataFrame Printing¶

Instead of printing the summary information, pandas now splits the string representation across multiple rows by default:

In [30]: wide_frame = DataFrame(randn(5, 16))

In [31]: wide_frame
Out[31]: 
         0         1         2         3         4         5         6   \
0 -0.681624  0.191356  1.180274 -0.834179  0.703043  0.166568 -0.583599   
1  0.441522 -0.316864 -0.017062  1.570114 -0.360875 -0.880096  0.235532   
2 -0.412451 -0.462580  0.422194  0.288403 -0.487393 -0.777639  0.055865   
3 -0.277255  1.331263  0.585174 -0.568825 -0.719412  1.191340 -0.456362   
4 -1.642511  0.432560  1.218080 -0.564705 -0.581790  0.286071  0.048725   

         7         8         9         10        11        12        13  \
0 -1.201796 -1.422811 -0.882554  1.209871 -0.941235  0.863067 -0.336232   
1  0.207232 -1.983857 -1.702547 -1.621234 -0.906840  1.014601 -0.475108   
2  1.383381  0.085638  0.246392  0.965887  0.246354 -0.727728 -0.094414   
3  0.089931  0.776079  0.752889 -1.195795 -1.425911 -0.548829  0.774225   
4  1.002440  1.276582  0.054399  0.241963 -0.471786  0.314510 -0.059986   

         14        15  
0 -0.976847  0.033862  
1 -0.358944  1.262942  
2 -0.276854  0.158399  
3  0.740501  1.510263  
4 -2.069319 -1.115104  

[5 rows x 16 columns]

The old behavior of printing out summary information can be achieved via the ‘expand_frame_repr’ print option:

In [32]: pd.set_option('expand_frame_repr', False)

In [33]: wide_frame
Out[33]: 
         0         1         2         3         4         5         6         7         8         9         10        11        12        13        14        15
0 -0.681624  0.191356  1.180274 -0.834179  0.703043  0.166568 -0.583599 -1.201796 -1.422811 -0.882554  1.209871 -0.941235  0.863067 -0.336232 -0.976847  0.033862
1  0.441522 -0.316864 -0.017062  1.570114 -0.360875 -0.880096  0.235532  0.207232 -1.983857 -1.702547 -1.621234 -0.906840  1.014601 -0.475108 -0.358944  1.262942
2 -0.412451 -0.462580  0.422194  0.288403 -0.487393 -0.777639  0.055865  1.383381  0.085638  0.246392  0.965887  0.246354 -0.727728 -0.094414 -0.276854  0.158399
3 -0.277255  1.331263  0.585174 -0.568825 -0.719412  1.191340 -0.456362  0.089931  0.776079  0.752889 -1.195795 -1.425911 -0.548829  0.774225  0.740501  1.510263
4 -1.642511  0.432560  1.218080 -0.564705 -0.581790  0.286071  0.048725  1.002440  1.276582  0.054399  0.241963 -0.471786  0.314510 -0.059986 -2.069319 -1.115104

[5 rows x 16 columns]

The width of each line can be changed via ‘line_width’ (80 by default):

In [34]: pd.set_option('line_width', 40)
line_width has been deprecated, use display.width instead (currently both are
identical)


In [35]: wide_frame
������������������������������������������������������������������������������������������Out[35]: 
         0         1         2   \
0 -0.681624  0.191356  1.180274   
1  0.441522 -0.316864 -0.017062   
2 -0.412451 -0.462580  0.422194   
3 -0.277255  1.331263  0.585174   
4 -1.642511  0.432560  1.218080   

         3         4         5   \
0 -0.834179  0.703043  0.166568   
1  1.570114 -0.360875 -0.880096   
2  0.288403 -0.487393 -0.777639   
3 -0.568825 -0.719412  1.191340   
4 -0.564705 -0.581790  0.286071   

         6         7         8   \
0 -0.583599 -1.201796 -1.422811   
1  0.235532  0.207232 -1.983857   
2  0.055865  1.383381  0.085638   
3 -0.456362  0.089931  0.776079   
4  0.048725  1.002440  1.276582   

         9         10        11  \
0 -0.882554  1.209871 -0.941235   
1 -1.702547 -1.621234 -0.906840   
2  0.246392  0.965887  0.246354   
3  0.752889 -1.195795 -1.425911   
4  0.054399  0.241963 -0.471786   

         12        13        14  \
0  0.863067 -0.336232 -0.976847   
1  1.014601 -0.475108 -0.358944   
2 -0.727728 -0.094414 -0.276854   
3 -0.548829  0.774225  0.740501   
4  0.314510 -0.059986 -2.069319   

         15  
0  0.033862  
1  1.262942  
2  0.158399  
3  1.510263  
4 -1.115104  

[5 rows x 16 columns]
Updated PyTables Support¶

Docs for PyTables Table format & several enhancements to the api. Here is a taste of what to expect.

In [36]: store = HDFStore('store.h5')

In [37]: df = DataFrame(randn(8, 3), index=date_range('1/1/2000', periods=8),
   ....:            columns=['A', 'B', 'C'])
   ....: 

In [38]: df
Out[38]: 
                   A         B         C
2000-01-01 -0.369325 -1.502617 -0.376280
2000-01-02  0.511936 -0.116412 -0.625256
2000-01-03 -0.550627  1.261433 -0.552429
2000-01-04  1.695803 -1.025917 -0.910942
2000-01-05  0.426805 -0.131749  0.432600
2000-01-06  0.044671 -0.341265  1.844536
2000-01-07 -2.036047  0.000830 -0.955697
2000-01-08 -0.898872 -0.725411  0.059904

[8 rows x 3 columns]

# appending data frames
In [39]: df1 = df[0:4]

In [40]: df2 = df[4:]

In [41]: store.append('df', df1)

In [42]: store.append('df', df2)

In [43]: store
Out[43]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df            frame_table  (typ->appendable,nrows->8,ncols->3,indexers->[index])

# selecting the entire store
In [44]: store.select('df')
������������������������������������������������������������������������������������������������������������������������������������������������������Out[44]: 
                   A         B         C
2000-01-01 -0.369325 -1.502617 -0.376280
2000-01-02  0.511936 -0.116412 -0.625256
2000-01-03 -0.550627  1.261433 -0.552429
2000-01-04  1.695803 -1.025917 -0.910942
2000-01-05  0.426805 -0.131749  0.432600
2000-01-06  0.044671 -0.341265  1.844536
2000-01-07 -2.036047  0.000830 -0.955697
2000-01-08 -0.898872 -0.725411  0.059904

[8 rows x 3 columns]
In [45]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   ....:        major_axis=date_range('1/1/2000', periods=5),
   ....:        minor_axis=['A', 'B', 'C', 'D'])
   ....: 

In [46]: wp
Out[46]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

# storing a panel
In [47]: store.append('wp',wp)

# selecting via A QUERY
In [48]: store.select('wp', "major_axis>20000102 and minor_axis=['A','B']")
Out[48]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to B

# removing data from tables
In [49]: store.remove('wp', "major_axis>20000103")
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[49]: 8

In [50]: store.select('wp')
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[50]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-03 00:00:00
Minor_axis axis: A to D

# deleting a store
In [51]: del store['df']

In [52]: store
Out[52]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/wp            wide_table   (typ->appendable,nrows->12,ncols->2,indexers->[major_axis,minor_axis])

Enhancements

Bug Fixes

Compatibility

0.10 of HDFStore is backwards compatible for reading tables created in a prior version of pandas, however, query terms using the prior (undocumented) methodology are unsupported. You must read in the entire file and write it out using the new format to take advantage of the updates.

N Dimensional Panels (Experimental)¶

Adding experimental support for Panel4D and factory functions to create n-dimensional named panels. Docs for NDim. Here is a taste of what to expect.

In [65]: p4d = Panel4D(randn(2, 2, 5, 4),
   ....:       labels=['Label1','Label2'],
   ....:       items=['Item1', 'Item2'],
   ....:       major_axis=date_range('1/1/2000', periods=5),
   ....:       minor_axis=['A', 'B', 'C', 'D'])
   ....: 

In [66]: p4d
Out[66]: 
<class 'pandas.core.panelnd.Panel4D'>
Dimensions: 2 (labels) x 2 (items) x 5 (major_axis) x 4 (minor_axis)
Labels axis: Label1 to Label2
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

See the full release notes or issue tracker on GitHub for a complete list.

v0.9.1 (November 14, 2012)¶

This is a bugfix release from 0.9.0 and includes several new features and enhancements along with a large number of bug fixes. The new features include by-column sort order for DataFrame and Series, improved NA handling for the rank method, masking functions for DataFrame, and intraday time-series filtering for DataFrame.

New features¶
API changes¶

See the full release notes or issue tracker on GitHub for a complete list.

v0.9.0 (October 7, 2012)¶

This is a major release from 0.8.1 and includes several new features and enhancements along with a large number of bug fixes. New features include vectorized unicode encoding/decoding for Series.str, to_latex method to DataFrame, more flexible parsing of boolean values, and enabling the download of options data from Yahoo! Finance.

New features¶
API changes¶
In [1]: data = '0,0,1\n1,1,0\n0,1,0'

In [2]: df = read_csv(StringIO(data), header=None)

In [3]: df
Out[3]: 
   0  1  2
0  0  0  1
1  1  1  0
2  0  1  0

[3 rows x 3 columns]
In [4]: s1 = Series([1, 2, 3])

In [5]: s1
Out[5]: 
0    1
1    2
2    3
Length: 3, dtype: int64

In [6]: s2 = Series(s1, index=['foo', 'bar', 'baz'])

In [7]: s2
Out[7]: 
foo   NaN
bar   NaN
baz   NaN
Length: 3, dtype: float64

See the full release notes or issue tracker on GitHub for a complete list.

v0.8.1 (July 22, 2012)¶

This release includes a few new features, performance enhancements, and over 30 bug fixes from 0.8.0. New features include notably NA friendly string processing functionality and a series of new plot types and options.

Performance improvements¶
v0.8.0 (June 29, 2012)¶

This is a major release from 0.7.3 and includes extensive work on the time series handling and processing infrastructure as well as a great deal of new functionality throughout the library. It includes over 700 commits from more than 20 distinct authors. Most pandas 0.7.3 and earlier users should not experience any issues upgrading, but due to the migration to the NumPy datetime64 dtype, there may be a number of bugs and incompatibilities lurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release if necessary. See the full release notes or issue tracker on GitHub for a complete list.

Support for non-unique indexes¶

All objects can now work with non-unique indexes. Data alignment / join operations work according to SQL join semantics (including, if application, index duplication in many-to-many joins)

NumPy datetime64 dtype and 1.6 dependency¶

Time series data are now represented using NumPy’s datetime64 dtype; thus, pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verified to work with the development version (1.7+) of NumPy as well which includes some significant user-facing API changes. NumPy 1.6 also has a number of bugs having to do with nanosecond resolution data, so I recommend that you steer clear of NumPy 1.6’s datetime64 API functions (though limited as they are) and only interact with this data using the interface that pandas provides.

See the end of the 0.8.0 section for a “porting” guide listing potential issues for users migrating legacy codebases from pandas 0.7 or earlier to 0.8.0.

Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided as they arise. There will be no more further development in 0.7.x beyond bug fixes.

Time series changes and improvements¶

Note

With this release, legacy scikits.timeseries users should be able to port their code to use pandas.

Other new features¶ New plotting methods¶

Series.plot now supports a secondary_y option:

In [1]: plt.figure()
Out[1]: <matplotlib.figure.Figure at 0x133e4b8d0>

In [2]: fx['FR'].plot(style='g')
��������������������������������������������������Out[2]: <matplotlib.axes._subplots.AxesSubplot at 0x12ad13cc0>

In [3]: fx['IT'].plot(style='k--', secondary_y=True)
�����������������������������������������������������������������������������������������������������������������Out[3]: <matplotlib.axes._subplots.AxesSubplot at 0x133554dd8>

Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot types. For example, 'kde' is a new option:

In [4]: s = Series(np.concatenate((np.random.randn(1000),
   ...:                            np.random.randn(1000) * 0.5 + 3)))
   ...: 

In [5]: plt.figure()
Out[5]: <matplotlib.figure.Figure at 0x1300adda0>

In [6]: s.hist(normed=True, alpha=0.2)
��������������������������������������������������Out[6]: <matplotlib.axes._subplots.AxesSubplot at 0x13099ca20>

In [7]: s.plot(kind='kde')
�����������������������������������������������������������������������������������������������������������������Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x13099ca20>

See the plotting page for much more.

Other API changes¶ Potential porting issues for pandas <= 0.7.3 users¶

The major change that may affect you in pandas 0.8.0 is that time series indexes use NumPy’s datetime64 data type instead of dtype=object arrays of Python’s built-in datetime.datetime objects. DateRange has been replaced by DatetimeIndex but otherwise behaved identically. But, if you have code that converts DateRange or Index objects that used to contain datetime.datetime values to plain NumPy arrays, you may have bugs lurking with code using scalar values because you are handing control over to NumPy:

In [8]: import datetime

In [9]: rng = date_range('1/1/2000', periods=10)

In [10]: rng[5]
Out[10]: Timestamp('2000-01-06 00:00:00', freq='D')

In [11]: isinstance(rng[5], datetime.datetime)
����������������������������������������������������Out[11]: True

In [12]: rng_asarray = np.asarray(rng)

In [13]: scalar_val = rng_asarray[5]

In [14]: type(scalar_val)
Out[14]: numpy.datetime64

pandas’s Timestamp object is a subclass of datetime.datetime that has nanosecond support (the nanosecond field store the nanosecond value between 0 and 999). It should substitute directly into any code that used datetime.datetime values before. Thus, I recommend not casting DatetimeIndex to regular NumPy arrays.

If you have code that requires an array of datetime.datetime objects, you have a couple of options. First, the asobject property of DatetimeIndex produces an array of Timestamp objects:

In [15]: stamp_array = rng.asobject

In [16]: stamp_array
Out[16]: 
Index([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00,
       2000-01-04 00:00:00, 2000-01-05 00:00:00, 2000-01-06 00:00:00,
       2000-01-07 00:00:00, 2000-01-08 00:00:00, 2000-01-09 00:00:00,
       2000-01-10 00:00:00],
      dtype='object')

In [17]: stamp_array[5]
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[17]: Timestamp('2000-01-06 00:00:00', freq='D')

To get an array of proper datetime.datetime objects, use the to_pydatetime method:

In [18]: dt_array = rng.to_pydatetime()

In [19]: dt_array
Out[19]: 
array([datetime.datetime(2000, 1, 1, 0, 0),
       datetime.datetime(2000, 1, 2, 0, 0),
       datetime.datetime(2000, 1, 3, 0, 0),
       datetime.datetime(2000, 1, 4, 0, 0),
       datetime.datetime(2000, 1, 5, 0, 0),
       datetime.datetime(2000, 1, 6, 0, 0),
       datetime.datetime(2000, 1, 7, 0, 0),
       datetime.datetime(2000, 1, 8, 0, 0),
       datetime.datetime(2000, 1, 9, 0, 0),
       datetime.datetime(2000, 1, 10, 0, 0)], dtype=object)

In [20]: dt_array[5]
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[20]: datetime.datetime(2000, 1, 6, 0, 0)

matplotlib knows how to handle datetime.datetime but not Timestamp objects. While I recommend that you plot time series using TimeSeries.plot, you can either use to_pydatetime or register a converter for the Timestamp type. See matplotlib documentation for more on this.

Warning

There are bugs in the user-facing API with the nanosecond datetime64 unit in NumPy 1.6. In particular, the string version of the array shows garbage values, and conversion to dtype=object is similarly broken.

In [21]: rng = date_range('1/1/2000', periods=10)

In [22]: rng
Out[22]: 
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10'],
              dtype='datetime64[ns]', freq='D')

In [23]: np.asarray(rng)
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[23]: 
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000',
       '2000-01-05T00:00:00.000000000', '2000-01-06T00:00:00.000000000',
       '2000-01-07T00:00:00.000000000', '2000-01-08T00:00:00.000000000',
       '2000-01-09T00:00:00.000000000', '2000-01-10T00:00:00.000000000'], dtype='datetime64[ns]')

In [24]: converted = np.asarray(rng, dtype=object)

In [25]: converted[5]
Out[25]: 947116800000000000

Trust me: don’t panic. If you are using NumPy 1.6 and restrict your interaction with datetime64 values to pandas’s API you will be just fine. There is nothing wrong with the data-type (a 64-bit integer internally); all of the important data processing happens in pandas and is heavily tested. I strongly recommend that you do not work directly with datetime64 arrays in NumPy 1.6 and only use the pandas API.

Support for non-unique indexes: In the latter case, you may have code inside a try:... catch: block that failed due to the index not being unique. In many cases it will no longer fail (some method like append still check for uniqueness unless disabled). However, all is not lost: you can inspect index.is_unique and raise an exception explicitly if it is False or go to a different code branch.

v.0.7.3 (April 12, 2012)¶

This is a minor release from 0.7.2 and fixes many minor bugs and adds a number of nice new features. There are also a couple of API changes to note; these should not affect very many users, and we are inclined to call them “bug fixes” even though they do constitute a change in behavior. See the full release notes or issue tracker on GitHub for a complete list.

NA Boolean Comparison API Change¶

Reverted some changes to how NA values (represented typically as NaN or None) are handled in non-numeric Series:

In [1]: series = Series(['Steve', np.nan, 'Joe'])

In [2]: series == 'Steve'
Out[2]: 
0     True
1    False
2    False
Length: 3, dtype: bool

In [3]: series != 'Steve'
�����������������������������������������������������������������Out[3]: 
0    False
1     True
2     True
Length: 3, dtype: bool

In comparisons, NA / NaN will always come through as False except with != which is True. Be very careful with boolean arithmetic, especially negation, in the presence of NA data. You may wish to add an explicit NA filter into boolean array operations if you are worried about this:

In [4]: mask = series == 'Steve'

In [5]: series[mask & series.notnull()]
Out[5]: 
0    Steve
Length: 1, dtype: object

While propagating NA in comparisons may seem like the right behavior to some users (and you could argue on purely technical grounds that this is the right thing to do), the evaluation was made that propagating NA everywhere, including in numerical arrays, would cause a large amount of problems for users. Thus, a “practicality beats purity” approach was taken. This issue may be revisited at some point in the future.

Other API Changes¶

When calling apply on a grouped Series, the return value will also be a Series, to be more consistent with the groupby behavior with DataFrame:

In [6]: df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                     'foo', 'bar', 'foo', 'foo'],
   ...:                 'B' : ['one', 'one', 'two', 'three',
   ...:                        'two', 'two', 'one', 'three'],
   ...:                 'C' : np.random.randn(8), 'D' : np.random.randn(8)})
   ...: 

In [7]: df
Out[7]: 
     A      B         C         D
0  foo    one  1.075059 -0.449141
1  bar    one  0.785676  1.443014
2  foo    two  0.958157  0.612324
3  bar  three  1.477773 -0.178818
4  foo    two -1.006023  0.133072
5  bar    two -1.506997 -0.550981
6  foo    one  1.218042 -2.043335
7  foo  three -0.565878  0.753539

[8 rows x 4 columns]

In [8]: grouped = df.groupby('A')['C']

In [9]: grouped.describe()
Out[9]: 
     count      mean       std       min       25%       50%       75%  \
A                                                                        
bar    3.0  0.252151  1.562274 -1.506997 -0.360661  0.785676  1.131724   
foo    5.0  0.335871  1.039915 -1.006023 -0.565878  0.958157  1.075059   

          max  
A              
bar  1.477773  
foo  1.218042  

[2 rows x 8 columns]

In [10]: grouped.apply(lambda x: x.sort_values()[-2:]) # top 2 values
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[10]: 
A     
bar  1    0.785676
     3    1.477773
foo  0    1.075059
     6    1.218042
Name: C, Length: 4, dtype: float64
v.0.7.2 (March 16, 2012)¶

This release targets bugs in 0.7.1, and adds a few minor features.

New features¶
Performance improvements¶
v.0.7.1 (February 29, 2012)¶

This release includes a few new features and addresses over a dozen bugs in 0.7.0.

New features¶
Performance improvements¶
v.0.7.0 (February 9, 2012)¶ New features¶
In [1]: df = DataFrame(randn(10, 4))

In [2]: df.apply(lambda x: x.describe())
Out[2]: 
               0          1          2          3
count  10.000000  10.000000  10.000000  10.000000
mean   -0.409608   0.539495   0.163276   0.051646
std     1.397779   0.968808   0.874489   0.719651
min    -2.539411  -0.737206  -1.202276  -1.050435
25%    -1.202202   0.021308  -0.368812  -0.383608
50%    -0.384480   0.306124   0.211431   0.165586
75%     0.186280   1.024039   0.730744   0.494457
max     2.524998   2.533114   1.334428   1.147396

[8 rows x 4 columns]
API Changes to integer indexing¶

One of the potentially riskiest API changes in 0.7.0, but also one of the most important, was a complete review of how integer indexes are handled with regard to label-based indexing. Here is an example:

In [3]: s = Series(randn(10), index=range(0, 20, 2))

In [4]: s
Out[4]: 
0    -0.543429
2     1.425447
4    -0.408795
6    -1.489348
8    -1.166408
10   -0.481205
12   -0.810355
14   -0.985491
16   -0.336246
18   -0.629058
Length: 10, dtype: float64

In [5]: s[0]
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[5]: -0.54342898765020686

In [6]: s[2]
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[6]: 1.4254474252163707

In [7]: s[4]
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[7]: -0.40879476802408349

This is all exactly identical to the behavior before. However, if you ask for a key not contained in the Series, in versions 0.6.1 and prior, Series would fall back on a location-based lookup. This now raises a KeyError:

This change also has the same impact on DataFrame:

In [3]: df = DataFrame(randn(8, 4), index=range(0, 16, 2))

In [4]: df
    0        1       2       3
0   0.88427  0.3363 -0.1787  0.03162
2   0.14451 -0.1415  0.2504  0.58374
4  -1.44779 -0.9186 -1.4996  0.27163
6  -0.26598 -2.4184 -0.2658  0.11503
8  -0.58776  0.3144 -0.8566  0.61941
10  0.10940 -0.7175 -1.0108  0.47990
12 -1.16919 -0.3087 -0.6049 -0.43544
14 -0.07337  0.3410  0.0424 -0.16037

In [5]: df.ix[3]
KeyError: 3

In order to support purely integer-based indexing, the following methods have been added:

Method Description Series.iget_value(i) Retrieve value stored at location i Series.iget(i) Alias for iget_value DataFrame.irow(i) Retrieve the i-th row DataFrame.icol(j) Retrieve the j-th column DataFrame.iget_value(i, j) Retrieve the value at row i and column j API tweaks regarding label-based slicing¶

Label-based slicing using ix now requires that the index be sorted (monotonic) unless both the start and endpoint are contained in the index:

In [1]: s = Series(randn(6), index=list('gmkaec'))

In [2]: s
Out[2]:
g   -1.182230
m   -0.276183
k   -0.243550
a    1.628992
e    0.073308
c   -0.539890
dtype: float64

Then this is OK:

In [3]: s.ix['k':'e']
Out[3]:
k   -0.243550
a    1.628992
e    0.073308
dtype: float64

But this is not:

In [12]: s.ix['b':'h']
KeyError 'b'

If the index had been sorted, the “range selection” would have been possible:

In [4]: s2 = s.sort_index()

In [5]: s2
Out[5]:
a    1.628992
c   -0.539890
e    0.073308
g   -1.182230
k   -0.243550
m   -0.276183
dtype: float64

In [6]: s2.ix['b':'h']
Out[6]:
c   -0.539890
e    0.073308
g   -1.182230
dtype: float64
Changes to Series [] operator¶

As as notational convenience, you can pass a sequence of labels or a label slice to a Series when getting and setting values via [] (i.e. the __getitem__ and __setitem__ methods). The behavior will be the same as passing similar input to ix except in the case of integer indexing:

In [8]: s = Series(randn(6), index=list('acegkm'))

In [9]: s
Out[9]: 
a   -0.297788
c    0.499769
e    0.810531
g    0.414649
k   -1.551478
m    1.012459
Length: 6, dtype: float64

In [10]: s[['m', 'a', 'c', 'e']]
�����������������������������������������������������������������������������������������������������������������������Out[10]: 
m    1.012459
a   -0.297788
c    0.499769
e    0.810531
Length: 4, dtype: float64

In [11]: s['b':'l']
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[11]: 
c    0.499769
e    0.810531
g    0.414649
k   -1.551478
Length: 4, dtype: float64

In [12]: s['c':'k']
���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Out[12]: 
c    0.499769
e    0.810531
g    0.414649
k   -1.551478
Length: 4, dtype: float64

In the case of integer indexes, the behavior will be exactly as before (shadowing ndarray):

In [13]: s = Series(randn(6), index=range(0, 12, 2))

In [14]: s[[4, 0, 2]]
Out[14]: 
4    0.928877
0    1.171752
2    0.026488
Length: 3, dtype: float64

In [15]: s[1:5]
������������������������������������������������������������������������������Out[15]: 
2    0.026488
4    0.928877
6   -1.264991
8    0.419449
Length: 4, dtype: float64

If you wish to do indexing with sequences and slicing on an integer index with label semantics, use ix.

Other API Changes¶ Performance improvements¶ v.0.6.1 (December 13, 2011)¶ Performance improvements¶ v.0.6.0 (November 25, 2011)¶ New Features¶ Performance Enhancements¶ v.0.5.0 (October 24, 2011)¶ New Features¶ Performance Enhancements¶ v.0.4.3 through v0.4.1 (September 25 - October 9, 2011)¶ New Features¶ Performance Enhancements¶

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4