These are the changes in pandas 1.5.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#pandas-stubs
#
The pandas-stubs
library is now supported by the pandas development team, providing type stubs for the pandas API. Please visit pandas-dev/pandas-stubs for more information.
We thank VirtusLab and Microsoft for their initial, significant contributions to pandas-stubs
With Pyarrow installed, users can now create pandas objects that are backed by a pyarrow.ChunkedArray
and pyarrow.DataType
.
The dtype
argument can accept a string of a pyarrow data type with pyarrow
in brackets e.g. "int64[pyarrow]"
or, for pyarrow data types that take parameters, a ArrowDtype
initialized with a pyarrow.DataType
.
In [1]: import pyarrow as pa In [2]: ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]") In [3]: ser_float Out[3]: 0 1.0 1 2.0 2 <NA> dtype: float[pyarrow] In [4]: list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64())) In [5]: ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type) In [6]: ser_list Out[6]: 0 [1. 2.] 1 [ 3. nan] dtype: list<item: int64>[pyarrow] In [7]: ser_list.take([1, 0]) Out[7]: 1 [ 3. nan] 0 [1. 2.] dtype: list<item: int64>[pyarrow] In [8]: ser_float * 5 Out[8]: 0 5.0 1 10.0 2 <NA> dtype: float[pyarrow] In [9]: ser_float.mean() Out[9]: 1.5 In [10]: ser_float.dropna() Out[10]: 0 1.0 1 2.0 dtype: float[pyarrow]
Most operations are supported and have been implemented using pyarrow compute functions. We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.
Warning
This feature is experimental, and the API can change in a future release without warning.
DataFrame interchange protocol implementation#Pandas now implement the DataFrame interchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html
The protocol consists of two parts:
New method DataFrame.__dataframe__()
which produces the interchange object. It effectively âexportsâ the pandas dataframe as an interchange object so any other library which has the protocol implemented can âimportâ that dataframe without knowing anything about the producer except that it makes an interchange object.
New function pandas.api.interchange.from_dataframe()
which can take an arbitrary interchange object from any conformant library and construct a pandas DataFrame out of it.
The most notable development is the new method Styler.concat()
which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (GH 43875, GH 46186)
Additionally there is an alternative output method Styler.to_string()
, which allows using the Stylerâs formatting methods to create, for example, CSVs (GH 44502).
A new feature Styler.relabel_index()
is also made available to provide full customisation of the display of index or column headers (GH 47864)
Minor feature improvements are:
Control of index with
Adding the ability to render
border
andborder-{side}
CSS properties in Excel (GH 42276)Making keyword arguments consist:
Styler.highlight_null()
now acceptscolor
and deprecatesnull_color
although this remains backwards compatible (GH 45907)
group_keys
in DataFrame.resample()
#
The argument group_keys
has been added to the method DataFrame.resample()
. As with DataFrame.groupby()
, this argument controls the whether each group is added to the index in the resample when Resampler.apply()
is used.
Warning
Not specifying the group_keys
argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False
. In a future version of pandas, not specifying group_keys
will default to the same behavior as group_keys=False
.
In [11]: df = pd.DataFrame( ....: {'a': range(6)}, ....: index=pd.date_range("2021-01-01", periods=6, freq="8H") ....: ) ....: In [12]: df.resample("D", group_keys=True).apply(lambda x: x) Out[12]: a 2021-01-01 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5 In [13]: df.resample("D", group_keys=False).apply(lambda x: x) Out[13]: a 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5
Previously, the resulting index would depend upon the values returned by apply
, as seen in the following example.
In [1]: # pandas 1.3 In [2]: df.resample("D").apply(lambda x: x) Out[2]: a 2021-01-01 00:00:00 0 2021-01-01 08:00:00 1 2021-01-01 16:00:00 2 2021-01-02 00:00:00 3 2021-01-02 08:00:00 4 2021-01-02 16:00:00 5 In [3]: df.resample("D").apply(lambda x: x.reset_index()) Out[3]: index a 2021-01-01 0 2021-01-01 00:00:00 0 1 2021-01-01 08:00:00 1 2 2021-01-01 16:00:00 2 2021-01-02 0 2021-01-02 00:00:00 3 1 2021-01-02 08:00:00 4 2 2021-01-02 16:00:00 5from_dummies#
Added new function from_dummies()
to convert a dummy coded DataFrame
into a categorical DataFrame
.
In [11]: import pandas as pd In [12]: df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0], ....: "col2_a": [0, 1, 0], "col2_b": [1, 0, 0], ....: "col2_c": [0, 0, 1]}) ....: In [13]: pd.from_dummies(df, sep="_") Out[13]: col1 col2 0 a b 1 b a 2 a cWriting to ORC files#
The new method DataFrame.to_orc()
allows writing to ORC files (GH 43864).
This functionality depends the pyarrow library. For more details, see the IO docs on ORC.
Warning
It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow.
to_orc()
requires pyarrow>=7.0.0.
to_orc()
is not supported on Windows yet, you can find valid environments on install optional dependencies.
For supported dtypes please refer to supported ORC features in Arrow.
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]}) df.to_orc("./out.orc")Reading directly from TAR archives#
I/O methods like read_csv()
or DataFrame.to_json()
now allow reading and writing directly on TAR archives (GH 44787).
df = pd.read_csv("./movement.tar.gz") # ... df.to_csv("./out.tar.gz")
This supports .tar
, .tar.gz
, .tar.bz
and .tar.xz2
archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression
argument:
df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821
(mode
being one of tarfile.open
âs modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)
dtype
, converters
, and parse_dates
#
Similar to other IO methods, pandas.read_xml()
now supports assigning specific dtypes to columns, apply converter methods, and parse dates (GH 43567).
In [14]: from io import StringIO In [15]: xml_dates = """<?xml version='1.0' encoding='utf-8'?> ....: <data> ....: <row> ....: <shape>square</shape> ....: <degrees>00360</degrees> ....: <sides>4.0</sides> ....: <date>2020-01-01</date> ....: </row> ....: <row> ....: <shape>circle</shape> ....: <degrees>00360</degrees> ....: <sides/> ....: <date>2021-01-01</date> ....: </row> ....: <row> ....: <shape>triangle</shape> ....: <degrees>00180</degrees> ....: <sides>3.0</sides> ....: <date>2022-01-01</date> ....: </row> ....: </data>""" ....: In [16]: df = pd.read_xml( ....: StringIO(xml_dates), ....: dtype={'sides': 'Int64'}, ....: converters={'degrees': str}, ....: parse_dates=['date'] ....: ) ....: In [17]: df Out[17]: shape degrees sides date 0 square 00360 4 2020-01-01 1 circle 00360 <NA> 2021-01-01 2 triangle 00180 3 2022-01-01 In [18]: df.dtypes Out[18]: shape object degrees object sides Int64 date datetime64[ns] dtype: objectread_xml now supports large XML using
iterparse
#
For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml()
now supports parsing such sizeable files using lxmlâs iterparse and etreeâs iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (GH 45442).
In [1]: df = pd.read_xml( ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", ... iterparse = {"page": ["title", "ns", "id"]}) ... ) df Out[2]: title ns id 0 Gettysburg Address 0 21450 1 Main Page 0 42950 2 Declaration by United Nations 0 8435 3 Constitution of the United States of America 0 8435 4 Declaration of Independence (Israel) 0 17858 ... ... ... ... 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 [3578765 rows x 3 columns]Copy on Write#
A new feature copy_on_write
was added (GH 46958). Copy on write ensures that any DataFrame or Series derived from another in any way always behaves as a copy. Copy on write disallows updating any other object than the object the method was applied to.
Copy on write can be enabled through:
pd.set_option("mode.copy_on_write", True) pd.options.mode.copy_on_write = True
Alternatively, copy on write can be enabled locally through:
with pd.option_context("mode.copy_on_write", True): ...
Without copy on write, the parent DataFrame
is updated when updating a child DataFrame
that was derived from this DataFrame
.
In [19]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1}) In [20]: view = df["foo"] In [21]: view.iloc[0] Out[21]: 1 In [22]: df Out[22]: foo bar 0 1 1 1 2 1 2 3 1
With copy on write enabled, df wonât be updated anymore:
In [23]: with pd.option_context("mode.copy_on_write", True): ....: df = pd.DataFrame({"foo": [1, 2, 3], "bar": 1}) ....: view = df["foo"] ....: view.iloc[0] ....: df ....:
A more detailed explanation can be found here.
Other enhancements#Series.map()
now raises when arg
is dict but na_action
is not either None
or 'ignore'
(GH 46588)
MultiIndex.to_frame()
now supports the argument allow_duplicates
and raises on duplicate labels if it is missing or False (GH 45245)
StringArray
now accepts array-likes containing nan-likes (None
, np.nan
) for the values
parameter in its constructor in addition to strings and pandas.NA
. (GH 40839)
Improved the rendering of categories
in CategoricalIndex
(GH 45218)
DataFrame.plot()
will now allow the subplots
parameter to be a list of iterables specifying column groups, so that columns may be grouped together in the same subplot (GH 29688).
to_numeric()
now preserves float64 arrays when downcasting would generate values not representable in float32 (GH 43693)
Series.reset_index()
and DataFrame.reset_index()
now support the argument allow_duplicates
(GH 44410)
DataFrameGroupBy.min()
, SeriesGroupBy.min()
, DataFrameGroupBy.max()
, and SeriesGroupBy.max()
now supports Numba execution with the engine
keyword (GH 45428)
read_csv()
now supports defaultdict
as a dtype
parameter (GH 41574)
DataFrame.rolling()
and Series.rolling()
now support a step
parameter with fixed-length windows (GH 15354)
Implemented a bool
-dtype Index
, passing a bool-dtype array-like to pd.Index
will now retain bool
dtype instead of casting to object
(GH 45061)
Implemented a complex-dtype Index
, passing a complex-dtype array-like to pd.Index
will now retain complex dtype instead of casting to object
(GH 45845)
Series
and DataFrame
with IntegerDtype
now supports bitwise operations (GH 34463)
Add milliseconds
field support for DateOffset
(GH 43371)
DataFrame.where()
tries to maintain dtype of DataFrame
if fill value can be cast without loss of precision (GH 45582)
DataFrame.reset_index()
now accepts a names
argument which renames the index names (GH 6878)
concat()
now raises when levels
is given but keys
is None (GH 46653)
concat()
now raises when levels
contains duplicate values (GH 46653)
Added numeric_only
argument to DataFrame.corr()
, DataFrame.corrwith()
, DataFrame.cov()
, DataFrame.idxmin()
, DataFrame.idxmax()
, DataFrameGroupBy.idxmin()
, DataFrameGroupBy.idxmax()
, DataFrameGroupBy.var()
, SeriesGroupBy.var()
, DataFrameGroupBy.std()
, SeriesGroupBy.std()
, DataFrameGroupBy.sem()
, SeriesGroupBy.sem()
, and DataFrameGroupBy.quantile()
(GH 46560)
A errors.PerformanceWarning
is now thrown when using string[pyarrow]
dtype with methods that donât dispatch to pyarrow.compute
methods (GH 42613, GH 46725)
Added validate
argument to DataFrame.join()
(GH 46622)
Added numeric_only
argument to Resampler.sum()
, Resampler.prod()
, Resampler.min()
, Resampler.max()
, Resampler.first()
, and Resampler.last()
(GH 46442)
times
argument in ExponentialMovingWindow
now accepts np.timedelta64
(GH 47003)
DataError
, SpecificationError
, SettingWithCopyError
, SettingWithCopyWarning
, NumExprClobberingError
, UndefinedVariableError
, IndexingError
, PyperclipException
, PyperclipWindowsException
, CSSWarning
, PossibleDataLossError
, ClosedFileError
, IncompatibilityWarning
, AttributeConflictWarning
, DatabaseError
, PossiblePrecisionLoss
, ValueLabelTypeMismatch
, InvalidColumnName
, and CategoricalConversionWarning
are now exposed in pandas.errors
(GH 27656)
Added check_like
argument to testing.assert_series_equal()
(GH 47247)
Add support for DataFrameGroupBy.ohlc()
and SeriesGroupBy.ohlc()
for extension array dtypes (GH 37493)
Allow reading compressed SAS files with read_sas()
(e.g., .sas7bdat.gz
files)
pandas.read_html()
now supports extracting links from table cells (GH 13141)
DatetimeIndex.astype()
now supports casting timezone-naive indexes to datetime64[s]
, datetime64[ms]
, and datetime64[us]
, and timezone-aware indexes to the corresponding datetime64[unit, tzname]
dtypes (GH 47579)
Series
reducers (e.g. min
, max
, sum
, mean
) will now successfully operate when the dtype is numeric and numeric_only=True
is provided; previously this would raise a NotImplementedError
(GH 47500)
RangeIndex.union()
now can return a RangeIndex
instead of a Int64Index
if the resulting values are equally spaced (GH 47557, GH 43885)
DataFrame.compare()
now accepts an argument result_names
to allow the user to specify the resultâs names of both left and right DataFrame which are being compared. This is by default 'self'
and 'other'
(GH 44354)
DataFrame.quantile()
gained a method
argument that can accept table
to evaluate multi-column quantiles (GH 43881)
Interval
now supports checking whether one interval is contained by another interval (GH 46613)
Added copy
keyword to Series.set_axis()
and DataFrame.set_axis()
to allow user to set axis on a new object without necessarily copying the underlying data (GH 47932)
The method ExtensionArray.factorize()
accepts use_na_sentinel=False
for determining how null values are to be treated (GH 46601)
The Dockerfile
now installs a dedicated pandas-dev
virtual environment for pandas development instead of using the base
environment (GH 48427)
These are bug fixes that might have notable behavior changes.
Usingdropna=True
with groupby
transforms#
A transform is an operation whose result has the same size as its input. When the result is a DataFrame
or Series
, it is also required that the index of the result matches that of the input. In pandas 1.4, using DataFrameGroupBy.transform()
or SeriesGroupBy.transform()
with null values in the groups and dropna=True
gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.
In [24]: df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})
Old behavior:
In [3]: # Value in the last row should be np.nan df.groupby('a', dropna=True).transform('sum') Out[3]: b 0 5 1 5 2 5 In [3]: # Should have one additional row with the value np.nan df.groupby('a', dropna=True).transform(lambda x: x.sum()) Out[3]: b 0 5 1 5 In [3]: # The value in the last row is np.nan interpreted as an integer df.groupby('a', dropna=True).transform('ffill') Out[3]: b 0 2 1 3 2 -9223372036854775808 In [3]: # Should have one additional row with the value np.nan df.groupby('a', dropna=True).transform(lambda x: x) Out[3]: b 0 2 1 3
New behavior:
In [25]: df.groupby('a', dropna=True).transform('sum') Out[25]: b 0 5.0 1 5.0 2 NaN In [26]: df.groupby('a', dropna=True).transform(lambda x: x.sum()) Out[26]: b 0 5.0 1 5.0 2 NaN In [27]: df.groupby('a', dropna=True).transform('ffill') Out[27]: b 0 2.0 1 3.0 2 NaN In [28]: df.groupby('a', dropna=True).transform(lambda x: x) Out[28]: b 0 2.0 1 3.0 2 NaNSerializing tz-naive Timestamps with to_json() with
iso_dates=True
#
DataFrame.to_json()
, Series.to_json()
, and Index.to_json()
would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (GH 38760)
Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue GH 12997)
Old Behavior
In [32]: index = pd.date_range( ....: start='2020-12-28 00:00:00', ....: end='2020-12-28 02:00:00', ....: freq='1H', ....: ) ....: In [33]: a = pd.Series( ....: data=range(3), ....: index=index, ....: ) ....: In [4]: from io import StringIO In [5]: a.to_json(date_format='iso') Out[5]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}' In [6]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index Out[6]: array([False, False, False])
New Behavior
In [34]: from io import StringIO In [35]: a.to_json(date_format='iso') Out[35]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}' # Roundtripping now works In [36]: pd.read_json(StringIO(a.to_json(date_format='iso')), typ="series").index == a.index Out[36]: array([ True, True, True])DataFrameGroupBy.value_counts with non-grouping categorical columns and
observed=True
#
Calling DataFrameGroupBy.value_counts()
with observed=True
would incorrectly drop non-observed categories of non-grouping columns (GH 46357).
In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2] In [7]: df Out[7]: 0 0 a 1 b
Old Behavior
In [8]: df.groupby(level=0, observed=True).value_counts() Out[8]: 0 a 1 1 b 1 dtype: int64
New Behavior
In [9]: df.groupby(level=0, observed=True).value_counts() Out[9]: 0 a 1 1 a 0 b 1 0 b 0 c 0 1 c 0 dtype: int64Backwards incompatible API changes# Increased minimum versions for dependencies#
Some minimum supported versions of dependencies were updated. If installed, we now require:
Package
Minimum Version
Required
Changed
numpy
1.20.3
X
X
mypy (dev)
0.971
X
beautifulsoup4
4.9.3
X
blosc
1.21.0
X
bottleneck
1.3.2
X
fsspec
2021.07.0
X
hypothesis
6.13.0
X
gcsfs
2021.07.0
X
jinja2
3.0.0
X
lxml
4.6.3
X
numba
0.53.1
X
numexpr
2.7.3
X
openpyxl
3.0.7
X
pandas-gbq
0.15.0
X
psycopg2
2.8.6
X
pymysql
1.0.2
X
pyreadstat
1.1.2
X
pyxlsb
1.0.8
X
s3fs
2021.08.0
X
scipy
1.7.1
X
sqlalchemy
1.4.16
X
tabulate
0.8.9
X
xarray
0.19.0
X
xlsxwriter
1.4.3
X
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package
Minimum Version
Changed
beautifulsoup4
4.9.3
X
blosc
1.21.0
X
bottleneck
1.3.2
X
brotlipy
0.7.0
fastparquet
0.4.0
fsspec
2021.08.0
X
html5lib
1.1
hypothesis
6.13.0
X
gcsfs
2021.08.0
X
jinja2
3.0.0
X
lxml
4.6.3
X
matplotlib
3.3.2
numba
0.53.1
X
numexpr
2.7.3
X
odfpy
1.4.1
openpyxl
3.0.7
X
pandas-gbq
0.15.0
X
psycopg2
2.8.6
X
pyarrow
1.0.1
pymysql
1.0.2
X
pyreadstat
1.1.2
X
pytables
3.6.1
python-snappy
0.6.0
pyxlsb
1.0.8
X
s3fs
2021.08.0
X
scipy
1.7.1
X
sqlalchemy
1.4.16
X
tabulate
0.8.9
X
tzdata
2022a
xarray
0.19.0
X
xlrd
2.0.1
xlsxwriter
1.4.3
X
xlwt
1.3.0
zstandard
0.15.2
See Dependencies and Optional dependencies for more.
Other API changes#BigQuery I/O methods read_gbq()
and DataFrame.to_gbq()
default to auth_local_webserver = True
. Google has deprecated the auth_local_webserver = False
âout of bandâ (copy-paste) flow. The auth_local_webserver = False
option is planned to stop working in October 2022. (GH 46312)
read_json()
now raises FileNotFoundError
(previously ValueError
) when input is a string ending in .json
, .json.gz
, .json.bz2
, etc. but no such file exists. (GH 29102)
Operations with Timestamp
or Timedelta
that would previously raise OverflowError
instead raise OutOfBoundsDatetime
or OutOfBoundsTimedelta
where appropriate (GH 47268)
When read_sas()
previously returned None
, it now returns an empty DataFrame
(GH 47410)
DataFrame
constructor raises if index
or columns
arguments are sets (GH 47215)
Warning
In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as making the standard library zoneinfo the default timezone implementation instead of pytz
, having the Index
support all data types instead of having multiple subclasses (CategoricalIndex
, Int64Index
, etc.), and more. The changes under consideration are logged in this GitHub issue, and any feedback or concerns are welcome.
In a future version, integer slicing on a Series
with a Int64Index
or RangeIndex
will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__()
and Series.__setitem__()
behaviors (GH 45162).
For example:
In [29]: ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])
In the old behavior, ser[2:4]
treats the slice as positional:
Old behavior:
In [3]: ser[2:4] Out[3]: 5 3 7 4 dtype: int64
In a future version, this will be treated as label-based:
Future behavior:
In [4]: ser.loc[2:4] Out[4]: 2 1 3 2 dtype: int64
To retain the old behavior, use series.iloc[i:j]
. To get the future behavior, use series.loc[i:j]
.
Slicing on a DataFrame
will not be affected.
ExcelWriter
attributes#
All attributes of ExcelWriter
were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book
or ExcelWriter.sheets
, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book
would not update ExcelWriter.sheets
and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (GH 45572)
The following attributes are now public and considered safe to access.
book
check_extension
close
date_format
datetime_format
engine
if_sheet_exists
sheets
supported_extensions
The following attributes have been deprecated. They now raise a FutureWarning
when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.
cur_sheet
handles
path
save
write_cells
See the documentation of ExcelWriter
for further details.
group_keys
with transformers in DataFrameGroupBy.apply()
and SeriesGroupBy.apply()
#
In previous versions of pandas, if it was inferred that the function passed to DataFrameGroupBy.apply()
or SeriesGroupBy.apply()
was a transformer (i.e. the resulting index was equal to the input index), the group_keys
argument of DataFrame.groupby()
and Series.groupby()
was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True
.
As group_keys=True
is the default value of DataFrame.groupby()
and Series.groupby()
, not specifying group_keys
with a transformer will raise a FutureWarning
. This can be silenced and the previous behavior retained by specifying group_keys=False
.
loc
and iloc
#
Most of the time setting values with DataFrame.iloc()
attempts to set values inplace, only falling back to inserting a new array if necessary. There are some cases where this rule is not followed, for example when setting an entire column from an array with different dtype:
In [30]: df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2']) In [31]: original_prices = df['price'] In [32]: new_prices = np.array([98, 99])
Old behavior:
In [3]: df.iloc[:, 0] = new_prices In [4]: df.iloc[:, 0] Out[4]: book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: price, float: 64
This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.
Future behavior:
In [3]: df.iloc[:, 0] = new_prices In [4]: df.iloc[:, 0] Out[4]: book1 98.0 book2 99.0 Name: price, dtype: float64 In [5]: original_prices Out[5]: book1 98.0 book2 99.0 Name: price, dtype: float64
To get the old behavior, use DataFrame.__setitem__()
directly:
In [3]: df[df.columns[0]] = new_prices In [4]: df.iloc[:, 0] Out[4] book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: price, dtype: float64
To get the old behaviour when df.columns
is not unique and you want to change a single column by index, you can use DataFrame.isetitem()
, which has been added in pandas 1.5:
In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns') In [3]: df_with_duplicated_cols.isetitem(0, new_prices) In [4]: df_with_duplicated_cols.iloc[:, 0] Out[4]: book1 98 book2 99 Name: price, dtype: int64 In [5]: original_prices Out[5]: book1 11.1 book2 12.2 Name: 0, dtype: float64
numeric_only
default value#
Across the DataFrame
, DataFrameGroupBy
, and Resampler
operations such as min
, sum
, and idxmax
, the default value of the numeric_only
argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None
can lead to surprising results. (GH 46560)
In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]}) In [2]: # Reading the next line without knowing the contents of df, one would # expect the result to contain the products for both columns a and b. df[["a", "b"]].prod() Out[2]: a 2 dtype: int64
To avoid this behavior, the specifying the value numeric_only=None
has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only
argument will default to False
. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True
to operate only on Boolean, integer, and float columns.
In order to support the transition to the new behavior, the following methods have gained the numeric_only
argument.
DataFrame.rolling()
operations
DataFrame.expanding()
operations
DataFrame.ewm()
operations
Deprecated the keyword line_terminator
in DataFrame.to_csv()
and Series.to_csv()
, use lineterminator
instead; this is for consistency with read_csv()
and the standard library âcsvâ module (GH 9568)
Deprecated behavior of SparseArray.astype()
, Series.astype()
, and DataFrame.astype()
with SparseDtype
when passing a non-sparse dtype
. In a future version, this will cast to that non-sparse dtype instead of wrapping it in a SparseDtype
(GH 34457)
Deprecated behavior of DatetimeIndex.intersection()
and DatetimeIndex.symmetric_difference()
(union
behavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (GH 39328, GH 45357)
Deprecated DataFrame.iteritems()
, Series.iteritems()
, HDFStore.iteritems()
in favor of DataFrame.items()
, Series.items()
, HDFStore.items()
(GH 45321)
Deprecated Series.is_monotonic()
and Index.is_monotonic()
in favor of Series.is_monotonic_increasing()
and Index.is_monotonic_increasing()
(GH 45422, GH 21335)
Deprecated behavior of DatetimeIndex.astype()
, TimedeltaIndex.astype()
, PeriodIndex.astype()
when converting to an integer dtype other than int64
. In a future version, these will convert to exactly the specified dtype (instead of always int64
) and will raise if the conversion overflows (GH 45034)
Deprecated the __array_wrap__
method of DataFrame and Series, rely on standard numpy ufuncs instead (GH 45451)
Deprecated treating float-dtype data as wall-times when passed with a timezone to Series
or DatetimeIndex
(GH 45573)
Deprecated the behavior of Series.fillna()
and DataFrame.fillna()
with timedelta64[ns]
dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (GH 45746)
Deprecated the warn
parameter in infer_freq()
(GH 45947)
Deprecated allowing non-keyword arguments in ExtensionArray.argsort()
(GH 46134)
Deprecated treating all-bool object
-dtype columns as bool-like in DataFrame.any()
and DataFrame.all()
with bool_only=True
, explicitly cast to bool instead (GH 46188)
Deprecated behavior of method DataFrame.quantile()
, attribute numeric_only
will default False. Including datetime/timedelta columns in the result (GH 7308).
Deprecated Timedelta.freq
and Timedelta.is_populated
(GH 46430)
Deprecated Timedelta.delta
(GH 46476)
Deprecated passing arguments as positional in DataFrame.any()
and Series.any()
(GH 44802)
Deprecated passing positional arguments to DataFrame.pivot()
and pivot()
except data
(GH 30228)
Deprecated the methods DataFrame.mad()
, Series.mad()
, and the corresponding groupby methods (GH 11787)
Deprecated positional arguments to Index.join()
except for other
, use keyword-only arguments instead of positional arguments (GH 46518)
Deprecated positional arguments to StringMethods.rsplit()
and StringMethods.split()
except for pat
, use keyword-only arguments instead of positional arguments (GH 47423)
Deprecated indexing on a timezone-naive DatetimeIndex
using a string representing a timezone-aware datetime (GH 46903, GH 36148)
Deprecated allowing unit="M"
or unit="Y"
in Timestamp
constructor with a non-round float value (GH 47267)
Deprecated the display.column_space
global configuration option (GH 7576)
Deprecated the argument na_sentinel
in factorize()
, Index.factorize()
, and ExtensionArray.factorize()
; pass use_na_sentinel=True
instead to use the sentinel -1
for NaN values and use_na_sentinel=False
instead of na_sentinel=None
to encode NaN values (GH 46910)
Deprecated DataFrameGroupBy.transform()
not aligning the result when the UDF returned DataFrame (GH 45648)
Clarified warning from to_datetime()
when delimited dates canât be parsed in accordance to specified dayfirst
argument (GH 46210)
Emit warning from to_datetime()
when delimited dates canât be parsed in accordance to specified dayfirst
argument even for dates where leading zero is omitted (e.g. 31/1/2001
) (GH 47880)
Deprecated Series
and Resampler
reducers (e.g. min
, max
, sum
, mean
) raising a NotImplementedError
when the dtype is non-numric and numeric_only=True
is provided; this will raise a TypeError
in a future version (GH 47500)
Deprecated Series.rank()
returning an empty result when the dtype is non-numeric and numeric_only=True
is provided; this will raise a TypeError
in a future version (GH 47500)
Deprecated argument errors
for Series.mask()
, Series.where()
, DataFrame.mask()
, and DataFrame.where()
as errors
had no effect on this methods (GH 47728)
Deprecated arguments *args
and **kwargs
in Rolling
, Expanding
, and ExponentialMovingWindow
ops. (GH 47836)
Deprecated the inplace
keyword in Categorical.set_ordered()
, Categorical.as_ordered()
, and Categorical.as_unordered()
(GH 37643)
Deprecated setting a categoricalâs categories with cat.categories = ['a', 'b', 'c']
, use Categorical.rename_categories()
instead (GH 37643)
Deprecated unused arguments encoding
and verbose
in Series.to_excel()
and DataFrame.to_excel()
(GH 47912)
Deprecated the inplace
keyword in DataFrame.set_axis()
and Series.set_axis()
, use obj = obj.set_axis(..., copy=False)
instead (GH 48130)
Deprecated producing a single element when iterating over a DataFrameGroupBy
or a SeriesGroupBy
that has been grouped by a list of length 1; A tuple of length one will be returned instead (GH 42795)
Fixed up warning message of deprecation of MultiIndex.lesort_depth()
as public method, as the message previously referred to MultiIndex.is_lexsorted()
instead (GH 38701)
Deprecated the sort_columns
argument in DataFrame.plot()
and Series.plot()
(GH 47563).
Deprecated positional arguments for all but the first argument of DataFrame.to_stata()
and read_stata()
, use keyword arguments instead (GH 48128).
Deprecated the mangle_dupe_cols
argument in read_csv()
, read_fwf()
, read_table()
and read_excel()
. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead (GH 47718)
Deprecated allowing dtype='datetime64'
or dtype=np.datetime64
in Series.astype()
, use âdatetime64[ns]â instead (GH 47844)
Performance improvement in DataFrame.corrwith()
for column-wise (axis=0) Pearson and Spearman correlation when other is a Series
(GH 46174)
Performance improvement in DataFrameGroupBy.transform()
and SeriesGroupBy.transform()
for some user-defined DataFrame -> Series functions (GH 45387)
Performance improvement in DataFrame.duplicated()
when subset consists of only one column (GH 45236)
Performance improvement in DataFrameGroupBy.diff()
and SeriesGroupBy.diff()
(GH 16706)
Performance improvement in DataFrameGroupBy.transform()
and SeriesGroupBy.transform()
when broadcasting values for user-defined functions (GH 45708)
Performance improvement in DataFrameGroupBy.transform()
and SeriesGroupBy.transform()
for user-defined functions when only a single group exists (GH 44977)
Performance improvement in DataFrameGroupBy.apply()
and SeriesGroupBy.apply()
when grouping on a non-unique unsorted index (GH 46527)
Performance improvement in DataFrame.loc()
and Series.loc()
for tuple-based indexing of a MultiIndex
(GH 45681, GH 46040, GH 46330)
Performance improvement in DataFrameGroupBy.var()
and SeriesGroupBy.var()
with ddof
other than one (GH 48152)
Performance improvement in DataFrame.to_records()
when the index is a MultiIndex
(GH 47263)
Performance improvement in MultiIndex.values
when the MultiIndex contains levels of type DatetimeIndex, TimedeltaIndex or ExtensionDtypes (GH 46288)
Performance improvement in merge()
when left and/or right are empty (GH 45838)
Performance improvement in DataFrame.join()
when left and/or right are empty (GH 46015)
Performance improvement in DataFrame.reindex()
and Series.reindex()
when target is a MultiIndex
(GH 46235)
Performance improvement when setting values in a pyarrow backed string array (GH 46400)
Performance improvement in factorize()
(GH 46109)
Performance improvement in DataFrame
and Series
constructors for extension dtype scalars (GH 45854)
Performance improvement in read_excel()
when nrows
argument provided (GH 32727)
Performance improvement in Styler.to_excel()
when applying repeated CSS formats (GH 47371)
Performance improvement in MultiIndex.is_monotonic_increasing()
(GH 47458)
Performance improvement in BusinessHour
str
and repr
(GH 44764)
Performance improvement in datetime arrays string formatting when one of the default strftime formats "%Y-%m-%d %H:%M:%S"
or "%Y-%m-%d %H:%M:%S.%f"
is used. (GH 44764)
Performance improvement in Series.to_sql()
and DataFrame.to_sql()
(SQLiteTable
) when processing time arrays. (GH 44764)
Performance improvement to read_sas()
(GH 47404)
Performance improvement in argmax
and argmin
for arrays.SparseArray
(GH 34197)
Bug in Categorical.view()
not accepting integer dtypes (GH 25464)
Bug in CategoricalIndex.union()
when the indexâs categories are integer-dtype and the index contains NaN
values incorrectly raising instead of casting to float64
(GH 45362)
Bug in concat()
when concatenating two (or more) unordered CategoricalIndex
variables, whose categories are permutations, yields incorrect index values (GH 24845)
Bug in DataFrame.quantile()
with datetime-like dtypes and no rows incorrectly returning float64
dtype instead of retaining datetime-like dtype (GH 41544)
Bug in to_datetime()
with sequences of np.str_
objects incorrectly raising (GH 32264)
Bug in Timestamp
construction when passing datetime components as positional arguments and tzinfo
as a keyword argument incorrectly raising (GH 31929)
Bug in Index.astype()
when casting from object dtype to timedelta64[ns]
dtype incorrectly casting np.datetime64("NaT")
values to np.timedelta64("NaT")
instead of raising (GH 45722)
Bug in SeriesGroupBy.value_counts()
index when passing categorical column (GH 44324)
Bug in DatetimeIndex.tz_localize()
localizing to UTC failing to make a copy of the underlying data (GH 46460)
Bug in DatetimeIndex.resolution()
incorrectly returning âdayâ instead of ânanosecondâ for nanosecond-resolution indexes (GH 46903)
Bug in Timestamp
with an integer or float value and unit="Y"
or unit="M"
giving slightly-wrong results (GH 47266)
Bug in DatetimeArray
construction when passed another DatetimeArray
and freq=None
incorrectly inferring the freq from the given array (GH 47296)
Bug in to_datetime()
where OutOfBoundsDatetime
would be thrown even if errors=coerce
if there were more than 50 rows (GH 45319)
Bug when adding a DateOffset
to a Series
would not add the nanoseconds
field (GH 47856)
Bug in astype_nansafe()
astype(âtimedelta64[ns]â) fails when np.nan is included (GH 45798)
Bug in constructing a Timedelta
with a np.timedelta64
object and a unit
sometimes silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta
(GH 46827)
Bug in constructing a Timedelta
from a large integer or float with unit="W"
silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta
(GH 47268)
Bug in operations with array-likes with dtype="boolean"
and NA
incorrectly altering the array in-place (GH 45421)
Bug in arithmetic operations with nullable types without NA
values not matching the same operation with non-nullable types (GH 48223)
Bug in floordiv
when dividing by IntegerDtype
0
would return 0
instead of inf
(GH 48223)
Bug in division, pow
and mod
operations on array-likes with dtype="boolean"
not being like their np.bool_
counterparts (GH 46063)
Bug in multiplying a Series
with IntegerDtype
or FloatingDtype
by an array-like with timedelta64[ns]
dtype incorrectly raising (GH 45622)
Bug in mean()
where the optional dependency bottleneck
causes precision loss linear in the length of the array. bottleneck
has been disabled for mean()
improving the loss to log-linear but may result in a performance decrease. (GH 42878)
Bug in DataFrame.astype()
not preserving subclasses (GH 40810)
Bug in constructing a Series
from a float-containing list or a floating-dtype ndarray-like (e.g. dask.Array
) and an integer dtype raising instead of casting like we would with an np.ndarray
(GH 40110)
Bug in Float64Index.astype()
to unsigned integer dtype incorrectly casting to np.int64
dtype (GH 45309)
Bug in Series.astype()
and DataFrame.astype()
from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (GH 45151)
Bug in array()
with FloatingDtype
and values containing float-castable strings incorrectly raising (GH 45424)
Bug when comparing string and datetime64ns objects causing OverflowError
exception. (GH 45506)
Bug in metaclass of generic abstract dtypes causing DataFrame.apply()
and Series.apply()
to raise for the built-in function type
(GH 46684)
Bug in DataFrame.to_records()
returning inconsistent numpy types if the index was a MultiIndex
(GH 47263)
Bug in DataFrame.to_dict()
for orient="list"
or orient="index"
was not returning native types (GH 46751)
Bug in DataFrame.apply()
that returns a DataFrame
instead of a Series
when applied to an empty DataFrame
and axis=1
(GH 39111)
Bug when inferring the dtype from an iterable that is not a NumPy ndarray
consisting of all NumPy unsigned integer scalars did not result in an unsigned integer dtype (GH 47294)
Bug in DataFrame.eval()
when pandas objects (e.g. 'Timestamp'
) were column names (GH 44603)
Bug in str.startswith()
and str.endswith()
when using other series as parameter _pat_. Now raises TypeError
(GH 3485)
Bug in Series.str.zfill()
when strings contain leading signs, padding â0â before the sign character rather than after as str.zfill
from standard library (GH 20868)
Bug in IntervalArray.__setitem__()
when setting np.nan
into an integer-backed array raising ValueError
instead of TypeError
(GH 45484)
Bug in IntervalDtype
when using datetime64[ns, tz] as a dtype string (GH 46999)
Bug in DataFrame.iloc()
where indexing a single row on a DataFrame
with a single ExtensionDtype column gave a copy instead of a view on the underlying data (GH 45241)
Bug in DataFrame.__getitem__()
returning copy when DataFrame
has duplicated columns even if a unique column is selected (GH 45316, GH 41062)
Bug in Series.align()
does not create MultiIndex
with union of levels when both MultiIndexes intersections are identical (GH 45224)
Bug in setting a NA value (None
or np.nan
) into a Series
with int-based IntervalDtype
incorrectly casting to object dtype instead of a float-based IntervalDtype
(GH 45568)
Bug in indexing setting values into an ExtensionDtype
column with df.iloc[:, i] = values
with values
having the same dtype as df.iloc[:, i]
incorrectly inserting a new array instead of setting in-place (GH 33457)
Bug in Series.__setitem__()
with a non-integer Index
when using an integer key to set a value that cannot be set inplace where a ValueError
was raised instead of casting to a common dtype (GH 45070)
Bug in DataFrame.loc()
not casting None
to NA
when setting value as a list into DataFrame
(GH 47987)
Bug in Series.__setitem__()
when setting incompatible values into a PeriodDtype
or IntervalDtype
Series
raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with Series.mask()
and Series.where()
(GH 45768)
Bug in DataFrame.where()
with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (GH 45837)
Bug in isin()
upcasting to float64
with unsigned integer dtype and list-like argument without a dtype (GH 46485)
Bug in Series.loc.__setitem__()
and Series.loc.__getitem__()
not raising when using multiple keys without using a MultiIndex
(GH 13831)
Bug in Index.reindex()
raising AssertionError
when level
was specified but no MultiIndex
was given; level is ignored now (GH 35132)
Bug when setting a value too large for a Series
dtype failing to coerce to a common type (GH 26049, GH 32878)
Bug in loc.__setitem__()
treating range
keys as positional instead of label-based (GH 45479)
Bug in DataFrame.__setitem__()
casting extension array dtypes to object when setting with a scalar key and DataFrame
as value (GH 46896)
Bug in Series.__setitem__()
when setting a scalar to a nullable pandas dtype would not raise a TypeError
if the scalar could not be cast (losslessly) to the nullable type (GH 45404)
Bug in Series.__setitem__()
when setting boolean
dtype values containing NA
incorrectly raising instead of casting to boolean
dtype (GH 45462)
Bug in Series.loc()
raising with boolean indexer containing NA
when Index
did not match (GH 46551)
Bug in Series.__setitem__()
where setting NA
into a numeric-dtype Series
would incorrectly upcast to object-dtype rather than treating the value as np.nan
(GH 44199)
Bug in DataFrame.loc()
when setting values to a column and right hand side is a dictionary (GH 47216)
Bug in Series.__setitem__()
with datetime64[ns]
dtype, an all-False
boolean mask, and an incompatible value incorrectly casting to object
instead of retaining datetime64[ns]
dtype (GH 45967)
Bug in Index.__getitem__()
raising ValueError
when indexer is from boolean dtype with NA
(GH 45806)
Bug in Series.__setitem__()
losing precision when enlarging Series
with scalar (GH 32346)
Bug in Series.mask()
with inplace=True
or setting values with a boolean mask with small integer dtypes incorrectly raising (GH 45750)
Bug in DataFrame.mask()
with inplace=True
and ExtensionDtype
columns incorrectly raising (GH 45577)
Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (GH 42950)
Bug in DataFrame.__getattribute__()
raising AttributeError
if columns have "string"
dtype (GH 46185)
Bug in DataFrame.compare()
returning all NaN
column when comparing extension array dtype and numpy dtype (GH 44014)
Bug in DataFrame.where()
setting wrong values with "boolean"
mask for numpy dtype (GH 44014)
Bug in indexing on a DatetimeIndex
with a np.str_
key incorrectly raising (GH 45580)
Bug in CategoricalIndex.get_indexer()
when index contains NaN
values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (GH 45361)
Bug in setting large integer values into Series
with float32
or float16
dtype incorrectly altering these values instead of coercing to float64
dtype (GH 45844)
Bug in Series.asof()
and DataFrame.asof()
incorrectly casting bool-dtype results to float64
dtype (GH 16063)
Bug in NDFrame.xs()
, DataFrame.iterrows()
, DataFrame.loc()
and DataFrame.iloc()
not always propagating metadata (GH 28283)
Bug in DataFrame.sum()
min_count changes dtype if input contains NaNs (GH 46947)
Bug in IntervalTree
that lead to an infinite recursion. (GH 46658)
Bug in PeriodIndex
raising AttributeError
when indexing on NA
, rather than putting NaT
in its place. (GH 46673)
Bug in DataFrame.at()
would allow the modification of multiple columns (GH 48296)
Bug in Series.fillna()
and DataFrame.fillna()
with downcast
keyword not being respected in some cases where there are no NA values present (GH 45423)
Bug in Series.fillna()
and DataFrame.fillna()
with IntervalDtype
and incompatible value raising instead of casting to a common (usually object) dtype (GH 45796)
Bug in Series.map()
not respecting na_action
argument if mapper is a dict
or Series
(GH 47527)
Bug in DataFrame.interpolate()
with object-dtype column not returning a copy with inplace=False
(GH 45791)
Bug in DataFrame.dropna()
allows to set both how
and thresh
incompatible arguments (GH 46575)
Bug in DataFrame.fillna()
ignored axis
when DataFrame
is single block (GH 47713)
Bug in DataFrame.loc()
returning empty result when slicing a MultiIndex
with a negative step size and non-null start/stop values (GH 46156)
Bug in DataFrame.loc()
raising when slicing a MultiIndex
with a negative step size other than -1 (GH 46156)
Bug in DataFrame.loc()
raising when slicing a MultiIndex
with a negative step size and slicing a non-int labeled index level (GH 46156)
Bug in Series.to_numpy()
where multiindexed Series could not be converted to numpy arrays when an na_value
was supplied (GH 45774)
Bug in MultiIndex.equals
not commutative when only one side has extension array dtype (GH 46026)
Bug in MultiIndex.from_tuples()
cannot construct Index of empty tuples (GH 45608)
Bug in DataFrame.to_stata()
where no error is raised if the DataFrame
contains -np.inf
(GH 45350)
Bug in read_excel()
results in an infinite loop with certain skiprows
callables (GH 45585)
Bug in DataFrame.info()
where a new line at the end of the output is omitted when called on an empty DataFrame
(GH 45494)
Bug in read_csv()
not recognizing line break for on_bad_lines="warn"
for engine="c"
(GH 41710)
Bug in DataFrame.to_csv()
not respecting float_format
for Float64
dtype (GH 45991)
Bug in read_csv()
not respecting a specified converter to index columns in all cases (GH 40589)
Bug in read_csv()
interpreting second row as Index
names even when index_col=False
(GH 46569)
Bug in read_parquet()
when engine="pyarrow"
which caused partial write to disk when column of unsupported datatype was passed (GH 44914)
Bug in DataFrame.to_excel()
and ExcelWriter
would raise when writing an empty DataFrame to a .ods
file (GH 45793)
Bug in read_csv()
ignoring non-existing header row for engine="python"
(GH 47400)
Bug in read_excel()
raising uncontrolled IndexError
when header
references non-existing rows (GH 43143)
Bug in read_html()
where elements surrounding <br>
were joined without a space between them (GH 29528)
Bug in read_csv()
when data is longer than header leading to issues with callables in usecols
expecting strings (GH 46997)
Bug in Parquet roundtrip for Interval dtype with datetime64[ns]
subtype (GH 45881)
Bug in read_excel()
when reading a .ods
file with newlines between xml elements (GH 45598)
Bug in read_parquet()
when engine="fastparquet"
where the file was not closed on error (GH 46555)
DataFrame.to_html()
now excludes the border
attribute from <table>
elements when border
keyword is set to False
.
Bug in read_sas()
with certain types of compressed SAS7BDAT files (GH 35545)
Bug in read_excel()
not forward filling MultiIndex
when no names were given (GH 47487)
Bug in read_sas()
returned None
rather than an empty DataFrame for SAS7BDAT files with zero rows (GH 18198)
Bug in DataFrame.to_string()
using wrong missing value with extension arrays in MultiIndex
(GH 47986)
Bug in StataWriter
where value labels were always written with default encoding (GH 46750)
Bug in StataWriterUTF8
where some valid characters were removed from variable names (GH 47276)
Bug in DataFrame.to_excel()
when writing an empty dataframe with MultiIndex
(GH 19543)
Bug in read_sas()
with RLE-compressed SAS7BDAT files that contain 0x40 control bytes (GH 31243)
Bug in read_sas()
that scrambled column names (GH 31243)
Bug in read_sas()
with RLE-compressed SAS7BDAT files that contain 0x00 control bytes (GH 47099)
Bug in read_parquet()
with use_nullable_dtypes=True
where float64
dtype was returned instead of nullable Float64
dtype (GH 45694)
Bug in DataFrame.to_json()
where PeriodDtype
would not make the serialization roundtrip when read back with read_json()
(GH 44720)
Bug in read_xml()
when reading XML files with Chinese character tags and would raise XMLSyntaxError
(GH 47902)
Bug in subtraction of Period
from PeriodArray
returning wrong results (GH 45999)
Bug in Period.strftime()
and PeriodIndex.strftime()
, directives %l
and %u
were giving wrong results (GH 46252)
Bug in inferring an incorrect freq
when passing a string to Period
microseconds that are a multiple of 1000 (GH 46811)
Bug in constructing a Period
from a Timestamp
or np.datetime64
object with non-zero nanoseconds and freq="ns"
incorrectly truncating the nanoseconds (GH 46811)
Bug in adding np.timedelta64("NaT", "ns")
to a Period
with a timedelta-like freq incorrectly raising IncompatibleFrequency
instead of returning NaT
(GH 47196)
Bug in adding an array of integers to an array with PeriodDtype
giving incorrect results when dtype.freq.n > 1
(GH 47209)
Bug in subtracting a Period
from an array with PeriodDtype
returning incorrect results instead of raising OverflowError
when the operation overflows (GH 47538)
Bug in DataFrame.plot.barh()
that prevented labeling the x-axis and xlabel
updating the y-axis label (GH 45144)
Bug in DataFrame.plot.box()
that prevented labeling the x-axis (GH 45463)
Bug in DataFrame.boxplot()
that prevented passing in xlabel
and ylabel
(GH 45463)
Bug in DataFrame.boxplot()
that prevented specifying vert=False
(GH 36918)
Bug in DataFrame.plot.scatter()
that prevented specifying norm
(GH 45809)
Fix showing âNoneâ as ylabel in Series.plot()
when not setting ylabel (GH 46129)
Bug in DataFrame.plot()
that led to xticks and vertical grids being improperly placed when plotting a quarterly series (GH 47602)
Bug in DataFrame.plot()
that prevented setting y-axis label, limits and ticks for a secondary y-axis (GH 47753)
Bug in DataFrame.resample()
ignoring closed="right"
on TimedeltaIndex
(GH 45414)
Bug in DataFrameGroupBy.transform()
fails when func="size"
and the input DataFrame has multiple columns (GH 27469)
Bug in DataFrameGroupBy.size()
and DataFrameGroupBy.transform()
with func="size"
produced incorrect results when axis=1
(GH 45715)
Bug in ExponentialMovingWindow.mean()
with axis=1
and engine='numba'
when the DataFrame
has more columns than rows (GH 46086)
Bug when using engine="numba"
would return the same jitted function when modifying engine_kwargs
(GH 46086)
Bug in DataFrameGroupBy.transform()
fails when axis=1
and func
is "first"
or "last"
(GH 45986)
Bug in DataFrameGroupBy.cumsum()
with skipna=False
giving incorrect results (GH 46216)
Bug in DataFrameGroupBy.sum()
, SeriesGroupBy.sum()
, DataFrameGroupBy.prod()
, SeriesGroupBy.prod, :meth:()
.DataFrameGroupBy.cumsum`, and SeriesGroupBy.cumsum()
with integer dtypes losing precision (GH 37493)
Bug in DataFrameGroupBy.cumsum()
and SeriesGroupBy.cumsum()
with timedelta64[ns]
dtype failing to recognize NaT
as a null value (GH 46216)
Bug in DataFrameGroupBy.cumsum()
and SeriesGroupBy.cumsum()
with integer dtypes causing overflows when sum was bigger than maximum of dtype (GH 37493)
Bug in DataFrameGroupBy.cummin()
, SeriesGroupBy.cummin()
, DataFrameGroupBy.cummax()
and SeriesGroupBy.cummax()
with nullable dtypes incorrectly altering the original data in place (GH 46220)
Bug in DataFrame.groupby()
raising error when None
is in first level of MultiIndex
(GH 47348)
Bug in DataFrameGroupBy.cummax()
and SeriesGroupBy.cummax()
with int64
dtype with leading value being the smallest possible int64 (GH 46382)
Bug in DataFrameGroupBy.cumprod()
and SeriesGroupBy.cumprod()
NaN
influences calculation in different columns with skipna=False
(GH 48064)
Bug in DataFrameGroupBy.max()
and SeriesGroupBy.max()
with empty groups and uint64
dtype incorrectly raising RuntimeError
(GH 46408)
Bug in DataFrameGroupBy.apply()
and SeriesGroupBy.apply()
would fail when func
was a string and args or kwargs were supplied (GH 46479)
Bug in SeriesGroupBy.apply()
would incorrectly name its result when there was a unique group (GH 46369)
Bug in Rolling.sum()
and Rolling.mean()
would give incorrect result with window of same values (GH 42064, GH 46431)
Bug in Rolling.var()
and Rolling.std()
would give non-zero result with window of same values (GH 42064)
Bug in Rolling.skew()
and Rolling.kurt()
would give NaN with window of same values (GH 30993)
Bug in Rolling.var()
would segfault calculating weighted variance when window size was larger than data size (GH 46760)
Bug in Grouper.__repr__()
where dropna
was not included. Now it is (GH 46754)
Bug in DataFrame.rolling()
gives ValueError when center=True, axis=1 and win_type is specified (GH 46135)
Bug in DataFrameGroupBy.describe()
and SeriesGroupBy.describe()
produces inconsistent results for empty datasets (GH 41575)
Bug in DataFrame.resample()
reduction methods when used with on
would attempt to aggregate the provided column (GH 47079)
Bug in DataFrame.groupby()
and Series.groupby()
would not respect dropna=False
when the input DataFrame/Series had a NaN values in a MultiIndex
(GH 46783)
Bug in DataFrameGroupBy.resample()
raises KeyError
when getting the result from a key list which misses the resample key (GH 47362)
Bug in DataFrame.groupby()
would lose index columns when the DataFrame is empty for transforms, like fillna (GH 47787)
Bug in DataFrame.groupby()
and Series.groupby()
with dropna=False
and sort=False
would put any null groups at the end instead the order that they are encountered (GH 46584)
Bug in concat()
between a Series
with integer dtype and another with CategoricalDtype
with integer categories and containing NaN
values casting to object dtype instead of float64
(GH 45359)
Bug in get_dummies()
that selected object and categorical dtypes but not string (GH 44965)
Bug in DataFrame.align()
when aligning a MultiIndex
to a Series
with another MultiIndex
(GH 46001)
Bug in concatenation with IntegerDtype
, or FloatingDtype
arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (GH 46379)
Bug in concat()
losing dtype of columns when join="outer"
and sort=True
(GH 47329)
Bug in concat()
not sorting the column names when None
is included (GH 47331)
Bug in concat()
with identical key leads to error when indexing MultiIndex
(GH 46519)
Bug in pivot_table()
raising TypeError
when dropna=True
and aggregation column has extension array dtype (GH 47477)
Bug in merge()
raising error for how="cross"
when using FIPS
mode in ssl library (GH 48024)
Bug in DataFrame.join()
with a list when using suffixes to join DataFrames with duplicate column names (GH 46396)
Bug in DataFrame.pivot_table()
with sort=False
results in sorted index (GH 17041)
Bug in concat()
when axis=1
and sort=False
where the resulting Index was a Int64Index
instead of a RangeIndex
(GH 46675)
Bug in wide_to_long()
raises when stubnames
is missing in columns and i
contains string dtype column (GH 46044)
Bug in DataFrame.join()
with categorical index results in unexpected reordering (GH 47812)
Bug in Series.where()
and DataFrame.where()
with SparseDtype
failing to retain the arrayâs fill_value
(GH 45691)
Bug in SparseArray.unique()
fails to keep original elements order (GH 47809)
Bug in IntegerArray.searchsorted()
and FloatingArray.searchsorted()
returning inconsistent results when acting on np.nan
(GH 45255)
Bug when attempting to apply styling functions to an empty DataFrame subset (GH 45313)
Bug in CSSToExcelConverter
leading to TypeError
when border color provided without border style for xlsxwriter
engine (GH 42276)
Bug in Styler.set_sticky()
leading to white text on white background in dark mode (GH 46984)
Bug in Styler.to_latex()
causing UnboundLocalError
when clines="all;data"
and the DataFrame
has no rows. (GH 47203)
Bug in Styler.to_excel()
when using vertical-align: middle;
with xlsxwriter
engine (GH 30107)
Bug when applying styles to a DataFrame with boolean column labels (GH 47838)
Fixed metadata propagation in DataFrame.melt()
(GH 28283)
Fixed metadata propagation in DataFrame.explode()
(GH 28283)
Bug in assert_index_equal()
with names=True
and check_order=False
not checking names (GH 47328)
A total of 271 people contributed patches to this release. People with a â+â by their names contributed a patch for the first time.
Aadharsh Acharya +
Aadharsh-Acharya +
Aadhi Manivannan +
Adam Bowden
Aditya Agarwal +
Ahmed Ibrahim +
Alastair Porter +
Alex Povel +
Alex-Blade
Alexandra Sciocchetti +
AlonMenczer +
Andras Deak +
Andrew Hawyrluk
Andy Grigg +
Aneta Kahleová +
Anthony Givans +
Anton Shevtsov +
B. J. Potter +
BarkotBeyene +
Ben Beasley +
Ben Wozniak +
Bernhard Wagner +
Boris Rumyantsev
Brian Gollop +
CCXXXI +
Chandrasekaran Anirudh Bhardwaj +
Charles Blackmon-Luca +
Chris Moradi +
ChrisAlbertsen +
Compro Prasad +
DaPy15
Damian Barabonkov +
Daniel I +
Daniel Isaac +
Daniel Schmidt
Danil Iashchenko +
Dare Adewumi
Dennis Chukwunta +
Dennis J. Gray +
Derek Sharp +
Dhruv Samdani +
Dimitra Karadima +
Dmitry Savostyanov +
Dmytro Litvinov +
Do Young Kim +
Dries Schaumont +
Edward Huang +
Eirik +
Ekaterina +
Eli Dourado +
Ezra Brauner +
Fabian Gabel +
FactorizeD +
Fangchen Li
Francesco Romandini +
Greg Gandenberger +
Guo Ci +
Hiroaki Ogasawara
Hood Chatham +
Ian Alexander Joiner +
Irv Lustig
Ivan Ng +
JHM Darbyshire
JHM Darbyshire (MBP)
JHM Darbyshire (iMac)
JMBurley
Jack Goldsmith +
James Freeman +
James Lamb
James Moro +
Janosh Riebesell
Jarrod Millman
Jason Jia +
Jeff Reback
Jeremy Tuloup +
Johannes Mueller
John Bencina +
John Mantios +
John Zangwill
Jon Bramley +
Jonas Haag
Jordan Hicks
Joris Van den Bossche
Jose Ortiz +
JosephParampathu +
José Duarte
Julian Steger +
Kai Priester +
Kapil E. Iyer +
Karthik Velayutham +
Kashif Khan
Kazuki Igeta +
Kevin Jan Anker +
Kevin Sheppard
Khor Chean Wei
Kian Eliasi
Kian S +
Kim, KwonHyun +
Kinza-Raza +
Konjeti Maruthi +
Leonardus Chen
Linxiao Francis Cong +
Loïc Estève
LucasG0 +
Lucy Jiménez +
Luis Pinto
Luke Manley
Marc Garcia
Marco Edward Gorelli
Marco Gorelli
MarcoGorelli
Margarete Dippel +
Mariam-ke +
Martin Fleischmann
Marvin John Walter +
Marvin Walter +
Mateusz
Matilda M +
Matthew Roeschke
Matthias Bussonnier
MeeseeksMachine
Mehgarg +
Melissa Weber Mendonça +
Michael Milton +
Michael Wang
Mike McCarty +
Miloni Atal +
Mitlasóczki Bence +
Moritz Schreiber +
Morten Canth Hels +
Nick Crews +
NickFillot +
Nicolas Hug +
Nima Sarang
Noa Tamir +
Pandas Development Team
Parfait Gasana
Parthi +
Partho +
Patrick Hoefler
Peter
Peter Hawkins +
Philipp A
Philipp Schaefer +
Pierrot +
Pratik Patel +
Prithvijit
Purna Chandra Mansingh +
Radoslaw Lemiec +
RaphSku +
Reinert Huseby Karlsen +
Richard Shadrach
Richard Shadrach +
Robbie Palmer
Robert de Vries
Roger +
Roger Murray +
Ruizhe Deng +
SELEE +
Sachin Yadav +
Saiwing Yeung +
Sam Rao +
Sandro Casagrande +
Sebastiaan Vermeulen +
Shaghayegh +
Shantanu +
Shashank Shet +
Shawn Zhong +
Shuangchi He +
Simon Hawkins
Simon Knott +
Solomon Song +
Somtochi Umeh +
Stefan Krawczyk +
Stefanie Molin
Steffen Rehberg
Steven Bamford +
Steven Rotondo +
Steven Schaerer
Sylvain MARIE +
Sylvain Marié
Tarun Raghunandan Kaushik +
Taylor Packard +
Terji Petersen
Thierry Moisan
Thomas Grainger
Thomas Hunter +
Thomas Li
Tim McFarland +
Tim Swast
Tim Yang +
Tobias Pitters
Tom Aarsen +
Tom Augspurger
Torsten Wörtwein
TraverseTowner +
Tyler Reddy
Valentin Iovene
Varun Sharma +
Vasily Litvinov
Venaturum
Vinicius Akira Imaizumi +
Vladimir Fokow +
Wenjun Si
Will Lachance +
William Andrea
Wolfgang F. Riedl +
Xingrong Chen
Yago González
Yikun Jiang +
Yuanhao Geng
Yuval +
Zero
Zhengfei Wang +
abmyii
alexondor +
alm
andjhall +
anilbey +
arnaudlegout +
asv-bot +
ateki +
auderson +
bherwerth +
bicarlsen +
carbonleakage +
charles +
charlogazzo +
code-review-doctor +
dataxerik +
deponovo
dimitra-karadima +
dospix +
ehallam +
ehsan shirvanian +
ember91 +
eshirvana
fractionalhare +
gaotian98 +
gesoos
github-actions[bot]
gunghub +
hasan-yaman
iansheng +
iasoon +
jbrockmendel
joshuabello2550 +
jyuv +
kouya takahashi +
mariana-LJ +
matt +
mattB1989 +
nealxm +
partev
poloso +
realead
roib20 +
rtpsw
ryangilmour +
shourya5 +
srotondo +
stanleycai95 +
staticdev +
tehunter +
theidexisted +
tobias.pitters +
uncjackg +
vernetya
wany-oh +
wfr +
z3c0 +
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4