A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/54792 below:

new default String dtype (pyarrow-backed, numpy NaN semantics) · Issue #54792 · pandas-dev/pandas · GitHub

Skip to content Navigation Menu

Saved searches Use saved searches to filter your results more quickly

Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert Additional navigation options

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Description

Overview of work for the future string dtype (PDEP-14).

Main implementation:

Testing related:

Open design questions / behaviour changes to implement:

Known bugs that need to be fixed:

Documentation:

[original issue body]

With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..), .str.len(..), .str.count(), etc, or comparison operators like ==).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy") instead of ArrowDtype("string"). From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator". Main PR:

plus some follow-ups (#54720, #54585, #54591).

cc @pandas-dev/pandas-core

You can’t perform that action at this time.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4