RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/54792 below:

new default String dtype (pyarrow-backed, numpy NaN semantics) · Issue #54792 · pandas-dev/pandas · GitHub

Pricing

Saved searches Use saved searches to filter your results more quickly

Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert Additional navigation options

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Description

Overview of work for the future string dtype (PDEP-14).

Main implementation:

Implement the object-dtype based fallback:
- String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
Rename the storage options (from the PDEP: storage="pyarrow_numpy" to storage="pyarrow", na_value=np.nan)
- String dtype: rename the storage options and add na_value keyword in StringDtype() #59330
- Update the tests to stop using "pyarrow_numpy" -> TST (string dtype): remove usage of 'string[pyarrow_numpy]' alias #59758
- Deprecate the "pyarrow_numpy" storage option -> String dtype: deprecate the pyarrow_numpy storage option #60152
- String dtype: restrict options.mode.string_storage to python|pyarrow (remove pyarrow_numpy) #59376
Change the string alias to "str" for the NaN-variant of the dtype
- String dtype: use 'str' string alias and representation for NaN-variant of the dtype #59388
Ensure dtype=str/ astype(str) works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)
String dtype: avoid surfacing pyarrow exception in binary operations #59610

Testing related:

Set up test build with future.infer_string enabled:
Get all tests passing with future.infer_string (tackle all xfails / TODO(infer_string) tests)
- TST (string dtype): clean-up xpasssing tests with future string dtype #59323
- TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430
- TST (string dtype): un-xfail string tests specific to object dtype #59433
- Fix tests:
  - IO - SQL
  - IO - pytables
  - IO - parser
  - IO - parquet/feather/orc: String dtype: fix pyarrow-based IO + update tests #59478
  - ...

Open design questions / behaviour changes to implement:

Known bugs that need to be fixed:

Documentation:

List and document all breaking / behaviour changes in upgrade guide
- String dtype: overview of breaking behaviour changes #59328
- DOC: add pandas 3.0 migration guide for the string dtype #61705
Update the user guide page on text data
Update the installation instructions to recommend always installing pyarrow

[original issue body]

With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.

For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:

>>> pd.options.future.infer_string = True

>>> pd.Series(["a", "b", None])
0      a
1      b
2    NaN
dtype: string

>>> pd.Series(["a", "b", None]).dtype
string[pyarrow_numpy]

This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings

One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..), .str.len(..), .str.count(), etc, or comparison operators like ==).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)

To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy") instead of ArrowDtype("string"). From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator". Main PR:

Implement Arrow String Array that is compatible with NumPy semantics #54533

plus some follow-ups (#54720, #54585, #54591).

cc @pandas-dev/pandas-core

You can’t perform that action at this time.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4