Overview of work for the future string dtype (PDEP-14).
Main implementation:
storage="pyarrow_numpy"
to storage="pyarrow", na_value=np.nan
)
na_value
keyword in StringDtype()
#59330"str"
for the NaN-variant of the dtype
dtype=str
/ astype(str)
works as an alias when the future mode is enabled (and any other alias which will currently convert the input to strings, like "U"?)Testing related:
future.infer_string
enabled:
future.infer_string
(tackle all xfails
/ TODO(infer_string)
tests)
Open design questions / behaviour changes to implement:
select_dtypes
#61916Known bugs that need to be fixed:
testing.assert_frame_equal
unhelpful error message for string[pyarrow]
#54190Documentation:
[original issue body]
With PDEP-10 (#52711, https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html), we decided to start using pyarrow for the default string data type in pandas 3.0.
For pandas 2.1, an option was added to already enable this future default data type, and then the various ways to construct a DataFrame (type inference in the constructor, IO methods) will use the new string dtype as default:
>>> pd.options.future.infer_string = True >>> pd.Series(["a", "b", None]) 0 a 1 b 2 NaN dtype: string >>> pd.Series(["a", "b", None]).dtype string[pyarrow_numpy]
This is documented at https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#whatsnew-210-enhancements-infer-strings
One aspect that was discussed after the PDEP (mostly at the sprint, I think; creating this issue for a better public record of it), is that for a data type that would become the default in pandas 3.0 (which for the rest still uses all numpy dtypes with numpy NaN missing value semantics), should probably also still use the same default semantics and result in numpy data types when doing operations on the string column that result in a boolean or numeric data type (eg .str.startswith(..)
, .str.len(..)
, .str.count()
, etc, or comparison operators like ==
).
(this way, a user only gets an ArrowDtype column when explicitly asking for those, and not by default through using a the default string dtype)
To achieve this, @phofl has done several PRs to refactor the current pyarrow-based string dtypes, to add another variant which uses StringDtype(storage="pyarrow_numpy")
instead of ArrowDtype("string")
. From the updated whatsnew: "This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan
as the missing value indicator". Main PR:
plus some follow-ups (#54720, #54585, #54591).
cc @pandas-dev/pandas-core
You can’t perform that action at this time.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4