Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()
) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.
There are several things to discuss:
The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).
StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.
There are three was to get a StringDtype
-dtype array today:
pd.array(['a', 'b', None])
dtype=pd.StringDtype()
dtype="string"
My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.
I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype()
as a
way to get the StringArray backed by an object-dtype ndarray.
The easiest way to support this is, I think, an option.
>>> pd.options.mode.use_arrow_string_dtype = True
Then all of those would create an Arrow-backed StringArray.
Fallback ModeIt's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does
>>> Series(['a', 'b'], dtype="string").str.normalize() # no arrow kernel
we have a few options:
I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4