A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/35169 below:

Plan for a native string dtype · Issue #35169 · pandas-dev/pandas · GitHub

Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

  1. How do users opt into this behavior?
  2. A fallback mode for not implemented kernels.
How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a StringDtype-dtype array today:

  1. Infer: pd.array(['a', 'b', None])
  2. Explicit dtype=pd.StringDtype()
  3. String alias dtype="string"

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

>>> pd.options.mode.use_arrow_string_dtype = True

Then all of those would create an Arrow-backed StringArray.

Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel

we have a few options:

  1. Raise, stating that there's no kernel for normalize.
  2. PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4