In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.
To do this it is useful to compute min/max values.
What actually breaksWhen I try to do this on columns of type string[pyarrow]
I get the following exception
import pandas as pd s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]") s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs) 10825 ) 10826 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs): > 10827 return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs) 10828 10829 setattr(cls, "min", min) ~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs) 10348 10349 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs): > 10350 return self._stat_function( 10351 "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs 10352 ) ~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs) 10343 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only 10344 ) > 10345 return self._reduce( 10346 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only 10347 ) ~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds) 4380 if isinstance(delegate, ExtensionArray): 4381 # dispatch to ExtensionArray interface -> 4382 return delegate._reduce(name, skipna=skipna, **kwds) 4383 4384 else: ~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs) 377 def _reduce(self, name: str, skipna: bool = True, **kwargs): 378 if name in ["min", "max"]: --> 379 return getattr(self, name)(skipna=skipna) 380 381 raise TypeError(f"Cannot perform reduction '{name}' with string dtype") AttributeError: 'ArrowStringArray' object has no attribute 'min'Solution
I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4