This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.
The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.
Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray
.
The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.
Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.
I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.
array/tensor types supportArray types and how to treat them:
Input type Return type When Behavior Notes numpy.ndarray numpy.ndarray always torch.Tensor (CPU) torch.Tensor always torch.Tensor (GPU) torch.Tensor when possible raise TypeError otherwise for pure Python implementation for now. See text below for more details cupy.ndarray cupy.ndarray when possible raise TypeError otherwise has__array_namespace__
same as input type when possible raise TypeError otherwise array with object
dtype n/a raise TypeError numpy.matrix n/a for NumPy >= 2.0 (or always?) raise TypeError numpy.ma.MaskedArray numpy.ma.MaskedArray only for stats.mstats
raise TypeError otherwise future nan_policy
/mstats
plans are unaffected by this numpy.ndarray subclasses same subclass type always so always use np.asanyarray
checks torch.Tensor subclasses same subclass type same as for torch.Tensor
on CPU/GPU PyData/Sparse arrays same sparse array type once it has array API support TypeError otherwise TypeError is already the current state; no auto-densification dask.Array dask.Array once it has array API support TypeError otherwise (?) bc-breaking, but mixing with current conversion to numpy array may not be healthy jax.numpy.Array jax.numpy.Array once added to array-api-compat
same as PyTorch for CPU and GPU
When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:
__array__
), use the compiled code in question, then convert back.I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.
A note on numpy.matrix
: the only place this is needed is scipy.sparse
, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix
long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array
instances and PyData/Sparse.
Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg
and fft
submodules - supporting partial implementations without linalg
support in particular seems unnecessary.
For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.
Development and introduction strategyThe following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.
array-api-compat
as an optional dependency for nowarray-api-compat
numpy.array_api
for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that.scipy._lib
and depend on that from other submodules. That means we'll have a single replacement for np.asarray
/np.asanyarray
in one central location.numpy
namespacenumpy.array_api
module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only.array-api-compat
as the way to go. This looks quite a bit nicer than the previous approach with numpy.array_api
, and the work to support PyTorch that way has been quite well-received.__duckarray__
), NEP 31 (uarray
) and NEP 37 (__array_module__
) are all effectively dead - I'll propose rejecting these.numpy.array_api
) has been implemented, however I'm going to propose marking it superceded via a new NEP for the main numpy
namespace__array_function__
and __array_ufunc__
are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types to ndarray
and error out if that fails. The exception here is ufuncs in scipy.special
, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.np.matrix
instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful.numpy.ma
has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned. numpy.ma
in its current form should be considered legacy.scipy.stats.mstats
is the only module that specifically supports masked arrays.scipy.stats
has a nan_policy
keyword in many functions that is well-maintained at this point, and has a documented specification. That is probably not applicable to other submodules though.interpolate
and spatial.KDTree
(see gh-18230) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist.numpy.ma.MaskedArray
instances".The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).
Problems include:
Decimal
and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.Opportunities include:
numpy.array_api
is a bit cumbersome): gh-15395Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).
dataframe supportFor dataframe library support, the situation is a little trickier. We have to think about pandas Series
and DataFrame
instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.
This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):
Input type Return type When Behavior Notes pandas.Series pandas.Series if dtype is a numpy dtype raise TypeError otherwise non-NumPy dtypes get coerced to object arrays otherwise, avoid that pandas.DataFrame pandas.DataFrame if dtype is a numpy dtype raise TypeError otherwise has__dataframe__
same as input type needs a small extension to dataframe interchange protocol to reconstruct input type
There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output
feature to request pandas dataframe as the return type.
Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.
Dispatching mechanismFor compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).
There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft
(as do matching fft
APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.
Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x)
where x
is a non-numpy array should dispatch to the library which implements the array type of x
.
We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:
leofang and KelSolaarjjerphan, lorentzenchr, vnmabus and tirthasheshpateltupui, AnirudhDagar, ilayn and tirthasheshpatel
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4