A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/scipy/scipy/issues/18286 below:

SciPy array types & libraries support · Issue #18286 · scipy/scipy · GitHub

This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.

The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray.

The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.

Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.

I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.

array/tensor types support

Array types and how to treat them:

Input type Return type When Behavior Notes numpy.ndarray numpy.ndarray always torch.Tensor (CPU) torch.Tensor always torch.Tensor (GPU) torch.Tensor when possible raise TypeError otherwise for pure Python implementation for now. See text below for more details cupy.ndarray cupy.ndarray when possible raise TypeError otherwise has __array_namespace__ same as input type when possible raise TypeError otherwise array with object dtype n/a raise TypeError numpy.matrix n/a for NumPy >= 2.0 (or always?) raise TypeError numpy.ma.MaskedArray numpy.ma.MaskedArray only for stats.mstats raise TypeError otherwise future nan_policy/mstats plans are unaffected by this numpy.ndarray subclasses same subclass type always so always use np.asanyarray checks torch.Tensor subclasses same subclass type same as for torch.Tensor on CPU/GPU PyData/Sparse arrays same sparse array type once it has array API support TypeError otherwise TypeError is already the current state; no auto-densification dask.Array dask.Array once it has array API support TypeError otherwise (?) bc-breaking, but mixing with current conversion to numpy array may not be healthy jax.numpy.Array jax.numpy.Array once added to array-api-compat same as PyTorch for CPU and GPU

When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:

  1. dispatch back to the other library (PyTorch, CuPy, etc.).
  2. convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or __array__), use the compiled code in question, then convert back.
  3. don't support this at all

I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.

A note on numpy.matrix: the only place this is needed is scipy.sparse, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array instances and PyData/Sparse.

Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg and fft submodules - supporting partial implementations without linalg support in particular seems unnecessary.

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

Development and introduction strategy

The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.

Context Problems & opportunities

The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).

Problems include:

Opportunities include:

References Appendix

Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).

dataframe support

For dataframe library support, the situation is a little trickier. We have to think about pandas Series and DataFrame instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.

This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):

Input type Return type When Behavior Notes pandas.Series pandas.Series if dtype is a numpy dtype raise TypeError otherwise non-NumPy dtypes get coerced to object arrays otherwise, avoid that pandas.DataFrame pandas.DataFrame if dtype is a numpy dtype raise TypeError otherwise has __dataframe__ same as input type needs a small extension to dataframe interchange protocol to reconstruct input type

There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output feature to request pandas dataframe as the return type.

Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.

Dispatching mechanism

For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).

There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft (as do matching fft APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.

Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x) where x is a non-numpy array should dispatch to the library which implements the array type of x.

We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:

leofang and KelSolaarjjerphan, lorentzenchr, vnmabus and tirthasheshpateltupui, AnirudhDagar, ilayn and tirthasheshpatel


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4