scipy.stats.mstats
is a mostly-separate re-implementation of scipy.stats
with support for NumPy masked arrays. Masked values are treated as missing: for 1-D slices, the result is typically the same as if the masked value were not present.
While there seems to be demand for statistical functions to support missing values, I'd suggest that having two separate implementations of these functions is not the best way to satisfy the need.
scipy.stats
function or the masked capabilities of its scipy.stats.mstats
counterpart1.I have seen the opinion that we can combine the implementations but must maintain a separate scipy.stats.mstats
namespace. While this does not double the workload, maintaining two interfaces is more work than maintaining one. For instance, many scipy.stats.mstats
are missing "Returns" (#22065 (comment)) and "Examples" (gh-7168) sections of their documentation. Also, having separate interfaces for essentially identical functionality is unnecessarily complicated for users.
I see two other reasons why a namespace should not be devoted to NumPy masked arrays.
array_namespace
function.Fortunately, many scipy.stats
functions already offer the same functionality as their scipy.stats.mstats
counterparts, making the separate namespace redundant. There are actually two obvious ways 2 to ignore missing values in most scipy.stats
functions with a scipy.stats.mstats
counterpart:
nan
and use nan_policy='omit'
.scipy.stats
function. This behavior has been handled by the _axis_nan_policy
decorator for several years.Both of these avoid a common pitfall of NumPy masked arrays, which mask non-finite values that arise during calculations. This behavior is problematic because NaNs and infinities should not always be treated the same as missing data.
Update April 2025: The specific plan suggested here has changed; see #22194 (comment) for an update.
Here is the proposed alternative:
scipy.stats.mstats
functions to add to scipy.stats
, and add them. These new functions should be subject to the same level of review as any other new scipy.stats
function, as standards have changed since they were introduced to mstats
.scipy.stats
functions with a scipy.stats.mstats
counterpart support the following:
nan_policy='omit'
(always)SCIPY_ARRAY_API
is unspecified)scipy.stats.mstats
functions with a helpful warning for transitioning to the scipy.stats
equivalent.scipy.stats.mstats
functions after the usual deprecation period.scipy.stats.mstats
namespace either a) along with the last scipy.stats.mstats
functions (preferably) or b) in SciPy 2.0.0 (if it is necessary to wait for some reason).Looking beyond this, I would also suggest that as scipy.stats
functions are translated to use the Python Array API, they can also be adapted to natively support marray
, which add masks to any Python Array API compatible backend. In most cases, the only special consideration for MArray
s is that the count of non-masked elements along axis
should be used in place of the length of the array along axis
.
Closing this will close gh-5474
If the scipy.stats
version did not already support masked arrays - but many do. (Addressed below.) ↩
Ideally, nan_policy='omit'
could also be eliminated, and the same behavior could be achieved by passing an MArray
(discussed below) to the function. MArray
s do not automatically mask non-finite values that arise during calculations. ↩
j-bowhay, h-vetinari, dschmitz89, lucascolley, tupui and 2 moretupuistefanv
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4