This is a follow up to gh-14688. That issue was originally about a kernel panic (fixed in macOS 12.0.1), and after that the same reproducer showed severe performance issues. This issue is about those performance issues. Note that while the reproducer is the same, it's not clear whether or not the kernel panic and the performance issues share a root cause or not.
Issue reproducerA reproducer (warning: do NOT run on macOS 11.x, it will crash the OS):
from time import perf_counter import numpy as np from scipy.sparse.linalg import eigsh n_samples, n_features = 2000, 10 rng = np.random.default_rng(0) X = rng.normal(size=(n_samples, n_features)) K = X @ X.T for i in range(10): print("running eigsh...") tic = perf_counter() s, _ = eigsh(K, 3, which="LA", tol=0) toc = perf_counter() print(f"computed {s} in {toc - tic:.3f} s")
Running scipy.test()
or scipy.linalg.test()
will also show a significant performance impact.
In situations where we hit the performance problem, the above code will show:
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.062 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.888 s
...
And if we don't hit that problem:
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
...
So a ~50x slowdown for this particular example.
There is in general an impact on functions that use BLAS/LAPACK. The impact on the total time taken by scipy.test()
was about 30% (311 sec. with default settings, 234 sec. when using OPENBLAS_NUM_THREADS=1
) - note that this was just a single test on one build config, results may vary: #14688 (comment). The single-threaded case has similar timings as when running the test suite on a scipy
install that doesn't show the problem at all (~240 sec. seems expected on arm64 macOS, and it doesn't depend on the threading setting (because test arrays are always small)). Important: ensure pytest-xdist
is not installed when looking at time taken by the test suite (see gh-14425 for why).
The discussion in gh-14688 showed that this problem gets hit when two copies of libopenblas
get loaded. The following configurations showed a problem so far:
numpy
and scipy
from a wheel (e.g., numpy
1.21.4 from PyPI and the latest 1.8.0.dev0
wheel from https://anaconda.org/scipy-wheels-nightly/scipy/)numpy
1.21.4 from PyPI and installing scipy
locally when built against conda-forge's openblas
.These configurations did not show a problem:
numpy
1.21.4 from PyPI and installing scipy
locally when built against Homebrew's openblas
.libopenblas
is loaded.It is unclear right now what the exact root cause is. The situation when using conda-forge's openblas
is very similar to that using Homebrew's openblas
, but only one of those triggers the issue. The most important situation is installing both NumPy and SciPy from wheels though, that's what the vast majority of pip
/PyPI users will get.
A difference between conda-forge and Homebrew that may be relevant is that the former uses @rpath
and the latter a hardcoded path to load libopenblas
:
% # conda-forge
% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)
% # Homebrew
% otool -L /opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-*-darwin.so
/opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-39-darwin.so:
/opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
/opt/homebrew/opt/gcc/lib/gcc/11/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
/opt/homebrew/opt/gcc/lib/gcc/11/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)
That may not be the only difference, e.g. compilers used to build libopenblas
and scipy
were not the same. Also libopenblas
can be built with either pthreads
or openmp
- numpy and scipy wheels use pthreads
, while conda-forge and Homebrew both use openmp
.
To check if two libopenblas
libraries get loaded, use:
❯ python -m threadpoolctl -i scipy.linalg
[
{
"user_api": "blas",
"internal_api": "openblas",
"prefix": "libopenblas",
"filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
"version": "0.3.18",
"threading_layer": "pthreads",
"architecture": "armv8",
"num_threads": 8
},
{
"user_api": "blas",
"internal_api": "openblas",
"prefix": "libopenblas",
"filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
"version": "0.3.17",
"threading_layer": "pthreads",
"architecture": "armv8",
"num_threads": 8
}
]
Context: why do 2 libopenblas copies get loaded
The reason is that the NumPy and SciPy wheels both vendor a copy of libopenblas
within them, and extension modules that need libopenblas
are depending directly on that vendored copy:
% cd /path/to/site-packages/scipy/linalg
% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
@loader_path/../.dylibs/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)
% cd ../../numpy/linalg
% otool -L _umath_linalg.cpython-39-darwin.so
_umath_linalg.cpython-39-darwin.so:
@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)
This is how we have been shipping wheels for years, and it works fine across Windows, Linux and macOS. It seems like a weird thing to do of course (if you know how package managers work but are new to PyPI/wheels) - it's a long story, but the tl;dr is that PyPI wasn't designed with non-Python dependencies in mind, so the usual approach is to bundle those all into a wheel (it tends to work, unless you have complex non-Python dependencies). It'd be very much nontrivial to do any kind of unbundling here, and doing so would break situations where numpy
and scipy
are not installed in the same way (e.g., the former from conda-forge/Homebrew, the latter from PyPI).
The kernel panic had to do with spin locks apparently. It is not clear if the performance issues are also due to that, or have a completely different root cause. It does seem to be the case that two copies of the same shared library with the same version (all are libopenblas.0.dylib
) cause a conflict at the OS level somehow. Anything beyond that is speculation at this point.
If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of pip install numpy scipy
does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use arm64
just yet but stick with x86_64
(but that does have some performance implications as well).
At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:
threadpoolctl
(see the README at https://github.com/joblib/threadpoolctl for how)OPENBLAS_NUM_THREADS
libopenblas
we bundle in the wheel to have a max number of threads of 1, 2, or 4.SciPy doesn't have a threadpoolctl
runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an optional dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.
Rebuilding libopenblas
with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in #14688 (comment) that this is undesirable.
Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in scipy/__init__.py
then that may be the most pragmatic solution right now. However, this must be done before libopenblas
is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of libopenblas
. It needs testing whether this then still works around the problem or not.
Note: I wanted to have everything in one place, but let's discuss the release strategy on the mailing list (link to thread), and the actual performance issue here.
Testing on other macOS arm64 build/install configurationsRequest: if you have a build config on macOS arm64 that is not covered by the above summary yet, please run the following and reply on this issue with the results:
% python -m threadpoolctl -i scipy.linalg
% cd /PATH/TO/scipy/linalg
% otool -L _fblas.cpython-*-darwin.so
% cd /PATH/TO/numpy/linalg
% otool -L _umath_linalg.cpython-*-darwin.so
% # Run the reproducer (again, only on macOS 12 - you will trigger an OS
% # crash on macOS 11.x!) and report if the time per `eigsh` call is ~0.02 sec. or ~1 sec.
% pip list # if using pip for everything
% conda list # if using conda
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4