Welcome to the Design Document for CPU parallelism in NumPy, SciPy, scikit-learn, and pandas. Each library has varying levels of support for running parallel computation. This document details the current status of parallelism with shipping code on PyPI and possible paths for improvement.
Current LandscapeEach library ships with an assortment of parallel interfaces on PyPi:
linalg
module and matrix multiplication utilize BLAS, which is multi-threaded by default for most implementations. On PyPI, NumPy ships with OpenBLAS built with pthreads.linalg
module uses OpenBLAS, which is also multi-threaded by default and built with pthreads. SciPy also has a workers
parameter that configures multiprocessing, multithreading, or pthreads.parallel=True
to engine_kwargs.linalg
module from NumPy and SciPy, which is multi-threaded by default. Scikit-learn ships with OpenMP on PyPi and runs OpenMP accelerated code in parallel by default. The library also has a n_jobs
parameter that uses Python’s multithreading or loky, an improved Python-based Process Pool Executor via joblib.On PyPi, if a library requires OpenMP or OpenBLAS, it bundles the shared library into its wheel:
Issues with the Current LandscapeThe current landscape has three broad categories of issues:
There are three ways to configure parallelism across the libraries: environment variables, threadpoolctl, or library-specific Python API.
Examples of environment variables consist of:
OPENBLAS_NUM_THREADS
for OpenBLASMKL_NUM_THREADS
for MKLOMP_NUM_THREADS
for OpenMPThese environment variables control how many threads a specific backend uses. These environment variables do not influence code that does not use a particular backend, like OpenMP. For example, SciPy’s fft
module uses pthreads directly.
Threadpoolctl provides a Python interface for configuring the number of threads in OpenBLAS, MKL, and OpenMP. Linear algebra function calls from NumPy, SciPy, or scikit-learn can all be configured with threadpoolctl or an environment variable. These configuration options also control Scikit-learn’s OpenMP routines.
SciPy and scikit-learn have a library-specific Python API for controlling parallelism. SciPy’s workers
can mean multithreading, multiprocessing, or pthreads. Scikit-learn’s n_jobs
is either multiprocessing or multithreading. Note that scikit-learn’s n_jobs
does not configure OpenMP or OpenBLAS parallelism.
Here is a two step proposal:
workers
parameter because it is more consistent in controlling the number of cores used. workers
denotes any form of parallelism such as: multi-threading, multiprocessing, OpenMP threads, or pthreads. Please see the FAQ section for more information.BLAS implementations such as OpenBLAS are multi-threaded by default. Scikit-learn followed this convention with OpenMP, which is also multi-threaded by default. Using all the CPU cores by default is convenient for interactive use cases like in a Jupyter Notebook. The downside of using all CPU cores is during deployment to shared environments. The user needs to know which API to configure their program to become serial from the above section.
There can be oversubscription when multiprocessing or multi-thraeding is used together with OpenBLAS or OpenMP. Distributed Python libraries such as Dask and Ray recommend setting environment variables to configure OpenBLAS and OpenMP to run serially.
ProposalHere are some possible paths we can take:
linalg
module or scikit-learn’s OpenMP accelerated routines will continue to be parallel as the default.Options 2 and 3 helps with oversubscription because the library is serial by default.
Interactions Between Different Forms of ParallelismWhen different parallelism interfaces are running concurrently, it is possible to run into crashes or oversubscription. The following is a list of known issues:
libgomp
(GCC’s OpenMP runtime library) is not fork(1)
-safe while libomp
(LLVM’s OpenMP runtime library) is fork(1)
-safe. Scikit-learn’s community developed loky as a workaround. There is a patch to GCC OpenMP to make it fork(1)
safe, but it has not progressed. For details, see scikit-learn’s FAQ entry.libomp
(LLVM’s OpenMP runtime library) not compatible with libiomp (OpenMP for Intel Complier). The workaround is to set MKL_THREADING_LAYER=GNU
. See this link for details.libgomp
(GCC’s OpenMP runtime library) is also not compatible with libiomp (OpenMP for Intel Complier): pytorch#37377OPENBLAS_THREAD_TIMEOUT=1
on the affected platforms.The following are feasible steps we can take to improve the issues listed above:
fork(1)
are used together, raising an error.MKL_THREADING_LAYER
when LLVM OpenMP and Intel OpenMP are loaded together. For example, threadpoolctl has such a warning.OpenMP and OpenBLAS are shipped with their header files. When building an upstream library such as NumPy, extensions will use RPATH to link to the OpenMP and OpenBLAS wheels. auditwheel repair
needs a patch so that it does not copy PyPi libraries into the wheel: auditwheel#392. Note that PEP513 explicitly allows for shared libraries to be distributed as separate packages on PyPI.
There are two options: libgomp
(GCC’s OpenMP runtime library) or libomp
(LLVM’s OpenMP runtime library).
libgomp
is not fork(1)
safe, but uses the GCC and shipped with all Linux distros. We advocate for the patch in GCC to make it fork(1)
safe.libomp
is fork(1)
safe, but it is an implementation detail and not part of the OpenMP specification.On PyPI, I propose we go with libomp
, because it has the same symbols as libgomp
and is fork(1)
safe. Upstream libraries such as NumPy or SciPy can still use GCC as their compiler. Package managers can still ship libraries linked with libgomp
. SciPy has an existing discussion regarding OpenMP adoption and the compiler choice: scipy#10239.
OpenMP
wheel, can workers
configure and use another threading API like pthreads
?
Yes, this design documentation does not restrict the usage of other API for parallelism. The only requirement is to link the library to the OpenMP
wheel if it uses OpenMP
. This way, all libraries can share the same OpenMP thread pool.
multiprocessing
and multithreading
modules still be used?
Yes, this design documentation does not restrict the usage of other APIs for parallelism. If Python’s multithreading
or multiprocessing
fits your library’s use case, you are welcome to use them.
Libraries will do their best to account for over-subscription from nested parallelism. For example, if multiple OpenMP threads call a BLAS routine, then the BLAS routine is configured to run serially. On the other hand, a library can have an API to perform multiprocessing on a user-defined function. If the user-defined function is also parallelized, then there is nested parallelism. The best a library can do is to document how its parallelism interacts with user-defined functions.
How does conda-forge work?On conda-forge, if a library requires OpenMP or an implementation of BLAS (such as MKL or OpenBLAS), it depends on the package manager distributing the shared library:
For BLAS, conda-forge builds with netlib. During installation time, BLAS can be switched to other implementations such as MKL, BLIS, OpenBLAS. See this link for details.
For OpenMP, conda-forge builds with libgomp
, the GNU build of OpenMP. During installation time, OpenMP can be switched to libomp
, the LLVM build of OpenMP. Recall that the LLVM implementation is fork(1)
-safe. Note, that the GNU implementation has target offloading symbols, while LLVM does not. See this link for details.
Conda-forge has a mutex package ensuring that a single OpenMP or BLAS library is installed and loaded.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4