A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/toxa81/COSMA below:

toxa81/COSMA: Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix mutliplication algorithm that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning. The key idea behind COSMA is to first derive a tight optimal sequential schedule and only then parallelize it, preserving I/O optimality between processes. This stands in contrast with the 2D and 3D algorithms, which fix process domain decomposition upfront and then map it to the matrix dimensions, which may result in asymptotically more communication. The final design of COSMA facilitates the overlap of computation and communication, ensuring speedups and applicability of modern mechanisms such as RDMA. COSMA allows to not utilize some processors in order to optimize the processor grid, which reduces the communication volume even further and increases the computation volume per processor.

COSMA got the Best Student Paper Award at the prestigious Supercomputing 2019 conference in Denver, US.

COSMA alleviates the issues of current state-of-the-art algorithms, which can be summarized as follows:

In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:

The library supports both one-sided and two-sided MPI communication backends. It uses dgemm for the local computations, but also has a support for the GPU acceleration through our Tiled-MM library using cublas or rocBLAS.

The paper and other materials on COSMA are available under the following link:

See Installation Instructions.

COSMA is a CMake project and requires a recent CMake(>=3.12).

External dependencies:

Some dependencies are bundled as submodules and need not be installed explicitly:

To allow easy integration, COSMA can be used in the following ways:

The documentation for the latter option will soon be published here.

Using COSMA in 30 seconds

For easy integration, it is enough to build COSMA with ScaLAPACK API and then link your code to COSMA before linking to any other library providing ScaLAPACK pxgemm. This way, all pxgemm calls will be using COSMA pxgemm wrappers. To achieve this, please follow these steps:

  1. Build COSMA with ScaLAPACK API:
###############
# get COSMA
###############
git clone --recursive https://github.com/eth-cscs/COSMA cosma && cd cosma

##############################
# build and install COSMA
##############################
mkdir build && cd build
# choose BLAS and SCALAPACK versions you want to use
# COSMA_BLAS can be: MKL, OpenBLAS, CRAY_LIBSCI, CUDA, ROCM, CUSTOM
# COSMA_SCALAPACK can be MKL, CRAY_LIBSCI, CUSTOM
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCMAKE_INSTALL_PREFIX=<installation dir>/cosma ..
make -j 8
make install
  1. Link your code to COSMA:

  2. Include headers:

-I/cosma/include

COSMA is integrated into the CP2K quantum chemistry simulator. Since COSMA provides ScaLAPACK API, it is enough to link CP2K to COSMA, without changing CP2K code at all, which makes the integration trivial even if (as in the case of CP2K) the simulation code is in written Fortran.

In the production run, we ran Random-Phase Approximation (RPA) benchmark of 128 water molecules, using the Resolution of Identity (RI). The benchmark was run once on 1024 and once on 128 nodes of the GPU partition on Piz Daint supercomputer (Cray XC50). Computationally, the most dominant part of this benchmark consists of 46 tall-and-skinny dense matrix multiplications, with the parameters shown in the table below:

On 1024 nodes, we compared the performance of CP2K using COSMA and Cray-libsci_acc (version: 19.10.1), both being GPU accelerated, for all dense matrix-matrix multiplications (pdgemm routine). As can be seen in the following table, the version with COSMA was approximately 2x faster.

On 128 nodes, we compared the performance of CP2K using the following algorithms for multiplying matrices (pdgemm routine): MKL (version: 19.0.1.144), Cray-libsci (version: 19.06.1), Cray-libsci_acc (version: 19.10.1, GPU accelerated) and COSMA (both CPU-only and GPU-accelerated versions) libraries. The version with COSMA was the fastest on both CPU and GPU. The CPU version of COSMA achieved the peak performance, whereas the GPU version achieved more than 65% of the peak performance of GPUs. Keep in mind that the peak performance of GPUs assumes the data is already residing on GPUs which is not the case here, since matrices were initially residing on CPU. This is one of the reasons why the peak performance is not achieved with the GPU version. Still, the GPU version of COSMA was 25-27% faster than the second best in this case. The results are summarized in the following table:

With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different square matrix multiplications on the same number of nodes (=128) of Piz Daint supercomputer. The block size is 128x128 and the processor grid is also square: 16x16 (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: 19.0.1.144). The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:

All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.

# for CPU-only version
sbatch schedule_miniapp_on_daint_cpu.sh
# for Hybrid (CPU+GPU) version
sbatch schedule_miniapp_on_daint_gpu.sh

The script will use SLURM to submit a job on 10 nodes. The job will run 2 matrix multiplications and output the time COSMA algorithm took.

The project contains a miniapp that produces two random matrices A and B, computes their product C with the COSMA algorithm and outputs the time of the multiplication.

The miniapp consists of an executable ./build/miniapp/cosma-miniapp which can be run with the following command line (assuming we are in the root folder of the project):

mpirun --oversubscribe -np 4 ./build/miniapp/cosma-miniapp -m 1000 -n 1000 -k 1000 -P 4 -s pm2,sm2,pk2 -r 2

The flags have the following meaning:

COSMA also contains a wrapper for ScaLAPACK pxgemm calls which offers scalapack interface (pxgemm functions with exactly the same signatures as ScaLAPACK). Running these functions will take care of transforming the matrices between ScaLAPACK and COSMA data layout, perform the multiplication using COSMA algorithm and transform the result back to the specified ScaLAPACK data layout.

The miniapp consists of an executable ./build/miniapp/pdgemm-miniapp which can be run with the following command line on Piz Daint (assuming we are in the root folder of the project):

OMP_NUM_THREADS=18 MKL_NUM_THREADS=18 srun -C mc -N 8 -n 16 ./build/miniapp/pdgemm-miniapp -m 1000 -n 1000 -k 1000 --block_a 128x128 --block_b 128x128 --block_c 128x128 -p 4 -q 4 --trans_a -r 2

The flags have the following meaning:

Use -DCOSMA_WITH_PROFILING=ON to instrument the code. We use the profiler, called semiprof, written by Benjamin Cumming (https://github.com/bcumming).

Running the miniapp locally (from the project root folder) with the following command:

mpirun --oversubscribe -np 4 ./build/miniapp/cosma-miniapp -m 1000 -n 1000 -k 1000 -P 4

Produces the following output from rank 0:

Matrix dimensions (m, n, k) = (1000, 1000, 1000)
Number of processors: 4

_p_ REGION                     CALLS      THREAD        WALL       %
_p_ total                          -       0.110       0.110   100.0
_p_   multiply                     -       0.098       0.098    88.7
_p_     computation                2       0.052       0.052    47.1
_p_     communication              -       0.046       0.046    41.6
_p_       copy                     3       0.037       0.037    33.2
_p_       reduce                   3       0.009       0.009     8.3
_p_     layout                    18       0.000       0.000     0.0
_p_   preprocessing                3       0.012       0.012    11.3

The precentage is always relative to the first level above. All time measurements are in seconds.

Cite as:

@inproceedings{cosma_algorithm_2019,
  title={Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication},
  author={Kwasniewski, Grzegorz and Kabi{\'c}, Marko and Besta, Maciej and VandeVondele, Joost and Solc{\`a}, Raffaele and Hoefler, Torsten},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  pages={1--22},
  year={2019}
}

For questions, feel free to contact us, and we will soon get back to you:

If you need any help with the integration of COSMA into your library, we will be more than happy to help you!

This work was funded in part by:

We thank Thibault Notargiacomo, Sam Yates, Benjamin Cumming and Simon Pintarelli for their generous contribution to the project: great ideas, useful advices and fruitful discussions.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4