Deep neural networks achieve outstanding performance in a variety of fields, such as computer vision, speech recognition, and natural language processing. The computational power needed to process these neural networks is rapidly increasing, so efficient models and computation are crucial. Neural network pruning, removing unnecessary model parameters to yield a sparse network, is a useful way to reduce model complexity while maintaining accuracy.
Figure 1. Fine-grained structured sparsity results in even load balancing, regular memory accesses, and 2x math efficiency with no loss in network accuracy.To exploit fine-grained network pruning, the NVIDIA Ampere GPU architecture introduces the concept of fine-grained structured sparsity. On the NVIDIA A100 GPU, the structure manifests as a 2:4 pattern: out of every four elements, at least two must be zero. This reduces the data footprint and bandwidth of one matrix multiply (also known as GEMM) operand by 2x and doubles throughput by skipping the computation of the zero values using new NVIDIA Sparse Tensor Cores.
cuSPARSELt: A high-performance CUDA library for sparse matrix-dense matrix multiplicationFigure 2 shows how NVIDIA Sparse Tensor Cores operate on only half of one input to double the math efficiency. On the left is a weight matrix pruned to meet the expected 2:4 sparse pattern. As you can see, in each group of four weights (outlined in orange), only two weights are nonzero (shades of green). This matrix is compressed to be half the size of the original matrix with a small amount of metadata to keep track of where the nonzeros were in the original, uncompressed matrix. This metadata is used to select only the corresponding activations from the second input matrix, letting the NVIDIA Sparse Tensor Core skip computing multiplications by zero to achieve twice the throughput of a regular Tensor Core.
Figure 2. NVIDIA A100 GPU Sparse Tensor Cores can perform twice the effective work in the same time as its third-generation Tensor Cores as long as the sparse operand uses 2:4 structured sparsity.To make it easy to use NVIDIA Ampere architecture sparse capabilities, NVIDIA introduces cuSPARSELt, a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix. The cuSPARSELt library lets you use NVIDIA third-generation Tensor Cores Sparse Matrix Multiply-Accumulate (SpMMA) operation without the complexity of low-level programming. The library also provides helper functions for pruning and compressing matrices.
The key features of cuSPARSELt include the following:
FP16
inputs/output, FP32
Tensor Core accumulationBFLOAT16
inputs/output, FP32
Tensor Core accumulationINT8
inputs/output, INT32
Tensor Core accumulationThe cuSPARSELt library follows an equivalent approach and adopts similar concepts to cuBLASLt and cuTENSOR. The library programming model requires organizing the computation in such a way that the same setup can be repeatedly used for different inputs.
In particular, the model relies on the following high-level stages:
The common workflow consists of the following steps (Figure 3):
cusparseLtInit
.cusparseLtDenseDescriptorInit
, cusparseLtStructuredDescriptorInit
.cusparseLtMatmulDescriptorInit
.cusparseLtMatmulAlgSelectionInit
.cusparseLtMatmulPlanInit
.cusparseLtSpMMAPrune
. This step is not needed if the user provides a matrix that already satisfies the 2:4 structured sparsity constraint, such as a weight matrix generated by the ASP library.cusparseLtSpMMACompress
.cusparseLtMatmul
. This step can be repeated multiple times with different inputs.cusparseLtMatmulPlanDestroy
, cusparseLtDestroy
.As with dense matrix multiplication, the performance of sparse matrix multiplications varies with GEMM dimensions, layouts, and data types. Here’s a snapshot of the relative performance of dense and sparse GEMMs with today’s software.
The following charts show the performance of the cuSPARSELt and cuBLAS for the following operation:
D=alpha*op(A)*op(B)+beta*C
In this operation, A
, B
, and D=C
are dense matrices of sizes MxK, KxN, and MxN, respectively. We denote the layouts of the matrices A
and B
with N for column-major order (op is non-transposed) and T for row-major order (op is transposed).
To showcase the performance achievable with cuSPARSELt for a real workload, the following table shows some common GEMM sizes used by a pruned BERT-Large model (seqlen=128, BS=128) with column-major TN FP16 kernels. In general, the larger the workload is, the more that sparsity can help.
Layer M N K Dense TFLOPs Sparse TFLOPs Speedup QKV 3072 16384 1024 188 263 1.4x PROJ 1024 16384 1024 183 242 1.3x FC1 4096 16384 1024 196 270 1.4x FC2 1024 16384 4096 211 339 1.6x Table 1. cuSPARSELt performance for BERT-Large model and different layers. Structured sparse matrix-matrix multiplication code exampleNow that you’ve seen the available performance, here’s an example of performing a matrix multiplication with structured sparsity in the cuSPARSELt library using Sparse Tensor Cores in the NVIDIA A100 or GA100 GPU. For more information, see the NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/spmma GitHub repo.
First, include the cuSPARSELt header, set up some device pointers and data structures, and initialize the cuSPARSELt handle.
#include <cusparseLt.h> // cusparseLt header // Device pointers and coefficient definitions float alpha = 1.0f; float beta = 0.0f; __half* dA = ... __half* dB = ... __half* dC = ... // cusparseLt data structures and handle initialization cusparseLtHandle_t handle; cusparseLtMatDescriptor_t matA, matB, matC; cusparseLtMatmulDescriptor_t matmul; cusparseLtMatmulAlgSelection_t alg_sel; cusparseLtMatmulPlan_t plan; cudaStream_t stream = nullptr; cusparseLtInit(&handle);
Next, initialize the structured sparse input matrix (matrix A
), dense input matrix (matrix B
), and dense output matrix (matrix C
) descriptors.
cusparseLtStructuredDescriptorInit(&handle, &matA, num_A_rows, num_A_cols, lda, alignment, type, order, CUSPARSELT_SPARSITY_50_PERCENT); cusparseLtDenseDescriptorInit(&handle, &matB, num_B_rows, num_B_cols, ldb, alignment, type, order); cusparseLtDenseDescriptorInit(&handle, &matC, num_C_rows, num_C_cols, ldc, alignment, type, order);
With the descriptors ready, you can prepare the matrix multiplication operation’s descriptor, select an algorithm to use to perform the matmul
operation, and initialize the matmul
plan.
cusparseLtMatmulDescriptorInit(&handle, &matmul, opA, opB, &matA, &matB, &matC, &matC, compute_type); cusparseLtMatmulAlgSelectionInit(&handle, &alg_sel, &matmul, CUSPARSELT_MATMUL_ALG_DEFAULT); int alg = 0; // set algorithm ID cusparseLtMatmulAlgSetAttribute(&handle, &alg_sel, CUSPARSELT_MATMUL_ALG_CONFIG_ID, &alg, sizeof(alg)); size_t workspace_size, compressed_size; cusparseLtMatmulGetWorkspace(&handle, &alg_sel, &workspace_size); cusparseLtMatmulPlanInit(&handle, &plan, &matmul, &alg_sel, workspace_size);
If the sparse matrix hasn’t been pruned by another process, you can do it at this point. Don’t forget to check the validity of the sparsity pattern to make sure it can be accelerated with Sparse Tensor Cores.
cusparseLtSpMMAPrune(&handle, &matmul, dA, dA, CUSPARSELT_PRUNE_SPMMA_TILE, stream); // checking the correctness int is_valid = 0; cusparseLtSpMMAPruneCheck(&handle, &matmul, dA, &is_valid, stream); if (is_valid != 0) { std::printf("!!!! The matrix does not conform to the SpMMA sparsity pattern. " "cusparseLtMatmul does not provide correct results\n"); return EXIT_FAILURE; }
Now that matrix A
has been pruned with 2:4 sparsity, you can compress it to roughly half of its original size. This execution time for this step is negligible compared to the actual matrix multiplication (less than 5%).
cusparseLtSpMMACompressedSize(&handle, &plan, &compressed_size); cudaMalloc((void**) &dA_compressed, compressed_size); cusparseLtSpMMACompress(&handle, &plan, dA, dA_compressed, stream);
With the setup complete, perform the matmul
operation. The call to cusparseLtMatmul
can be repeated many times with different B matrices. You only have to set up the sparse matrix one time. For use cases where the A matrix values change, the cusparseLtSpMMACompress
routine must be called again to set up the data structures for the sparse matrix.
void* d_workspace = nullptr; int num_streams = 0; cudaStream_t* streams = nullptr; cusparseLtMatmul(&handle, &plan, &alpha, dA_compressed, dB, &beta, dC, dD, d_workspace, streams, num_streams) )
Finally, clean up the used memory by destroying the matmul plan and cuSPARSELt handle.
cusparseLtMatmulPlanDestroy(&plan); cusparseLtDestroy(&handle);Get started with cuSPARSELt
The cuSPARSELt library makes it easy to exploit NVIDIA Sparse Tensor Core operations, significantly improving the performance of matrix-matrix multiplication for deep learning applications without reducing network’s accuracy. The library also provides utilities for matrix compression, pruning, and performance auto-tuning. In short, cuSPARSELt reduces computation, power consumption, execution time, and memory storage compared to the common dense math approach.
The latest version of cuSPARSELt with NVIDIA Ampere architecture support can be found in NVIDIA GPU Accelerated Libraries. For more information about APIs, installation notes, new features, and examples, see cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication.
For more information, see the following resources:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4