A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/c3sr/tcu_scope below:

GitHub - c3sr/tcu_scope

Accelerating Reduction and Scan Using Tensor Core Units
@inproceedings{Dakkak:2019:ARS:3330345.3331057,
 author = {Dakkak, Abdul and Li, Cheng and Xiong, Jinjun and Gelado, Isaac and Hwu, Wen-mei},
 title = {Accelerating Reduction and Scan Using Tensor Core Units},
 booktitle = {Proceedings of the ACM International Conference on Supercomputing},
 series = {ICS '19},
 year = {2019},
 isbn = {978-1-4503-6079-1},
 location = {Phoenix, Arizona},
 pages = {46--57},
 numpages = {12},
 url = {http://doi.acm.org/10.1145/3330345.3331057},
 doi = {10.1145/3330345.3331057},
 acmid = {3331057},
 publisher = {ACM},
 address = {New York, NY, USA},
}

cmake version >=3.8 is required. (there's a problem with hunter using cmake 3.10.2)

  cd /tmp
  wget https://cmake.org/files/v3.10/cmake-3.10.1-Linux-x86_64.sh
  sudo sh cmake-3.10.1-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

you may also want to remove the default installation sudo apt-get remove cmake

you need to install from source if on ppc64le

To compile the project run the following commands

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

if you get errors about nvcc not supporting your gcc compiler, then you may want to use

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_HOST_COMPILER=`which gcc-6` ..

The following benchmakrs are currently available

Name Description CUBLAS_WMMA_GEMM CUBLAS GEMM w/ Tensor Cores. AType, BType = half, CType = float CUBLAS_GEMM CUBLAS GEMM w/o Tensor Cores. AType, BType = half, CType = float CUTLASS_WGEMM CUTLASS GEMM w/ Tensor Cores. AType, BType = half, CType = float CUDA_WMMA_GEMM_NAIVE Naive CUDA GEMM w/ Tensor Cores. AType, BType = half, CType = float CUDA_WMMA_GEMM_SHAREDMEM Shared memory CUDA GEMM w/ Tensor Cores. Atype, BType = half, CType = float CUDA_WMMA_GEMM Optimized CUDA GEMM (from CUDA Samples). AType, BType = half, CType = float

Name Description CUBLAS_WMMA_HGEMM CUBLAS HGEMM w/ Tensor Cores. AType, BType = half, CType = half CUBLAS_HGEMM CUBLAS HGEMM w/o Tensor Cores. AType, BType = half, CType = half

inType = half, outType = half

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_PREFIXSUM_16 and CUDA_WMMA_SEGMENTED_PREFIXSUM_256

(WMMA_TILES_PER_WARP = 2, WARPS_PER_BLOCK = 4) is the best.

Note: There's a bug in cub::WarpScan for LOGICAL_THREADS_PER_WARP = 16.

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_REDUCTION_16 and CUDA_WMMA_SEGMENTED_REDUCTION_256

(WMMA_TILES_PER_WARP = 1, WARPS_PER_BLOCK = 8) is the best.

Running Individual Benchmarks

you can benchmark each primitive individually using

./bench --benchmark_filter=[name_of_primitive]

for example

./bench --benchmark_filter=WMMA_GEMM

futher controls over the benchmarks are explained in the --help option

Benchmark all the primitives

The above will output to stdout somthing like

------------------------------------------------------------------------------
Benchmark                       Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------
SGEMM/1000/1/1/-1/1             5 us          5 us     126475 K=1 M=1000 N=1 alpha=-1 beta=1
SGEMM/128/169/1728/1/0        539 us        534 us       1314 K=1.728k M=128 N=169 alpha=1 beta=0
SGEMM/128/729/1200/1/0       1042 us       1035 us        689 K=1.2k M=128 N=729 alpha=1 beta=0
SGEMM/192/169/1728/1/0        729 us        724 us        869 K=1.728k M=192 N=169 alpha=1 beta=0
SGEMM/256/169/1/1/1             9 us          9 us      75928 K=1 M=256 N=169 alpha=1 beta=1
SGEMM/256/729/1/1/1            35 us         35 us      20285 K=1 M=256 N=729 alpha=1 beta=1
SGEMM/384/169/1/1/1            18 us         18 us      45886 K=1 M=384 N=169 alpha=1 beta=1
SGEMM/384/169/2304/1/0       2475 us       2412 us        327 K=2.304k M=384 N=169 alpha=1 beta=0
SGEMM/50/1000/1/1/1            10 us         10 us      73312 K=1 M=50 N=1000 alpha=1 beta=1
SGEMM/50/1000/4096/1/0       6364 us       5803 us        100 K=4.096k M=50 N=1000 alpha=1 beta=0
SGEMM/50/4096/1/1/1            46 us         45 us      13491 K=1 M=50 N=4.096k alpha=1 beta=1
SGEMM/50/4096/4096/1/0      29223 us      26913 us         20 K=4.096k M=50 N=4.096k alpha=1 beta=0
SGEMM/50/4096/9216/1/0      55410 us      55181 us         10 K=9.216k M=50 N=4.096k alpha=1 beta=0
SGEMM/96/3025/1/1/1            55 us         51 us      14408 K=1 M=96 N=3.025k alpha=1 beta=1
SGEMM/96/3025/363/1/0        1313 us       1295 us        570 K=363 M=96 N=3.025k alpha=1 beta=0

Output as JSON using

./bench --benchmark_out_format=json --benchmark_out=test.json

or preferably

./bench --benchmark_out_format=json --benchmark_out=`hostname`.json
mkdir -p build && cd build && rm -fr * && cmake -DCMAKE_BUILD_TYPE=Release ..
Disable CPU frequency scaling

If you see this error:

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

you might want to disable the CPU frequency scaling while running the benchmark:

sudo cpupower frequency-set --governor performance
./mybench
sudo cpupower frequency-set --governor powersave
python plot/plot.py plot/spec/full_reduction.yml

or generate all figures

Install nvidia-docker, then, list the available benchmarks.

nvidia-docker run  --rm raiproject/tensorcore_bench:latest bench --benchmark_list_tests

You can run benchmarks in the following way (probably with the --benchmark_filter flag).

nvidia-docker run --privileged --rm -v `readlink -f .`:/data -u `id -u`:`id -g` raiproject/tensorcore_bench:amd64-latest ./run_benchmarks.sh

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4