RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/c3sr/tcu_scope below:

GitHub - c3sr/tcu_scope

Accelerating Reduction and Scan Using Tensor Core Units

@inproceedings{Dakkak:2019:ARS:3330345.3331057,
 author = {Dakkak, Abdul and Li, Cheng and Xiong, Jinjun and Gelado, Isaac and Hwu, Wen-mei},
 title = {Accelerating Reduction and Scan Using Tensor Core Units},
 booktitle = {Proceedings of the ACM International Conference on Supercomputing},
 series = {ICS '19},
 year = {2019},
 isbn = {978-1-4503-6079-1},
 location = {Phoenix, Arizona},
 pages = {46--57},
 numpages = {12},
 url = {http://doi.acm.org/10.1145/3330345.3331057},
 doi = {10.1145/3330345.3331057},
 acmid = {3331057},
 publisher = {ACM},
 address = {New York, NY, USA},
}

cmake version >=3.8 is required. (there's a problem with hunter using cmake 3.10.2)

  cd /tmp
  wget https://cmake.org/files/v3.10/cmake-3.10.1-Linux-x86_64.sh
  sudo sh cmake-3.10.1-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

you may also want to remove the default installation sudo apt-get remove cmake

you need to install from source if on ppc64le

To compile the project run the following commands

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

if you get errors about nvcc not supporting your gcc compiler, then you may want to use

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_HOST_COMPILER=`which gcc-6` ..

The following benchmakrs are currently available

Name Description CUBLAS_WMMA_GEMM CUBLAS GEMM w/ Tensor Cores. AType, BType = half, CType = float CUBLAS_GEMM CUBLAS GEMM w/o Tensor Cores. AType, BType = half, CType = float CUTLASS_WGEMM CUTLASS GEMM w/ Tensor Cores. AType, BType = half, CType = float CUDA_WMMA_GEMM_NAIVE Naive CUDA GEMM w/ Tensor Cores. AType, BType = half, CType = float CUDA_WMMA_GEMM_SHAREDMEM Shared memory CUDA GEMM w/ Tensor Cores. Atype, BType = half, CType = float CUDA_WMMA_GEMM Optimized CUDA GEMM (from CUDA Samples). AType, BType = half, CType = float

cublasHgemm

Name Description CUBLAS_WMMA_HGEMM CUBLAS HGEMM w/ Tensor Cores. AType, BType = half, CType = half CUBLAS_HGEMM CUBLAS HGEMM w/o Tensor Cores. AType, BType = half, CType = half

inType = half, outType = half

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_PREFIXSUM_16 and CUDA_WMMA_SEGMENTED_PREFIXSUM_256

(WMMA_TILES_PER_WARP = 2, WARPS_PER_BLOCK = 4) is the best.

Note: There's a bug in cub::WarpScan for LOGICAL_THREADS_PER_WARP = 16.

WWMMA_TILES_PER_WARP and WARPS_PER_BLOCK tuning for CUDA_WMMA_SEGMENTED_REDUCTION_16 and CUDA_WMMA_SEGMENTED_REDUCTION_256

(WMMA_TILES_PER_WARP = 1, WARPS_PER_BLOCK = 8) is the best.

Running Individual Benchmarks

you can benchmark each primitive individually using

./bench --benchmark_filter=[name_of_primitive]

for example

./bench --benchmark_filter=WMMA_GEMM

futher controls over the benchmarks are explained in the --help option

Benchmark all the primitives

The above will output to stdout somthing like

------------------------------------------------------------------------------
Benchmark                       Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------
SGEMM/1000/1/1/-1/1             5 us          5 us     126475 K=1 M=1000 N=1 alpha=-1 beta=1
SGEMM/128/169/1728/1/0        539 us        534 us       1314 K=1.728k M=128 N=169 alpha=1 beta=0
SGEMM/128/729/1200/1/0       1042 us       1035 us        689 K=1.2k M=128 N=729 alpha=1 beta=0
SGEMM/192/169/1728/1/0        729 us        724 us        869 K=1.728k M=192 N=169 alpha=1 beta=0
SGEMM/256/169/1/1/1             9 us          9 us      75928 K=1 M=256 N=169 alpha=1 beta=1
SGEMM/256/729/1/1/1            35 us         35 us      20285 K=1 M=256 N=729 alpha=1 beta=1
SGEMM/384/169/1/1/1            18 us         18 us      45886 K=1 M=384 N=169 alpha=1 beta=1
SGEMM/384/169/2304/1/0       2475 us       2412 us        327 K=2.304k M=384 N=169 alpha=1 beta=0
SGEMM/50/1000/1/1/1            10 us         10 us      73312 K=1 M=50 N=1000 alpha=1 beta=1
SGEMM/50/1000/4096/1/0       6364 us       5803 us        100 K=4.096k M=50 N=1000 alpha=1 beta=0
SGEMM/50/4096/1/1/1            46 us         45 us      13491 K=1 M=50 N=4.096k alpha=1 beta=1
SGEMM/50/4096/4096/1/0      29223 us      26913 us         20 K=4.096k M=50 N=4.096k alpha=1 beta=0
SGEMM/50/4096/9216/1/0      55410 us      55181 us         10 K=9.216k M=50 N=4.096k alpha=1 beta=0
SGEMM/96/3025/1/1/1            55 us         51 us      14408 K=1 M=96 N=3.025k alpha=1 beta=1
SGEMM/96/3025/363/1/0        1313 us       1295 us        570 K=363 M=96 N=3.025k alpha=1 beta=0

Output as JSON using

./bench --benchmark_out_format=json --benchmark_out=test.json

or preferably

./bench --benchmark_out_format=json --benchmark_out=`hostname`.json

mkdir -p build && cd build && rm -fr * && cmake -DCMAKE_BUILD_TYPE=Release ..

Disable CPU frequency scaling

If you see this error:

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

you might want to disable the CPU frequency scaling while running the benchmark:

sudo cpupower frequency-set --governor performance
./mybench
sudo cpupower frequency-set --governor powersave

python plot/plot.py plot/spec/full_reduction.yml

or generate all figures

Install nvidia-docker, then, list the available benchmarks.

nvidia-docker run  --rm raiproject/tensorcore_bench:latest bench --benchmark_list_tests

You can run benchmarks in the following way (probably with the --benchmark_filter flag).

nvidia-docker run --privileged --rm -v `readlink -f .`:/data -u `id -u`:`id -g` raiproject/tensorcore_bench:amd64-latest ./run_benchmarks.sh

--privileged is needed to set the NUMA policy for NUMA benchmarks.
-v `readlink -f .`:/data maps the current directory into the container as /data.
--benchmark_out=/data/\`hostname`.json tells the bench binary to write out to /data, which is mapped to the current directory.
-u `id -u`:`id -g` tells docker to run as user id -u and group id -g, which is the current user and group. This means that files that docker produces will be modifiable from the host system without root permission.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4