A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ below:

CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops

One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let’s use our old friend SAXPY. Here’s the basic sequential implementation, which uses a for loop. To efficiently parallelize this, we need to launch enough threads to fully utilize the GPU.

void saxpy(int n, float a, float *x, float *y)
{
    for (int i = 0; i < n; ++i)
        y[i] = a * x[i] + y[i];
}

Common CUDA guidance is to launch one thread per data element, which means to parallelize the above SAXPY loop we write a kernel that assumes we have enough threads to more than cover the array size.

__global__
void saxpy(int n, float a, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) 
        y[i] = a * x[i] + y[i];
}

I’ll refer to this style of kernel as a monolithic kernel, because it assumes a single large grid of threads to process the entire array in one pass. You might use the following code to launch the saxpy kernel to process one million elements.

// Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(1<<20, 2.0, x, y);

Instead of completely eliminating the loop when parallelizing the computation, I recommend to use a grid-stride loop, as in the following kernel.

__global__
void saxpy(int n, float a, float *x, float *y)
{
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; 
         i < n; 
         i += blockDim.x * gridDim.x) 
      {
          y[i] = a * x[i] + y[i];
      }
}

Rather than assume that the thread grid is large enough to cover the entire data array, this kernel loops over the data array one grid-size at a time.

Notice that the stride of the loop is blockDim.x * gridDim.x which is the total number of threads in the grid. So if there are 1280 threads in the grid, thread 0 will compute elements 0, 1280, 2560, etc. This is why I call this a grid-stride loop. By using a loop with stride equal to the grid size, we ensure that all addressing within warps is unit-stride, so we get maximum memory coalescing, just as in the monolithic version.

When launched with a grid large enough to cover all iterations of the loop, the grid-stride loop should have essentially the same instruction cost as the if statement in the monolithic kernel, because the loop increment will only be evaluated when the loop condition evaluates to true.

There are several benefits to using a grid-stride loop.

Grid-stride loops are a great way to make your CUDA kernels flexible, scalable, and debuggable. While the examples in this post have all used CUDA C++, the same concepts apply in other CUDA languages such as CUDA Fortran.

I’d like to thank Justin Luitjens from the NVIDIA Developer Technology group for the idea and many of the details in this CUDA Pro Tip.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4