Determining divergence as a performance bottleneck

From the previous reduction optimization, you might find a warning about an inefficient kernel due to divergent branches in the computing analysis, as follows:

73.4 % divergence means that we have an inefficient operation path. We can determine that the reduction addressing is the issue, highlighted next:

__global__ void reduction_kernel(float* d_out, float* d_in, unsigned int size) {
    unsigned int idx_x = blockIdx.x * blockDim.x + threadIdx.x;

    extern __shared__ float s_data[];
    s_data[threadIdx.x] = (idx_x < size) ? d_in[idx_x] : 0.f;

    __syncthreads();

    // do reduction
    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
        // thread synchronous reduction
        if ( (idx_x % (stride * 2 - 1)) == 0 )
            s_data[threadIdx.x] += s_data[threadIdx.x + stride];
           
        __syncthreads();
    }

    if (threadIdx.x == 0)
        d_out[blockIdx.x] = s_data[0];
}

When it comes to reduction addressing, we can select one of these CUDA thread indexing strategies:

Interleaved addressing
Sequential addressing

Let's review what they are and compare their performance by implementing these strategies. Since we will just modify the reduction kernel, we can reuse the host code for the next two implementations.

Table of Contents for Determining divergence as a performance bottleneck

Create new playlist

Sign In

Sign Up

Table of Contents for
Determining divergence as a performance bottleneck