Determining divergence as a performance bottleneck

From the previous reduction optimization, you might find a warning about an inefficient kernel due to divergent branches in the computing analysis, as follows:

73.4 % divergence means that we have an inefficient operation path. We can determine that the reduction addressing is the issue, highlighted next:

__global__ void reduction_kernel(float* d_out, float* d_in, unsigned int size) {
unsigned int idx_x = blockIdx.x * blockDim.x + threadIdx.x;

extern __shared__ float s_data[];
s_data[threadIdx.x] = (idx_x < size) ? d_in[idx_x] : 0.f;

__syncthreads();

// do reduction
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
// thread synchronous reduction
if ( (idx_x % (stride * 2 - 1)) == 0 )
s_data[threadIdx.x] += s_data[threadIdx.x + stride];

__syncthreads();
}

if (threadIdx.x == 0)
d_out[blockIdx.x] = s_data[0];
}

When it comes to reduction addressing, we can select one of these CUDA thread indexing strategies:

  • Interleaved addressing
  • Sequential addressing

Let's review what they are and compare their performance by implementing these strategies. Since we will just modify the reduction kernel, we can reuse the host code for the next two implementations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.202.167