From the previous reduction optimization, you might find a warning about an inefficient kernel due to divergent branches in the computing analysis, as follows:
73.4 % divergence means that we have an inefficient operation path. We can determine that the reduction addressing is the issue, highlighted next:
__global__ void reduction_kernel(float* d_out, float* d_in, unsigned int size) {
unsigned int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
extern __shared__ float s_data[];
s_data[threadIdx.x] = (idx_x < size) ? d_in[idx_x] : 0.f;
__syncthreads();
// do reduction
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
// thread synchronous reduction
if ( (idx_x % (stride * 2 - 1)) == 0 )
s_data[threadIdx.x] += s_data[threadIdx.x + stride];
__syncthreads();
}
if (threadIdx.x == 0)
d_out[blockIdx.x] = s_data[0];
}
When it comes to reduction addressing, we can select one of these CUDA thread indexing strategies:
- Interleaved addressing
- Sequential addressing
Let's review what they are and compare their performance by implementing these strategies. Since we will just modify the reduction kernel, we can reuse the host code for the next two implementations.