Interleaved addressing

In this strategy, the consecutive CUDA threads to fetch input data using the interleaved addressing strategy. Compared to the previous version, CUDA threads access input data by increasing the stride value. The following diagram shows how CUDA threads are interleaved with reduction items:

This interleaving addressing can be implemented as follows:

__global__ void
interleaved_reduction_kernel(float* g_out, float* g_in, unsigned int size) {
unsigned int idx_x = blockIdx.x * blockDim.x + threadIdx.x;

extern __shared__ float s_data[];
s_data[threadIdx.x] = (idx_x < size) ? g_in[idx_x] : 0.f;
__syncthreads();

// do reduction
// interleaved addressing
for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
int index = 2 * stride * threadIdx.x;
if (index < blockDim.x)
s_data[index] += s_data[index + stride];
__syncthreads();
}
if (threadIdx.x == 0)
g_out[blockIdx.x] = s_data[0];
}

Run the following command to compile the preceding code:

$ nvcc -run -m64 -gencode arch=compute_70,code=sm_70 -I/usr/local/cuda/samples/common/inc -o reduction ./reduction.cpp ./reduction_kernel_interleaving.cu

The measured kernel execution time is 0.446 ms on the Tesla V100. It is slower than the previous version because each thread block is not fully utilized in this approach. We would be able to get more detail by profiling its metrics.

Now we will try another addressing approach, which is designed so that each thread block computes more data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.119.148