Interleaved addressing

In this strategy, the consecutive CUDA threads to fetch input data using the interleaved addressing strategy. Compared to the previous version, CUDA threads access input data by increasing the stride value. The following diagram shows how CUDA threads are interleaved with reduction items:

This interleaving addressing can be implemented as follows:

__global__ void
 interleaved_reduction_kernel(float* g_out, float* g_in, unsigned int size) {
    unsigned int idx_x = blockIdx.x * blockDim.x + threadIdx.x;

    extern __shared__ float s_data[];
    s_data[threadIdx.x] = (idx_x < size) ? g_in[idx_x] : 0.f;
    __syncthreads();

    // do reduction
    // interleaved addressing
    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {
        int index = 2 * stride * threadIdx.x;
        if (index < blockDim.x)
            s_data[index] += s_data[index + stride];
        __syncthreads();
    }
    if (threadIdx.x == 0)
        g_out[blockIdx.x] = s_data[0];
}

Run the following command to compile the preceding code:

$ nvcc -run -m64 -gencode arch=compute_70,code=sm_70 -I/usr/local/cuda/samples/common/inc -o reduction ./reduction.cpp ./reduction_kernel_interleaving.cu

The measured kernel execution time is 0.446 ms on the Tesla V100. It is slower than the previous version because each thread block is not fully utilized in this approach. We would be able to get more detail by profiling its metrics.

Now we will try another addressing approach, which is designed so that each thread block computes more data.

Table of Contents for Interleaved addressing

Create new playlist

Sign In

Sign Up

Table of Contents for
Interleaved addressing