Using Nsight to understand the warp lockstep property in CUDA

We will now use Nsight to step through some code to help us better understand some of the CUDA GPU architecture, and how branching within a kernel is handled. This will give us some insight about how to write more efficient CUDA kernels. By branching, we mean how the GPU handles control flow statements such as if, else, or switch within a CUDA kernel. In particular, we are interested in how branch divergence is handled within a kernel, which is what happens when one thread in a kernel satisfies the conditions to be an if statement, while another doesn't and is an else statement: they are divergent because they are executing different pieces of code.

Let's write a small CUDA-C program as an experiment: we will start with a small kernel that prints one output if its threadIdx.x value is even and another if it is odd. We then write a main function that will launch this kernel over one single block consisting of 32 different threads:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void divergence_test_ker()
{
    if( threadIdx.x % 2 == 0)
        printf("threadIdx.x %d : This is an even thread.
", threadIdx.x);
    else
        printf("threadIdx.x %d : This is an odd thread.
", threadIdx.x);
}

__host__ int main()
{
    cudaSetDevice(0);
    divergence_test_ker<<<1, 32>>>();
    cudaDeviceSynchronize();
    cudaDeviceReset();
}

(This code is also available as divergence_test.cu in the repository.)

If we compile and run this from the command line, we might naively expect there to be an interleaved sequence of strings from even and odd threads; or maybe they will be randomly interleaved—since all of the threads run concurrently and branch about the same time, this would make sense.

Instead, every single time we run this, we always get this output:

All of the strings corresponding to even threads are printed first, while all of the odd strings are printed second. Perhaps the Nsight debugger can shed some light on this; let's import this little program into an Nsight project as we did in the last section, putting a breakpoint at the first if statement in our kernel. We will then do a step over, so that the debugger stops where the first printf statement is. Since the default thread in Nsight is (0,0,0), this should have satisfied the first if statement so it will be stuck there until the debugger continues.

Let's switch over to an odd thread, say (1,0,0), and see where it is in our program now:

Very strange! Thread (1,0,0) is also at the same place in execution as thread (0,0,0). Indeed, if we check every single other odd thread here, it will be stuck in the same place—at a printf statement that all of the odd threads should have skipped right past.

What gives? This is known as the warp lockstep property. A warp in the CUDA architecture is a unit of 32 "lanes" within which our GPU executes kernels and grids over, where each lane will execute a single thread. A major limitation of warps is that all threads executing on a single warp must step through the same exact code in lockstep; this means that not every thread does indeed run the same code, but just ignores steps that are not applicable to it. (This is called lockstep because it's like a group of soldiers marching lockstep in unison—whether they want to march, or not!)

The lockstep property implies that if one single thread running on a warp diverges from all 31 other threads in a single if statement, all 31 other threads have their execution delayed until this single anomalous thread finishes and returns from its solitary if divergence. This is a property that you should always keep in mind when writing kernels, and why branch divergence should be minimized as much as possible as a general rule in CUDA programming.

Table of Contents for Using Nsight to understand the warp lockstep property in CUDA

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Nsight to understand the warp lockstep property in CUDA