Using printf for debugging

Let's go over an example to see how we can approach debugging a CUDA kernel with printf with an example before we move on. There is no exact science to this method, but it is a skill that can be learned through experience. We will start with a CUDA kernel that is for matrix-matrix multiplication, but that has several bugs in it. (The reader is encouraged to go through the code as we go along, which is available as the broken_matrix_ker.py file in the 6 directories within the repository.)

Let's briefly review matrix-matrix multiplication before we continue. Suppose we have two matrices , A and B, and we multiply these together to get another matrix, C, of the same size as follows: . We do this by iterating over all tuples and setting the value of to the dot product of the i^th row of A and the j^th column of B: .

In other words, we set each i, j element in the output matrix C as follows:

Suppose we already wrote a kernel that is to perform matrix-matrix multiplication, which takes in two arrays representing the input matrices, an additional pre allocated float array that the output will be written to, and an integer that indicates the height and width of each matrix (we will assume that all matrices are the same size and square-shaped). These matrices are all to be represented as one-dimensional float * arrays in a row-wise one-dimensional layout. Furthermore, this will be implemented so that each CUDA thread will handle a single row/column tuple in the output matrix.

We make a small test case and check it against the output of the matrix multiplication in CUDA, and it fails as an assertion check on two 4 x 4 matrices, as follows:

test_a = np.float32( [xrange(1,5)] * 4 )
test_b = np.float32([xrange(14,10, -1)]*4 )
output_mat = np.matmul(test_a, test_b)

test_a_gpu = gpuarray.to_gpu(test_a)
test_b_gpu = gpuarray.to_gpu(test_b)
output_mat_gpu = gpuarray.empty_like(test_a_gpu)

matrix_ker(test_a_gpu, test_b_gpu, output_mat_gpu, np.int32(4), block=(2,2,1), grid=(2,2,1))

assert( np.allclose(output_mat_gpu.get(), output_mat) )

We will run this program right now, and unsurprisingly get the following output:

Let's now look at the CUDA C code, which consists of a kernel and a device function:

ker = SourceModule('''
// row-column dot-product for matrix multiplication
__device__ float rowcol_dot(float *matrix_a, float *matrix_b, int row, int col, int N)
{
 float val = 0;

 for (int k=0; k < N; k++)
 {
     val += matrix_a[ row + k*N ] * matrix_b[ col*N + k];
 }
 return(val);
}

// matrix multiplication kernel that is parallelized over row/column tuples.

__global__ void matrix_mult_ker(float * matrix_a, float * matrix_b, float * output_matrix, int N)
{
 int row = blockIdx.x + threadIdx.x;
 int col = blockIdx.y + threadIdx.y;
 
 output_matrix[col + row*N] = rowcol_dot(matrix_a, matrix_b, col, row, N);
}
''')

Our goal is to place printf invocations intelligently throughout our CUDA code so that we can monitor a number of appropriate values and variables in the kernel and device function; we should also be sure to print out the thread and block numbers alongside these values at every printf invocation.

Let's start at the entry point of our kernel. We see two variables, row and col, so we should check these right away. Let's put the following line right after we set them (since this is parallelized over two dimensions, we should print the x and y values of threadIdx and blockIdx):

printf("threadIdx.x,y: %d,%d blockIdx.x,y: %d,%d -- row is %d, col is %d.\n", threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y, row, col);

Running the code again, we get this output:

There are two things that are immediately salient: that there are repeated values for row and column tuples (every individual tuple should be represented only once), and that the row and column values never exceed two, when they both should reach three (since this unit test is using 4 x 4 matrices). This should indicate to us that we are calculating the row and column values wrongly; indeed, we are forgetting to multiply the blockIdx values by the blockDim values to find the objective row/column values. We fix this as follows:

int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;

If we run the program again, though, we still get an assertion error. Let's keep our original printf invocation in place, so we can monitor the values as we continue. We see that there is an invocation to a device function in the kernel, rowcol_dot, so we decide to look into there. Let's first ensure that the variables are being passed into the device function correctly by putting this printf invocation at the beginning:

printf("threadIdx.x,y: %d,%d blockIdx.x,y: %d,%d -- row is %d, col is %d, N is %d.\n", threadIdx.x, threadIdx.y, blockIdx.x, blockIdx.y, row, col, N);

When we run our program, even more lines will come out, however, we will see one that says—threadIdx.x,y: 0,0 blockIdx.x,y: 1,0 -- row is 2, col is 0. and yet another that says—threadIdx.x,y: 0,0 blockIdx.x,y: 1,0 -- row is 0, col is 2, N is 4. By the threadIdx and blockIdx values, we see that this is the same thread in the same block, but with the row and col values reversed. Indeed, when we look at the invocation of the rowcol_dot device function, we see that row and col are indeed reversed from that in the declaration of the device function. We fix this, but when we run the program again, we get yet another assertion error.

Let's place another printf invocation in the device function, within the for loop; this, of course, is the dot product that is to perform a dot product between rows of matrix A with columns of matrix B. We will check the values of the matrices we are multiplying, as well as k; we will also only look at the values of the very first thread, or else we will get an incoherent mess of an output:

if(threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.x == 0 && blockIdx.y == 0)
            printf("Dot-product loop: k value is %d, matrix_a value is %f, matrix_b is %f.\n", k, matrix_a[ row + k*N ], matrix_b[ col*N + k]);

Let's look at the values of the A and B matrices that are set up for our unit tests before we continue:

We see that both matrices vary when we switch between columns but are constant when we change between rows. Therefore, by the nature of matrix multiplication, the values of matrix A should vary across k in our for loop, while the values of B should remain constant. Let's run the program again and check the pertinent output:

So, it appears that we are not accessing the elements of the matrices in a correct way; remembering that these matrices are stored in a row-wise format, we modify the indices so that their values are accessed in the proper manner:

val += matrix_a[ row*N + k ] * matrix_b[ col + k*N];

Running the program again will yield no assertion errors. Congratulations, you just debugged a CUDA kernel using the only printf!

Table of Contents for Using printf for debugging

Create new playlist

Sign In

Sign Up

Table of Contents for
Using printf for debugging