Using printf from within CUDA kernels

It may come as a surprise, but we can actually print text to the standard output from directly within a CUDA kernel; not only that, each individual thread can print its own output. This will come in particularly handy when we are debugging our kernels, as we may need to monitor the values of particular variables or computations at particular points in our code and it will also free us from the shackles of using a debugger to go through step by step. Printing output from a CUDA kernel is done with none other than the most fundamental function in all of C/C++ programming, the function that most people will learn when they write their first Hello world program in C: printf. Of course, printf is the standard function that prints a string to the standard output, and is really the equivalent in the C programming language of Python's print function.

Let's now briefly review how to use printf before we see how to use it in CUDA. The first thing to remember is that printf always takes a string as its first parameter; so printing "Hello world!" in C is done with printf("Hello world! ");. (Of course, indicates "new line" or "return", which moves the output in the Terminal to the next line.) printf can also take a variable number of parameters in the case that we want to print any constants or variables from directly within C: if we want to print the 123 integers to the output, we do this with printf("%d", 123); (where %d indicates that an integer follows the string.)

Similarly, we use %f, %e, or %g to print floating-point values (where %f is the decimal notation, %e is the scientific notation, and %g is the shortest representation whether decimal or scientific). We can even print several values in a row, remembering to place these specifiers in the correct order: printf("%d is a prime number, %f is close to pi, and %d is even. ", 17, 3.14, 4); will print "17 is a prime number, 3.14 is close to pi, and 4 is even." on the Terminal.

Now, nearly halfway through this book, we will finally embark on creating our first parallel Hello world program in CUDA! We start by importing the appropriate modules into Python and then write our kernel. We will start out by printing the thread and grid identification of each individual thread (we will only launch this in one-dimensional blocks and grids, so we only need the x values):

ker = SourceModule('''
__global__ void hello_world_ker()
{
    printf("Hello world from thread %d, in block %d!\n", threadIdx.x, blockIdx.x);

Let's stop for a second and note that we wrote \n rather than . This is due to the fact that the triple quote in Python itself will interpret as a "new line", so we have to indicate that we mean this literally by using a double backslash so as to pass the directly into the CUDA compiler.

We will now print some information about the block and grid dimensions, but we want to ensure that it is printed after every thread has already finished its initial printf command. We can do this by putting in __syncthreads(); to ensure each individual thread will be synchronized after the first printf function is executed.

Now, we only want to print the block and grid dimensions to the terminal only once; if we just place printf statements here, every single thread will print out the same information. We can do this by having only one specified thread print to the output; let's go with the 0th thread of the 0th block, which is the only thread that is guaranteed to exist no matter the block and grid dimensionality we choose. We can do this with a C if statement:

 if(threadIdx.x == 0 && blockIdx.x == 0)
 {

We will now print the dimensionality of our block and grid and close up the if statement, and that will be the end of our CUDA kernel:

 printf("-------------------------------------\n");
 printf("This kernel was launched over a grid consisting of %d blocks,\n", gridDim.x);
 printf("where each block has %d threads.\n", blockDim.x);
 }
}
''')

We will now extract the kernel and then launch it over a grid consisting of two blocks, where each block has five threads:

hello_ker = ker.get_function("hello_world_ker")
hello_ker( block=(5,1,1), grid=(2,1,1) )

Let's run this right now (this program is also available in hello-world_gpu.py under 6 in the repository):

Table of Contents for Using printf from within CUDA kernels

Create new playlist

Sign In

Sign Up

Table of Contents for
Using printf from within CUDA kernels