CUDA streams

Streams act in a FIFO manner, where the sequence of operations is executed in the order of when they were issued. Requests that are made from the host code are put into First-In-First-Out queues. Queues are read and processed asynchronously by the driver, and the device driver ensures that the commands in a queue are processed in sequence. For example, memory copies end before kernel launch, and so on.

The general idea of using multiple streams is that CUDA operations that are fired in different streams may run concurrently. This can result in multiple kernels overlapping or overlapping memory copies within the kernel execution.

To understand CUDA streams, we will be looking at two applications. The first application is a simple vector addition code with added streams so that it can overlap data transfers with kernel execution. The second application is of an image merging application, which will also be used in Chapter 9, GPU Programming Using OpenACC.

To start, configure your environment according to the following steps:

Prepare your GPU application. As an example, we will be merging two images. This code can be found in the 06_multi-gpu/streams folder in this book's GitHub repository.
Compile your application with the nvcc compiler as follows:

$ nvcc --default-stream per-thread -o vector_addition -Xcompiler -fopenmp -lgomp vector_addition.cu
$ nvcc --default-stream per-thread -o merging_muli_gpu -Xcompiler -fopenmp -lgomp scrImagePgmPpmPackage.cu image_merging.cu
$ ./vector addition
$ ./merging_muli_gpu

The preceding commands will create two binaries named vector_addition and merging_multi_gpu. As you might have observed, we are using additional arguments in our code. Let's understand them in more detail:

--default-stream per-thread: This flag tells the compiler to parse the OpenACC directives provided in the code.
-Xcompiler -fopenmp -lgomp: This flag tells nvcc to pass these additional flags to the CPU compiler underneath to compile the CPU part of the code. In this case, we are asking the compiler to add OpenMP-related libraries to our application.

We will divide this section into two parts. Application 1 and application 2 demonstrate using streams in single and multiple GPUs, respectively.

Table of Contents for CUDA streams

Create new playlist

Sign In

Sign Up

Table of Contents for
CUDA streams