Dynamic parallelism

First, we will take a look at dynamic parallelism, a feature in CUDA that allows a kernel to launch and manage other kernels without any interaction or input on behalf of the host. This also makes many of the host-side CUDA-C features that are normally available also available on the GPU, such as device memory allocation/deallocation, device-to-device memory copies, context-wide synchronizations, and streams.

Let's start with a very simple example. We will create a small kernel over N threads that will print a short message to the terminal from each thread, which will then recursively launch another kernel over N - 1 threads. This process will continue until N reaches 1. (Of course, beyond illustrating how dynamic parallelism works, this example would be pretty pointless.)

Let's start with the import statements in Python:

from __future__ import division
import numpy as np
from pycuda.compiler import DynamicSourceModule
import pycuda.autoinit

Notice that we have to import DynamicSourceModule rather than the usual SourceModule! This is due to the fact that the dynamic parallelism feature requires particular configuration details to be set by the compiler. Otherwise, this will look and act like a usual SourceModule operation. Now we can continue writing the kernel:

DynamicParallelismCode='''
__global__ void dynamic_hello_ker(int depth)
{
 printf("Hello from thread %d, recursion depth %d!\n", threadIdx.x, depth);
 if (threadIdx.x == 0 && blockIdx.x == 0 && blockDim.x > 1)
 {
  printf("Launching a new kernel from depth %d .\n", depth);
  printf("-----------------------------------------\n");
  dynamic_hello_ker<<< 1, blockDim.x - 1 >>>(depth + 1);
 }
}'''

The most important thing here to note is this: we must be careful that we have only a single thread launch the next iteration of kernels with a single thread with a well-placed if statement that checks the threadIdx and blockIdx values. If we don't do this, then each thread will launch far more kernel instances than necessary at every depth iteration. Also, notice how we could just launch the kernel in a normal way with the usual CUDA-C triple-bracket notation—we don't have to use any obscure or low-level commands to make use of dynamic parallelism.

When using the CUDA dynamic parallelism feature, always be careful to avoid unnecessary kernel launches. This can be done by having a designated thread launch the next iteration of kernels.

Now let's finish this up:

dp_mod = DynamicSourceModule(DynamicParallelismCode)
hello_ker = dp_mod.get_function('dynamic_hello_ker')
hello_ker(np.int32(0), grid=(1,1,1), block=(4,1,1))

Now we can run the preceding code, which will give us the following output:

This example can also be found in the dynamic_hello.py file under the directory in this book's GitHub repository.

Table of Contents for Dynamic parallelism

Create new playlist

Sign In

Sign Up

Table of Contents for
Dynamic parallelism