Kernels, Threads, Blocks, and Grids

In this chapter, we'll see how to write effective CUDA kernels. In GPU programming, a kernel (which we interchangeably use with terms such as CUDA kernel or kernel function) is a parallel function that can be launched directly from the host (the CPU) onto the device (the GPU), while a device function is a function that can only be called from a kernel function or another device function. (Generally speaking, device functions look and act like normal serial C/C++ functions, only they are running on the GPU and are called in parallel from kernels.)

We'll then get an understanding of how CUDA uses the notion of threads, blocks, and grids to abstract away some of the underlying technical details of the GPU (such as cores, warps, and streaming multiprocessors, which we'll cover later in this book), and how we can use these notions to ease the cognitive overhead in parallel programming. We'll learn about thread synchronization (both block-level and grid-level), and intra-thread communication in CUDA using both global and shared memory. Finally, we'll delve into the technical details of how to implement our own parallel prefix type algorithms on the GPU (that is, the scan/reduce type functions we covered in the last chapter), which allow us to put all of the principles we'll learn in this chapter into practice.

The learning outcomes for this chapter are as follows:

Understanding the difference between a kernel and a device function
How to compile and launch a kernel in PyCUDA and use a device function within a kernel
Effectively using threads, blocks, and grids in the context of launching a kernel and how to use threadIdx and blockIdx within a kernel
How and why to synchronize threads within a kernel, using both __syncthreads() for synchronizing all threads among a single block and the host to synchronize all threads among an entire grid of blocks

How to use device global and shared memory for intra-thread communication
How to use all of our newly acquired knowledge about kernels to properly implement a GPU version of the parallel prefix sum

Table of Contents for Kernels, Threads, Blocks, and Grids

Create new playlist

Sign In

Sign Up

Table of Contents for
Kernels, Threads, Blocks, and Grids