Performance Optimization in CUDA

In this penultimate chapter, we will cover some fairly advanced CUDA features that we can use for low-level performance optimizations. We will start by learning about dynamic parallelism, which allows kernels to launch and manage other kernels on the GPU, and see how we can use this to implement quicksort directly on the GPU. We will learn about vectorized memory access, which can be used to increase memory access speedups when reading from the GPU's global memory. We will then look at how we can use CUDA atomic operations, which are thread-safe functions that can operate on shared data without thread synchronization or mutex locks. We will learn about Warps, which are fundamental blocks of 32 or fewer threads, in which threads can read or write to each other's variables directly, and then make a brief foray into the world of PTX Assembly. We'll do this by directly writing some basic PTX Assembly inline within our CUDA-C code, which itself will be inline in our Python code! Finally, we will bring all of these little low-level tweaks together into one final example, where we will apply them to make a blazingly fast summation kernel, and compare this to PyCUDA's sum.

The learning outcomes for this chapter are as follows:

Dynamic parallelism in CUDA
Implementing quicksort on the GPU with dynamic parallelism
Using vectorized types to speed up device memory accesses
Using thread-safe CUDA atomic operations
Basic PTX Assembly
Applying all of these concepts to write a performance-optimized summation kernel

Table of Contents for Performance Optimization in CUDA

Create new playlist

Sign In

Sign Up

Table of Contents for
Performance Optimization in CUDA