Summary

We started this chapter by learning about dynamic parallelism, which is a paradigm that allows us to launch and manage kernels directly on the GPU from other kernels. We saw how we can use this to implement a quicksort algorithm on the GPU directly. We then learned about vectorized datatypes in CUDA, and saw how we can use these to speed up memory reads from global device memory. We then learned about CUDA Warps, which are small units of 32 threads or less on the GPU, and we saw how threads within a single Warp can directly read and write to each other's registers using Warp Shuffling. We then looked at how we can write a few basic operations in PTX assembly, including import operations such as determining the lane ID and splitting a 64-bit variable into two 32-bit variables. Finally, we ended this chapter by writing a new performance-optimized summation kernel that is used for arrays of doubles, applying almost most of the tricks we've learned in this chapter. We saw that this is actually faster than the standard PyCUDA sum on double arrays with a length of an order of 500,000.

We have gotten through all of the technical chapters of this book! You should be proud of yourself, since you are now surely a skilled GPU programmer with many tricks up your sleeve. We will now embark upon the final chapter, where we will take a brief tour of a few of the different paths you can take to apply and extend your GPU programming knowledge from here.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary