In this chapter, we will cover parallel programming algorithms that will help you understand how to parallelize different algorithms and optimize CUDA. The techniques we will cover in this chapter can be applied to a variety of problems, for example, the parallel reduction problem we looked at in Chapter 3, CUDA Thread Programming, which can be used to design an efficient softmax layer in neural network operations.
In this chapter, we will cover the following topics:
- Matrix multiplication optimization
- Image convolution
- Prefix sum
- Pack and split
- N-body operation
- QuickSort in CUDA using dynamic parallelism
- Radix sort
- Histogram calculation