Summary 

In this chapter, we have covered several kernel execution mechanisms. We covered what CUDA streams are, and how to use them to execute multiple kernel functions concurrently. By utilizing the asynchronous operation between the host and the GPU, we have learned that we can hide the kernel execution time by making the pipelining architecture with data transfer and kernel executions. Also, we can make a CUDA stream call the host function using the callback function. We can create a prioritized stream, and confirm its prioritized execution, too. To measure the exact execution time of a kernel function, we have used CUDA events, and we also learned that CUDA events can be used to synchronize with the host. In the last section, we also discussed the performance of each kernel execution method.

We also covered other kernel operation models: dynamic parallelism and grid-level cooperative groups. Dynamic parallelism enables kernel calls inside the kernel function so we can make recursive operations with that. The grid-level cooperative group enables versatile grid-level synchronization, and we discussed how this feature can be useful in a specific area: graph search, genetic algorithms, and particle simulations.

Then, we expanded our coverage to the host. CUDA kernels can be called from multiple threads or multiple processes. To execute multiple threads, we used OpenMP with CUDA and discussed its usefulness. We used MPI to simulate multiple process operations, and could see how MPS benefits overall application performance.

As we saw in this chapter, choosing the right kernel execution model is an important topic, as is thread programming. This can optimize application execution time. Now, we will expand our discussion to multi-GPU programming to solve big problems.

Table of Contents for Summary&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary