Questions

In the first CUDA-C program that we wrote, we didn't use a cudaDeviceSynchronize command after the calls we made to allocate memory arrays on the GPU with cudaMalloc. Why was this not necessary? (Hint: Review the last chapter.)
Suppose we have a single kernel that is launched over a grid consisting of two blocks, where each block has 32 threads. Suppose all of the threads in the first block execute an if statement, while all of the threads in the second block execute the corresponding else statement. Will all of the threads in the second block have to "lockstep" through the commands in the if statement as the threads in the first block are actually executing them?
What if we executed a similar piece of code, only over a grid consisting of one single block executed over 64 threads, where the first 32 threads execute an if and the second 32 execute an else statement?
What can the nvprof profiler measure for us that Python's cProfiler cannot?
Name some contexts where we might prefer to use printf to debug a CUDA kernel and other contexts where it might be easier to use Nsight to debug a CUDA kernel.
What is the purpose of the cudaSetDevice command in CUDA-C?
Why do we have to use cudaDeviceSynchronize after every kernel launch or memory copy in CUDA-C?

Table of Contents for Questions