Why bother with threads and blocks?

It might not be obvious why we need this additional hierarchy of threads and blocks. They add a level of complexity where the developer needs to find out the right block and grid size. Also, global indexing becomes a challenge. The reason for this is because of the restrictions that the CUDA programming model put it place.

Unlike parallel blocks, threads have mechanisms to communicate and synchronize efficiently. Real-world applications require threads to communicate with each other and may want to wait for certain data to be interchanged before proceeding further. This kind of operation requires threads to communicate, and the CUDA programming model allows this communication for threads within the same block. Threads belonging to different blocks cannot communicate/synchronize with each other during the execution of the kernel. This restriction allows the scheduler to schedule the blocks on the SM independently of each other. The result of this is that, if new hardware is released with more SMs and if the code has enough parallelism, the code can be scaled linearly. In other words, this allows the hardware to scale the number of blocks running in parallel based on the GPU's capability.

The threads communicate with each other using a special memory known as shared memory. We will cover shared memory extensively in Chapter 2CUDA Memory Management, where we will expose other memory hierarchies in the GPU and their optimal usage. The following screenshot demonstrates scaling blocks across different GPUs consisting of different amounts of SMs:

Now, let's find out more about launching kernels in multiple dimensions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.131.238