Low latency versus higher throughput

As we mentioned in the previous section, CPU architecture is optimized for low latency access while GPU architecture is optimized for data parallel throughput computation. As shown in the following screenshot, the CPU architecture has a large amount of cache compared to GPU and has many types. The higher we go, that is, L3 to L1, the lower the amount of cache is present, but less latency. The CPU architecture is designed for low latency access to cached datasets. A large number of transistors are used to implement the speculative execution and out of order execution. Since CPUs run at a very high clock speed, it becomes necessary to hide the latency of fetching the data by frequently storing used data in caches and predicting the next instruction to execute. Applications that can explore this temporal locality can optimally make use of a CPU cache. Also, applications where it is easy to fill the instruction pipeline, for example, an application with no if and else statements in its code, can benefit from this by hiding the latency of fetching the instruction. Hence, the CPU architecture is a latency reducing architecture.

The following screenshot shows how the CPU and GPU architecture dedicate the chip die area for different memory and compute units. While GPU uses a lot of transistors for computing ALUs, CPU uses it to reduce latency:

The GPU architecture, on the other hand, is called a latency reducing or high throughput architecture. The GPU architecture hides latency with computations from other threads. When one thread is waiting for the data to be available for computation, the other threads can start execution and hence not waste any clock cycles. If you are familiar with CUDA, then you might know about the concept of warps. We will cover the concept of warps in the upcoming chapters. (In CUDA, the execution unit is a warp and not a thread. Due to this, context switching happens between warps and not threads).

Some of you might be already wondering why we can't create these threads in the CPU and do the same thing to hide latency. The reason for this is that GPUs have lots of registers, and all of the thread context switching information is already present in them. This is the fastest memory that's available. However, in CPU, there are limited sets of registers and hence thread-related information is usually stored in a lower memory hierarchy such as a cache. For example, Volta contains 20 MB of register storage. Due to this, the context switching time between threads in CPU, compared to GPU, is much higher.

Now, let's take a look at the different approaches when it comes to programming on GPU.

Table of Contents for Low latency versus higher throughput

Create new playlist

Sign In

Sign Up

Table of Contents for
Low latency versus higher throughput