Optimizing unified memory using data prefetching

Now, let's look at an easier method, called data prefetching. One key thing about CUDA is that it provides different methods to the developer, starting from the easiest ones to the ones that require ninja programming skills. Data prefetching are basically hints to the driver to prefetch the data that we believe will be used in the device prior to its use. CUDA provides a prefetching API called cudaMemPrefetchAsync() for this purpose. To see its implementation, let's look at the unified_memory_prefetch.cu file, which we compiled earlier. A snapshot of this code is shown in the following code snippet:

// Allocate Unified Memory -- accessible from CPU or GPU
 cudaMallocManaged(&x, N*sizeof(float));  cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
 for (int i = 0; i < N; i++) {  x[i] = 1.0f;  y[i] = 2.0f;  } 
//prefetch the memory to GPU
cudaGetDevice(&device);
cudaMemPrefetchAsync(x, N*sizeof(float), device, NULL);
cudaMemPrefetchAsync(y, N*sizeof(float), device, NULL); 
...
 add<<<numBlocks, blockSize>>>(N, x, y);
//prefetch the memory to CPU
 cudaMemPrefetchAsync(y, N*sizeof(float), cudaCpuDeviceId, NULL);
 // Wait for GPU to finish before accessing on host
 cudaDeviceSynchronize();
...
for (int i = 0; i < N; i++)
 maxError = fmax(maxError, fabs(y[i]-3.0f));

The code is quite simple and explains itself. The concept is fairly simple: in the case where it is known what memory will be used on a particular device, the memory can be prefetched. Let's take a look at the profiling result, which is shown in the following screenshot.

As we can see, the add<<<>>> kernel provides the bandwidth that we expect it to provide:

Unified memory is an evolving feature and changes with every CUDA version and GPU architecture release. It is expected that you keep yourself informed by accessing the latest CUDA programming guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd).

So far, we have seen the usefulness of the UM concept, which not only provides ease of programming (not explicitly managing memory using the CUDA API) but is much more powerful and helpful when it comes to porting applications that were otherwise either not possible to be ported on GPU or were too difficult to port. One of the key advantages of using UM is over-subscription. GPU memory is quite limited compared to CPU memory. The latest GPU (Volta card V100) provides 32 GB max per GPU. With the help of UM, multiple pieces of GPU memory, along with CPU memory, can be seen as one big memory. For example, the NVIDIA DGX2 machine, which has a 16 Volta GPU of 323 GB, can be seen as a collection of GPU memory with a maximum size of 512 GB. The advantages of these are enormous for applications such as Computational Fluid Dynamics (CFD) and analytics. Previously, where it was difficult to fit the problem size in GPU memory, it is now possible. Moving pieces by hand is error-prone and requires tuning the memory size.

Also, the advent of high speed interconnects such as NVLink and NVSwitch allow for fast transfer between GPU with high bandwidth and low latency. You can actually get high performance with unified memory!

Data prefetching, combined with hints specifying where the data will actually reside, is helpful for multiple processors that need to simultaneously access the same data. The API name that's used in this case is cudaMemAdvice(). Hence, by knowing your application inside out, you can optimize the access by making use of these hints. These are also useful if you wish to override some of the driver heuristics. Some of the advice that's currently being taken by the API is as follows:

cudaMemAdviseSetReadMostly: As the name suggests, this implies that the data is mostly read-only. The driver creates a read-only copy of the data, resulting in a reduction of the page fault. It is important to note that the data can still be written to. In that case, the page copies become invalidated, except for the device that wrote the memory:

// Sets the data readonly for the GPU
cudaMemAdvise(data, N, ..SetReadMostly, processorId); 
mykernel<<<..., s>>>(data, N);

cudaMemAdviseSetPreferredLocation: This advice sets the preferred location for the data to be the memory belonging to the device. Setting the preferred location does not cause data to migrate to that location immediately. Like in the following code, mykernel<<<>>> will page fault and generate direct mapping to data on the CPU. The driver tries to resist migrating data away from the set preferred location using cudaMemAdvise:

cudaMemAdvise(input, N, ..PreferredLocation, processorId); 
mykernel<<<..., s>>>(input, N);

cudaMemAdviseSetAccessedBy: This advice implies that the data will be accessed by the device. The device will create a direct mapping of input in the CPU memory and no page faults will be generated:

cudaMemAdvise(input, N, ..SetAccessedBy, processorId); 
mykernel<<<..., s>>>(input, N);

In the next section, we will use a holistic view to see how different memories in GPU have evolved with the newer architecture.

Table of Contents for Optimizing unified memory using data prefetching

Create new playlist

Sign In

Sign Up

Table of Contents for
Optimizing unified memory using data prefetching