GPUDirect peer to peer

The GPUDirect technology was created to allow high-bandwidth, low-latency communication between GPUs within and across different nodes. This technology was introduced to eliminate CPU overheads when one GPU needs to communicate with another. GPUDirect can be classified into the following major categories:

  • Peer to peer (P2P) transfer between GPU: Allows CUDA programs to use high-speed Direct Memory Transfer (DMA) to copy data between two GPUs in the same system. It also allows optimized access to the memory of other GPUs within the same system.
  • Accelerated communication between network and storage: This technology helps with direct access to CUDA memory from third-party devices such as InfiniBand network adapters or storage. It eliminates unnecessary memory copies and CPU overhead and hence reduces the latency of transfer and access. This feature is supported from CUDA 3.1 onward. 
  • GPUDirect for video: This technology optimizes pipelines for frame-based video devices. It allows low-latency communication with OpenGL, DirectX, or CUDA and is supported from CUDA 4.2 onward.
  • Remote Direct Memory Access (RDMA): This feature allows direct communication between GPUs across a cluster. This feature is supported from CUDA 5.0 and later.

In this section, we will be converting our sequential code to make use of the P2P feature of GPUDirect so that it can run on multiple GPUs within the same system.

The GPUDirect P2P feature allows the following:

  • GPUDirect transfers: cudaMemcpy() initiates a DMA copy from GPU 1's memory to GPU 2's memory.
  • Direct access: GPU 1 can read or write GPU 2's memory (load/store).

The following diagram demonstrates these features:

To understand the advantage of P2P, it is necessary to understand the PCIe bus specification. This was created with the intention of optimally communicating through interconnects such as InfiniBand to other nodes. This is different when we want to optimally send and receive data from individual GPUs. The following is a sample PCIe topology where eight GPUs are being connected to various CPUs and NIC/InfiniBand cards:

In the preceding diagram, P2P transfer is allowed between GPU0 and GPU1 as they are both situated in the same PCIe switch. However, GPU0 and GPU4 cannot perform a P2P transfer as PCIe P2P communication is not supported between two I/O Hubs (IOHs). The IOH does not support non-contiguous bytes from PCI Express for remote peer-to-peer MMIO transactions. The nature of the QPI link connecting the two CPUs ensures that a direct P2P copy between GPU memory is not possible if the GPUs reside on different PCIe domains. Thus, a copy from the memory of GPU0 to the memory of GPU4 requires copying over the PCIe link to the memory attached to CPU0, then transferring it over the QPI link to CPU1 and over the PCIe again to GPU4. As you can imagine, this process adds a significant amount of overhead in terms of both latency and bandwidth.

The following diagram shows another system where GPUs are connected to each other via an NVLink interconnect that supports P2P transfers:

The preceding diagram shows a sample NVLink topology resulting in an eight-cube mesh where each GPU is connected to another GPU with a max 1 hop.

The more important query would be, How can we figure out this topology and which GPUs support P2P transfer? Fortunately, there are tools for this. nvidia-smi is one such tool that gets installed as part of the NVIDIA driver's installation. The following screenshot shows the output of running nvidia-smi on the NVIDIA DGX server whose network topology is shown in the preceding diagram:

The preceding screenshot represents the result of the nvidia-smi topo -m command being run on the DGX system, which has 8 GPUs. As you can see, any GPU that is connected to another GPU via SMP interconnect (QPI/UPI) cannot perform P2P transfer. For example, GPU0 will not be able to do P2P with GPU5, GPU6, and GPU7. Another way is to figure out this transfer via CUDA APIs, which we will be using to convert our code in the next section.

Now that we have understood the system topology, we can start converting our application into multiple GPUs in a single node/server.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.200.211