Heterogeneous programming with PyCUDA

The CUDA programming model (and, hence, that of PyCUDA) is designed for the joint execution of a software application on a CPU and GPU, in order to perform the sequential parts of the application on the CPU and those that can be parallelized on the GPU. Unfortunately, the computer is not smart enough to understand how to distribute the code autonomously, so it is up to the developer to indicate which parts should be run by the CPU and by the GPU.

In fact, a CUDA application is composed of serial components, which are executed by the system CPU or host, or by parallel components called kernels, which are executed by the GPU or by the device instead

A kernel is defined as a grid and can, in turn, be decomposed into blocks that are sequentially assigned to the various multiprocessors, thus implementing coarse-grained parallelism. Inside the blocks, there is the fundamental computational unit, the thread, with a very fine parallel granularity. A thread can belong to only one block and is identified by a unique index for the whole kernel. For convenience, there is the possibility of using two-dimensional indexes for blocks and three-dimensional indexes for threads. The kernels are executed sequentially between them. Blocks and threads, on the other hand, are executed in parallel. The number of threads running (in parallel) depends on their organization in blocks and on their requests in terms of resources, with respect to the resources available in the device.

To visualize the concepts expressed previously, please refer to (Figure 5) at https://sites.google.com/site/computationvisualization/programming/cuda/article1.

The blocks are designed to guarantee scalability. In fact, if you have an architecture with two multiprocessors and another with four, then, a GPU application can be performed on both architectures, obviously with different times and levels of parallelism.

The execution of a heterogeneous program according to the PyCUDA programming model is thus structured as follows:

  1. Allocate memory on the host.
  2. Transfer data from the host memory to the device memory.
  3. Run the device through the invocation of the kernel functions.
  4. Transfer the results from the device memory to the host memory.
  5. Release the memory allocated on the device.

The following diagram shows the execution flow of a program according to the PyCUDA programming model:

PyCUDA programming model

In the next example, we will go through a concrete example of the programming methodology to follow in order to build PyCUDA applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.112.1