Using the CUDA Driver API

We will now translate our little Mandelbrot generation program so that we can use our wrapper library. Let's start with the appropriate import statements; notice how we load all of our wrappers into the current namespace:

from __future__ import division
from time import time
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
from cuda_driver import *

Let's put all of our GPU code into the mandelbrot function, as we did previously. We will start by initializing the CUDA Driver API with cuInit and then checking if there is at least one GPU installed on the system, raising an exception otherwise:

def mandelbrot(breadth, low, high, max_iters, upper_bound):
 cuInit(0)
 cnt = c_int(0)
 cuDeviceGetCount(byref(cnt))
 if cnt.value == 0:
  raise Exception('No GPU device found!')

Notice the byref here: this is the Ctypes equivalent of the reference operator (&) from C programming. We'll now apply this idea again, remembering that the device handle and CUDA context can be represented as c_int and c_void_p with Ctypes:

 cuDevice = c_int(0)
 cuDeviceGet(byref(cuDevice), 0)
 cuContext = c_void_p()
 cuCtxCreate(byref(cuContext), 0, cuDevice)

We will now load our PTX module, remembering to typecast the filename to a C string with c_char_p:

 cuModule = c_void_p()
 cuModuleLoad(byref(cuModule), c_char_p('./mandelbrot.ptx'))

Now we will set up the lattice on the host side, as well as a NumPy array of zeros called graph that will be used to store the output on the host side. We will also allocate memory on the GPU for both the lattice and the graph output, and then copy the lattice to the GPU with cuMemcpyHtoD:

 lattice = np.linspace(low, high, breadth, dtype=np.float32)
 lattice_c = lattice.ctypes.data_as(POINTER(c_float))
 lattice_gpu = c_void_p(0)
 graph = np.zeros(shape=(lattice.size, lattice.size), dtype=np.float32)
 cuMemAlloc(byref(lattice_gpu), c_size_t(lattice.size*sizeof(c_float)))
 graph_gpu = c_void_p(0)
 cuMemAlloc(byref(graph_gpu), c_size_t(lattice.size**2 * sizeof(c_float)))
 cuMemcpyHtoD(lattice_gpu, lattice_c, c_size_t(lattice.size*sizeof(c_float)))

Now we will get a handle to the Mandelbrot kernel with cuModuleGetFunction and set up some of the inputs:

 mandel_ker = c_void_p(0)
 cuModuleGetFunction(byref(mandel_ker), cuModule, c_char_p('mandelbrot_ker'))
 max_iters = c_int(max_iters)
 upper_bound_squared = c_float(upper_bound**2)
 lattice_size = c_int(lattice.size)

The next step is a little complex to understand. Before we continue, we have to understand how the parameters are passed into a CUDA kernel with cuLaunchKernel. Let's see how this works in CUDA-C first.

We express the input parameters in kernelParams as an array of void * values, which are, themselves, pointers to the inputs we desire to plug into our kernel. In the case of our Mandelbrot kernel, it would look like this:

void * mandel_params [] = {&lattice_gpu, &graph_gpu, &max_iters, &upper_bound_squared, &lattice_size};

Now let's see how we can express this in Ctypes, which isn't immediately obvious. First, let's put all of our inputs into a Python list, in the proper order:

mandel_args0 = [lattice_gpu, graph_gpu, max_iters, upper_bound_squared, lattice_size ]

Now we need pointers to each of these values, typecast to the void * type. Let's use the Ctypes function addressof to get the address of each Ctypes variable here (which is similar to byref, only not bound to a particular type), and then typecast it to c_void_p. We'll store these values in another list:

mandel_args = [c_void_p(addressof(x)) for x in mandel_args0]

Now let's use Ctypes to convert this Python list to an array of void * pointers, like so:

 mandel_params = (c_void_p * len(mandel_args))(*mandel_args)

We can now set up our grid's size, as we did previously, and launch our kernel with this set of parameters using cuLaunchKernel. We then synchronize the context afterward:

 gridsize = int(np.ceil(lattice.size**2 / 32))
 cuLaunchKernel(mandel_ker, gridsize, 1, 1, 32, 1, 1, 10000, None, mandel_params, None)
 cuCtxSynchronize()

We will now copy the data from the GPU into our NumPy array using cuMemcpyDtoH with the NumPy array.ctypes.data member, which is a C pointer that will allow us to directly access the array from C as a chunk of heap memory. We will typecast this to c_void_p using the Ctypes typecast function cast:

 cuMemcpyDtoH( cast(graph.ctypes.data, c_void_p), graph_gpu,  c_size_t(lattice.size**2 *sizeof(c_float)))

We are now done! Let's free the arrays we allocated on the GPU and end our GPU session by destroying the current context. We will then return the graph NumPy array to the calling function:

 cuMemFree(lattice_gpu)
 cuMemFree(graph_gpu)
 cuCtxDestroy(cuContext)
 return graph

Now we can set up our main function exactly as before:

if __name__ == '__main__':
 t1 = time()
 mandel = mandelbrot(512,-2,2,256, 2)
 t2 = time()
 mandel_time = t2 - t1
 print 'It took %s seconds to calculate the Mandelbrot graph.' % mandel_time
 
 fig = plt.figure(1)
 plt.imshow(mandel, extent=(-2, 2, -2, 2))
 plt.show()

Now try running this function to ensure that it yields the same output as the other Mandelbrot programs we just wrote.

Congratulations—you've just written a direct interface to the low-level CUDA Driver API and successfully launched a kernel with it!

This program is also available as the mandelbrot_driver.py file under the directory in this book's GitHub repository.

Table of Contents for Using the CUDA Driver API

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the CUDA Driver API