Writing wrappers for the CUDA Driver API

We will now look at how we can write our very own wrappers for some pre-packaged binary CUDA library functions using Ctypes. In particular, we will be writing wrappers for the CUDA Driver API, which will allow us to perform all of the necessary operations needed for basic GPU usage—including GPU initialization, memory allocation/transfers/deallocation, kernel launching, and context creation/synchronization/destruction. This is a very powerful piece of knowledge; it will allow us to use our GPU without going through PyCUDA, and also without writing any cumbersome host-side C-function wrappers.

We will now write a small module that will act as a wrapper library for the CUDA Driver API. Let's talk about what this means for a minute. The Driver API is slightly different and a little more technical than the CUDA Runtime API, the latter being what we have been working within this text from CUDA-C. The Driver API is designed to be used with a regular C/C++ compiler rather than with NVCC, with some different conventions like using the cuLaunchKernel function to launch a kernel rather than using the <<< gridsize, blocksize >>> bracket notation. This will allow us to directly access the necessary functions that we need to launch a kernel from a PTX file with Ctypes.

Let's start writing this module by importing all of the Ctypes into the module's namespace, and then importing the sys module. We will make our module usable from both Windows and Linux by loading the proper library file (either nvcuda.dll or libcuda.so) by checking the system's OS with sys.platform, like so:

from ctypes import *
import sys
if 'linux' in sys.platform:
 cuda = CDLL('libcuda.so')
elif 'win' in sys.platform:
 cuda = CDLL('nvcuda.dll')

We have successfully loaded the CUDA Driver API, and we can now begin writing wrappers for the necessary functions for basic GPU usage. We will look at the prototypes of each Driver API function as we go along, which is generally necessary to do when you are writing Ctypes wrappers.

The reader is encouraged to look up all of the functions we will be using in this section in the official Nvidia CUDA Driver API Documentation, which is available here: https://docs.nvidia.com/cuda/cuda-driver-api/.

Let's start with the most fundamental function from the Driver API, cuInit, which will initialize the Driver API. This takes an unsigned integer used for flags as an input parameter and returns a value of type CUresult, which is actually just an integer value. We can write our wrapper like so:

cuInit = cuda.cuInit
cuInit.argtypes = [c_uint]
cuInit.restype = int

Now let's start on the next function, cuDeviceCount, which will tell us how many NVIDIA GPUs we have installed on our computer. This takes in an integer pointer as its single input, which is actually a single integer output value that is returned by reference. The return value is another CUresult integer—all of the functions will use CUresult, which is a standardization of the error values for all of the Driver API functions. For instance, if any function we see returns a 0, this means the result is CUDA_SUCCESS, while non-zero results will always mean an error or warning:

cuDeviceGetCount = cuda.cuDeviceGetCount
cuDeviceGetCount.argtypes = [POINTER(c_int)]
cuDeviceGetCount.restype = int

Now let's write a wrapper for cuDeviceGet, which will return a device handle by reference in the first input. This will correspond to the ordinal GPU given in the second input. The first parameter is of the type CUdevice *, which is actually just an integer pointer:

cuDeviceGet = cuda.cuDeviceGet
cuDeviceGet.argtypes = [POINTER(c_int), c_int]
cuDeviceGet.restype = int

Let's remember that every CUDA session will require at least one CUDA Context, which can be thought of as analogous to a process running on the CPU. Since this is handled automatically with the Runtime API, here we will have to create a context manually on a device (using a device handle) before we can use it, and we will have to destroy this context when our CUDA session is over.

We can create a CUDA context with the cuCtxCreate function, which will, of course, create a context. Let's look at the prototype listed in the documentation:

 CUresult cuCtxCreate ( CUcontext* pctx, unsigned int flags, CUdevice dev )

Of course, the return value is CUresult. The first input is a pointer to a type called CUcontext, which is actually itself a pointer to a particular C structure used internally by CUDA. Since our only interaction with CUcontext from Python will be to hold onto its value to pass between other functions, we can just store CUcontext as a C void * type, which is used to store a generic pointer address for any type. Since this is actually a pointer to a CU context (again, which is itself a pointer to an internal data structure—this is another pass-by-reference return value), we can set the type to be just a plain void *, which is a c_void_p type in Ctypes. The second value is an unsigned integer, while the final value is the device handle on which to create the new context—let's remember that this is itself just an integer. We are now prepared to create our wrapper for cuCtxCreate:

cuCtxCreate = cuda.cuCtxCreate
cuCtxCreate.argtypes = [c_void_p, c_uint, c_int]
cuCtxCreate.restype = int

You can always use the void * type in C/C++ (c_void_p in Ctypes) to point to any arbitrary data or variable—even structures and objects whose definition may not be available.

The next function is cuModuleLoad, which will load a PTX module file for us. The first argument is a CUmodule by reference (again, we can just use a c_void_p here), and the second is the file name, which will be a typical null-terminated C-string—this is a char *, or c_char_p in Ctypes:

cuModuleLoad = cuda.cuModuleLoad
cuModuleLoad.argtypes = [c_void_p, c_char_p]
cuModuleLoad.restype = int

The next function is for synchronizing all launched operations over the current CUDA context, and is called cuCtxSynchronize (this takes no arguments):

cuCtxSynchronize = cuda.cuCtxSynchronize
cuCtxSynchronize.argtypes = []
cuCtxSynchronize.restype = int

The next function is used for retrieving a kernel function handle from a loaded module so that we may launch it onto the GPU, which corresponds exactly to PyCUDA's get_function method, which we've seen many times at this point. The documentation tells us that the prototype is CUresult cuModuleGetFunction ( CUfunction* hfunc, CUmodule hmod, const char* name ). We can now write the wrapper:

cuModuleGetFunction = cuda.cuModuleGetFunction
 cuModuleGetFunction.argtypes = [c_void_p, c_void_p, c_char_p ]
 cuModuleGetFunction.restype = int

Now let's write the wrappers for the standard dynamic memory operations; these will be necessary since we won't have the vanity of using PyCUDA gpuarray objects. These are practically the same as the CUDA runtime operations that we have worked with before; that is, cudaMalloc, cudaMemcpy, and cudaFree:

cuMemAlloc = cuda.cuMemAlloc
cuMemAlloc.argtypes = [c_void_p, c_size_t]
cuMemAlloc.restype = int

cuMemcpyHtoD = cuda.cuMemcpyHtoD
cuMemcpyHtoD.argtypes = [c_void_p, c_void_p, c_size_t]
cuMemAlloc.restype = int

cuMemcpyDtoH = cuda.cuMemcpyDtoH
cuMemcpyDtoH.argtypes = [c_void_p, c_void_p, c_size_t]
cuMemcpyDtoH.restype = int

cuMemFree = cuda.cuMemFree
cuMemFree.argtypes = [c_void_p] 
cuMemFree.restype = int

Now, we will write a wrapper for the cuLaunchKernel function. Of course, this is what we will use to launch a CUDA kernel onto the GPU, provided that we have already initialized the CUDA Driver API, set up a context, loaded a module, allocated memory and configured inputs, and have extracted the kernel function handle from the loaded module. This one is a little more complex than the other functions, so we will look at the prototype:

CUresult cuLaunchKernel ( CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra )

The first parameter is a handle to the kernel function we want to launch, which we can represent as c_void_p. The six gridDim and blockDim parameters are used to indicate the grid and block dimensions. The unsigned integer, sharedMemBytes, is used to indicate how many bytes of shared memory will be allocated for each block upon kernel launch. CUstream hStream is an optional parameter that we can use to set up a custom stream, or set to NULL (0) if we wish to use the default stream, which we can represent as c_void_p in Ctypes. Finally, the kernelParams and extra parameters are used to set the inputs to a kernel; these are a little involved, so for now just know that we can also represent these as c_void_p:

cuLaunchKernel = cuda.cuLaunchKernel
cuLaunchKernel.argtypes = [c_void_p, c_uint, c_uint, c_uint, c_uint, c_uint, c_uint, c_uint, c_void_p, c_void_p, c_void_p]
cuLaunchKernel.restype = int

Now we have one last function to write a wrapper for, cuCtxDestroy. We use this at the end of a CUDA session to destroy a context on the GPU. The only input is a CUcontext object, which is represented by c_void_p:

cuCtxDestroy = cuda.cuCtxDestroy
cuCtxDestroy.argtypes = [c_void_p]
cuCtxDestroy.restype = int

Let's save this into the cuda_driver.py file. We have now completed our Driver API wrapper module! Next, we will look at how to load a PTX module and launch a kernel using only our module and our Mandelbrot PTX.

This example is also available as the cuda_driver.py file in this book's GitHub repository.

Table of Contents for Writing wrappers for the CUDA Driver API

Create new playlist

Sign In

Sign Up

Table of Contents for
Writing wrappers for the CUDA Driver API