Compiling and launching pure PTX code

We have just seen how to call a pure-C function from Ctypes. In some ways, this may seem a little inelegant, as our binary file must contain both host code as well as the compiled GPU code, which may seem cumbersome. Can we just use pure, compiled GPU code and then launch it appropriately onto the GPU without writing a C wrapper each and every time? Fortunately, we can.

The NVCC compiler compiles CUDA-C into PTX (Parallel Thread Execution), which is an interpreted pseudo-assembly language that is compatible across NVIDIA 's various GPU architectures. Whenever you compile a program that uses a CUDA kernel with NVCC into an executable EXE, DLL, .so, or ELF file, there will be PTX code for that kernel contained within the file. We can also directly compile a file with the extension PTX, which will contain only the compiled GPU kernels from a compiled CUDA .cu file. Luckily for us, PyCUDA includes an interface to load a CUDA kernel directly from a PTX, freeing us from the shackles of just-in-time compilation while still allowing us to use all of the other nice features from PyCUDA.

Now let's compile the Mandelbrot code we just wrote into a PTX file; we don't need to make any changes to it. Just type the following into the command line in either Linux or Windows:

nvcc -ptx -o mandelbrot.ptx mandelbrot.cu

Now let's modify the Python program from the last section to use PTX code instead. We will remove ctypes from the imports and add the appropriate PyCUDA imports:

from __future__ import division
from time import time
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
import pycuda
from pycuda import gpuarray
import pycuda.autoinit

Now let's load the PTX file using PyCUDA's module_from_file function, like so:

mandel_mod = pycuda.driver.module_from_file('./mandelbrot.ptx')

Now we can get a reference to our kernel with get_function, just like did with PyCUDA's SourceModule:

mandel_ker = mandel_mod.get_function('mandelbrot_ker')

We can now rewrite the Mandelbrot function to handle using this kernel with the appropriate gpuarray objects and typecast inputs. (We won't go over this one line-by-line since its functionality should be obvious at this point.):

def mandelbrot(breadth, low, high, max_iters, upper_bound):
    lattice = gpuarray.to_gpu(np.linspace(low, high, breadth, dtype=np.   
    out_gpu = gpuarray.empty(shape=(lattice.size,lattice.size), dtype=np.float32)
    gridsize = int(np.ceil(lattice.size**2 / 32))
    mandel_ker(lattice, out_gpu, np.int32(256), np.float32(upper_bound**2), np.int32(lattice.size), grid=(gridsize, 1, 1), block=(32,1,1))
    out = out_gpu.get()
 
    return out

The main function will be exactly the same as in the last section:

if __name__ == '__main__':
    t1 = time()
    mandel = mandelbrot(512,-2,2,256,2)
    t2 = time()
    mandel_time = t2 - t1
    print 'It took %s seconds to calculate the Mandelbrot graph.' % mandel_time
    plt.figure(1)
    plt.imshow(mandel, extent=(-2, 2, -2, 2))
    plt.show()

Now, try running this to ensure that the output is correct. You may also notice some speed improvements over the Ctypes version.

This code is also available in the mandelbrot_ptx.py file under the "10" directory in this book's GitHub repository.

Table of Contents for Compiling and launching pure PTX code

Create new playlist

Sign In

Sign Up

Table of Contents for
Compiling and launching pure PTX code