The PyCUDA SourceModule function

We'll use the SourceModule function from PyCUDA to compile raw inline CUDA C code into usable kernels that we can launch from Python. We should note that SourceModule actually compiles code into a CUDA module, this is like a Python module or Windows DLL, only it contains a collection of compiled CUDA code. This means we'll have to "pull out" a reference to the kernel we want to use with PyCUDA's get_function, before we can actually launch it. Let's start with a basic example of how to use a CUDA kernel with SourceModule.

As before, we'll start with making one of the most simple kernel functions possible—one that multiplies a vector by a scalar. We'll start with the imports:

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda import gpuarray
from pycuda.compiler import SourceModule

Now we can immediately dive into writing our kernel:

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
int i = threadIdx.x;
outvec[i] = scalar*vec[i];

So, let's stop and contrast this with how it was done in ElementwiseKernel. First, when we declare a kernel function in CUDA C proper, we precede it with the __global__ keyword. This will distinguish the function as a kernel to the compiler. We'll always just declare this as a void function, because we'll always get our output values by passing a pointer to some empty chunk of memory that we pass in as a parameter. We can declare the parameters as we would with any standard C function: first we have outvec, which will be our output scaled vector, which is of course a floating-point array pointer. Next, we have scalar, which is represented with a mere float; notice that this is not a pointer! If we wish to pass simple singleton input values to our kernel, we can always do so without using pointers. Finally, we have our input vector, vec, which is of course another floating-point array pointer.

Singleton input parameters to a kernel function can be passed in directly from the host without using pointers or allocated device memory. 

Let's peer into the kernel before we continue with testing it. We recall that ElementwiseKernel automatically parallelized over multiple GPU threads by a value, i, which was set for us by PyCUDA; the identification of each individual thread is given by the threadIdx value, which we retrieve as follows: int i = threadIdx.x;.

threadIdx is used to tell each individual thread its identity. This is usually used to determine an index for what values should be processed on the input and output data arrays. (This can also be used for assigning particular threads different tasks than others with standard C control flow statements such as if or switch.)

Now, we are ready to perform our scalar multiplication in parallel as before: outvec[i] = scalar*vec[i];.

Now, let's test this code: we first must pull out a reference to our compiled kernel function from the CUDA module we just compiled with SourceModule. We can get this kernel reference with Python's get_function as follows:

scalar_multiply_gpu = ker.get_function("scalar_multiply_kernel")

Now, we have to put some data on the GPU to actually test our kernel. Let's set up a floating-point array of 512 random values, and then copy these into an array in the GPU's global memory using the gpuarray.to_gpu function. (We're going to multiply this random vector by a scalar both on the GPU and CPU, and see if the output matches.) We'll also allocate a chunk of empty memory to the GPU's global memory using the gpuarray.empty_like function:

testvec = np.random.randn(512).astype(np.float32)
testvec_gpu = gpuarray.to_gpu(testvec)
outvec_gpu = gpuarray.empty_like(testvec_gpu)

We are now prepared to launch our kernel. We'll set the scalar value as 2. (Again, since the scalar is a singleton, we don't have to copy this value to the GPU—we should be careful that we typecast it properly, however.) Here we'll have to specifically set the number of threads to 512 with the block and grid parameters. We are now ready to launch:

scalar_multiply_gpu( outvec_gpu, np.float32(2), testvec_gpu, block=(512,1,1), grid=(1,1,1))

We can now check whether the output matches with the expected output by using the get function in our gpuarray output object and comparing this to the correct output with NumPy's allclose function:

print "Does our kernel work correctly? : {}".format(np.allclose(outvec_gpu.get() , 2*testvec) )

(The code to this example is available as the file, under 4 in the repository.)

Now we are starting to remove the training wheels of the PyCUDA kernel templates we learned in the previous chapter—we can now directly write a kernel in pure CUDA C and launch it to use a specific number of threads on our GPU. However, we'll have to learn a bit more about how CUDA structures threads into collections of abstract units known as blocks and grids before we can continue with kernels.

