How to do it...

In order to show the PyCUDA programming model, we consider the task of having to double all the elements of a 5 × 5 matrix:

import PyCUDA.driver as CUDA 
import PyCUDA.autoinit 
from PyCUDA.compiler import SourceModule 
import numpy

The numpy library, which we imported, allows us to construct the input to our problem, that is, a 5 × 5 matrix whose values are chosen randomly:

a = numpy.random.randn(5,5) 
a = a.astype(numpy.float32)

The matrix, thus built, must be copied from the memory of the host to the memory of the device. For this, we allocate a memory space (a_gpu) on the device that is necessary to contain matrix a. For this purpose, we use the mem_alloc function, which has the allocated memory space as its subject. In particular, the number of bytes of matrix a, as expressed by the a.nbytes parameter, is as follows:

a_gpu = cuda.mem_alloc(a.nbytes)

After that, we can transfer the matrix from the host to the memory area, created specifically on the device by using the memcpy_htod function:

cuda.memcpy_htod(a_gpu, a)

Inside the device, the doubleMatrix kernel function will operate. Its purpose will be to multiply each element of the input matrix by 2. As you can see, the syntax of the doubleMatrix function is C-like, while the SourceModule statement is a real directive for the NVIDIA compiler (the nvcc compiler), which creates a module that, in this case, consists of the doubleMatrix function only:

mod = SourceModule(""" 
  __global__ void doubles_matrix(float *a){ 
    int idx = threadIdx.x + threadIdx.y*4; 
    a[idx] *= 2;} 
  """)

With the func parameter, we identify the doubleMatrix function, which is contained in the mod module:

func = mod.get_function("doubles_matrix")

Finally, we run the kernel function. In order to successfully execute a kernel function on the device, the CUDA user must specify the input for the kernel and the size of the execution thread block. In the following case, the input is the a_gpu matrix that was previously copied to the device, while the dimension of the thread block is (5,5,1):

func(a_gpu, block=(5,5,1))

Therefore, we allocate an area of memory of size equal to that of the input matrix a:

a_doubled = numpy.empty_like(a)

Then, we copy the contents of the memory area allocated to the device—that is, the a_gpu matrix—to the previously defined memory area, a_doubled:

cuda.memcpy_dtoh(a_doubled, a_gpu)

Finally, we print the contents of the input matrix a and the output matrix in order to verify the quality of the implementation:

print ("ORIGINAL MATRIX") 
print (a) 
print ("DOUBLED MATRIX AFTER PyCUDA EXECUTION") 
print (a_doubled)

Table of Contents for How to do it...