Filling in the gaps with CUDA-C

We will now go through the very basics of how to write a full-on CUDA-C program. We'll start small and just translate the fixed version of the little matrix multiplication test program we just debugged in the last section to a pure CUDA-C program, which we will then compile from the command line with NVIDIA's nvcc compiler into a native Windows or Linux executable file (we will see how to use the Nsight IDE in the next section, so we will just be doing this with only a text editor and the command line for now). Again, the reader is encouraged to look at the code we are translating from Python as we go along, which is available as the matrix_ker.py file in the repository.

Now, let's open our favorite text editor and create a new file entitled matrix_ker.cu. The extension will indicate that this is a CUDA-C program, which can be compiled with the nvcc compiler.

CUDA-C program and library source code filenames always use the .cu file extension.

Let's start at the beginning—as Python uses the import keyword at the beginning of a program for libraries, we recall the C language uses #include. We will need to include a few import libraries before we continue.

Let's start with these:

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>

Let's briefly think about what we need these for: cuda_runtime.h is the header file that has the declarations of all of the particular CUDA datatypes, functions, and structures that we will need for our program. We will need to include this for any pure CUDA-C program that we write. stdio.h, of course, gives us all of the standard I/O functions for the host such as printf, and we need stdlib.h for using the malloc and free dynamic memory allocation functions on the host.

Remember to always put #include <cuda_runtime.h> at the beginning of every pure CUDA-C program!

Now, before we continue, we remember that we will ultimately have to check the output of our kernel with a correct known output, as we did with NumPy's allclose function. Unfortunately, we don't have a standard or easy-to-use numerical math library in C as Python has with NumPy. More often than not, it's just easier to write your own equivalent function if it's something simple, as in this case. This means that we will now explicitly have to make our own equivalent to NumPy's allclose. We will do so as such: we will use the #define macro in C to set up a value called _EPSILON, which will act as a constant to indicate the minimum value between the output and expected output to be considered the same, and we will also set up a macro called _ABS, which will tell us the absolute difference between two numbers. We do so as follows:

#define _EPSILON 0.001
#define _ABS(x) ( x > 0.0f ? x : -x )

We can now create our own version of allclose. This will take in two float pointers and an integer value, len. We loop through both arrays and check them: if any points differ by more than _EPSILON, we return -1, otherwise we return 0 to indicate that the two arrays do indeed match.

We note one thing: since we are using CUDA-C, we precede the definition of the function with __host__, to indicate that this function is intended to be run on the CPU rather than on the GPU:

__host__ int allclose(float *A, float *B, int len)
{

  int returnval = 0;
  
  for (int i = 0; i < len; i++)
  {
    if ( _ABS(A[i] - B[i]) > _EPSILON )
    {
      returnval = -1;
      break;
    }
  }
  
  return(returnval);
}

We now can cut and paste the device and kernel functions exactly as they appear in our Python version here:


__device__ float rowcol_dot(float *matrix_a, float *matrix_b, int row, int col, int N)
{
  float val = 0;
  
  for (int k=0; k < N; k++)
  {
        val += matrix_a[ row*N + k ] * matrix_b[ col + k*N];
  }
  
  return(val);
}

__global__ void matrix_mult_ker(float * matrix_a, float * matrix_b, float * output_matrix, int N)
{

    int row = blockIdx.x*blockDim.x + threadIdx.x;
    int col = blockIdx.y*blockDim.y + threadIdx.y;

  output_matrix[col + row*N] = rowcol_dot(matrix_a, matrix_b, row, col, N);
}

Again, in contrast with __host__, notice that the CUDA device function is preceded by __device__, while the CUDA kernel is preceded by __global__.

Now, as in any C program, we will need to write the main function, which will run on the host, where we will set up our test case and from which we explicitly launch our CUDA kernel onto GPU. Again, in contrast to vanilla C, we will have explicitly to specify that this is also to be run on the CPU with __host__:

__host__ int main()
{

The first thing we will have to do is select and initialize our GPU. We do so with cudaSetDevice as follows:

cudaSetDevice(0);

cudaSetDevice(0) will select the default GPU. If you have multiple GPUs installed in your system, you can select and use them instead with cudaSetDevice(1), cudaSetDevice(2), and so on.

We will now set up N as in Python to indicate the height/width of our matrix. Since our test case will consist only of 4 x 4 matrices, we set it to 4. Since we will be working with dynamically allocated arrays and pointers, we will also have to set up a value that will indicate the number of bytes our test matrices will require. The matrices will consist of N x N floats, and we can determine the number of bytes required by a float with the sizeof keyword in C:

int N = 4;
int num_bytes = sizeof(float)*N*N;

We now set up our test matrices as such; these will correspond exactly to the test_a and test_b matrices that we saw in our Python test program (notice how we use the h_ prefix to indicate that these arrays are stored on the host, rather than on the device):


 float h_A[] = { 1.0, 2.0, 3.0, 4.0, 
                 1.0, 2.0, 3.0, 4.0, 
                 1.0, 2.0, 3.0, 4.0, 
                 1.0, 2.0, 3.0, 4.0 };
 
 float h_B[] = { 14.0, 13.0, 12.0, 11.0, 
                 14.0, 13.0, 12.0, 11.0, 
                 14.0, 13.0, 12.0, 11.0, 
                 14.0, 13.0, 12.0, 11.0 };

We now set up another array, which will indicate the expected output of the matrix multiplication of the prior test matrices. We will have to calculate this explicitly and put these values into our C code. Ultimately, we will compare this to the GPU output at the end of the program, but let's just set it up and get it out of the way:

float h_AxB[] = { 140.0, 130.0, 120.0, 110.0, 
                 140.0, 130.0, 120.0, 110.0, 
                 140.0, 130.0, 120.0, 110.0, 
                 140.0, 130.0, 120.0, 110.0 };

We now declare some pointers for arrays that will live on the GPU, and for that we will copy the values of h_A and h_B and pointer to the GPU's output. Notice how we just use standard float pointers for this. Also, notice the prefix d_— this is another standard CUDA-C convention that indicates that these will exist on the device:

float * d_A;
float * d_B;
float * d_output;

Now, we will allocate some memory on the device for d_A and d_B with cudaMalloc, which is almost the same as malloc in C; this is what PyCUDA gpuarray functions such as empty or to_gpu have been calling us invisibly to allocate memory arrays on the GPU throughout this book:

cudaMalloc((float **) &d_A, num_bytes);
cudaMalloc((float **) &d_B, num_bytes);

Let's think a bit about how this works: in C functions, we can get the address of a variable by preceding it with an ampersand (&); if you have an integer, x, we can get its address with &x. &x will be a pointer to an integer, so its type will be int *. We can use this to set values of parameters into a C function, rather than use only pure return values.

Since cudaMalloc sets the pointer through a parameter rather than with the return value (in contrast to the regular malloc), we have to use the ampersand operator, which will be a pointer to a pointer, as it is a pointer to a float pointer as here (float **). We have to typecast this value explicitly with the parenthesis since cudaMalloc can allocate arrays of any type. Finally, in the second parameter, we have to indicate how many bytes to allocate on the GPU; we already set up num_bytes previously to be the number of bytes we will need to hold a 4 x 4 matrix consisting of floats, so we plug this in and continue.

We can now copy the values from h_A and h_B to d_A and d_B respectively with two invocations of the function cudaMemcpy, as follows:

cudaMemcpy(d_A, h_A, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, num_bytes, cudaMemcpyHostToDevice);

cudaMemcpy always takes a destination pointer as the first argument, a source pointer as the second, the number of bytes to copy as the third argument, and a final parameter. The last parameter will indicate if we are copying from the host to the GPU with cudaMemcpyHostToDevice , from the GPU to the host with cudaMemcpyDeviceToHost, or between two arrays on the GPU with cudaMemcpyDeviceToDevice.

We will now allocate an array to hold the output of our matrix multiplication on the GPU with another invocation of cudaMalloc:

cudaMalloc((float **) &d_output, num_bytes);

Finally, we will have to have some memory set up on the host that will store the output of the GPU when we want to check the output of our kernel. Let's set up a regular C float pointer and allocate memory with malloc as we would normally:

float * h_output;
h_output = (float *) malloc(num_bytes);

Now, we are almost ready to launch our kernel. CUDA uses a data structure called dim3 to indicate block and grid sizes for kernel launches; we will set these up as such, since we want a grid with a dimension of 2 x 2 and blocks that are also of a dimension of 2 x 2:

dim3 block(2,2,1);
dim3 grid(2,2,1);

We are now ready to launch our kernel; we use the triple-triangle brackets to indicate to the CUDA-C compiler the block and grid sizes that the kernel should be launched over:

matrix_mult_ker <<< grid, block >>> (d_A, d_B, d_output, N);

Now, of course, before we can copy the output of the kernel back to the host, we have to ensure that the kernel has finished executing. We do this by calling cudaDeviceSynchronize, which will block the host from issuing any more commands to the GPU until the kernel has finished execution:

cudaDeviceSynchronize();

We now can copy the output of our kernel to the array we've allocated on the host:

cudaMemcpy(h_output, d_output, num_bytes, cudaMemcpyDeviceToHost);

Again, we synchronize:

cudaDeviceSynchronize();

Before we check the output, we realize that we no longer need any of the arrays we allocated on the GPU. We free this memory by calling cudaFree on each array:

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_output);

We're done with the GPU, so we call cudaDeviceReset:

cudaDeviceReset();

Now, we finally check the output we copied onto the host with the allclose function we wrote at the beginning of this chapter. If the actual output doesn't match the expected output, we print an error and return -1, otherwise, we print that it does match and we return 0. We then put a closing bracket on our program's main function:

if (allclose(h_AxB, h_output, N*N) < 0)
 {
     printf("Error! Output of kernel does not match expected output.
");
     free(h_output);
     return(-1);
 }
 else
 {
     printf("Success! Output of kernel matches expected output.
");
     free(h_output);
     return(0);
 }
}

Notice that we make one final invocation to the standard C free function since we have allocated memory to h_output , in both cases.

We now save our file, and compile it into a Windows or Linux executable file from the command line with nvcc matrix_ker.cu -o matrix_ker. This should output a binary executable file, matrix_ker.exe (in Windows) or matrix_ker (in Linux). Let's try compiling and running it right now:

Congratulations, you've just created your first pure CUDA-C program! (This example is available as matrix_ker.cu in the repository, under 7.)

Table of Contents for Filling in the gaps with CUDA-C

Create new playlist

Sign In

Sign Up

Table of Contents for
Filling in the gaps with CUDA-C