Level-1 AXPY with cuBLAS

Let's start with a basic level-1 ax + y (or AXPY) operation with cuBLAS. Let's stop for a moment and review a bit of linear algebra and think about what this means. Here, a is considered to be a scalar; that is, a real number, such as -10, 0, 1.345, or 100. x and y are considered to be vectors in some vector space, . This means that x and y are n-tuples of real numbers, so in the case of , these could be values such as [1,2,3] or [-0.345, 8.15, -15.867]. ax means the scaling of x by a, so if a is 10 and x is the first prior value, then ax is each individual value of x multiplied by a; that is, [10, 20, 30]. Finally, the sum ax + y means that we add each individual value in each slot of both vectors to produce a new vector, which would be as follows (assuming that y is the second vector given)—[9.655, 28.15, 14.133].

Let's do this in cuBLAS now. First, let's import the appropriate modules:

import pycuda.autoinit
from pycuda import gpuarray
import numpy as np

Now let's import cuBLAS:

from skcuda import cublas

We can now set up our vector arrays and copy them to the GPU. Note that we are using 32-bit (single precision) floating point numbers:

a = np.float32(10)
x = np.float32([1,2,3])
y = np.float32([-.345,8.15,-15.867])
x_gpu = gpuarray.to_gpu(x)
y_gpu = gpuarray.to_gpu(y)

We now have to create a cuBLAS context. This is similar in nature to CUDA contexts, which we discussed in Chapter 5, Streams, Events, Contexts, and Concurrency, only this time it is used explicitly for managing cuBLAS sessions. The cublasCreate function creates a cuBLAS context and gives a handle to it as its output. We will need to hold onto this handle for as long as we intend to use cuBLAS in this session:

cublas_context_h = cublas.cublasCreate()

We can now use the cublasSaxpy function. The S stands for single precision, which is what we will need since we are working with 32-bit floating point arrays:

cublas.cublasSaxpy(cublas_context_h, x_gpu.size, a, x_gpu.gpudata, 1, y_gpu.gpudata, 1)

Let's discuss what we just did. Also, let's keep in mind that this is a direct wrapper to a low-level C function, so the input may seem more like a C function than a true Python function. In short, this performed an "AXPY" operation, ultimately putting the output data into the y_gpu array. Let's go through each input parameter one by one.

The first input is always the CUDA context handle. We then have to specify the size of the vectors, since this function will be ultimately operating on C pointers; we can do this by using the size parameter of a gpuarray. Having typecasted our scalar already to a NumPy float32 variable, we can pass the a variable right over as the scalar parameter. We then hand the underlying C pointer of the x_gpu array to this function using the gpudata parameter. Then we specify the stride of the first array as 1: the stride specifies how many steps we should take between each input value. (In contrast, if you were using a vector from a column in a row-wise matrix, you would set the stride to the width of the matrix.) We then put in the pointer to the y_gpu array, and set its stride to 1 as well.

We are done with our computation; now we have to explicitly destroy our cuBLAS context:

cublas.cublasDestroy(cublas_context)

We can now verify whether this is close with NumPy's allclose function, like so:

print 'This is close to the NumPy approximation: %s' % np.allclose(a*x + y , y_gpu.get())

Again, notice that the final output was put into the y_gpu array, which was also an input.

Always remember that BLAS and CuBLAS functions act in-place to save time and memory from a new allocation call. This means that an input array will also be used as an output!

We just saw how to perform an AXPY operation using the cublasSaxpy function.

Let's discuss the prominent upper case S. Like we mentioned previously, this stands for single precision that is, 32-bit real floating point values (float32). If we want to operate on arrays of 64-bit real floating point values, (float64 in NumPy and PyCUDA), then we would use the cublasDaxpy function; for 64-bit single precision complex values (complex64), we would use cublasCaxpy, while for 128-bit double precision complex values (complex128), we would use cublasZaxpy.

We can tell what type of data a BLAS or CuBLAS function operates on by checking the letter preceding the rest of the function name. Functions that use single precision reals are always preceded with S, double precision reals with D, single precision complex with C, and double precision complex with Z.

Table of Contents for Level-1 AXPY with cuBLAS

Create new playlist

Sign In

Sign Up

Table of Contents for
Level-1 AXPY with cuBLAS