Let's look at how to do a GEMV matrix-vector multiplication. This is defined as the following operation for an m x n matrix A, an n-dimensional vector x, a m-dimensional vector y, and for the scalars alpha and beta:
Now let's look at how the function is laid out before we continue:
cublasSgemv(handle, trans, m, n, alpha, A, lda, x, incx, beta, y, incy)
Let's go through these inputs one-by-one:
- handle refers to the cuBLAS context handle.
- trans refers to the structure of the matrix—we can specify whether we want to use the original matrix, a direct transpose, or a conjugate transpose (for complex matrices). This is important to keep in mind because this function will expect that the matrix A is stored in column-major format.
- m and n are the number of rows and columns of the matrix A that we want to use.
- alpha is the floating-point value for α.
- A is the m x n matrix A.
- lda indicates the leading dimension of the matrix, where the total size of the matrix is actually lda x n. This is important in the column-major format because if lda is larger than m, this can cause problems for cuBLAS when it tries to access the values of A since its underlying structure of this matrix is a one-dimensional array.
- We then have x and its stride, incx; x is the underlying C pointer of the vector being multiplied by A. Remember, x will have to be of size n; that is, the number of columns of A.
- beta, which is the floating-point value for β.
- Finally, we have y and its stride incy as the last parameters. We should remember that y should be of size m, or the number of rows of A.
Let's test this by generating a 10 x 100 matrix of random values A, and a vector x of 100 random values. We'll initialize y as a matrix of 10 zeros. We will set alpha to 1 and beta to 0, just to get a direct matrix multiplication with no scaling:
m = 10
n = 100
alpha = 1
beta = 0
A = np.random.rand(m,n).astype('float32')
x = np.random.rand(n).astype('float32')
y = np.zeros(m).astype('float32')
We will now have to get A into column-major (or column-wise) format. NumPy stores matrices as row-major (or row-wise) by default, meaning that the underlying one-dimensional array that is used to store a matrix iterates through all of the values of the first row, then all of the values of the second row, and so on. You should remember that a transpose operation swaps the columns of a matrix with its rows. However, the result will be that the new one-dimensional array underlying the transposed matrix will represent the original matrix in a column-major format. We can make a copy of the transposed matrix of A with A.T.copy() like so, and copy this as well as x and y to the GPU:
A_columnwise = A.T.copy()
A_gpu = gpuarray.to_gpu(A_columnwise)
x_gpu = gpuarray.to_gpu(x)
y_gpu = gpuarray.to_gpu(y)
Since we now have the column-wise matrix stored properly on the GPU, we can set the trans variable to not take the transpose by using the _CUBLAS_OP dictionary:
trans = cublas._CUBLAS_OP['N']
Since the size of the matrix is exactly the same as the number of rows that we want to use, we now set lda as m. The strides for the x and y vectors are, again, 1. We now have all of the values we need set up, and can now create our CuBLAS context and store its handle, like so:
lda = m
incx = 1
incy = 1
handle = cublas.cublasCreate()
We can now launch our function. Remember that A, x, and y are actually PyCUDA gpuarray objects, so we have to use the gpudata parameter to input into this function. Other than doing this, this is pretty straightforward:
cublas.cublasSgemv(handle, trans, m, n, alpha, A_gpu.gpudata, lda, x_gpu.gpudata, incx, beta, y_gpu.gpudata, incy)
We can now destroy our cuBLAS context and check the return value to ensure that it is correct:
cublas.cublasDestroy(handle)
print 'cuBLAS returned the correct value: %s' % np.allclose(np.dot(A,x), y_gpu.get())