Level-3 GEMM in cuBLAS for measuring GPU performance

We will now look at how to perform a general matrix-matrix multiplication (GEMM) with CuBLAS. We will actually try to make something a little more utilitarian than the last few examples we saw in cuBLAS—we will use this as a performance metric for our GPU to determine the number of Floating Point Operations Per Second (FLOPS) it can perform, which will be two separate values: the case of single precision, and that of double precision. Using GEMM is a standard technique for evaluating the performance of computing hardware in FLOPS, as it gives a much better understanding of sheer computational power than using pure clock speed in MHz or GHz.

If you need a brief review, recall that we covered matrix-matrix multiplication in depth in the last chapter. If you forgot how this works, it's strongly suggested that you review this chapter before you move on to this section.

First, let's see how a GEMM operation is defined:

This means that we perform a matrix multiplication of A and B, scale the result by alpha, and then add this to the C matrix that we have scaled by beta, placing the final result in C.

Let's think about how many floating point operations are executed to get the final result of a real-valued GEMM operation, assuming that A is an m x k (where m is rows and k is columns) matrix, B is a k x n matrix, and C is an m x n matrix. First, let's figure out how many operations are required for computing AB. Let's take a single column of A and multiply it by B: this will amount to k multiplies and k - 1 adds for each of the m rows in A, which means that this is km + (k-1)m total operations over m rows. There are n columns in B, so computing AB will total to kmn + (k-1)mn = 2kmn - mn operations. Now, we use alpha to scale AB, which will be mn operations, since that is the size of the matrix AB; similarly, scaling C by beta is another mn operation. Finally, we add these two resulting matrices, which is yet another mn operation. This means that we will have a total of 2kmn - mn + 3mn = 2kmn + 2mn = 2mn(k+1) floating point operations in a given GEMM operation.

Now the only thing we have to do is run a timed GEMM operation, taking note of the different sizes of the matrices, and divide 2kmn + 2mn by the total time duration to calculate the FLOPS of our GPU. The resulting number will be very large, so we will represent this in terms of GFLOPS – that is, how many billions (10⁹) of operations that can be computed per second. We can compute this by multiplying the FLOPS value by 10^-9.

Now we are ready to start coding this up. Let's start with our import statements, as well as the time function:

import pycuda.autoinit
from pycuda import gpuarray
import numpy as np
from skcuda import cublas
from time import time

Now we will set the m, n, and k variables for our matrix sizes. We want our matrices to be relatively big so that the time duration is sufficiently large so as to avoid divide by 0 errors. The following values should be sufficient for any GPU released up to mid-2018 or earlier; users with newer cards may consider increasing these values:

m = 5000
n = 10000
k = 10000

We will now write a function that computes the GFLOPS for both single and double precision. We will set the input value to 'D' if we wish to use double precision, or 'S' otherwise:

def compute_gflops(precision='S'):

if precision=='S':
    float_type = 'float32'
elif precision=='D':
    float_type = 'float64'
else:
    return -1

Now let's generate some random matrices that are of the appropriate precision that we will use for timing. The GEMM operations act similarly to the GEMV operation we saw before, so we will have to transpose these before we copy them to the GPU. (Since we are just doing timing, this step isn't necessary, but it's good practice to remember this.)

We will set up some other necessary variables for GEMM, whose purpose should be self-explanatory at this point (transa, lda, ldb, and so on):

A = np.random.randn(m, k).astype(float_type)
B = np.random.randn(k, n).astype(float_type)
C = np.random.randn(m, n).astype(float_type)
A_cm = A.T.copy()
B_cm = B.T.copy()
C_cm = C.T.copy()
A_gpu = gpuarray.to_gpu(A_cm)
B_gpu = gpuarray.to_gpu(B_cm)
C_gpu = gpuarray.to_gpu(C_cm)
alpha = np.random.randn()
beta = np.random.randn()
transa = cublas._CUBLAS_OP['N']
transb = cublas._CUBLAS_OP['N']
lda = m
ldb = k
ldc = m

We can now start the timer! First, we will create a cuBLAS context:

t = time()
handle = cublas.cublasCreate()

We will now launch GEMM. Keep in mind that there are two versions for the real case: cublasSgemm for single precision and cublasDgemm for double precision. We can execute the appropriate function using a little Python trick: we will write a string with cublas%sgemm with the appropriate parameters, and then replace the %s with D or S by appending % precision to the string. We will then execute this string as Python code with the exec function, like so:

exec('cublas.cublas%sgemm(handle, transa, transb, m, n, k, alpha, A_gpu.gpudata, lda, B_gpu.gpudata, ldb, beta, C_gpu.gpudata, ldc)' % precision)

We can now destroy the cuBLAS context and get the final time for our computation:

cublas.cublasDestroy(handle)
t = time() - t

Then we need to compute the GFLOPS using the equation we derived and return it as the output of this function:

gflops = 2*m*n*(k+1)*(10**-9) / t 
return gflops

Now we can set up our main function. We will output the GFLOPS in both the single and double precision cases:

if __name__ == '__main__':
    print 'Single-precision performance: %s GFLOPS' % compute_gflops('S')
    print 'Double-precision performance: %s GFLOPS' % compute_gflops('D')

Now let's do a little homework before we run this program—go to https://www.techpowerup.com and search for your GPU, and then take note of two things—the single precision floating point performance and the double precision floating point performance. I am using a GTX 1050 right now, and it's listing claims that it has 1,862 GFLOPS performance in single precision, and 58.20 GFLOPS performance in double precision. Let's run this program right now and see if this aligns with the truth:

Lo and behold, it does!

This program is also available as the cublas_gemm_flops.py file under the directory in this book's repository.

Table of Contents for Level-3 GEMM in cuBLAS for measuring GPU performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Level-3 GEMM in cuBLAS for measuring GPU performance