How to do it...

The following code fragment shows the calculation of the product of two matrices, M×N, in the standard method, which is based on a sequential approach. Each element of the output matrix, P, is obtained by taking a row element from matrix M, and a column element from matrix N:

void SequentialMatrixMultiplication(float*M,float *N,float *P, int width){ 
  for (int i=0; i< width; ++i) 
      for(int j=0;j < width; ++j) { 
          float sum = 0; 
          for (int k = 0 ; k < width; ++k) { 
              float a = M[I * width + k]; 
              float b = N[k * width + j]; 
              sum += a * b; 
                     } 
         P[I * width + j] = sum; 
    } 
} 
P[I * width + j] = sum;

In this case, if each thread had been given the task of calculating each element of the matrix, then access to the memory would have dominated the execution time of the algorithm.

What we can do is rely on a block of threads to calculate one output submatrix at a time. In this way, the threads that access the same memory block cooperate to optimize accesses, thereby minimizing the total calculation time:

The first step is to load all the necessary modules to implement the algorithm:

import numpy as np 
from pycuda import driver, compiler, gpuarray, tools

Then, initialize the GPU device:

import pycuda.autoinit

We implement kernel_code_template, which implements the product of two matrices that are respectively indicated with a and b, while the resulting matrix is indicated with the parameter c. Note that the MATRIX_SIZE parameter will be defined in the next step:

kernel_code_template = """ 
__global__ void MatrixMulKernel(float *a, float *b, float *c) 
{ 
    int tx = threadIdx.x; 
    int ty = threadIdx.y; 
    float Pvalue = 0; 
    for (int k = 0; k < %(MATRIX_SIZE)s; ++k) { 
        float Aelement = a[ty * %(MATRIX_SIZE)s + k]; 
        float Belement = b[k * %(MATRIX_SIZE)s + tx]; 
        Pvalue += Aelement * Belement; 
    } 
    c[ty * %(MATRIX_SIZE)s + tx] = Pvalue; 
}"""

The following parameter will be used to set the dimensions of the matrices. In this case, the size is 5 × 5:

MATRIX_SIZE = 5

We define the two input matrices, a_cpu and b_cpu, that will contain random floating-point values:

a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

Then, we calculate the product of the two matrices, a and b, on the host device:

c_cpu = np.dot(a_cpu, b_cpu)

We allocate memory areas on the device (GPU), equal in size to the input matrices:

a_gpu = gpuarray.to_gpu(a_cpu)  
b_gpu = gpuarray.to_gpu(b_cpu)

We allocate a memory area on the GPU, equal in size to the output matrix resulting from the product of the two matrices. In this case, the resulting matrix, c_gpu, will have a size of 5 × 5:

c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

The following kernel_code redefines kernel_code_template, but with the matrix_size parameter set:

kernel_code = kernel_code_template % { 
    'MATRIX_SIZE': MATRIX_SIZE}

The SourceModule directive tells nvcc (NVIDIA CUDA Compiler) that it will have to create a module—that is, a collection of functions—containing the previously defined kernel_code:

mod = compiler.SourceModule(kernel_code)

Finally, we take the MatrixMulKernel function from the module, mod, to which we give the name matrixmul:

matrixmul = mod.get_function("MatrixMulKernel")

We execute the product between two matrices, a_gpu and b_gpu, resulting in the c_gpu matrix. The size of the thread block is defined as MATRIX_SIZE, MATRIX_SIZE, 1:

matrixmul( 
    a_gpu, b_gpu,  
    c_gpu,  
    block = (MATRIX_SIZE, MATRIX_SIZE, 1))

Print the input matrices:

print ("-" * 80) 
print ("Matrix A (GPU):") 
print (a_gpu.get()) 
print ("-" * 80) 
print ("Matrix B (GPU):") 
print (b_gpu.get()) 
print ("-" * 80) 
print ("Matrix C (GPU):") 
print (c_gpu.get())

To check the validity of the calculation performed on the GPU, we compare the results of the two implementations, which are the one performed on the host device (CPU) and the one performed on the device (GPU). To do this, we use the numpy allclose directive, which verifies that two element-wise arrays are equal within a tolerance equal to 1e-05:

np.allclose(c_cpu, c_gpu.get())

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...