How it works...

Let's consider the PyCUDA programming workflow. Let's prepare the input matrix, the output matrix, and where to store the results:

MATRIX_SIZE = 5 
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 
c_cpu = np.dot(a_cpu, b_cpu)

Then, we transfer these matrices to the GPU device by using the gpuarray.to_gpu() PyCUDA function:

a_gpu = gpuarray.to_gpu(a_cpu)  
b_gpu = gpuarray.to_gpu(b_cpu) 
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

The core of the algorithm is the following kernel function. Let's remark that the __global__ keyword specifies that this function is a kernel function, which means that it will be executed by the device (GPU) following a call from the host code (CPU):

__global__ void MatrixMulKernel(float *a, float *b, float *c){
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    float Pvalue = 0;
    for (int k = 0; k < %(MATRIX_SIZE)s; ++k) {
        float Aelement = a[ty * %(MATRIX_SIZE)s + k];
        float Belement = b[k * %(MATRIX_SIZE)s + tx];
        Pvalue += Aelement * Belement;}
    c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}

threadIdx.x and threadIdy.y are coordinates that allow the identification of the threads in the grid of two-dimensional blocks. Note that the threads within the grid block execute the same kernel code but on different pieces of data. If we compare the parallel version with the sequential one, then we immediately notice that the cycle indexes, i and j, have been replaced by the threadIdx.x and threadIdx.y indexes.

This means that in the parallel version, we will have only one iteration of the cycle. In fact, the MatrixMulKernel kernel will be executed on a grid of dimensions of 5 × 5 parallel threads.

This condition is expressed in the following diagram:

Grid and block of thread organization for the example

Then, we verify the product computation just by comparing the two resulting matrices:

np.allclose(c_cpu, c_gpu.get())

The output is as follows:

C:>python memManagementPycuda.py

---------------------------------------------------------------------
Matrix A (GPU):
[[ 0.90780383 -0.4782407 0.23222363 -0.63184392 1.05509627]
 [-1.27266967 -1.02834761 -0.15528528 -0.09468858 1.037099 ]
 [-0.18135822 -0.69884419 0.29881889 -1.15969539 1.21021318]
 [ 0.20939326 -0.27155793 -0.57454145 0.1466181 1.84723163]
 [ 1.33780348 -0.42343542 -0.50257754 -0.73388749 -1.883829 ]]
---------------------------------------------------------------------
Matrix B (GPU):
[[ 0.04523897 0.99969769 -1.04473436 1.28909719 1.10332143]
 [-0.08900332 -1.3893919 0.06948703 -0.25977209 -0.49602833]
 [-0.6463753 -1.4424541 -0.81715286 0.67685211 -0.94934392]
 [ 0.4485206 -0.77086055 -0.16582981 0.08478995 1.26223004]
 [-0.79841441 -0.16199949 -0.35969591 -0.46809086 0.20455229]]
---------------------------------------------------------------------
Matrix C (GPU):
[[-1.19226956 1.55315971 -1.44614291 0.90420711 0.43665022]
 [-0.73617989 0.28546685 1.02769876 -1.97204924 -0.65403283]
 [-1.62555301 1.05654192 -0.34626681 -0.51481217 -1.35338223]
 [-1.0040834 1.00310731 -0.4568972 -0.90064859 1.47408712]
 [ 1.59797418 3.52156591 -0.21708387 2.31396151 0.85150564]]
---------------------------------------------------------------------

TRUE

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...