Matrix multiplication optimization

Although we have used matrix multiplication code in many examples, we didn't investigate whether the operation was optimized. Now, let's review its operation and how we can find an opportunity for optimization.

Matrix multiplication is a group of dot product operations from two matrices. We can simply parallelize the operations that are done by all the CUDA threads to generate a dot product of elements. However, this operation is inefficient in terms of memory usage because the data that's loaded from memory isn't reused. To confirm our analogy, let's measure the performance limiter. The following chart shows the GPU utilization for a Tesla V100 card using Nsight Compute:

Based on our performance limiter analysis, this utilization ratio can be categorized as memory bounded. Therefore, we should review the memory utilization to mitigate utilization. The following screenshot shows the memory workload analysis section:

From this analysis, we can see that the L2 cache hit rate is low and that the max bandwidth is low. We can presume that this is because the original matrix multiplication operation does not reuse loaded data, as we mentioned earlier. This can be resolved by using shared memory, that is, reusing the loaded data and mitigating global memory usage. Now, let's review matrix multiplication and how we can optimize this to use shared memory that has a small memory space.

Matrix multiplication is a group of dot product operations with some small size matrices and a cumulation of output. The small matrices are called tiles and they map to the matrices along the output matrix. Each tile will compute its own output in parallel. This operation can be implemented in the following steps:

Determine the tile size for two input and output matrices.
Traverse the input tiles, along with their direction (matrix A goes to the right, and matrix B goes down).
Compute matrix multiplication within the tile.
Continue the second step until the tile reaches the end.
Flush the output.

The following diagram shows the concept of tiled matrix multiplication:

In the preceding diagram, we compute a matrix multiplication, C = AB. We compute a smaller matrix multiplication as a tile, in green, from matrix A and matrix B. Then, we traverse the input tile position, respectively. The operation result is accumulated to the previous output to generate the matrix multiplication's output.

This operation provides an optimization opportunity because we can break down the large matrix operation with the small problems and place it in the small memory space. In CUDA programming, we place the small matrices in shared memory and mitigate global memory access. In our implementation, we will match the tile with the CUDA thread block. The tile's position will be determined by its block index, which is done with the tid_* variable.

Table of Contents for Matrix multiplication optimization

Create new playlist

Sign In

Sign Up

Table of Contents for
Matrix multiplication optimization