Using the PyCUDA stream class

We will start with a simple PyCUDA program; all this will do is generate a series of random GPU arrays, process each array with a simple kernel, and copy the arrays back to the host. We will then modify this to use streams. Keep in mind this program will have no point at all, beyond illustrating how to use streams and some basic performance gains you can get. (This program can be seen in the multi-kernel.py file, under the 5 directory in the GitHub repository.)

Of course, we'll start by importing the appropriate Python modules, as well as the time function:

import pycuda.autoinit
import pycuda.driver as drv
from pycuda import gpuarray
from pycuda.compiler import SourceModule
import numpy as np
from time import time

We now will specify how many arrays we wish to process—here, each array will be processed by a different kernel launch. We also specify the length of the random arrays we will generate, as follows:

num_arrays = 200
array_len = 1024**2

We now have a kernel that operates on each array; all this will do is iterate over each point in the array, and multiply and divide it by 2 for 50 times, ultimately leaving the array intact. We want to restrict the number of threads that each kernel launch will use, which will help us gain concurrency among many kernel launches on the GPU so that we will have each thread iterate over different parts of the array with a for loop. (Again, remember that this kernel function will be completely useless for anything other than for learning about streams and synchronization!) If each kernel launch uses too many threads, it will be harder to gain concurrency later:

ker = SourceModule(""" 
__global__ void mult_ker(float * array, int array_len)
{
     int thd = blockIdx.x*blockDim.x + threadIdx.x;
     int num_iters = array_len / blockDim.x;

     for(int j=0; j < num_iters; j++)
     {
         int i = j * blockDim.x + thd;

         for(int k = 0; k < 50; k++)
         {
              array[i] *= 2.0;
              array[i] /= 2.0;
         }
     }
}
""")

mult_ker = ker.get_function('mult_ker')

Now, we will generate some random data array, copy these arrays to the GPU, iteratively launch our kernel over each array across 64 threads, and then copy the output data back to the host and assert that the same with NumPy's allclose function. We will time the duration of all operations from start to finish by using Python's time function, as follows:

data = []
data_gpu = []
gpu_out = []

# generate random arrays.
for _ in range(num_arrays):
    data.append(np.random.randn(array_len).astype('float32'))

t_start = time()

# copy arrays to GPU.
for k in range(num_arrays):
    data_gpu.append(gpuarray.to_gpu(data[k]))

# process arrays.
for k in range(num_arrays):
    mult_ker(data_gpu[k], np.int32(array_len), block=(64,1,1), grid=(1,1,1))

# copy arrays from GPU.
for k in range(num_arrays):
    gpu_out.append(data_gpu[k].get())

t_end = time()

for k in range(num_arrays):
    assert (np.allclose(gpu_out[k], data[k]))

print 'Total time: %f' % (t_end - t_start)

We are now prepared to run this program. I will run it right now:

So, it took almost three seconds for this program to complete. We will make a few simple modifications so that our program can use streams, and then see if we can get any performance gains (this can be seen in the multi-kernel_streams.py file in the repository).

First, we note that for each kernel launch we have a separate array of data that it processes, and these are stored in Python lists. We will have to create a separate stream object for each individual array/kernel launch pair, so let's first add an empty list, entitled streams, that will hold our stream objects:

data = []
data_gpu = []
gpu_out = []
streams = []

We can now generate a series of streams that we will use to organize the kernel launches. We can get a stream object from the pycuda.driver submodule with the Stream class. Since we've imported this submodule and aliased it as drv, we can fill up our list with new stream objects, as follows:

for _ in range(num_arrays):
    streams.append(drv.Stream())

Now, we will have to first modify our memory operations that transfer data to the GPU. Consider the following steps for it:

Look for the first loop that copies the arrays to the GPU with the gpuarray.to_gpu function. We will want to switch to the asynchronous and stream-friendly version of this function, gpu_array.to_gpu_async, instead. (We must now also specify which stream each memory operation should use with the stream parameter):

for k in range(num_arrays):
    data_gpu.append(gpuarray.to_gpu_async(data[k], stream=streams[k]))

We can now launch our kernels. This is exactly as before, only we must specify what stream to use by using the stream parameter:

for k in range(num_arrays):
    mult_ker(data_gpu[k], np.int32(array_len), block=(64,1,1), grid=(1,1,1), stream=streams[k])

Finally, we need to pull our data off the GPU. We can do this by switching the gpuarray get function to get_async, and again using the stream parameter, as follows:

for k in range(num_arrays):
    gpu_out.append(data_gpu[k].get_async(stream=streams[k]))

We are now ready to run our stream-friendly modified program:

In this case, we have a triple-fold performance gain, which is not too bad considering the very few numbers of modifications we had to make. But before we move on, let's try to get a deeper understanding as to why this works.

Let's consider the case of two CUDA kernel launches. We will also perform GPU memory operations corresponding to each kernel before and after we launch our kernels, for a total of six operations. We can visualize the operations happening on the GPU with respect to time with a graph as such—moving to the right on the x-axis corresponds to time duration, while the y-axis corresponds to operations being executed on the GPU at a particular time. This is depicted with the following diagram:

It's not too hard to visualize why streams work so well in performance increase—since operations in a single stream are blocked until only all necessary prior operations are competed, we will gain concurrency among distinct GPU operations and make full use of our device. This can be seen by the large overlap of concurrent operations. We can visualize stream-based concurrency over time as follows:

Table of Contents for Using the PyCUDA stream class

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the PyCUDA stream class