Events and streams

We will now see how to use event objects with respect to streams; this will give us a highly intricate level of control over the flow of our various GPU operations, allowing us to know exactly how far each individual stream has progressed via the query function, and even allowing us to synchronize particular streams with the host while ignoring the other streams.

First, though, we have to realize this—each stream has to have its own dedicated collection of event objects; multiple streams cannot share an event object. Let's see what this means exactly by modifying the prior example, multi_kernel_streams.py. After the kernel definition, let's add two additional empty lists—start_events and end_events. We will fill these lists up with event objects, which will correspond to each stream that we have. This will allow us to time one GPU operation in each stream, since every GPU operation requires two events:

data = []
data_gpu = []
gpu_out = []
streams = []
start_events = []
end_events = []

for _ in range(num_arrays):
    streams.append(drv.Stream())
    start_events.append(drv.Event())
    end_events.append(drv.Event())

Now, we can time each kernel launch individually by modifying the second loop to use the record of the event at the beginning and end of the launch. Notice that here, since there are multiple streams, we have to input the appropriate stream as a parameter to each event object's record function. Also, notice that we can capture the end events in a second loop; this will still allow us to capture kernel execution duration perfectly, without any delay in launching the subsequent kernels. Now consider the following code:

for k in range(num_arrays):
    start_events[k].record(streams[k])
    mult_ker(data_gpu[k], np.int32(array_len), block=(64,1,1), grid=(1,1,1), stream=streams[k])

for k in range(num_arrays):
    end_events[k].record(streams[k])

Now we're going to extract the duration of each individual kernel launch. Let's add a new empty list after the iterative assert check, and fill it with the duration by way of the time_till function:

kernel_times = []
for k in range(num_arrays):
   kernel_times.append(start_events[k].time_till(end_events[k]))

Let's now add two print statements at the very end, to tell us the mean and standard deviation of the kernel execution times:

print 'Mean kernel duration (milliseconds): %f' % np.mean(kernel_times)
print 'Mean kernel standard deviation (milliseconds): %f' % np.std(kernel_times)

We can now run this:

(This example is also available as multi-kernel_events.py in the repository.)

We see that there is a relatively low degree of standard deviation in kernel duration, which is good, considering each kernel processes the same amount of data over the same block and grid size—if there were a high degree of deviation, then that would mean that we were making highly uneven usage of the GPU in our kernel executions, and we would have to re-tune parameters to gain a greater level of concurrency.

Table of Contents for Events and streams

Create new playlist

Sign In

Sign Up

Table of Contents for
Events and streams