Concurrent Conway's game of life using CUDA streams

We will now see a more interesting application—we will modify the LIFE (Conway's Game of Life) simulation from the last chapter, so that we will have four independent windows of animation displayed concurrently. (It is suggested you look at this example from the last chapter, if you haven't yet.)

Let's get a copy of the old LIFE simulation from the last chapter in the repository, which should be under conway_gpu.py in the 4 directory. We will now modify this into our new CUDA-stream based concurrent LIFE simulation. (This new streams-based simulation that we will see in a moment is also available in the conway_gpu_streams.py file in this chapter's directory, 5.)

Go to the main function at the end of the file. We will set a new variable that indicates how many concurrent animations we will display at once with num_concurrent (where N indicates the height/width of the simulation lattice, as before). We will set it to 4 here, but you can feel free to try other values:

if __name__ == '__main__':

    N = 128
    num_concurrent = 4

We will now need a collection of num_concurrent stream objects, and will also need to allocate a collection of input and output lattices on the GPU. We'll of course just store these in lists and initialize the lattices as before. We will set up some empty lists and fill each with the appropriate objects over a loop, as such (notice how we set up a new initial state lattice on each iteration, send it to the GPU, and concatenate it to lattices_gpu):

streams = []
lattices_gpu = []
newLattices_gpu = []

for k in range(num_concurrent):
    streams.append(drv.Stream())
    lattice = np.int32( np.random.choice([1,0], N*N, p=[0.25, 0.75]).reshape(N, N) )
    lattices_gpu.append(gpuarray.to_gpu(lattice)) 
    newLattices_gpu.append(gpuarray.empty_like(lattices_gpu[k]))

Since we're only doing this loop once during the startup of our program and the virtually all of the computational work will be in the animation loop, we really don't have to worry about actually using the streams we just immediately generated.

We will now set up the environment with Matplotlib using the subplots function; notice how we can set up multiple animation plots by setting the ncols parameter. We will have another list structure that will correspond to the images that are required for the animation updates in imgs. Notice how we can now set this up with get_async and the appropriate corresponding stream:

fig, ax = plt.subplots(nrows=1, ncols=num_concurrent)
imgs = []
 
for k in range(num_concurrent):
    imgs.append( ax[k].imshow(lattices_gpu[k].get_async(stream=streams[k]), interpolation='nearest') )

The last thing to change in the main function is the penultimate line starting with ani = animation.FuncAnimation. Let's modify the arguments to the update_gpu function to reflect the new lists we are using and add two more arguments, one to pass our streams list, plus a parameter to indicate how many concurrent animations there should be:

ani = animation.FuncAnimation(fig, update_gpu, fargs=(imgs, newLattices_gpu, lattices_gpu, N, streams, num_concurrent) , interval=0, frames=1000, save_count=1000)

We now duly make the required modifications to the update_gpu function to take these extra parameters. Scroll up a bit in the file and modify the parameters as follows:

def update_gpu(frameNum, imgs, newLattices_gpu, lattices_gpu, N, streams, num_concurrent):

We now need to modify this function to iterate num_concurrent times and set each element of imgs as before, before finally returning the whole imgs list:

for k in range(num_concurrent):
    conway_ker( newLattices_gpu[k], lattices_gpu[k], grid=(N/32,N/32,1), block=(32,32,1), stream=streams[k] )
     imgs[k].set_data(newLattices_gpu[k].get_async(stream=streams[k]) )
     lattices_gpu[k].set_async(newLattices_gpu[k], stream=streams[k])
 
 return imgs

Notice the changes we made—each kernel is launched in the appropriate stream, while get has been switched to a get_async synchronized with the same stream.

Finally, the last line in the loop copies GPU data from one device array to another without any re-allocation. Before, we could use the shorthand slicing operator [:] to directly copy the elements between the arrays without re-allocating any memory on the GPU; in this case, the slicing operator notation acts as an alias for the PyCUDA set function for GPU arrays. (set, of course, is the function that copies one GPU array to another of the same size, without any re-allocation.) Luckily, there is indeed a stream-synchronized also version of this function, set_async, but we need to use this specifically to call this function, explicitly specifying the array to copy and the stream to use.

We're now finished and ready to run this. Go to a Terminal and enter python conway_gpu_streams.py at the command line to enjoy the show:

Table of Contents for Concurrent Conway's game of life using CUDA streams

Create new playlist

Sign In

Sign Up

Table of Contents for
Concurrent Conway's game of life using CUDA streams