Conway's game of life

The Game of Life (often called LIFE for short) is a cellular automata simulation that was invented by the British mathematician John Conway back in 1970. This sounds complex, but it's really quite simple—LIFE is a zero-player game that consists of a two-dimensional binary lattice of cells that are either considered live or dead. The lattice is iteratively updated by the following set of rules:

  • Any live cell with fewer than two live neighbors dies
  • Any live cell with two or three neighbors lives
  • Any live cell with more than three neighbors dies
  • Any dead cell with exactly three neighbors comes to life

These four simple rules give rise to a complex simulation with interesting mathematical properties that is also aesthetically quite pleasing to watch when animated. However, with a large number of cells in the lattice, it can run quite slowly, and usually results in choppy animation when programmed in pure serial Python. However, this is parallelizable, as it is clear that each cell in the lattice can be managed by a single CUDA thread.

We'll now implement LIFE as a CUDA kernel and animate it as using the matplotlib.animation module. This will be interesting to us right now because namely we'll be able to apply our new knowledge of blocks and grids here.

We'll start by including the appropriate modules as follows:

import pycuda.autoinit
import pycuda.driver as drv
from pycuda import gpuarray
from pycuda.compiler import SourceModule
import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.animation as animation

Now, let's dive into writing our kernel via SourceModule . We're going to start by using the C language's #define directive to set up some constants and macros that we'll use throughout our kernel. Let's look at the first two we'll set up, _X and _Y:

ker = SourceModule("""
#define _X  ( threadIdx.x + blockIdx.x * blockDim.x )
#define _Y  ( threadIdx.y + blockIdx.y * blockDim.y )

Let's first remember how #define works here—it will literally replace any text of _X or _Y with the defined values (in the parentheses here) at compilation time—that is, it creates macros for us. (As a matter of personal style, I usually precede all of my C macros with an underscore.)

In C and C++, #define is used for creating macros. This means that #define doesn't create any function or set up a proper constant variables—it just allows us to write things shorthand in our code by swapping text out right before compilation time.

Now, let's talk about what _X and _Y mean specifically—these will be the Cartesian x and y values of a single CUDA thread's cell on the two-dimensional lattice we are using for LIFE. We'll launch the kernel over a two-dimensional grid consisting of two-dimensional blocks that will correspond to the entire cell lattice. We'll have to use both thread and block constants to find the Cartesian point on the lattice. Let's look at some diagrams to make the point. A thread residing in a two-dimensional CUDA block can be visualized as follows:

At this point, you may be wondering why we don't launch our kernel over a single block, so we can just set _X as threadIdx.x and _Y as threadIdx.y and be done with it. This is due to a limitation on block size imposed on us by CUDA—currently, only blocks consisting of at most 1,024 threads are supported. This means that we can only make our cell lattice of dimensions 32 x 32 at most, which would make for a rather boring simulation that might be better done on a CPU, so we'll have to launch multiple blocks over a grid. (The dimensions of our current block will be given by blockDim.x and blockDim.y, which will help us determine the objective x and y coordinates, as we'll see.)

Similarly, as before, we can determine which block we are in within a two-dimensional grid with blockIdx.x and blockIdx.y:

After we think of the math a little bit, it should be clear that _X should be defined as (threadIdx.x + blockIdx.x * blockDim.x) and _Y should be defined as ( threadIdx.y + blockIdx.y * blockDim.y ). (The parentheses are added so as not to interfere with the order of operations when the macros are inserted in the code.) Now, let's continue defining the remaining macros:

#define _WIDTH  ( blockDim.x * gridDim.x )
#define _HEIGHT ( blockDim.y * gridDim.y  )

#define _XM(x)  ( (x + _WIDTH) % _WIDTH )
#define _YM(y)  ( (y + _HEIGHT) % _HEIGHT )

The _WIDTH and _HEIGHT macros will give us the width and height of our cell lattice, respectively, which should be clear from the diagrams. Let's discuss the _XM and _YM macros. In our implementation of LIFE, we'll have the endpoints "wrap around" to the other side of the lattice—for example, we'll consider the x-value of -1 to be _WIDTH - 1, and a y-value of -1 to be _HEIGHT - 1, and we'll likewise consider an x-value of _WIDTH to be 0 and a y-value of _HEIGHT to be 0. Why do we need this? When we calculate the number of living neighbors of a given cell, we might be at some edge and the neighbors might be external pointsdefining these macros to modulate our points will cover this for us automatically. Notice that we have to add the width or height before we use C's modulus operatorthis is because, unlike Python, the modulus operator in C can return negative values for integers.

We now have one final macro to define. We recall that PyCUDA passes two-dimensional arrays into CUDA C as one-dimensional pointers; two-dimensional arrays are passed in row-wise from Python into one dimensional C pointers. This means that we'll have to translate a given Cartesian (x,y) point for a given cell on the lattice into a one dimensional point within the pointer corresponding to the lattice. Here, we can do so as follows:

#define _INDEX(x,y)  ( _XM(x)  + _YM(y) * _WIDTH )

Since our cell lattice is stored row-wise, we have to multiply the y-value by the width to offset to the point corresponding to the appropriate row. We can now finally begin with our implementation of LIFE. Let's start with the most important part of LIFEcounting the number of living neighbors a given cell has. We'll implement this using a CUDA device function, as follows:

__device__ int nbrs(int x, int y, int * in)
{
     return ( in[ _INDEX(x -1, y+1) ] + in[ _INDEX(x-1, y) ] + in[ _INDEX(x-1, y-1) ] 
                   + in[ _INDEX(x, y+1)] + in[_INDEX(x, y - 1)] 
                   + in[ _INDEX(x+1, y+1) ] + in[ _INDEX(x+1, y) ] + in[ _INDEX(x+1, y-1) ] );
}

A device function is a C function written in serial, which is called by an individual CUDA thread in kernel. That is to say, this little function will be called in parallel by multiple threads from our kernel. We'll represent our cell lattice as a collection of 32-bit integers (1 will represent a living cell and 0 will represent a dead one), so this will work for our purposes; we just have to add the values of the neighbors around our current cell.

A CUDA device function is a serial C function that is called by an individual CUDA thread from within a kernel. While these functions are serial in themselves, they can be run in parallel by multiple GPU threads. Device functions cannot by themselves by launched by a host computer onto a GPU, only kernels.

We are now prepared to write our kernel implementation of LIFE. Actually, we've done most of the hard work alreadywe check the number of neighbors of the current thread's cell, check whether the current cell is living or dead, and then use the appropriate switch-case statements to determine its status for the next iteration according to the rules of LIFE. We'll use two integer pointer arrays for this kernel—one will be in reference to the last iteration as input (lattice) and the other in reference to the iteration that we'll calculate as output (lattice_out):

__global__ void conway_ker(int * lattice_out, int * lattice  )
{
   // x, y are the appropriate values for the cell covered by this thread
   int x = _X, y = _Y;
   
   // count the number of neighbors around the current cell
   int n = nbrs(x, y, lattice);
                   
    
    // if the current cell is alive, then determine if it lives or dies for the next generation.
    if ( lattice[_INDEX(x,y)] == 1)
       switch(n)
       {
          // if the cell is alive: it remains alive only if it has 2 or 3 neighbors.
          case 2:
          case 3: lattice_out[_INDEX(x,y)] = 1;
                  break;
          default: lattice_out[_INDEX(x,y)] = 0;                   
       }
    else if( lattice[_INDEX(x,y)] == 0 )
         switch(n)
         {
            // a dead cell comes to life only if it has 3 neighbors that are alive.
            case 3: lattice_out[_INDEX(x,y)] = 1;
                    break;
            default: lattice_out[_INDEX(x,y)] = 0;         
         }
         
}
""")


conway_ker = ker.get_function("conway_ker")

We remember to close off the inline CUDA C segment with the triple-parentheses, and then get a reference to our CUDA C kernel with get_function. Since the kernel will only update the lattice once, we'll set up a short function in Python that will cover for all of the overhead of updating the lattice for the animation:

def update_gpu(frameNum, img, newLattice_gpu, lattice_gpu, N):    

The frameNum parameter is just a value that is required by Matplotlib's animation module for update functions that we can ignore, while img will be the representative image of our cell lattice that is required by the module that will be iteratively displayed.

Let's focus on the other three remaining parameters—newLattice_gpu and lattice_gpu will be PyCUDA arrays that we'll keep persistent, as we want to avoid re-allocating chunks of memory on the GPU when we can. lattice_gpu will be the current generation of the cell array that will correspond to the lattice parameter in the kernel, while newLattice_gpu will be the next generation of the lattice. N will indicate the the height and width of the lattice (in other words, we'll be working with an N x N lattice).

We launch the kernel with the appropriate parameters and set the block and grid sizes as follows:

    conway_ker(newLattice_gpu, lattice_gpu, grid=(N/32,N/32,1), block=(32,32,1) )    

We'll set the block sizes as 32 x 32 with (32, 32, 1); since we are only using two dimensions for our cell lattice, we can just set the z-dimension as one. Remember that blocks are limited to 1,024 threads—32 x 32 = 1024, so this will work. (Keep in mind that there is nothing special here about 32 x 32; we could use values such as 16 x 64 or 10 x 10 if we wanted to, as long as the total number of threads does not exceed 1,024.)

The number of threads in a CUDA block is limited to a maximum of 1,024.

We now look at grid value—here, since we are working with dimensions of 32, it should be clear that N (in this case) should be divisible by 32. That means that in this case, we are limited to lattices such as 64 x 64, 96 x 96, 128 x 128, and 1024 x 1024. Again, if we want to use lattices of a different size, then we'll have to alter the dimensions of the blocks. (If this doesn't make sense, then please look at the previous diagrams and review how we defined the width and height macros in our kernel.)

We can now set up the image data for our animation after grabbing the latest generated lattice from the GPU's memory with the get() function. We finally copy the new lattice data into the current data using the PyCUDA slice operator, [:], which will copy over the previously allocated memory on the GPU so that we don't have to re-allocate:

    img.set_data(newLattice_gpu.get() )    
    lattice_gpu[:] = newLattice_gpu[:]
    
    return img

Let's set up a lattice of size 256 x 256. We now will set up an initial state for our lattice using the choice function from the numpy.random module. We'll populate a N x N graph of integers randomly with ones and zeros; generally, if around 25% of the points are ones and the rest zeros, we can generate some interesting lattice animations, so we'll go with that:

if __name__ == '__main__':
    # set lattice size
    N = 256
    
    lattice = np.int32( np.random.choice([1,0], N*N, p=[0.25, 0.75]).reshape(N, N) )
    lattice_gpu = gpuarray.to_gpu(lattice)

Finally, we can set up the lattices on the GPU with the appropriate gpuarray functions and set up the Matplotlib animation accordingly, as follows:

lattice_gpu = gpuarray.to_gpu(lattice)
lattice_gpu = gpuarray.to_gpu(lattice)
newLattice_gpu = gpuarray.empty_like(lattice_gpu)

fig, ax = plt.subplots()
img = ax.imshow(lattice_gpu.get(), interpolation='nearest')
ani = animation.FuncAnimation(fig, update_gpu, fargs=(img, newLattice_gpu, lattice_gpu, N, ) , interval=0, frames=1000, save_count=1000)

plt.show()

We can now run our program and enjoy the show (the code is also available as the conway_gpu.py file under the 4 directory in the GitHub repository):

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.59