Using the __syncthreads() device function

In our prior example of Conway's Game of Life, our kernel only updated the lattice once for every time it was launched by the host. There are no issues with synchronizing all of the threads among the launched kernel in this case, since we only had to work with the lattice's previous iteration that was readily available.

Now let's suppose that we want to do something slightly different—we want to re-write our kernel so that it performs a certain number of iterations on a given cell lattice without being re-launched over and over by the host. This may initially seem trivial—a naive solution would be to just put an integer parameter to indicate the number of iterations and a for loop in the inline conway_ker kernel, make some additional trivial changes, and be done with it.

However, this raises the issue of race conditions; this is the issue of multiple threads reading and writing to the same memory address and the problems that may arise from that. Our old conway_ker kernel avoids this issue by using two arrays of memory, one that is strictly read from, and one that is strictly written to for each iteration. Furthermore, since the kernel only performs a single iteration, we are effectively using the host for the synchronization of the threads.

We want to do multiple iterations of LIFE on the GPU that are fully synchronized; we also will want to use a single array of memory for the lattice. We can avoid race conditions by using a CUDA device function called __syncthreads(). This function is a block level synchronization barrier—this means that every thread that is executing within a block will stop when it reaches a __syncthreads() instance and wait until each and every other thread within the same block reaches that same invocation of __syncthreads() before the the threads continue to execute the subsequent lines of code.

__syncthreads() can only synchronize threads within a single CUDA block, not all threads within a CUDA grid!

Let's now create our new kernel; this will be a modification of the prior LIFE kernel that will perform a certain number of iterations and then stop. This means we'll not represent this as an animation, just as a static image, so we'll load the appropriate Python modules in the beginning. (This code is also available in the conway_gpu_syncthreads.py file, in the GitHub repository):

import pycuda.autoinit
import pycuda.driver as drv
from pycuda import gpuarray
from pycuda.compiler import SourceModule
import numpy as np
import matplotlib.pyplot as plt

Now, let's again set up our kernel that will compute LIFE:

ker = SourceModule("""

Of course, our CUDA C code will go here, which will be largely the same as before. We'll have to only make some changes to our kernel. Of course, we can preserve the device function, nbrs. In our declaration, we'll use only one array to represent the cell lattice. We can do this since we'll be using proper thread synchronization. We'll also have to indicate the number of iterations with an integer. We set the parameters as follows:

__global__ void conway_ker(int * lattice, int iters)
{

We'll continue similarly as before, only iterating with a for loop:

 int x = _X, y = _Y; 
 for (int i = 0; i < iters; i++)
 {
     int n = nbrs(x, y, lattice); 
     int cell_value;

Let's recall that previously, we directly set the new cell lattice value directly within the array. Here, we'll hold the value in the cell_value variable until all of the threads in the block are synchronized. We proceed similarly as before, blocking execution with __syncthreads until all of the new cell values are determined for the current iteration, and only then setting the values within the lattice array:

 if ( lattice[_INDEX(x,y)] == 1)
 switch(n)
 {
 // if the cell is alive: it remains alive only if it has 2 or 3 neighbors.
 case 2:
 case 3: cell_value = 1;
 break;
 default: cell_value = 0; 
 }
 else if( lattice[_INDEX(x,y)] == 0 )
 switch(n)
 {
 // a dead cell comes to life only if it has 3 neighbors that are alive.
 case 3: cell_value = 1;
 break;
 default: cell_value = 0; 
 } 
 __syncthreads();
 lattice[_INDEX(x,y)] = cell_value; 
 __syncthreads();
 } 
}
""")

We'll now launch the kernel as before and display the output, iterating over the lattice 1,000,000 times. Note that we are using only a single block in our grid, which is of a size of 32 x 32, due to the limit of 1,024 threads per block. (Again, it should be emphasized that __syncthreads only works over all threads in a block, rather than over all threads in a grid, which is why we are limiting ourselves to a single block here):

conway_ker = ker.get_function("conway_ker")
if __name__ == '__main__':
 # set lattice size
 N = 32
 lattice = np.int32( np.random.choice([1,0], N*N, p=[0.25, 0.75]).reshape(N, N) )
 lattice_gpu = gpuarray.to_gpu(lattice)
 conway_ker(lattice_gpu, np.int32(1000000), grid=(1,1,1), block=(32,32,1))
 fig = plt.figure(1)
 plt.imshow(lattice_gpu.get())

When we run the program, we'll get the desired output as follows (this is what a random LIFE lattice will converge to after one million iterations!):

Table of Contents for Using the __syncthreads()&#xA0;device function

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the __syncthreads() device function