Using shared memory

We can see from the prior example that the threads in the kernel can intercommunicate using arrays within the GPU's global memory; while it is possible to use global memory for most operations, we can speed things up by using shared memory. This is a type of memory meant specifically for intercommunication of threads within a single CUDA block; the advantage of using this over global memory is that it is much faster for pure inter-thread communication. In contrast to global memory, though, memory stored in shared memory cannot directly be accessed by the host—shared memory must be copied back into global memory by the kernel itself first.

Let's first step back for a moment before we continue and think about what we mean. Let's look at some of the variables that are declared in our iterative LIFE kernel that we just saw. Let's first look at x and y, two integers that hold the Cartesian coordinates of a particular thread's cell. Remember that we are setting their values with the _X and _Y macros. (Compiler optimizations notwithstanding, we want to store these values in variables to reduce computation because directly using _X and _Y will recompute the x and y values every time these macros are referenced in our code):

 int x = _X, y = _Y;

We note that, for every single thread, there will be a unique Cartesian point in the lattice that will correspond to x and y. Similarly, we use a variable, n, which is declared with int n = nbrs(x, y, lattice);, to indicate the number of living neighbors around a particular cell. This is because, when we normally declare variables in CUDA, they are by default local to each individual thread. Note that, even if we declare an array within a thread such as int a[10];, there will be an array of size 10 that is local to each thread.

Local thread arrays (for example, a declaration of int a[10]; within the kernel) and pointers to global GPU memory (for example, a value passed as a kernel parameter of the form int * b) may look and act similarly, but are very different. For every thread in the kernel, there will be a separate a array that the other threads cannot read, yet there is a single b that will hold the same values and be equally accessible for all of the threads.

We are prepared to use shared memory. This allows us to declare variables and arrays that are shared among the threads within a single CUDA block. This memory is much faster than using global memory pointers (as we have been using till now), as well as reduces the overhead of allocating memory in the case of pointers.

Let's say we want a shared integer array of size 10. We declare it as follows—__shared__ int a[10] . Note that we don't have to limit ourselves to arrays; we can make shared singleton variables as follows: __shared__ int x.

Let's rewrite a few lines of iterative version of LIFE that we saw in the last sub-section to make use of shared memory. First, let's just rename the input pointer to p_lattice, so we can instead use this variable name on our shared array, and lazily preserve all of the references to " lattice" in our code. Since we'll be sticking with a 32 x 32 cell lattice here, we set up the new shared lattice array as follows:

__global__ void conway_ker_shared(int * p_lattice, int iters)
{
 int x = _X, y = _Y;
 __shared__ int lattice[32*32];

We'll now have to copy all values from the global memory p_lattice array into lattice. We'll index our shared array exactly in the same way, so we can just use our old _INDEX macro here. Note that we make sure to put __syncthreads() after we copy, to ensure that all of the memory accesses to lattice are entirely completed before we proceed with the LIFE algorithm:

 lattice[_INDEX(x,y)] = p_lattice[_INDEX(x,y)];
 __syncthreads();

The rest of the kernel is exactly as before, only we have to copy from the shared lattice back into the GPU array. We do so as follows and then close off the inline code:

 __syncthreads();
 p_lattice[_INDEX(x,y)] = lattice[_INDEX(x,y)];
 __syncthreads();
} """)

We can now run this as before, with the same exact test code. (This example can be seen in conway_gpu_syncthreads_shared.py in the GitHub repository.)

Table of Contents for Using shared memory

Create new playlist

Sign In

Sign Up

Table of Contents for
Using shared memory