Thread-safe atomic operations

We will now learn about atomic operations in CUDA. Atomic operations are very simple, thread-safe operations that output to a single global array element or shared memory variable, which would normally lead to race conditions otherwise.

Let's think of one example. Suppose that we have a kernel, and we set a local variable called x across all threads at some point. We then want to find the maximum value over all xs, and then set this value to the shared variable we declare with __shared__ int x_largest. We can do this by just calling atomicMax(&x_largest, x) over every thread.

Let's look at a brief example of atomic operations. We will write a small program for two experiments:

  • Setting a variable to 0 and then adding 1 to this for each thread
  • Finding the maximum thread ID value across all threads

Let's start out by setting the tid integer to the global thread ID as usual, and then set the global add_out variable to 0. In the past, we would do this by having a single thread alter the variable using an if statement, but now we can use atomicExch(add_out, 0) across all threads. Let's do the imports and write our kernel up to this point:

from __future__ import division
import numpy as np
from pycuda.compiler import SourceModule
import pycuda.autoinit
from pycuda import gpuarray
import pycuda.driver as drv

AtomicCode='''
__global__ void atomic_ker(int *add_out, int *max_out)
{

int tid = blockIdx.x*blockDim.x + threadIdx.x;

atomicExch(add_out, 0);

It should be noted that while Atomics are indeed thread-safe, they by no means guarantee that all threads will access them at the same time, and they may be executed at different times by different threads. This can be problematic here, since we will be modifying add_out in the next step. This might lead to add_out being reset after it's already been partially modified by some of the threads. Let's do a block-synchronization to guard against this:

 __syncthreads();

We can now use atomicAdd to add 1 to add_out for each thread, which will give us the total number of threads:

 atomicAdd(add_out, 1);

Now let's check what the maximum value of tid is for all threads by using atomicMax. We can then close off our CUDA kernel:

 atomicMax(max_out, tid);

}
'''

We will now add the test code; let's try launching this over 1 block of 100 threads. We only need two variables here, so we will have to allocate some gpuarray objects of only size 1. We will then print the output:

atomic_mod = SourceModule(AtomicCode)
atomic_ker = atomic_mod.get_function('atomic_ker')

add_out = gpuarray.empty((1,), dtype=np.int32)
max_out = gpuarray.empty((1,), dtype=np.int32)

atomic_ker(add_out, max_out, grid=(1,1,1), block=(100,1,1))

print 'Atomic operations test:'
print 'add_out: %s' % add_out.get()[0]
print 'max_out: %s' % max_out.get()[0]

Now we are prepared to run this:

This example is also available as the atomic.py file in this book's GitHub repository.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.124.194