Warp shuffling

We will now look at what is known as warp shuffling. This is a feature in CUDA that allows threads that exist within the same CUDA Warp concurrently to communicate by directly reading and writing to each other's registers (that is, their local stack-space variables), without the use of shared variables or global device memory. Warp shuffling is actually much faster and easier to use than the other two options. This almost sounds too good to be true, so there must be a catch—indeed, the catch is that this only works between threads that exist on the same CUDA Warp, which limits shuffling operations to groups of threads of size 32 or less. Another catch is that we can only use datatypes that are 32 bits or less. This means that we can't shuffle 64-bit long long integers or double floating point values across a Warp.

Only 32-bit (or smaller) datatypes can be used with CUDA Warp shuffling! This means that while we can use integers, floats, and chars, we cannot use doubles or long long integers!

Let's briefly review CUDA Warps before we move on to any coding. (You might wish to review the section entitled The warp lockstep property in Chapter 6, Debugging and Profiling Your CUDA Code, before we continue.) A CUDA Warp is the minimal execution unit in CUDA that consists of 32 threads or less, that runs on exactly 32 GPU cores. Just as a Grid consists of blocks, blocks similarly consist of one or more Warps, depending on the number of threads the Block uses – if a Block consists of 32 threads, then it will use one Warp, and if it uses 96 threads, it will consist of three Warps. Even if a Warp is of a size less than 32, it is also considered a full Warp: this means that a Block with only one single thread will use 32 cores. This also implies that a block of 33 threads will consist of two Warps and 31 cores.

To remember what we looked at in Chapter 6Debugging and Profiling Your CUDA Code, a Warp has what is known as the Lockstep Property. This means that every thread in a warp will iterate through every instruction, perfectly in parallel with every other thread in the Warp. That is to say, every thread in a single Warp will step through the same exact instructions simultaneously, ignoring any instructions that are not applicable to a particular thread – this is why any divergence among threads within a single Warp is to be avoided as much as possible. NVIDIA calls this execution model Single Instruction Multiple Thread, or SIMT. By now, you should understand why we have tried to always use Blocks of 32 threads consistently throughout the text!

We need to learn one more term before we get going—a lane in a Warp is a unique identifier for a particular thread within the warp, which will be between 0 and 31. Sometimes, this is also called the Lane ID.

Let's start with a simple example: we will use the __shfl_xor command to swap the values of a particular variable between all even and odd numbered Lanes (threads) within our warp. This is actually very quick and easy to do, so let's write our kernel and take a look:

from __future__ import division
import numpy as np
from pycuda.compiler import SourceModule
import pycuda.autoinit
from pycuda import gpuarray


ShflCode='''
__global__ void shfl_xor_ker(int *input, int * output) {

int temp = input[threadIdx.x];

temp = __shfl_xor (temp, 1, blockDim.x);

output[threadIdx.x] = temp;

}'''

Everything here is familiar to us except __shfl_xor . This is how an individual CUDA thread sees this: this function takes the value of temp as an input from the current thread. It performs an XOR operation on the binary Lane ID of the current thread with 1, which will be either its left neighbor (if the least significant digit of this thread's Lane is "1" in binary), or its right neighbor (if the least significant digit is "0" in binary). It then sends the current thread's temp value to its neighbor, while retrieving the neighbor's temp value, which is __shfl_xor. This will be returned as output right back into temp. We then set the value in the output array, which will swap our input array values.

Now let's write the rest of the test code and then check the output:

shfl_mod = SourceModule(ShflCode)
shfl_ker = shfl_mod.get_function('shfl_xor_ker')

dinput = gpuarray.to_gpu(np.int32(range(32)))
doutout = gpuarray.empty_like(dinput)

shfl_ker(dinput, doutout, grid=(1,1,1), block=(32,1,1))

print 'input array: %s' % dinput.get()
print 'array after __shfl_xor: %s' % doutout.get()

The output for the preceding code is as follows:

Let's do one more warp-shuffling example before we move on—we will implement an operation to sum a single local variable over all of the threads in a Warp. Let's recall the Naive Parallel Sum algorithm from Chapter 4, Kernels, Threads, Blocks, and Grids, which is very fast but makes the naive assumption that we have as many processors as we do pieces of data—this is one of the few cases in life where we actually will, assuming that we're working with an array of size 32 or less. We will use the __shfl_down function to implement this in a single warp. __shfl_down takes the thread variable in the first parameter and works by shifting a variable between threads by the certain number of steps indicated in the second parameter, while the third parameter will indicate the total size of the Warp.

Let's implement this right now. Again, if you aren't familiar with the Naive Parallel Sum or don't remember why this should work, please review Chapter 4, Kernels, Threads, Blocks, and Grids. We will implement a straight-up sum with __shfl_down, and then run this on an array that includes the integers 0 through 31. We will then compare this against NumPy's own sum function to ensure correctness:

from __future__ import division
import numpy as np
from pycuda.compiler import SourceModule
import pycuda.autoinit
from pycuda import gpuarray


ShflSumCode='''
__global__ void shfl_sum_ker(int *input, int *out) {

int temp = input[threadIdx.x];

for (int i=1; i < 32; i *= 2)
temp += __shfl_down (temp, i, 32);

if (threadIdx.x == 0)
*out = temp;

}'''

shfl_mod = SourceModule(ShflSumCode)
shfl_sum_ker = shfl_mod.get_function('shfl_sum_ker')

array_in = gpuarray.to_gpu(np.int32(range(32)))
out = gpuarray.empty((1,), dtype=np.int32)

shfl_sum_ker(array_in, out, grid=(1,1,1), block=(32,1,1))

print 'Input array: %s' % array_in.get()
print 'Summed value: %s' % out.get()[0]
print 'Does this match with Python''s sum? : %s' % (out.get()[0] == sum(array_in.get()) )

This will give us the following output:

The examples in this section are also available as the shfl_sum.py and shfl_xor.py files under the Chapter11 directory in this book's GitHub repository.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.123.34