Parallel scan and reduction kernel basics

Let's look at a basic function in PyCUDA that reproduces the functionality of reduce—InclusiveScanKernel. (You can find the code under the simple_scankernal0.py filename.) Let's execute a basic example that sums a small list of numbers on the GPU:

import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.scan import InclusiveScanKernel
seq = np.array([1,2,3,4],dtype=np.int32)
seq_gpu = gpuarray.to_gpu(seq)
sum_gpu = InclusiveScanKernel(np.int32, "a+b")
print sum_gpu(seq_gpu).get()
print np.cumsum(seq)

We construct our kernel by first specifying the input/output type (here, NumPy int32) and in the string, "a+b". Here, InclusiveScanKernel sets up elements named a and b in the GPU space automatically, so you can think of this string input as being analogous to lambda a,b: a + b in Python. We can really put any (associative) binary operation here, provided we remember to write it in C.

When we run sum_gpu, we see that we will get an array of the same size as the input array. Each element in the array represents the value for each step in the calculation (the NumPy cumsum function gives the same output, as we can see). The last element will be the final output that we are seeking, which corresponds to the output of reduce:

Let's try something a little more challenging; let's find the maximum value in a float32 array:

import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.scan import InclusiveScanKernel
seq = np.array([1,100,-3,-10000, 4, 10000, 66, 14, 21],dtype=np.int32)
seq_gpu = gpuarray.to_gpu(seq)
max_gpu = InclusiveScanKernel(np.int32, "a > b ? a : b")
print max_gpu(seq_gpu).get()[-1]
print np.max(seq)

(You can find the complete code in the file named simple_scankernal1.py.)

Here, the main change we made is to replace the a + b string with a > b ? a : b. (In Python, this would be rendered within a reduce statement as lambda a, b: max(a,b)). Here, we are using a trick to give the max among a and b with the C language's ? operator. We finally display the last value of the resulting element in the output array, which will be exactly the last element (which we can always retrieve with the [-1] index in Python).

Now, let's finally look one more PyCUDA function for generating GPU kernels—ReductionKernel. Effectively, ReductionKernel acts like a ElementwiseKernel function followed by a parallel scan kernel. What algorithm is a good candidate for implementing with a ReductionKernel? The first that tends to come to mind is the dot product from linear algebra. Let's remember computing the dot product of two vectors has two steps:

Multiply the vectors pointwise
Sum the resulting pointwise multiples

These two steps are also called multiply and accumulate. Let's set up a kernel to do this computation now:

dot_prod = ReductionKernel(np.float32, neutral="0", reduce_expr="a+b", map_expr="vec1[i]*vec2[i]", arguments="float *vec1, float *vec2")

First, note the datatype we use for our kernel (a float32). We then set up the input arguments to our CUDA C kernel with arguments, (here two float arrays representing each vector designated with float *) and set the pointwise calculation with map_expr, here it is pointwise multiplication. As with ElementwiseKernel, this is indexed over i. We set up reduce_expr the same as with InclusiveScanKernel. This will take the resulting output from the element-wise operation and perform a reduce-type operation on the array. Finally, we set the neutral element with neutral. This is an element that will act as an identity for reduce_expr; here, we set neutral=0, because 0 is always the identity under addition (under multiplication, one is the identity). We'll see why exactly we have to set this up when we cover parallel prefix in greater depth later in this book.

Table of Contents for Parallel scan and reduction kernel basics

Create new playlist

Sign In

Sign Up

Table of Contents for
Parallel scan and reduction kernel basics