Work-efficient parallel prefix — implementation 

As a capstone for this chapter, we'll write an implementation of this algorithm that can operate on arrays of arbitrarily large size over 1,024. This will mean that this will operate over grids as well as blocks; that being such, we'll have to use the host for synchronization; furthermore, this will require that we implement two separate kernels for up-sweep and down-sweep phases that will act as the parfor loops in both phases, as well as Python functions that will act as the outer for loop for the up- and down-sweeps.

Let's begin with an up-sweep kernel. Since we'll be iteratively re-launching this kernel from the host, we'll also need a parameter that indicates current iteration (k). We'll use two arrays for the computation to avoid race conditionsx (for the current iteration) and x_old (for the prior iteration). We declare the kernel as follows:

up_ker = SourceModule("""
__global__ void up_ker(double *x, double *x_old, int k)
{

Now let's set the tid variable, which will be the current thread's identification among all threads in all blocks in the grid. We use the same trick as in our original grid-level implementation of Conway's Game of Life that we saw earlier:

int tid =  blockIdx.x*blockDim.x + threadIdx.x;

We'll now use C bit-wise shift operators to generate 2k and 2k+1 directly from k. We now set j to be tid times _2k1—this will enable us to remove the "if j is divisible by 2k+1", as in the pseudocode, enabling us to only launch as many threads as we'll need:

 int _2k = 1 << k;
int _2k1 = 1 << (k+1);

int j = tid* _2k1;
We can easily generate dyadic (power-of-2) integers in CUDA C with the left bit-wise shift operator (<<). Recall that the integer 1 (that is 20) is represented as 0001, 2 (21) is represented as 0010, 4 (22 ) is represented as 0100, and so on. We can therefore compute 2k with the 1 << k operation.

We can now run the up-sweep phase with a single line, noting that j is indeed divisible by 2k+1 by its construction:


x[j + _2k1 - 1] = x_old[j + _2k -1 ] + x_old[j + _2k1 - 1];
}
""")

We're done writing our kernel! But this is not a full implementation of the up-sweep, of course. We have to do the rest in Python. Let's get our kernel and begin the implementation. This mostly speaks for itself as it follows the pseudocode exactly; we should recall that we are updating x_old_gpu by copying from x_gpu using [:], which will preserve the memory allocation and merely copy the new data over rather than re-allocate. Also note how we set our block and grid sizes depending on how many threads we have to launch—we try to keep our block sizes as multiples of size 32 (which is our rule-of-thumb in this text, we go into the details why we use 32 specifically in Chapter 11, Performance Optimization in CUDA). We should put from __future__ import division at the beginning of our file, since we'll use Python 3-style division in calculating our block and kernel sizes.

One issue to mention is that we are assuming that x is of dyadic length 32 or greater—this can be modified trivially if you wish to have this operate on arrays of other sizes by padding our arrays with zeros, however:


up_gpu = up_ker.get_function("up_ker")

def up_sweep(x):
x = np.float64(x)
x_gpu = gpuarray.to_gpu(np.float64(x) )
x_old_gpu = x_gpu.copy()
for k in range( int(np.log2(x.size) ) ) :
num_threads = int(np.ceil( x.size / 2**(k+1)))
grid_size = int(np.ceil(num_threads / 32))

if grid_size > 1:
block_size = 32
else:
block_size = num_threads

up_gpu(x_gpu, x_old_gpu, np.int32(k) , block=(block_size,1,1), grid=(grid_size,1,1))
x_old_gpu[:] = x_gpu[:]

x_out = x_gpu.get()
return(x_out)

Now we'll embark on writing the down-sweep. Again, let's start with the kernel, which will have the functionality of the inner parfor loop of the pseudocode. It follows similarly as before—again, we'll use two arrays, so using a temp variable as in the pseudocode is unnecessary here, and again we use bit-shift operators to obtain the values of 2k and 2k+1. We calculate j similarly to before:

down_ker = SourceModule("""
__global__ void down_ker(double *y, double *y_old, int k)
{
int j = blockIdx.x*blockDim.x + threadIdx.x;

int _2k = 1 << k;
int _2k1 = 1 << (k+1);


int j = tid*_2k1;

y[j + _2k - 1 ] = y_old[j + _2k1 - 1];
y[j + _2k1 - 1] = y_old[j + _2k1 - 1] + y_old[j + _2k - 1];
}
""")

down_gpu = down_ker.get_function("down_ker")

We now can write our Python function that will iteratively launch the kernel, which corresponds to the outer for loop of the down-sweep phase. This is similar to the Python function for the up-sweep phase. One important distinction from looking at the pseudocode is that we have to iterate from the largest value in the outer for loop to the smallest; we can just use Python's reversed function to do this. Now we can implement the down-sweep phase:


def down_sweep(y):
y = np.float64(y)
y[-1] = 0
y_gpu = gpuarray.to_gpu(y)
y_old_gpu = y_gpu.copy()
for k in reversed(range(int(np.log2(y.size)))):
num_threads = int(np.ceil( y.size / 2**(k+1)))
grid_size = int(np.ceil(num_threads / 32))

if grid_size > 1:
block_size = 32
else:
block_size = num_threads

down_gpu(y_gpu, y_old_gpu, np.int32(k), block=(block_size,1,1), grid=(grid_size,1,1))
y_old_gpu[:] = y_gpu[:]
y_out = y_gpu.get()
return(y_out)

Having implemented both the up-sweep and down-sweep phases, our last task is trivial to complete:

def efficient_prefix(x):
return(down_sweep(up_sweep(x)))

We have now fully implemented a host-synchronized version of the work-efficient parallel prefix algorithm! (This implementation is available in the work-efficient_prefix.py file in the repository, along with some test code.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.238