Quicksort with dynamic parallelism

Now let's look at a slightly more interesting and utilitarian application of dynamic parallelism—the Quicksort Algorithm. This is actually a well-suited algorithm for parallelization, as we will see.

Let's start with a brief review. Quicksort is a recursive and in-place sorting algorithm that has an average and best case performance of O(N log N), and worst-case performance of O(N²). Quicksort is performed by choosing an arbitrary point called a pivot in an unsorted array, and then partitioning the array into a left array (which contains all points less than the pivot), a right array (which contains all points equal to or greater than the pivot), with the pivot in-between the two arrays. If one or both of the arrays now has a length greater than 1, then we recursively call quicksort again on one or both of the sub-arrays, with the pivot point now in its final position.

Quicksort can be implemented in a single line in pure Python using functional programming:
qsort = lambda xs : [] if xs == [] else qsort(filter(lambda x: x < xs[-1] , xs[0:-1])) + [xs[-1]] + qsort(filter(lambda x: x >= xs[-1] , xs[0:-1]))

We can see where parallelism will come into play by the fact that quicksort is recursively called on both the right and left arrays—we can see how this will start with one thread operating on an initial large array, but by the time the arrays get very small, there should be many threads working on them. Here, we will actually accomplish this by launching all of the kernels over one single thread each!

Let's get going, and start with the import statements. (We will ensure that we import the shuffle function from the standard random module for the example that we will go over later.):

from __future__ import division
import numpy as np
from pycuda.compiler import DynamicSourceModule
import pycuda.autoinit
from pycuda import gpuarray
from random import shuffle

Now we'll write our quicksort kernel. We'll write a device function for the partitioning step, which will take an integer pointer, the lowest point of the subarray to partition, and the highest point of the subarray. This function will also use the highest point of this subarray as the pivot. Ultimately, after this function is done, it will return the final resting place of the pivot:

DynamicQuicksortCode='''
__device__ int partition(int * a, int lo, int hi)
{
 int i = lo;
 int pivot = a[hi];
 int temp;

 for (int k=lo; k<hi; k++)
 {
  if (a[k] < pivot)
  {
   temp = a[k];
   a[k] = a[i];
   a[i] = temp;
   i++;
  }
 }
 
 a[hi] = a[i];
 a[i] = pivot;
  
 return i;
}

Now we can write the kernel that implements this partition function into a parallel quicksort. We'll have to use the CUDA-C conventions for streams, which we haven't seen so far: to launch a kernel k in a stream s in CUDA-C, we use k<<<grid, block, sharedMemBytesPerBlock, s>>>(...). By using two streams here, we can be sure that they are launched in parallel. (Considering that we won't be using shared memory, we'll set the third launch parameter to "0".) The creation and destruction of the stream objects should be self-explanatory:

__global__ void quicksort_ker(int *a, int lo, int hi)
{

 cudaStream_t s_left, s_right; 
 cudaStreamCreateWithFlags(&s_left, cudaStreamNonBlocking);
 cudaStreamCreateWithFlags(&s_right, cudaStreamNonBlocking);
 
 int mid = partition(a, lo, hi);
  
 if(mid - 1 - lo > 0)
   quicksort_ker<<< 1, 1, 0, s_left >>>(a, lo, mid - 1);
 if(hi - (mid + 1) > 0)
   quicksort_ker<<< 1, 1, 0, s_right >>>(a, mid + 1, hi);
    
 cudaStreamDestroy(s_left);
 cudaStreamDestroy(s_right);

}
'''

Now let's randomly shuffle a list of 100 integers and have our kernel sort this for us. Notice how we launch the kernel over a single thread:

qsort_mod = DynamicSourceModule(DynamicQuicksortCode)

qsort_ker = qsort_mod.get_function('quicksort_ker')

if __name__ == '__main__':
    a = range(100)
    shuffle(a)
    
    a = np.int32(a)
    
    d_a = gpuarray.to_gpu(a)
    
    print 'Unsorted array: %s' % a
    
    qsort_ker(d_a, np.int32(0), np.int32(a.size - 1), grid=(1,1,1), block=(1,1,1))
    
    a_sorted = list(d_a.get())
    
    print 'Sorted array: %s' % a_sorted

This program is also available in the dynamic_quicksort.py file in this book's GitHub repository.

Table of Contents for Quicksort with dynamic parallelism

Create new playlist

Sign In

Sign Up

Table of Contents for
Quicksort with dynamic parallelism