Performance-optimized array sum 

For the final example of this book, we will now make a standard array summation kernel for a given array of doubles, except this time we will use every trick that we've learned in this chapter to make it as fast as possible. We will check the output of our summing kernel against NumPy's sum function, and then we will run some tests with the standard Python timeit function to compare how our function compares to PyCUDA's own sum function for gpuarray objects.

Let's get started by importing all of the necessary libraries, and then start with a laneid function, similar to the one we used in the previous section:

from __future__ import division
import numpy as np
from pycuda.compiler import SourceModule
import pycuda.autoinit
from pycuda import gpuarray
import pycuda.driver as drv
from timeit import timeit

SumCode='''
__device__ void __inline__ laneid(int & id)
{
 asm("mov.u32 %0, %%laneid; " : "=r"(id)); 
}

Let's note a few things—notice that we put a new inline statement in the declaration of our device function. This will effectively make our function into a macro, which will shave off a little time from calling and branching to a device function when we call this from the kernel. Also, notice that we set the id variable by reference instead of returning a value—in this case, there may actually be two integer registers that should be used, and there should be an additional copy command. This guarantees that this won't happen.

Let's write the other device functions in a similar fashion. We will need to have two more device functions so that we can split and combine a 64-bit double into two 32-bit variables:

__device__ void __inline__ split64(double val, int & lo, int & hi)
{
 asm volatile("mov.b64 {%0, %1}, %2; ":"=r"(lo),"=r"(hi):"d"(val));
}

__device__ void __inline__ combine64(double &val, int lo, int hi)
{
 asm volatile("mov.b64 %0, {%1, %2}; ":"=d"(val):"r"(lo),"r"(hi));
}

Let's start writing the kernel. We will take in an array of doubles called input, and then output the entire sum to out, which should be initialized to 0. We will start by getting the lane ID for the current thread and loading two values from global memory into the current thread with vectorized memory loading:

__global__ void sum_ker(double *input, double *out) 
{

 int id;
 laneid(id);

 double2 vals = *reinterpret_cast<double2*> ( &input[(blockDim.x*blockIdx.x + threadIdx.x) * 2] );

Now let's sum these values from the double2 vals variable into a new double variable, sum_val, which will keep track of all the summations across this thread. We will create two 32-bit integers, s1 and s2, that we will use for splitting this value and sharing it with Warp Shuffling, and then create a temp variable for reconstructed values we receive from other threads in this Warp:

 double sum_val = vals.x + vals.y;

 double temp;
 
 int s1, s2;

Now let's use a Naive Parallel sum again across the warp, which will be the same as summing 32-bit integers across a Warp, except we will be using our split64 and combine64 PTX functions on sum_val and temp for each iteration:

 for (int i=1; i < 32; i *= 2)
 {

     
     // use PTX assembly to split
     split64(sum_val, s1, s2);
 
     // shuffle to transfer data
     s1 = __shfl_down (s1, i, 32);
     s2 = __shfl_down (s2, i, 32);
     
     
     // PTX assembly to combine
     combine64(temp, s1, s2);
     sum_val += temp;
 }

Now that we are done, let's have the 0th thread of every single warp add their end value to out using the thread-safe atomicAdd:

 if (id == 0)
     atomicAdd(out, sum_val);
     
}'''

We will now write our test code with timeit operations to measure the average time of our kernel and PyCUDA's sum over 20 iterations of both on an array of 10000*2*32 doubles:

sum_mod = SourceModule(SumCode)
sum_ker = sum_mod.get_function('sum_ker')

a = np.float64(np.random.randn(10000*2*32))
a_gpu = gpuarray.to_gpu(a)
out = gpuarray.zeros((1,), dtype=np.float64)

sum_ker(a_gpu, out, grid=(int(np.ceil(a.size/64)),1,1), block=(32,1,1))
drv.Context.synchronize()

print 'Does sum_ker produces the same value as NumPy's sum (according allclose)? : %s' % np.allclose(np.sum(a) , out.get()[0])

print 'Performing sum_ker / PyCUDA sum timing tests (20 each)...'

sum_ker_time = timeit('''from __main__ import sum_ker, a_gpu, out, np, drv 
sum_ker(a_gpu, out, grid=(int(np.ceil(a_gpu.size/64)),1,1), block=(32,1,1)) 
drv.Context.synchronize()''', number=20)
pycuda_sum_time = timeit('''from __main__ import gpuarray, a_gpu, drv 
gpuarray.sum(a_gpu) 
drv.Context.synchronize()''', number=20)

print 'sum_ker average time duration: %s, PyCUDA's gpuarray.sum average time duration: %s' % (sum_ker_time, pycuda_sum_time)
print '(Performance improvement of sum_ker over gpuarray.sum: %s )' % (pycuda_sum_time / sum_ker_time)

Let's run this from IPython. Make sure that you have run both gpuarray.sum and sum_ker beforehand to ensure that we aren't timing any compilation by nvcc as well:

So, while summing is normally pretty boring, we can be excited by the fact that our clever use of hardware tricks can speed up such a bland and trivial algorithm quite a bit.

This example is available as the performance_sum_ker.py file under the Chapter11 directory in this book's GitHub repository.

Table of Contents for Performance-optimized array sum&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Performance-optimized array sum