Computing definite integrals with the Monte Carlo method

We are now going to use the CUDA Math API for representing an arbitrary mathematical function, f, while using the cuRAND library to implement the Monte Carlo integral. We will do this with metaprogramming: we will use Python to generate the code for a device function from a code template, which will plug into an appropriate Monte Carlo kernel for integration. 

The idea here is that it will look and act similarly to some of the metaprogramming tools we've seen with PyCUDA, such as ElementwiseKernel.

Let's start by importing the appropriate modules into our new project:

import pycuda.autoinit
import pycuda.driver as drv
from pycuda import gpuarray
from pycuda.compiler import SourceModule
import numpy as np

We're going to use a trick in Python called dictionary based string formatting. Let's go over this for a minute before we continue. Suppose we are writing a chunk of CUDA C code, and we are unsure of whether we want a particular collection of variables to be float or double; perhaps it looks like this: code_string="float x, y; float * z;". We might actually want to format the code so that we can switch between floats and doubles on the fly. Let's change all references from float in the string to %(precision)scode_string="%(precision)s x, y; %(precision)s * z;". We can now set up an appropriate dictionary that will swap %(presision)s with double, which is, code_dict = {'precision' : 'double'}, and get the new double string with code_double = code_string % code_dict. Let's take a look:

Now, let's think for a moment about how we want our new Monte Carlo integrator to work. We will also have it take a string that is a math equation that is written using the CUDA Math API to define the function we want to integrate. We can then fit this string into the code using the dictionary trick we just learned, and use this to integrate arbitrary functions. We will also use the template to switch between float and double precision, as per the user's discretion.

We can now begin our CUDA C code:

MonteCarloKernelTemplate = '''
#include <curand_kernel.h>

We will keep the unsigned 64-bit integer macro from before, ULL. Let's define some new macros for a reciprocal of x (_R), and for squaring (_P2):

#define ULL unsigned long long
#define _R(z) ( 1.0f / (z) )
#define _P2(z) ( (z) * (z) )

Now, let's define a device function that our equation string will plug into. We will use the math_function value when we have to swap the text from a dictionary. We will have another value called p, for precision (which will either be a float or double). We'll call this device function f. We'll put an inline in the declaration of the function, which will save us a little time from branching when this is called from the kernel:

__device__ inline %(p)s f(%(p)s x)
{
%(p)s y;
%(math_function)s;
return y;
}

Now, let's think about how this will work— We declare a 32 or 64-bit floating point value called y, call math_function, and then return ymath_function, which will only make sense if it's some code that acts on the input parameter x and sets some value to y, such as y = sin(x). Let's keep this in mind and continue.

We will now begin writing our Monte Carlo integration kernel. Let's remember that we have to make our CUDA kernel visible from plain C with the extern "C" keyword. We will then set up our kernel.

First, we will indicate how many random samples each thread in the kernel should take with iters; we then indicate the lower bound of integration (b) with lo and the upper bound (a) with hi, and pass in an array, ys_out, to store the collection of partial integrals for each thread (we will later sum over ys_out to get the value of the complete definite integral from lo to hi on the host side). Again, notice how we are referring to the precision as p:

extern "C" {
__global__ void monte_carlo(int iters, %(p)s lo, %(p)s hi, %(p)s * ys_out)
{

 We will need a curandState object for generating random values. We will also need to find the global thread ID and the total number of threads. Since we are working with a one-dimensional mathematical function, it makes sense to set up our block and grid parameters in one dimension, x, as well:

curandState cr_state;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int num_threads = blockDim.x * gridDim.x;

We will now calculate the amount of area there is between lo and hi that a single thread will process. We'll do this by dividing up the entire length of the integration (which will be hi - lo) by the total number of threads.:

Again, note how we are using templating tricks so that this value can be multi-precision.
%(p)s t_width = (hi - lo) / ( %(p)s ) num_threads;

Recall that we have a parameter called iters; this indicates how many random values each thread will sample. We need to know what the density of the samples is in a little bit; that is, the average number of samples per unit distance. We calculate it like so, remembering to typecast the integer iters into a floating-point value:

%(p)s density = ( ( %(p)s ) iters ) / t_width;

Recall that we are dividing the area we are integrating over by the number of threads. This means that each thread will have its own start and end point. Since we are dividing up the lengths fairly for each thread, we calculate this like so:

%(p)s t_lo = t_width*tid + lo;
%(p)s t_hi = t_lo + t_width;

We can now initialize cuRAND like we did previously, making sure that each thread is generating random values from its own individual seed:

curand_init( (ULL)  clock() + (ULL) tid, (ULL) 0, (ULL) 0, &cr_state);

Before we start sampling, we will need to set up some additional floating point values. y will hold the final value for the integral estimate from t_lo to t_hi, and y_sum will hold the sum of all of the sampled values. We will also use the rand_val variable to hold the raw random value we generate, and x to store the scaled random value from the area that we will be sampling from:

%(p)s y, y_sum = 0.0f;
%(p)s rand_val, x;

Now, let's loop to the sample values from our function, adding the values into y_sum. The one salient thing to notice is the %(p_curand)s at the end of curand_uniform—the 32-bit floating point version of this function is curand_uniform, while the 64-bit version is curand_uniform_double. We will have to swap this with either _double or an empty string later, depending on what level of precision we go with here. Also, notice how we scale rand_val so that x falls between t_lo and t_hi, remembering that random uniform distributions in cuRAND only yields values between 0 and 1:

for (int i=0; i < iters; i++)
{
rand_val = curand_uniform%(p_curand)s(&cr_state);
x = t_lo + t_width * rand_val;
y_sum += f(x);
}

We can now calculate the value of the subintegral from t_lo to t_hi by dividing y_sum by density:

y = y_sum / density;

We output this value into the array and close off our CUDA kernel, as well as the extern "C", with the final closing bracket. We're done writing CUDA C, so we will close off this section with a triple-quote:

ys_out[tid] = y;
}
}
'''

We will now do something a little different—we're going to set up a class to handle our definite integrals. Let's call it MonteCarloIntegrator. We will start, of course, by writing the constructor, that is, the __init__ function. This is where we will input the object reference, self. Let's set up the default value for math_function to be 'y = sin(x)', with the default precision as 'd', for double. We'll also set the default value for lo as 0 and hi as the NumPy approximation of π . Finally, we'll have values for the number of random samples each thread will take (samples_per_thread), and the grid size that we will launch our kernel over (num_blocks).

Let's start this function by storing the text string math_function within the self object for later use:

def __init__(self, math_function='y = sin(x)', precision='d', lo=0, hi=np.pi, samples_per_thread=10**5, num_blocks=100):

self.math_function = math_function

Now, let's set up the values related to our choice of floating-point precision that we will need for later, particularly for setting up our template dictionary. We will also store the lo and hi values within the object. Let's also be sure to raise exception errors if the user inputs an invalid datatype, or if hi is actually smaller than lo:

         if precision in [None, 's', 'S', 'single', np.float32]:
self.precision = 'float'
self.numpy_precision = np.float32
self.p_curand = ''
elif precision in ['d','D', 'double', np.float64]:
self.precision = 'double'
self.numpy_precision = np.float64
self.p_curand = '_double'
else:
raise Exception('precision is invalid datatype!')

if (hi - lo <= 0):
raise Exception('hi - lo <= 0!')
else:
self.hi = hi
self.lo = lo

We can now set up our code template dictionary:

MonteCarloDict = {'p' : self.precision, 'p_curand' : self.p_curand, 'math_function' : self.math_function}

We can now generate the actual final code using dictionary-based string formatting, and compile. Let's also turn off warnings from the nvcc compiler by setting options=['-w'] in SourceModule:

self.MonteCarloCode = MonteCarloKernelTemplate % MonteCarloDict

self.ker = SourceModule(no_extern_c=True , options=['-w'], source=self.MonteCarloCode)

We will now set up a function reference in our object to our compiled kernel with get_function. Let's save the remaining two parameters within our object before we continue:

self.f = self.ker.get_function('monte_carlo')
self.num_blocks = num_blocks
self.samples_per_thread = samples_per_thread

Now, while we will need different instantiations of MonteCarloIntegrator objects to evaluate definite integrals of different mathematical functions or floating point precision, we might want to evaluate the same integral over different lo and hi bounds, change the number of threads/grid size, or alter the number of samples we take at each thread. Thankfully, these are easy alterations to make, and can all be made at runtime. 

We'll set up a specific function for evaluating the integral of a given object. We will set the default values of these parameters to be those that we stored during the call to the constructor:

def definite_integral(self, lo=None, hi=None, samples_per_thread=None, num_blocks=None):
if lo is None or hi is None:
lo = self.lo
hi = self.hi
if samples_per_thread is None:
samples_per_thread = self.samples_per_thread
if num_blocks is None:
num_blocks = self.num_blocks
grid = (num_blocks,1,1)
else:
grid = (num_blocks,1,1)

block = (32,1,1)
num_threads = 32*num_blocks

We can finish this function off by setting up an empty array to store the partial sub-integrals and launching the kernel. We then need to sum over the sub-integrals to get the final value, which we return:

self.ys = gpuarray.empty((num_threads,) , dtype=self.numpy_precision)

self.f(np.int32(samples_per_thread), self.numpy_precision(lo), self.numpy_precision(hi), self.ys, block=block, grid=grid)

self.nintegral = np.sum(self.ys.get() )

return np.sum(self.nintegral)

We are ready to try this out. Let's just set up a class with the default values—this will integrate y = sin(x) from 0 to π. If you remember calculus, the anti-derivative of sin(x) is -cos(x), so we can evaluate the definite integral like so:

Therefore, we should get a numerical value close to 2. Let's see what we get:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.65.65