The Mandelbrot set revisited (again)

Let's revisit the Mandelbrot set that we looked at in Chapter 1, Why GPU Programming?, and Chapter 3, Getting Started with PyCUDA. First, we will write a full-on CUDA kernel that will compute the Mandelbrot set, given a particular set of parameters, along with an appropriate host-side wrapper function that we may interface to from Ctypes later. We will first be writing these functions into a single CUDA-C .cu source file and then compile this into a DLL or .so binary with the NVCC compiler. Finally, we will write some Python code so that we can run our binary code and display the Mandelbrot set.

We will now apply our knowledge of Ctypes to launch a pre-compiled CUDA kernel from Python without any assistance from PyCUDA. This will require us to write a host-side kernel launcher wrapper function in CUDA-C that we may call directly, which itself has been compiled into a dynamic library binary with any necessary GPU code—that is, a Dynamically Linked Library (DLL) binary on Windows, or a shared-object (so) binary on Linux.

We will start, of course, by writing our CUDA-C code, so open up your favorite text editor and follow along. We will begin with the standard include statements:

#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

We'll now jump directly into writing our kernel. Notice extern "C" in the code, which will allow us to link to this function externally:

extern "C" __global__ void mandelbrot_ker(float * lattice, float * mandelbrot_graph, int max_iters, float upper_bound_squared, int lattice_size)
{

Let's think for a minute about how this will work: we will use a single one-dimensional array for both the real and imaginary components called lattice, which is of length lattice_size. We will use this to compute a two-dimensional Mandelbrot graph of the shape (lattice_size, lattice_size) into the pre-allocated array, mandelbrot_graph. We will specify the number of iterations to check for divergence at each point with max_iters, specifying the maximum upper bound as before by providing its squared value with upper_bound_squared. (We'll look at the motivation for using the square in a second.)

We will launch this kernel over a one-dimensional grid/block structure, with each thread corresponding to a single point in the graph image of the Mandelbrot set. We can then determine the real/imaginary lattice values for the corresponding point, like so:

    int tid = blockIdx.x * blockDim.x + threadIdx.x;

if ( tid < lattice_size*lattice_size )
{
int i = tid % lattice_size;
int j = lattice_size - 1 - (tid / lattice_size);

float c_re = lattice[i];
float c_im = lattice[j];

Let's talk about this for a minute. First, remember that we may have to use slightly more threads than necessary, so it's important that we check that the thread ID will correspond to some point in the output image with the if statement. Let's also remember that the output array, mandelbrot_graph, will be stored as a one-dimensional array that represents a two-dimensional image stored in a row-wise format, and that we will be using tid as the index to write in this array. We will use i and j, as well as the x and y coordinates of the graph on the complex plane. Since lattice is a series of real values sorted from small to large, we will have to reverse their order to get the appropriate imaginary values. Also, notice that we will be using plain floats here, rather than some structure or object to represent a complex value. Since there are real and imaginary components in every complex number, we will have to use two floats here to store the complex number corresponding to this thread's lattice point (c_re and c_im).

We will set up two more variables to handle the divergence check, z_re and z_im, and set the initial value of this thread's point on the graph to 1 before we check for divergence:

        float z_re = 0.0f;
float z_im = 0.0f;

mandelbrot_graph[tid] = 1;

Now we will do our check for divergence; if it does diverge after max_iters iterations, we set the point to 0. Otherwise, it is left at 1:

        for (int k = 0; k < max_iters; k++)
{
float temp;

temp = z_re*z_re - z_im*z_im + c_re;
z_im = 2*z_re*z_im + c_im;
z_re = temp;

if ( (z_re*z_re + z_im*z_im) > upper_bound_squared )
{
mandelbrot_graph[tid] = 0;
break;
}
}

Let's talk about this chunk of code for a minute before we continue. Let's remember that each iteration of a Mandelbrot set is computed with complex multiplication and addition for example, z_new = z*z + c. Since we are not working with a class that will handle complex values for us, the preceding operation is exactly what we need to do to compute the new real and imaginary values of z. We also need to compute the absolute value and see if it exceeds a particular valueremember that the absolute value of a complex number, c = x + iy, is computed with √(x2+y2). It will actually save us some time here to compute the square of the upper bound and then plug that into the kernel, since it will save us the time of computing the square root of z_re*z_re + z_im*z_im for each iteration here.

We're now pretty much done with this kernelwe just need to close off the if statement and return from the kernel, and we're done:

    }
return;
}

However, we are not completely finished just yet. We need to write a host-side wrapper function with only extern "C" in the case of Linux, and extern "C" __declspec(dllexport) in the case of Windows. (In contrast to a compiled CUDA kernel, this extra word is necessary if we want to be able to access a host-side function from Ctypes in Windows.) The parameters that we put into this function will correspond directly to those that go into the kernel, except these will be stored on the host:

extern "C" __declspec(dllexport) void launch_mandelbrot(float * lattice,  float * mandelbrot_graph, int max_iters, float upper_bound, int lattice_size)
{

Now, the first task we will have to do is allocate sufficient memory to store the lattice and output on the GPU with cudaMalloc, and then copy the lattice to the GPU with cudaMemcpy:

    int num_bytes_lattice = sizeof(float) * lattice_size;
int num_bytes_graph = sizeof(float)* lattice_size*lattice_size;

float * d_lattice;
float * d_mandelbrot_graph;

cudaMalloc((float **) &d_lattice, num_bytes_lattice);
cudaMalloc((float **) &d_mandelbrot_graph, num_bytes_graph);

cudaMemcpy(d_lattice, lattice, num_bytes_lattice, cudaMemcpyHostToDevice);

Like many of our other kernels, we will launch this over one-dimensional blocks of size 32 over a one-dimensional grid. We will take the ceiling value of the number of output points to compute, divided by 32, to determine the grid size, like so:

    int grid_size = (int)  ceil(  ( (double) lattice_size*lattice_size ) / ( (double) 32 ) );

Now we are ready to launch our kernel by using the traditional CUDA-C triple-triangle brackets to specify grid and block size. Notice how we square the upper bound beforehand here:

    mandelbrot_ker <<< grid_size, 32 >>> (d_lattice,  d_mandelbrot_graph, max_iters, upper_bound*upper_bound, lattice_size);

Now we just need to copy the output to the host after this is done, and then call cudaFree on the appropriate arrays. Then we can return from this function:

    cudaMemcpy(mandelbrot_graph, d_mandelbrot_graph, num_bytes_graph, cudaMemcpyDeviceToHost);    
cudaFree(d_lattice);
cudaFree(d_mandelbrot_graph);
}

And with that, we are done with all of the CUDA-C code that we will need. Save this to a file named mandelbrot.cu, and let's continue to the next step.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.194.230