2

Configurations for MATLAB and CUDA

The C-MEX file is the basic starting point for using CUDA and GPUs. These functions are dynamically loaded as a function during MATLAB sessions. There are many reasons why we write our own c-mex files. Although the parallel computing toolbox from Mathworks and other third party CPU toolboxes provide the CUDA interface, many constraints and limitations hinder full utilization of CUDA and GPUs. The generic way of using the c-mex file, however, has endless custom extensions and is flexible in using the CUDA libraries provided by NVIDIA and other CUDA companies. In this chapter, you learn about configuring MATLAB for c-mex programming, making the simplest c-mex example Hello, c-mex, configuring CUDA for MATLAB, and making simple CUDA examples for MATLAB.

Keywords

c-mex configuration; C/C++ compiler; c-mex programming; CUDA configuration; vector addition example

2.1 Chapter Objectives

MATLAB functions written in C/C++ are called c-mex files. The c-mex file is the basic starting point for using CUDA and GPU in MATLAB. These functions are dynamically loaded as a function during MATLAB sessions. We write our own c-mex files for many reasons:

1. To reuse C/C++ functions in MATLAB.

2. For increased speed.

3. For endless custom extensions

Although the parallel computing toolbox from Mathworks and other third party CPU toolboxes provide CUDA interface, many constraints and limitations inhibit fully utilizing CUDA and GPUs. However, the generic way of using the c-mex file has endless custom extensions and is flexible in using the CUDA libraries provided by NVIDIA and other companies. In this chapter, we learn the following:

• To configure MATLAB for c-mex programming.

• To make the simplest c-mex example, Hello, c-mex.

• To configure CUDA for MATLAB.

• To make simple CUDA examples for MATLAB.

2.2 MATLAB Configuration for c-mex Programming

2.2.1 Checklists

MATLAB Executable (MEX) is intended to directly use C/C++ and FORTRAN codes within the MATLAB environment to accomplish higher executing speed and avoid application bottlenecks. We call C-MEX for the C/C++ MEX, and focus on C-MEX only in this book for the purpose of deploying a GPU device. Since c-mex requires building C/C++ executable and CUDA requires hardware-specific (NVIDIA GPU) codes, we need extra installation steps in addition to a standard MATLAB installation. We first check the C/C++ compiler installation followed by CUDA installation.

2.2.1.1 C/C++ Compilers

When it comes to C-MEX programming, MATLAB makes use of the C/C++ compiler installed in your system in order to create a MATLAB callable binary. You should be aware of what compilers are available in your system and where they are located. You first have to make sure your MATLAB version supports the compiler in your system. For this, you may have to visit the Mathworks website and check the version compatibility for your MATLAB and compiler at http://www.mathworks.com/support/compilers/R2013a/index.html. Typically, in a Windows, the Microsoft Visual C++ compiler, cl.exe, is installed at

• C:Program Files (x86)Microsoft Visual Studio x.0VCin on 64 bit

• C:Program FilesMicrosoft Visual Studio x.0VCin on 32 bit

In Mac OS X and Linux distributions, the gcc/g++ compiler is supported by MATLAB and its installation location depends on Linux distributions. Common installations are at, for example,

• /Developer/usr/bin  Mac OS X

• /usr/local/bin  Linux distributions

At this point, verify you have your compiler installed properly and take a note of its location if it is installed in a different location.

2.2.1.2 NVIDIA CUDA Compiler nvcc

To compile CUDA codes, we need to download and install the CUDA toolkit from NVIDIA’s website. This toolkit is available as free. For the steps and information on the downloading and installing the CUDA toolkit, please refer to Appendix 1. Download and Install the CUDA Library.

The nvcc translates CUDA-specific codes and invokes the C/C++ compiler to generate executable binaries or object files. Therefore, we need both a CUDA compiler and a C/C++ compiler to build a GPU-enabled executable.

It is very helpful to know beforehand where these are located. Also, you need to know where the CUDA runtime libraries are located.

Often times, most compilation errors come from incorrectly defined or undefined compiler and library locations. Once you identify those locations and set their paths in your system environment accordingly, your sail through C-MEX and CUDA programming will be a lot easier and smoother.

2.2.2 Compiler Selection

We begin by selecting our C compiler from MATLAB. In MATLAB command window, run mex -setup. Once you are greeted with the welcome message, continue by pressing [y] to let mex locate the installed compilers (Figure 2.1).

image

Figure 2.1 The c-mex configuration message in MATLAB command window.

In this example, two Microsoft Visual C++ compilers are available; we choose Microsoft Visual C++ 2010 as our C++ compiler by selecting [1]. MATLAB asks you to verify your choice. Confirm it by pressing y. Once MATLAB updates the option files, we are done with our compiler selection. The updated c-mex option file contains information about which C++ compiler we use and how we compile and link our C++ codes.

This option file was actually generated and updated from the template that is supported by mex. All the template options files supported by mex are located in the MATLABrootinwin32mexopts or the MATLABrootinwin64mexopts on Windows, or MATLABroot/bin folder on UNIX systems. Figure 2.2 shows an actual example session on c-mex setup in MATLAB command window.

image

Figure 2.2 C/C++ compiler verification during c-mex configuration.

MATLAB uses the one from the built-in templates for the chosen compiler. MATLAB provides a list of compilers supported by the given MATLAB version. The final option file stores all the compiler-specific compilation and linking options to be used for a c-mex compilation. You can edit this option file for specific compilation needs, such as warning and debugging options.

You can find the local copy of the option file, for example, at

• For Window 7,
C:UsersMyNameAppDataRoamingMathWorksMATLABR2012amexopts.bat.

• For Windows XP,
C:Documents and SettingsMyNameApplication DataMathWorksMATLABR2012amexopts.bat.

• For Mac OS X,
/Users/MyName/.matlab/R2012a/mexopts.sh.

If you open this option file in Mac, you also can specify what SDK you would like to use for the compiler. You simply edit SDKROOT and save (Figure 2.3).

• For Linux distributions,
~/.matlab/R2012a/mexopts.sh.

image

Figure 2.3 The mexopts.sh file for Mac OS X SDK selection.

The supported compilers for MATLAB vary from operation system to operation system and different MATLAB versions. Again, you have to make sure your installed compiler is supported by the version of MATLAB installed in your system.

2.3 “Hello, mex!” using C-MEX

Now, we go step by step to say ‘Hello’ to our c-mex.

Step 1. First, create an empty working directory, for example, at
c:junkMatlabMeetsCudaHello.

Step 2. Now, open MATLAB and set the working directory as our current folder in the MATLAB toolbar, as shown in the Figure 2.4.

Step 3. Open the MATLAB editor and create a new script by choosing File>New>Script from the menu. Then, save this new script as helloMex.cpp in the editor (Figure 2.5).

Step 4. Type the following codes into the editor window and save the file by choosing File>Save:
1 #include "mex.h"
2
3 void mexFunction(int nlhs,
4  mxArray *plhs[],
5    int nrhs,
6    const mxArray *prhs[])
7 {
8  mexPrintf("Hello, mex! ");
9 }
The mexPrintf(..) is equivalent to printf(..) in C/C++. Unlike printing to stdout in pritnf, this prints your formatted message in the MATLAB command window. You will find its usage is same as printf.

Step 5. Go back to the MATLAB. MATLAB now shows our newly created file, helloMex.cpp in the MATLAB current folder. We then compile this code by running the following command in the command window,
>> mex helloMex.cpp
The c-mex invokes our selected compiler to compile, link, and finally generate the binary, which we can call in our normal MATLAB session.

Step 6. On success, this will create a new mex file, hello.mexw64 (or hello.mexw32) on Windows system, in the same directory (Figure 2.6).

Step 7. Now, it is time to say hello to our c-mex by entering the command in the command window (Figure 2.7).
>> helloMex
With just a couple of lines, we have just created our first C-MEX function!

image

Figure 2.4 Set a working directory as a current folder.

image

Figure 2.5 Save a new script as a C++ code.

image

Figure 2.6 Creating a c-mex file from the C++ code.

image

Figure 2.7 Running Hello, mex in the command window.

2.3.1.1 Summary

In this example, we created one special function called mexFunction. It is referred to as the gateway routine. This function provides the entry point to the mex-shared library, just like the main(…) function in C/C++ programming. Let us take a brief look at the function definition and its input parameters:

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
nlhs: It is the number of output variables.
plhs: Array of mxArray pointers to the output variables
nrhs: Number of input variables
prhs: Array of mxArray pointers to the input variables

We visit this in more detail in the following section.

2.4 CUDA Configuration for MATLAB

Let us now turn to CUDA. To compile CUDA codes, we need an nvcc compiler, which translates CUDA specific keywords in our codes and generates machine codes for GPU and our host system. The nvcc under the hood uses our configured C/C++ compiler.

2.4.1 Preparing CUDA Settings

Before we move to adding CUDA codes, it is a good idea to check if our CUDA has been installed properly. The CUDA Toolkit is installed by default to

• In Windows:
C:Program FilesNVIDIA GPU Computing ToolkitCUDAv#.#, where #.# is version number 3.2 or higher.

• In Mac OSX:
/Deverloper/NVIDIA/CUDA-#.#.

• In Linux:
/usr/local/cuda-#.#.

If properly installed, you should be able to enter command nvcc using a “system” command from the MATLAB console:

>> system(‘nvcc’)

If we run this, we get an error message, “nvcc : fatal error : No input files specified; use option --help for more information.” As the message indicates, we sure got the error since we did not specify which input file we want to compile. However, if MATLAB complains with a message, “nvcc is not recognized as an internal or external command, operable program or batch file”, then it indicates that our CUDA has not been properly installed. You will most likely want to refer to http://docs.nvidia.com/cuda/index.html to make sure your CUDA is properly installed on your system before taking the next step.

Here are some tips as to how to make sure the CUDA environment is configured properly.

In Windows, check your path by entering path at the command prompt.

C:> path

You should see the NVIDIA compiler path in the PATH variable as follows

PATH=C:Program FilesNVIDIA GPU Computing ToolkitCUDAv#.#in;C:WINDOWSsystem32;C:WINDOWS;C:WINDOWSSystem32Wbem;C:Program FilesMATLABR2010bin;C:Program FilesTortoiseSVNin;

Or, open Control Panel>All Control Panel Items>System>Advanced System Settings (Figure 2.8). Further information for Windows can be found at http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/index.html.

image

Figure 2.8 NVIDIA compiler path in the PATH variable.

In Mac OS X, open the Terminal application in the Finder by going to /Application/Utilities. In the shell, enter as

imac:bin $ echo $PATH
/Developer/NVIDIA/CUDA-5.0/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin

You find path where your CUDA is installed. Further information for Mac OSX can be found at http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-mac-os-x/index.html.

In Linux, in bash, for example, enter as

[user@linux_bash ~]$ echo $PATH
/usr/java/default/bin: /opt/CollabNet_Subversion/bin:/bin:/sbin:/usr:/usr/bin:/usr/sbin:/opt/openoffice.org3/program:/usr/local/cuda/bin:/usr/java/default/bin:/usr/local/bin:/bin:/usr/bin:

And look for the nvcc path. In this case, nvcc was installed at /usr/local/cuda/bin. Further information for Linux can be found at http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html.

2.5 Example: Simple Vector Addition Using CUDA

We start with very popular and simple example of vector addition. In this exercise, we create a CUDA function to add two input vectors of the same size and output the result to a separate output vector of the same size.

Step 1. Create AddVectors.h in the working directory. Enter the following codes and save:
1 #ifndef __ADDVECTORS_H__
2 #define __ADDVECTORS_H__
3
4 extern void addVectors(float* A, float* B, float* C, int size);
5
6 #endif // __ADDVECTORS_H__
In this header file, we declare our vector addition function prototype which we will use in our mex function. extern indicates that our function is implemented in some other file.

Step 2. We now implement our addVectors function in AddVectors.cu. The file extension.cu represents the CUDA file. Create a new file in the MATLAB editor. Enter the following codes and save it as AddVectors.cu:
1 #include "AddVectors.h"
2 #include "mex.h"
3
4 __global__ void addVectorsMask(float* A, float* B, float* C, int size)
5 {
6 int i=blockIdx.x;
7 if (i >= size)
8  return;
9
10 C[i]=A[i]+B[i];
11 }
12
13 void addVectors(float* A, float* B, float* C, int size)
14 {
15 float *devPtrA=0, *devPtrB=0, *devPtrC=0;
16
17 cudaMalloc(&devPtrA, sizeof(float) * size);
18 cudaMalloc(&devPtrB, sizeof(float) * size);
19 cudaMalloc(&devPtrC, sizeof(float) * size);
20
21 cudaMemcpy(devPtrA, A, sizeof(float) * size, cudaMemcpyHostToDevice);
22 cudaMemcpy(devPtrB, B, sizeof(float) * size, cudaMemcpyHostToDevice);
23
24 addVectorsMask<<<size, 1>>>(devPtrA, devPtrB, devPtrC, size);
25
26 cudaMemcpy(C, devPtrC, sizeof(float) * size, cudaMemcpyDeviceToHost);
27
28 cudaFree(devPtrA);
29 cudaFree(devPtrB);
30 cudaFree(devPtrC);
31 }

Step 3. In this step, we compile the simple CUDA codes using the -c option into the object file, which we then use later for linking to our mex code. To build an object file from this code, enter the following command in the MATLAB command window.
>> system('nvcc -c AddVectors.cu')
On success, you see a message similar to the following message coming from nvcc in the command window:
AddVectors.cu
tmpxft_00000dc0_00000000-5_AddVectors.cudafe1.gpu
tmpxft_00000dc0_00000000-10_AddVectors.cudafe2.gpu
AddVectors.cu
tmpxft_00000dc0_00000000-5_AddVectors.cudafe1.cpp
tmpxft_00000dc0_00000000-15_AddVectors.ii
ans =
 0
If, however, you get the error message, “‘nvcc’ is not recognized as an internal or external command, operable program or batch file” in the Command Window, this means the C++ compiler path is not set in your system. You can add the C++ compiler path to your system environment or pass it explicitly by using –ccbin option:
>> system('nvcc -c AddVectors.cu -ccbin "C:Program FilesMicrosoft Visual Studio 10.0VCin"')

Step 4. You now notice that we just created the object file in the same working directory in the MATLAB current folder window (Figure 2.9).

Step 5. In this step, we create our mex function, which we call our AddVectors function. Just like our helloMex function, we start with mexFunction. Create a new file in the MATLAB editor. Enter the following codes and save it as AddVectorsCuda.cpp:
1 #include "mex.h"
2 #include "AddVectors.h"
3
4 void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[])
5 {
6 if (nrhs != 2)
7  mexErrMsgTxt("Invaid number of input arguments");
8
9 if (nlhs != 1)
10  mexErrMsgTxt("Invalid number of outputs");
11
12 if (!mxIsSingle(prhs[0]) && !mxIsSingle(prhs[1]))
13  mexErrMsgTxt("input vector data type must be single");
14
15 int numRowsA=(int)mxGetM(prhs[0]);
16 int numColsA=(int)mxGetN(prhs[0]);
17 int numRowsB=(int)mxGetM(prhs[1]);
18 int numColsB=(int)mxGetN(prhs[1]);
19
20 if (numRowsA != numRowsB || numColsA != numColsB)
21  mexErrMsgTxt("Invalid size. The sizes of two vectors must be same");
22
23 int minSize=(numRowsA<numColsA) ? numRowsA : numColsA;
24 int maxSize=(numRowsA>numColsA) ? numRowsA : numColsA;
25
26 if (minSize != 1)
27  mexErrMsgTxt("Invalid size. The vector must be one dimentional");
28
29 float* A=(float*)mxGetData(prhs[0]);
30 float* B=(float*)mxGetData(prhs[1]);
31
32 plhs[0]=mxCreateNumericMatrix(numRowsA, numColsB, mxSINGLE_CLASS, mxREAL);
33 float* C=(float*)mxGetData(plhs[0]);
34
35 addVectors(A, B, C, maxSize);
36 }
From line 6 to 13, we make sure that our inputs support data type and correct vector size. We then acquire the size of the input vectors. In line 32, we create the output vector that will hold the result of the two-vector addition. In line 35, we call our CUDA based function to add two input vectors.

Step 6. To compile our mex and link to the CUDA object file we created, enter the following command in the MATLAB command window. For a 64-bit Windows system and CUDA v5.0,
>> mex AddVectorsCuda.cpp AddVectors.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libx64"
If you have CUDA v4.0, replace v5.0 with v4.0. For 32-bit Windows, replace x64 with Win32; for example,
>> mex AddVectorsCuda.cpp AddVectors.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0libWin32"
The -lcudart tells mex that we are using CUDA runtime libraries. The -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libx64" tells the location of those CUDA runtime libraries.
For Mac OS X,
>> mex AddVectorsCuda.cpp AddVectors.obj -lcudart -L"/Developer/NVIDIA/CUDA-5.0/lib"
And, for Linux Distributions,
>> mex AddVectorsCuda.cpp AddVectors.obj -lcudart -L"/usr/local/cuda/lib"

Step 7. On success, a new mex file, AddVectorsCuda.mexw64, is created in the same working directory (Figure 2.10).

Step 8. Now, it is time to run our new mex function in the MATLAB. In the command window, run
>> A=single([1 2 3 4 5 6 7 8 9 10]);
>> B=single([10 9 8 7 6 5 4 3 2 1]);
>> C=AddVectorsCuda(A, B);

Step 9. Verify the result stored in the vector C. When you add each vector element, you get 11 in the resulting vector C:
>> C
C =
11   11   11   11   11   11   11   11   11   11
You can run this whole process using runAddVectors.m, as follows:

image

Figure 2.9 The created object file.

image

Figure 2.10 The c-mex file created,

% runAddVectors.m

disp('1. nvcc AddVectors.cu compiling …'),

system('nvcc -c AddVectors.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 10.0VCin"')

disp('nvcc compiling done !'),

disp('2. C/C++ compiling for AddVectorsCuda.cpp with AddVectors.obj …'),

mex AddVectorsCuda.cpp AddVectors.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libx64"

disp('C/C++ compiling done !'),

disp('3. Test AddVectorsCuda() …')

disp('Two input arrays:')

A = single([1 2 3 4 5 6 7 8 9 10])

B = single([10 9 8 7 6 5 4 3 2 1])

disp('Result:')

C = AddVectorsCuda(A, B)

2.5.1.1 Summary

We put CUDA related stuff in AddVectors.cu. AddVectors.h contains the function prototype defined in the AddVectors.cu file. Our mex function (the gateway routine in AddVectorsCuda.cpp) calls the CUDA function through AddVectors.h. Once we compile CUDA code (.cu) into an object file (.obj) by nvcc.exe, we use the mex command to compile our C/C++ code (.cpp) and to link it to the CUDA object file (.obj). We finally obtain the binary mex executable (.mexw64) file, which combines the regular cpp file and the cu file. This process is depicted in Figure 2.11.

image

Figure 2.11 The summarized block diagram for CUDA related c-mex compilation (image: input source files, image: commands, image: generated files).

2.6 Example with Image Convolution

We now try to create a more complex example. First of all, we define our mex function. We read one sample image from MATLAB and pass it onto our mex function which then does simple two-dimensional convolution using a 3×3 mask. The result is returned to MATLAB. It will be interesting to show how we do this in three cases:

1. Use the MATLAB function, conv2,

2. Do the same in pure C++ code.

3. Do the same using CUDA.

In this section, we focus on how we do this in each case with sample codes. For all cases, we use the same image (Figure 2.12), whose date type is single and the same 3×3 mask of single data type. The example mask follows:

image

image

Figure 2.12 Input image as an example.

2.6.1 Convolution in MATLAB

MATLAB has a built-in function, conv2. It computes the two-dimensional convolution of two input matrices using a straightforward implementation. It is very simple to use. Let us first see how we do this in plain MATLAB.

Step 1. Read the sample image of coins in the MATLAB command window:
>> quarters = single(imread(‘eight.tif’));
>> mask = single([1 2 1; 0 0 0; -1 -2 -1]);
>> imagesc(quarters);
>> colormap(gray);
Note that we cast the input image and mask to the single data type. When we read an image using imread, it returns the image in uint8 data type. Since we will work with a single data type in CUDA, we are preparing the input data as single.

Step 2. Do two-dimensional convolution using conv2:
>> H = conv2(quarters, mask, ‘same’);
For now, we chose to do the convolution with the shape parameter, ‘same’ By specifying the third parameter same, we ask MATLAB to return the output of the same size as the input image. Now, plot the output image H, to see the result of gradient image.
>> imagesc(H);
>> colormap(gray);
You can run this whole process using convol_matlab.m as follows:

% convol_matlab.m

quarters = single(imread('eight.tif'));

mask = single([1 2 1; 0 0 0; -1 -2 -1]);

imagesc(quarters);

colormap(gray);

H = conv2(quarters, mask, 'same'),

imagesc(H);

colormap(gray);

Figure 2.13 shows the resulting image.

image

Figure 2.13 The resulting gradient image.

2.6.2 Convolution in Custom c-mex

We implement the same conv2 function using our custom c-mex function. Before we start implementing, let us revisit the gateway routine introduced in the previous example. The gateway routine takes four input parameters. The first two are used to pass outputs from our c-mex function to MATLAB. The last two are for passing inputs from MATLAB to our c-mex function:

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

Step 1. Create a new file, add the following code and save it as conv2Mex.cpp:
1 #include "mex.h"
2
3 void conv2Mex(float* src, float* dst, int numRows, int numCols, float* mask)
4 {
5 int boundCol = numCols - 1;
6 int boundRow = numRows - 1;
7
8 for (int c = 1; c < boundCol; c++)
9 {
10  for (int r = 1; r < boundRow - 1; r++)
11  {
12   int dstIndex = c * numRows + r;
13   int kerIndex = 8;
14   for (int kc = -1; kc < 2; kc++)
15   {
16    int srcIndex = (c + kc) * numRows + r;
17    for (int kr = -1; kr < 2; kr++)
18     dst[dstIndex] += mask[kerIndex--] * src[srcIndex + kr];
19    }
20   }
21  }
22 }
23
24 void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[])
25 {
26 if (nrhs != 2)
27  mexErrMsgTxt("Invaid number of input arguments");
28
29 if (nlhs != 1)
30  mexErrMsgTxt("Invalid number of outputs");
31
32 if (!mxIsSingle(prhs[0]) && !mxIsSingle(prhs[1]))
33  mexErrMsgTxt("input image and mask type must be single");
34
35 float* image = (float*)mxGetData(prhs[0]);
36 float* mask = (float*)mxGetData(prhs[1]);
37
38 int numRows = (int)mxGetM(prhs[0]);
39 int numCols = (int)mxGetN(prhs[0]);
40 int numKRows = (int)mxGetM(prhs[1]);
41 int numKCols = (int)mxGetN(prhs[1]);
42
43 if (numKRows != 3 || numKCols != 3)
44  mexErrMsgTxt("Invalid mask size. It must be 3×3");
45
46 plhs[0] = mxCreateNumericMatrix(numRows, numCols, mxSINGLE_CLASS, mxREAL);
47 float* out = (float*)mxGetData(plhs[0]);
48
49 conv2Mex(image, out, numRows, numCols, mask);
50 }
In mexFunction, we first check the number of inputs and outputs. In this example, there must be two inputs, image and mask, and one output, the convolution result. We then make sure that the input data type is single. We find out the size of the input image and the mask and make sure mask is of 3×3. In line 46, we prepare the output array in which we will put our result and pass it back to MATLAB. Then, we call our custom convolution function, conv2Mex to do the number crunching.

Step 2. We compile our mex function and call from the MATLAB command window. Compiling is very simple:
>> mex conv2Mex.cpp
If compiled successfully, mex creates a binary file, conv2Mex.mexw64 in Windows 64 bit.

Step 3. We can call this function anywhere in our MATLAB and display the result:

>> quaters = (single)imread('eight.tif'),
>> mask = single([1 2 1; 0 0 0; -1 -2 -1]);
>> H2 = conv2Mex(quarters, mask);
>> imagesc(H2);
>> colormap(gray);

You can run this whole process using convol_mex.m as follows:

% convol_mex.m

mex conv2Mex.cpp

quarters = single(imread('eight.tif'));

mask = single([1 2 1; 0 0 0; -1 -2 -1]);

imagesc(quarters);

colormap(gray);

H2 = conv2Mex(quarters, mask);

imagesc(H2);

colormap(gray);

2.6.3 Convolution in Custom c-mex with CUDA

In this example, we use a CUDA function to do convolution operations. This CUDA function is functionally same as the c-mex function in previous section except for the CUDA implementation.

Step 1. First, we define the function prototype we call in our CUDA function. Create a new file and save it as conv2Mex.h:
1 #ifndef __CONV2MEXCUDA_H__
2 #define __CONV2MEXCUDA_H__
3
4 extern void conv2Mex(float* in,
5     float* out,
6     int numRows,
7     int numCols,
8     float* mask);
9
10 #endif // __CONV2MEXCUDA_H__

Step 2. We implement our conv2Mex function in CUDA. Create conv2Mex.cu and add the following code:
1 #include "conv2Mex.h"
2
3 __global__ void conv2MexCuda(float* src,
4      float* dst,
5      int numRows,
6      int numCols,
7      float* mask)
8 {
9 int row=blockIdx.x;
10 if (row<1 || row>numRows - 1)
11  return;
12
13 int col=blockIdx.y;
14 if (col<1 || col>numCols - 1)
15  return;
16
17 int dstIndex=col * numRows+row;
18 dst[dstIndex]=0;
19 int kerIndex=3 * 3 - 1;
20 for (int kc=-1; kc<2; kc++)
21 {
22  int srcIndex=(col+kc) * numRows+row;
23  for (int kr=-1; kr<2; kr++)
24  {
25   dst[dstIndex] += mask[kerIndex--] * src[srcIndex+kr];
26  }
27 }
28 }
29
30 void conv2Mex(float* src, float* dst, int numRows, int numCols, float* ker)
31 {
32 int totalPixels=numRows * numCols;
33 float *deviceSrc, *deviceKer, *deviceDst;
34
35 cudaMalloc(&deviceSrc, sizeof(float) * totalPixels);
36 cudaMalloc(&deviceDst, sizeof(float) * totalPixels);
37 cudaMalloc(&deviceKer, sizeof(float) * 3 * 3);
38
39 cudaMemcpy(deviceSrc,
40    src,
41    sizeof(float) * totalPixels,
42    cudaMemcpyHostToDevice);
43
44 cudaMemcpy(deviceKer,
45    ker,
46    sizeof(float) * 3 * 3,
47    cudaMemcpyHostToDevice);
48
49 cudaMemset(deviceDst, 0, sizeof(float) * totalPixels);
50
51 dim3 gridSize(numRows, numCols);
52
53 conv2MexCuda<<<gridSize, 1>>>(deviceSrc,
54     deviceDst,
55     numRows,
56     numCols,
57     deviceKer);
58
59 cudaMemcpy(dst,
60    deviceDst,
61    sizeof(float) * totalPixels,
62    cudaMemcpyDeviceToHost);
63
64 cudaFree(deviceSrc);
65 cudaFree(deviceDst);
66 cudaFree(deviceKer);
67 }
We use cudaMalloc to allocate memory on our CUDA device. The function cudaMemcpy copies data from host to device or from device to host, based on the fourth parameter. On the CUDA device, we allocated memory for input and output images and a mask. Then, we copied the input and mask data from host to our CUDA device. Using cudaMemset, we initialized the output data to zero. Our CUDA call is made with conv2MexCuda. Here, we simply assigned the grid size to be same as the image size. Each CUDA grid calculates the final value for each output pixel by applying a 3×3 mask in conv2MexCuda. We explain about the grid size more in detail in Chapter 4.

Step 3. Compile our CUDA codes to an object file, which we link to our c-mex function in a later step. To compile in MATLAB, enter the following in MATLAB command window:
>> system('nvcc -c conv2Mex.cu')
If you encounter any compilation error, please refer to Section 2.5, “Example: Simple Vector Addition using CUDA.” On success, this generates a conv2Mex.obj file.

Step 4. We then create the mex function in which we call our CUDA-based convolution function. Create a new file, enter the code that follows, and save it as conv2MexCuda.cpp:
1 #include "mex.h"
2 #include "conv2Mex.h"
3
4 void mexFunction(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[])
5 {
6 if (nrhs != 2)
7  mexErrMsgTxt("Invaid number of input arguments");
8
9 if (nlhs != 1)
10  mexErrMsgTxt("Invalid number of outputs");
11
12 if (!mxIsSingle(prhs[0]) && !mxIsSingle(prhs[1]))
13  mexErrMsgTxt("input image and mask type must be single");
14
15 float* image=(float*)mxGetData(prhs[0]);
16 float* mask=(float*)mxGetData(prhs[1]);
17
18 int numRows=(int)mxGetM(prhs[0]);
19 int numCols=(int)mxGetN(prhs[0]);
20 int numKRows=(int)mxGetM(prhs[1]);
21 int numKCols=(int)mxGetN(prhs[1]);
22
23 if (numKRows != 3 || numKCols != 3)
24  mexErrMsgTxt("Invalid mask size. It must be 3×3");
25
26 plhs[0]=mxCreateNumericMatrix(numRows, numCols, mxSINGLE_CLASS, mxREAL);
27 float* out=(float*)mxGetData(plhs[0]);
28
29 conv2Mex(image, out, numRows, numCols, mask);
30 }
Our new mex function is almost the same as we had in a previous section. The only difference is #include “conv2Mex.h” on line 1. Here, we define our conv2Mex function in conv2Mex.h and implemented it in conv2Mex.cu.

Step 5. We are ready to make our CUDA-based mex function. However, our conv2Mex function is in conv2Mex.obj, so we have to tell our linker where that function is. Also, we tell the linker that we will be using CUDA runtime libraries and where they are located. Enter the following in the MATLAB command window. For Windows 64 bit,
>> mex conv2MexCuda.cpp conv2Mex.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libx64"
For Windows 32 bit,
>> mex conv2MexCuda.cpp conv2Mex.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libWin32"
For Linux,
>> mex conv2MexCuda.cpp conv2Mex.obj -lcudart -L"/usr/local/cuda/lib"
For MAC OSX,
>> mex conv2MexCuda.cpp conv2Mex.obj -lcudart -L"/Developer/NVIDIA/CUDA-5.0/lib"
On success, MATLAB creates our mex function, convMexCuda.mexw64 for Windows 64 bit.

Step 6. Execute our CUDA-based convolution function in the MATLAB command window:
>> quarters = single(imread('eight.tif'));
>> mask = single([1 2 1; 0 0 0; -1 -2 -1]);
>> H3 = conv2MexCuda(quarters, mask);
>> imagesc(H3);
>> colormap(gray);
We should now see the same output image in the MATLAB figure.

You can run this whole process using convol_cuda.m as follows:

% convol_cuda.m

system('nvcc -c conv2Mex.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 10.0VCin')

mex conv2MexCuda.cpp conv2Mex.obj -lcudart -L"C:Program FilesNVIDIA GPU Computing ToolkitCUDAv5.0libx64"

quarters = single(imread('eight.tif'));

mask = single([1 2 1; 0 0 0; -1 -2 -1]);

imagesc(quarters);

colormap(gray);

H3 = conv2MexCuda(quarters, mask);

imagesc(H3);

colormap(gray);

2.6.4 Brief Time Performance Profiling

We did two-dimensional convolution operation in three ways, resulting in the same output image. How each performs in terms of time is our main interest. More detailed instructions and information are discussed in the next chapter. Here, we briefly examine using tic and toc command in the MATLAB.

The following is the sample MATLAB session to see the time performance for each case.

>> tic; H1 = conv2(quaters, mask); toc;
Elapsed time is 0.001292 seconds.
>> tic; H1 = conv2(quaters, mask); toc;
Elapsed time is 0.001225 seconds.
>> tic; H2 = conv2Mex(quaters, mask); toc;
Elapsed time is 0.001244 seconds.
>> tic; H2 = conv2Mex(quaters, mask); toc;
Elapsed time is 0.001118 seconds.
>> tic; H3 = conv2MexCuda(quaters, mask); toc;

Elapsed time is 0.036286 seconds.

>> tic; H3 = conv2MexCuda(quaters, mask); toc;
Elapsed time is 0.035877 seconds.

Both the built-in conv2 and our custom conv2Mex functions perform about the same. However, our CUDA-based convolution in this example is very slow compared to the former two functions, because the processing data size is small and our grid and block sizes are not taking advantage of GPU architecture.

Image

In later chapters, we introduce how to get more detailed time profiling and how to optimize our simple CUDA functions to achieve our goal for acceleration.

2.7 Summary

As we did in AddVectors, we put CUDA-related functions in conv2Mex.cu (Figure 2.14). The file conv2Mex.h contains the function definition that our mex function, conv2MexCuda.cpp, calls within its gateway routine. Once we compile CUDA code into an object file, we use the mex command to compile our mex code and link it to the CUDA object file.

image

Figure 2.14 The summarized block diagram for the example (image: input source files, image: commands, image: generated files).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.187.132