Chapter 6. Programs and Kernels

In Chapter 2, we created a simple example that executed a trivial parallel OpenCL kernel on a device. In that example, a kernel object and a program object were created in order to facilitate execution on the device. Program and kernel objects are fundamental in working with OpenCL, and in this chapter we cover these objects in more detail. Specifically, this chapter covers

Program and kernel object overview

Creating program objects and building programs

Program build options

Creating kernel objects and setting kernel arguments

• Source versus binary program creation

• Querying kernel and program objects

Program and Kernel Object Overview

Two of the most important objects in OpenCL are kernel objects and program objects. OpenCL applications express the functions that will execute in parallel on a device as kernels. Kernels are written in the OpenCL C language (as described in Chapter 4) and are delineated with the __kernel qualifier. In order to be able to pass arguments to a kernel function, an application must create a kernel object. Kernel objects can be operated on using API functions that allow for setting the kernel arguments and querying the kernel for information.

Kernel objects are created from program objects. Program objects contain collections of kernel functions that are defined in the source code of a program. One of the primary purposes of the program object is to facilitate the compilation of the kernels for the devices to which the program is attached. Additionally, the program object provides facilities for determining build errors and querying the program for information.

An analogy that may be helpful in understanding the distinction between kernel objects and program objects is that the program object is like a dynamic library in that it holds a collection of kernel functions. The kernel object is like a handle to a function within the dynamic library. The program object is created from either source code (OpenCL C) or a compiled program binary (more on this later). The program gets built for any of the devices to which the program object is attached. The kernel object is then used to access properties of the compiled kernel function, enqueue calls to it, and set its arguments.

Program Objects

The first step in working with kernels and programs in OpenCL is to create and build a program object. The next sections will introduce the mechanisms available for creating program objects and how to build programs. Further, we detail the options available for building programs and how to query the program objects for information. Finally, we discuss the functions available for managing the resources used by program objects.

Creating and Building Programs

Program objects can be created either by passing in OpenCL C source code text or with a program binary. Creating program objects from OpenCL C source code is typically how a developer would create program objects. The source code to the OpenCL C program would be in an external file (for example, a .cl file as in our example code), and the application would create the program object from the source code using the clCreateProgramWithSource() function. Another alternative is to create the program object from a binary that has been precompiled for the devices. This method is discussed later in the chapter; for now we show how to create a program object from source using clCreateProgramWithSource():

image

Calling clCreateProgramWithSource() will cause a new program object to be created using the source code passed in. The return value is a new program object attached to the context. Typically, the next step after calling clCreateProgramWithSource() would be to build the program object using clBuildProgram():

image

Invoking clBuildProgram() will cause the program object to be built for the list of devices that it was called with (or all devices attached to the context if no list is specified). This step is essentially equivalent to invoking a compiler/linker on a C program. The options parameter contains a string of build options, including preprocessor defines and various optimization and code generation options (e.g., -DUSE_FEATURE=1 -cl-mad-enable). These options are described at the end of this section in the “Program Build Options” subsection. The executable code gets stored internally to the program object for all devices for which it was compiled. The clBuildProgram() function will return CL_SUCCESS if the program was successfully built for all devices; otherwise an error code will be returned. If there was a build error, the detailed build log can be checked for by calling clGetProgramBuildInfo() with a param_name of CL_PROGRAM_BUILD_LOG.

image

image

Putting it all together, the code in Listing 6.1 (from the HelloWorld example in Chapter 2) demonstrates how to create a program object from source, build it for all attached devices, and query the build results for a single device.

Listing 6.1 Creating and Building a Program Object


cl_program CreateProgram(cl_context context, cl_device_id device,
                         const char* fileName)
{
    cl_int errNum;
    cl_program program;

    ifstream kernelFile(fileName, ios::in);
    if (!kernelFile.is_open())
    {
       cerr << "Failed to open file for reading: " << fileName <<
                endl;
       return NULL;
    }

    ostringstream oss;
    oss << kernelFile.rdbuf();

    string srcStdStr = oss.str();
    const char *srcStr = srcStdStr.c_str();
    program = clCreateProgramWithSource(context, 1,
                                        (const char**)&srcStr,
                                        NULL, NULL);
    if (program == NULL)
    {
        cerr << "Failed to create CL program from source." << endl;
        return NULL;
    }

    errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    if (errNum != CL_SUCCESS)
    {
        // Determine the reason for the error
        char buildLog[16384];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
                              sizeof(buildLog), buildLog, NULL);

        cerr << "Error in kernel: " << endl;
        cerr << buildLog;
        clReleaseProgram(program);
        return NULL;
    }

    return program;
}


Program Build Options

As described earlier in this section, clBuildProgram() takes as an argument a string (const char *options) that controls several types of build options:

• Preprocessor options

• Floating-point options (math intrinsics)

• Optimization options

• Miscellaneous options

Much like a C or C++ compiler, OpenCL has a wide range of options that control the behavior of program compilation. The OpenCL program compiler has a preprocessor, and it is possible to define options to the preprocessor within the options argument to clBuildProgram(). Table 6.1 lists the options that can be specified to the preprocessor.

Table 6.1 Preprocessor Build Options

image

One note about defining preprocessor variables is that the kernel function signatures for a program object must be the same for all of the devices for which the program is built. Take, for example, the following kernel source:

#ifdef SOME_MACRO
__kernel void my_kernel(__global const float* p) {
      // ...
}

#else // !SOME_MACRO

__kernel void my_kernel(__global const int* p) {
      // ...
}

#endif // !SOME_MACRO

In this example, the my_kernel() function signature differs based on the value of SOME_MACRO (its argument is either a __global const float* or a __global const int*). This, in and of itself, is not a problem. However, if we choose to invoke clBuildProgram() separately for each device on the same program object, once when we pass in –D SOME_MACRO for one device and once when we do not define SOME_MACRO for another device, we will get a kernel that has different function signatures within the program, and this will fail. That is, the kernel function signatures must be the same for all devices for which a program object is built. It is acceptable to send in different preprocessor directives that impact the building of the program in different ways for each device, but not in a way that changes the kernel function signatures. The kernel function signatures must be the same for each device for which a single program object is built.

The OpenCL program compiler also has options that control the behavior of floating-point math. These options are described in Table 6.2 and, like the preprocessor options, can be specified in the options argument to clBuildProgram().

Table 6.2 Floating-Point Options (Math Intrinsics)

image

It is possible to also control the optimizations that the OpenCL C compiler is allowed to make. These options are listed in Table 6.3.

Table 6.3 Optimization Options

image

image

Finally, Table 6.4 lists the last set of miscellaneous options accepted by the OpenCL C compiler.

Table 6.4 Miscellaneous Options

image

Creating Programs from Binaries

An alternative to creating program objects from source is to create a program object from binaries. A program binary is a compiled version of the source code for a specific device. The data format of a program binary is opaque. That is, there is no standardized format for the contents of the binary. An OpenCL implementation could choose to store an executable version of the program in the binary, or it might choose to store an intermediate representation that can be converted into the executable at runtime.

Because program binaries have already been compiled (either partially to intermediate representation or fully to an executable), loading them will be faster and require less memory, thus reducing the load time of your application. Another advantage to using program binaries is protection of intellectual property: you can generate the program binaries at installation time and never store the original OpenCL C source code on disk. A typical application scenario would be to generate program binaries at either install time or first run and store the binaries on disk for later loading. The way program binaries are generated is by building the program from source using OpenCL and then querying back for the program binary. To get a program binary back from a built program, you would use clGetProgramInfo():

image

image

After querying the program object for its binaries, the binaries can then be stored on disk for future runs. The next time the program is run, the program object can be created using clCreateProgramWithBinary():

image

The example HelloBinaryWorld demonstrates how to create a program from binaries. This is a modification of the HelloWorld example from Chapter 2. The difference is that the HelloBinaryWorld example for this chapter will attempt to retrieve the program binary the first time the application is run and store it to HelloWorld.cl.bin. On future executions, the application will load the program from this generated binary. The main logic that performs this caching is provided in Listing 6.2 from the main() function of HelloBinaryWorld.

Listing 6.2 Caching the Program Binary on First Run


program = CreateProgramFromBinary(context, device,
                                  "HelloWorld.cl.bin");
if (program == NULL)
{

    program = CreateProgram(context, device,
                            "HelloWorld.cl");
    if (program == NULL)
    {
        Cleanup(context, commandQueue, program,
                kernel, memObjects);
        return 1;
    }


    if (SaveProgramBinary(program, device, "HelloWorld.cl.bin")
                          == false)
    {
        std::cerr << "Failed to write program binary"
                  << std::endl;
        Cleanup(context, commandQueue, program,
                kernel, memObjects);
        return 1;
    }
}
else
{
    std::cout << "Read program from binary." << std::endl;
}


First let’s take a look at SaveProgramBinary(), which is the function that queries for and stores the program binary. This function assumes that the program object was already created and built from source. The code for SaveProgramBinary() is provided in Listing 6.3. The function first calls clGetProgramInfo() to query for the number of devices attached to the program. Next it retrieves the device IDs associated with each of the devices. After getting the list of devices, the function then retrieves the size of each of the program binaries for every device along with the program binaries themselves. After retrieving all of the program binaries, the function loops over the devices and finds the one that was passed as an argument to SaveProgramBinary(). This program binary is finally written to disk using fwrite() to the file HelloWorld.cl.bin.

Listing 6.3 Querying for and Storing the Program Binary


bool SaveProgramBinary(cl_program program, cl_device_id device,
                       const char* fileName)
{
    cl_uint numDevices = 0;
    cl_int errNum;

    // 1 - Query for number of devices attached to program
    errNum = clGetProgramInfo(program, CL_PROGRAM_NUM_DEVICES,
                              sizeof(cl_uint),
                              &numDevices, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr << "Error querying for number of devices."
                  << std::endl;
        return false;
    }

    // 2 - Get all of the Device IDs
    cl_device_id *devices = new cl_device_id[numDevices];
    errNum = clGetProgramInfo(program, CL_PROGRAM_DEVICES,
                              sizeof(cl_device_id) * numDevices,
                              devices, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr << "Error querying for devices." << std::endl;
        delete [] devices;
        return false;
    }

    // 3 - Determine the size of each program binary
    size_t *programBinarySizes = new size_t [numDevices];
    errNum = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES,
                              sizeof(size_t) * numDevices,
                              programBinarySizes, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr << "Error querying for program binary sizes."
                  << std::endl;
        delete [] devices;
        delete [] programBinarySizes;
        return false;
    }

    unsigned char **programBinaries =
        new unsigned char*[numDevices];
    for (cl_uint i = 0; i < numDevices; i++)
    {
        programBinaries[i] =
            new unsigned char[programBinarySizes[i]];
    }

    // 4 - Get all of the program binaries
    errNum = clGetProgramInfo(program, CL_PROGRAM_BINARIES,
                              sizeof(unsigned char*) * numDevices,
                              programBinaries, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr << "Error querying for program binaries"
                  << std::endl;

        delete [] devices;
        delete [] programBinarySizes;
        for (cl_uint i = 0; i < numDevices; i++)
        {
            delete [] programBinaries[i];
        }
        delete [] programBinaries;
        return false;
    }

    // 5 - Finally store the binaries for the device requested
    //     out to disk for future reading.
    for (cl_uint i = 0; i < numDevices; i++)
    {
        // Store the binary just for the device requested.
        // In a scenario where multiple devices were being used
        // you would save all of the binaries out here.
        if (devices[i] == device)
        {
            FILE *fp = fopen(fileName, "wb");
            fwrite(programBinaries[i], 1,
                   programBinarySizes[i], fp);
            fclose(fp);
            break;
        }
    }

    // Cleanup
    delete [] devices;
    delete [] programBinarySizes;
    for (cl_uint i = 0; i < numDevices; i++)
    {
        delete [] programBinaries[i];
    }
    delete [] programBinaries;
    return true;
}


There are several important factors that a developer needs to understand about program binaries. The first is that a program binary is valid only for the device with which it was created. The OpenCL implementation itself might choose to store in its binary format either an intermediate representation of the program or the executable code. It is a choice made by the implementation that the application has no way of knowing. It is not safe to assume that a binary will work across other devices unless an OpenCL vendor specifically gives this guarantee. Generally, it is important to recompile the binaries for new devices to be sure of compatibility.

An example of the program binary that is produced by the NVIDIA OpenCL implementation is provided in Listing 6.4. This listing may look familiar to those developers familiar with CUDA. The NVIDIA binary format is stored in the proprietary PTX format. Apple and AMD also store binaries in their own formats. None of these binaries should be expected to be compatible across multiple vendors. The PTX format happens to be readable text, but it is perfectly valid for the program binary to be binary bits that are not human-readable.

Listing 6.4 Example Program Binary for HelloWorld.cl (NVIDIA)


//
// Generated by NVIDIA NVPTX Backend for LLVM
//

.version 2.0
.target sm_13, texmode_independent

// Global Launch Offsets
.const[0] .s32 %_global_num_groups[3];
.const[0] .s32 %_global_size[3];
.const[0] .u32 %_work_dim;
.const[0] .s32 %_global_block_offset[3];
.const[0] .s32 %_global_launch_offset[3];

.const .align 8 .b8 def___internal_i2opi_d[144] = {  0x08, 0x5D,
0x8D, 0x1F, 0xB1, 0x5F, 0xFB, 0x6B, 0xEA, 0x92, 0x52, 0x8A, 0xF7,
0x39, 0x07, 0x3D, 0x7B, 0xF1, 0xE5, 0xEB, 0xC7, 0xBA, 0x27, 0x75,
0x2D, 0xEA, 0x5F, 0x9E, 0x66, 0x3F, 0x46, 0x4F, 0xB7, 0x09, 0xCB,
0x27, 0xCF, 0x7E, 0x36, 0x6D, 0x1F, 0x6D, 0x0A, 0x5A, 0x8B, 0x11,
0x2F, 0xEF, 0x0F, 0x98, 0x05, 0xDE, 0xFF, 0x97, 0xF8, 0x1F, 0x3B,
0x28, 0xF9, 0xBD, 0x8B, 0x5F, 0x84, 0x9C, 0xF4, 0x39, 0x53, 0x83,
0x39, 0xD6, 0x91, 0x39, 0x41, 0x7E, 0x5F, 0xB4, 0x26, 0x70, 0x9C,
0xE9, 0x84, 0x44, 0xBB, 0x2E, 0xF5, 0x35, 0x82, 0xE8, 0x3E, 0xA7,
0x29, 0xB1, 0x1C, 0xEB, 0x1D, 0xFE, 0x1C, 0x92, 0xD1, 0x09, 0xEA,
0x2E, 0x49, 0x06, 0xE0, 0xD2, 0x4D, 0x42, 0x3A, 0x6E, 0x24, 0xB7,
0x61, 0xC5, 0xBB, 0xDE, 0xAB, 0x63, 0x51, 0xFE, 0x41, 0x90, 0x43,
0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57,
0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };
.const .align 4 .b8 def___GPU_i2opi_f[24] = {  0x41, 0x90, 0x43,
0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57,
0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };

.entry hello_kernel
(
      .param .b32 hello_kernel_param_0,
      .param .b32 hello_kernel_param_1,
      .param .b32 hello_kernel_param_2
)
{
      .reg .f32  %f<4>;
      .reg .s32  %r<9>;

_hello_kernel:
      {
      // get_global_id(0)
      .reg .u32   %vntidx;
      .reg .u32   %vctaidx;
      .reg .u32   %vtidx;
      mov.u32     %vntidx, %ntid.x;
      mov.u32     %vctaidx, %ctaid.x;
      mov.u32     %vtidx, %tid.x;
      mad.lo.s32  %r1, %vntidx, %vctaidx, %vtidx;
      .reg .u32   %temp;
      ld.const.u32 %temp, [%_global_launch_offset+0];
      add.u32     %r1, %r1, %temp;
      }

      shl.b32     %r2, %r1, 2;
      ld.param.u32   %r3, [hello_kernel_param_1];
      ld.param.u32   %r4, [hello_kernel_param_0];
      add.s32     %r5, %r4, %r2;
      add.s32     %r6, %r3, %r2;
      ld.param.u32   %r7, [hello_kernel_param_2];
      ld.global.f32  %f1, [%r5];
      ld.global.f32  %f2, [%r6];
      add.rn.f32  %f3, %f1, %f2;
      add.s32     %r8, %r7, %r2;
      st.global.f32  [%r8], %f3;
      ret;
}


On subsequent runs of the application, a binary version of the program will be stored on disk (in HelloWorld.cl.bin). The HelloBinaryWorld application loads this program from binary as shown in Listing 6.5. At the beginning of CreateProgramFromBinary(), the program binary is loaded from disk. The program object is created from the program binary for the passed-in device. Finally, after checking for errors, the program binary is built by calling clBuildProgram() just as would be done for a program that was created from source.

The last step of calling clBuildProgram() may at first seem strange. The program is already in binary format, so why does it need to be rebuilt? The answer stems from the fact that the program binary may or may not contain executable code. If it is an intermediate representation, then OpenCL will still need to compile it into the final executable. Thus, whether a program is created from source or binary, it must always be built before it can be used.

Listing 6.5 Creating a Program from Binary


cl_program CreateProgramFromBinary(cl_context context,
                                   cl_device_id device,
                                   const char* fileName)
{
    FILE *fp = fopen(fileName, "rb");
    if (fp == NULL)
    {
        return NULL;
    }

    // Determine the size of the binary
    size_t binarySize;
    fseek(fp, 0, SEEK_END);
    binarySize = ftell(fp);
    rewind(fp);

    // Load binary from disk
    unsigned char *programBinary = new unsigned char[binarySize];
    fread(programBinary, 1, binarySize, fp);
    fclose(fp);

    cl_int errNum = 0;
    cl_program program;
    cl_int binaryStatus;

    program = clCreateProgramWithBinary(context,
                    1,
                    &device,
                    &binarySize,
                    (const unsigned char**)&programBinary,
                    &binaryStatus,
                    &errNum);

    delete [] programBinary;
    if (errNum != CL_SUCCESS)
    {
        std::cerr << "Error loading program binary." << std::endl;
        return NULL;
    }

    if (binaryStatus != CL_SUCCESS)
    {
        std::cerr << "Invalid binary for device" << std::endl;
        return NULL;
    }

    errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    if (errNum != CL_SUCCESS)
    {
        // Determine the reason for the error
        char buildLog[16384];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
                              sizeof(buildLog), buildLog, NULL);

        std::cerr << "Error in program: " << std::endl;
        std::cerr << buildLog << std::endl;
        clReleaseProgram(program);
        return NULL;
    }

    return program;
}


Managing and Querying Programs

To clean up a program after it has been used, the program can be deleted by calling clReleaseProgram(). Internally, OpenCL stores a reference count with each program object. The functions that create objects in OpenCL return the object with an initial reference count of 1. The act of calling clReleaseProgram() will reduce the reference count. If the reference count reaches 0, the program will be deleted.

If the user wishes to manually increase the reference count of the OpenCL program, this can be done using clRetainProgram():

Further, when an application is finished building programs, it can choose to instruct the OpenCL implementation that it is finished with the compiler by calling clUnloadCompiler(). An OpenCL implementation can choose to use this notification to unload any resources consumed by the compiler. Doing so may free up some memory use by the OpenCL implementation. If an application calls clBuildProgram() again after calling clUnloadCompiler(), this will cause the compiler to be reloaded automatically.

image

Kernel Objects

So far we have been concerned with the creation and management of program objects. As discussed in the previous section, the program object is a container that stores the compiled executable code for each kernel on each device attached to it. In order to actually be able to execute a kernel, we must be able to pass arguments to the kernel function. This is the primary purpose of kernel objects. Kernel objects are containers that can be used to pass arguments to a kernel function that is contained within a program object. The kernel object can also be used to query for information about an individual kernel function.

Creating Kernel Objects and Setting Kernel Arguments

The way in which a kernel object can be created is by passing the name of the kernel function to clCreateKernel():

image

Once created, arguments can be passed in to the kernel function contained in the kernel object by calling clSetKernelArg():

image

Each parameter in the kernel function has an index associated with it. The first argument has index 0, the second argument has index 1, and so on. For example, given the hello_kernel() in the HelloBinaryWorld example, argument a has index 0, argument b has index 1, and argument result has index 2.

__kernel void hello_kernel(__global const float *a,
                               __global const float *b,
                               __global float *result)
   {
       int gid = get_global_id(0);

       result[gid] = a[gid] + b[gid];
   }

Each of the parameters to hello_kernel() is a global pointer, and thus the arguments are provided using memory objects (allocated with clCreateBuffer()). The following block of code demonstrates how the kernel arguments are passed for hello_kernel:

kernel = clCreateKernel(program, "hello_kernel", NULL);
  if (kernel == NULL)
{
    std::cerr << "Failed to create kernel" << std::endl;
    Cleanup(context, commandQueue, program, kernel, memObjects);
    return 1;
}

// Set the kernel arguments (result, a, b)
errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem),
                         &memObjects[0]);
errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem),
                         &memObjects[1]);
errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem),
                         &memObjects[2]);
if (errNum != CL_SUCCESS)
{
    std::cerr << "Error setting kernel arguments." << std::endl;
    Cleanup(context, commandQueue, program, kernel, memObjects);
    return 1;
 }

When clSetKernelArg() is called, the pointer passed in holding the argument value will be internally copied by the OpenCL implementation. This means that after calling clSetKernelArg(), it is safe to reuse the pointer for other purposes. The type of the argument sent in to the kernel is dependent on how the kernel is declared. For example, the following kernel takes a pointer, an integer, a floating-point value, and a local floating-point buffer:

__kernel void arg_example(global int *vertexArray,
                          int vertexCount,
                          float weight,
                          local float* localArray)
{
    ...
}

In this case, the first argument has index 0 and is passed a pointer to a cl_mem object because it is a global pointer. The second argument has index 1 and is passed a cl_int variable because it is an int argument, and likewise the third argument has index 2 and is passed a cl_float. The last argument has index 3 and is a bit trickier as it is qualified with local. Because it is a local argument, its contents are available only within a work-group and are not available outside of a work-group. As such, the call to clSetKernelArg() only specifies the size of the argument (in this case tied to the local work size so that there is one element per thread) and the arg_value is NULL. The arguments would be set using the following calls to clSetKernelArg():

kernel = clCreateKernel(program, "arg_example", NULL);
cl_int vertexCount;
cl_float weight;
cl_mem vertexArray;
cl_int localWorkSize[1] = { 32 };

// Create vertexArray with clCreateBuffer, assign values
// to vertexCount and weight
...

errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &vertexArray);
errNum |= clSetKernelArg(kernel, 1, sizeof(cl_int), &vertexCount);
errNum |= clSetKernelArg(kernel, 2, sizeof(cl_float), &weight);
errNum |= clSetKernelArg(kernel, 3,
                         sizeof(cl_float) * localWorkSize[0],
                         NULL);

The arguments that are set on a kernel object are persistent until changed. That is, even after invoking calls that queue the kernel for execution, the arguments will remain persistent.

An alternative to using clCreateKernel() to create kernel objects one kernel function at a time is to use clCreateKernelsInProgram() to create objects for all kernel functions in a program:

image

The use of clCreateKernelsInProgram() requires calling the function twice: first to determine the number of kernels in the program and next to create the kernel objects. The following block of code demonstrates its use:

cl_uint numKernels;
errNum = clCreateKernelsInProgram(program, NULL,
                                  NULL, &numKernels);

cl_kernel *kernels = new cl_kernel[numKernels];
errNum = clCreateKernelsInProgram(program, numKernels, kernels,
                                  &numKernels);

Thread Safety

The entire OpenCL API is specified to be thread-safe with one exception: clSetKernelArg(). The fact that the entire API except for a single function is defined to be thread-safe is likely to be an area of confusion for developers. First, let’s define what we mean by “thread-safe” and then examine why it is that clSetKernelArg() is the one exception.

In the realm of OpenCL, what it means for a function to be thread-safe is that an application can have multiple host threads simultaneously call the same function without having to provide mutual exclusion. That is, with the exception of clSetKernelArg(), an application may call the same OpenCL function from multiple threads on the host and the OpenCL implementation guarantees that its internal state will remain consistent.

You may be asking yourself what makes clSetKernelArg() special. It does not on the surface appear to be any different from other OpenCL function calls. The reason that the specification chose to make clSetKernelArg() not thread-safe is twofold:

clSetKernelArg() is the most frequently called function in the OpenCL API. The specification authors took care to make sure that this function would be as lightweight as possible. Because providing thread safety implies some inherent overhead, it was defined not to be thread-safe to make it as fast as possible.

• In addition to the performance justification, it is hard to construct a reason that an application would need to set kernel arguments for the same kernel object in different threads on the host.

Pay special attention to the emphasis on “for the same kernel object” in the second item. One misinterpretation of saying that clSetKernelArg() is not thread-safe would be that it cannot be called from multiple host threads simultaneously. This is not the case. You can call clSetKernelArg() on multiple host threads simultaneously, just not on the same kernel object. As long as your application does not attempt to call clSetKernelArg() from different threads on the same kernel object, everything should work as expected.

Managing and Querying Kernels

In addition to setting kernel arguments, it is also possible to query the kernel object to find out additional information. The function clGetKernelInfo() allows querying the kernel for basic information including the kernel function name, the number of arguments to the kernel function, the context, and the associated program object:

image

Another important query function available for kernel objects is clGetKernelWorkGroupInfo(). This function allows the application to query the kernel object for information particular to a device. This can be very useful in trying to determine how to break up a parallel workload across different devices on which a kernel will be executed. The CL_KERNEL_WORK_GROUP_SIZE query can be used to determine the maximum work-group size that can be used on the device. Further, the application can achieve optimal performance by adhering to using a work-group size that is a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Additional queries are also available for determining the resource utilization of the kernel on the device.

image

image

Kernel objects can be released and retained in the same manner as program objects. The object reference count will be decremented by the function clReleaseKernel() and will be released when this reference count reaches 0:

image

One important consideration regarding the release of kernel objects is that a program object cannot be rebuilt until all of the kernel objects associated with it have been released. Consider this example block of pseudo code:

cl_program program = clCreateProgramWithSource(...);
clBuildProgram(program, ...);
cl_kernel k = clCreateKernel(program, "foo");

// .. CL API calls to enqueue kernels and other commands ..

clBuildProgram(program, ...); // This call will fail
                              // because the kernel
                              // object "k" above has
                              // not been released.

The second call to clBuildProgram() in this example would fail with a CL_INVALID_OPERATION error because there is still a kernel object associated with the program. In order to be able to build the program again, that kernel object (and any other ones associated with the program object) must be released using clReleaseKernel().

Finally, the reference count can be incremented by one by calling the function clRetainKernel():

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.189.228