Chapter 3

Introduction to OpenCL

Abstract

This chapter provides an introduction to the basics of using the OpenCL standard when developing parallel programs. It describes the different abstraction models defined in the standard and also presents a basic example of an OpenCL program to place some of the abstraction in context. It also provides an example of the OpenCL C++ application programming interface.

Keywords

OpenCL

parallel programming

parallel programming abstractions

heterogeneous computing

vector addition

3.1 Introduction

This chapter introduces OpenCL, the programming fabric that will allow us to weave our application to execute concurrently. Programmers familiar with C and C++ should have little trouble understanding the OpenCL syntax. We begin by reviewing the OpenCL standard.

3.1.1 The OpenCL Standard

OpenCL was refined into an initial proposal by Apple in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and NVIDIA, and was submitted to the Khronos Group. The initial 1.0 specification was released by the Khronos Group in 2008. OpenCL 1.0 defined the host application programming interface (API) and the OpenCL C kernel language used for executing data-parallel programs on different heterogeneous devices. Follow-up releases of OpenCL 1.1 and OpenCL 1.2 enhanced the OpenCL standard with features such as OpenGL interoperability, additional image formats, synchronization events, and device partitioning. In November 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification. A number of additional features were added to the OpenCL standard, such as shared virtual memory, nested parallelism, and generic address spaces. These advanced features have the potential to simplify parallel application development, and improve the performance portability of OpenCL applications.

Open programming standards designers are tasked with a very challenging objective: arrive at a common set of programming standards that are acceptable to a range of competing needs and requirements. The Khronos Group, which manages the OpenCL standard, has done a good job addressing these requirements. It has developed an API that is general enough to run on significantly different architectures while being adaptable enough that each hardware platform can still achieve high performance. Using the core language and correctly following the specification, any program designed for one vendor can execute on another vendor’s hardware. The model set forth by OpenCL creates portable, vendor- and device-independent programs that are capable of being accelerated on many different hardware platforms.

The code that executes on an OpenCL device, which in general is not the same device as the host central processing unit (CPU), is written in the OpenCL C language. OpenCL C is a restricted version of the C99 language with extensions appropriate for executing data-parallel code on a variety of heterogeneous devices. The OpenCL C programming language also implements a subset of the C11 atomics and synchronization operations. While the OpenCL API itself is a C API, there are third-party bindings for many languages, including Java, C++, Python, and .NET. Additionally, a number of popular libraries in domains such as linear algebra and computer vision have integrated OpenCL to leverage heterogeneous platforms and gain substantial performance improvements.

3.1.2 The OpenCL Specification

The OpenCL specification is defined in four parts, which it refers to as models. The models are summarized here, and are explained in detail in the following sections.

1. Platform model: Specifies that there is one host processor coordinating execution, and one or more device processors whose job it is to execute OpenCL C kernels. It also defines an abstract hardware model for devices.

2. Execution model: Defines how the OpenCL environment is configured by the host, and how the host may direct the devices to perform work. This includes defining an environment for execution on the host, mechanisms for host-device interaction, and a concurrency model used when configuring kernels. The concurrency model defines how an algorithm is decomposed into OpenCL work-items and work-groups.

3. Kernel programming model: Defines how the concurrency model is mapped to physical hardware.

4. Memory model: Defines memory object types, and the abstract memory hierarchy that kernels use regardless of the actual underlying memory architecture. It also contains requirements for memory ordering and optional shared virtual memory between the host and devices.

In a typical scenario, we might observe an OpenCL implementation executing on a platform consisting of a host x86 CPU using a graphics processing unit (GPU) device as an accelerator. The host sets up a kernel for the GPU to run and sends a command to the GPU to execute the kernel with some specified degree of parallelism. This is the execution model. The memory for the data used by the kernel is allocated by the programmer to specific parts of an abstract memory hierarchy specified by the memory model. The runtime and driver will map these abstract memory regions to the physical hierarchy. Finally, the GPU creates hardware threads to execute the kernel, and maps them to its hardware units. This is done using the programming model. Throughout this chapter, these ideas are discussed in further detail.

This chapter begins by introducing the OpenCL models, including the OpenCL API related to each model. Once the OpenCL host API has been described, it is demonstrated using a vector addition program. The full listing of the vector addition program is given at the end of the chapter in Section 3.6. The same vector addition program is then used to illustrate the OpenCL C++ API, and a comparison of an OpenCL program with a CUDA program.

3.2 The OpenCL Platform Model

An OpenCL platform consists of a host connected to one or more OpenCL devices. The platform model defines the roles of the host and the devices, and provides an abstract hardware model for devices. A device is divided into one or more compute units, which are further divided into one or more processing elements. A diagram of these concepts is provided in Figure 3.1.

f03-01-9780128014141
Figure 3.1 An OpenCL platform with multiple compute devices. Each compute device contains one or more compute units. A compute unit is composed of one or more processing elements (PEs). A system could have multiple platforms present at the same time. For example, a system could have an AMD platform and an Intel platform present at the same time.

The platform model is key to application development for portability between OpenCL-capable systems. Even within a single capable system, there could be a number of different OpenCL platforms which could be targeted by any given application. The platform model’s API allows an OpenCL application to adapt and choose the desired platform and compute device for executing its computation.

In the API, a platform can be thought of as a common interface a vendor-specific OpenCL runtime. The devices that a platform can target are thus limited to those with which a vendor knows how to interact. For example, if company A’s platform is chosen, it likely will not be able to communicate with company B’s GPU. However, platforms are not necessarily vendor exclusive. For example, implementations from AMD and Intel should be able to create platforms that target each other’s x86 CPUs as devices.

The platform model also presents an abstract device architecture that programmers target when writing OpenCL C code. Vendors map this abstract architecture to the physical hardware. The platform model defines a device as a group of multiple compute units, where each compute unit is functionally independent. Compute units are further divided into processing elements. Figure 3.1 illustrates this hierarchical model. As an example, the AMD Radeon R9 290X graphics card (device) comprises 44 vector processors (compute units). Each compute unit has four 16-lane SIMD engines, for a total of 64 lanes (processing elements). Each SIMD lane on the Radeon R9 290X executes a scalar instruction. This allows the GPU device to execute a total of 44 × 16 × 4 = 2816 instructions at a time.

3.2.1 Platforms and Devices

The API call clGetPlatformIDs() is used to discover the set of available OpenCL platforms for a given system. The most robust code will call clGetPlatformIDs() twice when querying the system for OpenCL platforms. The first call to clGetPlatformIDs() passes an unsigned integer pointer as the num_platforms argument and NULL for the platforms argument. The pointer is populated with the available number of platforms. The programmer can then allocate space (pointed to by platforms) to hold the platform objects (of type cl_platform_id). For the second call to clGetPlatformIDs(), the platforms pointer is passed to the implementation with enough space allocated for the desired number (num_entries) of platforms. After platforms have been discovered, the clGetPlatformInfo() API call can be used to determine which implementation (vendor) the platform was defined by. This API call, and all further API functions discussed in this chapter, are illustrated in the vector addition source code listing in Section 3.6.

cl_int

clGetPlatformIDs(

 cl_uint num_entries,

 cl_platform_id *platforms,

 cl_uint *num_platforms)

Once a platform has been selected, the next step is to query the devices available to that platform. The API call to do this is clGetDeviceIDs(), and the procedure for discovering devices is similar to clGetPlatformIDs(). The call to clGetDeviceIDs() takes the additional arguments of a platform and a device type, but otherwise the same three-step process occurs: discovery of the quantity of devices, allocation, and retrieval of the desired number of devices. The device:type argument can be used to limit the devices to GPUs only (CL_DEVICE:TYPE_GPU), CPUs only (CL_DEVICE:TYPE_CPU), all devices (CL_DEVICE:TYPE_ALL), as well as other options. The same option should be used for both calls to clGetDeviceIDs(). As with platforms, the clGetDeviceInfo() API call is used to retrieve information such as name, type, and vendor from each device.

cl_int

clGetDeviceIDs(

 cl_platform_id platform,

 cl_device:type device:type,

 cl_uint num_entries,

 cl_device:id *devices,

 cl_uint *num_devices)

The CLInfo program in the AMD accelerated parallel processing (APP) software development kit (SDK) uses clGetPlatformInfo() and clGetDeviceInfo() to print detailed information about the OpenCL-supported platforms and devices in a system. Hardware details such as memory sizes and bus widths are available using these commands, and the rest of the properties should become clear after completion of this chapter. A snippet of the output from CLInfo is shown in Figure 3.2.

f03-02-9780128014141
Figure 3.2 Some of the Output from the CLInfo program showing the characteristics of an OpenCL platform and devices. We see that the AMD platform has two devices (a CPU and a GPU). The output shown here can be queried using functions from the platform API.

3.3 The OpenCL Execution Model

The OpenCL platform model allows us to build a topology of a system with a coordinating host processor, and one or more devices that will be targeted to execute our OpenCL kernels. In order for the host to request that a kernel be executed on a device, a context must be configured that enables the host to pass commands and data to the device.

3.3.1 Contexts

In OpenCL, a context is an abstract environment within which coordination and memory management for kernel execution is valid and well defined. A context coordinates the mechanisms for host-device interaction, manages the memory objects available to the devices, and keeps track of the programs and kernels that are created for each device. The API function to create a context is clCreateContext().

cl_context

clCreateContext (

 const cl_context_properties *properties,

 cl_uint num_devices,

 const cl_device:id *devices,

 void (CL_CALLBACK *pfn_notify)(

 const char *errinfo,

 const void *private_info,

 size_t cb,

 void *user_data),

 void *user_data,

 cl_int *errcode_ret)

The properties argument is used to restrict the scope of the context. It may provide a specific platform, enable graphics interoperability, or enable other parameters in the future. Limiting the scope of a context to a given platform allows the programmer to provide contexts for multiple platforms and fully utilize a system comprising resources from a mixture of vendors. Next, the devices that the programmer wants to use with the context must be supplied. A user callback can also be provided when a programmer is creating a context, and can be used to report additional error information that might be generated throughout its lifetime.

OpenCL also provides a different API call for creating a context that alleviates the need to build a list of devices. The call clCreateContextFromType() allows a programmer to create a context that automatically includes all devices of the specified type (e.g. CPUs, GPUs, and all devices). After creation of a context, the function clGetContextInfo() can be used to query information such as the number of devices present and the device objects. In OpenCL, the process of discovering platforms and devices and setting up a context can be tedious. However, after the code to perform these steps has been written, it can be reused for almost any project.

3.3.2 Command-Queues

The execution model specifies that devices perform tasks based on commands which are sent from the host to the device. Actions specified by commands include executing kernels, performing data transfers, and performing synchronization. It is also possible for a device to send certain commands to itself, which is discussed later in the chapter.

A command-queue is the communication mechanism that the host uses to request action by a device. Once the host has decided which devices to work with and a context has been created, one command-queue needs to be created per device. Each command-queue is associated with only one device—this is required because the host needs to be able to submit commands to a specific device when multiple devices are present in the context. Whenever the host needs an action to be performed by a device, it will submit commands to the proper command-queue. The API call clCreateCommandQueueWithProperties() is used to create a command-queue and associate it with a device.

cl_command_queue

clCreateCommandQueueWithProperties(

 cl_context context,

 cl_device:id device,

 cl_command_queue_properties properties,

 cl_int* errcode_ret)

The properties parameter of clCreateCommandQueueWithProperties() is a bit field that is used to enable profiling of commands (CL_QUEUE_PROFILING_ENABLE) and/or to allow out-of-order execution of commands (CL_QUEUE_OUT_OF_ORDER_ EXEC_MODE_ENABLE). Both are discussed in Chapter 5.

For in-order command-queues (the default), commands are pulled from the queue in the order they are received. Out-of-order command-queues allow the OpenCL implementation to search for commands that can be rearranged to execute more efficiently. If out-of-order command-queues are used, it is up to the user to specify dependencies that enforce a correct execution order.

Any API call that submits a command to a command-queue will begin with clEnqueue and require a command-queue as a parameter. For example, the clEnqueueReadBuffer() call requests that the device send data to the host, and clEnqueueNDRangeKernel() requests that a kernel is executed on the device. These calls will be discussed in detail later in this chapter.

In addition to API calls that submit commands to command-queues, OpenCL includes barrier operations that can be used to synchronize execution of command-queues. The API calls clFlush() and clFinish() are barrier operations for a command-queue. The clFinish() call blocks execution of the host thread until all of the commands in a command-queue have completed execution; it’s functionality is synonymous with a synchronization barrier. The clFlush() call blocks execution until all of the commands in a command-queue have been removed from the queue. This means that the commands will definitely be submitted to the device, but will not necessarily have completed execution. Each API call requires only the desired command-queue as an argument.

cl_int clFlush(cl_command_queue command_queue);

cl_int clFinish(cl_command_queue command_queue);

3.3.3 Events

In the OpenCL API, objects called events are used to specify dependencies between commands. As we discuss the various clEnqueue API calls, you will notice that all of them have three parameters in common: a pointer to a list of events that specify dependencies for the current command, the number of events in the wait list, and a pointer to an event that will represent the execution of the current command. The returned event can in turn be used to specify a dependency for future events. The array of events used to specify dependencies for a command is referred to as a wait list. Specifying dependencies with events is detailed in Chapter 5.

In addition to providing dependencies, events enable the execution status of a command to be queried at any time. As the event makes its way through the execution process, its status is updated by the implementation. The command will have one of six possible states:

 Queued: The command has been placed into a command-queue.

 Submitted: The command has been removed from the command-queue and has been submitted for execution on the device.

 Ready: The command is ready for execution on the device.

 Running: Execution of the command has started on the device.

 Ended: Execution of the command has finished on the device.

 Complete: The command and all of its child commands have finished.

The concept of child commands is related to device-side enqueuing, and is discussed in the next section. Successful completion is indicated when the event status associated with a command is set to CL_COMPLETE. Unsuccessful completion results in abnormal termination of the command, which is indicated by setting the event status to a negative value. In this case, the command-queue associated with the abnormally terminated command and all other command-queues in the same context may no longer be available. Querying an event’s status is done using the API call clGetEventInfo().

In addition to supplying dependencies between commands as they are enqueued, the API also includes the function clWaitForEvents(), which causes the host to wait for all events specified in the wait list to complete execution.

cl_int

clWaitForEvents (

 cl_uint num_events,

 const cl_event *event_list)

3.3.4 Device-Side Enqueuing

Until now, we have described the execution model in terms of a master-worker paradigm where the host (master) sends commands to the device (worker). This execution model provides a simple paradigm for coordinating execution between the host and the device. However, in many cases the amount of work that has to be dispatched cannot be determined statically—especially in algorithms where each stage is dependent on the previous one. For example, in a combinatorial optimization application, the size of the search region may define the number of work-groups required. However the size of the region may only be known from the previous iteration. In previous versions of OpenCL, this situation would require communication from the device to the host in order to appropriately set up the dimensions of the next kernel. To remove this requirement and potentially improve performance, OpenCL 2.0 provides a new feature in the execution model known as device-side enqueuing.

A kernel executing on a device now has the ability to enqueue another kernel into a device-side command-queue (shown in Figure 3.5). In this scenario, the kernel currently executing on a device is referred to as the parent kernel, and the kernel that is enqueued is known as the child kernel. Parent and child kernels execute asynchronously, although a parent kernel is not registered as complete until all its child kernels have completed. We can check that a parent kernel has completed execution when its event object is set to CL_COMPLETE. The device-side command-queue is an out-of-order command-queue, and follows the same behavior as the out-of-order command-queues exposed to the host. Commands enqueued to a device-side command-queue generate and use events to enforce dependencies just as the command-queue on the host. These events, however, are visible only to the parent kernel running on the device. Device-side enqueuing is discussed in more detail in Chapter 5.

f03-03-9780128014141
Figure 3.3 Vector addition algorithm showing how each element can be added independently.
f03-04-9780128014141
Figure 3.4 The hierarchical model used for creating an NDRange of work-items, grouped into work-groups.
f03-05-9780128014141
Figure 3.5 The OpenCL runtime shown denotes an OpenCL context with two compute devices (a CPU device and a GPU device). Each compute device has its own command-queues. Host-side and device-side command-queues are shown. The device-side queues are visible only from kernels executing on the compute device. The memory objects have been defined within the memory model.

3.4 Kernels and the OpenCL Programming Model

The execution model API enables an application to manage the execution of OpenCL commands. The OpenCL commands describe the movement of data and the execution of kernels that process this data to perform some meaningful task. OpenCL kernels are the parts of an OpenCL application that actually execute on a device. Like many CPU concurrency models, an OpenCL kernel is syntactically similar to a standard C function; the key differences are a set of additional keywords and the concurrency model that OpenCL kernels implement. When developing concurrent programs for a CPU using operating system threading APIs or OpenMP, for example, the programmer considers the physical resources available (e.g. CPU cores) and the overhead of creating and switching between threads when their number substantially exceeds the resource availability. With OpenCL, the goal is often to represent parallelism programmatically at the finest granularity possible. The generalization of the OpenCL interface and the low-level kernel language allows efficient mapping to a wide range of hardware. The following discussion presents three versions of a function that performs an element-wise vector addition: a serial C implementation, a threaded C implementation, and an OpenCL C implementation. The code for a serial C implementation of the vector addition is shown in Listing 3.1 and executes a loop with as many iterations as there are elements to compute. Each loop iteration adds the corresponding locations in the input arrays together and stores the result into the output array. A diagram of the vector addition algorithm is shown in Figure 3.3.

f03-08-9780128014141
Listing 3.1 Serial vector addition.

For a simple multicore device, we could either use a low-level coarse-grained threading API, such as Win32 or POSIX threads, or a data-parallel model such as OpenMP. Writing a coarse-grained multithreaded version of the same function would require dividing the work (i.e. loop iterations) between the threads. Because there may be a large number of loop iterations and the work per iteration is small, we would need to chunk the loop iterations into a larger granularity, a technique called strip mining [1]. The code for the multithreaded version may be as in Listing 3.2.

f03-09-9780128014141
Listing 3.2 Vector addition chunked for coarse-grained parallelism (e.g., POSIX threads on a CPU). The input vector is partitioned among the available cores.

The unit of concurrent execution in OpenCL C is a work-item. Each work-item executes the kernel function body. Instead of manually strip mining the loop, we will map a single iteration of the loop to a work-item. We tell the OpenCL runtime to generate as many work-items as elements in the input and output arrays and allow the runtime to map those work-items to the underlying hardware, and hence CPU or GPU cores, in whatever way it deems appropriate. Conceptually, this is very similar to the parallelism inherent in a functional “map” operation (cf., mapReduce) or a data-parallel for loop in OpenMP. When an OpenCL device begins executing a kernel, it provides intrinsic functions that allow a work-item to identify itself. In the following code, the call to get_global_id(0) allows the programmer to make use of the position of the current work-item to access a unique element in the array. The parameter “0” to the get_global_id() function assumes that we have specified a one-dimensional configuration of work-items, and therefore only need its ID in the first dimension.

Given that OpenCL describes execution in fine-grained work-items and can dispatch vast numbers of work-items on architectures with hardware support for fine-grained threading, it is easy to have concerns about scalability. The hierarchical concurrency model implemented by OpenCL ensures that scalable execution can be achieved even while supporting a large number of work-items. When a kernel is executed, the programmer specifies the number of work-items that should be created as an n-dimensional range (NDRange). An NDRange is a one-, two-, or three-dimensional index space of work-items that will often map to the dimensions of either the input or the output data. The dimensions of the NDRange are specified as an N-element array of type size_t, where N represents the number of dimensions used to describe the work-items being created.

In the vector addition example, our data will be one-dimensional and, assuming that there are 1024 elements, the size can be specified by an array of one, two, or three values. The host code to specify a one-dimensional NDRange for 1024 elements may look like the following:

 size_t indexSpace[3] = {1024, 1, 1};

Achieving scalability comes from dividing the work-items of an NDRange into smaller, equally sized work-groups (Figure 3.4). An index space with N dimensions requires work-groups to be specified using the same N dimensions; thus, a three-dimensional index space requires three-dimensional work-groups. Work-items within a work-group have a special relationship with one another: they can perform barrier operations to synchronize and they have access to a shared memory address space. A work-group’s size is fixed per dispatch, and so communication costs between work-items do not increase for a larger dispatch. The fact that the communication cost between work-items is not dependent on the size of the dispatch allows OpenCL implementations to maintain scalability for larger dispatches.

For the vector addition example, the work-group size might be specified as

 size_t workgroupSize[3] = {64, 1, 1};

If the total number of work-items per array is 1024, this results in the creation of 16 work-groups (1024 work-items/(64 work-items per work-group) = 16 work-groups). For hardware efficiency, the work-group size is usually fixed to a favorable size. In previous versions of the OpenCL specification, the index space dimensions would have to be rounded up to be a multiple of the work-group dimensions. In the kernel code, we would then have to specify that extra work-items in each dimension simply return immediately without outputting any data. However, the OpenCL 2.0 specification allows each dimension of the index space that is not evenly divisible by the work-group size to be divided into two regions: one region where the number of work-items per work-group is as specified by the programmer, and another region of remainder work-groups which have fewer work-items. Since work-group sizes can be nonuniform in multiple dimensions, there are up to four different sizes possible for a two-dimensional NDRange, and up to eight different sizes for a three-dimensional NDRange.

For programs such as vector addition in which work-items behave independently (even within a work-group), OpenCL allows the work-group size to be ignored by the programmer altogether and to be generated automatically by the implementation; in this case, the developer can pass NULL when defining the work-group size.

3.4.1 Compilation and Argument Handling

An OpenCL program is a collection of OpenCL C kernels, functions called by the kernel, and constant data. For example, an algebraic solver application could contain a vector addition kernel, a matrix multiplication kernel, and a matrix transpose kernel within the same OpenCL program. OpenCL source code is compiled at runtime through a series of API calls. Runtime compilation gives the system an opportunity to optimize OpenCL kernels for a specific compute device. Runtime compilation also enables OpenCL kernel source code to run on a previously unknown OpenCL-compatible compute device. There is no need for an OpenCL application to have been prebuilt against the AMD, NVIDIA, or Intel runtimes, for example, if it is to run on compute devices produced by all of these vendors. OpenCL software links only to a common runtime layer called the installable client driver (ICD). All platform-specific activity is delegated to the respective vendor runtime through a dynamic library interface.

The process of creating a kernel from source code is as follows:

1. The OpenCL C source code is stored in a character array. If the source code is stored in a file on a disk, it must be read into memory and stored as a character array.

2. The source code is turned into a program object, cl_program, by calling clCreateProgramWithSource().

3. The program object is then compiled, for one or more OpenCL devices, with clBuildProgram(). If there are compile errors, they will be reported here.

4. A kernel object, cl_kernel, is then created by calling clCreateKernel and specifying the program object and kernel name.

The final step of obtaining a cl_kernel object is similar to obtaining an exported function from a dynamic library. The name of the kernel that the program exports is used to request it from the compiled program object. The name of the kernel is passed to clCreateKernel(), along with the program object, and the kernel object will be returned if the program object was valid and the particular kernel is found. The relationship between an OpenCL program and OpenCL kernels is shown in Figure 3.5, where multiple kernels can be extracted from an OpenCL program. Each context can have multiple OpenCL programs that have been generated from OpenCL source code.

cl_kernel

clCreateKernel (

 cl_program program,

 const char *kernel_name,

 cl_int *errcode_ret)

The precise binary representation of an OpenCL kernel object is vendor specific. In the AMD runtime, there are two main classes of devices: x86 CPUs and GPUs. For x86 CPUs, clBuildProgram() generates x86 instructions that can be directly executed on the device. For the GPUs, it will create AMD’s GPU intermediate language, a high-level intermediate language that will be just-in-time compiled for a specific GPU’s architecture later, generating what is often known as instruction set architecture (ISA) code. NVIDIA uses a similar approach, calling its intermediate representation parallel thread execution (PTX). The advantage of using such an intermediate language is to allow the GPU ISA to change from one device or generation to another in what is still a very rapidly developing architectural space.

An additional feature of the build process is the ability to generate both the final binary format and various intermediate representations and serialize them (e.g. write them out to disk). As with most objects, OpenCL provides a function to return information about program objects, clGetProgramInfo(). One of the flags to this function is CL_PROGRAM_BINARIES, which returns a vendor-specific set of binary objects generated by clBuildProgram(). In addition to clCreateProgramWithSource(), OpenCL provides clCreateProgramWithBinary(), which takes a list of binaries that matches its device list. The binaries are previously created using clGetProgramInfo(). Using a binary representation of OpenCL kernels allows OpenCL programs to be distributed without exposing kernel source code as plain text.

Unlike invoking functions in C programs, we cannot simply call a kernel with a list of arguments. Executing a kernel requires dispatching it through an enqueue function. Owing to the syntax of C and the fact that kernel arguments are persistent (and hence we need not repeatedly set them to construct the argument list for such a dispatch), we must specify each kernel argument individually using clSetKernelArg(). This function takes a kernel object, an index specifying the argument number, the size of the argument, and a pointer to the argument. The type information in the kernel parameter list is then used by the runtime to unbox (similar to casting) the data to its appropriate type.

cl_int

clSetKernelArg (

 cl_kernel kernel,

 cl_uint arg_index,

 size_t arg_size,

 const void *arg_value)

3.4.2 Starting Kernel Execution on a Device

Enqueuing a command to a device to begin kernel execution is done with a call to clEnqueueNDRangeKernel(). A command-queue must be specified so the target device is known. The kernel object identifies the code to be executed. Four fields are then related to work-item creation. The work_dim parameter specifies the number of dimensions (one, two, or three) in which work-items will be created. The global_work_size parameter specifies the number of work-items in each dimension of the NDRange, and local_work_size specifies the number of work-items in each dimension of the work-groups. The parameter global_work_offset can be used to provide an offset so that the global IDs of the work-items do not start at zero.

cl_int

clEnqueueNDRangeKernel(

 cl_command_queue command_queue,

 cl_kernel kernel,

 cl_uint work_dim,

 const size_t *global_work_offset,

 const size_t *global_work_size,

 const size_t *local_work_size,

 cl_uint num_events_in_wait_list,

 const cl_event *event_wait_list,

 cl_event *event)

As with all clEnqueue API calls, an event_wait_list is provided, and for non-NULL values the runtime will guarantee that all corresponding events will have completed before the kernel begins execution. Similarly, clEnqueueNDRangeKernel() is asynchronous: it will return immediately after the command is enqueued in the command-queue and likely before the kernel has even started execution. An API call such as clWaitForEvents() or clFinish() can be used to block host execution on the host until the kernel completes execution.

3.5 OpenCL Memory Model

Memory subsystems differ greatly between computing platforms. To support code portability, OpenCL’s approach is to define an abstract memory model that programmers can target when writing code and vendors can map to their actual memory hardware. The OpenCL memory model describes the structure of the memory system exposed by an OpenCL platform to the OpenCL program. The memory model must define how the values in memory are seen from each of these units of execution. The memory model allows a programmer to reason about the correctness of OpenCL programs.

The OpenCL memory model tells programmers what they can expect from an OpenCL implementation: which memory operations are guaranteed to happen in which order and which memory values each read operation will return. The memory consistency model in OpenCL is based on the memory model from the ISO C11 programming language. Chapters 6 and 7 are dedicated to the OpenCL memory model, including details on the memory consistency model and shared virtual memory. Here we provide information on types of memory objects that are defined by OpenCL, and the memory regions that make up the abstract memory model. With this information, we will be able to execute our first OpenCL program.

3.5.1 Memory Objects

OpenCL kernels usually require some sort of input data (e.g. arrays or multidimensional matrices) and generate some sort of output data. Before execution can begin, the input data needs to be accessible by the device. In order for data to be transferred to a device, it must first be encapsulated as a memory object. In order for output data to be generated, space must also be allocated and encapsulated as a memory object. OpenCL defines three types of memory objects: buffers, images, and pipes.

Buffers

Buffers are equivalent to arrays in C created using malloc(), where data elements are stored contiguously in memory. Conceptually, it may help to visualize an OpenCL buffer object as a pointer that is valid on a device. The API function clCreateBuffer() allocates space for the buffer and returns a memory object.

cl_mem

clCreateBuffer(

 cl_context context,

 cl_mem_flags flags,

 size_t size,

 void *host_ptr,

 cl_int *errcode_ret)

The clCreateBuffer() API call is similar to malloc in C, or C++’s new operator. Creating a buffer requires supplying the size of the buffer and a context in which the buffer will be allocated; it is visible for all devices associated with the context. Optionally, the caller can supply flags that specify that the data is read only, write only, or read-write. Other flags also exist that specify additional options for creating and initializing a buffer. One simple option is to supply a host pointer with data used to initialize the buffer. We see from the signature that an OpenCL buffer is linked to a context, not a device, so it is the runtime that determines the precise time the data is moved. Buffer movement to and from specific devices is managed by the OpenCL runtime to satisfy data dependencies.

Images

Images are OpenCL memory objects that abstract the storage of physical data to allow device-specific optimizations. Unlike buffers, images cannot be directly referenced as if they were arrays. Further, adjacent data elements are not guaranteed to be stored contiguously in memory. The purpose of using images is to allow the hardware to take advantage of spatial locality and to utilize the hardware acceleration available on many devices.

cl_mem

clCreateImage(

 cl_context context,

 cl_mem_flags flags,

 const cl_image_format *image_format,

 const cl_image_desc *image_desc,

 void *host_ptr,

 cl_int *errcode_ret)

Unlike buffers, which do not have a data type or dimensions, an image is created using descriptors that provide specific details to the hardware about the data. The elements of an image are represented by a format descriptor (cl_image_format). The format descriptor specifies how the image elements are stored in memory using the concept of channels. The channel order specifies the number of elements that make up an image element (up to four elements, based on the traditional use of RGBA pixels), and the channel type specifies the size of each element. These elements can be sized anywhere from one to four bytes and in various different formats (e.g. integer or floating point). Other metadata are provided by an image descriptor (cl_image_desc), which includes the type of the image and the dimensions. An example using images is provided in Chapter 4, and the architectural design and trade-offs for images are discussed in detail in Chapters 6 and 7.

To support the abstraction provided by images, OpenCL C provides dedicated function calls for reading from and writing to images. The dedicated functions for reading and writing images allow a vendor to optimize image access routines independently from each other and possibly utilize hardware acceleration. Compared with buffers, the image read and write functions take additional parameters and are specific to the image’s data type. For example, the function read_imagef() is used for reading floating-point values and read_imageui() is used for reading unsigned integers. While there are many variations on these function signatures, read accesses usually require at least the coordinates to access and a sampler object. A sampler specifies how out-of-bounds image accesses are handled, whether interpolation should be used, and if coordinates are normalized. Writing to an image requires manual conversion to the proper storage data format (i.e. storing in the proper channel and with the proper size), as well as the destination coordinates.

In previous versions of the OpenCL standard, a kernel was not allowed to both read from and write to a single image. However, OpenCL 2.0 has relaxed this restriction by providing synchronization operations that let programmers safely read and write a single image within a kernel.

Pipes

A pipe memory object is an ordered sequence of data items (referred to as packets) that are stored on the basis of a first in, first out (FIFO) method. A pipe has a write endpoint into which data items are inserted, and a read endpoint from which data items are removed. When creating a pipe using the OpenCL API call clCreatePipe(), one must supply the packet size along with the number of entries in the pipe (i.e. the maximum number of packets that can fit into the pipe at once). The function clGetPipeInfo() can return information about the size of the pipe and the maximum number of packets that can reside in the pipe. The properties argument is reserved for future use, and should be NULL in OpenCL 2.0.

cl_mem

clCreatePipe (

 cl_context context,

 cl_mem_flags flags,

 cl_uint pipe_packet_size,

 cl_uint pipe_max_packets,

 const cl_pipe_properties *properties,

 cl_int *errcode_ret)

At any time, only one kernel may write into a pipe, and only one kernel may read from a pipe. To support the producer-consumer design pattern, one kernel connects to the write endpoint (the producer), while another kernel connects to the read endpoint (the consumer). The same kernel may not be both the writer and the reader for a pipe.

As with images, pipes are opaque data structures that can be accessed only via intrinsic function calls provided by OpenCL C (e.g. read_pipe() and write_pipe()). OpenCL C also provides functions for reserving sections of a pipe to read from and to write to. The intrinsic functions allow pipes to be accessed on a work-group granularity, without otherwise having individual work-items access the pipe and then perform synchronization. Pipes are described in more detail in Chapter 6.

3.5.2 Data Transfer Commands

Before a kernel is executed, it is usually necessary to copy data from a host array into an allocated area of memory that is encapsulated as a memory object. Initializing buffers and images is possible within their respective clCreate calls. The host pointer arguments within the clCreate calls can be used to initialize memory objects with data from host memory. This allows us to initialize a memory object without the need to consider data movement any further. After the memory object is initialized, the runtime is responsible for ensuring that data is moved between devices as required by dependencies.

Despite the runtime’s management of data movement, we will often desire to initiate data transfers manually for performance reasons (described in Chapter 6). Explicit data transfers are also required to retrieve data back to host memory. Therefore, in general, we will often use the explicit data transfer commands to write the data to a device before the first time a memory object is used, and to read the data from a device after the last time it is used. Assuming that our memory object is a buffer, data in host memory is transferred to and from a buffer using calls to clEnqueueWriteBuffer() and clEnqueueReadBuffer(), respectively. If a kernel using a buffer is executed on a device with a discrete memory such as a GPU, the buffer may be transferred to the device when this command executes (e.g. across the PCI Express bus). The API calls for reading from and writing to buffers are very similar. The signature for clEnqueueWriteBuffer() is as follows:

cl_int

clEnqueueWriteBuffer (

 cl_command_queue command_queue,

 cl_mem buffer,

 cl_bool blocking_write,

 size_t offset,

 size_t cb,

 const void *ptr,

 cl_uint num_events_in_wait_list,

 const cl_event *event_wait_list,

 cl_event *event)

In addition to the command-queue, the clEnqueueWriteBuffer function requires the buffer memory object, the number of bytes to transfer, and an offset within the buffer. The combination of offset and number of bytes allows a subset of the buffer data to be written. The blocking_write option should be set to CL_TRUE if the programmer wants the transfer to complete before the function returns—effectively turning the otherwise asynchronous API call into a blocking call. Alternatively, setting blocking_write to CL_FALSE will cause clEnqueueWriteBuffer() to return immediately (likely well before the write operation has completed). Writing to and reading from buffers is shown in the vector addition at the end of the chapter.

3.5.3 Memory Regions

OpenCL classifies memory as either host memory or device memory. Host memory is directly available to the host, and is defined outside OpenCL. Data moves between the host and devices using functions within the OpenCL API or through a shared virtual memory interface. Alternatively, device memory is memory which is available to executing kernels.

OpenCL divides device memory into four named memory regions as shown in Figure 3.6. These memory regions are relevant within OpenCL kernels. Within a kernel, keywords are associated with each region, and are used to specify where a variable should be created or where the data that it points to resides. Memory regions are logically disjoint, and data movement between different memory regions is controlled by the kernel developer. Each memory region has its own performance characteristics. Owing to these characteristics, accessing data for computation from the right memory region can greatly affect performance.

f03-06-9780128014141
Figure 3.6 Memory regions and their scope in the OpenCL memory model.

The following provides a short description of each memory region.

 Global memory is visible to all work-items executing a kernel (similarly to the main memory on a CPU-based host system). Whenever data is transferred from the host to the device, the data will reside in global memory. Any data that is to be transferred back from the device to the host must also reside in global memory. The keyword global or _ _global is added to a pointer declaration to specify that data referenced by the pointer resides in global memory. For example, in the OpenCL C code at the end of the chapter, global int* A denotes that the data pointed to by A resides in global memory (although we will see that A actually resides in private memory).

 Constant memory is not specifically designed for every type of read-only data, but rather is specifically designed for data where each element is accessed simultaneously by all work-items. Variables whose values never change (e.g. a data variable holding the value of π) also fall into this category. Constant memory is modeled as a part of global memory, so memory objects that are transferred to global memory can be specified as constant. Data is mapped to constant memory by using either the keyword constant or _ _constant.

 Local memory is a memory that is shared between work-items within a work-group. It is common for local memory to be mapped to on-chip memory, such as software-managed scratchpad memory. As such, accesses may have much shorter latency and much higher bandwidth than global memory. Calling clSetKernelArg() with a size, but no argument, allows local memory to be allocated at runtime. Within an OpenCL C kernel, a kernel parameter that corresponds to local memory is defined as a local or _ _local pointer (e.g. local int* sharedData). Alternatively, arrays can be statically declared in local memory by appending the keyword local (e.g. local int[64] sharedData), although this requires specifying the array size at compile time.

 Private memory is memory that is unique to an individual work-item. Local variables and nonpointer kernel arguments are private by default. In practice, these variables are usually mapped to registers, although private arrays and any spilled registers are usually mapped to an off-chip (i.e. long-latency) memory.

Figure 3.7 details the relationship between OpenCL memory regions and those found on an AMD Radeon HD 7970 GPU.

f03-07-9780128014141
Figure 3.7 Mapping the OpenCL memory model to an AMD Radeon HD 7970 GPU.

3.5.4 Generic Address Space

In earlier versions of the OpenCL specification, named address spaces sometimes required the creation of multiple versions of callable functions simply to manipulate data from different address spaces. To save programmer effort, a single generic address space was added to OpenCL 2.0, which is closely modeled after the concept of a generic address space used in the embedded C standard (ISO/IEC 9899:1999). The generic address space supports conversion of pointers to and from private, local, and global address spaces, and hence lets a programmer write a single function that at compile time can take arguments from any of the three named address spaces. The generic address space is discussed further in Chapter 7.

3.6 The OpenCL Runtime with an Example

OpenCL’s four models discussed in the previous sections are exposed to application developers by their runtime APIs. The platform model is used to enable a host and one or more devices to participate in executing an OpenCL application. The application developer implements their core computation using OpenCL kernels whose execution is defined by the programming model. The computation that a kernel performs manipulates data in a way defined by the memory model. The developer leverages the execution model to submit commands to devices to perform data movement and execute kernels. This section puts all of these ideas together with our first complete OpenCL application.

The main steps to execute a simple OpenCL application are summarized below:

1. Discovering the platform and devices

2. Creating a context

3. Creating a command-queue per device

4. Creating memory objects (buffers) to hold data

5. Copying the input data onto the device

6. Creating and compiling a program from the OpenCL C source code

7. Extracting the kernel from the program

8. Executing the kernel

9. Copying output data back to the host

10. Releasing the OpenCL resources

The following code implements each of the summarized steps. Much of the setup to execute an OpenCL application is generic code that is required to allow implementations to span hardware platforms containing multiple styles of architectures from multiple vendors. Therefore, much of this code can be reused directly on many applications, and potentially abstracted into user-defined functions. The C++ API, shown later, is also less verbose than the C API.

We now discuss each step enumerated above. After this section, a full program listing is provided.

1. Discovering the platform and devices: Before a host can request that a kernel be executed on a device, a platform and a device or devices must be discovered.

cl_int status; // Used for error checking

// Retrieve the number of platforms

cl_uint numPlatforms = 0;

status = clGetPlatformIDs(0, NULL, &numPlatforms);

// Allocate enough space for each platform

cl_platform_id *platforms = NULL;

platforms = (cl_platform_id*)malloc(numPlatforms*sizeof

 (cl_platform_id));

// Fill in the platforms

status = clGetPlatformIDs(numPlatforms, platforms, NULL);

// Retrieve the number of devices

cl_uint numDevices = 0;

status = clGetDeviceIDs(platforms[0], CL_DEVICE:TYPE_ALL,

 0, NULL, &numDevices);

// Allocate enough space for each device

cl_device:id *devices;

devices = (cl_device:id*)malloc(numDevices*sizeof(cl_device:id));

// Fill in the devices

status = clGetDeviceIDs(platforms[0], CL_DEVICE:TYPE_ALL,

 numDevices, devices, NULL);

In the complete program listing that follows, we will assume that we are using the first platform and device that are found, which will allow us to reduce the number of function calls required. This will help provide clarity and brevity when viewing the source code.

2. Creating a context: Once the device or devices have been discovered, the context can be configured on the host.

// Create a context that includes all devices

cl_context context = clCreateContext(NULL, numDevices,

 devices, NULL, NULL, &status);

3. Creating a command-queue per device: Once the host has decided which devices to work with and a context has been created, one command-queue needs to be created per device (i.e. each command-queue is associated with only one device). The host will ask the device to perform work by submitting commands to the command-queue.

// Only create a command-queue for the first device

cl_command_queue cmdQueue = clCreateCommandQueueWithProperties

 (context, devices[0], 0, &status);

4. Creating buffers to hold data: Creating a buffer requires supplying the size of the buffer and a context in which the buffer will be allocated; it is visible to all devices associated with the context. Optionally, the caller can supply flags that specify that the data is read only, write only, or read-write. By passing NULL as the fourth argument, we are not initializing the buffer at this step.

// Allocate 2 input and one output buffer for the three vectors in

 the vector addition

cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize,

 NULL, &status);

cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize,

 NULL, &status);

cl_mem bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize,

 NULL, &status);

5. Copying the input data onto the device: The next step is to copy data from a host pointer to a buffer. The API call takes a command-queue argument, so data will likely be copied directly to the device. By setting the third argument to CL_TRUE, we can ensure that data is copied before the API call returns.

// Write data from the input arrays to the buffers

status = clEnqueueWriteBuffer(cmdQueue, bufA, CL_TRUE, 0,

 datasize, A, 0, NULL, NULL);

status = clEnqueueWriteBuffer(cmdQueue, bufB, CL_TRUE, 0,

 datasize, B, 0, NULL, NULL);

6. Creating and compiling a program from the OpenCL C source code: The vector addition kernel shown in Listing 3.3 is stored in a character array, programSource, and is used to create a program object which is then compiled. When we compile a program, we also supply the information for each device that the program may target.

f03-10-9780128014141
Listing 3.3 OpenCL vector addition kernel.

// Create a program with source code

cl_program program = clCreateProgramWithSource(context, 1,

 (const char**)&programSource, NULL, &status);

// Build (compile) the program for the device

status = clBuildProgram(program, numDevices, devices, NULL,

 NULL, NULL);

7. Extracting the kernel from the program: The kernel is created by selecting the desired function from within the program.

// Create the vector addition kernel

cl_kernel kernel = clCreateKernel(program, "vecadd", &status);

8. Executing the kernel: Once the kernel has been created and data has been initialized, the buffers are set as arguments to the kernel. A command to execute the kernel can now be enqueued into the command-queue. Along with the kernel, the command requires specification of the NDRange configuration.

// Set the kernel arguments

status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA);

status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufB);

status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufC);

// Define an index space of work-items for execution.

// A work-group size is not required, but can be used.

size_t indexSpaceSize[1], workGroupSize[1];

indexSpaceSize[0] = datasize/sizeof(int);

workGroupSize[0] = 256;

// Execute the kernel for execution

status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL,

 indexSpaceSize, workGroupSize, 0, NULL, NULL);

9. Copying output data back to the host: This step reads data back to a pointer on the host.

// Read the device output buffer to the host output array

status = clEnqueueReadBuffer(cmdQueue, bufC, CL_TRUE, 0,

 datasize, C, 0, NULL, NULL);

10. Releasing resources: Once the kernel has completed execution and the resulting output has been retrieved from the device, the OpenCL resources that were allocated can be freed. This is similar to any C or C++ program where memory allocations, file handles, and other resources are explicitly released by the developer. As shown below, each OpenCL object has its own API calls to release its resources. The OpenCL context should be released last since all OpenCL objects such as buffers and command-queues are bound to a context. This is similar to deleting objects in C++, where member arrays must be freed before the object itself is freed.

clReleaseKernel(kernel);

clReleaseProgram(program);

clReleaseCommandQueue(cmdQueue);

clReleaseMemObject(bufA);

clReleaseMemObject(bufB);

clReleaseMemObject(bufC);

clReleaseContext(context);

3.6.1 Complete Vector Addition Listing

The following is the complete listing for the vector addition example. It follows the same steps from the previous section, but uses the first platform and device for simplicity.

3.7 Vector Addition Using an OpenCL C++ Wrapper

The Khronos Group has defined a C++ wrapper API to go with the OpenCL standard. The C++ API corresponds closely to the C API (e.g. cl::Memory maps to cl_mem), but offers the benefits of a high-level language such as classes and exception handling. The following source listing provides a vector addition example that corresponds to the C version in Listing 3.4.

f03-11a-9780128014141f03-11b-9780128014141
f03-11c-9780128014141f03-11d-9780128014141
Listing 3.4 OpenCL vector addition using the C API.
f03-12a-9780128014141f03-12b-9780128014141f03-12c-9780128014141
Listing 3.5 OpenCL vector addition with the C++ API.

3.8 OpenCL for CUDA Programmers

NVIDIA’s CUDA C is an API similar to OpenCL. A comparison of OpenCL and CUDA versions of the vector addition example is shown in Listing 3.6. Listing 3.6 shows that OpenCL and CUDA follow a one-to-one mapping for most of their commands. The reason for the additional API calls and function parameters in OpenCL is the fact that platform discovery and program compilation at runtime are required in OpenCL. Since CUDA C targets only NVIDIA’s GPUs, there is only a single platform that can be discovered automatically, and the program compilation step to PTX can be done when the host binary is compiled.

f03-13a-9780128014141f03-13b-9780128014141
f03-13c-9780128014141
Listing 3.6 Vector addition using the CUDA C API.

With OpenCL, platforms are discovered at runtime, and the program can choose a target device at runtime as well. Program compilation cannot be done prior to runtime because the intermediate language (IL)/ISA of the device that will execute a kernel is unknown. For example, with OpenCL it is perfectly reasonable that a kernel may have been developed and tested on an AMD GPU. However, it would also need to run on an OpenCL-compatible GPU from Intel that has a different ISA. The platform discovery and the runtime compilation of the program makes this possible.

The other major difference between OpenCL and the CUDA C API is that CUDA C provides special operators for kernel launching, with the requirement that code may only be compiled using a toolchain that includes an NVIDIA-supplied preprocessor. The code that the preprocessor generates will end up looking very much like OpenCL code.

3.9 Summary

In this chapter, we provided an introduction to the basics of using the OpenCL standard when developing parallel programs. We have described the different abstraction models defined in the standard and also presented a basic example of an OpenCL program to place some of the abstraction in context.

Reference

[1] Cooper K, Torczon L. Engineering a Compiler. Burlington, MA: Morgan Kaufmann; 2011.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.201