Foreword

In the last few years computing has entered the heterogeneous computing era, which aims to bring together in a single device the best of both central processing units (CPUs) and graphics processing units (GPUs). Designers are creating an increasingly wide range of heterogeneous machines, and hardware vendors are making them broadly available. This change in hardware offers great platforms for exciting new applications. But, because the designs are different, classical programming models do not work very well, and it is important to learn about new models such as those in OpenCL.

When the design of OpenCL started, the designers noticed that for a class of algorithms that were latency focused (spreadsheets), developers wrote code in C or C++ and ran it on a CPU, but for a second class of algorithms that where throughput focused (e.g. matrix multiply), developers often wrote in CUDA and used a GPU: two related approaches, but each worked on only one kind of processor—C++ did not run on a GPU, CUDA did not run on a CPU. Developers had to specialize in one and ignore the other. But the real power of a heterogeneous device is that it can efficiently run applications that mix both classes of algorithms. The question was how do you program such machines?

One solution is to add new features to the existing platforms; both C++ and CUDA are actively evolving to meet the challenge of new hardware. Another solution was to create a new set of programming abstractions specifically targeted at heterogeneous computing. Apple came up with an initial proposal for such a new paradigm. This proposal was refined by technical teams from many companies, and became OpenCL. When the design started, I was privileged to be part of one of those teams. We had a lot of goals for the kernel language: (1) let developers write kernels in a single source language; (2) allow those kernels to be functionally portable over CPUs, GPUs, field-programmable gate arrays, and other sorts of devices; (3) be low level so that developers could tease out all the performance of each device; (4) keep the model abstract enough, so that the same code would work correctly on machines being built by lots of companies. And, of course, as with any computer project, we wanted to do this fast. To speed up implementations, we chose to base the language on C99. In less than 6 months we produced the specification for OpenCL 1.0, and within 1 year the first implementations appeared. And then, time passed and OpenCL met real developers …

So what happened? First, C developers pointed out all the great C++ features (a real memory model, atomics, etc.) that made them more productive, and CUDA developers pointed out all the new features that NVIDIA added to CUDA (e.g. nested parallelism) that make programs both simpler and faster. Second, as hardware architects explored heterogeneous computing, they figured out how to remove the early restrictions requiring CPUs and GPUs to have separate memories. One great hardware change was the development of integrated devices, which provide both a GPU and a CPU on one die (NVIDIA’s Tegra and AMD’s APUS are examples). And third, even though the specification was written with great care and there was a conformance suite, implementers of the compilers did not always read the specification in the same way—sometimes the same program could get a different answer on a different device.

All this led to a revised and more mature specification—OpenCL 2.0.

The new specification is a significant evolution which lets developers take advantage of the new integrated GPU/CPU processors.

The big changes include the following:

 Shared virtual memory—so that host and device code can share complex pointer-based structures such as trees and linked lists, getting rid of the costly data transfers between the host and devices.

 Dynamic parallelism—so that device kernels can launch work to the same device without host interaction, getting rid of significant bottlenecks.

 Generic address spaces—so that single functions can operate on either GPU or CPU data, making programming easier.

 C++-style atomics—so that work-items can share data over work-groups and devices, enabling a wider class of algorithms be realized in Open CL.

This book provides a good introduction to OpenCL, either for a class on OpenCL programming, or as part of a class on parallel programming. It will also be valuable to developers who want to learn OpenCL on their own.

The authors have been working on high performance mixing GPUs and CPUs for quite some time. I highly respect their work. Previous versions of the book covering previous versions of OpenCL were well received, and this addition expands that work to cover all the new OpenCL 2.0 features.

I encourage potential readers to go through the book, learn OpenCL, and build the exciting applications of the future.

Norm Rubin, Research Scientist, NVIDIA Visiting Scholar, Northeastern University

January 2015

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.59.219