List of Figures

Fig. 1.1 (a) Simple sorting: a divide-and-conquer implementation, breaking the list into shorter lists, sorting them, and then merging the shorter sorted lists. (b) Vector-scalar multiply: scattering the multiplies and then gathering the results to be summed up in a series of steps. 3

Fig. 1.2 Multiplying elements in arrays A and B, and storing the result in an array C. 4

Fig. 1.3 Task parallelism present in fast Fourier transform (FFT) application. Different input images are processed independently in the three independent tasks. 5

Fig. 1.4 Task-level parallelism, where multiple words can be compared concurrently. Also shown is finer-grained character-by-character parallelism present when characters within the words are compared with the search string. 6

Fig. 1.5 After all string comparisons in Figure 1.4 have been completed, we can sum up the number of matches in a combining network. 6

Fig. 1.6 The relationship between parallel and concurrent programs. Parallel and concurrent programs are subsets of all programs. 8

Fig. 2.1 Out-of-order execution of an instruction stream of simple assembly-like instructions. Note that in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c. 18

Fig. 2.2 VLIW execution based on the out-of-order diagram in Figure 2.1. 20

Fig. 2.3 SIMD execution where a single instruction is scheduled in order, but executes over multiple ALUs at the same time. 21

Fig. 2.4 The out-of-order schedule seen in Figure 2.1 combined with a second thread and executed simultaneously. 23

Fig. 2.5 Two threads scheduled in a time-slice fashion. 24

Fig. 2.6 Taking temporal multithreading to an extreme as is done in throughput computing: a large number of threads interleave execution to keep the device busy, whereas each individual thread takes longer to execute than the theoretical minimum. 25

Fig. 2.7 The AMD Puma (left) and Steamroller (right) high-level designs (not shown to any shared scale). Puma is a low-power design that follows a traditional approach to mapping functional units to cores. Steamroller combines two cores within a module, sharing its floating-point (FP) units. 26

Fig. 2.8 The AMD Radeon HD 6970 GPU architecture. The device is divided into two halves, where instruction control (scheduling and dispatch) is performed by the wave scheduler for each half. The 24 16-lane SIMD cores execute four-way VLIW instructions on each SIMD lane and contain private level 1 (L1) caches and local data shares (scratchpad memory). 27

Fig. 2.9 The Niagara 2 CPU from Sun/Oracle. The design intends to make a high level of threading efficient. Note its relative similarity to the GPU design seen in Figure 2.8. Given enough threads, we can cover all memory access time with useful compute, without extracting instruction-level parallelism (ILP) through complicated hardware techniques. 32

Fig. 2.10 The AMD Radeon R9 290X architecture. The device has 44 cores in 11 clusters. Each core consists of a scalar execution unit that handles branches and basic integer operations, and four 16-lane SIMD ALUs. The clusters share instruction and scalar caches. 35

Fig. 2.11 The NVIDIA GeForce GTX 780 architecture. The device has 12 large cores that NVIDIA refers to as “streaming multiprocessors” (SMX). Each SMX has 12 SIMD units (with specialized double-precision and special function units), a single L1 cache, and a read-only data cache. 36

Fig. 2.12 The A10-7850K APU consists of two Steamroller-based CPU cores and eight Radeon R9 GPU cores (32 16-lane SIMD units in total). The APU includes a fast bus from the GPU to DDR3 memory, and a shared path that is optionally coherent with CPU caches. 37

Fig. 2.13 An Intel i7 processor with HD Graphics 4000 graphics. Although not termed “APU” by Intel, the concept is the same as for the devices in that category from AMD. Intel combines four Haswell x86 cores with its graphics processors, connected to a shared last-level cache (LLC) via a ring bus. 38

Fig. 3.1 An OpenCL platform with multiple compute devices. Each compute device contains one or more compute units. A compute unit is composed of one or more processing elements (PEs). A system could have multiple platforms present at the same time. For example, a system could have an AMD platform and an Intel platform present at the same time. 43

Fig. 3.2 Some of the Output from the CLInfo program showing the characteristics of an OpenCL platform and devices. We see that the AMD platform has two devices (a CPU and a GPU). The output shown here can be queried using functions from the platform API. 46

Fig. 3.3 Vector addition algorithm showing how each element can be added independently. 50

Fig. 3.4 The hierarchical model used for creating an NDRange of work-items, grouped into work-groups. 52

Fig. 3.5 The OpenCL runtime shown denotes an OpenCL context with two compute devices (a CPU device and a GPU device). Each compute device has its own command-queues. Host-side and device-side command-queues are shown. The device-side queues are visible only from kernels executing on the compute device. The memory objects have been defined within the memory model. 54

Fig. 3.6 Memory regions and their scope in the OpenCL memory model. 61

Fig. 3.7 Mapping the OpenCL memory model to an AMD Radeon HD 7970 GPU.62

Fig. 4.1 A histogram generated from a 256-bit image. Each bin corresponds to the frequency of the corresponding pixel value. 76

Fig. 4.2 An image rotated by 45°. Pixels that correspond to an out-of-bounds location in the input image are returned as black. 83

Fig. 4.3 Applying a convolution filter to a source image. 91

Fig. 4.4 The effect of different convolution filters applied to the same source image: (a) the original image; (b) blurring filter; and (c) embossing filter. 92

Fig. 4.5 The producer kernel will generate filtered pixels and send them via a pipe to the consumer kernel, which will then generate the histogram: (a) original image; (b) filtered image; and (c) histogram of filtered image. 99

Fig. 5.1 Multiple command-queues created for different devices declared within the same context. Two devices are shown, where one command-queue has been created for each device. 118

Fig. 5.2 Multiple devices working in a pipelined manner on the same data. The CPU queue will wait until the GPU kernel has finished. 119

Fig. 5.3 Multiple devices working in a parallel manner. In this scenario, both GPUs do not use the same buffers and will execute independently. The CPU queue will wait until both GPU devices have finished. 120

Fig. 5.4 Executing the simple kernel shown in Listing 5.5. The different work-items in the NDRange are shown. 121

Fig. 5.5 Within a single kernel dispatch, synchronization regarding execution order is supported only within work-groups using barriers. Global synchronization is maintained by completion of the kernel, and the guarantee that on a completion event all work is complete and memory content is as expected. 126

Fig. 5.6 Example showing OpenCL memory objects mapping to arguments for clEnqueueNativeKernel() in Listing 5.8. 131

Fig. 5.7 A single-level fork-join execution paradigm compared with nested parallelism thread execution. 133

Fig. 6.1 An example showing a scenario where a buffer is created and initialized on the host, used for computation on the device, and transferred back to the host. Note that the runtime could have also created and initialized the buffer directly on the device. (a) Creation and initialization of a buffer in host memory. (b) Implicit data transfer from the host to the device prior to kernel execution. (c) Explicit copying of data back from the device to the host pointer. 150

Fig. 6.2 Data movement using explicit read-write commands. (a) Creation of an uninitialized buffer in device memory. (b) Explicit data transfer from the host to the device prior to execution. (c) Explicit data transfer from the device to the host following execution. 151

Fig. 6.3 Data movement using map/unmap. (a) Creation of an uninitialized buffer in device memory. (b) The buffer is mapped into the host’s address space. (c) The buffer is unmapped from the host’s address space. 158

Fig. 7.1 The memory spaces available to an OpenCL device. 164

Fig. 7.2 Data race when incrementing a shared variable. The value stored depends on the ordering of operations between the threads. 166

Fig. 7.3 Applying Z-order mapping to a two-dimensional memory space. 172

Fig. 7.4 The pattern of data flow for the example shown in the localAccess kernel. 177

Fig. 8.1 High-level design of AMD’s Piledriver-based FX-8350 CPU. 188

Fig. 8.2 OpenCL mapped onto an FX-8350 CPU. The FX-8350 CPU is both the OpenCL host and the device in this scenario. 189

Fig. 8.3 Implementation of work-group execution on an x86 architecture. 190

Fig. 8.4 Mapping the memory spaces for a work-group (work-group 0) onto a Piledriver CPU cache. 192

Fig. 8.5 High-level Radeon R9 290X diagram labeled with OpenCL execution and memory model terms. 193

Fig. 8.6 Memory bandwidths in the discrete system. 195

Fig. 8.7 Radeon R9 290X compute unit microarchitecture. 197

Fig. 8.8 Mapping OpenCL’s memory model onto a Radeon R9 290X GPU. 201

Fig. 8.9 Using vector reads provides a better opportunity to return data efficiently through the memory system. When work-items access consecutive elements, GPU hardware can achieve the same result through coalescing. 203

Fig. 8.10 Accesses to nonconsecutive elements return smaller pieces of data less efficiently. 203

Fig. 8.11 Mapping the Radeon R9 290X address space onto memory channels and DRAM banks. 204

Fig. 8.12 Radeon R9 290X memory subsystem. 205

Fig. 8.13 The accumulation pass of the prefix sum shown in Listing 8.2 over a 16-element array in local memory using 8 work-items. 208

Fig. 8.14 Step 1 in Figure 8.13 showing the behavior of an LDS with eight banks. 209

Fig. 8.15 Step 1 in Figure 8.14 with padding added to the original data set to remove bank conflicts in the LDS. 210

Fig. 9.1 An image classification pipeline. An algorithm such as SURF is used to generate features. A clustering algorithm such as k-means then generates a set of centroid features that can serve as a set of visual words for the image. The generated features are assigned to each centroid by the histogram builder. 214

Fig. 9.2 Feature generation using the SURF algorithm. The SURF algorithm accepts an image as an input and generates an array of features. Each feature includes position information and a set of 64 values known as a descriptor. 214

Fig. 9.3 The data transformation kernel used to enable memory coalescing is the same as a matrix transpose kernel. 219

Fig. 9.4 A transpose illustrated on a one-dimensional array. 220

Fig. 10.1 The session explorer for CodeXL in profile mode. Two application timeline sessions and one GPU performance counter session are shown. 233

Fig. 10.2 The Timeline View of CodeXL in profile mode for the Nbody application. We see the time spent in data transfer and kernel execution. 234

Fig. 10.3 The API Trace View of CodeXL in profile mode for the Nbody application. 235

Fig. 10.4 CodeXL Profiler showing the different GPU kernel performance counters for the Nbody kernel. 237

Fig. 10.5 AMD CodeXL explorer in analysis mode. The NBody OpenCL kernel has been compiled and analyzed for a number of different graphics architectures. 240

Fig. 10.6 The ISA view of KernelAnalyzer. The NBody OpenCL kernel has been compiled for multiple graphics architectures. For each architecture, the AMD IL and the GPU ISA can be evaluated. 241

Fig. 10.7 The Statistics view for the Nbody kernel shown by KernelAnalyzer. We see that the number of concurrent wavefronts that can be scheduled is limited by the number of vector registers. 241

Fig. 10.8 The Analysis view of the Nbody kernel is shown. The execution duration calculated by emulation is shown for different graphics architectures. 242

Fig. 10.9 A high-level overview of how CodeXL interacts with an OpenCL application. 243

Fig. 10.10 CodeXL API trace showing the history of the OpenCL functions called. 244

Fig. 10.11 A kernel breakpoint set on the Nbody kernel. 246

Fig. 10.12 The Multi-Watch window showing the values of a global memory buffer in the Nbody example. The values can also be visualized as an image. 247

Fig. 11.1 C+ + AMP code example—vector addition. 250

Fig. 11.2 Vector addition, conceptual view. 251

Fig. 11.3 Functor version for C+ +AMP vector addition (conceptual code). 256

Fig. 11.4 Further expanded version for C+ +AMP vector addition (conceptual code). 257

Fig. 11.5 Host code implementation of parallel_for_each (conceptual code). 259

Fig. 11.6 C+ + AMP Lambda—vector addition. 260

Fig. 11.7 Compiled OpenCL SPIR code—vector addition kernel. 261

Fig. 12.1 WebCL objects. 275

Fig. 12.2 Using multiple command-queues for overlapped data transfer. 281

Fig. 12.3 Typical runtime involving WebCL and WebCL. 283

Fig. 12.4 Two triangles in WebGL to draw a WebCL-generated image. 284

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.117.214