Figures

Figure 1.1 The rate at which instructions are retired is the same in these two cases, but the power is much less with two cores running at half the frequency of a single core 5

Figure 1.2 A plot of peak performance versus power at the thermal design point for three processors produced on a 65nm process technology. Note: This is not to say that one processor is better or worse than the others. The point is that the more specialized the core, the more power-efficient it is 6

Figure 1.3 Block diagram of a modern desktop PC with multiple CPUs (potentially different) and a GPU, demonstrating that systems today are frequently heterogeneous 7

Figure 1.4 A simple example of data parallelism where a single task is applied concurrently to each element of a vector to produce a new vector 9

Figure 1.5 Task parallelism showing two ways of mapping six independent tasks onto three PEs. A computation is not done until every task is complete, so the goal should be a well-balanced load, that is, to have the time spent computing by each PE be the same 10

Figure 1.6 The OpenCL platform model with one host and one or more OpenCL devices. Each OpenCL device has one or more compute units, each of which has one or more processing elements 12

Figure 1.7 An example of how the global IDs, local IDs, and work-group indices are related for a two-dimensional NDRange. Other parameters of the index space are defined in the figure. The shaded block has a global ID of (gx, gy) = (6, 5) and a work-group plus local ID of (wx, wy) = (1, 1) and (lx, ly) =(2, 1) 16

Figure 1.8 A summary of the memory model in OpenCL and how the different memory regions interact with the platform model 23

Figure 1.9 This block diagram summarizes the components of OpenCL and the actions that occur on the host during an OpenCL application 35

Figure 2.1 CodeBlocks CL_Book project 42

Figure 2.2 Using cmake-gui to generate Visual Studio projects 43

Figure 2.3 Microsoft Visual Studio 2008 Project 44

Figure 2.4 Eclipse CL_Book project 45

Figure 3.1 Platform, devices, and contexts 84

Figure 3.2 Convolution of an 8×8 signal with a 3×3 filter, resulting in a 6×6 signal 90

Figure 4.1 Mapping get_global_id to a work-item 98

Figure 4.2 Converting a float4 to a ushort4 with round-to-nearest rounding and saturation 120

Figure 4.3 Adding two vectors 125

Figure 4.4 Multiplying a vector and a scalar with widening 126

Figure 4.5 Multiplying a vector and a scalar with conversion and widening 126

Figure 5.1 Example of the work-item functions 150

Figure 7.1 (a) 2D array represented as an OpenCL buffer; (b) 2D slice into the same buffer 269

Figure 9.1 A failed attempt to use the clEnqueueBarrier() command to establish a barrier between two command-queues. This doesn’t work because the barrier command in OpenCL applies only to the queue within which it is placed 316

Figure 9.2 Creating a barrier between queues using clEnqueueMarker() to post the barrier in one queue with its exported event to connect to a clEnqueueWaitForEvent() function in the other queue. Because clEnqueueWaitForEvents() does not imply a barrier, it must be preceded by an explicit clEnqueueBarrier() 317

Figure 10.1 A program demonstrating OpenCL/OpenGL interop. The positions of the vertices in the sine wave and the background texture color values are computed by kernels in OpenCL and displayed using Direct3D 344

Figure 11.1 A program demonstrating OpenCL/D3D interop. The sine positions of the vertices in the sine wave and the texture color values are programmatically set by kernels in OpenCL and displayed using Direct3D 368

Figure 12.1 C++ Wrapper API class hierarchy 370

Figure 15.1 OpenCL Sobel kernel: input image and output image after applying the Sobel filter 409

Figure 16.1 Summary of data in Table 16.1: NV GTX 295 (1 GPU, 2 GPU) and Intel Core i7 performance 419

Figure 16.2 Using one GPU versus two GPUs: NV GTX 295 (1 GPU, 2 GPU) and Intel Core i7 performance 420

Figure 16.3 Summary of data in Table 16.2: NV GTX 295 (1 GPU, 2 GPU) and Intel Core i7 performance—10 edges per vertex 421

Figure 16.4 Summary of data in Table 16.3: comparison of dual GPU, dual GPU + multicore CPU, multicore CPU, and CPU at vertex degree 1 423

Figure 17.1 AMD’s Samari demo, courtesy of Jason Yang 426

Figure 17.2 Masses and connecting links, similar to a mass/spring model for soft bodies 426

Figure 17.3 Creating a simulation structure from a cloth mesh 427

Figure 17.4 Cloth link structure 428

Figure 17.5 Cloth mesh with both structural links that stop stretching and bend links that resist folding of the material 428

Figure 17.6 Solving the mesh of a rope. Note how the motion applied between (a) and (b) propagates during solver iterations (c) and (d) until, eventually, the entire rope has been affected 429

Figure 17.7 The stages of Gauss-Seidel iteration on a set of soft-body links and vertices. In (a) we see the mesh at the start of the solver iteration. In (b) we apply the effects of the first link on its vertices. In (c) we apply those of another link, noting that we work from the positions computed in (b) 432

Figure 17.8 The same mesh as in Figure 17.7 is shown in (a). In (b) the update shown in Figure 17.7(c) has occurred as well as a second update represented by the dark mass and dotted lines 433

Figure 17.9 A mesh with structural links taken from the input triangle mesh and bend links created across triangle boundaries with one possible coloring into independent batches 434

Figure 17.10 Dividing the mesh into larger chunks and applying a coloring to those. Note that fewer colors are needed than in the direct link coloring approach. This pattern can repeat infinitely with the same four colors 439

Figure 18.1 A single frame from the Ocean demonstration 450

Figure 19.1 A pair of test images of a car trunk being closed. The first (a) and fifth (b) images of the test sequence are shown 470

Figure 19.2 Optical flow vectors recovered from the test images of a car trunk being closed. The fourth and fifth images in the sequence were used to generate this result 471

Figure 19.3 Pyramidal Lucas-Kanade optical flow algorithm 473

Figure 21.1 A matrix multiplication operation to compute a single element of the product matrix, C. This corresponds to summing into each element Ci,j the dot product from the ith row of A with the jth column of B 500

Figure 21.2 Matrix multiplication where each work-item computes an entire row of the C matrix. This requires a change from a 2D NDRange of size 1000×1000 to a 1D NDRange of size 1000. We set the work-group size to 250, resulting in four work-groups (one for each compute unit in our GPU) 506

Figure 21.3 Matrix multiplication where each work-item computes an entire row of the C matrix. The same row of A is used for elements in the row of C so memory movement overhead can be dramatically reduced by copying a row of A into private memory 508

Figure 21.4 Matrix multiplication where each work-item computes an entire row of the C matrix. Memory traffic to global memory is minimized by copying a row of A into each work-item’s private memory and copying rows of B into local memory for each work-group 510

Figure 22.1 Sparse matrix example 516

Figure 22.2 A tile in a matrix and its relationship with input and output vectors 520

Figure 22.3 Format of a single-precision 128-byte packet 521

Figure 22.4 Format of a double-precision 192-byte packet 522

Figure 22.5 Format of the header block of a tiled and packetized sparse matrix 523

Figure 22.6 Single-precision SpMV performance across 22 matrices on seven platforms 528

Figure 22.7 Double-precision SpMV performance across 22 matrices on five platforms 528

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.62