Index
Note: Page numbers followed by “b”, “f” and “t” refer to boxes, figures, and tables, respectively.
A
Abrupt underflow convention,
137
memory bandwidth of filling kernels,
485tprogrammer productivity,
484
Accelerator::set_default static method,
526
Accelerator_view, C++ AMP,
526
AddVecKernel function,
36,
120
representation of simple graph,
259fsparse matrix representation of,
259fAdjacent synchronization,
193
AMD Opteron family,
Apodization filtering function,
306
Application programming interfaces (APIs), ,
19–20
communication functions,
388Applications software,
Arithmetic and logic unit (ALU),
78–80
Arithmetic instructions,
78–80
Array data layout
column-major layout,
50–51
Array of Structures (AoS),
488
pipelining with async and wait,
435f
atomic operation in cache memory,
210CUDA kernel for calculation histogram,
206fintrinsic functions,
205bstrategy I for parallelizing histogram computation,
202fAtomic operations, enhanced,
452–453
Audio digital signal processing,
150
B
Backward convolution,
353
Backward substitution,
143
Barrier__syncthreads(),
93
Basic Linear Algebra Subprograms (BLAS),
50–51,
73
Bezier curve
Bi-directional relations,
258
gridDim.x threads,
51,
207BlockIdx variable,
34,
34
maze routing in integrated circuits,
261fBrent–Kung adder design,
183
Built-in variables,
28b,
39
C
C language
multidimensional array,
49preprocessor directive,
26–27traditional C compiler,
23traditional C program,
22,
25C++ Accelerated Massive Parallelism (C++ AMP),
515
data-parallel computation,
521explicit and implicit data copies,
522–523extension to current C++ 11 standard,
515–516extensions to language,
516for_each function template,
518sharing concepts with CUDA,
515vehicle for reading and writing large data collections,
517
Cache
cache memory, atomic operation in,
210
hierarchy of modern processors,
158f
Central processing unit (CPU),
CPU-based parallel programing models,
461–462
co_rank_circular function operates on,
255fmanaging shared memory tiles,
250fmerge_sequential_circular function implementation,
255fsimplified model for co-rank values,
252f
clEnqueueNDRangeKernel() function,
473
clReleaseMemObject() function,
473
function based on binary search,
238f
memory access patterns in C 2D arrays for,
106fshared memory to enabling coalescing,
110f
Bezier curve calculation without dynamic parallelism,
301fquadtree with dynamic parallelism,
302f,
303fCollective communication function,
404–405
Column-major layout,
50–51
Compensated summation algorithm,
141–142
Compressed sparse row (CSR),
259
Computation intensity,
487
Computational microscope,
332
skills needed for parallel programer,
380
Computer architecture,
380
Computer-aided design (CAD),
261–262
errors and launch failures,
284launch environment configuration,
283memory allocation and lifetime,
283–284pending launch pool configuration,
284Conjugate gradient algorithm (CG algorithm),
309–310
Conjugate gradient method,
217
“__constant__” keyword,
83
1D convolution kernel,
158f
Constructor for array views,
517–518
Control
Control flow efficiency, better,
451
Conventional machine learning systems,
346
calculation of P[3],
151fconstant memory and caching,
156–159
global memory traffic,
358fkernel for forward path,
357fparallelization of forward path,
356freduction to matrix multiplication,
359–364
sequential loop implements SpMV/COO,
226f
base Coulomb potential calculation,
521fCUDA vector addition,
516fexisting C99 keyword restrict,
519providing math operations,
521restriction specifiers,
520Core performance evolution
better control flow efficiency,
451configurable caching and scratchpad,
451–452double-precision speed,
451enhanced atomic operations,
452–453enhanced global memory access,
453
CPU–GPU execution of application,
4–5
CUDA
C++ AMP sharing concepts with,
515calling CUDA device kernels from OpenACC,
439–440calling CUDA or libraries with OpenACC arrays,
437–438implementation of forward propagation,
355–358interoperability with CUDA and libraries,
437–440mapping between OpenCL and,
463fregisters and shared memory,
97syncthreads() statement,
59
for data transfer between host and device,
30ffor managing device global memory,
29f
and CUDA Fortran differences,
493–495function declarations,
38keywords for function declaration,
35fmodel of host/device interaction,
444–449providing shortcut for launching kernel,
45CUDA dynamic parallelism,
276
events synchronization,
285,
287fixed
vs. dynamic grids,
277fsynchronization depth,
285CUDA Fortran programming,
493
asynchronous data transfers,
504–508calling CUDA C via iso_c_binding,
499–501data transfer and kernel execution,
507f,
508fkernel loop directives and reduction operations,
501–502overloading host/device routines,
498–499CUDA linear algebra library (cuBLAS),
359
__constant__’’ keyword,
83CUDA variable type qualifiers,
82t“__device__” keyword,
83–84global memory in CUDA device,
78improving compute-to-global-memory-access ratio,
77memory
vs. registers in modern computer,
79f
error checking and handling in,
32bsynchronization constraints between blocks enables transparent scalability for,
60fCUDA thread organization,
43–47
hierarchical organizations,
44bmultidimensional example of CUDA grid organization,
47fCUDA-aware message passing interface (CUDA-aware MPI),
409–410
compute process code,
407frevised MPI SendRec calls,
409f
cudaDeviceProp type,
63,
64
cudaDeviceSynchronize() function,
285
cudaGetDeviceProperties function,
98
cudaMemcpyAsync() function,
402–403
cudaMemcpyToSymble() function,
157
cudaStreamCreate() function,
403
cudaStreamCreateWithFlags(),
287
cudaStreamWaitEvent(),
287
D
array size notation,
427ffor unstructured data directives,
430f
data clause array size notation,
427fJacobi iterative method with data region,
427fspeed-up with addition of,
428funstructured data directives in C++ class,
429fupdate directive with MPI halo exchange,
430fData management techniques,
10
CUDA C program structure,
22–25device global memory,
27–32function declarations,
38vector addition kernel,
25–27
conversion of color image to gray-scale image,
21fdata-parallel computation,
521RGB color image representation,
21btask parallelism
vs.,
20bData server
CUDA API function for,
30f
Dev_prop. maxGridSize,
63–64
Dev_prop. maxThreadsPerBlock,
63
Dev_prop. multiProcessorCount,
63
dev_prop. sharedMemPerBlock,
98
Device global memory,
27–32
CUDA API functions for managing,
29f“__device__” keyword,
83–84
Device properties, querying,
61–64
Digital high-definition (HD) TV,
Digital signal processors (DSPs),
462
Direct Coulomb Summation (DCS),
332,
335f
Direct memory access device (DMA device),
402
Direct3D techniques,
DirectX interop—rotate vertex list,
533f
Divide-and-concur approach,
231
Double-precision speed,
451
Driving direction services,
257–258
Dynamic partitioning of resources,
125–127
channels and banks in systems,
112f
E
Electronic gaming,
Electrostatic potential
Enhanced atomic operations,
452–453
Enhanced global memory access,
453
Error checking and handling in CUDA,
32b
Events synchronization,
285,
287
Exception handling in kernel functions,
449–450
excess-3 encoding, sorted by excess-3 ordering,
133f
supporting discrete accelerator,
525
CUDA kernels scalability,
37–38of matrix multiplication functions,
73of parallel programs,
12,
103of sequential programs,
F
Fast Fourier Transform (FFT),
307–308
Fermi GPU architecture,
450
Field programable gate arrays (FPGAs),
462
Finite difference methods,
172
“Flat” memory space,
49–50
Floating-point data representation,
132–134
normalized representation of
M,
132–133Floating-point precision and accuracy validation,
326f
Function calls within kernel functions,
449
Function declarations, CUDA C,
38
G
GEneral Matrix to Matrix Multiplication (GEMM),
359,
360f
General-purpose programming interface,
General-Purpose Programming using GPU (GPGPU),
Generation 2 interface (Gen2 interface),
Generic interfaces, overloading host/device routines,
498–499
get_global_id(0) function,
466
Giga floating-point operations per second (GFLOPS), ,
72
“__global__” keyword,
34,
35
Global memory access, enhanced,
453
analyzing data access pattern,
112burst organization of modern DRAMs,
105coalesced access pattern,
107fmemory access patterns in C 2D arrays for coalescing,
106fplacing matrix elements into linear order,
106fshared memory to enabling coalescing,
110ftiled matrix multiplication kernel using shared memory,
111fun-coalesced access pattern,
109f
Gnu C Compiler (gcc),
205
Gradient backpropagation,
351
Graph data structure,
258
adjacency matrix representation,
259fgraph data structure,
258graph with directional edges,
258fsparse matrix representation,
259fsparse representation,
260Graphics API,
Graphics Double Data Rate (GDDR),
6–8
architecture of modern,
6–8design philosophy,
floating-point arithmetic units,
5–6IEEE Floating-Point Standard,
5–6PCI-E Gen3,
H
Hardware trigonometry functions,
323–325
overlapping computation and communication,
400–408,
401fpoint-to-point communication,
393–400programer’s view of MPI processes,
388fsmall example of memory layout,
390f
many-thread trajectory,
2–3multicore trajectory,
NVIDIA,
practical form factors and easy accessibility,
Hierarchical parallel scan for arbitrary-length inputs,
189–192
Hierarchical scan for arbitrary length inputs,
189f
High-Bandwidth Memory (HBM),
6–8
High-performance computing (HPC),
13,
387,
414
High-performance parallel programs,
14
Higher order stencil computation,
388–389
“__host__” keyword,
35,
38
Host/device interaction model,
444–449
classical gridded MRI reconstruction from spiral scan data,
384fleast squares reconstruction of spiral scan data,
385fI
IEEE Floating-Point Standard,
5–6
IEEE format, special bit patterns and precision in,
138–139
IEEE-754 Floating-Point Standard,
132
If-then-else statement,
59
Installed base of processor,
Instruction Register (IR),
79
Instruction/execution divergence,
381–382
Intel Pentium family,
Interleaved data distribution,
114–115
Interoperability with CUDA and libraries,
437–440
calling CUDA device kernels from OpenACC,
439–440calling CUDA or libraries with OpenACC arrays,
437–438
Intrinsic functions,
205b
Inverse Fast Fourier Transform (iFFT),
306–307
iso_c_binding module, calling CUDA C via,
499–501
J
code with loop tile clause,
433fcompiler feedback for,
423fusing parallel directive,
422fJagged Diagonal Storage format (JDS format),
227
K
k-space sampling trajectory,
306
Kernel execution control
exception handling in kernel functions,
449–450function calls within kernel functions,
449hardware queues and dynamic parallelism,
450simultaneous execution of multiple kernels,
450
Kernel parallelism structure,
312–317
option of F
HD kernel,
316fversion of F
HD kernel,
312f
loop directives and reduction operations,
501–502
and parallel directives comparison,
424–425
kernel for inclusive scan,
180fparallel exclusive scan algorithm,
181fparallel inclusive scan algorithm,
178fL
Last-level on-chip caches,
Latency-oriented design,
Launch
environment configuration,
283Least-squares reconstruction (LS reconstruction),
385
Linear algebra functions,
51b
Linear algebra operation,
51b
Linear Bezier curves,
288
iterative reconstruction algorithm,
308–309
Loop fission technique,
313
Loop optimizations, OpenACC,
430–432
loop directive specifying levels of parallelism,
431f
Loop splitting technique,
313
M
convolutional layer reduction to matrix multiplication,
359–364Magnetic resonance imaging (MRI),
306,
371
blockIdx.x and threadIdx.x values,
315Cartesian scan trajectories,
306–307experimental performance tuning,
326–327hardware trigonometry functions,
323–325kernel parallelism structure,
312–317loop fission or loop splitting,
313M/MU_THREADS_PER_BLOCK blocks,
314–315matrix–vector multiplication,
309,
310memory bandwidth limitation,
317–323non-Cartesian
k-space sample trajectory,
308fnon-Cartesian scan trajectories,
307physics principles behind MRI,
306quasi-Bayesian estimation problem formulation,
309ratio of floating-point operations,
311–312scanner
k-space trajectories,
307f
Many-thread processors,
2–3
Map driving direction applications,
258
Map-reduce distributed computing frameworks,
233
Map-reduce frameworks,
231
actions of one thread block,
76fconvolutional layer reduction to,
359–364execution example of matrixMulKernel,
76fexecution of for-loop,
75execution speed of functions,
73function generates unrolled X matrix,
362fglobal memory accesses,
77high-performance implementation,
363fhost code for invoking unroll kernel,
363fimplementing forward path,
362fkernel using one thread,
75fusing multiple blocks by tiling,
74f
Maze routing problem,
262
determining kernel parallelism structure,
312–317experimental performance tuning,
326–327getting around memory bandwidth limitation,
317–323hardware trigonometry functions,
323–325in integrated circuits,
261fMemory
memory-bound programs,
72Memory access efficiency
Memory and data locality
importance of memory access efficiency,
72–73matrix multiplication,
73–77parallelism, memory as limiting factor to,
97–99tiled matrix multiplication kernel,
90–94tiling for reduced memory traffic,
84–90
adjusting k-space data layout,
323fchunking k-space data,
320feffect of k-space data layout,
322fregisters to reducing memory accesses,
319f
DCS kernel version 3,
340forganizing threads and memory layout,
339freusing computation results among multiple grid points,
339fversion 2 of DCS kernel,
339f
errors and launch failures,
284launch environment configuration,
283memory allocation and lifetime,
283–284pending launch pool configuration,
284
banking improving utilization of data transfer bandwidth,
113fchannels and banks in DRAM systems,
112fdistributing array elements into channels and banks,
115fM elements loaded by thread blocks,
116fmatrix multiplication,
115fMerge sort
circular-buffer merge kernel,
249–254co-rank function implementation,
236–241sequential merge algorithm,
233–234sorted
vs. unsorted lists,
232fmerge_circular_buffer_kernel,
249
closing communication system,
391fMPI/CUDA programming,
387MPI_Comm_rank()function,
391overlapping computation and communication,
400–408point-to-point communication,
393–400Microprocessors,
Microscopes,
Molecular visualization and analysis
simple kernel implementation,
333–337single-threads CPU
vs. CPU–GPU comparison,
343fthread granularity adjustment,
337–338
Multidimensional arrays,
49–50
Multidimensional data, threads mapping to,
47–54
linearized access to three-dimensional array,
54multidimensional arrays,
49–501D, 2D, or 3D thread organizations,
47row-major layout for 2D C array,
50fsource code of color ToGreyscaleConversion,
52f2D thread grid to processing,
48fN
Non-Cartesian MRI
non-Cartesian k-space sample trajectory,
308fnon-Cartesian scan trajectories,
307non-Cartesian trajectories,
384scanner k-space trajectories,
307f
Normalized representation of
M,
132–133
Numerical considerations
arithmetic accuracy and rounding,
139–140floating-point data representation,
132–134linear solvers and numerical stability,
142–146special bit patterns and precision in IEEE format,
138–139,
138f
Numerically stable values,
142
Numerically unstable values,
142
NVIDIA C Compiler (NVCC),
23
O
boundary condition handling,
155fkernel with boundary condition handling,
155fmapping of threads to output elements,
154Mask_Width [size of masks],
155output element index,
154
building OpenCL kernel,
472fdata access indexing in OpenCL and CUDA,
471fDCS kernel version 3 NDRange configuration,
470fdevice management and kernel launch,
466–469dimensions and indices to CUDA,
464felectrostatic potential map in,
469–473host code for kernel launch and parameter passing,
472finner loop of OpenCL DCS kernel,
471fto managing devices,
467fmapping between OpenCL and CUDA,
463fmapping DCS NDRange to,
470fparallel execution model,
463f
abstract machine model,
415fasynchronous computation and data,
434–435calling CUDA device kernels from OpenACC,
439–440calling CUDA or libraries with OpenACC arrays,
437–438compiler output from example kernels code,
421fGPU timeline of parallel loop code,
424finteroperability with CUDA and libraries,
437–440kernels and parallel directives comparison,
424–425offloading execution model,
415fperformance speed-up,
424f
OpenCL clEnqueueReadBuffer(),
473
Ordered merge operations,
231
Overlapping computation and communication,
400–408,
401f
device memory offsets,
404fP
hybrid approach to regulates,
224–227parallel SpMV/ELL kernel,
223fPage locked memory buffer
see Pinned memory buffer
BFS host code function,
266fblock-level queue contents,
269fkernel based on block-level privatized queues,
267f
atomic operation in cache memory,
210block partitioning
vs. interleaved partitioning,
206–207C function for calculating histogram,
201fcomparison of scalability and performance,
378energy value of grid point,
371grid-centric decomposition,
371–372latency
vs. throughput of atomic operations,
207–209nonbonded force calculation,
372–373parallel histogram computation,
199running time of three binned cutoff algorithms,
378,
378fSPMD, shared memory and locality,
380–382
Parallel directive, OpenACC,
422–424
and kernels directives comparison,
424–425
Parallel programming languages and models,
12–14
Parallel reduction algorithm,
122
algorithm with 16-element input,
178–179for arbitrary-length inputs,
189–192exclusive scan algorithm,
181fimplementation of iterative calculations,
180inclusive scan operation,
176,
178fKogge–Stone kernel for inclusive scan,
180fas primitive operation,
177sequential algorithm of computation,
177single-pass scan for memory access efficiency,
192–194
memory as limiting factor to,
97–99
Pareto-Optimal-Curve-based method,
327
Pending launch pool configuration,
284
Performance considerations
dynamic partitioning of resources,
125–127
Physical address spaces,
446
Pinned memory
Portfolio management process,
370
Portland Group (PGI),
414,
493
Processor cores,
Programing interface for computing clusters,
388
overlapping computation and communication,
400–408Programmer productivity,
484
Q
Quadratic Bezier curves,
288
Querying device properties,
61–64
R
Reduced memory access throughput,
126
Representable numbers of floating-point format,
134–138
abrupt underflow format,
137falignment shifting of operands,
140arithmetic accuracy and rounding,
139–1403-bit unsigned integer format,
134f,
135fdenormalization format,
137fdiscrepancy between sequential algorithms and parallel algorithms,
141intervals in neighborhood of 0,
135–136between negative infinity and positive infinity,
139no-zero, abrupt underflow, and denorm formats,
135fno-zero representation,
135freduction computation,
141trend of increasing density,
136–137Resource and capability queries,
62b
Resource assignment,
60–61
Restrict(amp) specification,
519
RGB color image representation,
21b
Root-mean-square (RMS),
327
S
in CUDA C and Thrust,
483f
Scalable parallel execution
CUDA thread organization,
43–47mapping threads to multidimensional data,
47–54querying device properties,
61–64resource assignment,
60–61scalable parallel program,
19synchronization and transparent scalability,
58–60
Semiconductor industry,
459
Sequential cutoff algorithm,
378
Sequential merge algorithm,
233–234
Sequential reduction algorithm,
121
Sequential SpMV/CSR loop,
218f,
220
SpMV/ELL kernel code,
223
Signal-to-noise ratio (SNR),
306
Signaling NaN’s (SNaNs),
139
Simple kernel implementation,
333–337
Simulation algorithm,
277
Single Instruction Multiple Data (SIMD), ,
65,
117
executing threads of warp,
119execution of revised algorithm,
124f,
125placing 2D threads into linear order,
119f__syncthreads() statement,
122
Single-pass scan for memory access efficiency,
192–194
SmallBin-Overlap algorithm,
379
Sparse matrix computation,
215
data padding and transposition,
221–224dot product loop body,
223elements of data, col_index, and row_index,
216–217Gaussian elimination,
217hybrid approach to regulates padding,
224–227loop index iteration,
218,
219matrix–vector multiplication and accumulation,
218fin science and engineering problems,
216sequential implementation of SpMV,
218in solving linear system of N equations of N variables,
217sorting and partitioning for regularization,
227–229SpMV computation code,
219SpMV loop operating on,
219ftransposition of JDS-CSR representation,
228–229Sparse Matrix–Vector (SpMV multiplication),
217–218
loop operating on sparse matrix,
219fsequential loop implements SpMV/COO,
226fSparse representation,
260
Special function units (SFU),
323
Speeding up real applications,
10–11
Statistical estimation methods
non-Cartesian k-space sample trajectory,
308fscanner k-space trajectories,
307fStatistically optimal image reconstruction method,
308
executing threads in warp,
65hardware resources (built-in registers) for,
61size of shared memory in,
98thread block assignment to,
61funit of thread scheduling in,
64–65,
64fStreaming processors (SPs),
6–8
Structure of Arrays approach (SoA approach),
488–489
Supercomputing applications,
Synchronous DRAM (SDRAM),
6–8
System of linear equations,
142
T
Tera floating-point operations per second (TFLOPS),
Thread-to-data mapping,
74
in grid executing same kernel codes,
33fmapping to multidimensional data,
47–54multiple dimensions of,
118threadIdx.x and threadIdx.y values,
118–119threadIdx.x values with warp,
1183D thread organizations,
47ThreadIdx.x,
34,
34,
35–36,
48,
108,
111,
118,
122f,
161,
162,
166,
180,
181
Three-dimensional grid (3D grid),
333–334
Throughput
throughput-oriented design,
Thrust parallel template library,
475
device_pointer_cast() function,
481dynamic optimization,
486generate, sort, and copy algorithms,
478iterators and memory space,
479–480native CUDA C interoperability,
481programmer productivity,
484raw_pointer_cast() function,
480for solving complementary set of problems,
478
accessing N elements and ghost cells,
164fkernel using constant memory,
163fTiled 2D convolution
C type structure definition of image pixel element,
169fimage array access reduction ratio,
172fpadded image format,
167fstarting element indices,
169f
multiplication algorithm,
88,
89fmultiplication kernel,
90–94
identifying block-level output and input subarrays,
244floading elements into shared memory,
245f
for reduced memory traffic,
84–90
Transparent scalability,
58–60
parallel SpMV/ELL kernel,
223fTrigonometry functions,
312
2D convolution kernel,
169
Two-dimensional array,
50
U
CPU code to CUDA code,
447fUnified virtual address space (UVAS),
445
Unstructured data directives,
429
V
complete version of host code in,
37f
Vector addition kernel function,
25–27,
34f
revised vecAdd function,
26ftraditional vector addition C code example,
25f
Virtual address spaces,
446
Visual Molecular Dynamics (VMD),
332
memory
vs. registers,
79f
W
Warp-level queues (w-queues),
271–272
analyzing impact of control divergence,
121approach to execution,
120execution of revised algorithm,
124f,
125placing 2D threads into linear order,
119f__syncthreads() statement,
122
Brent–Kung kernel for inclusive scan,
186fdistribution of partial sums to positions,
184,
184minimal number of operations,
183–184number of operations in distribution tree stage,
187parallel inclusive scan algorithm,
183freduction tree phase of,
185
X
X86 instruction set,
Z
Zero-overhead thread scheduling,
65