Index

Note: Page numbers followed by “b”, “f” and “t” refer to boxes, figures, and tables, respectively.

A

Abrupt underflow convention, 137
Abstraction, 484–486
memory bandwidth of filling kernels, 485t
programmer productivity, 484
real-world performance, 485–486
robustness, 484
Accelerator, 517
device, 414
management, 525–528, 526f
Accelerator, class, 525–526
Accelerator::set_default static method, 526
Accelerator_view, C++ AMP, 526
AddVecKernel function, 36, 120
Adjacency matrix, 258
representation of simple graph, 259f
sparse matrix representation of, 259f
Adjacent synchronization, 193
Algorithm, 153–156
considerations, 140–142
selection, 374–379
AMD Opteron family, 1
Amdahl’s law, 10, 373–374, 374
ANSI C code, 23, 34, 35–36
ANSI C standard, 49
Apodization filtering function, 306
Application programming interfaces (APIs), 6, 19–20
communication functions, 388
Applications software, 2
Arithmetic accuracy, 139–140
Arithmetic and logic unit (ALU), 78–80
Arithmetic instructions, 78–80
Array data layout
column-major layout, 50–51
row major layout, 50, 50–51, 50f, 168f
Array of structs, 322
Array of Structures (AoS), 488
Array_view, 517, 517–518, 521–522, 522, 523
Asynchronous, 405, 409
async and wait, 434f
computation and data, 434–435
data transfers, 504–508
operation, 523–525, 524f, 524f
pipelining with async and wait, 435f
Atomic operations, 202–205, 334–335
atomic operation in cache memory, 210
CUDA kernel for calculation histogram, 206f
in Fermi GPU, 452–453
intrinsic functions, 205b
race condition, 203f, 204f
strategy I for parallelizing histogram computation, 202f
throughput of, 207–209
Atomic operations, enhanced, 452–453
Audio digital signal processing, 150
Auto clause, 425

B

Backward convolution, 353
Backward substitution, 143
Bank conflict, 114
Barrier synchronizations, 58, 117, 404–405, 407–408
execution timing of, 59f
syncthreads() points, 59
Barrier__syncthreads(), 93
Basic Linear Algebra Subprograms (BLAS), 50–51, 73
Bezier curve
calculation, 288–290, 290, 301f
linear, 288
quadratic, 288
Bi-directional relations, 258
Big Data, 9–10, 200–201
Binning process, 377–378
Block partitioning, 206–207
Block-level queue, 268–269, 269f, 270
blockDim variable, 33–34, 34
blockDim.x variable, 33–34, 35–36
blockDim.x, 33–34, 35–36
gridDim.x threads, 51, 207
BlockIdx variable, 34, 34
BlockIdx.x, 34, 35–36
Bottlenecks, 103
Boundary checks, 94–96
Boundary tile, 160–161
Brain, sodium map of, 383–385
Breadth-first search (BFS), 260–262, 260f
maze routing in integrated circuits, 261f
Brent–Kung adder design, 183
Built-in variables, 28b, 39

C

C language
ANSI C code, 23
convention, 150
CUDA C extending, 35
malloc() function, 396, 402
multidimensional array, 49
pointers, 26b–27b
preprocessor directive, 26–27
run-time library, 28–29
traditional C compiler, 23
traditional C program, 22, 25
2D array, 49
C++ Accelerated Massive Parallelism (C++ AMP), 515
array_view, 517, 517–518, 521–522, 522, 523
asynchronous operation, 523–525, 524f, 524f
data-parallel computation, 521
execution model, 522–525
explicit and implicit data copies, 522–523
extension to current C++ 11 standard, 515–516
extensions to language, 516
features, 516–522
focus of, 515
for_each function template, 518
graphics features, 531–533
managing accelerators, 525–528, 526f, 527f, 527f
sharing concepts with CUDA, 515
tiled execution, 528–530
vehicle for reading and writing large data collections, 517
C++ function, 475–476
classes, 476–477
objects, 482b–483b
Cache
cache memory, atomic operation in, 210
coherence mechanism, 159
L1 cache, 159
L2 cache, 159
halo cells, 166f
hierarchy of modern processors, 158f
Carpooling, 87
arrangements, 87–88
Cartesian scan trajectory, 306–307
Central processing unit (CPU), 1
CPU-based parallel programing models, 461–462
serial code, 23
Circular-buffer merge kernel, 249–254
See also Tiled merge kernel
co_rank_circular function operates on, 255f
managing shared memory tiles, 250f
merge_sequential_circular function implementation, 255f
part of, 251f, 254f
simplified model for co-rank values, 252f
clCreateBuffer() function, 471–472
clEnqueueNDRangeKernel() function, 473
clReleaseMemObject() function, 473
Co-rank function, 234
execution, 237f
function based on binary search, 238f
implementation, 236–241
operation example, 239f, 240f, 240f
Coalesces, 105
access pattern, 107f
coalesced pattern, 242
memory access patterns in C 2D arrays for, 106f
shared memory to enabling coalescing, 110f
Code, 301
Bezier curve calculation without dynamic parallelism, 301f
quadtree with dynamic parallelism, 302f, 303f
Collective communication function, 404–405
Column-major layout, 50–51
Communication, 388
avoiding algorithms, 145–146
connections, 392
systems, 391
Communicator, 391
Compensated summation algorithm, 141–142
Compressed sparse row (CSR), 259
parallel SpMV using, 219–221
storage format, 216, 216f
Computation intensity, 487
Computational microscope, 332
Computational thinking, 379–380
See also Parallel computing
skills needed for parallel programer, 380
strategies, 382–383
techniques, 14
Compute domain, 518
Compute process code, 399–400, 400f, 403f, 405f, 406f
Compute units (CUs), 464–465
Compute-to-global-memory-access ratio, 72, 72, 77, 77, 93–94
Computer architecture, 380
Computer-aided design (CAD), 261–262
Configurable caching, 451–452
Configurations management, 283–285
errors and launch failures, 284
launch environment configuration, 283
memory allocation and lifetime, 283–284
nesting depth, 284
pending launch pool configuration, 284
Conjugate gradient algorithm (CG algorithm), 309–310
Conjugate gradient method, 217
“__constant__” keyword, 83
Constant memory, 156–159
CUDA memory model, 156f
1D convolution kernel, 158f
Constant memory, 282
Constructor for array views, 517–518
Container classes, 476–477
Control
divergence, 121
flow divergence, 215, 221, 221
points, 287
unit, 79
Control flow efficiency, better, 451
Conventional machine learning systems, 346
Convolution, 149, 150–153
boundary condition, 151f
calculation of P[3], 151f
constant memory and caching, 156–159
kernel, 150
masks, 150
1D example, 150f
1D parallel, 153–156
pattern, 55
tiled 1D convolution, 160–165, 160f, 165–166
tiled 2D convolution, 166–172
2D example, 152f, 154f
Convolutional layers, 347–348, 349, 355–358
dE/dW calculation, 354f
dE/dX calculation, 353f
forward path, 355f
global memory traffic, 358f
kernel for forward path, 357f
parallelization of forward path, 356f
reduction to matrix multiplication, 359–364
Convolutional neural networks (ConvNets), 345, 347–355, 365
backpropagation, 351–355, 352f
basic layers, 348–351
Coordinate format (COO format), 224, 225f, 226f
reordering, 225f
sequential loop implements SpMV/COO, 226f
Core C++ AMP features, 516–522
array_view, 521–522
base Coulomb potential calculation, 521f
CUDA code, 518
CUDA vector addition, 516f
existing C99 keyword restrict, 519
providing math operations, 521
restriction specifiers, 520
row-major storage layout, 520–521
set of restrictions, 519
vector addition in, 517f
Core performance evolution
better control flow efficiency, 451
configurable caching and scratchpad, 451–452
double-precision speed, 451
enhanced atomic operations, 452–453
enhanced global memory access, 453
Corner turning, 109–110, 111, 187–188
CPU–GPU execution of application, 4–5
Critical-path analysis, 454, 454f, 455f
cublasSaxpy function, 437–438
CUDA
applications, 62–63, 104
block, 528
built-in variables, 39
C++ AMP sharing concepts with, 515
calling CUDA device kernels from OpenACC, 439–440
calling CUDA or libraries with OpenACC arrays, 437–438
CUDA pointers in, 438
device designs, 61
device global memory, 104–105
deviceptr clause, 438f
host function, 35
host_data region, 438f
implementation of forward propagation, 355–358
interoperability with CUDA and libraries, 437–440
kernel function, 32–33
mapping between OpenCL and, 463f
and OpenACC, 435–437
performance, 436
portability, 435–436
registers and shared memory, 97
run-time API, 39–41
runtime numbers, 63
runtime systems, 59–60, 60–61, 62
simplicity, 436
streams, 409
syncthreads() statement, 59
terminology, 437f
CUDA API function, 26–27, 28, 30, 284
for data transfer between host and device, 30f
for managing device global memory, 29f
architecture, 19–20
built-in mechanism, 62
calling via iso_c_binding, 499–501
and CUDA Fortran differences, 493–495
CUDA-enabled GPUs, 443–444
extending C language, 35
function declarations, 38
Kernel execution control, 449–451
kernel launching, 38
keyword “__global__”, 34
keywords for function declaration, 35f
memory bandwidth, 451–453
model of host/device interaction, 444–449
program, 22–25, 288
programing environment, 453–455
programming, 192–193
providing shortcut for launching kernel, 45
run-time API, 39–41
throughput computation, 451–453
CUDA dynamic parallelism, 276
code, 301
complex example, 287–293
configurations management, 283–285
dynamic parallelism, 278–279
events synchronization, 285, 287
example, 279–281
fixed vs. dynamic grids, 277f
graph search, 276
kernel launch patterns, 278, 278f
memory data visibility, 281–283
memory management, 283–285
recursive example, 293–297
streams, 285, 286–287
synchronization, 285
synchronization depth, 285
CUDA Fortran programming, 493
asynchronous data transfers, 504–508
calling CUDA C via iso_c_binding, 499–501
calling Thrust from, 509–513
compilation and profiling, 508–509
and CUDA C differences, 493–495
data transfer and kernel execution, 507f, 508f
dynamic shared memory, 502–503
kernel loop directives and reduction operations, 501–502
multidimensional array in, 496–497
overloading host/device routines, 498–499
program, 495–496
CUDA linear algebra library (cuBLAS), 359
CUDA memory types, 77–84
__constant__’’ keyword, 83
CUDA variable type qualifiers, 82t
“__device__” keyword, 83–84
device memory model, 78f
global memory in CUDA device, 78
improving compute-to-global-memory-access ratio, 77
memory vs. registers in modern computer, 79f
pointers usage, 84
registers, 77–78
“__shared__” keyword, 83
shared memory, 77–78, 80f, 81–82
CUDA Occupancy Calculator, 126–127
CUDA program, 22
error checking and handling in, 32b
execution, 24f
model, 14–15
synchronization constraints between blocks enables transparent scalability for, 60f
CUDA thread organization, 43–47
fields, 46–47
hierarchical organizations, 44b
host code, 45
multidimensional example of CUDA grid organization, 47f
CUDA-aware message passing interface (CUDA-aware MPI), 409–410
compute process code, 407f
data server code, 408f
revised MPI SendRec calls, 409f
CUDA-enabled GPU, 458
cudaDeviceProp type, 63, 64
cudaDeviceSynchronize() function, 285
CudaFree(), 29–30, 39
cudaGetDeviceProperties function, 98
API function, 63
cudaMalloc() function, 28–29, 84, 471–472
cudaMemcpy() function, 429–430, 444, 445, 473
cudaMemcpyAsync() function, 402–403
cudaMemcpyToSymble() function, 157
cudaStreamCreate() function, 403
cudaStreamCreateWithFlags(), 287
cudaStreamWaitEvent(), 287
cuDNN library, 364–367, 365t
Cutoff binning, 375, 376f
algorithm, 376

D

Data bits, 104–105
Data clauses, 426, 427f
array size notation, 427f
for unstructured data directives, 430f
Data directives, OpenACC, 425–430
data clause array size notation, 427f
data clauses, 426f, 430f
GPU timeline, 428f
Jacobi iterative method with data region, 427f
speed-up with addition of, 428f
unstructured data directives in C++ class, 429f
update directive with MPI halo exchange, 430f
Data management techniques, 10
Data parallel computing
See also Heterogeneous parallel computing
built-in variables, 39
CUDA C program structure, 22–25
data parallelism, 20–22
data transfer, 27–32
device global memory, 27–32
function declarations, 38
kernel functions, 32–37
kernel launching, 37–38, 38
run-time API, 39–41
threading, 32–37
vector addition kernel, 25–27
Data parallelism, 8, 20–22, 20b
See also Memory parallelism
conversion of color image to gray-scale image, 21f
data-parallel computation, 521
model, 462–464
pixels calculation, 22f
RGB color image representation, 21b
task parallelism vs., 20b
Data server
process, 398–399, 398f
Data sharing, 380–381, 381, 383
Data transfer, 27–32
bandwidth, 112–113
CUDA API function for, 30f
host code allocation, 30
vecAdd function, 28, 31–32
Deep copy, 440
Deep learning, 346
Denormalization, 137–138
Dev_prop. maxGridSize, 63–64
Dev_prop. maxThreadsPerBlock, 63
Dev_prop. multiProcessorCount, 63
dev_prop. sharedMemPerBlock, 98
Device global memory, 27–32
CUDA API functions for managing, 29f
host code allocation, 30
host memory and, 28, 28f
vecAdd function, 28
“__device__” keyword, 83–84
Device management, OpenCL, 466–469
Device properties, querying, 61–64
Digital high-definition (HD) TV, 9
Digital signal processors (DSPs), 462
Direct Coulomb Summation (DCS), 332, 335f
kernel version 1, 336f
Direct memory access device (DMA device), 402
Direct3D techniques, 6
DirectX interop—rotate vertex list, 533f
Divide-and-concur approach, 231
Domain knowledge, 380
Domain partition, 387, 390, 390
Double buffering, 263
Double data rate (DDR), 112–113
Double-precision speed, 451
Driving direction services, 257–258
Dynamic parallelism, 35, 273, 276, 278–279, 450
Dynamic partitioning of resources, 125–127
Dynamic random access memory (DRAM), 3–4, 10, 27–28, 104–105, 444, 444
bursts, 105, 112
channels and banks in systems, 112f
GDDR, 6–8
Dynamic shared memory, 502–503

E

Edges, 257–258
Electronic gaming, 9
Electrostatic potential
calculation, 375
energy, 333–334
Electrostatic potential map, 332, 333f, 333f
algorithms, 378
in OpenCL, 469–473
ELL storage format, 221, 222f, 226f
example in, 222f
Energy grid array, 340–341
Enhanced atomic operations, 452–453
Enhanced global memory access, 453
Error checking and handling in CUDA, 32b
Events synchronization, 285, 287
Exception handling in kernel functions, 449–450
Excess encoding of E, 133–134
excess-3 encoding, sorted by excess-3 ordering, 133f
Exclusive scan operation, 176, 176–177
Execution configuration parameters, 36–37, 43–44, 44
Execution model, 522–525
asynchronous operation, 523–525, 524f, 524f
explicit and implicit data copies, 522–523, 522f
supporting discrete accelerator, 525
Execution speed, 12
CUDA kernels scalability, 37–38
of matrix multiplication functions, 73
of parallel programs, 12, 103
of sequential programs, 2
Explicit data copies, 522–523, 522f
Extent, 517–518

F

Fadd instruction, 80
Fast Fourier Transform (FFT), 307–308
Feature extraction, 200–201
Feature map matrix, 364
Feed forward networks, 346, 346f
Fermi GPU architecture, 450
Field programable gate arrays (FPGAs), 462
Filter-bank matrix, 360–361
Finite difference methods, 172
Fixed grid approach, 276–277
“Flat” memory space, 49–50
Floating-point data representation, 132–134
excess encoding of E, 133–134
normalized representation of M, 132–133
Floating-point precision and accuracy validation, 326f
Fluid dynamics, 20–21, 452
FORTRAN, 111, 494
Forward propagation, 349, 349f, 350f
CUDA implementation, 355–358
Frame buffer memory, 6–8
Full connection layers, 347–348
Function calls within kernel functions, 449
Function declarations, CUDA C, 38
Fusion, 487

G

G80, 8, 485
Gangs, 416, 431, 433
Gather, 372, 372f
Gaussian elimination, 217, 309–310
algorithm, 142, 143f, 144f, 145
Gaussian filters, 149, 351
GeForce 8800 GTS, 485t
GeForce GTX 280, 485t
GeForce GTX 480, 485t, 486f
GeForce GTX 680, 342
GEneral Matrix to Matrix Multiplication (GEMM), 359, 360f
General-purpose programming interface, 6
General-Purpose Programming using GPU (GPGPU), 6
Generation 2 interface (Gen2 interface), 8
Generic algorithms, 482
Generic interfaces, overloading host/device routines, 498–499
Generic programming, 482–484
Generics, 475, 476–477
get_global_id(0) function, 466
“Ghost cells”, 152
Giga floating-point operations per second (GFLOPS), 1, 72
“__global__” keyword, 34, 35
Global memory, 6–8, 27–28, 281–282
access, 453
in CUDA device, 78
Global memory access, enhanced, 453
Global memory bandwidth, 72, 104–112
analyzing data access pattern, 112
burst organization of modern DRAMs, 105
coalesced access pattern, 107f
coalescing hardware, 106
of CUDA device, 104–105
memory access patterns in C 2D arrays for coalescing, 106f
placing matrix elements into linear order, 106f
shared memory to enabling coalescing, 110f
tiled algorithm, 111–112
tiled matrix multiplication kernel using shared memory, 111f
un-coalesced access pattern, 109f
GMAC, 4–5, 13
Gnu C Compiler (gcc), 205
Gradient backpropagation, 351
Graph data structure, 258
Graph search, 257–258
adjacency matrix representation, 259f
graph data structure, 258
graph with directional edges, 258f
optimizations, 270–273
parallel BFS function, 265–270
sequential BFS function, 262–265, 263f
sparse matrix representation, 259f
sparse representation, 260
Graphics API, 6
Graphics chips, 3–4, 6
Graphics Double Data Rate (GDDR), 6–8
Graphics processing unit (GPU), 2–3, 443–444
architecture of modern, 6–8
computing, 11
CUDA-enabled GPUs, 443–444, 446
data parallelism, 20–22
design philosophy, 4
floating-point arithmetic units, 5–6
IEEE Floating-Point Standard, 5–6
NVIDIA GTX480, 27–28
PCI-E Gen3, 8
Graphs, 257–258
Greyscale, 20–21
Grid, 33
Grid data structure, 341
Gridding approach, 307–308
Gridding computation, 307–308

H

Hadoop, 233
Halo cells, 152
tiled 1D convolution, 160–165
tiled 2D convolution, 166–172
Halo elements, 225–226
Hardware queues, 450
Hardware system, 372
Hardware trigonometry functions, 323–325
Heterogeneous computing cluster programming
See also Parallel programming
background, 388
collective communication, 408–409
CUDA-aware MPI, 409–410
overlapping computation and communication, 400–408, 401f
point-to-point communication, 393–400
programer’s view of MPI processes, 388f
running example, 388–390
small example of memory layout, 390f
3D grid array, 390f
Heterogeneous parallel computing, 2–6
See also Data parallel computing
CPUs, 3f
GPUs, 3f, 4–5
many-thread trajectory, 2–3
memory bandwidth, 3–4
multicore trajectory, 2
NVIDIA, 6
practical form factors and easy accessibility, 5
Hierarchical parallel scan for arbitrary-length inputs, 189–192
Hierarchical queues, 271–272
Hierarchical scan for arbitrary length inputs, 189f
High-Bandwidth Memory (HBM), 6–8
High-performance computing (HPC), 13, 387, 414
High-performance parallel programs, 14
Higher order stencil computation, 388–389
Histogram, 200
Host, 464–465
Host copy, 429–430
“__host__” keyword, 35, 38
Host/device interaction model, 444–449
host_data region, 437, 438f
Hypothetical example, 383–385
classical gridded MRI reconstruction from spiral scan data, 384f
least squares reconstruction of spiral scan data, 385f
sodium images, 384f

I

I/O devices, 444
Identity matrix, 144
IEEE Floating-Point Standard, 5–6
IEEE format, special bit patterns and precision in, 138–139
IEEE standard, 133, 134
IEEE-754 Floating-Point Standard, 132
If-then-else statement, 59
Image blur, 54–58
Implicit data copies, 522–523, 522f
Implicit ranges, 489–491, 490f
inclusive scan operation, 176, 176–177
Independent clause, 425
Installed base of processor, 5
Instruction Register (IR), 79
Instruction/execution divergence, 381–382
Integrated circuits, 261–262
maze routing in, 261f
Intel Pentium family, 1
Interleaved data distribution, 114–115
Interleaved partitioning, 206–207
Internal cells, 161
Internal tiles, 161
Interoperability with CUDA and libraries, 437–440
calling CUDA device kernels from OpenACC, 439–440
calling CUDA or libraries with OpenACC arrays, 437–438
CUDA pointers in, 438
deviceptr clause, 438f
host_data region, 438f
Interruptible kernels, 450–451
Intrinsic functions, 205b
Inverse Fast Fourier Transform (iFFT), 306–307
Inverse matrix, 144
iso_c_binding module, calling CUDA C via, 499–501
Iterative approaches, 217–218
Iterative solver, 228
Iterators, 477, 479–480

J

Jacobi iterative method, 388–389, 418, 419f, 432
code with loop tile clause, 433f
compiler feedback for, 423f
with data region, 427f
using parallel directive, 422f
Jagged Diagonal Storage format (JDS format), 227

K

k-space sampling trajectory, 306
Kahan’s summation algorithm, See Compensated summation algorithm
Kernel execution control
exception handling in kernel functions, 449–450
function calls within kernel functions, 449
hardware queues and dynamic parallelism, 450
interruptible kernels, 450–451
simultaneous execution of multiple kernels, 450
Kernel functions, 32–37, 77–78, 463
OpenCL, 466, 467f
“__kernel” keyword, 466
Kernel launch, 37–38, 38
OpenCL, 466–469
overhead, 272–273
Kernel parallelism structure, 312–317
cmpMu kernel, 314f
loop fission, 314f
loop interchange, 316f, 317f
option of FHD kernel, 316f
version of FHD kernel, 312f
Kernels, 23
complex, 54–58
image blur kernel, 56f
loop directives and reduction operations, 501–502
Kernels directive, OpenACC, 419–421, 420f
and parallel directives comparison, 424–425
Kogge–Stone algorithm, 177–178
kernel for inclusive scan, 180f
parallel exclusive scan algorithm, 181f
parallel inclusive scan algorithm, 178f

L

Laplace equation, 418, 418f
LargeBin algorithm, 378
Last-level on-chip caches, 4
Latency, 207–209
hiding, 65
tolerance, 64–66, 66b
Latency-oriented design, 4
Launch
environment configuration, 283
pool size, 292
Least-squares reconstruction (LS reconstruction), 385
LeNet-5 design, 348, 348f
Lifetime, 82
of constant variable, 83
of shared variable, 83
Linear algebra functions, 51b
Linear algebra operation, 51b
Linear algorithms, 175
Linear Bezier curves, 288
Linear solvers, 142–146
iterative reconstruction algorithm, 308–309
Load balance, 273
Local memory, 282
Locality, 90, 380–382
Loop directive, 422, 422–423
OpenACC, 430–431
Loop fission technique, 313
Loop interchange, 313
Loop optimizations, OpenACC, 430–432
adjusting loop parameters, 431f, 432f
loop directive specifying levels of parallelism, 431f
Loop parallelism, 36
Loop splitting technique, 313

M

Machine learning, 345, 346–347
ConvNets, 347–355
convolutional layer, 355–358
convolutional layer reduction to matrix multiplication, 359–364
cuDNN library, 364–367, 365t
Magnetic resonance imaging (MRI), 306, 371
application of MRI, 306
background, 306–308
blockIdx.x and threadIdx.x values, 315
Cartesian scan trajectories, 306–307
CG algorithm, 309–310
chunking k-space data, 320f, 320f
cmpMu() kernel, 314–315, 314f
computing FHD, 310–327, 311f, 312f
device, 383
experimental performance tuning, 326–327
final evaluation, 327–329
hardware trigonometry functions, 323–325
iterative reconstruction, 308–310
k-space elements, 318–319
k-space regions, 306
kernel parallelism structure, 312–317
loop fission or loop splitting, 313
loop interchange, 316f
M/MU_THREADS_PER_BLOCK blocks, 314–315
matrix–vector multiplication, 309, 310
memory bandwidth limitation, 317–323
non-Cartesian k-space sample trajectory, 308f
non-Cartesian scan trajectories, 307
physics principles behind MRI, 306
quasi-Bayesian estimation problem formulation, 309
ratio of floating-point operations, 311–312
scanner k-space trajectories, 307f
Managed memory, 446–447
Many-thread processors, 2–3
Map driving direction applications, 258
Map-reduce distributed computing frameworks, 233
Map-reduce frameworks, 231
Matrix multiplication, 73–77, 115f, 153–154, 345, 359, 374–375
actions of one thread block, 76f
convolutional layer reduction to, 359–364
example to execution, 75
execution example of matrixMulKernel, 76f
execution of for-loop, 75
execution speed of functions, 73
function generates unrolled X matrix, 362f
global memory accesses, 77
high-performance implementation, 363f
host code for invoking unroll kernel, 363f
implementation, 73
implementing forward path, 362f
kernel using one thread, 75f
matrixMulKernel, 77
using multiple blocks by tiling, 74f
tiling illustration, 84
Matrix–matrix multiplication, See Matrix multiplication
matrixMulKernel, 77
Matrix–vector multiplication (FHD), 309, 310–327
Maze routing problem, 262
determining kernel parallelism structure, 312–317
experimental performance tuning, 326–327
getting around memory bandwidth limitation, 317–323
hardware trigonometry functions, 323–325
in integrated circuits, 261f
Memory
access throughput, 104–105
allocation and lifetime, 283–284
bound, 201
memory-bound programs, 72
space, 49b, 479–480
Memory access efficiency
effect, 77
importance, 72–73
Memory and data locality
boundary checks, 94–96
CUDA memory types, 77–84
importance of memory access efficiency, 72–73
matrix multiplication, 73–77
parallelism, memory as limiting factor to, 97–99
tiled matrix multiplication kernel, 90–94
tiling for reduced memory traffic, 84–90
Memory bandwidth, 3–4, 112–113, 270–271, 451–453
adjusting k-space data layout, 323f
chunking k-space data, 320f
effect of k-space data layout, 322f
limitation, 317–323
registers to reducing memory accesses, 319f
revised FHD kernel, 321f
Memory coalescing, 104, 206, 207f, 338–342
DCS kernel version 3, 340f
organizing threads and memory layout, 339f
reusing computation results among multiple grid points, 339f
version 2 of DCS kernel, 339f
Memory data visibility, 281–283
constant memory, 282
global memory, 281–282
local memory, 282
shared memory, 283
texture memory, 283
zero-copy memory, 282
Memory management, 283–285
errors and launch failures, 284
launch environment configuration, 283
memory allocation and lifetime, 283–284
nesting depth, 284
pending launch pool configuration, 284
Memory parallelism, 112–116
See also Data parallelism
banking improving utilization of data transfer bandwidth, 113f
channel, 112
channels and banks in DRAM systems, 112f
distributing array elements into channels and banks, 115f
distribution scheme, 114–115
M elements loaded by thread blocks, 116f
matrix multiplication, 115f
Merge sort
circular-buffer merge kernel, 249–254
co-rank function implementation, 236–241
Hadoop, 233
merge operation, 232f
ordered merge function, 231–232
parallel merge kernel, 241–242
parallelization approach, 234–236
sequential merge algorithm, 233–234
sorted vs. unsorted lists, 232f
tiled merge kernel, 242–249, 243f
merge_circular_buffer_kernel, 249
merge_tiled_kernel, 249
Message Passing Interface (MPI), 12–13, 13, 374, 387, 391–393
barrier synchronization, 404–405
closing communication system, 391f
collective communication, 408–409
main program, 392f
MPI/CUDA programming, 387
MPI/OpenACC, 387
MPI/OpenCL, 387
MPI_Barrier() function, 404–405
MPI_Comm_rank()function, 391
MPI_Comm_size() function, 395–396
MPI_Recv() function, 393, 394f
MPI_Send() function, 393, 394f
MPI_Sendrecv() function, 405–406, 406f
overlapping computation and communication, 400–408
point-to-point communication, 393–400
Microprocessors, 1
Microscopes, 9
Modern compilers, 205
Molecular dynamics, 331
applications, 20–21, 372–373, 373f
simulation, 332
Molecular visualization and analysis
background, 332
memory coalescing, 338–342
simple kernel implementation, 333–337
single-threads CPU vs. CPU–GPU comparison, 343f
thread granularity adjustment, 337–338
Motivation, 477–478
Multidimensional arrays, 49–50
in CUDA Fortran, 496–497
Multidimensional data, threads mapping to, 47–54
host code, 48
linearized access to three-dimensional array, 54
multidimensional arrays, 49–50
1D, 2D, or 3D thread organizations, 47
row-major layout for 2D C array, 50f
source code of color ToGreyscaleConversion, 52f
2D thread grid to processing, 48f

N

Nesting depth, 284
Non-Cartesian MRI
application of MRI, 305
computing FHD, 310–327
final evaluation, 327–329
iterative reconstruction, 308–310
non-Cartesian k-space sample trajectory, 308f
non-Cartesian scan trajectories, 307
non-Cartesian trajectories, 384
scanner k-space trajectories, 307f
norm, C++ AMP type, 531
Normalized representation of M, 132–133
Not a Number (NaN), 138, 139
Numerical considerations
algorithm considerations, 140–142
arithmetic accuracy and rounding, 139–140
floating-point data representation, 132–134
linear solvers and numerical stability, 142–146
representable numbers, 134–138
special bit patterns and precision in IEEE format, 138–139, 138f
Numerical stability, 142–146
Numerically stable values, 142
Numerically unstable values, 142
NVIDIA C Compiler (NVCC), 23

O

On-chip memory, 88
1D parallel convolution, 153–156
boundary condition handling, 155f
input parameters, 154
kernel with boundary condition handling, 155f
mapping of threads to output elements, 154
Mask_Width [size of masks], 155
output element index, 154
variable P value, 155
Open Computing Language (OpenCL), 14, 19–20, 461, 461
building OpenCL kernel, 472f
data access indexing in OpenCL and CUDA, 471f
data parallelism model, 462–464
DCS kernel version 3 NDRange configuration, 470f
development of, 462
device architecture, 464–465, 465f
device management and kernel launch, 466–469
dimensions and indices to CUDA, 464f
electrostatic potential map in, 469–473
host code for kernel launch and parameter passing, 472f
inner loop of OpenCL DCS kernel, 471f
kernel functions, 466, 467f
to managing devices, 467f
mapping between OpenCL and CUDA, 463f
mapping DCS NDRange to, 470f
parallel execution model, 463f
abstract machine model, 415f
advantage, 13
asynchronous computation and data, 434–435
calling CUDA device kernels from OpenACC, 439–440
calling CUDA or libraries with OpenACC arrays, 437–438
compiler output from example kernels code, 421f
and CUDA, 435–437
CUDA pointers in, 438
data directives, 425–430
deviceptr clause, 438f
directive format, 416–418, 417f
by example, 418–435
execution model, 414–416
future of, 440–441
GPU timeline of parallel loop code, 424f
host_data region, 438f
interoperability with CUDA and libraries, 437–440
kernels and parallel directives comparison, 424–425
kernels directive, 419–421, 420f
loop optimizations, 430–432
offloading execution model, 415f
parallel directive, 422–424
performance, 436
performance speed-up, 424f
portability, 435–436
regions, 417
routine directive, 432–434
simplicity, 436
terminology, 437f
OpenCL clEnqueueReadBuffer(), 473
Operand value, 78–80
Ordered merge operations, 231
Output interference, 202
Overlapping computation and communication, 400–408, 401f
device memory offsets, 404f

P

Padding, 221–224
example in ELL, 222f
hybrid approach to regulates, 224–227
parallel SpMV/ELL kernel, 223f
Page locked memory buffer
see Pinned memory buffer
Parallel BFS function, 265–270
BFS host code function, 266f
block-level queue contents, 269f
kernel based on block-level privatized queues, 267f
aggregation, 211–213
algorithm selection, 374–379
atom-centric arrangement, 371–372
atomic operation in cache memory, 210
atomic operations, 202–205
block partitioning vs. interleaved partitioning, 206–207
C function for calculating histogram, 201f
comparison of scalability and performance, 378
cutoff algorithms, 378
energy value of grid point, 371
goal of, 370
grid-centric arrangement, 371–372
grid-centric decomposition, 371–372
histogram, 200
hypothetical example, 383–385
issue with binning, 377
latency vs. throughput of atomic operations, 207–209
nonbonded force calculation, 372–373
parallel histogram computation, 199
privatization technique, 210–211
problem decomposition, 371–374
running time of three binned cutoff algorithms, 378, 378f
sequential tasks, 374
SPMD, shared memory and locality, 380–382
threading arrangement, 371–372
work efficiency, 377
Parallel device code, 23
Parallel directive, OpenACC, 422–424
and kernels directives comparison, 424–425
Parallel execution, 371–372
Parallel merge kernel, 241–242
Parallel programming, 12
Parallel programming languages and models, 12–14
Parallel reduction algorithm, 122
Parallel scan, 175
algorithm with 16-element input, 178–179
for arbitrary-length inputs, 189–192
background, 176–177
exclusive scan algorithm, 181f
implementation of iterative calculations, 180
inclusive scan operation, 176, 178f
kernel launch, 179
Kogge–Stone kernel for inclusive scan, 180f
as primitive operation, 177
sequential algorithm of computation, 177
simple, 177–181
single-pass scan for memory access efficiency, 192–194
work efficiency, 181–183
work-efficient, 183–187, 187–189
Parallel SpMV using CSR, 219–221
Parallelism, 8–10
memory as limiting factor to, 97–99
Parallelization approach, 234–236
Pareto-Optimal-Curve-based method, 327
Pascal, 453
Pascal GPU architecture, 448, 451, 453
Peak SNR (PSNR), 324–325
Pending launch pool configuration, 284
Performance cliff, 126
Performance considerations
dynamic partitioning of resources, 125–127
global memory bandwidth, 104–112
memory parallelism, 112–116
thread granularity, 127–128
warps and SIMD hardware, 117–125
Phantom object, 324–325
Physical address spaces, 446
Pinned memory
allocation, 401–402
buffer, 402
Pivoting, 145, 146f
Pointers, CUDA, 84
Pointers model, 477
Pooling layer, See Subsampling layers
Portfolio management process, 370
Portland Group (PGI), 414, 493
#pragma keyword, 417
Precision in IEEE format, 138–139, 138f
Predefined variables, See Built-in variables
Prefix sum, 175
Privatization technique, 210–211, 257–258
Problem decomposition, 371–374
Processing elements (PEs), 464–465
Processing units, 81b
Processor cores, 1
Program Counter (PC), 79
Programing environment, 453–455
Programing interface for computing clusters, 388
overlapping computation and communication, 400–408
3D stencil computation, 388–390, 389f
Programmer productivity, 484
Projection imaging, 307
PSNR, See Peak SNR (PSNR)
PTX code, 508–509

Q

Q computation, 310–311
Quadratic Bezier curves, 288
Querying device properties, 61–64
Queues, 257–258
Quiet NaN, 139, 139

R

Race condition, 203, 203f, 204f
Rank, 236
Read-modify-write, 202, 203, 208
Real-world performance, 485–486
Reduced memory access throughput, 126
Reduction, 423
Reduction algorithm, 121
Reduction tree, 177–178, 183
Reference count, 426–427
Register File, 78
Registers, 77–78
in CUDA, 81–82
to SM, 97
Representable numbers of floating-point format, 134–138
abrupt underflow format, 137f
algorithm considerations, 140–142
alignment shifting of operands, 140
arithmetic accuracy and rounding, 139–140
bit patterns, IEEE standard format, 138–139, 138f
3-bit unsigned integer format, 134f, 135f
denormalization format, 137f
denormalization of, 137–138
discrepancy between sequential algorithms and parallel algorithms, 141
Gaussian elimination procedure, 142, 143f, 144, 145, 145, 146f
intervals in neighborhood of 0, 135–136
major intervals, 135–136
mantissa bits, 132
NaNs, 139
between negative infinity and positive infinity, 139
no-zero, abrupt underflow, and denorm formats, 135f
no-zero representation, 135f
precision of, 131
quiet NaN’s, 139
reduction computation, 141
represent of 0, 135
SNaNs, 139
trend of increasing density, 136–137
Resource and capability queries, 62b
Resource assignment, 60–61
Restrict(amp) specification, 519
Restrict(cpu, amp), 519
Restrict(cpu), 519
RGB color image representation, 21b
Root-mean-square (RMS), 327
Rounding, 139–140
Routine directive, OpenACC, 432–434, 433f
Routing software, 262
Row-major layout, 50
for 2D C array, 50f
Run-time API, 39–41

S

SAXPY function, 482–484
in CUDA C and Thrust, 483f
SAXPY routine, 495
Scalability, 15
Scalable parallel execution
CUDA thread organization, 43–47
image blur, 54–58
latency tolerance, 64–66, 66b
mapping threads to multidimensional data, 47–54
querying device properties, 61–64
resource assignment, 60–61
scalable parallel program, 19
synchronization and transparent scalability, 58–60
thread scheduling, 64–66
Scalar variables, 82
Scan blocks, 190
Scatter operations, 372, 372f
Scratchpad, 451–452
Scratchpad memory, 80–81
Self clause, 429–430
Semiconductor industry, 459
Sequential BFS function, 262–265, 263f
Sequential cutoff algorithm, 378
Sequential merge algorithm, 233–234
Sequential reduction algorithm, 121
Sequential SpMV/CSR loop, 218f, 220
shortcoming, 221
SpMV/ELL kernel code, 223
“__shared__” keyword, 83
Shared memory, 77–78, 126, 283, 380–382
in CUDA, 81–82
size in SM, 98
usage, 97
Signal-to-noise ratio (SNR), 306
Signaling NaN’s (SNaNs), 139
Simple kernel implementation, 333–337
Simulation algorithm, 277
Single Instruction Multiple Data (SIMD), 8, 65, 117
executing threads of warp, 119
execution extending, 120
execution of revised algorithm, 124f, 125
placing 2D threads into linear order, 119f
simple sum reduction kernel, 122f, 123f
__syncthreads() statement, 122
warps implementation, 117–118
Single program multiple data (SPMD), 15, 32–33, 380, 380–382
Single-pass scan for memory access efficiency, 192–194
Skirt cells, 161
SmallBin algorithm, 379
SmallBin-Overlap algorithm, 379
Social network, 259
graphs, 273
Sodium map of brain, 383–385
Sorting, 231
Sparse matrix computation, 215
background, 216–219
column index array, 216f
compressed storage, 217
coordinate (COO) format, 224, 224, 225
CSR storage format, 216, 216f
data padding and transposition, 221–224
dot product loop body, 223
elements of data, col_index, and row_index, 216–217
FLOPS rating, 229
format for storing, 215
Gaussian elimination, 217
hybrid approach to regulates padding, 224–227
hybrid ELL and COO method for SpMV, 225, 225–226
iterative approach, 217–218
in JDS format, 227, 228f
JDS–ELL representation, 227, 228f
loop index iteration, 218, 219
matrix–vector multiplication and accumulation, 218f
parallel SpMV using CSR, 219–221
parallel SpMV/ELL, 223f
real-world problems, 215
in science and engineering problems, 216
sequential implementation of SpMV, 218
sequential loop, 218f
in solving linear system of N equations of N variables, 217
sorting and partitioning for regularization, 227–229
sparse matrix, 216, 216f, 279–280
SpMV computation code, 219
SpMV loop operating on, 219f
transposition of JDS-CSR representation, 228–229
Sparse Matrix–Vector (SpMV multiplication), 217–218
loop operating on sparse matrix, 219f
sequential loop implements SpMV/COO, 226f
Sparse representation, 260
Sparsely connected graph, 258–259
Special bit patterns in IEEE format, 138–139, 138f
Special function units (SFU), 323
Speeding up real applications, 10–11
Standard Template Library (STL), 475, 515–516
Statistical estimation methods
application of MRI, 305
computing FHD, 310–327
final evaluation, 327–329
iterative reconstruction, 308–310
non-Cartesian k-space sample trajectory, 308f
scanner k-space trajectories, 307f
Statistically optimal image reconstruction method, 308
Stencil computation, 149, 172
25-stencil computation, 389, 389f
Stored program model, 79
“Strategy I”, 202, 206
Stream-based scan, 193, 194
Streaming multiprocessors (SMs), 6–8, 60–61, 65, 71, 292, 464–465
executing threads in warp, 65
hardware resources (built-in registers) for, 61
registers to, 97
size of shared memory in, 98
thread block assignment to, 61f
unit of thread scheduling in, 64–65, 64f
Streaming processors (SPs), 6–8
Strip-mining, 93
Structure of Arrays approach (SoA approach), 488–489
Stub function, 31–32
Subsampling layers, 347–348, 350, 351f
Supercomputing applications, 9
Synchronization, 58–60, 285
depth, 285
Synchronous DRAM (SDRAM), 6–8
__syncthreads() function, 58, 59, 122, 180, 285, 463
System of linear equations, 142

T

Task parallelism, 20b
Tera floating-point operations per second (TFLOPS), 1
Tesla P100, 453
Texture memory, 283
Thread scheduling, 64–66
Thread-to-data mapping, 74
Thread(s), 24b, 81b
adjustment, 337–338
block partition in, 206
coarsening, 331
CUDA runtime system, 59–60, 60–61, 126
granularity, 127–128
in grid executing same kernel codes, 33f
mapping to multidimensional data, 47–54
multiple dimensions of, 118
and SIMD hardware, 119
threadIdx.x and threadIdx.y values, 118–119
threadIdx.x values with warp, 118
3D thread organizations, 47
2D thread grid, 48f
ThreadIdx.x, 34, 34, 35–36, 48, 108, 111, 118, 122f, 161, 162, 166, 180, 181
Threading, 32–37
arrangement, 371–372, 372f
Three-dimensional grid (3D grid), 333–334
Throughput
of atomic operations, 207–209
computation, 451–453
throughput-oriented design, 4
Thrust parallel template library, 475
abstraction layer, 480f
array of structures data layout, 488–489, 489f
background, 475–477
benefits of abstraction, 484–486
best practices, 486–491
C++ function objects, 482b–483b
calling from CUDA Fortran, 509–513
counting iterator, 491
device_pointer_cast() function, 481
dynamic optimization, 486
features, 478–482
fill algorithm, 485
fusion, 487
generate, sort, and copy algorithms, 478
generic programming, 482–484
implicit ranges, 489–491, 490f
interfacing Thrust to CUDA C, 480, 480f, 481f
interoperability, 480–481, 481f
iterators and memory space, 479–480
kernel fusion, 487
motivation, 477–478
native CUDA C interoperability, 481
programmer productivity, 484
raw_pointer_cast() function, 480
real-world performance, 485–486
robustness, 484
salient features, 478
SAXPY functor, 482–484
saxpy_functor func, 482–484, 483
SNRM2, 488f
for solving complementary set of problems, 478
value of high-level, 478
vector containers, 478
Tile clause, 432
Tiled 1D convolution, 165–166
accessing N elements and ghost cells, 164f
with halo cells, 160–165, 160f
kernel using constant memory, 163f
Tiled 2D convolution
C type structure definition of image pixel element, 169f
with halo cells, 166–172
image array access reduction ratio, 172f
padded image format, 167f
row-major layout, 168f
starting element indices, 169f
2D convolution kernel, 170f, 170f, 171f
Tiled algorithms, 87–88, 88f, 93–94, 99, 111
Tiled execution, 528–530
Tiled matrix
See also Matrix multiplication
multiplication, 529f
multiplication algorithm, 88, 89f
multiplication kernel, 90–94
Tiled merge kernel, 242–249, 243f
See also Circular-buffer merge kernel
identifying block-level output and input subarrays, 244f
loading elements into shared memory, 245f
running example, 248f
threads merge, 246f
Tiles, 84
Tiling, 152, 153, 374–375, 380, 432
buffer, 253–254
efficiency, 152
for reduced memory traffic, 84–90
Traffic congestion, 86
Training, 345
application logic, 346
of ConvNets, 351
data sets, 354
images, 351
Transparent scalability, 58–60
Transposition, 221–224
example in ELL, 222f
parallel SpMV/ELL kernel, 223f
Trigonometry functions, 312
2D convolution kernel, 169
Two-dimensional array, 50
Two-dimensional slice (2D slice), 333–334, 334f

U

Unified Memory, 446–447
CPU code to CUDA code, 447f
Unified virtual address space (UVAS), 445
unorm, C++ AMP type, 531
Unstructured data directives, 429
data clauses for, 430f
Update device clause, 429–430

V

vecAdd function, 26f, 31, 517
complete version of host code in, 37f
vecAddKernel function, 34, 34f, 36f, 45, 45, 45, 48, 51, 54
Vector addition kernel function, 25–27, 34f
launch statement, 36f
pointers in C language, 26b–27b
revised vecAdd function, 26f
traditional vector addition C code example, 25f
Vector length, 416
Vectors, 416, 416, 431, 433
Vertices, 257–258
Virtual address spaces, 446
Visual Molecular Dynamics (VMD), 332
von Neumann model, 79b
memory vs. registers, 79f
von Neumann report, 1–2

W

Warp-level queues (w-queues), 271–272
analyzing impact of control divergence, 121
approach to execution, 120
execution, 117–118
execution of revised algorithm, 124f, 125
placing 2D threads into linear order, 119f
scheduling, 65
simple sum reduction kernel, 122f, 123f
__syncthreads() statement, 122
thread blocks, 118
Work efficiency, 265
Work-efficient parallel scan, 175, 183–187, 187–189
Brent–Kung kernel for inclusive scan, 186f
distribution of partial sums to positions, 184, 184
index values, 185
minimal number of operations, 183–184
number of operations in distribution tree stage, 187
parallel inclusive scan algorithm, 183f
partial sums, 184f
reduction tree phase of, 185
scan kernel, 186, 186f
two-position addition, 184–185
Workers, 416, 431, 433

X

X86 instruction set, 2

Z

Zero-copy memory, 282, 445
Zero-overhead thread scheduling, 65
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.205.21