Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Index

Note: Page numbers followed by “b”, “f” and “t” refer to boxes, figures, and tables, respectively.

Abrupt underflow convention, 137

Abstraction, 484–486

memory bandwidth of filling kernels, 485t

programmer productivity, 484

real-world performance, 485–486

robustness, 484

Accelerator, 517

device, 414

management, 525–528, 526f

Accelerator, class, 525–526

Accelerator::set_default static method, 526

Accelerator_view, C++ AMP, 526

AddVecKernel function, 36, 120

Adjacency matrix, 258

representation of simple graph, 259f

sparse matrix representation of, 259f

Adjacent synchronization, 193

Algorithm, 153–156

considerations, 140–142

selection, 374–379

ALU, See Arithmetic and logic unit (ALU)

AMD Opteron family, 1

Amdahl’s law, 10, 373–374, 374

ANSI C code, 23, 34, 35–36

ANSI C standard, 49

AoS, See Array of Structures (AoS)

APIs, See Application programming interfaces (APIs)

Apodization filtering function, 306

Application programming interfaces (APIs), 6, 19–20

communication functions, 388

Applications software, 2

Arithmetic accuracy, 139–140

Arithmetic and logic unit (ALU), 78–80

Arithmetic instructions, 78–80

Array data layout

column-major layout, 50–51

row major layout, 50, 50–51, 50f, 168f

Array of structs, 322

Array of Structures (AoS), 488

Array_view, 517, 517–518, 521–522, 522, 523

Asynchronous, 405, 409

async and wait, 434f

computation and data, 434–435

data transfers, 504–508

operation, 523–525, 524f, 524f

pipelining with async and wait, 435f

Atomic operations, 202–205, 334–335

atomic operation in cache memory, 210

CUDA kernel for calculation histogram, 206f

in Fermi GPU, 452–453

intrinsic functions, 205b

race condition, 203f, 204f

strategy I for parallelizing histogram computation, 202f

throughput of, 207–209

Atomic operations, enhanced, 452–453

Audio digital signal processing, 150

Auto clause, 425

Backward convolution, 353

Backward substitution, 143

Bank conflict, 114

Barrier synchronizations, 58, 117, 404–405, 407–408

execution timing of, 59f

syncthreads() points, 59

Barrier__syncthreads(), 93

Basic Linear Algebra Subprograms (BLAS), 50–51, 73

Bezier curve

calculation, 288–290, 290, 301f

linear, 288

quadratic, 288

BFS, See Breadth-first search (BFS)

Bi-directional relations, 258

Big Data, 9–10, 200–201

Binning process, 377–378

BLAS, See Basic Linear Algebra Subprograms (BLAS)

Block partitioning, 206–207

Block-level queue, 268–269, 269f, 270

blockDim variable, 33–34, 34

blockDim.x variable, 33–34, 35–36

blockDim.x, 33–34, 35–36

gridDim.x threads, 51, 207

BlockIdx variable, 34, 34

BlockIdx.x, 34, 35–36

Bottlenecks, 103

Boundary checks, 94–96

Boundary tile, 160–161

Brain, sodium map of, 383–385

Breadth-first search (BFS), 260–262, 260f

maze routing in integrated circuits, 261f

Brent–Kung adder design, 183

Built-in variables, 28b, 39

C language

ANSI C code, 23

convention, 150

CUDA C extending, 35

malloc() function, 396, 402

multidimensional array, 49

pointers, 26b–27b

preprocessor directive, 26–27

run-time library, 28–29

traditional C compiler, 23

traditional C program, 22, 25

2D array, 49

C++ Accelerated Massive Parallelism (C++ AMP), 515

array_view, 517, 517–518, 521–522, 522, 523

asynchronous operation, 523–525, 524f, 524f

data-parallel computation, 521

execution model, 522–525

explicit and implicit data copies, 522–523

extension to current C++ 11 standard, 515–516

extensions to language, 516

features, 516–522

focus of, 515

for_each function template, 518

graphics features, 531–533

managing accelerators, 525–528, 526f, 527f, 527f

sharing concepts with CUDA, 515

tiled execution, 528–530

vehicle for reading and writing large data collections, 517

C++ function, 475–476

classes, 476–477

objects, 482b–483b

Cache

cache memory, atomic operation in, 210

coherence mechanism, 159

L1 cache, 159

L2 cache, 159

Caching, 156–159, 165–166, 167f

halo cells, 166f

hierarchy of modern processors, 158f

CAD, See Computer-aided design (CAD)

Carpooling, 87

arrangements, 87–88

Cartesian scan trajectory, 306–307

Central processing unit (CPU), 1

CPU-based parallel programing models, 461–462

serial code, 23

CG algorithm, See Conjugate gradient algorithm (CG algorithm)

Circular-buffer merge kernel, 249–254
See also Tiled merge kernel

co_rank_circular function operates on, 255f

managing shared memory tiles, 250f

merge_sequential_circular function implementation, 255f

part of, 251f, 254f

simplified model for co-rank values, 252f

clCreateBuffer() function, 471–472

clEnqueueNDRangeKernel() function, 473

clReleaseMemObject() function, 473

Co-rank function, 234

execution, 237f

function based on binary search, 238f

implementation, 236–241

operation example, 239f, 240f, 240f

Coalesces, 105

access pattern, 107f

coalesced pattern, 242

memory access patterns in C 2D arrays for, 106f

shared memory to enabling coalescing, 110f

Code, 301

Bezier curve calculation without dynamic parallelism, 301f

quadtree with dynamic parallelism, 302f, 303f

Collective communication function, 404–405

Column-major layout, 50–51

Communication, 388

avoiding algorithms, 145–146

connections, 392

systems, 391

Communicator, 391

Compensated summation algorithm, 141–142

Compressed sparse row (CSR), 259

parallel SpMV using, 219–221

storage format, 216, 216f

Computation intensity, 487

Computational microscope, 332

Computational thinking, 379–380
See also Parallel computing

skills needed for parallel programer, 380

strategies, 382–383

techniques, 14

Compute domain, 518

Compute process code, 399–400, 400f, 403f, 405f, 406f

Compute units (CUs), 464–465

Compute-to-global-memory-access ratio, 72, 72, 77, 77, 93–94

Computer architecture, 380

Computer-aided design (CAD), 261–262

Configurable caching, 451–452

Configurations management, 283–285

errors and launch failures, 284

launch environment configuration, 283

memory allocation and lifetime, 283–284

nesting depth, 284

pending launch pool configuration, 284

Conjugate gradient algorithm (CG algorithm), 309–310

Conjugate gradient method, 217

“__constant__” keyword, 83

Constant memory, 156–159

CUDA memory model, 156f

1D convolution kernel, 158f

Constant memory, 282

Constructor for array views, 517–518

Container classes, 476–477

Control

divergence, 121

flow divergence, 215, 221, 221

points, 287

unit, 79

Control flow efficiency, better, 451

Conventional machine learning systems, 346

ConvNets, See Convolutional neural networks (ConvNets)

Convolution, 149, 150–153

boundary condition, 151f

calculation of P[3], 151f

constant memory and caching, 156–159

kernel, 150

masks, 150

1D example, 150f

1D parallel, 153–156

pattern, 55

tiled 1D convolution, 160–165, 160f, 165–166

tiled 2D convolution, 166–172

2D example, 152f, 154f

Convolutional layers, 347–348, 349, 355–358

dE/dW calculation, 354f

dE/dX calculation, 353f

forward path, 355f

global memory traffic, 358f

kernel for forward path, 357f

parallelization of forward path, 356f

reduction to matrix multiplication, 359–364

Convolutional neural networks (ConvNets), 345, 347–355, 365

backpropagation, 351–355, 352f

basic layers, 348–351

COO format, See Coordinate format (COO format)

Coordinate format (COO format), 224, 225f, 226f

reordering, 225f

sequential loop implements SpMV/COO, 226f

Core C++ AMP features, 516–522

array_view, 521–522

base Coulomb potential calculation, 521f

CUDA code, 518

CUDA vector addition, 516f

existing C99 keyword restrict, 519

providing math operations, 521

restriction specifiers, 520

row-major storage layout, 520–521

set of restrictions, 519

vector addition in, 517f

Core performance evolution

better control flow efficiency, 451

configurable caching and scratchpad, 451–452

double-precision speed, 451

enhanced atomic operations, 452–453

enhanced global memory access, 453

Corner turning, 109–110, 111, 187–188

CPU, See Central processing unit (CPU)

CPU–GPU execution of application, 4–5

Critical-path analysis, 454, 454f, 455f

CSR, See Compressed sparse row (CSR)

cuBLAS, See CUDA linear algebra library (cuBLAS)

cublasSaxpy function, 437–438

CUDA

applications, 62–63, 104

block, 528

built-in variables, 39

C++ AMP sharing concepts with, 515

calling CUDA device kernels from OpenACC, 439–440

calling CUDA or libraries with OpenACC arrays, 437–438

CUDA pointers in, 438

device designs, 61

device global memory, 104–105

deviceptr clause, 438f

host function, 35

host_data region, 438f

implementation of forward propagation, 355–358

interoperability with CUDA and libraries, 437–440

kernel function, 32–33

mapping between OpenCL and, 463f

and OpenACC, 435–437

performance, 436

portability, 435–436

registers and shared memory, 97

run-time API, 39–41

runtime numbers, 63

runtime systems, 59–60, 60–61, 62

simplicity, 436

streams, 409

syncthreads() statement, 59

terminology, 437f

CUDA API function, 26–27, 28, 30, 284

for data transfer between host and device, 30f

for managing device global memory, 29f

CUDA C, 443–444, 477–478

architecture, 19–20

built-in mechanism, 62

calling via iso_c_binding, 499–501

and CUDA Fortran differences, 493–495

CUDA-enabled GPUs, 443–444

extending C language, 35

function declarations, 38

Kernel execution control, 449–451

kernel launching, 38

keyword “__global__”, 34

keywords for function declaration, 35f

memory bandwidth, 451–453

model of host/device interaction, 444–449

program, 22–25, 288

programing environment, 453–455

programming, 192–193

providing shortcut for launching kernel, 45

run-time API, 39–41

throughput computation, 451–453

CUDA dynamic parallelism, 276

code, 301

complex example, 287–293

configurations management, 283–285

dynamic parallelism, 278–279

events synchronization, 285, 287

example, 279–281

fixed vs. dynamic grids, 277f

graph search, 276

kernel launch patterns, 278, 278f

memory data visibility, 281–283

memory management, 283–285

recursive example, 293–297

streams, 285, 286–287

synchronization, 285

synchronization depth, 285

CUDA Fortran programming, 493

asynchronous data transfers, 504–508

calling CUDA C via iso_c_binding, 499–501

calling Thrust from, 509–513

compilation and profiling, 508–509

and CUDA C differences, 493–495

data transfer and kernel execution, 507f, 508f

dynamic shared memory, 502–503

kernel loop directives and reduction operations, 501–502

multidimensional array in, 496–497

overloading host/device routines, 498–499

program, 495–496

CUDA linear algebra library (cuBLAS), 359

CUDA memory types, 77–84

__constant__’’ keyword, 83

CUDA variable type qualifiers, 82t

“__device__” keyword, 83–84

device memory model, 78f

global memory in CUDA device, 78

improving compute-to-global-memory-access ratio, 77

memory vs. registers in modern computer, 79f

pointers usage, 84

registers, 77–78

“__shared__” keyword, 83

shared memory, 77–78, 80f, 81–82

CUDA Occupancy Calculator, 126–127

CUDA program, 22

error checking and handling in, 32b

execution, 24f

model, 14–15

synchronization constraints between blocks enables transparent scalability for, 60f

CUDA thread organization, 43–47

fields, 46–47

hierarchical organizations, 44b

host code, 45

multidimensional example of CUDA grid organization, 47f

CUDA-aware message passing interface (CUDA-aware MPI), 409–410

compute process code, 407f

data server code, 408f

revised MPI SendRec calls, 409f

CUDA-aware MPI, See CUDA-aware message passing interface (CUDA-aware MPI)

CUDA-enabled GPU, 458

cudaDeviceProp type, 63, 64

cudaDeviceSynchronize() function, 285

CudaFree(), 29–30, 39

cudaGetDeviceProperties function, 98

API function, 63

cudaMalloc() function, 28–29, 84, 471–472

cudaMemcpy() function, 429–430, 444, 445, 473

cudaMemcpyAsync() function, 402–403

cudaMemcpyToSymble() function, 157

cudaStreamCreate() function, 403

cudaStreamCreateWithFlags(), 287

cudaStreamWaitEvent(), 287

cuDNN library, 364–367, 365t

CUs, See Compute units (CUs)

Cutoff binning, 375, 376f

algorithm, 376

Data bits, 104–105

Data clauses, 426, 427f

array size notation, 427f

for unstructured data directives, 430f

Data directives, OpenACC, 425–430

data clause array size notation, 427f

data clauses, 426f, 430f

GPU timeline, 428f

Jacobi iterative method with data region, 427f

speed-up with addition of, 428f

unstructured data directives in C++ class, 429f

update directive with MPI halo exchange, 430f

Data management techniques, 10

Data parallel computing
See also Heterogeneous parallel computing

built-in variables, 39

CUDA C program structure, 22–25

data parallelism, 20–22

data transfer, 27–32

device global memory, 27–32

function declarations, 38

kernel functions, 32–37

kernel launching, 37–38, 38

run-time API, 39–41

threading, 32–37

vector addition kernel, 25–27

Data parallelism, 8, 20–22, 20b
See also Memory parallelism

conversion of color image to gray-scale image, 21f

data-parallel computation, 521

model, 462–464

pixels calculation, 22f

RGB color image representation, 21b

task parallelism vs., 20b

Data server

code, 394–395, 395f, 397f

process, 398–399, 398f

Data sharing, 380–381, 381, 383

Data transfer, 27–32

bandwidth, 112–113

CUDA API function for, 30f

host code allocation, 30

vecAdd function, 28, 31–32

DCS, See Direct Coulomb Summation (DCS)

DDR, See Double data rate (DDR)

Deep copy, 440

Deep learning, 346

Denormalization, 137–138

Dev_prop. maxGridSize, 63–64

Dev_prop. maxThreadsPerBlock, 63

Dev_prop. multiProcessorCount, 63

dev_prop. sharedMemPerBlock, 98

Device global memory, 27–32

CUDA API functions for managing, 29f

host code allocation, 30

host memory and, 28, 28f

vecAdd function, 28

“__device__” keyword, 83–84

Device management, OpenCL, 466–469

Device properties, querying, 61–64

Digital high-definition (HD) TV, 9

Digital signal processors (DSPs), 462

Direct Coulomb Summation (DCS), 332, 335f

kernel version 1, 336f

Direct memory access device (DMA device), 402

Direct3D techniques, 6

DirectX interop—rotate vertex list, 533f

Divide-and-concur approach, 231

DMA device, See Direct memory access device (DMA device)

Domain knowledge, 380

Domain partition, 387, 390, 390

Double buffering, 263

Double data rate (DDR), 112–113

Double-precision speed, 451

DRAM, See Dynamic random access memory (DRAM)

Driving direction services, 257–258

DSPs, See Digital signal processors (DSPs)

Dynamic parallelism, 35, 273, 276, 278–279, 450

Dynamic partitioning of resources, 125–127

Dynamic random access memory (DRAM), 3–4, 10, 27–28, 104–105, 444, 444

bursts, 105, 112

channels and banks in systems, 112f

GDDR, 6–8

slow, 104b–105b

Dynamic shared memory, 502–503

Edges, 257–258

Electronic gaming, 9

Electrostatic potential

calculation, 375

energy, 333–334

Electrostatic potential map, 332, 333f, 333f

algorithms, 378

in OpenCL, 469–473

ELL storage format, 221, 222f, 226f

example in, 222f

Energy grid array, 340–341

Enhanced atomic operations, 452–453

Enhanced global memory access, 453

Error checking and handling in CUDA, 32b

Events synchronization, 285, 287

Exception handling in kernel functions, 449–450

Excess encoding of E, 133–134

excess-3 encoding, sorted by excess-3 ordering, 133f

Exclusive scan operation, 176, 176–177

Execution configuration parameters, 36–37, 43–44, 44

Execution model, 522–525

asynchronous operation, 523–525, 524f, 524f

explicit and implicit data copies, 522–523, 522f

supporting discrete accelerator, 525

Execution speed, 12

CUDA kernels scalability, 37–38

of matrix multiplication functions, 73

of parallel programs, 12, 103

of sequential programs, 2

Explicit data copies, 522–523, 522f

Extent, 517–518

Fadd instruction, 80

Fast Fourier Transform (FFT), 307–308

Feature extraction, 200–201

Feature map matrix, 364

Feed forward networks, 346, 346f

Fermi GPU architecture, 450

FFT, See Fast Fourier Transform (FFT)

Field programable gate arrays (FPGAs), 462

Filter-bank matrix, 360–361

Finite difference methods, 172

Fixed grid approach, 276–277

“Flat” memory space, 49–50

Floating-point data representation, 132–134

excess encoding of E, 133–134

normalized representation of M, 132–133

Floating-point precision and accuracy validation, 326f

Fluid dynamics, 20–21, 452

FORTRAN, 111, 494

Forward propagation, 349, 349f, 350f

CUDA implementation, 355–358

FPGAs, See Field programable gate arrays (FPGAs)

Frame buffer memory, 6–8

Full connection layers, 347–348

Function calls within kernel functions, 449

Function declarations, CUDA C, 38

Fusion, 487

G80, 8, 485

Gangs, 416, 431, 433

Gather, 372, 372f

Gaussian elimination, 217, 309–310

algorithm, 142, 143f, 144f, 145

Gaussian filters, 149, 351

gcc, See Gnu C Compiler (gcc)

GDDR, See Graphics Double Data Rate (GDDR)

GeForce 8800 GTS, 485t

GeForce GTX 280, 485t

GeForce GTX 480, 485t, 486f

GeForce GTX 680, 342

GEMM, See GEneral Matrix to Matrix Multiplication (GEMM)

Gen2 interface, See Generation 2 interface (Gen2 interface)

GEneral Matrix to Matrix Multiplication (GEMM), 359, 360f

General-purpose programming interface, 6

General-Purpose Programming using GPU (GPGPU), 6

Generation 2 interface (Gen2 interface), 8

Generic algorithms, 482

Generic interfaces, overloading host/device routines, 498–499

Generic programming, 482–484

Generics, 475, 476–477

get_global_id(0) function, 466

GFLOPS, See Giga floating-point operations per second (GFLOPS)

“Ghost cells”, 152

Giga (10⁹), See Giga floating-point operations per second (GFLOPS)

Giga floating-point operations per second (GFLOPS), 1, 72

“__global__” keyword, 34, 35

Global memory, 6–8, 27–28, 281–282

access, 453

in CUDA device, 78

Global memory access, enhanced, 453

Global memory bandwidth, 72, 104–112

analyzing data access pattern, 112

burst organization of modern DRAMs, 105

coalesced access pattern, 107f

coalescing hardware, 106

of CUDA device, 104–105

memory access patterns in C 2D arrays for coalescing, 106f

placing matrix elements into linear order, 106f

shared memory to enabling coalescing, 110f

tiled algorithm, 111–112

tiled matrix multiplication kernel using shared memory, 111f

un-coalesced access pattern, 109f

GMAC, 4–5, 13

Gnu C Compiler (gcc), 205

GPGPU, See General-Purpose Programming using GPU (GPGPU)

GPU, See Graphics processing unit (GPU)

Gradient backpropagation, 351

Graph data structure, 258

Graph search, 257–258

adjacency matrix representation, 259f

BFS, 260–262, 260f

graph data structure, 258

graph with directional edges, 258f

optimizations, 270–273

parallel BFS function, 265–270

sequential BFS function, 262–265, 263f

sparse matrix representation, 259f

sparse representation, 260

Graphics API, 6

Graphics chips, 3–4, 6

Graphics Double Data Rate (GDDR), 6–8

Graphics processing unit (GPU), 2–3, 443–444

architecture of modern, 6–8

computing, 11

CUDA-enabled GPUs, 443–444, 446

data parallelism, 20–22

design philosophy, 4

floating-point arithmetic units, 5–6

IEEE Floating-Point Standard, 5–6

NVIDIA GTX480, 27–28

PCI-E Gen3, 8

Graphs, 257–258

Greyscale, 20–21

Grid, 33

Grid data structure, 341

Gridding approach, 307–308

Gridding computation, 307–308

Hadoop, 233

Halo cells, 152

tiled 1D convolution, 160–165

tiled 2D convolution, 166–172

Halo elements, 225–226

Hardware queues, 450

Hardware system, 372

Hardware trigonometry functions, 323–325

HBM, See High-Bandwidth Memory (HBM)

Heterogeneous computing cluster programming
See also Parallel programming

background, 388

collective communication, 408–409

CUDA-aware MPI, 409–410

MPI, 391–393

overlapping computation and communication, 400–408, 401f

point-to-point communication, 393–400

programer’s view of MPI processes, 388f

running example, 388–390

small example of memory layout, 390f

3D grid array, 390f

Heterogeneous parallel computing, 2–6
See also Data parallel computing

CPUs, 3f

GPUs, 3f, 4–5

many-thread trajectory, 2–3

memory bandwidth, 3–4

multicore trajectory, 2

NVIDIA, 6

practical form factors and easy accessibility, 5

Hierarchical parallel scan for arbitrary-length inputs, 189–192

Hierarchical queues, 271–272

Hierarchical scan for arbitrary length inputs, 189f

High-Bandwidth Memory (HBM), 6–8

High-performance computing (HPC), 13, 387, 414

High-performance parallel programs, 14

Higher order stencil computation, 388–389

Histogram, 200

Host, 464–465

Host copy, 429–430

“__host__” keyword, 35, 38

Host/device interaction model, 444–449

host_data region, 437, 438f

HPC, See High-performance computing (HPC)

Hypothetical example, 383–385

classical gridded MRI reconstruction from spiral scan data, 384f

least squares reconstruction of spiral scan data, 385f

sodium images, 384f

I/O devices, 444

Identity matrix, 144

IEEE Floating-Point Standard, 5–6

IEEE format, special bit patterns and precision in, 138–139

IEEE standard, 133, 134

IEEE-754 Floating-Point Standard, 132

If-then-else statement, 59

iFFT, See Inverse Fast Fourier Transform (iFFT)

Image blur, 54–58

Implicit data copies, 522–523, 522f

Implicit ranges, 489–491, 490f

inclusive scan operation, 176, 176–177

Independent clause, 425

Installed base of processor, 5

Instruction Register (IR), 79

Instruction/execution divergence, 381–382

Integrated circuits, 261–262

maze routing in, 261f

Intel Pentium family, 1

Interleaved data distribution, 114–115

Interleaved partitioning, 206–207

Internal cells, 161

Internal tiles, 161

Interoperability with CUDA and libraries, 437–440

calling CUDA device kernels from OpenACC, 439–440

calling CUDA or libraries with OpenACC arrays, 437–438

CUDA pointers in, 438

deviceptr clause, 438f

host_data region, 438f

Interruptible kernels, 450–451

Intrinsic functions, 205b

Inverse Fast Fourier Transform (iFFT), 306–307

Inverse matrix, 144

IR, See Instruction Register (IR)

iso_c_binding module, calling CUDA C via, 499–501

Iterative approaches, 217–218

Iterative solver, 228

Iterators, 477, 479–480

Jacobi iterative method, 388–389, 418, 419f, 432

code with loop tile clause, 433f

compiler feedback for, 423f

with data region, 427f

using parallel directive, 422f

Jagged Diagonal Storage format (JDS format), 227

k-space sampling trajectory, 306

Kahan’s summation algorithm, See Compensated summation algorithm

Kernel execution control

exception handling in kernel functions, 449–450

function calls within kernel functions, 449

hardware queues and dynamic parallelism, 450

interruptible kernels, 450–451

simultaneous execution of multiple kernels, 450

Kernel functions, 32–37, 77–78, 463

OpenCL, 466, 467f

“__kernel” keyword, 466

Kernel launch, 37–38, 38

OpenCL, 466–469

overhead, 272–273

Kernel parallelism structure, 312–317

cmpMu kernel, 314f

loop fission, 314f

loop interchange, 316f, 317f

option of F^HD kernel, 316f

version of F^HD kernel, 312f

Kernels, 23

complex, 54–58

image blur kernel, 56f

loop directives and reduction operations, 501–502

Kernels directive, OpenACC, 419–421, 420f

and parallel directives comparison, 424–425

Kogge–Stone algorithm, 177–178

kernel for inclusive scan, 180f

parallel exclusive scan algorithm, 181f

parallel inclusive scan algorithm, 178f

Laplace equation, 418, 418f

LargeBin algorithm, 378

Last-level on-chip caches, 4

Latency, 207–209

hiding, 65

tolerance, 64–66, 66b

Latency-oriented design, 4

Launch

environment configuration, 283

pool size, 292

Least-squares reconstruction (LS reconstruction), 385

LeNet-5 design, 348, 348f

Lifetime, 82

of constant variable, 83

of shared variable, 83

Linear algebra functions, 51b

Linear algebra operation, 51b

Linear algorithms, 175

Linear Bezier curves, 288

Linear solvers, 142–146

iterative reconstruction algorithm, 308–309

Load balance, 273

Local memory, 282

Locality, 90, 380–382

Loop directive, 422, 422–423

OpenACC, 430–431

Loop fission technique, 313

Loop interchange, 313

Loop optimizations, OpenACC, 430–432

adjusting loop parameters, 431f, 432f

loop directive specifying levels of parallelism, 431f

Loop parallelism, 36

Loop splitting technique, 313

LS reconstruction, See Least-squares reconstruction (LS reconstruction)

Machine learning, 345, 346–347

ConvNets, 347–355

convolutional layer, 355–358

convolutional layer reduction to matrix multiplication, 359–364

cuDNN library, 364–367, 365t

Magnetic resonance imaging (MRI), 306, 371

application of MRI, 306

background, 306–308

blockIdx.x and threadIdx.x values, 315

Cartesian scan trajectories, 306–307

CG algorithm, 309–310

chunking k-space data, 320f, 320f

cmpMu() kernel, 314–315, 314f

computing F^HD, 310–327, 311f, 312f

device, 383

experimental performance tuning, 326–327

final evaluation, 327–329

hardware trigonometry functions, 323–325

iterative reconstruction, 308–310

k-space elements, 318–319

k-space regions, 306

kernel parallelism structure, 312–317

loop fission or loop splitting, 313

loop interchange, 316f

M/MU_THREADS_PER_BLOCK blocks, 314–315

matrix–vector multiplication, 309, 310

memory bandwidth limitation, 317–323

non-Cartesian k-space sample trajectory, 308f

non-Cartesian scan trajectories, 307

physics principles behind MRI, 306

quasi-Bayesian estimation problem formulation, 309

ratio of floating-point operations, 311–312

scanner k-space trajectories, 307f

Managed memory, 446–447

Many-thread processors, 2–3

Map driving direction applications, 258

Map-reduce distributed computing frameworks, 233

Map-reduce frameworks, 231

Matrix multiplication, 73–77, 115f, 153–154, 345, 359, 374–375

actions of one thread block, 76f

convolutional layer reduction to, 359–364

example to execution, 75

execution example of matrixMulKernel, 76f

execution of for-loop, 75

execution speed of functions, 73

function generates unrolled X matrix, 362f

global memory accesses, 77

high-performance implementation, 363f

host code for invoking unroll kernel, 363f

implementation, 73

implementing forward path, 362f

kernel using one thread, 75f

matrixMulKernel, 77

using multiple blocks by tiling, 74f

tiling illustration, 84

Matrix–matrix multiplication, See Matrix multiplication

matrixMulKernel, 77

Matrix–vector multiplication (F^HD), 309, 310–327

Maze routing problem, 262

determining kernel parallelism structure, 312–317

experimental performance tuning, 326–327

getting around memory bandwidth limitation, 317–323

hardware trigonometry functions, 323–325

in integrated circuits, 261f

Memory

access throughput, 104–105

allocation and lifetime, 283–284

bound, 201

memory-bound programs, 72

space, 49b, 479–480

Memory access efficiency

effect, 77

importance, 72–73

Memory and data locality

boundary checks, 94–96

CUDA memory types, 77–84

importance of memory access efficiency, 72–73

matrix multiplication, 73–77

parallelism, memory as limiting factor to, 97–99

tiled matrix multiplication kernel, 90–94

tiling for reduced memory traffic, 84–90

Memory bandwidth, 3–4, 112–113, 270–271, 451–453

adjusting k-space data layout, 323f

chunking k-space data, 320f

effect of k-space data layout, 322f

limitation, 317–323

registers to reducing memory accesses, 319f

revised F^HD kernel, 321f

Memory coalescing, 104, 206, 207f, 338–342

DCS kernel version 3, 340f

organizing threads and memory layout, 339f

reusing computation results among multiple grid points, 339f

version 2 of DCS kernel, 339f

Memory data visibility, 281–283

constant memory, 282

global memory, 281–282

local memory, 282

shared memory, 283

texture memory, 283

zero-copy memory, 282

Memory management, 283–285

errors and launch failures, 284

launch environment configuration, 283

memory allocation and lifetime, 283–284

nesting depth, 284

pending launch pool configuration, 284

Memory parallelism, 112–116
See also Data parallelism

banking improving utilization of data transfer bandwidth, 113f

channel, 112

channels and banks in DRAM systems, 112f

distributing array elements into channels and banks, 115f

distribution scheme, 114–115

M elements loaded by thread blocks, 116f

matrix multiplication, 115f

Merge sort

circular-buffer merge kernel, 249–254

co-rank function implementation, 236–241

Hadoop, 233

merge operation, 232f

ordered merge function, 231–232

parallel merge kernel, 241–242

parallelization approach, 234–236

sequential merge algorithm, 233–234

sorted vs. unsorted lists, 232f

tiled merge kernel, 242–249, 243f

merge_circular_buffer_kernel, 249

merge_tiled_kernel, 249

Message Passing Interface (MPI), 12–13, 13, 374, 387, 391–393

barrier synchronization, 404–405

closing communication system, 391f

collective communication, 408–409

main program, 392f

MPI/CUDA programming, 387

MPI/OpenACC, 387

MPI/OpenCL, 387

MPI_Barrier() function, 404–405

MPI_Comm_rank()function, 391

MPI_Comm_size() function, 395–396

MPI_Recv() function, 393, 394f

MPI_Send() function, 393, 394f

MPI_Sendrecv() function, 405–406, 406f

overlapping computation and communication, 400–408

point-to-point communication, 393–400

Microprocessors, 1

Microscopes, 9

Modern compilers, 205

Molecular dynamics, 331

applications, 20–21, 372–373, 373f

simulation, 332

Molecular visualization and analysis

background, 332

memory coalescing, 338–342

simple kernel implementation, 333–337

single-threads CPU vs. CPU–GPU comparison, 343f

thread granularity adjustment, 337–338

Motivation, 477–478

MPI, See Message Passing Interface (MPI)

MRI, See Magnetic resonance imaging (MRI)

Multidimensional arrays, 49–50

in CUDA Fortran, 496–497

Multidimensional data, threads mapping to, 47–54

host code, 48

linearized access to three-dimensional array, 54

multidimensional arrays, 49–50

1D, 2D, or 3D thread organizations, 47

row-major layout for 2D C array, 50f

source code of color ToGreyscaleConversion, 52f

2D thread grid to processing, 48f

NaN, See Not a Number (NaN)

Nesting depth, 284

Non-Cartesian MRI

application of MRI, 305

computing F^HD, 310–327

final evaluation, 327–329

iterative reconstruction, 308–310

non-Cartesian k-space sample trajectory, 308f

non-Cartesian scan trajectories, 307

non-Cartesian trajectories, 384

scanner k-space trajectories, 307f

norm, C++ AMP type, 531

Normalized representation of M, 132–133

Not a Number (NaN), 138, 139

Numerical considerations

algorithm considerations, 140–142

arithmetic accuracy and rounding, 139–140

floating-point data representation, 132–134

linear solvers and numerical stability, 142–146

representable numbers, 134–138

special bit patterns and precision in IEEE format, 138–139, 138f

Numerical stability, 142–146

Numerically stable values, 142

Numerically unstable values, 142

NVIDIA C Compiler (NVCC), 23

On-chip memory, 88

1D parallel convolution, 153–156

boundary condition handling, 155f

input parameters, 154

kernel with boundary condition handling, 155f

mapping of threads to output elements, 154

Mask_Width [size of masks], 155

output element index, 154

variable P value, 155

Open Computing Language (OpenCL), 14, 19–20, 461, 461

building OpenCL kernel, 472f

data access indexing in OpenCL and CUDA, 471f

data parallelism model, 462–464

DCS kernel version 3 NDRange configuration, 470f

development of, 462

device architecture, 464–465, 465f

device management and kernel launch, 466–469

dimensions and indices to CUDA, 464f

electrostatic potential map in, 469–473

host code for kernel launch and parameter passing, 472f

inner loop of OpenCL DCS kernel, 471f

kernel functions, 466, 467f

to managing devices, 467f

mapping between OpenCL and CUDA, 463f

mapping DCS NDRange to, 470f

parallel execution model, 463f

OpenACC, 12–13, 19–20, 413–414, 458

abstract machine model, 415f

advantage, 13

asynchronous computation and data, 434–435

calling CUDA device kernels from OpenACC, 439–440

calling CUDA or libraries with OpenACC arrays, 437–438

compiler output from example kernels code, 421f

and CUDA, 435–437

CUDA pointers in, 438

data directives, 425–430

deviceptr clause, 438f

directive format, 416–418, 417f

by example, 418–435

execution model, 414–416

future of, 440–441

GPU timeline of parallel loop code, 424f

host_data region, 438f

interoperability with CUDA and libraries, 437–440

kernels and parallel directives comparison, 424–425

kernels directive, 419–421, 420f

loop optimizations, 430–432

offloading execution model, 415f

parallel directive, 422–424

performance, 436

performance speed-up, 424f

portability, 435–436

regions, 417

routine directive, 432–434

simplicity, 436

terminology, 437f

OpenCL, See Open Computing Language (OpenCL)

OpenCL clEnqueueReadBuffer(), 473

Operand value, 78–80

Ordered merge operations, 231

Output interference, 202

Overlapping computation and communication, 400–408, 401f

device memory offsets, 404f

Padding, 221–224

example in ELL, 222f

hybrid approach to regulates, 224–227

parallel SpMV/ELL kernel, 223f

Page locked memory buffer
see Pinned memory buffer

Parallel BFS function, 265–270

BFS host code function, 266f

block-level queue contents, 269f

kernel based on block-level privatized queues, 267f

Parallel computing, 88, 149, 369, 370–371, 459
See also Heterogeneous computing cluster programming

aggregation, 211–213

algorithm selection, 374–379

atom-centric arrangement, 371–372

atomic operation in cache memory, 210

atomic operations, 202–205

block partitioning vs. interleaved partitioning, 206–207

C function for calculating histogram, 201f

comparison of scalability and performance, 378

cutoff algorithms, 378

energy value of grid point, 371

goal of, 370

grid-centric arrangement, 371–372

grid-centric decomposition, 371–372

histogram, 200

hypothetical example, 383–385

issue with binning, 377

latency vs. throughput of atomic operations, 207–209

nonbonded force calculation, 372–373

parallel histogram computation, 199

privatization technique, 210–211

problem decomposition, 371–374

running time of three binned cutoff algorithms, 378, 378f

sequential tasks, 374

SPMD, shared memory and locality, 380–382

threading arrangement, 371–372

work efficiency, 377

Parallel device code, 23

Parallel directive, OpenACC, 422–424

and kernels directives comparison, 424–425

Parallel execution, 371–372

Parallel merge kernel, 241–242

Parallel programming, 12

Parallel programming languages and models, 12–14

Parallel reduction algorithm, 122

Parallel scan, 175

algorithm with 16-element input, 178–179

for arbitrary-length inputs, 189–192

background, 176–177

exclusive scan algorithm, 181f

implementation of iterative calculations, 180

inclusive scan operation, 176, 178f

kernel launch, 179

Kogge–Stone kernel for inclusive scan, 180f

as primitive operation, 177

sequential algorithm of computation, 177

simple, 177–181

single-pass scan for memory access efficiency, 192–194

work efficiency, 181–183

work-efficient, 183–187, 187–189

Parallel SpMV using CSR, 219–221

Parallelism, 8–10

memory as limiting factor to, 97–99

Parallelization approach, 234–236

Pareto-Optimal-Curve-based method, 327

Pascal, 453

Pascal GPU architecture, 448, 451, 453

PC, See Program Counter (PC)

Peak SNR (PSNR), 324–325

Pending launch pool configuration, 284

Performance cliff, 126

Performance considerations

dynamic partitioning of resources, 125–127

global memory bandwidth, 104–112

memory parallelism, 112–116

thread granularity, 127–128

warps and SIMD hardware, 117–125

PEs, See Processing elements (PEs)

PGI, See Portland Group (PGI)

Phantom object, 324–325

Physical address spaces, 446

Pinned memory

allocation, 401–402

buffer, 402

Pivoting, 145, 146f

Pointers, CUDA, 84

Pointers model, 477

Pooling layer, See Subsampling layers

Portfolio management process, 370

Portland Group (PGI), 414, 493

#pragma keyword, 417

Precision in IEEE format, 138–139, 138f

Predefined variables, See Built-in variables

Prefix sum, 175

Privatization technique, 210–211, 257–258

Problem decomposition, 371–374

Processing elements (PEs), 464–465

Processing units, 81b

Processor cores, 1

Program Counter (PC), 79

Programing environment, 453–455

Programing interface for computing clusters, 388

overlapping computation and communication, 400–408

3D stencil computation, 388–390, 389f

Programmer productivity, 484

Projection imaging, 307

PSNR, See Peak SNR (PSNR)

PTX code, 508–509

Q computation, 310–311

Quadratic Bezier curves, 288

Querying device properties, 61–64

Queues, 257–258

Quiet NaN, 139, 139

Race condition, 203, 203f, 204f

Rank, 236

Read-modify-write, 202, 203, 208

Real-world performance, 485–486

Reduced memory access throughput, 126

Reduction, 423

Reduction algorithm, 121

Reduction tree, 177–178, 183

Reference count, 426–427

Registers, 77–78

in CUDA, 81–82

to SM, 97

Representable numbers of floating-point format, 134–138

abrupt underflow format, 137f

algorithm considerations, 140–142

alignment shifting of operands, 140

arithmetic accuracy and rounding, 139–140

bit patterns, IEEE standard format, 138–139, 138f

3-bit unsigned integer format, 134f, 135f

denormalization format, 137f

denormalization of, 137–138

discrepancy between sequential algorithms and parallel algorithms, 141

Gaussian elimination procedure, 142, 143f, 144, 145, 145, 146f

intervals in neighborhood of 0, 135–136

major intervals, 135–136

mantissa bits, 132

NaNs, 139

between negative infinity and positive infinity, 139

no-zero, abrupt underflow, and denorm formats, 135f

no-zero representation, 135f

precision of, 131

quiet NaN’s, 139

reduction computation, 141

represent of 0, 135

SNaNs, 139

trend of increasing density, 136–137

Resource and capability queries, 62b

Resource assignment, 60–61

Restrict(amp) specification, 519

Restrict(cpu, amp), 519

Restrict(cpu), 519

RGB color image representation, 21b

Root-mean-square (RMS), 327

Rounding, 139–140

Routine directive, OpenACC, 432–434, 433f

Routing software, 262

Row-major layout, 50

for 2D C array, 50f

Run-time API, 39–41

SAXPY function, 482–484

in CUDA C and Thrust, 483f

SAXPY routine, 495

Scalability, 15

Scalable parallel execution

CUDA thread organization, 43–47

image blur, 54–58

latency tolerance, 64–66, 66b

mapping threads to multidimensional data, 47–54

querying device properties, 61–64

resource assignment, 60–61

scalable parallel program, 19

synchronization and transparent scalability, 58–60

thread scheduling, 64–66

Scalar variables, 82

Scan blocks, 190

Scatter operations, 372, 372f

Scratchpad, 451–452

Scratchpad memory, 80–81

SDRAM, See Synchronous DRAM (SDRAM)

Self clause, 429–430

Semiconductor industry, 459

Sequential BFS function, 262–265, 263f

Sequential cutoff algorithm, 378

Sequential merge algorithm, 233–234

Sequential reduction algorithm, 121

Sequential SpMV/CSR loop, 218f, 220

shortcoming, 221

SpMV/ELL kernel code, 223

SFU, See Special function units (SFU)

“__shared__” keyword, 83

Shared memory, 77–78, 126, 283, 380–382

in CUDA, 81–82

size in SM, 98

usage, 97

Signal-to-noise ratio (SNR), 306

Signaling NaN’s (SNaNs), 139

SIMD, See Single Instruction Multiple Data (SIMD)

Simple kernel implementation, 333–337

Simulation algorithm, 277

Single Instruction Multiple Data (SIMD), 8, 65, 117

executing threads of warp, 119

execution extending, 120

execution of revised algorithm, 124f, 125

hardware, 117–125, 117b–118b, 381

placing 2D threads into linear order, 119f

simple sum reduction kernel, 122f, 123f

__syncthreads() statement, 122

warps implementation, 117–118

Single program multiple data (SPMD), 15, 32–33, 380, 380–382

Single-pass scan for memory access efficiency, 192–194

Skirt cells, 161

SmallBin algorithm, 379

SmallBin-Overlap algorithm, 379

SMs, See Streaming multiprocessors (SMs)

SNaNs, See Signaling NaN’s (SNaNs)

SNR, See Signal-to-noise ratio (SNR)

SoA approach, See Structure of Arrays approach (SoA approach)

Social network, 259

graphs, 273

Sodium map of brain, 383–385

Sorting, 231

Sparse matrix computation, 215

background, 216–219

column index array, 216f

compressed storage, 217

coordinate (COO) format, 224, 224, 225

CSR storage format, 216, 216f

data padding and transposition, 221–224

dot product loop body, 223

elements of data, col_index, and row_index, 216–217

FLOPS rating, 229

format for storing, 215

Gaussian elimination, 217

hybrid approach to regulates padding, 224–227

hybrid ELL and COO method for SpMV, 225, 225–226

iterative approach, 217–218

in JDS format, 227, 228f

JDS–ELL representation, 227, 228f

loop index iteration, 218, 219

matrix–vector multiplication and accumulation, 218f

parallel SpMV using CSR, 219–221

parallel SpMV/ELL, 223f

real-world problems, 215

in science and engineering problems, 216

sequential implementation of SpMV, 218

sequential loop, 218f

in solving linear system of N equations of N variables, 217

sorting and partitioning for regularization, 227–229

sparse matrix, 216, 216f, 279–280

SpMV computation code, 219

SpMV loop operating on, 219f

transposition of JDS-CSR representation, 228–229

Sparse Matrix–Vector (SpMV multiplication), 217–218

loop operating on sparse matrix, 219f

sequential loop implements SpMV/COO, 226f

Sparse representation, 260

Sparsely connected graph, 258–259

Special bit patterns in IEEE format, 138–139, 138f

Special function units (SFU), 323

Speeding up real applications, 10–11

SPMD, See Single program multiple data (SPMD)

SpMV multiplication, See Sparse Matrix–Vector (SpMV multiplication)

SPs, See Streaming processors (SPs)

Standard Template Library (STL), 475, 515–516

Statistical estimation methods

application of MRI, 305

computing F^HD, 310–327

final evaluation, 327–329

iterative reconstruction, 308–310

non-Cartesian k-space sample trajectory, 308f

scanner k-space trajectories, 307f

Statistically optimal image reconstruction method, 308

Stencil computation, 149, 172

25-stencil computation, 389, 389f

STL, See Standard Template Library (STL)

Stored program model, 79

“Strategy I”, 202, 206

Stream-based scan, 193, 194

Streaming multiprocessors (SMs), 6–8, 60–61, 65, 71, 292, 464–465

executing threads in warp, 65

hardware resources (built-in registers) for, 61

registers to, 97

size of shared memory in, 98

thread block assignment to, 61f

unit of thread scheduling in, 64–65, 64f

Streaming processors (SPs), 6–8

Streams, 285, 286–287, 292–293, 403

Strip-mining, 93

Structure of Arrays approach (SoA approach), 488–489

Stub function, 31–32

Subsampling layers, 347–348, 350, 351f

Supercomputing applications, 9

Synchronization, 58–60, 285

depth, 285

Synchronous DRAM (SDRAM), 6–8

__syncthreads() function, 58, 59, 122, 180, 285, 463

System of linear equations, 142

Task parallelism, 20b

Tera (10¹²), See Tera floating-point operations per second (TFLOPS)

Tera floating-point operations per second (TFLOPS), 1

Tesla P100, 453

Texture memory, 283

TFLOPS, See Tera floating-point operations per second (TFLOPS)

Thread scheduling, 64–66

Thread-to-data mapping, 74

Thread(s), 24b, 81b

adjustment, 337–338

block partition in, 206

coarsening, 331

CUDA runtime system, 59–60, 60–61, 126

granularity, 127–128

in grid executing same kernel codes, 33f

mapping to multidimensional data, 47–54

multiple dimensions of, 118

and SIMD hardware, 119

threadIdx.x and threadIdx.y values, 118–119

threadIdx.x values with warp, 118

3D thread organizations, 47

2D thread grid, 48f

ThreadIdx.x, 34, 34, 35–36, 48, 108, 111, 118, 122f, 161, 162, 166, 180, 181

Threading, 32–37

arrangement, 371–372, 372f

Three-dimensional grid (3D grid), 333–334

Throughput

of atomic operations, 207–209

computation, 451–453

throughput-oriented design, 4

Thrust parallel template library, 475

abstraction layer, 480f

array of structures data layout, 488–489, 489f

background, 475–477

benefits of abstraction, 484–486

best practices, 486–491

C++, 475–476, 476–477

C++ function objects, 482b–483b

calling from CUDA Fortran, 509–513

counting iterator, 491

device_pointer_cast() function, 481

dynamic optimization, 486

features, 478–482

fill algorithm, 485

fusion, 487

generate, sort, and copy algorithms, 478

generic programming, 482–484

implicit ranges, 489–491, 490f

interfacing Thrust to CUDA C, 480, 480f, 481f

interoperability, 480–481, 481f

iterators and memory space, 479–480

kernel fusion, 487

motivation, 477–478

native CUDA C interoperability, 481

programmer productivity, 484

raw_pointer_cast() function, 480

real-world performance, 485–486

robustness, 484

salient features, 478

SAXPY functor, 482–484

saxpy_functor func, 482–484, 483

SNRM2, 488f

for solving complementary set of problems, 478

value of high-level, 478

vector containers, 478

Tile clause, 432

Tiled 1D convolution, 165–166

accessing N elements and ghost cells, 164f

with halo cells, 160–165, 160f

kernel using constant memory, 163f

Tiled 2D convolution

C type structure definition of image pixel element, 169f

with halo cells, 166–172

image array access reduction ratio, 172f

padded image format, 167f

row-major layout, 168f

starting element indices, 169f

2D convolution kernel, 170f, 170f, 171f

Tiled algorithms, 87–88, 88f, 93–94, 99, 111

Tiled execution, 528–530

Tiled matrix
See also Matrix multiplication

multiplication, 529f

multiplication algorithm, 88, 89f

multiplication kernel, 90–94

Tiled merge kernel, 242–249, 243f
See also Circular-buffer merge kernel

identifying block-level output and input subarrays, 244f

loading elements into shared memory, 245f

running example, 248f

threads merge, 246f

Tiles, 84

Tiling, 152, 153, 374–375, 380, 432

buffer, 253–254

efficiency, 152

for reduced memory traffic, 84–90

Traffic congestion, 86

Training, 345

application logic, 346

of ConvNets, 351

data sets, 354

images, 351

Transparent scalability, 58–60

Transposition, 221–224

example in ELL, 222f

parallel SpMV/ELL kernel, 223f

Trigonometry functions, 312

2D convolution kernel, 169

Two-dimensional array, 50

Two-dimensional slice (2D slice), 333–334, 334f

Unified Memory, 446–447

CPU code to CUDA code, 447f

Unified virtual address space (UVAS), 445

unorm, C++ AMP type, 531

Unstructured data directives, 429

data clauses for, 430f

Update device clause, 429–430

UVAS, See Unified virtual address space (UVAS)

vecAdd function, 26f, 31, 517

complete version of host code in, 37f

vecAddKernel function, 34, 34f, 36f, 45, 45, 45, 48, 51, 54

Vector addition kernel function, 25–27, 34f

launch statement, 36f

pointers in C language, 26b–27b

revised vecAdd function, 26f

traditional vector addition C code example, 25f

Vector length, 416

Vectors, 416, 416, 431, 433

Vertices, 257–258

Virtual address spaces, 446

Visual Molecular Dynamics (VMD), 332

von Neumann model, 79b

memory vs. registers, 79f

von Neumann report, 1–2

Warp-level queues (w-queues), 271–272

Warp(s), 64–65, 117–125, 117b–118b

analyzing impact of control divergence, 121

approach to execution, 120

execution, 117–118

execution of revised algorithm, 124f, 125

placing 2D threads into linear order, 119f

scheduling, 65

simple sum reduction kernel, 122f, 123f

__syncthreads() statement, 122

thread blocks, 118

Work efficiency, 265

Work-efficient parallel scan, 175, 183–187, 187–189

Brent–Kung kernel for inclusive scan, 186f

distribution of partial sums to positions, 184, 184

index values, 185

minimal number of operations, 183–184

number of operations in distribution tree stage, 187

parallel inclusive scan algorithm, 183f

partial sums, 184f

reduction tree phase of, 185

scan kernel, 186, 186f

two-position addition, 184–185

Workers, 416, 431, 433

X86 instruction set, 2

Zero-copy memory, 282, 445

Zero-overhead thread scheduling, 65

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

Table of Contents for
Index