Subject Index
Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.
A
Adaptive body biasing (ABB)
686–687
Affinity-based scheduling
graph-based approach
91–92
irregular benchmarks
92–93
nonuniform data sharing
91
normalized L1 miss ratios
93f,
94
speedups of single-kernel runs
93f
All-pair shortest path (APSP) problem
matrix multiplication procedure
180
lazy minimum evaluation
182
repeated squaring technique
182
streaming block optimization
182
repeated squaring APSP algorithm
180–181b
constructive operator
261
maintenance scheduling problem
261
pheromone update operator
261
transit stop inspection
261
Architecturally correct execution (ACE)
622–623
Architectural vulnerability factor (AVF)
GPU’s massive vector register files
624
high-priority structures
624
Arithmetic logic units (ALUs)
377,
632
communication and control
386
dynamic feedback and throttling
385–386
mapping compression algorithms into
387–393
trigger and deployment
385
Assist warp buffer (AWB)
384
Assist warp controller (AWC)
383–384
Augmented block cimmino distributed (ABCD) algorithm
original data storage format
239
linear least square problem
234–235
partition of the system
234f
Average bandwidth utilization
397–398
B
Bandwidth optimizations in GPUs
409
Basic Local Alignment Tool (BLAST)
132–133
Bellman-Ford algorithm
175
algorithm iterations
179f
Biological sequence analysis
DNA and protein sequence database
128
GPU solution as a filter
141
task
vs. data-parallelism
139
graphics processing units
128
high-performance computing resources and techniques
128
pairwise sequence comparison
bioinformatics problems
129
dynamic programming technique
130–131
global, local, and semiglobal alignments
129–130f
Myers-Miller (MM) algorithm
130
Needleman-Wunsh (NW) algorithm
130
Smith-Waterman (SW) algorithm
130
data dependency and thread allocation
315f
optimal parallel sorting
314
Branch-and-Bound (B&B) algorithm
250
flow-shop scheduling problem
256–257
traveling salesman problem
257
Breadth first search (BFS)
513,
514f
coalesced read/write memory accesses
172
duplicate detection and correction strategy
172
dynamic virtual warps
171
single-block
vs. multiblock kernel
172
vs. state-of-the-art BFS implementations
172,
173t,
174f
frontier-based parallel implementations
computational complexity
169
CPU multicore hybrid approach
170
sparse graph matrices
169
C
Central processing unit (CPU) multicore hybrid approach
170
original data storage format
239
Coarse-grained communication and synchronization
global barrier at the kernel level
in-order kernel execution
59,
59f
out-of-order kernel execution
59,
59f
parallel scan algorithm
66,
67f
processing elements
64–65
software prefetching
68,
69f
thread divergence
65,
66f
average workload calculation
63–64
shared memory consistency
62–63
work donation algorithm
63
Compressed row storage (CRS)
165–166
Compressed sparse row (CSR) format
350–351
Compute-bound applications
377
Compute Unified Device Architecture (CUDA)
240
programmer/developer interface
382–383
Constructive heuristics
265
Control-theoretic gaming governor
586
Cooperative thread arrays (CTAs)
376
Coordinate format (COO) SpMV
351
Coordinate list (COO) sparse matrix
166–167
Core and memory clock-frequency scaling
default to 614 configuration
Core-Assisted Bottleneck Acceleration (CABA)
374
bandwidth optimizations
409
effect on performance and bandwidth utilization
398–399
enabling different compression algorithms
401–403
handling interrupts and exceptions
407
profiling and instrumentation
407–408
redundant multithreading
407
selective cache compression with
403–405
speculative precomputation
407
Correctness issues, GPU programming
lack of progress guarantees
10–11
dynamic bug-finding tools
13
symbolic bug-finding tools
13
CRISP performance predictor
execution time prediction
mean absolute percentage error
496,
496f
hardware mechanism and overhead
490
online energy saving framework
492
two load outstanding phases
486–488
CRItical Stalled Path (CRISP)
483–484
D
Data compression algorithm
initial setup and profiling
396
memory controller changes
396
compression ratio (CR)
327
floating-point compression algorithms
328–329
massively parallel machines
328
technical constraints
120
Data Placement and Migration Graph (DPMG)
115–116,
116f
Deep greedy switching (DGS)
264–265
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
320–322
Deriving data access patterns
Detected unrecoverable errors (DUEs)
619
Diagonal format (DIA) SpMV
350
Digital voltage and frequency scaling (DVFS)
582
design space exploration
602t
Discrete memory machine (DMM)
311
Double-error correct, triple-error detect (DEC-TED) ECC
630–631
Dynamic bandwidth allocation (DBA)
hybrid configuration crossbar
462
synthetic and real application-based traffic patterns
465–467
Dynamic bug-finding tools
13
Dynamic mapping approaches
Dynamic power management (DPM)
582
Dynamic virtual warps
171
Dynamic voltage and frequency scaling (DVFS)
CPU-based performance models
proportional scaling model
474
energy and power savings
471
GPGPU program performance
472,
473f
complex stall classification
481–482
CRISP performance predictor (
see CRItical Stalled Path (CRISP) performance predictor)
high thread-level parallelism
478
homogeneity of computation
478
linear and machine-learning-based models
482
performance counter architectures
472
thread-level parallelism
472
E
Early write termination (EWT) technique
551–552
Edge-discover technique
171
Energy and power considerations
compute- and memory-bound codes
510
hardware optimizations
509
high-performance computing
509
regular and irregular codes
510
sample power profile
511f
source-code optimizations
energy efficiency
vs. performance
530–533
error correcting codes (ECC)
522–523
lowest and highest settings
530
Energy-delay product (EDP)
460,
461f
Error correcting codes (ECC)
522–523
F
Fast-bank aware register mapping mechanism (FBA-RM)
695–696
Fast-warp aware scheduling (FWAS) policy
696–697
Filling-retreating scheme
84,
87–88
Fine-grained communication and synchronization
72
memory consistency model
73–74
memory operation relationships
75
memory order parameters
77–79
memory scope parameters
79
release and acquire semantics
77
special atomic operations
75–76
stand-alone memory fence
75–76
FIrst-ready first-come first-serve (FR-FCFS) scheduling policy
662
Flow-shop scheduling problem (FSP)
256–257
FPC algorithm, implementation of
390–393
G
Gabow’s scaling algorithm
176
General-purpose graphics processing units (GPGPUs)
23
bandwidth selection policy
vs. optimal bandwidth choices
459,
459f
baseline GPGPU-Sim configuration
454,
455t
dynamic bandwidth allocation
hybrid configuration crossbar
462
synthetic and real application-based traffic patterns
465–467
overall performance improvement
700–701
register file bank reorganization
692
RF architecture, feasibility in
694
IPC degradation under VL-SB and RF-BRO
manufacturing process variations
650
PV tolerant techniques
687
soft-error vulnerability analysis
architecturally correct execution
652
architecture vulnerability factor
652
experimental methodologies
654
dynamic bandwidth allocation
455
GPU-memory communication
455
digital voltage and frequency scaling
design space exploration
602t
Exynos 5 Octa (5422) HMPSoC
598–599
Odroid XU+E development platform
598–599
voltage-frequency settings
599t
state-of-the-art techniques
AMD Trinity single-chip heterogeneous platform
596
workload partitioning algorithm
597
work-group size manipulation
604–605
Branch-and-Bound (B&B) algorithm
flow-shop scheduling problem
256–257
traveling salesman problem
257
dynamic programming algorithm
253–255
Signal processing and linear algebra
248
linear optimization problems
249–250
mixed-integer programming
250
research contributions
248
vs. exterior point method
252
interior point method
253
product form of the inverse
252
simplex tableau algorithm
253
steepest-edge heuristic
252
lack of progress guarantees
10–11
dot product kernel
5–6,
5f,
warps and lock-step execution
accelerated particle beam testing
627–630
architectural and program vulnerability analysis
architecturally correct execution (ACE)
622–623
cost-benefit analysis
622
hypothetical transient fault
626
reliability enhancing structures
626
SWIFI’s source-level software abstraction
626
radiation-induced soft errors
detected unrecoverable errors (DUEs)
619
permanent (hard) faults
618
silent data corruptions (SDC)
619
soft-error rate (SER)
619
load balancing and memory accesses
semidynamic mapping techniques
185–187
multiphase mapping strategy
190
sequential implementation
163
Graphics processing units (GPUs)
baseline architecture
376f
bottlenecks in execution
373
capabilities and functionality
307
fixed pipeline functionality
307–308
high-level languages and template libraries
307–308
serial code and parallel code
309
single instruction multiple data (SIMD) model
309
streaming multiprocessor (SM) chip
308
generic programming strategies
312–313
NVIDIA Maxwell architecture
548
solid transfer-torque RAM (STT-RAM) technology
early write termination (EWT) technique
551–552
L2 cache architecture (
see two-part L2 cache)
operation’s latency and energy
551–552
two-retention STT-RAM structure
551–552
Greedy scheduling strategy
83–84
H
Half precision data type
12
Hardware-based management of threads
381–382
Hardware-based SM partitioning framework
99
Hardware reliability enhancements
manufacturing process variations
650
pipeline structures protection
632
Hash table managing algorithm
172b
programmer/developer interface
382–383
register file/shared memory allocation
382
Heterogeneous computing system
Heterogeneous multiprocessor system-on-chips (HMPSoCs), power management
abstract architectural block diagrams
582
control-theoretic gaming governor
586
default Linux Conservative governor
595
digital voltage and frequency scaling
582
dynamic power management
582
Odroid-XU+E with Exynos 5 Octa (5410) HMPSoC
584,
585f
power-performance trade-off
588–589
producer-consumer relationship
583–584
regression-based gaming governor
587
texture-directed gaming governor
586
Hidden Markov models (HMMs)
Plan7 profile HMM architecture
135,
136f
residue emission probabilities
135
transition probabilities
135
Hierarchical memory machine (HMM)
311
High-performance computing (HPC) systems
on-chip SRAM failures
621
Hybrid format (HYB) SpMV
351
I
Idle SM aware Full-Rise (IS-FRISE)
633
J
K
L
Lightweight performance model
114–115
Load-store queue (LSQ) unit
480
Los Alamos Neutron Science Center (LANSCE)
628–629
M
Magnetic tunnel junction (MTJ)
548–549
Massively parallel compression (MPC) algorithm
compression and decompression
double-precision data sets
333,
333t
double- and single-precision floating-point values
330
vs. floating-point compression algorithm
330
four chained components
329f
massively parallel machines
328
Max-min ant system (MMAS)
262
Memory-bound applications
377
producer-consumer code
73,
73t
relaxed consistency
74,
74t
sequential consistency model
73–74
Memory latency, in GPUs
409
Mesh-based NoC links and switch
450–451
constructive operator
261
maintenance scheduling problem
261
pheromone update operator
261
transit stop inspection
261
constructive heuristics
265
deep greedy switching (DGS)
264–265
design and implementation
258
multiobjective local search
265
Microring resonator (MRR)
451
Myers-Miller (MM) algorithm
130
N
Needleman-Wunsch dynamic programming algorithm
204–205
Needleman-Wunsh (NW) algorithm
130
Neural transformation
418
approximation techniques
443
input/output identification
421
topology selection and training
423
controlling quality trade-offs
430–431
cycle-accurate simulations
434
neurally enhanced SIMD lanes
428–430
neutrally transformed warp
428
energy modeling and overhead
435
evaluation/training data sets
433
performance and energy benefits
435–436
vs. prior CPU neural acceleration
441–442
quality-control mechanism
441
instruction-set architecture
GPU-specific challenges
425
streaming multiprocessors
425
neural network topology
433
reuse CPU neural accelerators
420
Nvidia’s CUDA programming model
3–4
O
On-chip memory, unutilized
379
memory operation relationships
75
memory order parameters
77–79
memory scope parameters
79
release and acquire semantics
77
special atomic operations
75–76
stand-alone memory fence
75–76
Open Computing Language (OpenCL)
3–4
memory consistency model
32
kernel execution commands
26–28
P
Pairwise sequence alignment
database alignment problem
199
high-level parallelization strategy
201–202
parallel implementation
200
single-GPU parallelizations
201–202
BlockedAntidiagonal strategy
205,
206f
Needleman-Wunsch dynamic programming algorithm
204–205
recurrence equations rewriting
204
vs. BlockedAntidiagonal
208
Pairwise sequence comparison
bioinformatics problems
129
dynamic programming technique
130–131
global, local, and semiglobal alignments
129–130f
Myers-Miller (MM) algorithm
130
Needleman-Wunsh (NW) algorithm
130
Smith-Waterman (SW) algorithm
130
Parallel thread execution (PTX) code
88
Photonic network-on-chip (NoC)
interconnect architecture model
interconnects bandwidth
452
total energy consumption
453
total photonic interconnect energy
454
mesh-based NoC links and switch
450–451
benchmark description
117t
data placement and migration graph
115–116
deriving data access patterns
lightweight performance model
114–115
memory latency description
117,
117t
memory specification through MSL
architectural specifications
108
“upperLevels” and “lowerLevels”
109
runtime library activation
106
Power management of mobile GPUs
gaming application problems
612
expected, regular, and irregular profile comparison
515–516
idealized power profile
512f
single-source shortest paths (SSSP)
513,
514f
Processing elements (PEs)
64–65
Product Form of the Inverse (PFI)
252
Q
R
Recycling the streaming processors Idle time for Soft-Error detection (RISE)
633–634
Redundant multithreading (RMT)
Dimitrov’s kernel-level replication
637–638
simulated low-latency intercore communication
636
Register transfer language (RTL) models
623
Regression-based gaming governor
587
Relaxed consistency
74,
74t
Request pending aware full RISE (RP-FRISE)
633
Round-robin warp scheduling
85,
99
Running Average Power Limit (RAPL) functionality
331
S
Semidynamic mapping techniques
185–187
Sequence-profile comparison
157t
Hidden Markov models (HMMs)
Plan7 profile HMM architecture
135,
136f
residue emission probabilities
135
transition probabilities
135
HMMER3
hmmsearch tool implementation, MSV algorithm
Araújo Neto and Moreano
156
HMMER2
hmmsearch tool implementation, Viterbi algorithm
sequence family and transfer information
133
Sequential consistency model
73–74
Shared memory allocation, helper threading
382
Shortest paths computations
using Dijkstra, better-scaling partitioned
300
betweenness centrality
273
decentralized APSP analysis
295–296
SSST query algorithm analysis
296–297
ex situ tropical product
292
diagonal and nondiagonal submatrices
280–281
distance matrix representation
280,
281f
in situ tropical product
292
Silent data corruptions (SDC)
619
Simultaneous multithreading (SMT)
408
Single error correct, double error detect (SEC-DED)
624
Single-instruction multiple-data (SIMD)
64–65,
309,
599
Single-source shortest path (SSSP) problem
513,
514f
Bellman-Ford algorithm
175
algorithm iterations
179f
Gabow’s scaling algorithm
176
Single write multiple read (SWMR) photonic crossbar
461–462
BlockedAntidiagonal strategy
205,
206f
Needleman-Wunsch dynamic programming algorithm
204–205
recurrence equations rewriting
204
Smith-Waterman variations
132
extensions to OpenCL
46–47
limitations of OpenCL
33–38
evaluation methodology
48
scalability on large-scale GPU cluster
51–53
scalability on medium-scale GPU cluster
50–51
target cluster architecture
24,
25f
command handler thread
42
command scheduler thread
40
consistency management
45
detecting memory objects
45–46
minimizing copying overhead
43–44
synchronization commands
42
Soft-error rate (SER)
619
Soft-error vulnerability analysis
architecturally correct execution
652
architecture vulnerability factor
652
experimental methodologies
654
GPGPU microarchitecture structures
first dynamically dead (FDD) instructions
654–655
transitively dynamically dead (TDD) instructions
654–655,
655f
microarchitecture-level soft-error vulnerability
Software-level scheduling
96–99
Software-level task scheduling
affinity-based scheduling
graph-based approach
91–92
irregular benchmarks
92–93
nonuniform data sharing
91
normalized L1 miss ratios
93f,
94
speedups of single-kernel runs
93f
filling-retreating scheme
84,
87–88
greedy scheduling strategy
83–84
nondeterministic across runs
85
oblivious to nonuniform data sharing
86
serializing multikernel co-runs
86
performance problems
88–89
SM-based task selection
84
average normalized turnaround time
95–96,
97f
multiprogram co-runs
94–96
software approaches
96–99
temporal scheduling
83–84
Software reliability enhancements
dual-modular redundancy
635
dynamic-reliability management
640–643
redundant multithreading (RMT)
Dimitrov’s kernel-level replication
637–638
simulated low-latency intercore communication
636
Sphere of Replication
635
symptom-based fault detection
640–643
Solid transfer-torque RAM (STT-RAM) technology
early write termination (EWT) technique
551–552
L2 cache architecture (
see two-part L2 cache)
operation’s latency and energy
551–552
two-retention STT-RAM structure
551–552
data dependency and thread allocation
315f
optimal parallel sorting
314
Sparse matrix-vector multiplication (SpMV)
cop20K_A and Scircuit
367
mac_econ_fwd500 and webbase-1M
367
compressed sparse row (CSR) format
350–351
coordinate format (COO)
351
diagonal format (DIA)
350
representation formats of
353f
Streaming processors (SPs)
201
Symbolic bug-finding tools
13
T
Texture-directed gaming governor
586
Thread communication and synchronization
built-in atomic functions
microbenchmark kernels
70
dynamic programming recurrence
222
three-dimensional matrix partitioning
221,
222f
GPU computational strategy
223
Tool support, GPU correctness issues
12
dynamic bug-finding tools
13
symbolic bug-finding tools
13
Traveling salesman problem (TSP)
257
Tree-structured minimum-finder (TMF) units
566
detailed structural parameters
569t
dynamic write threshold detection
total cache write cost
565
execution time and IPC values
570,
571f
latencies, energies, and utilization
567f
LR and HR cache retention times
567–568
LR cache size and structure
567
parallel
vs. sequential search mechanism
573
power and performance comparison
572,
573f
rewrite access interval distribution
557f
U
Unified memory machine (UMM)
311
V
Virtual warps load balancing
184–185
W
Warped-dual-modular redundancy (DMR)
632–633
Weak memory consistency model
Work-item coalescing regions (WCRs)
39