Subject Index

Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.

A

Adaptive body biasing (ABB) 686–687
Adjacency lists 165–166
Adjacency matrix 163–165
Affinity-based scheduling 
cluster enlargement 92
graph-based approach 91–92
graph partitioning 92
irregular benchmarks 92–93
nonuniform data sharing 91
normalized L1 miss ratios 93f, 94
seed selection 92
sorted list creation 92
speedups of single-kernel runs 93f
All-pair shortest path (APSP) problem 
Floyd-Warshall algorithm 
memory coalescing 182
phases 181
pseudocode 181, 181b
Johnson algorithm 180
matrix multiplication procedure 180
lazy minimum evaluation 182
repeated squaring technique 182
streaming block optimization 182
repeated squaring APSP algorithm 180–181b
Ant colony algorithms 
ACO implementation 260
constructive operator 261
2D table 260–261
maintenance scheduling problem 261
max-min ant system 262
pheromone update operator 261
prefix-sum algorithm 261
transit stop inspection 261
Architecturally correct execution (ACE) 622–623
Architectural vulnerability factor (AVF) 
vs. CPU AVF analysis 624
CPU-GPU architecture 625
GPU’s massive vector register files 624
high-priority structures 624
Little’s law 623
Arithmetic logic units (ALUs) 377, 632
Assist warp 374
communication and control 386
dynamic feedback and throttling 385–386
execution 385
mapping compression algorithms into 387–393
trigger and deployment 385
Assist warp buffer (AWB) 384
Assist warp controller (AWC) 383–384
Assist warp store (AWS) 383, 384, 385f, 393–394
Assist warp table (AWT) 385, 385f
Augmented block cimmino distributed (ABCD) algorithm 
appending matrix 236
augmentation 233, 235
augmented matrix 235
boundary padding 
augmented matrix 240, 241f
branch divergence 240, 244–245
CUDA core 240
givens rotation 240, 241, 242f
speedup 244–245
coalesced memory access 
coalesced format transformation 239–240, 239f
original data storage format 239
speedup 243–244
thread-block 238–240
vs. CPU implementation 242–243
future aspects 245
linear least square problem 234–235
parallel algorithms 233–234
partition of the system 234f
QR method 237–238
sparse storage format 238, 239f
Average bandwidth utilization 397–398

B

Bandwidth optimizations in GPUs 409
Basic Local Alignment Tool (BLAST) 132–133
BLASTN 133
BLASTP 133
evaluation 133
extension 133
Mega-BLAST 133
seeding 132–133
BDI compression algorithm 387–393
Bellman-Ford algorithm 175
algorithm iterations 179f
frontier-based 178, 178b
performance 180
relax operations 177–178
sequential 177b
speedups comparison 178, 179f
Biological sequence analysis 
algorithms 128
DNA and protein sequence database 128
evolutionary history 127
GPU design aspects 
GPU solution as a filter 141
memory hierarchy 140–141
sequence database sorting 139–140
task vs. data-parallelism 139
graphics processing units 128
high-performance computing resources and techniques 128
pairwise sequence comparison 
affine gap model 130
bioinformatics problems 129
dynamic programming technique 130–131
global, local, and semiglobal alignments 129–130f
Gotoh algorithm 130
linear gap model 130
metric GCUPS 131
Myers-Miller (MM) algorithm 130
Needleman-Wunsh (NW) algorithm 130
punctuation 129–130
sequence alignment 129
Smith-Waterman (SW) algorithm 130
sequence-profile comparison  See (Sequence-profile comparison)
Bitonic Sort algorithms 
bitonic merge 313b
bitonic sequence 313
bitonic split 314b
data dependency and thread allocation 315f
GPU implementation 315
optimal parallel sorting 314
Block search 188–189
Boundary padding 
augmented matrix 240, 241f
branch divergence 240, 244–245
CUDA core 240
givens rotation 240–241, 242f
speedup 244–245
Branch-and-Bound (B&B) algorithm 250
flow-shop scheduling problem 256–257
Knapsack problem 255–256
traveling salesman problem 257
Breadth first search (BFS) 513, 514f
BFS-4K 
coalesced read/write memory accesses 172
duplicate detection and correction strategy 172
dynamic parallelism 171
dynamic virtual warps 171
edge-discover 171
exclusive prefix-sum 170
single-block vs. multiblock kernel 172
vs. state-of-the-art BFS implementations 172, 173t, 174f
frontier-based parallel implementations 
computational complexity 169
CPU multicore hybrid approach 170
CSR representation 169
edge parallelism 169
filtering steps 168–169
frontier propagation 168
sparse graph matrices 169
virtual warp size 169
workload imbalance 170
sequential algorithm 167, 167b
Bucketing method 176–177

C

Central processing unit (CPU) multicore hybrid approach 170
Checkerboard-based NoC 450–451
Coalesced expansion 190
Coalesced memory access 
coalesced format transformation 239–240, 239f
original data storage format 239
speedup 243–244
thread-block 238–240
Coarse-grained communication and synchronization 
execution phases 58
global barrier at the kernel level 
custom function 60–61
data loading 59
generic kernel 60–61
initialization cost 59
in-order kernel execution 59, 59f
limitations 61–62
out-of-order kernel execution 59, 59f
skeletal host code 59–60
work-group suspension 59
implicit barrier 
AMD Southern Islands GPUs 64–65, 65f
compute-wavefronts 67–68
computing cores 64–65
DMA-wavefronts 67–68
intrablock scan 66–67, 68f
intrawarp scan 66–67, 67f
parallel scan algorithm 66, 67f
processing elements 64–65
SIMD unit 64–65
software prefetching 68, 69f
thread divergence 65, 66f
local barrier 
average workload calculation 63–64
deadlock 63
execution phases 62
shared memory consistency 62–63
task assignment 63–64
work donation algorithm 63
local memory 58
Compressed row storage (CRS) 165–166
Compressed sparse row (CSR) format 350–351
Compression, CABA for 374–375
Compute-bound applications 377
Compute Unified Device Architecture (CUDA) 240
applications 378f
programmer/developer interface 382–383
Computing cores 64–65
Constructive heuristics 265
Control-theoretic gaming governor 586
Cooperative thread arrays (CTAs) 376
Coordinate format (COO) SpMV 351
Coordinate list (COO) sparse matrix 166–167
Core and memory clock-frequency scaling 
default to 614 configuration 
compute-bound CUDA SDK 520–521
core speed 519–520
LonestarGPU 520
memory-bound codes 520
runtime changes 519
614 to 324 configuration 521–522
Core-Assisted Bottleneck Acceleration (CABA) 374
applications of 386
bandwidth optimizations 409
compression 409–410
for compression 374–375
data compression 386–396
design of 381–383
effect on performance and bandwidth utilization 398–399
enabling different compression algorithms 401–403
energy, effect on 399–401
framework 380–386, 383f
generality of 375
goals and challenges 380–381
handling interrupts and exceptions 407
helper threading 408–409
high-level block diagram 383–384
leverages 376
mapping BDI to 388–389
mechanism 385–386
memoization 406
memory bandwidth 403
memory latency 409
methodology 397–398
optimizations 404
prefetching 406–408
profiling and instrumentation 407–408
redundant multithreading 407
results 398–405
selective cache compression with 403–405
speculative precomputation 407
Correctness issues, GPU programming 4
data races 8–10
floating-point accuracy 11–12, 11f
lack of progress guarantees 10–11
tool support 12
dynamic bug-finding tools 13
symbolic bug-finding tools 13
verification tools 13–14
C-Pack algorithm 391–392, 392b
CRISP performance predictor 
compute/store path portion 484–486
critical stalled path 483–484
energy savings 490–492, 496–501
execution time prediction 
breakdown 494f
high scaling factor 494–496
lm kernel 494–496, 497f
mean absolute percentage error 496, 496f
STALL 494
target frequencies 494, 495f
generality 501–502
GPGPU configuration 490–492, 491t
hardware mechanism and overhead 490
LCP portion 484
leading load 486
loads and computations 486–488
miss model 486
online energy saving framework 492
online software 493
parameterizing 488–489
Rodinia benchmark 492
simplified 501
STALL model 486
static and dynamic power 492–493
static power 492
TMemory computation 487f
two load outstanding phases 486–488
CRItical Stalled Path (CRISP) 483–484

D

Data compression algorithm 
CABA 386–396
BDI compression algorithm 387–393
compression mechanism 395–396
C-Pack algorithm 391–392
decompression mechanism 393–395
FPC algorithm 390–393
initial setup and profiling 396
memory controller changes 396
compression ratio (CR) 327
data model and coder 328
floating-point compression algorithms 328–329
inverse model 328–329
massively parallel machines 328
Data placement 
GPU programs 120
PORPLE  See (PORPLE system)
static analysis 120–121
technical constraints 120
Data Placement and Migration Graph (DPMG) 115–116, 116f
Deep greedy switching (DGS) 264–265
Dennard scaling 417
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 320–322
Deriving data access patterns 
placement-agnostic code 112–114
reuse distance model 111–112
Detected unrecoverable errors (DUEs) 619
Diagonal format (DIA) SpMV 350
Digital voltage and frequency scaling (DVFS) 582
2DCONV and SYR2K applications 602–604, 603f
design space exploration 602t
energy-efficiency 603–604
Dijkstra algorithm 175
Direct search 187–188
Discrete memory machine (DMM) 311
Double-error correct, triple-error detect (DEC-TED) ECC 630–631
Dynamic bandwidth allocation (DBA) 
d-HetPNoC 462–465
hybrid configuration crossbar 462
mechanism 462
SWMR photonic crossbar 461–462
synthetic and real application-based traffic patterns 465–467
Dynamic bug-finding tools 13
Dynamic mapping approaches 
block search 188–189
direct search 187–188
local warp search 188
two phase search 189–190
Dynamic parallelism 171
Dynamic power management (DPM) 582
Dynamic virtual warps 171
Dynamic voltage and frequency scaling (DVFS) 
categories 472
CPU-based performance models 
analytical model 474–475
critical path 475
empirical model 474
execution time 474–476, 477f
leading load 475
miss model 475
proportional scaling model 474
sampling model 474
stall time 475
CPU cores 471–472
CPU program performance 472, 473f
energy and power savings 471
GPGPU program performance 472, 473f
in GPGPUs 
complex stall classification 481–482
core clock frequency scaling 476–478f
CRISP performance predictor (see CRItical Stalled Path (CRISP) performance predictor) 
frequency range 476–478
GPU analytical model 482
high thread-level parallelism 478
homogeneity of computation 478
L1 cache miss 482
linear and machine-learning-based models 482
memory/computation overlap 478–480
store stalls 480–481
memory/computation 472
performance counter architectures 472
thread-level parallelism 472

E

Early write termination (EWT) technique 551–552
Edge-discover technique 171
Edge list 166–167
ELLPACK format SpMV 351
Energy and power considerations 
algorithm implementation 516–518
arithmetic precision 518
benchmark programs 510–511
challenges 509
code optimizations 509–510
compute- and memory-bound codes 510
core and memory clock-frequency scaling  See (Core and memory clock-frequency scaling)
hardware optimizations 509
high-performance computing 509
input sensitivity 522, 523f
power profiling  See (Power profiling)
regular and irregular codes 510
sample power profile 511f
source-code optimizations 
effectiveness 527–530
energy efficiency vs. performance 530–533
error correcting codes (ECC) 522–523
lowest and highest settings 530
most baised optimizations 533–535
none vs. all optimizations 524–527
six code optimizations 523–524
system and compiler 511
Energy-delay product (EDP) 460, 461f
Error correcting codes (ECC) 522–523

F

Fast-bank aware register mapping mechanism (FBA-RM) 695–696
Fast-warp aware scheduling (FWAS) policy 696–697
Filling-retreating scheme 84, 87–88
Fine-grained communication and synchronization 72
memory consistency model 73–74
OPENCL 2.0 memory model 
memory operation relationships 75
memory order parameters 77–79
memory scope parameters 79
release and acquire semantics 77
special atomic operations 75–76
stand-alone memory fence 75–76
FIrst-ready first-come first-serve (FR-FCFS) scheduling policy 662
Flow-shop scheduling problem (FSP) 256–257
Floyd-Warshall algorithm 181, 181b
FPC algorithm, implementation of 390–393

G

Gabow’s scaling algorithm 176
General-purpose graphics processing units (GPGPUs) 23
bandwidth selection policy 
dynamic power 460
energy-delay-product 460, 461f
vs. optimal bandwidth choices 459, 459f
power consumption 460, 461f
baseline GPGPU-Sim configuration 454, 455t
BFS benchmark 458f
dynamic bandwidth allocation 
d-HetPNoC 462–465
hybrid configuration crossbar 462
mechanism 462
SWMR photonic crossbar 461–462
synthetic and real application-based traffic patterns 465–467
evaluations 
IPC improvement 698–700
overall performance improvement 700–701
frequency optimization 
implementation 692–693
modeling PV impact 688–690
register file bank reorganization 692
RF architecture, feasibility in 694
variable-latency subbanks 690–692
IPC degradation under VL-SB and RF-BRO 
FBA-RM 695–696
FWAS policy 696–697
register mapping 695
leukocyte benchmark 457, 458f
manufacturing process variations 650
memory accesses 457
programming models 651
to PVs 
ABB and gate sizing 686–687
contributions 687–688
FMAX 686
PV tolerant techniques 687
soft errors 650
soft-error vulnerability analysis 
architecturally correct execution 652
architecture vulnerability factor 652
experimental methodologies 654
GPGPU-SODA 653
Sim-SODA 652–653
speedup analysis 
dynamic bandwidth allocation 455
vs. flit size 455–457, 456f
GPU-memory communication 455
normalized IPC 456f
streaming multiprocessors 650–651
warp scheduler 651–652
Genetic algorithms (GA) 
Knapsack problems 260
operators 258
scheduling problem 259–260
traveling salesman problem 258–259
GKLEE 15–17
GPGPUs, power management 
concurrent execution 608–611
CPU and GPU partitioning 601–602, 602f
CPU estimation 606–608
CPU-GPU partitioning ratio 609–610
digital voltage and frequency scaling 
2DCONV and SYR2K applications 602–604, 603f
design space exploration 602t
energy-efficiency 603–604
GPU estimation 605–606
hardware environment 
CPU clusters 599
Exynos 5 Octa (5422) HMPSoC 598–599
GPU L2 cache 599
Odroid XU+E development platform 598–599
voltage-frequency settings 599t
memory contention 608–611
OpenCL background 
applications 599–600
compute units 600
CUDA 600
host code segment 600
NDRange 600
runtime software 600–601
work-item 600
workload partitioning 601–602
open problems 613
state-of-the-art techniques 
AMD Trinity single-chip heterogeneous platform 596
FreeOCL 597
image processing 596–597
kernel code 597
OpenCL framework 597
workload partitioning algorithm 597
work-group size manipulation 604–605
GPU computing 
advantages 247
Branch-and-Bound (B&B) algorithm 
flow-shop scheduling problem 256–257
Knapsack problem 255–256
traveling salesman problem 257
dynamic programming algorithm 253–255
Signal processing and linear algebra 248
metaheuristics  See (Metaheuristics)
NVIDIA GPUs 247–248, 248t
operations research 
linear optimization problems 249–250
linear program 249
metaheuristics 250
mixed-integer programming 250
optimization problem 249, 250
research contributions 248
simplex algorithm 
vs. exterior point method 252
interior point method 253
linear programming 251, 252t
matrix operations 251–252
product form of the inverse 252
simplex tableau algorithm 253
steepest-edge heuristic 252
GPU programming 
barrier synchronization 7
correctness issues 4
data races 8–10
floating-point accuracy 11–12, 11f
lack of progress guarantees 10–11
tool support 12
dot product kernel 5–6, 5f, 8
evolution 3–4
future aspects 18–19
memory spaces 6–7, 6f
parallel computing 3–4
threads 5–6, 6f
canonical schedule 15
data race 14
deterministic 14
two-thread reduction 15
warps and lock-step execution 7
GPU reliability research 
accelerated particle beam testing 627–630
architectural and program vulnerability analysis 
architecturally correct execution (ACE) 622–623
architectural vulnerability factor  See (Architectural vulnerability factor (AVF))
cost-benefit analysis 622
PVF concept 623–624
fault-injection 
bit-flip 625–626
GPU-Qin 626–627
hypothetical transient fault 626
reliability enhancing structures 626
SASSI and SASSIFI 627
SWIFI’s source-level software abstraction 626
hardware reliability enhancements  See (Hardware reliability enhancements)
radiation-induced soft errors 
detected unrecoverable errors (DUEs) 619
permanent (hard) faults 618
silent data corruptions (SDC) 619
soft-error rate (SER) 619
transient faults 618–619
software reliability enhancements  See (Software reliability enhancements)
GPUVerify tool 17
Graph algorithms 
adjacency lists 165–166
adjacency matrix 163–165
breadth first search  See (Breadth first search (BFS))
coalesced expansion 190
data representations 163, 164t
edge list 166–167
iterated searches 191–194
load balancing and memory accesses 
dynamic mapping approaches 187–190
semidynamic mapping techniques 185–187
static mapping techniques 183–185
multiphase mapping strategy 190
sequential implementation 163
Graphics processing units (GPUs) 
application 376
baseline architecture 376f
bottlenecks in execution 373
CUDA memory model 310
CUDA programming model 
capabilities and functionality 307
computation models 310–312
CUDA core 308–309
dynamic parallelism 308
fixed pipeline functionality 307–308
global memory 308
high-level languages and template libraries 307–308
latency 308
serial code and parallel code 309
single instruction multiple data (SIMD) model 309
streaming multiprocessor (SM) chip 308
threadIdx and blockIdx 309–310
current vs. write pulse 550, 551f
data types 547
D-cache 547
generic programming strategies 312–313
hardware architectures 354–355
L2 cache 547–548
L1 data cache 
read/write policy diagram 547–548, 547f
size 544–545
memory hierarchy 547, 547f, 551
NVIDIA Maxwell architecture 548
on-chip GPU caches 546
programming model 355
Racetrack memory 549–550, 549f
ReRAM memory element 548
solid transfer-torque RAM (STT-RAM) technology 
basic structure 548–549
DRAM and PCM parts 552
early write termination (EWT) technique 551–552
ferromagnetic layers 548–549
in GPU register file 552–553
L2 cache architecture (see two-part L2 cache) 
limitations 545
limited write endurance 548–549
operation’s latency and energy 551–552
two-retention STT-RAM structure 551–552
write working set (WWS) 545–546
sorting algorithms  See (Sorting algorithms)
stream multiprocessors 546–547
STT-RAM technology 548–549
thread-blocks 546–547
Greedy scheduling strategy 83–84

H

Half precision data type 12
Hardware acceleration 419–420
Hardware-based management of threads 381–382
Hardware-based SM partitioning framework 99
Hardware reliability enhancements 
manufacturing process variations 650
pipeline structures protection 632
soft errors 650
SRAM structures protection 630–632
underutilization 
Argus-G 634
GPU economics 634–635
RISE 633–634
warped-DMR 632–633
Hash table managing algorithm 172b
Helper threading 373–374, 408–409
goals and challenges 380–381
hardware-based management 381–382
programmer/developer interface 382–383
register file/shared memory allocation 382
Heterogeneous computing system 
accelerators 23
computational units 23
CUDA 23
OpenCL 23
Heterogeneous multiprocessor system-on-chips (HMPSoCs), power management 
abstract architectural block diagrams 582
ARM Mali-760 GPU 581–582
average power consumption 584–585, 585f
component frequency 590–591
control-theoretic gaming governor 586
CPU-GPU relationship 587–588
default Linux Conservative governor 595
digital voltage and frequency scaling 582
dynamic power management 582
FPS bottleneck 591
FPS prediction errors 594, 595t
gaming bottlenecks 589–590
gaming governor 
goal of 593
meeting FPS 593–594
samples 593
saving power 594
GPU processing 583
Linux and Android OS 584
Odroid-XU+E with Exynos 5 Octa (5410) HMPSoC 584, 585f
PC-class CPU-GPU systems 582–583, 583f
power-performance trade-off 588–589
producer-consumer relationship 583–584
regression-based gaming governor 587
resource bottlenecks 591–592
system-on-chip (SoC) 581
texture-directed gaming governor 586
utilization model 592
Hidden Markov models (HMMs) 
HMMER 134–135
Pfam database 134
Plan7 profile HMM architecture 135, 136f
residue emission probabilities 135
score and alignment 134
sequence family 135
topology 135
transition probabilities 135
Hierarchical memory machine (HMM) 311
High-performance computing (HPC) systems 
coprocessors 619–620
soft errors 
DRAM failures 620–621
on-chip SRAM failures 621
SER rates 620
SRAM faults 621
Hybrid format (HYB) SpMV 351

I

Idle SM aware Full-Rise (IS-FRISE) 633
Intrablock scan 66–67, 68f
Intrawarp scan 66–67, 67f

J

Johnson algorithm 180

K

KLEE tool 15
K20 Power Tool 331, 511

L

Layered algorithm 221–223
Leading load 475
Lightweight performance model 114–115
Load-store queue (LSQ) unit 480
Local warp search 188
Los Alamos Neutron Science Center (LANSCE) 628–629

M

Magnetic tunnel junction (MTJ) 548–549
Massively parallel compression (MPC) algorithm 
best algorithm 337
BIT component 334
component count 342–346
components 334–335
compression and decompression 
energy 342, 343t
compression ratio (CR) 327, 338, 339t
data model and coder 328
data sets 
double-precision data sets 333, 333t
MPI messages 332
numeric simulations 332–333
observational data 333
derivation 337
DIMn component 334
double- and single-precision floating-point values 330
vs. floating-point compression algorithm 330
four chained components 329f
individual best algorithms 336–337
INV component 334
LNVns component 334
LNVnx component 334
massively parallel machines 328
NUL component 334
RLE component 334
stages of 330
system and compilers 331
throughput and energy 331–332
ZE component 334
Max-min ant system (MMAS) 262
Memoization, CABA 406
Memory-bound applications 377
Memory coalescing 7, 243–244
Memory consistency model 
multicore system 73
producer-consumer code 73, 73t
relaxed consistency 74, 74t
sequential consistency model 73–74
Memory latency, in GPUs 409
Merge sort 317–319
Mesh-based NoC links and switch 450–451
Metaheuristics 
ant colony algorithms 
ACO implementation 260
constructive operator 261
2D table 260–261
maintenance scheduling problem 261
max-min ant system 262
pheromone update operator 261
prefix-sum algorithm 261
transit stop inspection 261
constructive heuristics 265
deep greedy switching (DGS) 264–265
definition 257
design and implementation 258
genetic algorithms 
Knapsack problems 260
operators 258
scheduling problem 259–260
traveling salesman problem 258–259
multiobjective local search 265
tabu search 262–264
Microring resonator (MRR) 451
Miss model 475
Moore’s law 417
Myers-Miller (MM) algorithm 130

N

Near-Far pile strategies 176–177
Needleman-Wunsch dynamic programming algorithm 204–205
Needleman-Wunsh (NW) algorithm 130
Neural acceleration 418–419
Neural transformation 418
annotations 431
approximation techniques 443
compilation workflow 
code generation 424
code observation 423
input/output identification 421
topology selection and training 423
controlling quality trade-offs 430–431
cycle-accurate simulations 434
Dennard scaling 417
design and integration 
MLPs 426
neurally enhanced SIMD lanes 428–430
neutrally transformed warp 428
objectives 425
SIMD lanes 426–427
energy modeling and overhead 435
evaluation/training data sets 433
experimental results 
accelerator’s speed 439
area overhead 437
further improvements 437–438
off-chip bandwidth 440–441
performance and energy benefits 435–436
vs. prior CPU neural acceleration 441–442
quality-control mechanism 441
GPU applications 431
hardware acceleration 419–420
instruction-set architecture 
GPU-specific challenges 425
streaming multiprocessors 425
vector instructions 424–425
Moore’s law 417
neural network topology 433
quality metrics 433–434
reuse CPU neural accelerators 420
safe programming interface 420–421
Nvidia’s CUDA programming model 3–4

O

On-chip memory, unutilized 379
OPENCL 2.0 memory model 
memory operation relationships 75
memory order parameters 77–79
memory scope parameters 79
release and acquire semantics 77
special atomic operations 75–76
stand-alone memory fence 75–76
Open Computing Language (OpenCL) 3–4
advantages 24
code portability 23
execution model 28, 29f
ICD 32–33, 33f
limitations 24
memory consistency model 32
memory model 29–30
platform model 26–28, 27f
application 26, 27f
kernel execution commands 26–28
memory objects 28
synchronization 30–32
Optical couplers 451

P

Pairwise sequence alignment 
ChunkedAlignment1 210–211, 214–216
advantage 216
running time 215t, 216t
sequences size 216
vs. StripedScore 216
ChunkedAlignment2 211
advantage 211
phases 211
running time 217t, 218f
computational efficiency 199–200
database alignment problem 199
DNA sequences 199
high-level parallelization strategy 201–202
local alignments 199
memory requirements 211–212
parallel implementation 200
scoring matrix 201–202
single-GPU parallelizations 201–202
Smith-Waterman algorithm 
BlockedAntidiagonal strategy 205, 206f
data dependency 205f
Gotoh’s variant 202–203
Needleman-Wunsch dynamic programming algorithm 204–205
pseudocode 203–204b
recurrence equations rewriting 204
score matrices 203
StripedAlignment 209–210, 213–219
StripedScore 
antidiagonals 206–207
vs. BlockedAntidiagonal 208
computational time 207–208
global memory 206–207
lenQuery 212–216
PerotRecurrence 212–213
running time 212–216, 212t, 213f, 214t
speedup ranges 212–213
substitution matrix 208
Pairwise sequence comparison 
affine gap model 130
Basic Local Alignment Tool 132–133
bioinformatics problems 129
BLAST and FASTA 131
dynamic programming technique 130–131
global, local, and semiglobal alignments 129–130f
Gotoh algorithm 130
linear gap model 130
metric GCUPS 131
Myers-Miller (MM) algorithm 130
Needleman-Wunsh (NW) algorithm 130
punctuation 129, 130
sequence alignment 129
Smith-Waterman (SW) algorithm 130
Parallel RAM (PRAM) model 311–312
Parallel thread execution (PTX) code 88
Partial-RISE 633–634
Photonic network-on-chip (NoC) 
architectures 451
bandwidth and latency 450–451
checkerboard-based NoC 450–451
interconnect architecture model 
energy per flit 454
flit width 452–453
interconnects bandwidth 452
power consumption 453–454
single photonic waveguide 452–453
total energy consumption 453
total photonic interconnect energy 454
laser sources 451
mesh-based NoC links and switch 450–451
microring resonator 451
optical couplers 451
photo detectors 451–452
silicon waveguides 451
Placement-agnostic code 112–114
PORPLE system 
adaptivity 107
benchmark description 117t
irregular benchmarks 118, 119f
regular benchmarks 118–120, 119f
RODINIA benchmark suite 117–118
rule-based approach 118
SHOC benchmark 117–118
best data placements 
algorithms 115
data placement and migration graph 115–116
components of 106–107
deriving data access patterns 
placement-agnostic code 112–114
PORPLE-C 111
reuse distance model 111–112
extensibility 107
generality 107
lightweight performance model 114–115
memory latency description 117, 117t
memory specification through MSL 
architectural specifications 108
blockSize 109
concurrencyFactor 109
design 108–109
name and ID 109
serialCondition 109
serialization condition 109–110
shareScope 109
Tesla M2075 110, 110f
“upperLevels” and “lowerLevels” 109
portability 107
runtime library activation 106
user-transparency 107
Power management of mobile GPUs 
gaming application problems 612
Power profiling 
expected, regular, and irregular profile comparison 515–516
idealized power profile 512f
irregular codes 
Barnes-Hut (BH) 515, 515f
breadth-first search (BFS) 513–514f
data dependencies 513–514
single-source shortest paths (SSSP) 513, 514f
regular codes 512–513, 513f
Prefetching, CABA 406–408
Prefix-sum procedure 170
Processing elements (PEs) 64–65
Product Form of the Inverse (PFI) 252

Q

Quicksort 319–320

R

Racetrack memory 549–550, 549f
Radix Sort 316–317
Recycling the streaming processors Idle time for Soft-Error detection (RISE) 633–634
duplication 669
error coverage 681–682
FULL-RISE 
idle ware Full-RISE 675–676
implementation 676–678
request pending 671–675
RMT technique 670
parallel computing 669
PARTIAL-RISE 678–681
sensitivity analysis 684–686
technique’s effectiveness 682–684
Redundant multithreading (RMT) 
Dimitrov’s kernel-level replication 637–638
intergroup RMT 639–640
intragroup RMT 638–639
output comparison 636
R-Naive 636–637
R-Scatter 637
R-Thread 637
simulated low-latency intercore communication 636
Register transfer language (RTL) models 623
Regression-based gaming governor 587
Relaxed consistency 74, 74t
Request pending aware full RISE (RP-FRISE) 633
ReRAM memory element 548
Reuse distance model 111–112
Round-robin warp scheduling 85, 99
Running Average Power Limit (RAPL) functionality 331

S

Scalable sorting 320, 321f
Semidynamic mapping techniques 185–187
Sequence alignment 
GPU architecture 201, 202f
Sequence-profile comparison 157t
Hidden Markov models (HMMs) 
HMMER 134–135
Pfam database 134
Plan7 profile HMM architecture 135, 136f
residue emission probabilities 135
score and alignment 134
sequence family 135
topology 135
transition probabilities 135
HMMER3 hmmsearch tool implementation, MSV algorithm 
Araújo Neto and Moreano 156
Cheng and Butler 155
Li et al. 154–155
HMMER2 hmmsearch tool implementation, Viterbi algorithm 
Du et al. 151–152
Ferraz and Moreano 153–154
Ganesan et al. 153
Horn et al. 150–151
Walters et al. 152
Yao et al. 152–153
MSV algorithm 137–138
sequence family and transfer information 133
statistical features 134
Viterbi algorithm 136–137
Sequential consistency model 73–74
Shared memory allocation, helper threading 382
Shortest paths computations 
APSP problem 274
centralized master/slave approach 282–285, 284f
decentralized approach 285–288, 288f
with real edge weights 297–300
using Dijkstra, better-scaling partitioned 300
betweenness centrality 273
computation complexities 
centralized APSP analysis 293–295
decentralized APSP analysis 295–296
SSST query algorithm analysis 296–297
Dijkstra algorithm 275–277
ex situ tropical product 292
Floyd-Warshall algorithm 274–275
graph partitioning 
diagonal and nondiagonal submatrices 280–281
distance matrix representation 280, 281f
METIS library 280
implementation challenges 277–278
related work 278–279
in situ tropical product 292
SPSP problem 274
preprocessing mode 289–290
query mode 290–291
using Dijkstra 302–303
SSSP problem 274
Silent data corruptions (SDC) 619
Silicon waveguides 451
Simultaneous multithreading (SMT) 408
Single error correct, double error detect (SEC-DED) 624
Single-event upsets (SEUs) 618–619
Single-instruction multiple-data (SIMD) 64–65, 309, 599
Single instruction multiple thread (SIMT) 375–376, 454, 543
Single-source shortest path (SSSP) problem 513, 514f
Bellman-Ford algorithm 175
algorithm iterations 179f
frontier-based 178, 178b
performance 180
relax operations 177–178
sequential 177b
speedups comparison 178, 179f
bucketing method 176–177
Dijkstra algorithm 175
D-stepping algorithm 176–177
Gabow’s scaling algorithm 176
Near-Far pile strategies 176–177
workfront Sweep 176–177
Single write multiple read (SWMR) photonic crossbar 461–462
Smith-Waterman (SW) algorithm 130, 199–200
BlockedAntidiagonal strategy 205, 206f
data dependency 205f
Gotoh’s variant 202–203
Needleman-Wunsch dynamic programming algorithm 204–205
pseudocode 203–204b
recurrence equations rewriting 204
score matrices 203
similarity matrix 131
Smith-Waterman variations 132
SnuCL 
Cluster 35, 37f, 38, 40–42
CPU 35, 37f, 38–40, 38t
extensions to OpenCL 46–47
limitations of OpenCL 33–38
performance evaluation 
evaluation methodology 48
scalability on large-scale GPU cluster 51–53
scalability on medium-scale GPU cluster 50–51
Single 35, 37f, 40
target cluster architecture 24, 25f
SnuCL Cluster 35, 37f, 38
command handler thread 42
command message 42
command scheduler thread 40
host thread 40
memory management 
consistency management 45
detecting memory objects 45–46
memory commands 44–45
minimizing copying overhead 43–44
space allocation 43
synchronization commands 42
Soft-error rate (SER) 619
Soft-error vulnerability analysis 
architecturally correct execution 652
architecture vulnerability factor 652
experimental methodologies 654
GPGPU microarchitecture structures 
ACE percentage 654–655, 655f
first dynamically dead (FDD) instructions 654–655
streaming multiprocessor level 657–659, 658f
transitively dynamically dead (TDD) instructions 654–655, 655f
GPGPU-SODA 653
microarchitecture-level soft-error vulnerability 
dynamic warp formation 660–662
number of threads per SM 662–666
off-chip memory access 662, 663f
warp scheduling policy 666–667
Sim-SODA 652–653
Software-level scheduling 96–99
Software-level task scheduling 
affinity-based scheduling 
cluster enlargement 92
graph-based approach 91–92
graph partitioning 92
irregular benchmarks 92–93
nonuniform data sharing 91
normalized L1 miss ratios 93f, 94
seed selection 92
sorted list creation 92
speedups of single-kernel runs 93f
filling-retreating scheme 84, 87–88
greedy scheduling strategy 83–84
hardware approaches 99
hardware scheduler 83–84
greedy scheduling 85
nondeterministic across runs 85
oblivious to nonuniform data sharing 86
round-robin style 85
serializing multikernel co-runs 86
implementation 
global index array 88–89
macros 90
performance problems 88–89
psuedo code 88–89, 88f
optimizations 84
parallelism control 91
SM-based task selection 84
correctness issues 87
task mapping 86–87
thread block 86, 87f
SM partitioning 
average normalized turnaround time 95–96, 97f
multiprogram co-runs 94–96
overlapped execution 95
sampling approach 94
software approaches 96–99
spatial scheduling 83–84
temporal scheduling 83–84
Software reliability enhancements 
dual-modular redundancy 635
dynamic-reliability management 640–643
hidden error problem 635
redundant execution 635–636
redundant multithreading (RMT) 
Dimitrov’s kernel-level replication 637–638
intergroup RMT 639–640
intragroup RMT 638–639
output comparison 636
R-Naive 636–637
R-Scatter 637
R-Thread 637
simulated low-latency intercore communication 636
Sphere of Replication 635
symptom-based fault detection 640–643
Solid transfer-torque RAM (STT-RAM) technology 
basic structure 548–549
DRAM and PCM parts 552
early write termination (EWT) technique 551–552
ferromagnetic layers 548–549
in GPU register file 552–553
L2 cache architecture (see two-part L2 cache) 
limitations 545
limited write endurance 548–549
operation’s latency and energy 551–552
two-retention STT-RAM structure 551–552
write working set (WWS) 545–546
Sorting algorithms 
Bitonic sort 
bitonic merge 313b
bitonic sequence 313
bitonic split 314b
data dependency and thread allocation 315f
GPU implementation 315
optimal parallel sorting 314
comparisons 322–323
large data sets 320–322
Merge sort 317–319
Quick sort 319–320
Radix sort 316–317
scalable sorting 320, 321f
warpsort 319–320
Sparse matrix-vector multiplication (SpMV) 
adaptive runtime system 
atmosmodd matrix 363–366
auto-tuner 358–359
computational power 366f
cop20K_A and Scircuit 367
crankseg-2 matrices 363–366
effective bandwidth 366f
mac_econ_fwd500 and webbase-1M 367
MaxNNZ and AvgNNZ 363–366
processing phase 358–359
training phase 359
BCCOO format 353
BELLPACK format 353
compressed sparse row (CSR) format 350–351
coordinate format (COO) 351
diagonal format (DIA) 350
ELLPACK format 351
hybrid format (HYB) 351
nd24k input matrix 350
optimization principles 
block size 356, 360
memory 357–358, 363, 364f, 365f
register 357, 360–363
representation formats of 353f
Sparse storage format 238, 239f
Stall time 475
Static mapping techniques 183–185
Streaming multiprocessors (SMs) 375–376, 425
Streaming processors (SPs) 201
Symbolic bug-finding tools 13
System-on-chip (SoC) 581

T

Tabu search 262–264
Texture-directed gaming governor 586
Thread-blocks 376
Thread communication and synchronization 
built-in atomic functions 
access patterns 69–70, 71t
data race 68–69
microbenchmark kernels 70
mutual exclusion 57
ordering 57
Three-sequence alignment 
algorithm 218–221
layered algorithm 
alignment computation 227–229
chunk values 221–222
computational time 223
dynamic programming recurrence 222
LAYERED-BT1 224–225
LAYERED-BT2 225
LAYERED-BT3 226–227
score computation 227
shared memory 222–223
three-dimensional matrix partitioning 221, 222f
problem 200–201
sloped algorithm 
alignment computation 227–229
analysis 224
GPU computational strategy 223
score computation 227
Tool support, GPU correctness issues 12
dynamic bug-finding tools 13
symbolic bug-finding tools 13
verification tools 13–14
Traveling salesman problem (TSP) 257
Tree-structured minimum-finder (TMF) units 566
Tridiagonal systems 
parallel algorithms 233–234
Two-part L2 cache 
cache-insensitive 568
cache-sensitive 568
detailed structural parameters 569t
dynamic energy consumption 571–572
dynamic write threshold detection 
mechanism 565–566
threshold analysis 563–565
total cache write cost 565
workload cost 563, 564f
write cost per block 565
eviction policy 560–561
execution time and IPC values 570, 571f
GPGPU-Sim configurations 570, 570t
GPGPU-Sim3.2.1 simulator 568–570
inter and intraset write variations 553–555, 554f
latencies, energies, and utilization 567f
lifetime evaluation 572–573
LR and HR cache retention times 567–568
LR cache size and structure 567
monitoring mechanism 558–559
parallel vs. sequential search mechanism 573
power and performance comparison 572, 573f
refreshing mechanism 561–563
rewrite access interval distribution 557f
search mechanism 559–560
STT-RAM cell retention time 553–555t
temporal and spatial distribution 553, 555, 555f
TWWS features 547f, 555–558
Two phase search 189–190

U

Unified memory machine (UMM) 311

V

Verification tools 13–14
Virtual warps load balancing 184–185

W

Warped-dual-modular redundancy (DMR) 632–633
Warp sort 319–320
Weak memory consistency model 7
Work-item coalescing regions (WCRs) 39
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.103.204