Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Subject Index

Note: Page numbers followed by b indicate boxes, f indicate figures and t indicate tables.

Adaptive body biasing (ABB) 686–687

Adjacency lists 165–166

Adjacency matrix 163–165

Affinity-based scheduling

cluster enlargement 92

graph-based approach 91–92

graph partitioning 92

irregular benchmarks 92–93

nonuniform data sharing 91

normalized L1 miss ratios 93f, 94

seed selection 92

sorted list creation 92

speedups of single-kernel runs 93f

All-pair shortest path (APSP) problem

Floyd-Warshall algorithm

memory coalescing 182

phases 181

pseudocode 181, 181b

Johnson algorithm 180

matrix multiplication procedure 180

lazy minimum evaluation 182

repeated squaring technique 182

streaming block optimization 182

repeated squaring APSP algorithm 180–181b

Ant colony algorithms

ACO implementation 260

constructive operator 261

2D table 260–261

maintenance scheduling problem 261

max-min ant system 262

pheromone update operator 261

prefix-sum algorithm 261

transit stop inspection 261

Architecturally correct execution (ACE) 622–623

Architectural vulnerability factor (AVF)

vs. CPU AVF analysis 624

CPU-GPU architecture 625

GPU’s massive vector register files 624

high-priority structures 624

Little’s law 623

Arithmetic logic units (ALUs) 377, 632

Assist warp 374

communication and control 386

dynamic feedback and throttling 385–386

execution 385

mapping compression algorithms into 387–393

trigger and deployment 385

Assist warp buffer (AWB) 384

Assist warp controller (AWC) 383–384

Assist warp store (AWS) 383, 384, 385f, 393–394

Assist warp table (AWT) 385, 385f

Augmented block cimmino distributed (ABCD) algorithm

appending matrix 236

augmentation 233, 235

augmented matrix 235

boundary padding

augmented matrix 240, 241f

branch divergence 240, 244–245

CUDA core 240

givens rotation 240, 241, 242f

speedup 244–245

coalesced memory access

coalesced format transformation 239–240, 239f

original data storage format 239

speedup 243–244

thread-block 238–240

vs. CPU implementation 242–243

future aspects 245

linear least square problem 234–235

parallel algorithms 233–234

partition of the system 234f

QR method 237–238

sparse storage format 238, 239f

Average bandwidth utilization 397–398

AWS. See Assist Warp Store (AWS)

Bandwidth optimizations in GPUs 409

Basic Local Alignment Tool (BLAST) 132–133

BLASTN 133

BLASTP 133

evaluation 133

extension 133

Mega-BLAST 133

seeding 132–133

BDI compression algorithm 387–393

Bellman-Ford algorithm 175

algorithm iterations 179f

frontier-based 178, 178b

performance 180

relax operations 177–178

sequential 177b

speedups comparison 178, 179f

Biological sequence analysis

algorithms 128

DNA and protein sequence database 128

evolutionary history 127

GPU design aspects

GPU solution as a filter 141

memory hierarchy 140–141

sequence database sorting 139–140

task vs. data-parallelism 139

graphics processing units 128

high-performance computing resources and techniques 128

pairwise sequence comparison

affine gap model 130

bioinformatics problems 129

BLAST See (Basic Local Alignment Tool (BLAST))

dynamic programming technique 130–131

global, local, and semiglobal alignments 129–130f

Gotoh algorithm 130

linear gap model 130

metric GCUPS 131

Myers-Miller (MM) algorithm 130

Needleman-Wunsh (NW) algorithm 130

punctuation 129–130

sequence alignment 129

Smith-Waterman (SW) algorithm 130

sequence-profile comparison See (Sequence-profile comparison)

Bitonic Sort algorithms

bitonic merge 313b

bitonic sequence 313

bitonic split 314b

data dependency and thread allocation 315f

GPU implementation 315

optimal parallel sorting 314

Block search 188–189

Boundary padding

augmented matrix 240, 241f

branch divergence 240, 244–245

CUDA core 240

givens rotation 240–241, 242f

speedup 244–245

Branch-and-Bound (B&B) algorithm 250

flow-shop scheduling problem 256–257

Knapsack problem 255–256

traveling salesman problem 257

Breadth first search (BFS) 513, 514f

BFS-4K

coalesced read/write memory accesses 172

duplicate detection and correction strategy 172

dynamic parallelism 171

dynamic virtual warps 171

edge-discover 171

exclusive prefix-sum 170

single-block vs. multiblock kernel 172

vs. state-of-the-art BFS implementations 172, 173t, 174f

frontier-based parallel implementations

computational complexity 169

CPU multicore hybrid approach 170

CSR representation 169

edge parallelism 169

filtering steps 168–169

frontier propagation 168

sparse graph matrices 169

virtual warp size 169

workload imbalance 170

sequential algorithm 167, 167b

Bucketing method 176–177

Central processing unit (CPU) multicore hybrid approach 170

Checkerboard-based NoC 450–451

Coalesced expansion 190

Coalesced memory access

coalesced format transformation 239–240, 239f

original data storage format 239

speedup 243–244

thread-block 238–240

Coarse-grained communication and synchronization

execution phases 58

global barrier at the kernel level

custom function 60–61

data loading 59

generic kernel 60–61

initialization cost 59

in-order kernel execution 59, 59f

limitations 61–62

out-of-order kernel execution 59, 59f

skeletal host code 59–60

work-group suspension 59

implicit barrier

AMD Southern Islands GPUs 64–65, 65f

compute-wavefronts 67–68

computing cores 64–65

DMA-wavefronts 67–68

intrablock scan 66–67, 68f

intrawarp scan 66–67, 67f

parallel scan algorithm 66, 67f

processing elements 64–65

SIMD unit 64–65

software prefetching 68, 69f

thread divergence 65, 66f

local barrier

average workload calculation 63–64

deadlock 63

execution phases 62

shared memory consistency 62–63

task assignment 63–64

work donation algorithm 63

local memory 58

Compressed row storage (CRS) 165–166

Compressed sparse row (CSR) format 350–351

Compression, CABA for 374–375

Compute-bound applications 377

Compute Unified Device Architecture (CUDA) 240

applications 378f

programmer/developer interface 382–383

Computing cores 64–65

Constructive heuristics 265

Control-theoretic gaming governor 586

Cooperative thread arrays (CTAs) 376

Coordinate format (COO) SpMV 351

Coordinate list (COO) sparse matrix 166–167

Core and memory clock-frequency scaling

default to 614 configuration

compute-bound CUDA SDK 520–521

core speed 519–520

LonestarGPU 520

memory-bound codes 520

runtime changes 519

614 to 324 configuration 521–522

Core-Assisted Bottleneck Acceleration (CABA) 374

applications of 386

bandwidth optimizations 409

compression 409–410

for compression 374–375

data compression 386–396

design of 381–383

effect on performance and bandwidth utilization 398–399

enabling different compression algorithms 401–403

energy, effect on 399–401

framework 380–386, 383f

generality of 375

goals and challenges 380–381

handling interrupts and exceptions 407

helper threading 408–409

high-level block diagram 383–384

leverages 376

mapping BDI to 388–389

mechanism 385–386

memoization 406

memory bandwidth 403

memory latency 409

methodology 397–398

optimizations 404

prefetching 406–408

profiling and instrumentation 407–408

redundant multithreading 407

results 398–405

selective cache compression with 403–405

speculative precomputation 407

uses of 374–375, 405–408

Correctness issues, GPU programming 4

data races 8–10

floating-point accuracy 11–12, 11f

lack of progress guarantees 10–11

tool support 12

dynamic bug-finding tools 13

symbolic bug-finding tools 13

verification tools 13–14

C-Pack algorithm 391–392, 392b

CRISP performance predictor

compute/store path portion 484–486

critical stalled path 483–484

energy savings 490–492, 496–501

execution time prediction

breakdown 494f

high scaling factor 494–496

lm kernel 494–496, 497f

mean absolute percentage error 496, 496f

STALL 494

target frequencies 494, 495f

generality 501–502

GPGPU configuration 490–492, 491t

hardware mechanism and overhead 490

LCP portion 484

leading load 486

loads and computations 486–488

miss model 486

online energy saving framework 492

online software 493

parameterizing 488–489

Rodinia benchmark 492

simplified 501

STALL model 486

static and dynamic power 492–493

static power 492

T_Memory computation 487f

two load outstanding phases 486–488

CRItical Stalled Path (CRISP) 483–484

Data compression algorithm

CABA 386–396

BDI compression algorithm 387–393

compression mechanism 395–396

C-Pack algorithm 391–392

decompression mechanism 393–395

FPC algorithm 390–393

initial setup and profiling 396

memory controller changes 396

compression ratio (CR) 327

data model and coder 328

floating-point compression algorithms 328–329

inverse model 328–329

massively parallel machines 328

MPC algorithm See (Massively parallel compression (MPC) algorithm)

Data placement

GPU programs 120

PORPLE See (PORPLE system)

static analysis 120–121

technical constraints 120

Data Placement and Migration Graph (DPMG) 115–116, 116f

Deep greedy switching (DGS) 264–265

Dennard scaling 417

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 320–322

Deriving data access patterns

placement-agnostic code 112–114

reuse distance model 111–112

Detected unrecoverable errors (DUEs) 619

Diagonal format (DIA) SpMV 350

Digital voltage and frequency scaling (DVFS) 582

2DCONV and SYR2K applications 602–604, 603f

design space exploration 602t

energy-efficiency 603–604

Dijkstra algorithm 175

Direct search 187–188

Discrete memory machine (DMM) 311

Double-error correct, triple-error detect (DEC-TED) ECC 630–631

Dynamic bandwidth allocation (DBA)

d-HetPNoC 462–465

hybrid configuration crossbar 462

mechanism 462

SWMR photonic crossbar 461–462

synthetic and real application-based traffic patterns 465–467

Dynamic bug-finding tools 13

Dynamic mapping approaches

block search 188–189

direct search 187–188

local warp search 188

two phase search 189–190

Dynamic parallelism 171

Dynamic power management (DPM) 582

Dynamic virtual warps 171

Dynamic voltage and frequency scaling (DVFS)

categories 472

CPU-based performance models

analytical model 474–475

critical path 475

empirical model 474

execution time 474–476, 477f

leading load 475

miss model 475

proportional scaling model 474

sampling model 474

stall time 475

CPU cores 471–472

CPU program performance 472, 473f

energy and power savings 471

GPGPU program performance 472, 473f

in GPGPUs

complex stall classification 481–482

core clock frequency scaling 476–478f

CRISP performance predictor (see CRItical Stalled Path (CRISP) performance predictor)

frequency range 476–478

GPU analytical model 482

high thread-level parallelism 478

homogeneity of computation 478

L1 cache miss 482

linear and machine-learning-based models 482

memory/computation overlap 478–480

store stalls 480–481

memory/computation 472

performance counter architectures 472

thread-level parallelism 472

Early write termination (EWT) technique 551–552

Edge-discover technique 171

Edge list 166–167

ELLPACK format SpMV 351

Energy and power considerations

algorithm implementation 516–518

arithmetic precision 518

benchmark programs 510–511

challenges 509

code optimizations 509–510

compute- and memory-bound codes 510

core and memory clock-frequency scaling See (Core and memory clock-frequency scaling)

hardware optimizations 509

high-performance computing 509

input sensitivity 522, 523f

power profiling See (Power profiling)

regular and irregular codes 510

sample power profile 511f

source-code optimizations

effectiveness 527–530

energy efficiency vs. performance 530–533

error correcting codes (ECC) 522–523

lowest and highest settings 530

most baised optimizations 533–535

none vs. all optimizations 524–527

six code optimizations 523–524

system and compiler 511

Energy-delay product (EDP) 460, 461f

Error correcting codes (ECC) 522–523

Fast-bank aware register mapping mechanism (FBA-RM) 695–696

Fast-warp aware scheduling (FWAS) policy 696–697

Filling-retreating scheme 84, 87–88

Fine-grained communication and synchronization 72

memory consistency model 73–74

OPENCL 2.0 memory model

memory operation relationships 75

memory order parameters 77–79

memory scope parameters 79

release and acquire semantics 77

special atomic operations 75–76

stand-alone memory fence 75–76

FIrst-ready first-come first-serve (FR-FCFS) scheduling policy 662

Flow-shop scheduling problem (FSP) 256–257

Floyd-Warshall algorithm 181, 181b

FPC algorithm, implementation of 390–393

Gabow’s scaling algorithm 176

General-purpose graphics processing units (GPGPUs) 23

bandwidth selection policy

dynamic power 460

energy-delay-product 460, 461f

vs. optimal bandwidth choices 459, 459f

power consumption 460, 461f

baseline GPGPU-Sim configuration 454, 455t

BFS benchmark 458f

dynamic bandwidth allocation

d-HetPNoC 462–465

hybrid configuration crossbar 462

mechanism 462

SWMR photonic crossbar 461–462

synthetic and real application-based traffic patterns 465–467

evaluations

IPC improvement 698–700

overall performance improvement 700–701

frequency optimization

implementation 692–693

modeling PV impact 688–690

RF architecture, feasibility in 694

variable-latency subbanks 690–692

IPC degradation under VL-SB and RF-BRO

FBA-RM 695–696

FWAS policy 696–697

leukocyte benchmark 457, 458f

manufacturing process variations 650

memory accesses 457

programming models 651

to PVs

ABB and gate sizing 686–687

contributions 687–688

FMAX 686

PV tolerant techniques 687

soft errors 650

soft-error vulnerability analysis

architecturally correct execution 652

architecture vulnerability factor 652

experimental methodologies 654

GPGPU-SODA 653

Sim-SODA 652–653

speedup analysis

dynamic bandwidth allocation 455

vs. flit size 455–457, 456f

GPU-memory communication 455

normalized IPC 456f

streaming multiprocessors 650–651

warp scheduler 651–652

Genetic algorithms (GA)

Knapsack problems 260

operators 258

scheduling problem 259–260

traveling salesman problem 258–259

GKLEE 15–17

GPGPUs, power management

concurrent execution 608–611

CPU and GPU partitioning 601–602, 602f

CPU estimation 606–608

CPU-GPU partitioning ratio 609–610

digital voltage and frequency scaling

2DCONV and SYR2K applications 602–604, 603f

design space exploration 602t

energy-efficiency 603–604

GPU estimation 605–606

hardware environment

CPU clusters 599

Exynos 5 Octa (5422) HMPSoC 598–599

GPU L2 cache 599

Odroid XU+E development platform 598–599

voltage-frequency settings 599t

memory contention 608–611

OpenCL background

applications 599–600

compute units 600

CUDA 600

host code segment 600

NDRange 600

runtime software 600–601

work-item 600

workload partitioning 601–602

open problems 613

state-of-the-art techniques

AMD Trinity single-chip heterogeneous platform 596

FreeOCL 597

image processing 596–597

kernel code 597

OpenCL framework 597

workload partitioning algorithm 597

work-group size manipulation 604–605

GPU computing

advantages 247

Branch-and-Bound (B&B) algorithm

flow-shop scheduling problem 256–257

Knapsack problem 255–256

traveling salesman problem 257

dynamic programming algorithm 253–255

Signal processing and linear algebra 248

metaheuristics See (Metaheuristics)

NVIDIA GPUs 247–248, 248t

operations research

linear optimization problems 249–250

linear program 249

metaheuristics 250

mixed-integer programming 250

optimization problem 249, 250

research contributions 248

simplex algorithm

vs. exterior point method 252

interior point method 253

linear programming 251, 252t

matrix operations 251–252

product form of the inverse 252

simplex tableau algorithm 253

steepest-edge heuristic 252

GPU programming

barrier synchronization 7

correctness issues 4

data races 8–10

floating-point accuracy 11–12, 11f

lack of progress guarantees 10–11

tool support 12

dot product kernel 5–6, 5f, 8

evolution 3–4

future aspects 18–19

memory spaces 6–7, 6f

parallel computing 3–4

threads 5–6, 6f

canonical schedule 15

data race 14

deterministic 14

two-thread reduction 15

warps and lock-step execution 7

GPU reliability research

accelerated particle beam testing 627–630

architectural and program vulnerability analysis

architecturally correct execution (ACE) 622–623

architectural vulnerability factor See (Architectural vulnerability factor (AVF))

cost-benefit analysis 622

PVF concept 623–624

fault-injection

bit-flip 625–626

GPU-Qin 626–627

hypothetical transient fault 626

reliability enhancing structures 626

SASSI and SASSIFI 627

SWIFI’s source-level software abstraction 626

hardware reliability enhancements See (Hardware reliability enhancements)

radiation-induced soft errors

detected unrecoverable errors (DUEs) 619

permanent (hard) faults 618

silent data corruptions (SDC) 619

soft-error rate (SER) 619

transient faults 618–619

software reliability enhancements See (Software reliability enhancements)

GPUVerify tool 17

Graph algorithms

adjacency lists 165–166

adjacency matrix 163–165

APSP problem See ((All-pair shortest path (APSP) problem)

breadth first search See (Breadth first search (BFS))

coalesced expansion 190

data representations 163, 164t

edge list 166–167

iterated searches 191–194

load balancing and memory accesses

dynamic mapping approaches 187–190

semidynamic mapping techniques 185–187

static mapping techniques 183–185

multiphase mapping strategy 190

sequential implementation 163

SSSP problem See (Single-source shortest path (SSSP) problem)

Graphics processing units (GPUs)

application 376

baseline architecture 376f

bottlenecks in execution 373

CUDA memory model 310

CUDA programming model

capabilities and functionality 307

computation models 310–312

CUDA core 308–309

dynamic parallelism 308

fixed pipeline functionality 307–308

global memory 308

high-level languages and template libraries 307–308

latency 308

serial code and parallel code 309

single instruction multiple data (SIMD) model 309

streaming multiprocessor (SM) chip 308

threadIdx and blockIdx 309–310

current vs. write pulse 550, 551f

data types 547

D-cache 547

generic programming strategies 312–313

hardware architectures 354–355

L2 cache 547–548

L1 data cache

read/write policy diagram 547–548, 547f

size 544–545

matrices See (Sparse matrix-vector multiplication (SpMV))

memory hierarchy 547, 547f, 551

NVIDIA Maxwell architecture 548

on-chip GPU caches 546

programming model 355

Racetrack memory 549–550, 549f

ReRAM memory element 548

solid transfer-torque RAM (STT-RAM) technology

basic structure 548–549

DRAM and PCM parts 552

early write termination (EWT) technique 551–552

ferromagnetic layers 548–549

in GPU register file 552–553

L2 cache architecture (see two-part L2 cache)

limitations 545

limited write endurance 548–549

operation’s latency and energy 551–552

two-retention STT-RAM structure 551–552

write working set (WWS) 545–546

sorting algorithms See (Sorting algorithms)

stream multiprocessors 546–547

STT-RAM technology 548–549

thread-blocks 546–547

Greedy scheduling strategy 83–84

Half precision data type 12

Hardware acceleration 419–420

Hardware-based management of threads 381–382

Hardware-based SM partitioning framework 99

Hardware reliability enhancements

manufacturing process variations 650

pipeline structures protection 632

soft errors 650

SRAM structures protection 630–632

underutilization

Argus-G 634

GPU economics 634–635

RISE 633–634

warped-DMR 632–633

Hash table managing algorithm 172b

Helper threading 373–374, 408–409

goals and challenges 380–381

hardware-based management 381–382

programmer/developer interface 382–383

Heterogeneous computing system

accelerators 23

computational units 23

CUDA 23

OpenCL 23

Heterogeneous multiprocessor system-on-chips (HMPSoCs), power management

abstract architectural block diagrams 582

ARM Mali-760 GPU 581–582

average power consumption 584–585, 585f

component frequency 590–591

control-theoretic gaming governor 586

CPU-GPU relationship 587–588

default Linux Conservative governor 595

digital voltage and frequency scaling 582

dynamic power management 582

FPS bottleneck 591

FPS prediction errors 594, 595t

gaming bottlenecks 589–590

gaming governor

goal of 593

meeting FPS 593–594

samples 593

saving power 594

GPU processing 583

Linux and Android OS 584

Odroid-XU+E with Exynos 5 Octa (5410) HMPSoC 584, 585f

PC-class CPU-GPU systems 582–583, 583f

power-performance trade-off 588–589

producer-consumer relationship 583–584

regression-based gaming governor 587

resource bottlenecks 591–592

system-on-chip (SoC) 581

texture-directed gaming governor 586

utilization model 592

Hidden Markov models (HMMs)

HMMER 134–135

Pfam database 134

Plan7 profile HMM architecture 135, 136f

residue emission probabilities 135

score and alignment 134

sequence family 135

topology 135

transition probabilities 135

Hierarchical memory machine (HMM) 311

High-performance computing (HPC) systems

coprocessors 619–620

soft errors

DRAM failures 620–621

on-chip SRAM failures 621

SER rates 620

SRAM faults 621

Hybrid format (HYB) SpMV 351

Idle SM aware Full-Rise (IS-FRISE) 633

Intrablock scan 66–67, 68f

Intrawarp scan 66–67, 67f

Johnson algorithm 180

KLEE tool 15

K20 Power Tool 331, 511

Layered algorithm 221–223

Leading load 475

Lightweight performance model 114–115

Load-store queue (LSQ) unit 480

Local warp search 188

Los Alamos Neutron Science Center (LANSCE) 628–629

Magnetic tunnel junction (MTJ) 548–549

Massively parallel compression (MPC) algorithm

best algorithm 337

BIT component 334

component count 342–346

components 334–335

compression and decompression

energy 342, 343t

speed 338–342, 340t, 341t

compression ratio (CR) 327, 338, 339t

data model and coder 328

data sets

double-precision data sets 333, 333t

MPI messages 332

numeric simulations 332–333

observational data 333

derivation 337

DIMn component 334

double- and single-precision floating-point values 330

vs. floating-point compression algorithm 330

four chained components 329f

individual best algorithms 336–337

INV component 334

LNVns component 334

LNVnx component 334

massively parallel machines 328

NUL component 334

RLE component 334

stages of 330

system and compilers 331

throughput and energy 331–332

ZE component 334

Max-min ant system (MMAS) 262

Memoization, CABA 406

Memory-bound applications 377

Memory coalescing 7, 243–244

Memory consistency model

multicore system 73

producer-consumer code 73, 73t

relaxed consistency 74, 74t

sequential consistency model 73–74

Memory latency, in GPUs 409

Merge sort 317–319

Mesh-based NoC links and switch 450–451

Metaheuristics

ant colony algorithms

ACO implementation 260

constructive operator 261

2D table 260–261

maintenance scheduling problem 261

max-min ant system 262

pheromone update operator 261

prefix-sum algorithm 261

transit stop inspection 261

constructive heuristics 265

deep greedy switching (DGS) 264–265

definition 257

design and implementation 258

genetic algorithms

Knapsack problems 260

operators 258

scheduling problem 259–260

traveling salesman problem 258–259

multiobjective local search 265

tabu search 262–264

Microring resonator (MRR) 451

Miss model 475

Moore’s law 417

Myers-Miller (MM) algorithm 130

Near-Far pile strategies 176–177

Needleman-Wunsch dynamic programming algorithm 204–205

Needleman-Wunsh (NW) algorithm 130

Neural acceleration 418–419

Neural transformation 418

annotations 431

approximation techniques 443

compilation workflow

code generation 424

code observation 423

input/output identification 421

topology selection and training 423

controlling quality trade-offs 430–431

cycle-accurate simulations 434

Dennard scaling 417

design and integration

MLPs 426

neurally enhanced SIMD lanes 428–430

neutrally transformed warp 428

objectives 425

SIMD lanes 426–427

energy modeling and overhead 435

evaluation/training data sets 433

experimental results

accelerator’s speed 439

area overhead 437

further improvements 437–438

off-chip bandwidth 440–441

performance and energy benefits 435–436

vs. prior CPU neural acceleration 441–442

quality-control mechanism 441

GPU applications 431

hardware acceleration 419–420

instruction-set architecture

GPU-specific challenges 425

streaming multiprocessors 425

vector instructions 424–425

Moore’s law 417

neural network topology 433

quality metrics 433–434

reuse CPU neural accelerators 420

safe programming interface 420–421

Nvidia’s CUDA programming model 3–4

On-chip memory, unutilized 379

OpenCL. See Open Computing Language (OpenCL)

OPENCL 2.0 memory model

memory operation relationships 75

memory order parameters 77–79

memory scope parameters 79

release and acquire semantics 77

special atomic operations 75–76

stand-alone memory fence 75–76

Open Computing Language (OpenCL) 3–4

advantages 24

code portability 23

execution model 28, 29f

ICD 32–33, 33f

limitations 24

memory consistency model 32

memory model 29–30

platform model 26–28, 27f

application 26, 27f

kernel execution commands 26–28

memory objects 28

synchronization 30–32

Optical couplers 451

Pairwise sequence alignment

ChunkedAlignment1 210–211, 214–216

advantage 216

running time 215t, 216t

sequences size 216

vs. StripedScore 216

ChunkedAlignment2 211

advantage 211

phases 211

running time 217t, 218f

computational efficiency 199–200

database alignment problem 199

DNA sequences 199

high-level parallelization strategy 201–202

local alignments 199

memory requirements 211–212

parallel implementation 200

scoring matrix 201–202

single-GPU parallelizations 201–202

Smith-Waterman algorithm

BlockedAntidiagonal strategy 205, 206f

data dependency 205f

Gotoh’s variant 202–203

Needleman-Wunsch dynamic programming algorithm 204–205

pseudocode 203–204b

recurrence equations rewriting 204

score matrices 203

StripedAlignment 209–210, 213–219

StripedScore

antidiagonals 206–207

vs. BlockedAntidiagonal 208

computational time 207–208

global memory 206–207

lenQuery 212–216

PerotRecurrence 212–213

running time 212–216, 212t, 213f, 214t

speedup ranges 212–213

substitution matrix 208

Pairwise sequence comparison

affine gap model 130

Basic Local Alignment Tool 132–133

bioinformatics problems 129

BLAST See (Basic Local Alignment Tool (BLAST))

BLAST and FASTA 131

dynamic programming technique 130–131

global, local, and semiglobal alignments 129–130f

Gotoh algorithm 130

linear gap model 130

metric GCUPS 131

Myers-Miller (MM) algorithm 130

Needleman-Wunsh (NW) algorithm 130

punctuation 129, 130

sequence alignment 129

Smith-Waterman (SW) algorithm 130

Parallel RAM (PRAM) model 311–312

Parallel thread execution (PTX) code 88

Partial-RISE 633–634

Photonic network-on-chip (NoC)

architectures 451

bandwidth and latency 450–451

checkerboard-based NoC 450–451

GPGPUs See (General-purpose graphics processing units (GPGPUs))

interconnect architecture model

energy per flit 454

flit width 452–453

interconnects bandwidth 452

power consumption 453–454

single photonic waveguide 452–453

total energy consumption 453

total photonic interconnect energy 454

laser sources 451

mesh-based NoC links and switch 450–451

microring resonator 451

optical couplers 451

photo detectors 451–452

silicon waveguides 451

Placement-agnostic code 112–114

PORPLE system

adaptivity 107

benchmark description 117t

irregular benchmarks 118, 119f

regular benchmarks 118–120, 119f

RODINIA benchmark suite 117–118

rule-based approach 118

SHOC benchmark 117–118

best data placements

algorithms 115

data placement and migration graph 115–116

components of 106–107

deriving data access patterns

placement-agnostic code 112–114

PORPLE-C 111

reuse distance model 111–112

extensibility 107

generality 107

lightweight performance model 114–115

memory latency description 117, 117t

memory specification through MSL

architectural specifications 108

blockSize 109

concurrencyFactor 109

design 108–109

name and ID 109

serialCondition 109

serialization condition 109–110

shareScope 109

Tesla M2075 110, 110f

“upperLevels” and “lowerLevels” 109

portability 107

runtime library activation 106

user-transparency 107

Power management of mobile GPUs

gaming application problems 612

GPGPU applications See (General-purpose computation on graphics processing unit (GPGPU), power management)

HMPSoCs, power management See (Heterogeneous multiprocessor system-on-chips (HMPSoCs), power management)

Power profiling

expected, regular, and irregular profile comparison 515–516

idealized power profile 512f

irregular codes

Barnes-Hut (BH) 515, 515f

breadth-first search (BFS) 513–514f

data dependencies 513–514

single-source shortest paths (SSSP) 513, 514f

regular codes 512–513, 513f

Prefetching, CABA 406–408

Prefix-sum procedure 170

Processing elements (PEs) 64–65

Product Form of the Inverse (PFI) 252

Quicksort 319–320

Racetrack memory 549–550, 549f

Radix Sort 316–317

Recycling the streaming processors Idle time for Soft-Error detection (RISE) 633–634

duplication 669

error coverage 681–682

FULL-RISE

idle ware Full-RISE 675–676

implementation 676–678

request pending 671–675

RMT technique 670

parallel computing 669

PARTIAL-RISE 678–681

sensitivity analysis 684–686

technique’s effectiveness 682–684

Redundant multithreading (RMT)

Dimitrov’s kernel-level replication 637–638

intergroup RMT 639–640

intragroup RMT 638–639

output comparison 636

R-Naive 636–637

R-Scatter 637

R-Thread 637

simulated low-latency intercore communication 636

Regression-based gaming governor 587

Relaxed consistency 74, 74t

Request pending aware full RISE (RP-FRISE) 633

ReRAM memory element 548

Reuse distance model 111–112

Round-robin warp scheduling 85, 99

Running Average Power Limit (RAPL) functionality 331

Scalable sorting 320, 321f

Semidynamic mapping techniques 185–187

Sequence alignment

GPU architecture 201, 202f

pairwise See (Pairwise sequence alignment)

three See (Three-sequence alignment)

Sequence-profile comparison 157t

Hidden Markov models (HMMs)

HMMER 134–135

Pfam database 134

Plan7 profile HMM architecture 135, 136f

residue emission probabilities 135

score and alignment 134

sequence family 135

topology 135

transition probabilities 135

HMMER3 hmmsearch tool implementation, MSV algorithm

Araújo Neto and Moreano 156

Cheng and Butler 155

Li et al. 154–155

HMMER2 hmmsearch tool implementation, Viterbi algorithm

Du et al. 151–152

Ferraz and Moreano 153–154

Ganesan et al. 153

Horn et al. 150–151

Walters et al. 152

Yao et al. 152–153

MSV algorithm 137–138

sequence family and transfer information 133

statistical features 134

Viterbi algorithm 136–137

Sequential consistency model 73–74

Shared memory allocation, helper threading 382

Shortest paths computations

APSP problem 274

centralized master/slave approach 282–285, 284f

decentralized approach 285–288, 288f

with real edge weights 297–300

using Dijkstra, better-scaling partitioned 300

betweenness centrality 273

computation complexities

centralized APSP analysis 293–295

decentralized APSP analysis 295–296

SSST query algorithm analysis 296–297

Dijkstra algorithm 275–277

ex situ tropical product 292

Floyd-Warshall algorithm 274–275

graph partitioning

diagonal and nondiagonal submatrices 280–281

distance matrix representation 280, 281f

METIS library 280

implementation challenges 277–278

related work 278–279

in situ tropical product 292

SPSP problem 274

preprocessing mode 289–290

query mode 290–291

using Dijkstra 302–303

SSSP problem 274

Silent data corruptions (SDC) 619

Silicon waveguides 451

Simultaneous multithreading (SMT) 408

Single error correct, double error detect (SEC-DED) 624

Single-event upsets (SEUs) 618–619

Single-instruction multiple-data (SIMD) 64–65, 309, 599

Single instruction multiple thread (SIMT) 375–376, 454, 543

Single-source shortest path (SSSP) problem 513, 514f

Bellman-Ford algorithm 175

algorithm iterations 179f

frontier-based 178, 178b

performance 180

relax operations 177–178

sequential 177b

speedups comparison 178, 179f

bucketing method 176–177

Dijkstra algorithm 175

D-stepping algorithm 176–177

Gabow’s scaling algorithm 176

Near-Far pile strategies 176–177

workfront Sweep 176–177

Single write multiple read (SWMR) photonic crossbar 461–462

Smith-Waterman (SW) algorithm 130, 199–200

BlockedAntidiagonal strategy 205, 206f

data dependency 205f

Gotoh’s variant 202–203

Needleman-Wunsch dynamic programming algorithm 204–205

pseudocode 203–204b

recurrence equations rewriting 204

score matrices 203

similarity matrix 131

Smith-Waterman variations 132

SnuCL

Cluster 35, 37f, 38, 40–42

CPU 35, 37f, 38–40, 38t

extensions to OpenCL 46–47

limitations of OpenCL 33–38

performance evaluation

evaluation methodology 48

scalability on large-scale GPU cluster 51–53

scalability on medium-scale GPU cluster 50–51

Single 35, 37f, 40

target cluster architecture 24, 25f

SnuCL Cluster 35, 37f, 38

command handler thread 42

command message 42

command scheduler thread 40

host thread 40

memory management

consistency management 45

detecting memory objects 45–46

memory commands 44–45

minimizing copying overhead 43–44

space allocation 43

synchronization commands 42

Soft-error rate (SER) 619

Soft-error vulnerability analysis

architecturally correct execution 652

architecture vulnerability factor 652

experimental methodologies 654

GPGPU microarchitecture structures

ACE percentage 654–655, 655f

AVF 655–657, 655f

first dynamically dead (FDD) instructions 654–655

streaming multiprocessor level 657–659, 658f

transitively dynamically dead (TDD) instructions 654–655, 655f

GPGPU-SODA 653

microarchitecture-level soft-error vulnerability

dynamic warp formation 660–662

number of threads per SM 662–666

off-chip memory access 662, 663f

warp scheduling policy 666–667

Sim-SODA 652–653

Software-level scheduling 96–99

Software-level task scheduling

affinity-based scheduling

cluster enlargement 92

graph-based approach 91–92

graph partitioning 92

irregular benchmarks 92–93

nonuniform data sharing 91

normalized L1 miss ratios 93f, 94

seed selection 92

sorted list creation 92

speedups of single-kernel runs 93f

filling-retreating scheme 84, 87–88

greedy scheduling strategy 83–84

hardware approaches 99

hardware scheduler 83–84

greedy scheduling 85

nondeterministic across runs 85

oblivious to nonuniform data sharing 86

round-robin style 85

serializing multikernel co-runs 86

implementation

global index array 88–89

macros 90

performance problems 88–89

psuedo code 88–89, 88f

optimizations 84

parallelism control 91

SM-based task selection 84

correctness issues 87

task mapping 86–87

thread block 86, 87f

SM partitioning

average normalized turnaround time 95–96, 97f

multiprogram co-runs 94–96

overlapped execution 95

sampling approach 94

software approaches 96–99

spatial scheduling 83–84

temporal scheduling 83–84

Software reliability enhancements

dual-modular redundancy 635

dynamic-reliability management 640–643

hidden error problem 635

redundant execution 635–636

redundant multithreading (RMT)

Dimitrov’s kernel-level replication 637–638

intergroup RMT 639–640

intragroup RMT 638–639

output comparison 636

R-Naive 636–637

R-Scatter 637

R-Thread 637

simulated low-latency intercore communication 636

Sphere of Replication 635

symptom-based fault detection 640–643

Solid transfer-torque RAM (STT-RAM) technology

basic structure 548–549

DRAM and PCM parts 552

early write termination (EWT) technique 551–552

ferromagnetic layers 548–549

in GPU register file 552–553

L2 cache architecture (see two-part L2 cache)

limitations 545

limited write endurance 548–549

operation’s latency and energy 551–552

two-retention STT-RAM structure 551–552

write working set (WWS) 545–546

Sorting algorithms

Bitonic sort

bitonic merge 313b

bitonic sequence 313

bitonic split 314b

data dependency and thread allocation 315f

GPU implementation 315

optimal parallel sorting 314

comparisons 322–323

large data sets 320–322

Merge sort 317–319

Quick sort 319–320

Radix sort 316–317

scalable sorting 320, 321f

warpsort 319–320

Sparse matrix-vector multiplication (SpMV)

adaptive runtime system

atmosmodd matrix 363–366

auto-tuner 358–359

computational power 366f

cop20K_A and Scircuit 367

crankseg-2 matrices 363–366

effective bandwidth 366f

mac_econ_fwd500 and webbase-1M 367

MaxNNZ and AvgNNZ 363–366

processing phase 358–359

training phase 359

BCCOO format 353

BELLPACK format 353

compressed sparse row (CSR) format 350–351

coordinate format (COO) 351

diagonal format (DIA) 350

ELLPACK format 351

hybrid format (HYB) 351

nd24k input matrix 350

optimization principles

block size 356, 360

memory 357–358, 363, 364f, 365f

representation formats of 353f

Sparse storage format 238, 239f

Stall time 475

Static mapping techniques 183–185

Streaming multiprocessors (SMs) 375–376, 425

Streaming processors (SPs) 201

Symbolic bug-finding tools 13

System-on-chip (SoC) 581

Tabu search 262–264

Texture-directed gaming governor 586

Thread-blocks 376

Thread communication and synchronization

built-in atomic functions

access patterns 69–70, 71t

data race 68–69

microbenchmark kernels 70

coarse-grained See (Coarse-grained communication and synchronization)

fine-grained See (Fine-grained communication and synchronization)

mutual exclusion 57

ordering 57

Three-sequence alignment

algorithm 218–221

layered algorithm

alignment computation 227–229

chunk values 221–222

computational time 223

dynamic programming recurrence 222

LAYERED-BT1 224–225

LAYERED-BT2 225

LAYERED-BT3 226–227

score computation 227

shared memory 222–223

three-dimensional matrix partitioning 221, 222f

problem 200–201

sloped algorithm

alignment computation 227–229

analysis 224

GPU computational strategy 223

score computation 227

Tool support, GPU correctness issues 12

dynamic bug-finding tools 13

symbolic bug-finding tools 13

verification tools 13–14

Traveling salesman problem (TSP) 257

Tree-structured minimum-finder (TMF) units 566

Tridiagonal systems

ABCD algorithm See (Augmented block cimmino distributed (ABCD) algorithm)

parallel algorithms 233–234

Two-part L2 cache

cache-insensitive 568

cache-sensitive 568

detailed structural parameters 569t

dynamic energy consumption 571–572

dynamic write threshold detection

mechanism 565–566

threshold analysis 563–565

total cache write cost 565

workload cost 563, 564f

write cost per block 565

eviction policy 560–561

execution time and IPC values 570, 571f

GPGPU-Sim configurations 570, 570t

GPGPU-Sim3.2.1 simulator 568–570

inter and intraset write variations 553–555, 554f

latencies, energies, and utilization 567f

lifetime evaluation 572–573

LR and HR cache retention times 567–568

LR cache size and structure 567

monitoring mechanism 558–559

parallel vs. sequential search mechanism 573

power and performance comparison 572, 573f

refreshing mechanism 561–563

rewrite access interval distribution 557f

search mechanism 559–560

STT-RAM cell retention time 553–555t

temporal and spatial distribution 553, 555, 555f

TWWS features 547f, 555–558

Two phase search 189–190

Unified memory machine (UMM) 311

Verification tools 13–14

Virtual warps load balancing 184–185

Warped-dual-modular redundancy (DMR) 632–633

Warp sort 319–320

Weak memory consistency model 7

Work-item coalescing regions (WCRs) 39

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Subject Index

Create new playlist

Sign In

Sign Up

Subject Index

Table of Contents for
Subject Index