Index

Note: Page numbers followed by f indicate figures and t indicate tables.

A

Accelerated processing units (APUs) 26, 27, 37, 37f, 38f
A10-7850K APU 37, 37f
AMD FX-8350 CPU 
AMD Piledriver architecture 191
barrier operations 189
fine-grained synchronization 188
high-level design 187, 188f
local memory 191, 192f
runtime, thread creation 188
work-group execution, x86 architecture 189, 190f
work-groups scheduling 188, 189f
work-item stack data storage 190
AMD Radeon HD 6970 GPU architecture 25, 27f
AMD Radeon R9 290X GPU 
architecture 34, 35f
control flow management 197
hardware threads 192
high-level diagram 193, 193f
ISA divergent execution 198, 199
queuing mechanism 194
SCC register 199
SIMD unit 196, 197f
threading and memory system 194
three-dimensional graphics 192, 193
unit microarchitecture 196, 197f
wavefront scheduling 193, 194
Aparapi 292, 293
API-level debugging 244
Application programming interface (API) 12, 43, 291

B

Bag-of-words (BoW) model 
document classification 213
image classification 213, 214f
feature generation algorithm 213, 214f
histogram builder 214, 215, 217
image clustering 214
natural language processing 213
Blocking memory operations 111
BoW model  See Bag-of-words (BoW) model
Breadth-first search (BFS) algorithm 133
Broadcast functions 129

C

C++ Accelerated Massive Parallelism (C++ AMP) 
address space deduction 265
array_view 250f, 251
binomial functions 268, 270
Clamp compiler 
compilation flow 259
components 254
SPIR code 254, 260
data movement optimization 267
data-parallel algorithms 250
features 249
Lambda 251
mapping 254, 255t
C++ mangling rules 255
functor version 255, 256f
host and device code 250f, 255, 258t
host code implementation 258, 259f
missing parts 255
trampoline 256, 257f
parallel_ for_each construct 252
captured variables 252
compute domain 252
extent object 252
functors 252
restrict(amp) modifier 253
tiling 
address space qualifier 264
compute domain division 263
explicit approach 263
implicit approach 263
vector addition 250, 250f, 251, 251f
Callback functions 114
Central processing unit (CPU) 16, 77
ARM Cortex 30
Atom 30, 31
EPIC 31, 32
FX-8350 
AMD Piledriver architecture 191
barrier operations 189
C code compilation and execution 187
fine-grained synchronization 188
high-level design 187, 188f
local memory 191, 192f
runtime, thread creation 188
work-group execution, x86 architecture 189, 190f
work-groups scheduling 188, 189f
work-item stack data storage 190
Haswell 31
low-power 30
mainstream desktop 31
Niagara 32, 32f
OpenMP parallelization 216
Puma 25, 26f, 30, 37
sequential implementation 215
SPARC T-series 32, 32f, 33
Steamroller 25, 26f, 31, 00010:p0330, 37, 37f
Chunking 10
Clang block syntax 136, 137, 138
Coalesced memory accesses 218, 219f, 220f
Coalescing technique 196
Coarse-grained buffer SVM 159, 159t, 262, 262t, 263
CodeXL 
AMD 
main modes, operation 231
Microsoft Visual Studio plug-in 232
stand-alone application 232
debugging 
API-level 244
heterogeneous environment 243
high-level overview 243, 243f
OpenCL compute kernels 244, 245
KernelAnalyzer 
Analysis view 242, 242f
offline compiler and analysis tool 238
session explorer, analysis mode 239, 240f
statistics and ISA views 239, 241f
profiling 
Application Timeline Trace/GPU performance 232
CPU performance 237
GPU kernel performance counters 236, 237f
Host API Trace View 235, 235f
OpenCL application trace 233, 234f
profiler mode 232
session explorer 233, 233f
summary pages 236
Command barriers 113
Command markers 113
Command queues 47, 279, 280, 297
Concurrent and parallel programs 7, 8f
Constant memory 61, 93, 175, 225

D

Data race 166, 166f
Data sharing and synchronization 11
Debugging 
CodeXL  See (CodeXL, debugging)
printf function 246
Device-side command-queues 
advantages 133
creation specifications 135
fork-join parallelism 132, 133f
kernel enqueuing 
block syntax 136, 137, 138
dynamic memory allocation 139
event dependencies 141
NDRange parameter 134
nested parallelism 132, 133, 133f
Device-side enqueuing 49, 133
block syntax 136, 137, 138
dynamic local memory allocation 139
event dependencies 141
flag operation 134
NDRange parameter 134
Device-side memory model 
constant memory 175
generic address space 178
global memory 
buffers 168
image objects  See (Image objects)
pipes 173
local memory 175
memory ordering and scope 
acquire 182
acquire-release 182
atomic operations 183
fences 185
relaxed 181
release 182
sequential 182
unit/multiple devices 183
work-group 183
work-item 182
memory spaces 163, 164f
private memory 178
synchronization 
atomic operations 166
barrier function 165
hierarchy of consistency 164
Divide-and-conquer methods 2
Document classification 213

E

Execution model 
command-queues 47
context 45
definition 42
device-side enqueuing 49
events 48

F

Feature generation algorithm 213, 214f
Feature histogram 
OpenMP parallelization 216
sequential implementation 215
Fences 185
Fine-grained buffer SVM 159t, 160, 262, 262t
Fine-grained parallelism 10
Fine-grained system SVM 159t, 161, 262
First in first out (FIFO)  See Pipes

G

Global memory 60, 168, 201, 203f, 204f, 205f
access pattern 204
buffers 168
byte selection 204, 204f
image objects  See (Image objects)
vs. local memory 209
memory bandwidth 202
nonconsecutive element access 202, 203f
off-chip memory 204, 205f
performance 202
pipes 173
utilization 201, 202
Graphics processing units (GPUs) 3, 16, 77
AMD Radeon R9 290X architecture 34, 35f
AMD Radeon HD 6970 GPU architecture 25, 27f
AMD Radeon HD 7970 62, 62f, 223, 225, 227
handheld 33
kernel performance counters 236, 237f
NVIDIA GeForce GTX 780 architecture 34, 36f

H

Hardware-controlled multithreading 196
Hardware trade-offs 
APUs 27
cache hierarchies 28
CPUs 16
frequency and limitations 17
GPU 16
memory systems 28
multicore architectures 25, 26f, 27f
SIMD 21, 21f
SoCs 26, 27
superscalar execution 18, 18f
thread parallelism 
SMT 22, 23, 23f, 24f
temporal multithreading 24, 25f
vector processing 21, 21f
VLIW 19, 20f
Haskell OpenCL (HOpenCL) 
advantage and disadvantage 293, 294
execution environment 295
buffers 297
command queues 297
contexts 296
kernels 298
program object 298
vector addition 298
module structure 294
platform and devices 296
reference counting 295
Heterogeneous computing 
concurrent and parallel systems 7, 8f
GPU 3
message-passing communication 9
OpenCL 12
parallel computing  See (Parallel computing/parallelism)
parallelism and concurrency 3
threads and shared memory 8
Histogram 
computation 75, 76
CPU 77
GPUs 77
host program 80
kernel 78, 79
local histogram 77
256-bit image 75, 76f
Histogram builder 214
coalesced memory accesses 218, 219f
GPU performance 227, 227t, 228t
moving cluster centroids to constant memory 225
moving SURF features to local memory 223
naive GPU implementation 217
OpenMP parallelization 216
sequential implementation 215
vectorizing computation 221
Host-side memory model 
allocation options 
accelerated processing units 156
data migration 154
explicit data transfer 149, 151f, 152
flags and host_ptr 149
flag parameter 155, 157
global memory 149
mapping 156, 158f
runtime concept 149, 150f
shared-memory system 155
synchronous execution 153
unmapping 157, 158f
zero-copy data 155
buffers 144
images 144, 145
pipes 147

I

Image classification 
BoW model 213
feature generation algorithm 213, 214f
histogram builder 214
coalesced memory accesses 218, 219f
GPU performance 227, 227t, 228t
moving cluster centroids to constant memory 225
moving SURF features to local memory 223
naive GPU implementation 217
OpenMP parallelization 216
sequential implementation 215
vectorizing computation 221
image clustering 214
Image clustering 214
Image convolution algorithm 
blurring filter 91, 92f
C++ API 95
embossing filter 91, 92f
host API signature 94
host program 96
Image2D and ImageFormat constructors 95
image memory objects 93
input and output images 95
OpenCL kernel 93
serial implementation 91
visual representation 91, 91f
Image objects 
device-side memory model 
coordinates 169, 170
filtering modes 171
samplerless read functions 171
sampler object 170
Z-order mapping 172, 172f
host-side memory model 144, 145
Image rotation 83, 83f
C pseudocode 84
host program 87
image objects 86
image sampler 85
implementation 84
pixel-based addressing 85
remainder work-groups 86

K

KernelAnalyzer 
Analysis view 242, 242f
benefits 238
offline compiler and analysis tool 238
session explorer, analysis mode 239, 240f
statistics and ISA views 239, 241f
Kernel execution model 
barrier operation 125
broadcast functions 129
built-in kernels 132
C function 121
native kernels 130, 131f
NDRange parameter 121
parallel primitive function 129
predicate evaluation function 128
SIMD execution 122
synchronization 124, 125
Kernel programming model 
compilation and argument handling 53, 54f
definition 42
execution 55
vector addition 
algorithm 50, 50f
NDRange 52, 52f
work-item 51, 53

L

Level 2 (L2) cache system 31, 195
Local data shares (LDS) 194, 195, 200, 209f, 210f
Local memory 61
AMD FX-8350 CPU 191, 192f
software-managed cache 
benefits 206
cache support 205
eight-bank LDS 209, 209f
kernel parameterization 210
padding addition, bank conflicts removal 209, 210f
programmer-controlled scratchpad memory 205
single work-group prefix sum 206, 207
16-element array, data flow 207, 208f

M

Memory checking 245
Memory model 
address spaces 62
vs. AMD Radeon HD 7970 GPU 62, 62f
data transfer commands 59
definition 42
memory objects 
buffers 57
images 57
pipes 58
memory regions 60
constant memory 61
global memory 60
local memory 61
private memory 61
Message-passing communication 9
Message Passing Interface (MPI) 9
Morton-order mapping 172, 172f
Multiple command-queues 118
OpenCL layout 118
parallel execution 120
pipelined execution 119, 119f

N

Natural language processing 213
Nbody application 234f, 235, 235f
Nested parallelism 132, 133, 133f
node-webcl module 286, 287

O

OpenCL application trace 
Application Timeline View 234, 234f
application trace profile file 233
high-level structure 234
OpenCL kernel 
binary 238
debugging 
breakpoints 245, 246f
Multi-Watch window 245, 247f
performance evaluation 239
OpenCL mapping 
AMD FX-8350 CPU 
AMD Piledriver architecture 191
barrier operations 189
C code compilation and execution 187
fine-grained synchronization 188
high-level design 187, 188f
local memory 191, 192f
runtime, thread creation 188
work-group execution, x86 architecture 189, 190f
work-groups scheduling 188, 189f
work-item stack data storage 190
AMD Radeon R9 290X GPU 
control flow management 197
hardware threads 192
high-level diagram 193, 193f
ISA divergent execution 198, 199
queuing mechanism 194
resource allocation 200
SCC register 199
SIMD unit 196, 197f
threading and memory system 194
three-dimensional graphics 192, 193
unit microarchitecture 196, 197f
wavefront scheduling 193, 194
global memory 
access pattern 204
byte selection 204, 204f
vs. local memory 209
memory bandwidth 202
nonconsecutive element access 202, 203f
off-chip memory 204, 205f
performance 202
return data 202, 203f
utilization 201, 202
local memory, software-managed cache 
benefits 206
cache support 205
eight-bank LDS 209, 209f
vs. global memory 209
hardware utilization 210
kernel parameterization 210
padding addition, bank conflicts removal 209, 210f
programmer-controlled scratchpad memory 205
single work-group prefix sum 206, 207
16-element array, data flow 207, 208f
Open Computing Language (OpenCL) 
features 75, 76t
heterogeneous computing 1, 2, 12
reporting compilation errors 107
runtime APIs 
copying input data 65
copying output data 66
creating and compiling program 65
creating buffer 64
creating command-queue per device 64
creating context 64
discovering platform and devices 63
kernel execution 65
kernel extraction 65
processing steps 63
releasing resources 66
vector addition 66
OpenMP parallelization 216
Out-of-order command-queues 116, 279

P

Packets 58, 99, 147
Parallel computing/parallelism 
chunking 10
and concurrent programs 7, 8f
data sharing and synchronization 11
fine-grained parallelism 10
multiplying elements of array 4, 4f
reduction 5, 6f
shared virtual memory 11
SIMD 10, 11
task-level parallelism 5, 6f
task parallelism 5, 5f
Parallel primitive functions 129
PCI Express bus 195f, 196
Performance improvement 225, 229
Pipes 58, 147, 173
Platform model 
abstract architecture 44
API 43
CLInfo program 45, 46f
compute units 43, 43f
definition 42
Predicate evaluation functions 128
Private memory 61, 178
Producer-consumer kernels 
create pipe 101, 102
host program 102
kernel implementations 100
pipes 99, 99f, 100
Profiling 
CodeXL  See (CodeXL, profiling)
command states 230t, 231
developer tools, commands tracking 229
event synchronization 231
PyOpenCL 291

Q

Queuing model 
barriers and markers 113
blocking memory operations 111
callback functions 114
events and commands 112, 114
for multiple devices  See (Multiple command-queues)
out-of-order queues 116
synchronization points 111
user events 115

R

Relaxed consistency model 121, 148, 180, 181
Remainder work-groups 53, 86, 123
Reservation identifiers 174
Resource allocation 200
Runtime and execution model 
kernel model 
barrier operation 125
broadcast functions 129
built-in kernels 132
C function 121
native kernels 130, 131f
NDRange parameter 121
parallel primitive function 129
predicate evaluation function 128
SIMD execution 122
synchronization 124, 125
queuing model 
barriers and markers 113
blocking memory operations 111
callback functions 114
events and commands 112, 114
for multiple devices  See (Multiple command-queues)
nested parallelism  See (Device-side command-queues)
out-of-order queues 116
synchronization points 111
user events 115

S

Scatter-gather methods 2
Sequential consistency model 181, 182
Shared virtual memory (SVM) 
address space 261
coarse-grained buffer 159, 159t, 262, 262t
fine-grained buffer 159t, 160, 262, 262t
fine-grained system 159t, 161, 262
Simultaneous multithreading (SMT) 22, 23f, 31
Single instruction multiple data (SIMD) 
AMD Radeon HD 6970 GPU 25, 27f
AMD Radeon R9 290X 34, 35f, 196
NVIDIA GeForce GTX 780 34, 36f
vs. SPMD model 11
vector processing 21, 21f
Single program, multiple data (SPMD) model 11, 13
Standard Portable Intermediate Representation (SPIR) 254, 260, 261f
Systems-on-chip (SoC) 26, 37

T

Temporal multithreading 24, 25f
Thread parallelism 
SMT 22, 23, 23f, 24f
temporal multithreading 24, 25f
Transpose 219, 220f, 227, 227t

V

Vector addition 
C API 66
CUDA C API 71
C++ wrapper 69
POSIX threads 51
serial C implementation 50
Vectorizing computation 221
Very long instruction word (VLIW) 19, 20f, 21, 27f, 31

W

WebCL 
applications 
image processing and ray tracing produce 282
kernel code 285
texture usage, argument 284
two textured triangles 283, 284f
implementations 288
interoperability, WebGL 282
programming 
command queues 279, 280
object-oriented nature, JavaScript 273
objects 274, 275, 275f
various numerical types 276, 277t
WebCL image 276
WebGL texture 283
server 286
specification 288
synchronization 281

Z

Z-order mapping 172, 172f
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.22.160