Index
Note: Page numbers followed by f indicate figures and t indicate tables.
A
AMD Piledriver architecture
191
fine-grained synchronization
188
runtime, thread creation
188
work-group execution, x86 architecture
189,
190f
work-item stack data storage
190
AMD Radeon HD 6970 GPU architecture
25,
27f
control flow management
197
ISA divergent execution
198,
199
threading and memory system
194
three-dimensional graphics
192,
193
Application programming interface (API)
12,
43,
291
B
document classification
213
feature generation algorithm
213,
214f
natural language processing
213
Blocking memory operations
111
Breadth-first search (BFS) algorithm
133
C
C++ Accelerated Massive Parallelism (C++ AMP)
address space deduction
265
data movement optimization
267
data-parallel algorithms
250
parallel_ for_each construct
252
restrict(amp) modifier
253
address space qualifier
264
compute domain division
263
Central processing unit (CPU)
16,
77
AMD Piledriver architecture
191
C code compilation and execution
187
fine-grained synchronization
188
runtime, thread creation
188
work-group execution, x86 architecture
189,
190f
work-item stack data storage
190
OpenMP parallelization
216
sequential implementation
215
main modes, operation
231
Microsoft Visual Studio plug-in
232
stand-alone application
232
heterogeneous environment
243
OpenCL compute kernels
244,
245
offline compiler and analysis tool
238
session explorer, analysis mode
239,
240f
Application Timeline Trace/GPU performance
232
GPU kernel performance counters
236,
237f
Concurrent and parallel programs ,
8f
D
Data sharing and synchronization
11
Device-side command-queues
creation specifications
135
dynamic memory allocation
139
Device-side enqueuing
49,
133
dynamic local memory allocation
139
generic address space
178
memory ordering and scope
unit/multiple devices
183
hierarchy of consistency
164
Divide-and-conquer methods
Document classification
213
E
F
Feature generation algorithm
213,
214f
OpenMP parallelization
216
sequential implementation
215
Fine-grained parallelism
10
First in first out (FIFO)
See Pipes
G
nonconsecutive element access
202,
203f
Graphics processing units (GPUs) ,
16,
77
AMD Radeon R9 290X architecture
34,
35f
AMD Radeon HD 6970 GPU architecture
25,
27f
kernel performance counters
236,
237f
NVIDIA GeForce GTX 780 architecture
34,
36f
H
Hardware-controlled multithreading
196
frequency and limitations
17
superscalar execution
18,
18f
temporal multithreading
24,
25f
vector processing
21,
21f
advantage and disadvantage
293,
294
execution environment
295
concurrent and parallel systems ,
8f
message-passing communication
parallelism and concurrency
threads and shared memory
coalesced memory accesses
218,
219f
moving cluster centroids to constant memory
225
moving SURF features to local memory
223
naive GPU implementation
217
OpenMP parallelization
216
sequential implementation
215
vectorizing computation
221
accelerated processing units
156
synchronous execution
153
I
feature generation algorithm
213,
214f
coalesced memory accesses
218,
219f
moving cluster centroids to constant memory
225
moving SURF features to local memory
223
naive GPU implementation
217
OpenMP parallelization
216
sequential implementation
215
vectorizing computation
221
Image convolution algorithm
Image2D and ImageFormat constructors
95
input and output images
95
visual representation
91,
91f
samplerless read functions
171
host-side memory model
144,
145
pixel-based addressing
85
K
offline compiler and analysis tool
238
session explorer, analysis mode
239,
240f
parallel primitive function
129
predicate evaluation function
128
compilation and argument handling
53,
54f
L
Level 2 (L2) cache system
31,
195
kernel parameterization
210
padding addition, bank conflicts removal
209,
210f
programmer-controlled scratchpad memory
205
single work-group prefix sum
206,
207
16-element array, data flow
207,
208f
M
vs. AMD Radeon HD 7970 GPU
62,
62f
data transfer commands
59
Message-passing communication
Message Passing Interface (MPI)
Multiple command-queues
118
N
Natural language processing
213
O
Application Timeline View
234,
234f
application trace profile file
233
performance evaluation
239
AMD Piledriver architecture
191
C code compilation and execution
187
fine-grained synchronization
188
runtime, thread creation
188
work-group execution, x86 architecture
189,
190f
work-item stack data storage
190
control flow management
197
ISA divergent execution
198,
199
threading and memory system
194
three-dimensional graphics
192,
193
nonconsecutive element access
202,
203f
local memory, software-managed cache
kernel parameterization
210
padding addition, bank conflicts removal
209,
210f
programmer-controlled scratchpad memory
205
single work-group prefix sum
206,
207
16-element array, data flow
207,
208f
Open Computing Language (OpenCL)
heterogeneous computing , ,
12
reporting compilation errors
107
creating and compiling program
65
creating command-queue per device
64
discovering platform and devices
63
OpenMP parallelization
216
Out-of-order command-queues
116,
279
P
Parallel computing/parallelism
and concurrent programs ,
8f
data sharing and synchronization
11
fine-grained parallelism
10
multiplying elements of array ,
4f
task-level parallelism ,
6f
Parallel primitive functions
129
Performance improvement
225,
229
Predicate evaluation functions
128
Producer-consumer kernels
kernel implementations
100
developer tools, commands tracking
229
event synchronization
231
Q
blocking memory operations
111
synchronization points
111
R
Reservation identifiers
174
Runtime and execution model
parallel primitive function
129
predicate evaluation function
128
blocking memory operations
111
synchronization points
111
S
Sequential consistency model
181,
182
Shared virtual memory (SVM)
Simultaneous multithreading (SMT)
22,
23f,
31
Single instruction multiple data (SIMD)
AMD Radeon HD 6970 GPU
25,
27f
NVIDIA GeForce GTX 780
34,
36f
vector processing
21,
21f
Single program, multiple data (SPMD) model
11,
13
Standard Portable Intermediate Representation (SPIR)
254,
260,
261f
Systems-on-chip (SoC)
26,
37
T
Temporal multithreading
24,
25f
temporal multithreading
24,
25f
V
serial C implementation
50
Vectorizing computation
221
W
image processing and ray tracing produce
282
texture usage, argument
284
interoperability, WebGL
282
object-oriented nature, JavaScript
273
Z