Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Chapter 13: Foreign lands: Plugging OpenCL in

Index

Note: Page numbers followed by f indicate figures and t indicate tables.

Accelerated processing units (APUs) 26, 27, 37, 37f, 38f

A10-7850K APU 37, 37f

AMD FX-8350 CPU

AMD Piledriver architecture 191

barrier operations 189

fine-grained synchronization 188

high-level design 187, 188f

local memory 191, 192f

runtime, thread creation 188

work-group execution, x86 architecture 189, 190f

work-groups scheduling 188, 189f

work-item stack data storage 190

AMD Radeon HD 6970 GPU architecture 25, 27f

AMD Radeon R9 290X GPU

architecture 34, 35f

control flow management 197

hardware threads 192

high-level diagram 193, 193f

ISA divergent execution 198, 199

queuing mechanism 194

SCC register 199

SIMD unit 196, 197f

threading and memory system 194

three-dimensional graphics 192, 193

unit microarchitecture 196, 197f

wavefront scheduling 193, 194

Aparapi 292, 293

API See Application programming interface (API)

API-level debugging 244

Application programming interface (API) 12, 43, 291

APUs See Accelerated processing units (APUs)

Bag-of-words (BoW) model

document classification 213

image classification 213, 214f

feature generation algorithm 213, 214f

histogram builder 214, 215, 217

image clustering 214

natural language processing 213

BFS algorithm See Breadth-first search (BFS) algorithm

Blocking memory operations 111

BoW model See Bag-of-words (BoW) model

Breadth-first search (BFS) algorithm 133

Broadcast functions 129

C++ Accelerated Massive Parallelism (C++ AMP)

address space deduction 265

array_view 250f, 251

binomial functions 268, 270

Clamp compiler

compilation flow 259

components 254

SPIR code 254, 260

data movement optimization 267

data-parallel algorithms 250

features 249

Lambda 251

mapping 254, 255t

C++ mangling rules 255

functor version 255, 256f

host and device code 250f, 255, 258t

host code implementation 258, 259f

missing parts 255

trampoline 256, 257f

parallel_ for_each construct 252

captured variables 252

compute domain 252

extent object 252

functors 252

restrict(amp) modifier 253

SVM See (Shared virtual memory (SVM))

tiling

address space qualifier 264

compute domain division 263

explicit approach 263

implicit approach 263

vector addition 250, 250f, 251, 251f

Callback functions 114

C++ AMP See C++ Accelerated Massive Parallelism (C++ AMP)

Central processing unit (CPU) 16, 77

ARM Cortex 30

Atom 30, 31

EPIC 31, 32

FX-8350

AMD Piledriver architecture 191

barrier operations 189

C code compilation and execution 187

fine-grained synchronization 188

high-level design 187, 188f

local memory 191, 192f

runtime, thread creation 188

work-group execution, x86 architecture 189, 190f

work-groups scheduling 188, 189f

work-item stack data storage 190

Haswell 31

low-power 30

mainstream desktop 31

Niagara 32, 32f

OpenMP parallelization 216

Puma 25, 26f, 30, 37

sequential implementation 215

SPARC T-series 32, 32f, 33

Steamroller 25, 26f, 31, 00010:p0330, 37, 37f

Chunking 10

Clang block syntax 136, 137, 138

Coalesced memory accesses 218, 219f, 220f

Coalescing technique 196

Coarse-grained buffer SVM 159, 159t, 262, 262t, 263

CodeXL

AMD

main modes, operation 231

Microsoft Visual Studio plug-in 232

stand-alone application 232

debugging

API-level 244

heterogeneous environment 243

high-level overview 243, 243f

OpenCL compute kernels 244, 245

KernelAnalyzer

Analysis view 242, 242f

offline compiler and analysis tool 238

session explorer, analysis mode 239, 240f

statistics and ISA views 239, 241f

profiling

Application Timeline Trace/GPU performance 232

CPU performance 237

GPU kernel performance counters 236, 237f

Host API Trace View 235, 235f

OpenCL application trace 233, 234f

profiler mode 232

session explorer 233, 233f

summary pages 236

Command barriers 113

Command markers 113

Command queues 47, 279, 280, 297

Concurrent and parallel programs 7, 8f

Constant memory 61, 93, 175, 225

CPU See Central processing unit (CPU)

Data race 166, 166f

Data sharing and synchronization 11

Debugging

CodeXL See (CodeXL, debugging)

printf function 246

Device-side command-queues

advantages 133

creation specifications 135

fork-join parallelism 132, 133f

kernel enqueuing

block syntax 136, 137, 138

dynamic memory allocation 139

event dependencies 141

NDRange parameter 134

nested parallelism 132, 133, 133f

Device-side enqueuing 49, 133

block syntax 136, 137, 138

dynamic local memory allocation 139

event dependencies 141

flag operation 134

NDRange parameter 134

Device-side memory model

constant memory 175

generic address space 178

global memory

buffers 168

image objects See (Image objects)

pipes 173

local memory 175

memory ordering and scope

acquire 182

acquire-release 182

atomic operations 183

fences 185

relaxed 181

release 182

sequential 182

unit/multiple devices 183

work-group 183

work-item 182

memory spaces 163, 164f

private memory 178

synchronization

atomic operations 166

barrier function 165

hierarchy of consistency 164

Divide-and-conquer methods 2

Document classification 213

Execution model

command-queues 47

context 45

definition 42

device-side enqueuing 49

events 48

Feature generation algorithm 213, 214f

Feature histogram

OpenMP parallelization 216

sequential implementation 215

Fences 185

Fine-grained buffer SVM 159t, 160, 262, 262t

Fine-grained parallelism 10

Fine-grained system SVM 159t, 161, 262

First in first out (FIFO) See Pipes

Global memory 60, 168, 201, 203f, 204f, 205f

access pattern 204

buffers 168

byte selection 204, 204f

image objects See (Image objects)

vs. local memory 209

memory bandwidth 202

nonconsecutive element access 202, 203f

off-chip memory 204, 205f

performance 202

pipes 173

utilization 201, 202

Graphics processing units (GPUs) 3, 16, 77

AMD Radeon R9 290X architecture 34, 35f

AMD Radeon HD 6970 GPU architecture 25, 27f

AMD Radeon HD 7970 62, 62f, 223, 225, 227

handheld 33

kernel performance counters 236, 237f

NVIDIA GeForce GTX 780 architecture 34, 36f

Hardware-controlled multithreading 196

Hardware trade-offs

APUs 27

cache hierarchies 28

CPUs 16

frequency and limitations 17

GPU 16

memory systems 28

multicore architectures 25, 26f, 27f

SIMD 21, 21f

SoCs 26, 27

superscalar execution 18, 18f

thread parallelism

SMT 22, 23, 23f, 24f

temporal multithreading 24, 25f

vector processing 21, 21f

VLIW 19, 20f

Haskell OpenCL (HOpenCL)

advantage and disadvantage 293, 294

execution environment 295

buffers 297

command queues 297

contexts 296

kernels 298

program object 298

vector addition 298

module structure 294

platform and devices 296

reference counting 295

Heterogeneous computing

concurrent and parallel systems 7, 8f

GPU 3

message-passing communication 9

OpenCL 12

parallel computing See (Parallel computing/parallelism)

parallelism and concurrency 3

threads and shared memory 8

Histogram

computation 75, 76

CPU 77

GPUs 77

host program 80

kernel 78, 79

local histogram 77

256-bit image 75, 76f

Histogram builder 214

coalesced memory accesses 218, 219f

GPU performance 227, 227t, 228t

moving cluster centroids to constant memory 225

moving SURF features to local memory 223

naive GPU implementation 217

OpenMP parallelization 216

sequential implementation 215

vectorizing computation 221

Host-side memory model

allocation options

accelerated processing units 156

data migration 154

explicit data transfer 149, 151f, 152

flags and host_ptr 149

flag parameter 155, 157

global memory 149

mapping 156, 158f

runtime concept 149, 150f

shared-memory system 155

synchronous execution 153

unmapping 157, 158f

zero-copy data 155

buffers 144

images 144, 145

pipes 147

Image classification

BoW model 213

feature generation algorithm 213, 214f

histogram builder 214

coalesced memory accesses 218, 219f

GPU performance 227, 227t, 228t

moving cluster centroids to constant memory 225

moving SURF features to local memory 223

naive GPU implementation 217

OpenMP parallelization 216

sequential implementation 215

vectorizing computation 221

image clustering 214

Image clustering 214

Image convolution algorithm

blurring filter 91, 92f

C++ API 95

embossing filter 91, 92f

host API signature 94

host program 96

Image2D and ImageFormat constructors 95

image memory objects 93

input and output images 95

OpenCL kernel 93

serial implementation 91

visual representation 91, 91f

Image objects

device-side memory model

coordinates 169, 170

filtering modes 171

samplerless read functions 171

sampler object 170

Z-order mapping 172, 172f

host-side memory model 144, 145

Image rotation 83, 83f

C pseudocode 84

host program 87

image objects 86

image sampler 85

implementation 84

pixel-based addressing 85

remainder work-groups 86

KernelAnalyzer

Analysis view 242, 242f

benefits 238

offline compiler and analysis tool 238

session explorer, analysis mode 239, 240f

statistics and ISA views 239, 241f

Kernel execution model

barrier operation 125

broadcast functions 129

built-in kernels 132

C function 121

native kernels 130, 131f

NDRange parameter 121

parallel primitive function 129

predicate evaluation function 128

SIMD execution 122

synchronization 124, 125

Kernel programming model

compilation and argument handling 53, 54f

definition 42

execution 55

vector addition

algorithm 50, 50f

NDRange 52, 52f

work-item 51, 53

LDS See Local data shares (LDS)

Level 2 (L2) cache system 31, 195

Local data shares (LDS) 194, 195, 200, 209f, 210f

Local memory 61

AMD FX-8350 CPU 191, 192f

software-managed cache

benefits 206

cache support 205

eight-bank LDS 209, 209f

kernel parameterization 210

padding addition, bank conflicts removal 209, 210f

programmer-controlled scratchpad memory 205

single work-group prefix sum 206, 207

16-element array, data flow 207, 208f

Memory checking 245

Memory model

address spaces 62

vs. AMD Radeon HD 7970 GPU 62, 62f

data transfer commands 59

definition 42

memory objects

buffers 57

images 57

pipes 58

memory regions 60

constant memory 61

global memory 60

local memory 61

private memory 61

Message-passing communication 9

Message Passing Interface (MPI) 9

Morton-order mapping 172, 172f

MPI See Message Passing Interface (MPI)

Multiple command-queues 118

OpenCL layout 118

parallel execution 120

pipelined execution 119, 119f

Natural language processing 213

Nbody application 234f, 235, 235f

Nested parallelism 132, 133, 133f

node-webcl module 286, 287

OpenCL See Open Computing Language (OpenCL)

OpenCL application trace

Application Timeline View 234, 234f

application trace profile file 233

high-level structure 234

OpenCL kernel

binary 238

debugging

breakpoints 245, 246f

Multi-Watch window 245, 247f

performance evaluation 239

OpenCL mapping

AMD FX-8350 CPU

AMD Piledriver architecture 191

barrier operations 189

C code compilation and execution 187

fine-grained synchronization 188

high-level design 187, 188f

local memory 191, 192f

runtime, thread creation 188

work-group execution, x86 architecture 189, 190f

work-groups scheduling 188, 189f

work-item stack data storage 190

AMD Radeon R9 290X GPU

control flow management 197

hardware threads 192

high-level diagram 193, 193f

ISA divergent execution 198, 199

queuing mechanism 194

resource allocation 200

SCC register 199

SIMD unit 196, 197f

threading and memory system 194

three-dimensional graphics 192, 193

unit microarchitecture 196, 197f

wavefront scheduling 193, 194

global memory

access pattern 204

byte selection 204, 204f

vs. local memory 209

memory bandwidth 202

nonconsecutive element access 202, 203f

off-chip memory 204, 205f

performance 202

return data 202, 203f

utilization 201, 202

local memory, software-managed cache

benefits 206

cache support 205

eight-bank LDS 209, 209f

vs. global memory 209

hardware utilization 210

kernel parameterization 210

padding addition, bank conflicts removal 209, 210f

programmer-controlled scratchpad memory 205

single work-group prefix sum 206, 207

16-element array, data flow 207, 208f

Open Computing Language (OpenCL)

features 75, 76t

heterogeneous computing 1, 2, 12

reporting compilation errors 107

runtime APIs

copying input data 65

copying output data 66

creating and compiling program 65

creating buffer 64

creating command-queue per device 64

creating context 64

discovering platform and devices 63

kernel execution 65

kernel extraction 65

processing steps 63

releasing resources 66

vector addition 66

OpenMP parallelization 216

Out-of-order command-queues 116, 279

Packets 58, 99, 147

Parallel computing/parallelism

chunking 10

and concurrent programs 7, 8f

data sharing and synchronization 11

fine-grained parallelism 10

multiplying elements of array 4, 4f

reduction 5, 6f

shared virtual memory 11

SIMD 10, 11

task-level parallelism 5, 6f

task parallelism 5, 5f

Parallel primitive functions 129

PCI Express bus 195f, 196

Performance improvement 225, 229

Pipes 58, 147, 173

Platform model

abstract architecture 44

API 43

CLInfo program 45, 46f

compute units 43, 43f

definition 42

Predicate evaluation functions 128

Private memory 61, 178

Producer-consumer kernels

create pipe 101, 102

host program 102

kernel implementations 100

pipes 99, 99f, 100

Profiling

CodeXL See (CodeXL, profiling)

command states 230t, 231

developer tools, commands tracking 229

event synchronization 231

PyOpenCL 291

Queuing model

barriers and markers 113

blocking memory operations 111

callback functions 114

events and commands 112, 114

for multiple devices See (Multiple command-queues)

out-of-order queues 116

synchronization points 111

user events 115

Relaxed consistency model 121, 148, 180, 181

Remainder work-groups 53, 86, 123

Reservation identifiers 174

Resource allocation 200

Runtime and execution model

kernel model

barrier operation 125

broadcast functions 129

built-in kernels 132

C function 121

native kernels 130, 131f

NDRange parameter 121

parallel primitive function 129

predicate evaluation function 128

SIMD execution 122

synchronization 124, 125

queuing model

barriers and markers 113

blocking memory operations 111

callback functions 114

events and commands 112, 114

for multiple devices See (Multiple command-queues)

nested parallelism See (Device-side command-queues)

out-of-order queues 116

synchronization points 111

user events 115

Scatter-gather methods 2

Sequential consistency model 181, 182

Shared virtual memory (SVM)

address space 261

coarse-grained buffer 159, 159t, 262, 262t

fine-grained buffer 159t, 160, 262, 262t

fine-grained system 159t, 161, 262

SIMD See Single instruction multiple data (SIMD)

Simultaneous multithreading (SMT) 22, 23f, 31

Single instruction multiple data (SIMD)

AMD Radeon HD 6970 GPU 25, 27f

AMD Radeon R9 290X 34, 35f, 196

NVIDIA GeForce GTX 780 34, 36f

vs. SPMD model 11

vector processing 21, 21f

Single program, multiple data (SPMD) model 11, 13

SMT See Simultaneous multithreading (SMT)

SoC See Systems-on-chip (SoC)

SPMD model See Single program, multiple data (SPMD) model

Standard Portable Intermediate Representation (SPIR) 254, 260, 261f

SVM See Shared virtual memory (SVM)

Systems-on-chip (SoC) 26, 37

Temporal multithreading 24, 25f

Thread parallelism

SMT 22, 23, 23f, 24f

temporal multithreading 24, 25f

Transpose 219, 220f, 227, 227t

Vector addition

C API 66

CUDA C API 71

C++ wrapper 69

POSIX threads 51

serial C implementation 50

Vectorizing computation 221

Very long instruction word (VLIW) 19, 20f, 21, 27f, 31

WebCL

applications

image processing and ray tracing produce 282

kernel code 285

texture usage, argument 284

two textured triangles 283, 284f

implementations 288

interoperability, WebGL 282

programming

command queues 279, 280

object-oriented nature, JavaScript 273

objects 274, 275, 275f

various numerical types 276, 277t

WebCL image 276

WebGL texture 283

server 286

specification 288

synchronization 281

Z-order mapping 172, 172f

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13: Foreign lands: Plugging OpenCL in

Create new playlist

Sign In

Sign Up

Index

Table of Contents for
Chapter 13: Foreign lands: Plugging OpenCL in