Index

Symbols

-- (pre-increment) unary operator, 131

- (subtract) operator, 124–126

?: (ternary selection) operator, 129

- or -- (unary) operators, 131

| or || (or) operators, 127–128

+ (addition) operator, 124–126

+ or ++ (post-increment) unary operator, 131

!= (not equal) operator, 127

== (equal) operator, 127

% (remainder) operator, 124–126

& or && (and) operators, 127–128

* (multiply) operator, 124–126

^ (exclusive or) operator, 127–128

^^ (exclusive) operator, 128

~ (not) operator, 127–128

< (greater than) operator, 127

>= (greater than or equal) operator, 127

>> (right shift) operator, 129–130

Numbers

0 value, 64–65, 68

2D composition, in DFT, 457–458

64-bit integers, embedded profile, 385–386

754 formats, IEEE floating-point arithmetic, 34

A

accelerator devices

defined, 69

tiled and packetized sparse matrix design, 523, 534

access qualifiers

as keywords in OpenCL C, 141

overview of, 140–141

reference guide, 576

add (+) arithmetic operator, 124–126

address space qualifiers

casting between address spaces, 139–140

constant, 137–138

global, 136

as keywords in OpenCL C, 141

local, 138–139

overview of, 135–136

private, 139

reference guide, 554

supported, 99

addressing mode, sampler objects, 282, 292–295

ALL_BUILD project, Visual Studio, 43

AltiVec Technology Programming Interface Manual, 111–113

AMD

generating project in Linux, 40–41

generating project in Windows, 40–41

storing binaries in own format, 233

and (& or &&) operators, 127–128

Apple

initializing contexts for OpenGL interoperability, 338

querying number of platforms, 64

storing binaries in own format, 233

application data types, 103–104

ARB_cl_event extension, OpenGL, 349–350

architecture diagram, OpenCL device, 577

arguments

context, 85

device, 68

enqueuing commands, 313

guassian_kernel(), 296–297

kernel function restrictions, 146

reference guide for kernel, 548

setting kernel, 55–57, 237–240

arithmetic operators

overview of, 124–126

post- and pre-increment (++ and --) unary, 131

symbols, 123

unary (+ and -), 131

arrays

parallelizing Dijkstra’s algorithm, 412–414

representing sparse matrix with binary data, 516

as_type(), 121–123

as_typen(), 121–123

ASCII File, representing sparse matrix, 516–517

assignment (=) operator, 124, 132

async copy and prefetch functions, 191–195, 570

ATI Stream SDK

generating project in Linux and Eclipse, 44–45

generating project in Visual Studio, 42–44

generating project in Windows, 40

querying and selecting platform, 65–66

querying context for devices, 89

querying devices, 70

atomic built-in functions

embedded profile options, 387

overview of, 195–198

reference guide, 568–569

_attribute_ keyword, kernel qualifier, 133–134

attributes, specifying type, 555

automatic load balancing, 20

B

barrier synchronization function, 190–191

batches

executing cloth simulation on GPU, 433–441

SpMV implementation, 518

behavior description, optional extension, 144

bilinear sampling object, optical flow, 476

binaries, program

creating, 235–236

HelloBinaryWorld example, 229–230

HelloWorld.cl (NVIDIA) example, 233–236

overview of, 227–229

querying and storing, 230–232

binary data arrays, sparse matrix, 516

bit field numbers, 147

bitwise operators, 124, 127–128

blocking enqueue calls, and callbacks, 327

blocking_read, executing kernel, 56

bool, rank order of, 113

border color, built-in functions, 209–210

bracket() operator, C++ Wrapper API, 370–371

buffers and sub-buffers

computing Dijkstra’s algorithm, 415

copying, 274–276

copying from image to, 299, 303–304

creating, 249–256

creating from OpenGL, 339–343

creating kernel and memory objects, 377–378

direct translation of matrix multiplication into OpenCL, 502

executing Vector Add kernel, 377–378, 381

mapping, 276–279

in memory model, 21

Ocean application, 451

OpenCL/OpenGL sharing APIs, 446–448, 578

overview of, 247–248

querying, 257–259

reading and writing, 259–274

reference guide, 544–545

building program objects

reference guide, 546–547

using clBuildProgram(). see clBuildProgram()

built-in data types

other, 108–109

reference guide, 550–552

scalar, 99–101

vector, 102–103

built-in functions

async copy and prefetch, 191–195

atomic, 195–198, 387, 568–569

border color, 209–210

common, 172–175, 559

floating-point constant, 162–163

floating-point pragma, 162

geometric, 175–177, 563–564

image read and write, 201–206, 572–573

integer, 168–172, 557–558

math, 153–161, 560–563

miscellaneous vector, 199–200, 571

overview of, 149

querying image information, 214–215

relational, 175, 178–181, 564–567

relative error as ulps, 163–168

samplers, 206–209

synchronization, 190–191

vector data load and store, 181–189

work-item, 150–152, 557

writing to image, 210–213

Bullet Physics SDK. see cloth simulation in Bullet Physics SDK

bytes, and vector data types, 102

C

C++ Wrapper API

defined, 369

exceptions, 371–374

Ocean application overview, 451

overview of, 369–371

C++ Wrapper API, Vector Add example

choosing device and creating command-queue, 375–377

choosing platform and creating context, 375

creating and building program object, 377

creating kernel and memory objects, 377–378

executing Vector Add kernel, 378–382

structure of OpenCL setup, 374–375

C99 language

OpenCL C derived from, 32–33, 97

OpenCL C features added to, 99

callbacks

creating OpenCL contexts, 85

event objects. see clSetEventCallback()

events impacting execution on host, 324–327

placing profiling functions inside, 331–332

steps in Ocean application, 451

capacitance, of multicore chips, 4–5

case studies

cloth simulation. see cloth simulation in Bullet Physics SDK

Dijkstra’s algorithm. see Dijkstra’s algorithm, parallelizing

image histogram. see image histograms

matrix multiplication. see matrix multiplication

optical flow. see optical flow

PyOpenCL. see PyOpenCL

simulating ocean. see Ocean simulation, with FFT

Sobel edge detection filter, 407–410

casts

explicit, 116

implicit conversions between vectors and, 111–113

cEnqueueNDRangeKernel(), 251, 255

ckCreateSampler(), 292–295

CL_COMPLETE value, command-queue, 311

CL_CONTEXT_DEVICES, C++ Wrapper API, 376

cl_context_properties fields, initializing contexts, 338–339

CL_DEVICE_IMAGE_SUPPORT property, clGetDeviceInfo(), 386–387

CL_DEVICE_IMAGE3D_MAX_WIDTH property, clGetDeviceInfo(), 386–387

CL_DEVICE_MAX_COMPUTE_UNITS, 506–509

CL_DEVICE_TYPE_GPU, 502

_CL_ENABLE_EXCEPTIONS preprocessor macro, 372

cl_image_format, 285, 287–291

cl_int clFinish (), 248

cl_int clWaitForEvents(), 248

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE MULTIPLE query, 243–244

CL_KERNEL_WORK_GROUP_SIZE query, 243–244

cl_khr_gl_event extension, 342, 348

cl_khr_gl_sharing extension, 336–337, 342

cl_map_flags, clEnqueueMapBuffer(), 276–277

cl_mem object, creating images, 284

CL_MEM_COPY_FROM_HOST_PTR, 377–378

cl_mem_flags, clCreateBuffer(), 249–250

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR memory type, 55

CL_MEM_READ_WRITE, 308

CL_MEM_USE_HOST_PTR, 377–378

cl_net error values, C++ Wrapper API, 371

cl_platform, 370–371

CL_PROFILING_COMMAND_END, 502

CL_PROFILING_COMMAND_START, 502

CL_QUEUE_PROFILING_ENABLE flag, 328

CL_QUEUE_PROFILING_ENABLE property, 502

CL_QUEUED value, command-queue, 311

CL_RUNNING value, command-queue, 311

CL_SUBMITTED value, command-queue, 311

CL_SUCCESS return value, clBuild-Program(), 220

_CL_USER_OVERRIDE_ERROR_STRINGS preprocessor macro, 372

classes, C++ Wrapper API hierarchy, 369–370

clBarrier(), 313–316

clBuffer(), 54

cl::Buffer(), 377–378, 381

clBuildProgram()

build options, 546–547

building program object, 219–220, 222

creating program from binary, 234–236

floating-point options, 224

miscellaneous options, 226–227

optimization options, 225–226

preprocessor build options, 223–224

querying program objects, 237

reference guide, 546

cl::CommandQueue::enqueueMap-Buffer(), 379, 381

cl::commandQueue::enqueueUnmapObj(), 379, 382

cl::Context(), 375

cl::Context::getInfo(), 376

clCreateBuffer()

creating buffers and sub-buffers, 249–251

creating memory objects, 54–55

direct translation of matrix multiplication into OpenCL, 502

reference guide, 544

setting kernel arguments, 239

clCreateCommandQueue(), 51–52, 543

clCreateContext(), 84–87, 541

clCreateContextFromType()

creating contexts, 84–85

querying context for associated devices, 88

reference guide, 541

clCreateEventFromGLsyncKHR()

explicit synchronization, 349

reference guide, 579

synchronization between OpenCL/OpenGL, 350–351

clCreateFromD3D10BufferKHR(), 580

clCreateFromD3D10Texture2DKHR(), 580

clCreateFromD3D10Texture3DKHR(), 580

clCreateFromGL*(), 335, 448

clCreateFromGLBuffer(), 339–343, 578

clCreateFromGLRenderbuffer()

creating memory objects from OpenGL, 341

reference guide, 578

sharing with OpenCL, 346–347

clCreateFromGLTexture2D(), 341, 578

clCreateFromGLTexture3D(), 341, 578

clCreateImage2D()

creating 2D image from file, 284–285

creating image objects, 283–284

reference guide, 573–574

clCreateImage3D(), 283–284, 574

clCreateKernel()

creating kernel objects, 237–238

reference guide, 547

setting kernel arguments, 239–240

clCreateKernelsInProgram(), 240–241, 547

clCreateProgram(), 221

clCreateProgramWithBinary()

creating programs from binaries, 228–229

HelloBinaryWorld example, 229–230

reference guide, 546

clCreateProgramWithSource()

creating and building program object, 52–53

creating program object from source, 218–219, 222

reference guide, 546

clCreateSampler(), 292–294, 576

clCreateSubBuffer(), 253–256, 544

clCreateUserEvent()

generating events on host, 321–322

how to use, 323–324

reference guide, 549

clEnqueueAcquireD3D10ObjectsKHR(), 580

clEnqueueAcquireGLObjects()

creating OpenCL buffers from OpenGL buffers, 341–342

explicit synchronization, 349

implicit synchronization, 348–349

reference guide, 579

clEnqueueBarrier()

function of, 316–317

ordering constraints between commands, 313

reference guide, 549

clEnqueueCopyBuffer(), 275–276, 545

clEnqueueCopyBufferToImage()

copying from buffer to image, 303–305

defined, 299

reference guide, 574

clEnqueueCopyImage()

copy image objects, 302–303

defined, 299

reference guide, 575

clEnqueueCopyImageToBuffer()

copying from image to buffer, 303–304

defined, 299

reference guide, 574

clEnqueueMapBuffer()

mapping buffers and sub-buffers, 276–278

moving data to and from buffer, 278–279

reference guide, 545

clEnqueueMapImage()

defined, 299

mapping image objects into host memory, 305–308

reference guide, 574

clEnqueueMarker(), 314–317, 549

clEnqueueMarker()

defining synchronization points, 314

function of, 315–317

clEnqueueNativeKernel(), 548

clEnqueueNDRangeKernel()

events and command-queues, 312

executing kernel, 56–57

reference guide, 548

work-items, 150

clEnqueueReadBuffer()

reading buffers, 260–261, 268–269

reading results back from kernel, 48, 56–57

reference guide, 544

clEnqueueReadBufferRect(), 269–272, 544

clEnqueueReadImage()

defined, 299–301

mapping image results to host memory pointer, 307–308

reference guide, 575

clEnqueueReleaseD3D10ObjectsKHR(), 580

clEnqueueReleaseGLObjects()

implicit synchronization, 348–349

reference guide, 579

releasing objects acquired by OpenCL, 341–342

synchronization between OpenCL/OpenGL, 351

clEnqueueTask(), 150, 548

clEnqueueUnmapMapImage(), 305–306

clEnqueueUnmapMemObject()

buffer mapping no longer required, 277–278

moving data to and from buffer, 278–279

reference guide, 545

releasing image data, 308

clEnqueueWaitForEvents(), 314–317, 549

clEnqueueWriteBuffer()

reference guide, 544

writing buffers, 259–260, 267

clEnqueueWriteBufferRect(), 272–273, 544–545

clEnqueueWriteImage()

defined, 299

reference guide, 575

writing images from host to device memory, 301–302

cles_khr_int64 extension string, embedded profile, 385–386

clFinish()

creating OpenCL buffers from OpenGL buffers, 342–343

OpenCL/OpenGL synchronization with, 348

OpenCL/OpenGL synchronization without, 351

preprocessor error macro for, 327

reference guide, 549

clFlush()

preprocessor error macro for, 327

reference guide, 549

using callbacks with events, 327

cl.get_platforms(), PyOpenCL, 493

clGetCommandQueueInfo(), 543

clGetContextInfo()

HelloWorld example, 50–51

querying context properties, 86–87

querying list of associated devices, 88

reference guide, 542

clGetDeviceIDs()

convolution signal example, 91

querying devices, 68–69

translation of matrix multiplication into OpenCL, 502

clGetDeviceIDsFromD3D10KHR(), 542

clGetDeviceInfo()

determining images supported, 290

embedded profile, 384

matrix multiplication, 506–509

querying context for associated devices, 88

querying device information, 70–78

querying embedded profile device support for images, 386–387

querying for OpenGL sharing extension, 336–337

reference guide, 542–543, 579

steps in OpenCL usage, 83

clGetEventInfo(), 319–320, 549

clGetEventProfilingInfo()

direct translation of matrix multiplication, 502

errors, 329–330

extracting timing data, 328

placing profiling functions inside callbacks, 332

profiling information and return types, 329

reference guide, 549

clGetGLContextInfoKHR(), 579

clGetGLObjectInfo(), 347–348, 578

clGetGLTextureInfo(), 578

clGetImageInfo(), 286

clGetKernelInfo(), 242–243, 548

clGetKernelWorkGroupInfo(), 243–244, 548

clGetMemObjectInfo()

querying buffers and sub-buffers, 257–259

querying image object, 286

reference guide, 545

clGetPlatformIDs()

querying context for associated devices, 88

querying platforms, 63–64

reference guide, 542

clGetPlatformInfo()

embedded profile, 384

querying and selecting platform, 65–67

reference guide, 542

clGetProgramBuildInfo()

creating and building program object, 52–53

detecting build error, 220–221, 222

direct translation of matrix multiplication, 502

reference guide, 547

clGetProgramInfo()

getting program binary back from built program, 227–228

reference guide, 547

clGetSamplerInfo(), 294–295, 576

clGetSupportedImageFormats(), 291, 574

clGetXXInfo(), use of in this book, 70

CLK_GLOBAL_MEM_FENCE value, barrier functions, 190–191

CLK_LOCAL_MEM_FENCE value, barrier functions, 190–191

cl::Kernel(), 378

cl::Kernel:setArg(), 378

cloth simulation in Bullet Physics SDK

adding OpenGL interoperation, 446–448

executing on CPU, 431–432

executing on GPU, 432–438

introduction to, 425–428

optimizing for SIMD computation and local memory, 441–446

overview of, 425

of soft body, 429–431

two-layered batching, 438–441

cl::Program(), 377

clReleaseCommandQueue(), 543

clReleaseContext(), 89, 542

clReleaseEvent(), 318–319, 549

clReleaseKernel(), 244–245, 548

clReleaseMemObject()

reference guide, 545

release buffer object, 339

release image object, 284

clReleaseProgram(), 236, 546

clReleaseSampler(), 294, 576

clRetainCommandQueue(), 543

clRetainContext(), 89, 541

clRetainEvent(), 318, 549

clRetainKernel(), 245, 548

clRetainMemObject(), 339, 545

clRetainProgram(), 236–237, 546

clRetainSampler(), 576

clSetEventCallback()

events impacting execution on host, 325–326

placing profiling functions inside callbacks, 331–332

reference guide, 549

clSetKernelArg()

creating buffers and sub-buffers, 250, 255

executing kernel, 55–56

executing Vector Add kernel, 378

matrix multiplication using local memory, 509–511

reference guide, 548

sampler declaration fields, 577

setting kernel arguments, 56, 237–240

thread safety and, 241–242

clSetMemObjectDestructor-Callback(), 545

clSetUserEventStatus()

generating events on host, 322

how to use, 323–324

reference guide, 549

clUnloadCompiler(), 237, 547

clWaitForEvents(), 323–324, 549

CMake tool

generating project in Linux and Eclipse, 44–45

generating project in Visual Studio, 42–44

installing as cross-platform build tool, 40–41

Mac OS X and Code::Blocks, 40–41

cmake-gui, 42–44

Code::Blocks, 41–42

color, cloth simulation

executing on GPU, 433–438

in two-layered batching, 438–441

color images. see image histograms

comma operator (,), 131

command-queue

acquiring OpenGL objects, 341–342

as core of OpenCL, 309–310

creating, 50–52

creating after selecting set of devices, 377

creating in PyOpenCL, 493

defining consistency of memory objects on, 24

direct translation of matrix multiplication into OpenCL, 502

events and, 311–317

executing kernel, 56–57

in execution model, 18–21

execution of Vector Add kernel, 378, 380

OpenCL runtime reference guide, 543

runtime API setting up, 31–32

transferring image objects to, 299–300

common functions, 172–175

compiler

directives for optional extensions, 143–145

unloading OpenCL, 547

component selection syntax, vectors, 106–107

components, vector data type, 106–108

compute device, platform model, 12

compute units, platform model, 12

concurrency, 7–8

exploiting in command-queues, 310

kernel execution model, 14

parallel algorithm limitations, 28–29

conditional operator, 124, 129

const type qualifier, 141

constant (_constant) address space qualifier, 137–138, 141

constant memory

device architecture diagram, 577

memory model, 21–23

contexts

allocating memory objects against, 248

choosing platform and creating, 375

convolution signal example, 89–97

creating, 49–50, 84–87

creating in PyOpenCL, 492–493

defining in execution model, 17–18

incrementing and decrementing reference count, 89

initializing for OpenGL interoperability, 338–339

OpenCL platform layer, 541–542

overview of, 83

querying properties, 85–87

steps in OpenCL, 84

convergence, simulating soft body, 430

conversion

embedded profile device support rules, 386–387

explicit, 117–121, 132

vector component, 554

convert_int(), explicit conversions, 118

convolution signal example, 89–97

coordinate mode, sampler objects, 282, 292–295

copy

buffers and sub-buffers, 274–276, 545

image objects, 302–305, 308, 575

costArray:, Dijkstra’s algorithm, 413–414, 415–417

CPUs

executing cloth simulation on, 431–432

heterogeneous future of multicore, 4–7

matrix multiplication and performance results, 511–513

SpMV implementation, 518–519

CreateCommandQueue(), 50–51

CreateContext(), 49–50, 375

CreateMemObjects(), 54–55

CSR format, sparse matrix, 517

D

DAG (directed acyclic graph), command-queues and, 310

data load and store functions, vectors, 181–189

data structure, Dijkstra’s algorithm, 412–414

data types

explicit casts, 116–117

explicit conversions, 117–121

implicit type conversions, 110–115

reference guide for supported, 550–552

reinterpreting data as other, 121–123

reserved as keywords in OpenCL C, 141

scalar. see scalar data types

specifying attributes, 555

vector. see vector data types

data-parallel programming model

overview of, 8–9

parallel algorithm limitations, 28–29

understanding, 25–27

writing kernel using OpenCL C, 97–99

decimation kernel, optical flow, 474

declaration fields, sampler, 577

default device, 69

#define preprocessor directive, 142, 145

denormalized numbers, 34, 388

dense matrix, 499

dense optical flow, 469

derived types, OpenCL C, 109–110

design, for tiled and packetized sparse matrix, 523–524

device_type argument, querying devices, 68

devices

architecture diagram, 577

choosing first available, 50–52

convolution signal example, 89–97

creating context in execution model, 17–18

determining profile support by, 390

embedded profile for hand held, 383–385

executing kernel on, 13–17

execution of Vector Add kernel, 380

full profile for desktop, 383

in platform model, 12

querying, 67–70, 78–83, 375–377, 542–543

selecting, 70–78

steps in OpenCL, 83–84

DFFT (discrete fast Fourier transform), 453

DFT. see discrete Fourier transform (DFT), Ocean simulation

Dijkstra’s algorithm, parallelizing

graph data structures, 412–414

kernels, 414–417

leveraging multiple compute devices, 417–423

overview of, 411–412

dimensions, image object, 282

Direct3D, interoperability with. see interoperability with Direct3D

directed acyclic graph (DAG), command-queues and, 310

directional edge detector filter, Sobel, 407–410

directories, sample code for this book, 41

DirectX Shading Language (HLSL), 111–113

discrete fast Fourier transform (DFFT), 453

discrete Fourier transform (DFT), Ocean simulation

avoiding local memory bank conflicts, 463

determining 2D composition, 457–458

determining local memory needed, 462

determining sub-transform size, 459–460

determining work-group size, 460

obtaining twiddle factors, 461–462

overview of, 457

using images, 463

using local memory, 459

distance(), geometric functions, 175–176

divide (/) arithmetic operator, 124–126

doublen, vector data load and store, 181

DRAM, modern multicore CPUs, 6–7

dynamic libraries, OpenCL program vs., 97

E

early exit, optical flow algorithm, 483

Eclipse, generating project in, 44–45

edgeArray:, Dijkstra’s algorithm, 412–414

“Efficient Sparse Matrix-Vector Multiplication on CUDA” (Bell and Garland), 517

embedded profile

64-bit integers, 385–386

built-in atomic functions, 387

determining device supporting, 390

full profile vs., 383

images, 386–387

mandated minimum single-precision floating-point capabilities, 387–389

OpenCL programs for, 35–36

overview of, 383–385

platform queries, 65

_EMBEDDED_PROFILE_macro, 390

enumerated type

rank order of, 113

specifying attributes, 555

enumerating, list of platforms, 66–67

equal (==) operator, 127

equality operators, 124, 127

error codes

C++ Wrapper API exceptions, 371–374

clBarrier(), 313

clCreateUserEvent(), 321–322

clEnqueueMarker(), 314

clEnqueueWaitForEvents(), 314–315

clGetEventProfilingInfo(), 329–330

clGetProgramBuildInfo, 220–221

clRetainEvent(), 318

clSetEventCallback(), 326

clWaitForEvents(), 323

table of, 57–61

ERROR_CODE value, command-queue, 311

.even suffix, vector data types, 107–108

event data types, 108, 147–148

event objects

OpenCL/OpenGL sharing APIs, 579

overview of, 317–320

reference guide, 549–550

event_t async_work_group_copy(), 192, 332–333

event_t async_work_group_strided_copy(), 192, 332–333

events

command-queues and, 311–317

defined, 310

event objects. see event objects

generating on host, 321–322

impacting execution on host, 322–327

inside kernels, 332–333

from outside OpenCL, 333

overview of, 309–310

profiling using, 327–332

in task-parallel programming model, 28

exceptions

C++ Wrapper API, 371–374

execution of Vector Add kernel, 379

exclusive (^^) operator, 128

exclusive or (^) operator, 127–128

execution model

command-queues, 18–21

contexts, 17–18

defined, 11

how kernel executes OpenCL device, 13–17

overview of, 13

parallel algorithm limitations, 28–29

explicit casts, 116–117

explicit conversions, 117–121, 132

explicit kernel, SpMV, 519

explicit memory fence, 570–571

explicit model, data parallelism, 26–27

explicit synchronization, 349

exponent, half data type, 101

expression, assignment operator, 132

extensions, compiler directives for optional, 143–145

F

fast Fourier transform (FTT). see Ocean simulation, with FFT

fast_ variants, geometric functions, 175

FBO (frame buffer object), 347

file, creating 2D image from, 284–285

filter mode, sampler objects, 282, 292–295

float channels, 403–406

float data type, converting, 101

float images, 386

float type, math constants, 556

floating-point arithmetic system, 33–34

floating-point constants, 162–163

floating-point data types, 113, 119–121

floating-point options

building program object, 224–225

full vs. embedded profiles, 387–388

floating-point pragmas, 143, 162

floatn, vector data load and store functions, 181, 182–186

fma, geometric functions, 175

formats, image

embedded profile, 387

encapsulating information on, 282

mapping OpenGL texture to OpenCL image, 346

overview of, 287–291

querying list of supported, 574

reference guide for supported, 576

formats, of program binaries, 227

FP_CONTRACT pragma, 162

frame buffer object (FBO), 347

FreeImage library, 283, 284–285

FreeSurfer. see Dijkstra’s algorithm, parallelizing

FTT (fast Fourier transform). see Ocean simulation, with FFT

full profile

built-in atomic functions, 387

determining profile support by device, 390

embedded profile as strict subset of, 383–385

mandated minimum single-precision floating-point capabilities, 387–389

platform queries, 65

querying device support for images, 386–387

function qualifiers

overview of, 133–134

reference guide, 554

reserved as keywords, 141

functions. see built-in functions

G

Gaussian filter, 282–283, 295–299

Gauss-Seidel iteration, 432

GCC compiler, 111–113

general-purpose GPU (GPGPU), 10, 29

gentype

barrier functions, 191–195

built-in common functions, 173–175

integer functions, 168–171

miscellaneous vector functions, 199–200

vector data load and store functions, 181–189

work-items, 153–161

gentyped

built-in common functions, 173–175

built-in geometric functions, 175–176

built-in math functions, 155–156

defined, 153

gentypef

built-in geometric functions, 175–177

built-in math functions, 155–156, 160–161

defined, 153

gentypei, 153, 158

gentypen, 181–182, 199–200

geometric built-in functions, 175–177, 563–564

get_global_id(), data-parallel kernel, 98–99

getInfo(), C++ Wrapper API, 375–377

gl_object_type parameter, query OpenGL objects, 347–348

glBuildProgram(), 52–53

glCreateFromGLTexture2D(), 344–345

glCreateFromGLTexture3D(), 344–345

glCreateSyncFromCLeventARB(), 350–351

glDeleteSync() function, 350

GLEW toolkit, 336

glFinish()

creating OpenCL buffers from OpenGL buffers, 342

OpenCL/OpenGL synchronization with, 348

OpenCL/OpenGL synchronization without, 351

global (_global) address space qualifier, 136, 141

global index space, kernel execution model, 15–16

global memory

device architecture diagram, 577

matrix multiplication, 507–509

memory model, 21–23

globalWorkSize, executing kernel, 56–57

GLSL (OpenGL Shading Language), 111–113

GLUT toolkit, 336, 450–451

glWaitSync(), synchronization, 350–351

GMCH (graphics/memory controller), 6–7

gotos, irreducible control flow, 147

GPGPU (general-purpose GPU), 10, 29

GPU (graphics processing unit)

advantages of image objects. see image objects

defined, 69

executing cloth simulation on, 432–438

leveraging multiple compute devices, 417–423

matrix multiplication and performance results, 511–513

modern multicore CPUs as, 6–7

OpenCL implementation for NVIDIA, 40

optical flow performance, 484–485

optimizing for SIMD computation and local memory, 441–446

querying and selecting, 69–70

SpMV implementation, 518–519

tiled and packetized sparse matrix design, 523–524

tiled and packetized sparse matrix team, 524

two-layered batching, 438–441

graph data structures, parallelizing Dijkstra’s algorithm, 412–414

graphics. see also images

shading languages, 111–113

standards, 30–31

graphics processing unit. see GPU (graphics processing unit)

graphics/memory controller (GMCH), 6–7

grayscale images, applying Sobel OpenCL kernel to, 409–410

greater than (>) operator, 127

greater than or equal (>=) operator, 127

H

half data type, 101–102

half_ functions, 153

half-float channels, 403–406

half-float images, 386

halfn, 181, 182–186

hand held devices, embedded profile for. see embedded profile

hardware

mapping program onto, 9–11

parallel computation as concurrency enabled by, 8

SpMV kernel, 519

SpMV multiplication, 524–538

hardware abstraction layer, 11, 29

hardware linear interpolation, optical flow algorithm, 480

hardware scheduling, optical flow algorithm, 483

header structure, SpMV, 522–523

height map, Ocean application, 450

HelloWorld sample

checking for errors, 57–61

choosing device and creating command-queue, 50–52

choosing platform and creating context, 49–50

creating and building program object, 52–53

creating kernel and memory objects, 54–55

downloading sample code, 39

executing kernel, 55–57

Linux and Eclipse, 44–45

Mac OS X and Code::Blocks, 41–42

Microsoft Windows and Visual Studio, 42–44

overview of, 39, 45–48

prerequisites, 40–41

heterogeneous platforms, 4–7

.hi suffix, vector data types, 107–108

high-level loop, Dijkstra’s algorithm, 414–417

histogram. see image histograms

histogram_partial_image_rgba_unorm8 kernel, 400

histogram_partial_results_rgba_unorm8 kernel, 400–402

histogram_sum_partial_results_unorm8 kernel, 400

HLSL (DirectX Shading Language), 111–113

host

calls to enqueue histogram kernels, 398–400

creating, writing and reading buffers and sub-buffers, 262–268

device architecture diagram, 577

events impacting execution on, 322–327

execution model, 13, 17–18

generating events on, 321–322

kernel execution model, 13

matrix multiplication, 502–505

platform model, 12

host memory

memory model, 21–23

reading image back to, 300–301

reading image from device to, 299–300

reading region of buffer into, 269–272

writing region into buffer from, 272–273

hybrid programming models, 29

I

ICC compiler, 111–113

ICD (installable client driver) model, 49, 375

IDs, kernel execution model, 14–15

IEEE standards, floating-point arithmetic, 33–34

image channel data type, image formats, 289–291

image channel order, image formats, 287–291

image data types, 108–109, 147

image difference, optical flow algorithm, 472

image functions

border color, 209–210

querying image information, 214–215

read and write, 201–206

samplers, 206–209

writing to images, 210–213

image histograms

additional optimizations to parallel, 400–402

computing, 393–395, 403–406

overview of, 393

parallelizing, 395–400

image objects

copy between buffer objects and, 574

creating, 283–286, 573–574

creating in OpenCL from OpenGL textures, 344–347

Gaussian filter example, 282–283

loading to in PyOpenCL, 493–494

mapping and ummapping, 305–308, 574

memory model, 21

OpenCL and, 30

OpenCL C functions for working with, 295–299

OpenCL/OpenGL sharing APIs, 578

overview of, 281–282

querying, 575

querying list of supported formats, 574

querying support for device images, 291

read, write, and copy, 575

specifying image formats, 287–291

transferring data, 299–308

image pyramids, optical flow algorithm, 472–479

image3d_t type, embedded profile, 386

ImageFIlter2D example, 282–291, 488–492

images

access qualifiers for read-only or write-only, 140–141

describing motion between. see optical flow

DFT, 463

embedded profile device support for, 386–387

formats. see formats, image

as memory objects, 247

read and write built-in functions, 572–573

Sobel edge detection filter for, 407–410

supported by OpenCL C, 99

Image.tostring() method, PyOpenCL, 493–494

implicit kernel, SpMV, 518–519

implicit model, data parallelism, 26

implicit synchronization, OpenCL/OpenGL, 348–349

implicit type conversions, 110–115

index space, kernel execution model, 13–14

INF (infinity), floating-point arithmetic, 34

inheritance, C++ API, 369

initialization

Ocean application overview, 450–451

OpenCL/OpenGL interoperability, 338–340

parallelizing Dijkstra’s algorithm, 415

in-order command-queue, 19–20, 24

input vector, SpMV, 518

installable client driver (ICD) model, 49, 375

integer built-in functions, 168–172, 557–558

integer data types

arithmetic operators, 124–216

explicit conversions, 119–121

rank order of, 113

relational and equality operators, 127

intellectual property, program binaries protecting, 227

interoperability with Direct3D

acquiring/releasing Direct3D objects in OpenCL, 361–363

creating memory objects from Direct3D buffers/textures, 357–361

initializing context for, 354–357

overview of, 353

processing D3D vertex data in OpenCL, 366–368

processing Direct3D texture in OpenCL, 363–366

reference guide, 579–580

sharing overview, 353–354

interoperability with OpenGL

cloth simulation, 446–448

creating OpenCL buffers from OpenGL buffers, 339–343

creating OpenCL image objects from OpenGL textures, 344–347

initializing OpenCL context for, 338–339

optical flow algorithm, 483–484

overview of, 335

querying for OpenGL sharing extension, 336–337

querying information about OpenGL objects, 347–348

reference guide, 577–579

sharing overview, 335–336

synchronization, 348–351

irreducible control flow, restrictions, 147

iterations

executing cloth simulation on CPU, 431–432

executing cloth simulation on GPU, 434–435

pyramidal Lucas-Kanade optical flow, 472

simulating soft body, 429–431

K

kernel attribute qualifiers, 134–135

kernel execution commands, 19–20

kernel objects

arguments and object queries, 548

creating, 547–548

creating, and setting kernel arguments, 237–241

executing, 548

managing and querying, 242–245

out-of-order execution of memory object command and, 549

overview of, 237

program objects vs., 217–218

thread safety, 241–242

_kernel qualifier, 133–135, 141, 217

kernels

applying Phillips spectrum, 453–457

constant memory during execution of, 21

creating, writing and reading buffers/sub-buffers, 262

creating context in execution model, 17–18

creating memory objects, 54–55, 377–378

in data-parallel programming model, 25–27

data-parallel version of, 97–99

defined, 13

in device architecture diagram, 577

events inside, 332–333

executing and reading result, 55–57

executing Ocean simulation application, 463–468

executing OpenCL device, 13–17

executing Sobel OpenCL, 407–410

executing Vector Add kernel, 381

in execution model, 13

leveraging multiple compute devices, 417–423

in matrix multiplication program, 501–509

parallel algorithm limitations, 28–29

parallelizing Dijkstra’s algorithm, 414–417

programming language and, 32–34

in PyOpenCL, 495–497

restrictions in OpenCL C, 146–148

in task-parallel programming model, 27–28

in tiled and packetized sparse matrix, 518–519, 523

keywords, OpenCL C, 141

Khronos, 29–30

L

learning OpenCL, 36–37

left shift (<<) operator, 129–130

length(), geometric functions, 175–177

less than (<) operator, 127

less than or equal (<=) operator, 127

library functions, restrictions in OpenCL C, 147

links

cloth simulation using two-layered batching, 438–441

executing cloth simulation on CPU, 431–432

executing cloth simulation on GPU, 433–438

introduction to cloth simulation, 426–428

simulating soft body, 429–431

Linux

generating project in, 44–45

initializing contexts for OpenGL interoperability, 338–339

OpenCL implementation in, 41

.lo suffix, vector data types, 107–108

load balancing

automatic, 20

in parallel computing, 9

loading, program binaries, 227

load/store functions, vector data, 567–568

local (_local) address space qualifier, 138–139, 141

local index space, kernel execution model, 15

local memory

device architecture diagram, 577

discrete Fourier transform, 459, 462–463

FFT kernel, 464

memory model, 21–24

optical flow algorithm, 481–482

optimizing in matrix multiplication, 509–511

SpMV implementation, 518–519

localWorkSize, executing kernel, 56–57

logical operators

overview of, 128

symbols, 124

unary not(!), 131

Lucas-Kanade. see pyramidal Lucas-Kanade optical flow algorithm

luminosity histogram, 393

lvalue, assignment operator, 132

M

Mac OS X

OpenCL implementation in, 40

using Code::Blocks, 41–42

macros

determining profile support by device, 390

integer functions, 172

OpenCL C, 145–146

preprocessor directives and, 555

preprocessor error, 372–374

mad, geometric functions, 175

magnitudes, wave, 454

main() function, HelloWorld OpenCL kernel and, 44–48

mandated minimum single-precision floating-point capabilities, 387–389

mantissa, half data type, 101

mapping

buffers and sub-buffers, 276–279

C++ classes to OpenCL C type, 369–370

image data, 305–308

image to host or memory pointer, 299

OpenGL texture to OpenCL image formats, 346

markers, synchronization point, 314

maskArray:, Dijkstra’s algorithm, 412–414, 415

masking off operation, 121–123

mass/spring model, for soft bodies, 425–427

math built-in functions

accuracy for embedded vs. full profile, 388

floating-point constant, 162–163

floating-point pragma, 162

overview of, 153–161

reference guide, 560–563

relative error as ulps in, 163–168

math constants, reference guide, 556

math intrinsics, program build options, 547

math_ functions, 153

Matrix Market (MM) exchange format, 517–518

matrix multiplication

basic algorithm, 499–501

direct translation into OpenCL, 501–505

increasing amount of work per kernel, 506–509

overview of, 499

performance results and optimizing original CPU code, 511–513

sparse matrix-vector. see sparse matrix-vector multiplication (SpMV)

using local memory, 509–511

memory access flags, 282–284

memory commands, 19

memory consistency, 23–24, 191

memory latency, SpMV, 518–519

memory model, 12, 21–24

memory objects

buffers and sub-buffers as, 247–248

creating context in execution model, 17–18

creating kernel and, 54–55, 377–378

matrix multiplication and, 502

in memory model, 21–24

out-of-order execution of kernels and, 549

querying to determine type of, 258–259

runtime API managing, 32

mesh

executing cloth simulation on CPU, 431–432

executing cloth simulation on GPU, 433

introduction to cloth simulation, 425–428

simulating soft body, 429–431

two-layered batching, 438–441

MFLOPS, 512–513

Microsoft Windows

generating project in Visual Studio, 42–44

OpenCL implementation in, 40

OpenGL interoperability, 338–339

mismatch vector, optical flow algorithm, 472

MM (Matrix Market) exchange format, 517–518

multicore chips, power-efficiency of, 4–5

multiplication

matrix. see matrix multiplication

sparse matrix-vector. see sparse matrix-vector multiplication (SpMV)

multiply (*) arithmetic operator, 124–126

N

n suffix, 181

names, reserved as keywords, 141

NaN (Not a Number), floating-point arithmetic, 34

native kernels, 13

NDRange

data-parallel programming model, 25

kernel execution model, 14–16

matrix multiplication, 502, 506–509

task-parallel programming model, 27

normalize(), geometric functions, 175–176

not (~) operator, 127–128

not equal (!=) operator, 127

NULL value, 64–65, 68

num_entries, 64, 68

numeric indices, built-in vector data types, 107

numpy, PyOpenCL, 488, 496–497

NVIDIA GPU Computing SDK

generating project in Linux, 41

generating project in Linux and Eclipse, 44–45

generating project in Visual Studio, 42

generating project in Windows, 40

OpenCL/OpenGL interoperability, 336

O

objects, OpenCL/OpenGL sharing API, 579

Ocean simulation, with FFT

FFT kernel, 463–467

generating Phillips spectrum, 453–457

OpenCL DFT. see discrete Fourier transform (DFT), Ocean simulation

overview of, 449–453

transpose kernel, 467–468

.odd suffix, vector data types, 107–108

OpenCL, introduction

conceptual foundation of, 11–12

data-parallel programming model, 25–27

embedded profile, 35–36

execution model, 13–21

graphics, 30–31

heterogeneous platforms of, 4–7

kernel programming language, 32–34

learning, 36–37

memory model, 21–24

other programming models, 29

parallel algorithm limitations, 28–29

platform API, 31

platform model, 12

runtime API, 31–32

software, 7–10

summary review, 34–35

task-parallel programming model, 27–28

understanding, 3–4

OpenCL C

access qualifiers, 140–141

address space qualifiers, 135–140

built-in functions. see built-in functions

derived types, 109–110

explicit casts, 116–117

explicit conversions, 117–121

function qualifiers, 133–134

functions for working with images, 295–299

implicit type conversions, 110

kernel attribute qualifiers, 134–135

as kernel programming language, 32–34

keywords, 141

macros, 145–146

other data types supported by, 108–109

overview of, 97

preprocessor directives, 141–144

reinterpreting data as another type, 121–123

restrictions, 146–148

scalar data types, 99–102

type qualifiers, 141

vector data types, 102–108

vector operators. see vector operators

writing data-parallel kernel using, 97–99

OPENCL EXTENSION directive, 143–145

OpenGL

interoperability between OpenCL and. see interoperability with Direct3D; interoperability with OpenGL

Ocean application, 450–453

OpenCL and graphics standards, 30

reference guide for sharing APIs, 577–579

synchronization between OpenCL, 333

OpenGL Shading Language (GLSL), 111–113

operands, vector literals, 105

operators, vector. see vector operators

optical flow

application of texture cache, 480–481

early exit and hardware scheduling, 483

efficient visualization with OpenGL interop, 483–484

performance, 484–485

problem of, 469–479

sub-pixel accuracy with hardware linear interpolation, 480

understanding, 469

using local memory, 481–482

optimization options

clBuildProgram(), 225–226

partial image histogram, 400–402

program build options, 546

“Optimizing Power Using Transformations” (Chandrakasan et al.), 4–5

“Optimizing Sparse Matrix-Vector Multiplication on GPUs” (Baskaran and Bordawekar), 517

optional extensions, compiler directives for, 143–145

or (|) operator, 127–128

or (||) operator, 128

out-of-order command-queue

automatic load balancing, 20

data-parallel programming model, 24

execution model, 20

reference guide, 549

task-parallel programming model, 28

output, creating 2D image for, 285–286

output vector, SpMV, 518

overloaded function, vector literal as, 104–105

P

packets

optimizing sparse matrix-vector multiplication, 538–539

tiled and packetized sparse matrix, 519–522

tiled and packetized sparse matrix design, 523–524

tiled and packetized sparse matrix team, 524

pad to 128-boundary, tiled and packetized sparse matrix, 523–524

parallel algorithm limitations, 28–29

parallel computation

as concurrency enabled by software, 8

of image histogram, 395–400

image histogram optimizations, 400–402

parallel programming, using models for, 8

parallelism, 8

param_name values, querying platforms, 64–65

partial histograms

computing, 395–397

optimizing by reducing number of, 400–402

summing to generate final histogram, 397–398

partitioning workload, for multiple compute devices, 417–423

Patterns for Parallel Programming (Mattson), 20

performance

heterogeneous future of, 4–7

leveraging multiple compute devices, 417–423

matrix multiplication results, 511–513

optical flow algorithm and, 484–485

soft body simulation and, 430–431

sparse matrix-vector multiplication and, 518, 524–538

using events for profiling, 327–332

using matrix multiplication for high. see matrix multiplication

PEs (processing elements), platform model, 12

phillips function, 455–457

Phillips spectrum generation, 453–457

platform API, 30–31

platform model, 11–12

platforms

choosing, 49–50

choosing and creating context, 375

convolution signal example, 89–97

embedded profile, 383–385

enumerating and querying, 63–67

querying and displaying specific information, 78–83

querying list of devices associated with, 68

reference guide, 541–543

steps in OpenCL, 83–84

pointer data types, implicit conversions, 111

post-increment (++) unary operator, 131

power

efficiency of specialized core, 5–6

of multicore chips, 4–5

#pragma directives, OpenCL C, 143–145

predefined identifiers, not supported, 147

prefetch functions, 191–195, 570

pre-increment (--) unary operator, 131

preprocessor build options, 223–224

preprocessor directives

OpenCL C, 141–142

program object build options, 546–547

reference guide, 555

preprocessor error macros, C++ Wrapper API, 372–374

private (_private) address space qualifier, 139, 141

private memory, 21–23, 577

processing elements (PEs), platform model, 12

profiles

associated with platforms, 63–67

commands for events, 327–332

embedded. see embedded profile

reference guide, 549

program objects

build options, 222–227

creating and building, 52–53, 377

creating and building from binaries, 227–236

creating and building from source code, 218–222

creating and building in PyOpenCL, 494–495

creating context in execution model, 17–18

kernel objects vs., 217–218

managing and querying, 236–237

reference guide, 546–547

runtime API creating, 32

programming language. see also OpenCL C; PyOpenCL, 32–34

programming models

data-parallel, 25–27

defined, 12

other, 29

parallel algorithm limitations, 28–29

task-parallel, 27–28

properties

device, 70

querying context, 85–87

PyImageFilter2D, PyOpenCL, 488–492

PyOpenCL

context and command-queue creation, 492–493

creating and building program, 494–495

introduction to, 487–488

loading to image object, 493–494

overview of, 487

PyImageFilter2D code, 488–492

reading results, 496

running PyImageFilter2D example, 488

setting kernel arguments/executing kernel, 495–496

pyopencl vo-92+, 488

pyopencl.create_some_context(), 492

pyramidal Lucas-Kanade optical flow algorithm, 469, 471–473

Python, using OpenCL in. see PyOpenCL

Python Image Library (PIL), 488, 493–494

Q

qualifiers

access, 140–141

address space, 135–140

function, 133–134

kernel attribute, 134–135

type, 141

queries

buffer and sub-buffer, 257–259, 545

device, 542–543

device image support, 291

event object, 319–320

image object, 214–215, 286, 575

kernel, 242–245, 548

OpenCL/OpenGL sharing APIs, 578

OpenGL objects, 347–348

platform, 63–66, 542–543

program object, 241–242, 547

storing program binary and, 230–232

supported image formats, 574

R

R,G, B color histogram

computing, 393–395, 403–406

optimizing, 400–402

overview of, 393

parallelizing, 395–400

rank order, usual arithmetic conversions, 113–115

read

buffers and sub-buffers, 259–268, 544

image back to host memory, 300–301

image built-in functions, 201–206, 298, 572–573

image from device to host memory, 299–300

image objects, 575

memory objects, 248

results in PyOpenCL, 496–497

read_imagef(), 298–299

read-only qualifier, 140–141

read-write qualifier, 141

recursion, not supported in OpenCL C, 147

reference counts

buffers and sub-buffers, 256

contexts, 89

event objects, 318

regions, memory model, 21–23

relational built-in functions, 175, 178–181, 564–567

relational operators, 124, 127

relaxed consistency model, memory objects, 24

remainder (%) arithmetic operator, 124–126

render buffers, 346–347, 578

rendering of height map, Ocean application, 450

reserved data types, 550–552

restrict type qualifier, 141

restrictions, OpenCL C, 146–148

return type, kernel function restrictions, 146

RGB images, applying Sobel OpenCL kernel to, 409

RGBA-formatted image, loading in PyOpenCL, 493–494

right shift (>>) operator, 129–130

rounding mode modifier

explicit conversions, 119–121

vector data load and store functions, 182–189

_rte suffix, 183, 187

runCLSimulation(), 451–457

runtime API, 30–32, 543

S

sampler data types

determining border color, 209–210

functions, 206–209

restrictions in OpenCL C, 108–109, 147

sampler objects. see also image objects

creating, 292–294

declaration fields, 577

functions of, 282

overview of, 281–282

reference guide, 576–577

releasing and querying, 294–295

_sat (saturation) modifier, explicit conversions, 119–120

SaveProgramBinary(), creating programs, 230–231

scalar data types

creating vectors with vector literals, 104–105

explicit casts of, 116–117

explicit conversions of, 117–121

half data type, 101–102

implicit conversions of, 110–111

integer functions, 172

reference guide, 550

supported by OpenCL C, 99–101

usual arithmetic conversions with, 113–115

vector operators with. see vector operators

scalar_add (), writing data-parallel kernel, 97–98

754 formats, IEEE floating-point arithmetic, 34

sgentype

integer functions, 172

relational functions, 181

shape matching, soft bodies, 425

sharing APIs, OpenCL/OpenGL, 577–579

shift operators, 124, 129–130

shuffle, illegal usage of, 214

shuffle2, illegal usage of, 214

sign, half data type, 101

SIMD (Single Instruction Multiple Data) model, 26–27, 465

simulation

cloth. see cloth simulation in Bullet Physics SDK

ocean. see Ocean simulation, with FFT

Single Instruction Multiple Data (SIMD) model, 26–27, 465

Single Program Multiple Data (SPMD) model, 26

single-source shortest-path graph algorithm. see Dijkstra’s algorithm, parallelizing

64-bit integers, embedded profile, 385–386

sizeof operator, 131–132

slab, tiled and packetized sparse matrix, 519

Sobel edge detection filter, 407–410

soft bodies

executing cloth simulation on CPU, 431–432

executing cloth simulation on GPU, 432–438

interoperability with OpenGL, 446–448

introduction to cloth simulation, 425–428

simulating, 429–431

software, parallel, 7–10

solveConstraints, cloth simulation on GPU, 435

solveLinksForPosition, cloth simulation on GPU, 435

source code

creating and building programs from, 218–222

program binary as compiled version of, 227

sparse matrix-vector multiplication (SpMV)

algorithm, 515–517

defined, 515

description of, 518–519

header structure, 522–523

optional team information, 524

other areas of optimization, 538–539

overview of, 515

tested hardware devices and results, 524–538

tiled and packetized design, 523–524

tiled and packetized representation of, 519–522

specify type attributes, 555

SPMD (Single Program Multiple Data) model, 26

SpMV. see sparse matrix-vector multiplication (SpMV)

storage

image layout, 308

sparse matrix formats, 517

strips, tiled and packetized sparse matrix, 519

struct type

restrictions on use of, 109–110, 146

specifying attributes, 555

sub-buffers. see buffers and sub-buffers

sub-pixel accuracy, optical flow algorithm, 480

subregions, of memory objects, 21

subtract (-) arithmetic operator, 124–126

sub-transform size, DFT, 459–460

suffixes, vector data types, 107–108

synchronization

commands, 19–21

computing Dijkstra’s algorithm with kernel, 415–417

explicit memory fence, 570–571

functions, 190–191

OpenCL/OpenGL, 342, 348–351

primitives, 248

synchronization points

defining when enqueuing commands, 312–315

in out-of-order command-queue, 24

T

T1 to T3 data types, rank order of, 114

task-parallel programming model

overview of, 9–10

parallel algorithm limitations, 28–29

understanding, 27–28

team information, tiled and packetized sparse matrix, 524

ternary selection (?:) operator, 129

Tessendorf, Jerry, 449, 454

tetrahedra, soft bodies, 425–428

texture cache, optical flow algorithm, 480–482

texture objects, OpenGL. see also image objects

creating image objects in OpenCL from, 344–347

Ocean application creating, 451

OpenCL/OpenGL sharing APIs, 578

querying information about, 347–348

thread safety, kernel objects, 241–242

tiled and packetized sparse matrix

defined, 515

design considerations, 523–524

header structure of, 522–523

overview of, 519–522

SpMV implementation, 517–518

timing data, profiling events, 328

traits, C++ template, 376

transpose kernel, simulating ocean, 467–468

twiddle factors, DFT

FFT kernel, 464–466

obtaining, 461–462

using local memory, 463

2D composition, in DFT, 457–458

two-layered batching, cloth simulation, 438–441

type casting, vector component, 554

type qualifiers, 141

U

ugentype, 168–169, 181

ugentypen, 214–215

ulp values, 163–168

unary operators, 124, 131–132

union type, specifying attributes, 555

updatingCostArray:, Dijkstra’s algorithm, 413–417

usual arithmetic conversions, 113–115

V

vadd() kernel, Vector Add kernel, 378

variable-length arrays, not supported in OpenCL C, 147

variadic macros and functions, not supported in OpenCL C, 147

VBO (vertex buffer object), 340–344, 446–448

vbo_cl_mem, creating VBO in OpenGL, 340–341

Vector Add example. see C++ Wrapper API, Vector Add example

vector data types

application, 103–104

built-in, 102–103

components, 106–108, 552–554

data load and store functions, 181–189

explicit casts, 116–117

explicit conversions, 117–121

implicit conversions between, 110–113

literals, 104–105

load/store functions reference, 567–568

miscellaneous built-in functions, 199–200, 571

operators. see vector operators

optical flow algorithm, 470–472

reference guide, 550

supported by OpenCL C, 99

usual arithmetic conversions with, 113–115

vector literals, 104–105

vector operators

arithmetic operators, 124–126

assignment operator, 132

bitwise operators, 127–128

conditional operator, 129

logical operators, 128

overview of, 123–124

reference guide, 554

relational and equality operators, 127

shift operators, 129–130

unary operators, 131–132

vertex buffer object (VBO), 340–344, 446–448

vertexArray:, Dijkstra’s algorithm, 412–414

vertical filtering, optical flow, 474

vertices

introduction to cloth simulation, 425–428

simulating soft body, 429–431

Visual Studio, generating project in, 42–44

vload_half(), 101, 182, 567

vload_halfn(), 182, 567

vloada_half(), 185–186, 568

vloadn(), 181, 567

void return type, kernel functions, 146

void wait_group_events(), 193, 332–333

volatile type qualifier, 141

voltage, multicore chip, 4–5

vstore_half()

half data type, 101

reference guide, 568

vector store functions, 183, 187

vstore_halfn(), 184, 186–188, 568

vstorea_halfn(), 186, 188–189, 568

vstoren(), 182, 567

VSTRIDE, FFT kernel, 464

W

wave amplitudes, 454

weightArray:, Dijkstra’s algorithm, 412–414

Windows. see Microsoft Windows

work-group barrier, 25–27

work-groups

data-parallel programming model, 25–27

global memory for, 21

kernel execution model, 14–16

local memory for, 21, 23

SpMV implementation, 518

tiled and packetized sparse matrix team, 524

work-items

barrier functions, 190–191

built-in functions, 557

data-parallel programming model, 25–27

functions, 150–152

global memory for, 21

kernel execution model, 13–15

local memory for, 23

mapping get_global_id to, 98–99

matrix multiplication, 501–509

private memory for, 21

task-parallel programming model, 27

write

buffers and sub-buffers, 259–268, 544–545

image built-in functions, 210–213, 298–299, 572–573

image from host to device memory, 301–302

image objects, 575

memory objects, 248

write_imagef(), 298–299

write-only qualifier, 140–141

Z

0 value, 64–65, 68

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.12.140