--
(pre-increment) unary operator, 131
-
(subtract) operator, 124–126
?:
(ternary selection) operator, 129
-
or -- (
unary) operators, 131
|
or ||
(or) operators, 127–128
+
(addition) operator, 124–126
+
or ++ (post-increment) unary operator, 131
!=
(not equal) operator, 127
==
(equal) operator, 127
%
(remainder) operator, 124–126
&
or &&
(and) operators, 127–128
* (multiply) operator, 124–126
^
(exclusive or) operator, 127–128
^^
(exclusive) operator, 128
~ (not) operator, 127–128
<
(greater than) operator, 127
>=
(greater than or equal) operator, 127
>>
(right shift) operator, 129–130
2D composition, in DFT, 457–458
64-bit integers, embedded profile, 385–386
754 formats, IEEE floating-point arithmetic, 34
accelerator devices
defined, 69
tiled and packetized sparse matrix design, 523, 534
access qualifiers
as keywords in OpenCL C, 141
overview of, 140–141
reference guide, 576
add (+
) arithmetic operator, 124–126
address space qualifiers
casting between address spaces, 139–140
constant, 137–138
global, 136
as keywords in OpenCL C, 141
local, 138–139
overview of, 135–136
private, 139
reference guide, 554
supported, 99
addressing mode, sampler objects, 282, 292–295
ALL_BUILD
project, Visual Studio, 43
AltiVec Technology Programming Interface Manual, 111–113
AMD
generating project in Linux, 40–41
generating project in Windows, 40–41
storing binaries in own format, 233
and (&
or &&
) operators, 127–128
Apple
initializing contexts for OpenGL interoperability, 338
querying number of platforms, 64
storing binaries in own format, 233
application data types, 103–104
ARB_cl_event
extension, OpenGL, 349–350
architecture diagram, OpenCL device, 577
arguments
context, 85
device, 68
enqueuing commands, 313
guassian_kernel()
, 296–297
kernel function restrictions, 146
reference guide for kernel, 548
setting kernel, 55–57, 237–240
overview of, 124–126
post- and pre-increment (++
and --
) unary, 131
symbols, 123
unary (+
and -
), 131
arrays
parallelizing Dijkstra’s algorithm, 412–414
representing sparse matrix with binary data, 516
as_type()
, 121–123
as_type
n
()
, 121–123
ASCII File, representing sparse matrix, 516–517
assignment (=
) operator, 124, 132
async copy and prefetch functions, 191–195, 570
ATI Stream SDK
generating project in Linux and Eclipse, 44–45
generating project in Visual Studio, 42–44
generating project in Windows, 40
querying and selecting platform, 65–66
querying context for devices, 89
querying devices, 70
atomic built-in functions
embedded profile options, 387
overview of, 195–198
reference guide, 568–569
_attribute_
keyword, kernel qualifier, 133–134
attributes, specifying type, 555
automatic load balancing, 20
barrier
synchronization function, 190–191
batches
executing cloth simulation on GPU, 433–441
SpMV implementation, 518
behavior description, optional extension, 144
bilinear sampling object, optical flow, 476
binaries, program
creating, 235–236
HelloBinaryWorld example, 229–230
HelloWorld.cl (NVIDIA) example, 233–236
overview of, 227–229
querying and storing, 230–232
binary data arrays, sparse matrix, 516
bit field numbers, 147
bitwise operators, 124, 127–128
blocking enqueue calls, and callbacks, 327
blocking_read
, executing kernel, 56
bool
, rank order of, 113
border color, built-in functions, 209–210
bracket()
operator, C++ Wrapper API, 370–371
computing Dijkstra’s algorithm, 415
copying, 274–276
copying from image to, 299, 303–304
creating, 249–256
creating from OpenGL, 339–343
creating kernel and memory objects, 377–378
direct translation of matrix multiplication into OpenCL, 502
executing Vector Add kernel, 377–378, 381
mapping, 276–279
in memory model, 21
Ocean application, 451
OpenCL/OpenGL sharing APIs, 446–448, 578
overview of, 247–248
querying, 257–259
reading and writing, 259–274
reference guide, 544–545
building program objects
reference guide, 546–547
using clBuildProgram()
. see clBuildProgram()
built-in data types
other, 108–109
reference guide, 550–552
scalar, 99–101
vector, 102–103
async copy and prefetch, 191–195
border color, 209–210
floating-point constant, 162–163
floating-point pragma, 162
image read and write, 201–206, 572–573
miscellaneous vector, 199–200, 571
overview of, 149
querying image information, 214–215
relational, 175, 178–181, 564–567
relative error as ulps
, 163–168
samplers, 206–209
synchronization, 190–191
vector data load and store, 181–189
writing to image, 210–213
Bullet Physics SDK. see cloth simulation in Bullet Physics SDK
bytes, and vector data types, 102
C++ Wrapper API
defined, 369
exceptions, 371–374
Ocean application overview, 451
overview of, 369–371
C++ Wrapper API, Vector Add example
choosing device and creating command-queue, 375–377
choosing platform and creating context, 375
creating and building program object, 377
creating kernel and memory objects, 377–378
executing Vector Add kernel, 378–382
structure of OpenCL setup, 374–375
C99 language
OpenCL C derived from, 32–33, 97
OpenCL C features added to, 99
callbacks
creating OpenCL contexts, 85
event objects. see clSetEventCallback()
events impacting execution on host, 324–327
placing profiling functions inside, 331–332
steps in Ocean application, 451
capacitance, of multicore chips, 4–5
case studies
cloth simulation. see cloth simulation in Bullet Physics SDK
Dijkstra’s algorithm. see Dijkstra’s algorithm, parallelizing
image histogram. see image histograms
matrix multiplication. see matrix multiplication
optical flow. see optical flow
PyOpenCL. see PyOpenCL
simulating ocean. see Ocean simulation, with FFT
Sobel edge detection filter, 407–410
casts
explicit, 116
implicit conversions between vectors and, 111–113
cEnqueueNDRangeKernel()
, 251, 255
ckCreateSampler()
, 292–295
CL_COMPLETE
value, command-queue, 311
CL_CONTEXT_DEVICES
, C++ Wrapper API, 376
cl_context_properties
fields, initializing contexts, 338–339
CL_DEVICE_IMAGE_SUPPORT
property, clGetDeviceInfo()
, 386–387
CL_DEVICE_IMAGE3D_MAX_WIDTH
property, clGetDeviceInfo()
, 386–387
CL_DEVICE_MAX_COMPUTE_UNITS
, 506–509
CL_DEVICE_TYPE_GPU
, 502
_CL_ENABLE_EXCEPTIONS
preprocessor macro, 372
cl_int clFinish
(), 248
cl_int clWaitForEvents()
, 248
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE MULTIPLE
query, 243–244
CL_KERNEL_WORK_GROUP_SIZE
query, 243–244
cl_khr_gl_event
extension, 342, 348
cl_khr_gl_sharing
extension, 336–337, 342
cl_map_flags
, clEnqueueMapBuffer()
, 276–277
cl_mem
object, creating images, 284
CL_MEM_COPY_FROM_HOST_PTR
, 377–378
cl_mem_flags
, clCreateBuffer()
, 249–250
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR
memory type, 55
CL_MEM_READ_WRITE
, 308
CL_MEM_USE_HOST_PTR
, 377–378
cl_net
error values, C++ Wrapper API, 371
cl_platform
, 370–371
CL_PROFILING_COMMAND_END
, 502
CL_PROFILING_COMMAND_START
, 502
CL_QUEUE_PROFILING_ENABLE
flag, 328
CL_QUEUE_PROFILING_ENABLE
property, 502
CL_QUEUED
value, command-queue, 311
CL_RUNNING
value, command-queue, 311
CL_SUBMITTED
value, command-queue, 311
CL_SUCCESS
return value, clBuild-Program()
, 220
_CL_USER_OVERRIDE_ERROR_STRINGS
preprocessor macro, 372
classes, C++ Wrapper API hierarchy, 369–370
clBarrier()
, 313–316
clBuffer()
, 54
build options, 546–547
building program object, 219–220, 222
creating program from binary, 234–236
floating-point options, 224
miscellaneous options, 226–227
optimization options, 225–226
preprocessor build options, 223–224
querying program objects, 237
reference guide, 546
cl::CommandQueue::enqueueMap-Buffer()
, 379, 381
cl::commandQueue::enqueueUnmapObj()
, 379, 382
cl::Context()
, 375
cl::Context::getInfo()
, 376
clCreateBuffer()
creating buffers and sub-buffers, 249–251
creating memory objects, 54–55
direct translation of matrix multiplication into OpenCL, 502
reference guide, 544
setting kernel arguments, 239
clCreateCommandQueue()
, 51–52, 543
clCreateContextFromType()
creating contexts, 84–85
querying context for associated devices, 88
reference guide, 541
clCreateEventFromGLsyncKHR()
explicit synchronization, 349
reference guide, 579
synchronization between OpenCL/OpenGL, 350–351
clCreateFromD3D10BufferKHR()
, 580
clCreateFromD3D10Texture2DKHR()
, 580
clCreateFromD3D10Texture3DKHR()
, 580
clCreateFromGLBuffer()
, 339–343, 578
clCreateFromGLRenderbuffer()
creating memory objects from OpenGL, 341
reference guide, 578
sharing with OpenCL, 346–347
clCreateFromGLTexture2D()
, 341, 578
clCreateFromGLTexture3D()
, 341, 578
clCreateImage2D()
creating 2D image from file, 284–285
creating image objects, 283–284
reference guide, 573–574
clCreateImage3D()
, 283–284, 574
clCreateKernel()
creating kernel objects, 237–238
reference guide, 547
setting kernel arguments, 239–240
clCreateKernelsInProgram()
, 240–241, 547
clCreateProgram()
, 221
clCreateProgramWithBinary()
creating programs from binaries, 228–229
HelloBinaryWorld example, 229–230
reference guide, 546
clCreateProgramWithSource()
creating and building program object, 52–53
creating program object from source, 218–219, 222
reference guide, 546
clCreateSampler()
, 292–294, 576
clCreateSubBuffer()
, 253–256, 544
clCreateUserEvent()
generating events on host, 321–322
how to use, 323–324
reference guide, 549
clEnqueueAcquireD3D10ObjectsKHR()
, 580
clEnqueueAcquireGLObjects()
creating OpenCL buffers from OpenGL buffers, 341–342
explicit synchronization, 349
implicit synchronization, 348–349
reference guide, 579
clEnqueueBarrier()
function of, 316–317
ordering constraints between commands, 313
reference guide, 549
clEnqueueCopyBuffer()
, 275–276, 545
clEnqueueCopyBufferToImage()
copying from buffer to image, 303–305
defined, 299
reference guide, 574
clEnqueueCopyImage()
copy image objects, 302–303
defined, 299
reference guide, 575
clEnqueueCopyImageToBuffer()
copying from image to buffer, 303–304
defined, 299
reference guide, 574
clEnqueueMapBuffer()
mapping buffers and sub-buffers, 276–278
moving data to and from buffer, 278–279
reference guide, 545
clEnqueueMapImage()
defined, 299
mapping image objects into host memory, 305–308
reference guide, 574
clEnqueueMarker()
, 314–317, 549
clEnqueueMarker
()
defining synchronization points, 314
function of, 315–317
clEnqueueNativeKernel()
, 548
clEnqueueNDRangeKernel()
events and command-queues, 312
executing kernel, 56–57
reference guide, 548
work-items, 150
clEnqueueReadBuffer()
reading buffers, 260–261, 268–269
reading results back from kernel, 48, 56–57
reference guide, 544
clEnqueueReadBufferRect()
, 269–272, 544
clEnqueueReadImage()
defined, 299–301
mapping image results to host memory pointer, 307–308
reference guide, 575
clEnqueueReleaseD3D10ObjectsKHR()
, 580
clEnqueueReleaseGLObjects()
implicit synchronization, 348–349
reference guide, 579
releasing objects acquired by OpenCL, 341–342
synchronization between OpenCL/OpenGL, 351
clEnqueueUnmapMapImage()
, 305–306
clEnqueueUnmapMemObject()
buffer mapping no longer required, 277–278
moving data to and from buffer, 278–279
reference guide, 545
releasing image data, 308
clEnqueueWaitForEvents()
, 314–317, 549
clEnqueueWriteBuffer()
reference guide, 544
clEnqueueWriteBufferRect()
, 272–273, 544–545
clEnqueueWriteImage()
defined, 299
reference guide, 575
writing images from host to device memory, 301–302
cles_khr_int64
extension string, embedded profile, 385–386
clFinish()
creating OpenCL buffers from OpenGL buffers, 342–343
OpenCL/OpenGL synchronization with, 348
OpenCL/OpenGL synchronization without, 351
preprocessor error macro for, 327
reference guide, 549
clFlush()
preprocessor error macro for, 327
reference guide, 549
using callbacks with events, 327
cl.get_platforms()
, PyOpenCL, 493
clGetCommandQueueInfo()
, 543
clGetContextInfo()
HelloWorld example, 50–51
querying context properties, 86–87
querying list of associated devices, 88
reference guide, 542
clGetDeviceIDs()
convolution signal example, 91
querying devices, 68–69
translation of matrix multiplication into OpenCL, 502
clGetDeviceIDsFromD3D10KHR()
, 542
clGetDeviceInfo()
determining images supported, 290
embedded profile, 384
matrix multiplication, 506–509
querying context for associated devices, 88
querying device information, 70–78
querying embedded profile device support for images, 386–387
querying for OpenGL sharing extension, 336–337
steps in OpenCL usage, 83
clGetEventInfo()
, 319–320, 549
clGetEventProfilingInfo()
direct translation of matrix multiplication, 502
errors, 329–330
extracting timing data, 328
placing profiling functions inside callbacks, 332
profiling information and return types, 329
reference guide, 549
clGetGLContextInfoKHR()
, 579
clGetGLObjectInfo()
, 347–348, 578
clGetGLTextureInfo()
, 578
clGetImageInfo()
, 286
clGetKernelInfo()
, 242–243, 548
clGetKernelWorkGroupInfo()
, 243–244, 548
clGetMemObjectInfo()
querying buffers and sub-buffers, 257–259
querying image object, 286
reference guide, 545
clGetPlatformIDs()
querying context for associated devices, 88
querying platforms, 63–64
reference guide, 542
clGetPlatformInfo()
embedded profile, 384
querying and selecting platform, 65–67
reference guide, 542
clGetProgramBuildInfo()
creating and building program object, 52–53
detecting build error, 220–221, 222
direct translation of matrix multiplication, 502
reference guide, 547
clGetProgramInfo()
getting program binary back from built program, 227–228
reference guide, 547
clGetSamplerInfo()
, 294–295, 576
clGetSupportedImageFormats()
, 291, 574
clGetXXInfo()
, use of in this book, 70
CLK_GLOBAL_MEM_FENCE
value, barrier functions, 190–191
CLK_LOCAL_MEM_FENCE
value, barrier functions, 190–191
cl::Kernel()
, 378
cl::Kernel:setArg()
, 378
cloth simulation in Bullet Physics SDK
adding OpenGL interoperation, 446–448
executing on CPU, 431–432
executing on GPU, 432–438
introduction to, 425–428
optimizing for SIMD computation and local memory, 441–446
overview of, 425
of soft body, 429–431
two-layered batching, 438–441
cl::Program()
, 377
clReleaseCommandQueue()
, 543
clReleaseEvent()
, 318–319, 549
clReleaseKernel()
, 244–245, 548
clReleaseMemObject()
reference guide, 545
release buffer object, 339
release image object, 284
clRetainCommandQueue()
, 543
clRetainProgram()
, 236–237, 546
clRetainSampler()
, 576
events impacting execution on host, 325–326
placing profiling functions inside callbacks, 331–332
reference guide, 549
clSetKernelArg()
creating buffers and sub-buffers, 250, 255
executing kernel, 55–56
executing Vector Add kernel, 378
matrix multiplication using local memory, 509–511
reference guide, 548
sampler declaration fields, 577
setting kernel arguments, 56, 237–240
thread safety and, 241–242
clSetMemObjectDestructor-Callback()
, 545
clSetUserEventStatus()
generating events on host, 322
how to use, 323–324
reference guide, 549
clWaitForEvents()
, 323–324, 549
CMake tool
generating project in Linux and Eclipse, 44–45
generating project in Visual Studio, 42–44
installing as cross-platform build tool, 40–41
Mac OS X and Code::Blocks, 40–41
cmake-gui, 42–44
Code::Blocks, 41–42
color, cloth simulation
executing on GPU, 433–438
in two-layered batching, 438–441
color images. see image histograms
comma operator (,
), 131
command-queue
acquiring OpenGL objects, 341–342
as core of OpenCL, 309–310
creating, 50–52
creating after selecting set of devices, 377
creating in PyOpenCL, 493
defining consistency of memory objects on, 24
direct translation of matrix multiplication into OpenCL, 502
events and, 311–317
executing kernel, 56–57
in execution model, 18–21
execution of Vector Add kernel, 378, 380
OpenCL runtime reference guide, 543
runtime API setting up, 31–32
transferring image objects to, 299–300
common functions, 172–175
compiler
directives for optional extensions, 143–145
unloading OpenCL, 547
component selection syntax, vectors, 106–107
components, vector data type, 106–108
compute device, platform model, 12
compute units, platform model, 12
concurrency, 7–8
exploiting in command-queues, 310
kernel execution model, 14
parallel algorithm limitations, 28–29
conditional operator, 124, 129
const
type qualifier, 141
constant
(_constant
) address space qualifier, 137–138, 141
constant memory
device architecture diagram, 577
memory model, 21–23
contexts
allocating memory objects against, 248
choosing platform and creating, 375
convolution signal example, 89–97
creating in PyOpenCL, 492–493
defining in execution model, 17–18
incrementing and decrementing reference count, 89
initializing for OpenGL interoperability, 338–339
OpenCL platform layer, 541–542
overview of, 83
querying properties, 85–87
steps in OpenCL, 84
convergence, simulating soft body, 430
conversion
embedded profile device support rules, 386–387
vector component, 554
convert_int()
, explicit conversions, 118
convolution signal example, 89–97
coordinate mode, sampler objects, 282, 292–295
copy
buffers and sub-buffers, 274–276, 545
image objects, 302–305, 308, 575
costArray:
, Dijkstra’s algorithm, 413–414, 415–417
CPUs
executing cloth simulation on, 431–432
heterogeneous future of multicore, 4–7
matrix multiplication and performance results, 511–513
SpMV implementation, 518–519
CreateCommandQueue()
, 50–51
CreateMemObjects()
, 54–55
CSR format, sparse matrix, 517
DAG (directed acyclic graph), command-queues and, 310
data load and store functions, vectors, 181–189
data structure, Dijkstra’s algorithm, 412–414
data types
explicit casts, 116–117
explicit conversions, 117–121
implicit type conversions, 110–115
reference guide for supported, 550–552
reinterpreting data as other, 121–123
reserved as keywords in OpenCL C, 141
scalar. see scalar data types
specifying attributes, 555
vector. see vector data types
data-parallel programming model
overview of, 8–9
parallel algorithm limitations, 28–29
understanding, 25–27
writing kernel using OpenCL C, 97–99
decimation kernel, optical flow, 474
declaration fields, sampler, 577
default device, 69
#define
preprocessor directive, 142, 145
dense matrix, 499
dense optical flow, 469
derived types, OpenCL C, 109–110
design, for tiled and packetized sparse matrix, 523–524
device_type
argument, querying devices, 68
devices
architecture diagram, 577
choosing first available, 50–52
convolution signal example, 89–97
creating context in execution model, 17–18
determining profile support by, 390
embedded profile for hand held, 383–385
executing kernel on, 13–17
execution of Vector Add kernel, 380
full profile for desktop, 383
in platform model, 12
querying, 67–70, 78–83, 375–377, 542–543
selecting, 70–78
steps in OpenCL, 83–84
DFFT (discrete fast Fourier transform), 453
DFT. see discrete Fourier transform (DFT), Ocean simulation
Dijkstra’s algorithm, parallelizing
graph data structures, 412–414
kernels, 414–417
leveraging multiple compute devices, 417–423
overview of, 411–412
dimensions, image object, 282
Direct3D, interoperability with. see interoperability with Direct3D
directed acyclic graph (DAG), command-queues and, 310
directional edge detector filter, Sobel, 407–410
directories, sample code for this book, 41
DirectX Shading Language (HLSL), 111–113
discrete fast Fourier transform (DFFT), 453
discrete Fourier transform (DFT), Ocean simulation
avoiding local memory bank conflicts, 463
determining 2D composition, 457–458
determining local memory needed, 462
determining sub-transform size, 459–460
determining work-group size, 460
obtaining twiddle factors, 461–462
overview of, 457
using images, 463
using local memory, 459
distance()
, geometric functions, 175–176
divide (/
) arithmetic operator, 124–126
double
n
, vector data load and store, 181
DRAM, modern multicore CPUs, 6–7
dynamic libraries, OpenCL program vs., 97
early exit, optical flow algorithm, 483
Eclipse, generating project in, 44–45
edgeArray:
, Dijkstra’s algorithm, 412–414
“Efficient Sparse Matrix-Vector Multiplication on CUDA” (Bell and Garland), 517
64-bit integers, 385–386
built-in atomic functions, 387
determining device supporting, 390
full profile vs., 383
images, 386–387
mandated minimum single-precision floating-point capabilities, 387–389
OpenCL programs for, 35–36
overview of, 383–385
platform queries, 65
_EMBEDDED_PROFILE_
macro, 390
enumerated type
rank order of, 113
specifying attributes, 555
enumerating, list of platforms, 66–67
equal (==
) operator, 127
error codes
C++ Wrapper API exceptions, 371–374
clBarrier()
, 313
clCreateUserEvent()
, 321–322
clEnqueueMarker
(), 314
clEnqueueWaitForEvents()
, 314–315
clGetEventProfilingInfo()
, 329–330
clGetProgramBuildInfo
, 220–221
clRetainEvent()
, 318
clSetEventCallback()
, 326
clWaitForEvents()
, 323
table of, 57–61
ERROR_CODE
value, command-queue, 311
.even
suffix, vector data types, 107–108
event data types, 108, 147–148
OpenCL/OpenGL sharing APIs, 579
overview of, 317–320
reference guide, 549–550
event_t async_work_group_copy()
, 192, 332–333
event_t async_work_group_strided_copy()
, 192, 332–333
events
command-queues and, 311–317
defined, 310
event objects. see event objects
generating on host, 321–322
impacting execution on host, 322–327
inside kernels, 332–333
from outside OpenCL, 333
overview of, 309–310
profiling using, 327–332
in task-parallel programming model, 28
exceptions
C++ Wrapper API, 371–374
execution of Vector Add kernel, 379
exclusive (^^
) operator, 128
exclusive or (^
) operator, 127–128
execution model
command-queues, 18–21
contexts, 17–18
defined, 11
how kernel executes OpenCL device, 13–17
overview of, 13
parallel algorithm limitations, 28–29
explicit casts, 116–117
explicit conversions, 117–121, 132
explicit kernel, SpMV, 519
explicit memory fence, 570–571
explicit model, data parallelism, 26–27
explicit synchronization, 349
exponent, half
data type, 101
expression
, assignment operator, 132
extensions, compiler directives for optional, 143–145
fast Fourier transform (FTT). see Ocean simulation, with FFT
fast_
variants, geometric functions, 175
FBO (frame buffer object), 347
file, creating 2D image from, 284–285
filter mode, sampler objects, 282, 292–295
float channels, 403–406
float
data type, converting, 101
float images, 386
float
type, math constants, 556
floating-point arithmetic system, 33–34
floating-point constants, 162–163
floating-point data types, 113, 119–121
floating-point options
building program object, 224–225
full vs. embedded profiles, 387–388
floating-point pragmas, 143, 162
float
n
, vector data load and store functions, 181, 182–186
fma
, geometric functions, 175
embedded profile, 387
encapsulating information on, 282
mapping OpenGL texture to OpenCL image, 346
overview of, 287–291
querying list of supported, 574
reference guide for supported, 576
formats, of program binaries, 227
FP_CONTRACT
pragma, 162
frame buffer object (FBO), 347
FreeImage library, 283, 284–285
FreeSurfer. see Dijkstra’s algorithm, parallelizing
FTT (fast Fourier transform). see Ocean simulation, with FFT
full profile
built-in atomic functions, 387
determining profile support by device, 390
embedded profile as strict subset of, 383–385
mandated minimum single-precision floating-point capabilities, 387–389
platform queries, 65
querying device support for images, 386–387
function qualifiers
overview of, 133–134
reference guide, 554
reserved as keywords, 141
functions. see built-in functions
Gaussian filter, 282–283, 295–299
Gauss-Seidel iteration, 432
GCC compiler, 111–113
general-purpose GPU (GPGPU), 10, 29
gentype
barrier functions, 191–195
built-in common functions, 173–175
integer functions, 168–171
miscellaneous vector functions, 199–200
vector data load and store functions, 181–189
work-items, 153–161
gentyped
built-in common functions, 173–175
built-in geometric functions, 175–176
built-in math functions, 155–156
defined, 153
gentypef
built-in geometric functions, 175–177
built-in math functions, 155–156, 160–161
defined, 153
geometric built-in functions, 175–177, 563–564
get_global_id()
, data-parallel kernel, 98–99
getInfo()
, C++ Wrapper API, 375–377
gl_object_type
parameter, query OpenGL objects, 347–348
glBuildProgram()
, 52–53
glCreateFromGLTexture2D()
, 344–345
glCreateFromGLTexture3D()
, 344–345
glCreateSyncFromCLeventARB()
, 350–351
glDeleteSync()
function, 350
GLEW toolkit, 336
glFinish()
creating OpenCL buffers from OpenGL buffers, 342
OpenCL/OpenGL synchronization with, 348
OpenCL/OpenGL synchronization without, 351
global
(_global
) address space qualifier, 136, 141
global index space, kernel execution model, 15–16
global memory
device architecture diagram, 577
matrix multiplication, 507–509
memory model, 21–23
globalWorkSize
, executing kernel, 56–57
GLSL (OpenGL Shading Language), 111–113
glWaitSync()
, synchronization, 350–351
GMCH (graphics/memory controller), 6–7
goto
s, irreducible control flow, 147
GPGPU (general-purpose GPU), 10, 29
GPU (graphics processing unit)
advantages of image objects. see image objects
defined, 69
executing cloth simulation on, 432–438
leveraging multiple compute devices, 417–423
matrix multiplication and performance results, 511–513
modern multicore CPUs as, 6–7
OpenCL implementation for NVIDIA, 40
optical flow performance, 484–485
optimizing for SIMD computation and local memory, 441–446
querying and selecting, 69–70
SpMV implementation, 518–519
tiled and packetized sparse matrix design, 523–524
tiled and packetized sparse matrix team, 524
two-layered batching, 438–441
graph data structures, parallelizing Dijkstra’s algorithm, 412–414
graphics. see also images
shading languages, 111–113
standards, 30–31
graphics processing unit. see GPU (graphics processing unit)
graphics/memory controller (GMCH), 6–7
grayscale images, applying Sobel OpenCL kernel to, 409–410
greater than (>
) operator, 127
greater than or equal (>=
) operator, 127
half
data type, 101–102
half_
functions, 153
half-float channels, 403–406
half-float images, 386
hand held devices, embedded profile for. see embedded profile
hardware
mapping program onto, 9–11
parallel computation as concurrency enabled by, 8
SpMV kernel, 519
SpMV multiplication, 524–538
hardware abstraction layer, 11, 29
hardware linear interpolation, optical flow algorithm, 480
hardware scheduling, optical flow algorithm, 483
header structure, SpMV, 522–523
height map, Ocean application, 450
HelloWorld sample
checking for errors, 57–61
choosing device and creating command-queue, 50–52
choosing platform and creating context, 49–50
creating and building program object, 52–53
creating kernel and memory objects, 54–55
downloading sample code, 39
executing kernel, 55–57
Linux and Eclipse, 44–45
Mac OS X and Code::Blocks, 41–42
Microsoft Windows and Visual Studio, 42–44
prerequisites, 40–41
heterogeneous platforms, 4–7
.hi
suffix, vector data types, 107–108
high-level loop, Dijkstra’s algorithm, 414–417
histogram. see image histograms
histogram_partial_image_rgba_unorm8
kernel, 400
histogram_partial_results_rgba_unorm8
kernel, 400–402
histogram_sum_partial_results_unorm8
kernel, 400
HLSL (DirectX Shading Language), 111–113
host
calls to enqueue histogram kernels, 398–400
creating, writing and reading buffers and sub-buffers, 262–268
device architecture diagram, 577
events impacting execution on, 322–327
generating events on, 321–322
kernel execution model, 13
matrix multiplication, 502–505
platform model, 12
host memory
memory model, 21–23
reading image back to, 300–301
reading image from device to, 299–300
reading region of buffer into, 269–272
writing region into buffer from, 272–273
hybrid programming models, 29
ICC compiler, 111–113
ICD (installable client driver) model, 49, 375
IDs, kernel execution model, 14–15
IEEE standards, floating-point arithmetic, 33–34
image channel data type, image formats, 289–291
image channel order, image formats, 287–291
image data types, 108–109, 147
image difference, optical flow algorithm, 472
image functions
border color, 209–210
querying image information, 214–215
read and write, 201–206
samplers, 206–209
writing to images, 210–213
additional optimizations to parallel, 400–402
overview of, 393
parallelizing, 395–400
copy between buffer objects and, 574
creating in OpenCL from OpenGL textures, 344–347
Gaussian filter example, 282–283
loading to in PyOpenCL, 493–494
mapping and ummapping, 305–308, 574
memory model, 21
OpenCL and, 30
OpenCL C functions for working with, 295–299
OpenCL/OpenGL sharing APIs, 578
overview of, 281–282
querying, 575
querying list of supported formats, 574
querying support for device images, 291
read, write, and copy, 575
specifying image formats, 287–291
transferring data, 299–308
image pyramids, optical flow algorithm, 472–479
image3d_t
type, embedded profile, 386
ImageFIlter2D example, 282–291, 488–492
access qualifiers for read-only or write-only, 140–141
describing motion between. see optical flow
DFT, 463
embedded profile device support for, 386–387
formats. see formats, image
as memory objects, 247
read and write built-in functions, 572–573
Sobel edge detection filter for, 407–410
supported by OpenCL C, 99
Image.tostring()
method, PyOpenCL, 493–494
implicit kernel, SpMV, 518–519
implicit model, data parallelism, 26
implicit synchronization, OpenCL/OpenGL, 348–349
implicit type conversions, 110–115
index space, kernel execution model, 13–14
INF
(infinity), floating-point arithmetic, 34
inheritance, C++ API, 369
initialization
Ocean application overview, 450–451
OpenCL/OpenGL interoperability, 338–340
parallelizing Dijkstra’s algorithm, 415
in-order command-queue, 19–20, 24
input vector, SpMV, 518
installable client driver (ICD) model, 49, 375
integer built-in functions, 168–172, 557–558
integer data types
arithmetic operators, 124–216
explicit conversions, 119–121
rank order of, 113
relational and equality operators, 127
intellectual property, program binaries protecting, 227
interoperability with Direct3D
acquiring/releasing Direct3D objects in OpenCL, 361–363
creating memory objects from Direct3D buffers/textures, 357–361
initializing context for, 354–357
overview of, 353
processing D3D vertex data in OpenCL, 366–368
processing Direct3D texture in OpenCL, 363–366
reference guide, 579–580
sharing overview, 353–354
cloth simulation, 446–448
creating OpenCL buffers from OpenGL buffers, 339–343
creating OpenCL image objects from OpenGL textures, 344–347
initializing OpenCL context for, 338–339
optical flow algorithm, 483–484
overview of, 335
querying for OpenGL sharing extension, 336–337
querying information about OpenGL objects, 347–348
reference guide, 577–579
sharing overview, 335–336
synchronization, 348–351
irreducible control flow, restrictions, 147
iterations
executing cloth simulation on CPU, 431–432
executing cloth simulation on GPU, 434–435
pyramidal Lucas-Kanade optical flow, 472
simulating soft body, 429–431
kernel attribute qualifiers, 134–135
kernel execution commands, 19–20
kernel objects
arguments and object queries, 548
creating, 547–548
creating, and setting kernel arguments, 237–241
executing, 548
managing and querying, 242–245
out-of-order execution of memory object command and, 549
overview of, 237
program objects vs., 217–218
thread safety, 241–242
_kernel
qualifier, 133–135, 141, 217
kernels
applying Phillips spectrum, 453–457
constant memory during execution of, 21
creating, writing and reading buffers/sub-buffers, 262
creating context in execution model, 17–18
creating memory objects, 54–55, 377–378
in data-parallel programming model, 25–27
data-parallel version of, 97–99
defined, 13
in device architecture diagram, 577
events inside, 332–333
executing and reading result, 55–57
executing Ocean simulation application, 463–468
executing OpenCL device, 13–17
executing Sobel OpenCL, 407–410
executing Vector Add kernel, 381
in execution model, 13
leveraging multiple compute devices, 417–423
in matrix multiplication program, 501–509
parallel algorithm limitations, 28–29
parallelizing Dijkstra’s algorithm, 414–417
programming language and, 32–34
in PyOpenCL, 495–497
restrictions in OpenCL C, 146–148
in task-parallel programming model, 27–28
in tiled and packetized sparse matrix, 518–519, 523
keywords, OpenCL C, 141
Khronos, 29–30
learning OpenCL, 36–37
left shift (<<
) operator, 129–130
length()
, geometric functions, 175–177
less than (<
) operator, 127
less than or equal (<=
) operator, 127
library functions, restrictions in OpenCL C, 147
links
cloth simulation using two-layered batching, 438–441
executing cloth simulation on CPU, 431–432
executing cloth simulation on GPU, 433–438
introduction to cloth simulation, 426–428
simulating soft body, 429–431
Linux
generating project in, 44–45
initializing contexts for OpenGL interoperability, 338–339
OpenCL implementation in, 41
.lo
suffix, vector data types, 107–108
load balancing
automatic, 20
in parallel computing, 9
loading, program binaries, 227
load/store functions, vector data, 567–568
local
(_local
) address space qualifier, 138–139, 141
local index space, kernel execution model, 15
local memory
device architecture diagram, 577
discrete Fourier transform, 459, 462–463
FFT kernel, 464
memory model, 21–24
optical flow algorithm, 481–482
optimizing in matrix multiplication, 509–511
SpMV implementation, 518–519
localWorkSize
, executing kernel, 56–57
logical operators
overview of, 128
symbols, 124
unary not(!
), 131
Lucas-Kanade. see pyramidal Lucas-Kanade optical flow algorithm
luminosity histogram, 393
lvalue
, assignment operator, 132
Mac OS X
OpenCL implementation in, 40
using Code::Blocks, 41–42
macros
determining profile support by device, 390
integer functions, 172
OpenCL C, 145–146
preprocessor directives and, 555
preprocessor error, 372–374
mad
, geometric functions, 175
magnitudes, wave, 454
main()
function, HelloWorld OpenCL kernel and, 44–48
mandated minimum single-precision floating-point capabilities, 387–389
mantissa, half
data type, 101
mapping
buffers and sub-buffers, 276–279
C++ classes to OpenCL C type, 369–370
image data, 305–308
image to host or memory pointer, 299
OpenGL texture to OpenCL image formats, 346
markers, synchronization point, 314
maskArray:
, Dijkstra’s algorithm, 412–414, 415
masking off operation, 121–123
mass/spring model, for soft bodies, 425–427
math built-in functions
accuracy for embedded vs. full profile, 388
floating-point constant, 162–163
floating-point pragma, 162
overview of, 153–161
reference guide, 560–563
relative error as ulps in
, 163–168
math constants, reference guide, 556
math intrinsics, program build options, 547
math_
functions, 153
Matrix Market (MM) exchange format, 517–518
basic algorithm, 499–501
direct translation into OpenCL, 501–505
increasing amount of work per kernel, 506–509
overview of, 499
performance results and optimizing original CPU code, 511–513
sparse matrix-vector. see sparse matrix-vector multiplication (SpMV)
using local memory, 509–511
memory access flags, 282–284
memory commands, 19
memory consistency, 23–24, 191
memory latency, SpMV, 518–519
memory objects
buffers and sub-buffers as, 247–248
creating context in execution model, 17–18
creating kernel and, 54–55, 377–378
matrix multiplication and, 502
in memory model, 21–24
out-of-order execution of kernels and, 549
querying to determine type of, 258–259
runtime API managing, 32
mesh
executing cloth simulation on CPU, 431–432
executing cloth simulation on GPU, 433
introduction to cloth simulation, 425–428
simulating soft body, 429–431
two-layered batching, 438–441
MFLOPS, 512–513
generating project in Visual Studio, 42–44
OpenCL implementation in, 40
OpenGL interoperability, 338–339
mismatch vector, optical flow algorithm, 472
MM (Matrix Market) exchange format, 517–518
multicore chips, power-efficiency of, 4–5
multiplication
matrix. see matrix multiplication
sparse matrix-vector. see sparse matrix-vector multiplication (SpMV)
multiply (*) arithmetic operator, 124–126
n suffix, 181
names, reserved as keywords, 141
NaN
(Not a Number), floating-point arithmetic, 34
native kernels, 13
NDRange
data-parallel programming model, 25
kernel execution model, 14–16
matrix multiplication, 502, 506–509
task-parallel programming model, 27
normalize()
, geometric functions, 175–176
not (~) operator, 127–128
not equal (!=
) operator, 127
numeric indices, built-in vector data types, 107
NVIDIA GPU Computing SDK
generating project in Linux, 41
generating project in Linux and Eclipse, 44–45
generating project in Visual Studio, 42
generating project in Windows, 40
OpenCL/OpenGL interoperability, 336
objects, OpenCL/OpenGL sharing API, 579
FFT kernel, 463–467
generating Phillips spectrum, 453–457
OpenCL DFT. see discrete Fourier transform (DFT), Ocean simulation
overview of, 449–453
transpose kernel, 467–468
.odd
suffix, vector data types, 107–108
OpenCL, introduction
conceptual foundation of, 11–12
data-parallel programming model, 25–27
embedded profile, 35–36
execution model, 13–21
graphics, 30–31
heterogeneous platforms of, 4–7
kernel programming language, 32–34
learning, 36–37
memory model, 21–24
other programming models, 29
parallel algorithm limitations, 28–29
platform API, 31
platform model, 12
runtime API, 31–32
software, 7–10
summary review, 34–35
task-parallel programming model, 27–28
understanding, 3–4
access qualifiers, 140–141
address space qualifiers, 135–140
built-in functions. see built-in functions
derived types, 109–110
explicit casts, 116–117
explicit conversions, 117–121
function qualifiers, 133–134
functions for working with images, 295–299
implicit type conversions, 110
kernel attribute qualifiers, 134–135
as kernel programming language, 32–34
keywords, 141
macros, 145–146
other data types supported by, 108–109
overview of, 97
preprocessor directives, 141–144
reinterpreting data as another type, 121–123
restrictions, 146–148
scalar data types, 99–102
type qualifiers, 141
vector data types, 102–108
vector operators. see vector operators
writing data-parallel kernel using, 97–99
OPENCL EXTENSION
directive, 143–145
OpenGL
interoperability between OpenCL and. see interoperability with Direct3D; interoperability with OpenGL
Ocean application, 450–453
OpenCL and graphics standards, 30
reference guide for sharing APIs, 577–579
synchronization between OpenCL, 333
OpenGL Shading Language (GLSL), 111–113
operands, vector literals, 105
operators, vector. see vector operators
application of texture cache, 480–481
early exit and hardware scheduling, 483
efficient visualization with OpenGL interop, 483–484
performance, 484–485
problem of, 469–479
sub-pixel accuracy with hardware linear interpolation, 480
understanding, 469
using local memory, 481–482
optimization options
clBuildProgram()
, 225–226
partial image histogram, 400–402
program build options, 546
“Optimizing Power Using Transformations” (Chandrakasan et al.), 4–5
“Optimizing Sparse Matrix-Vector Multiplication on GPUs” (Baskaran and Bordawekar), 517
optional extensions, compiler directives for, 143–145
or (|
) operator, 127–128
or (||
) operator, 128
out-of-order command-queue
automatic load balancing, 20
data-parallel programming model, 24
execution model, 20
reference guide, 549
task-parallel programming model, 28
output, creating 2D image for, 285–286
output vector, SpMV, 518
overloaded function, vector literal as, 104–105
packets
optimizing sparse matrix-vector multiplication, 538–539
tiled and packetized sparse matrix, 519–522
tiled and packetized sparse matrix design, 523–524
tiled and packetized sparse matrix team, 524
pad to 128-boundary, tiled and packetized sparse matrix, 523–524
parallel algorithm limitations, 28–29
parallel computation
as concurrency enabled by software, 8
of image histogram, 395–400
image histogram optimizations, 400–402
parallel programming, using models for, 8
parallelism, 8
param_name
values, querying platforms, 64–65
partial histograms
computing, 395–397
optimizing by reducing number of, 400–402
summing to generate final histogram, 397–398
partitioning workload, for multiple compute devices, 417–423
Patterns for Parallel Programming (Mattson), 20
performance
heterogeneous future of, 4–7
leveraging multiple compute devices, 417–423
matrix multiplication results, 511–513
optical flow algorithm and, 484–485
soft body simulation and, 430–431
sparse matrix-vector multiplication and, 518, 524–538
using events for profiling, 327–332
using matrix multiplication for high. see matrix multiplication
PEs (processing elements), platform model, 12
phillips
function, 455–457
Phillips spectrum generation, 453–457
platform API, 30–31
platform model, 11–12
platforms
choosing, 49–50
choosing and creating context, 375
convolution signal example, 89–97
embedded profile, 383–385
enumerating and querying, 63–67
querying and displaying specific information, 78–83
querying list of devices associated with, 68
reference guide, 541–543
steps in OpenCL, 83–84
pointer data types, implicit conversions, 111
post-increment (++
) unary operator, 131
power
efficiency of specialized core, 5–6
of multicore chips, 4–5
#pragma directives, OpenCL C, 143–145
predefined identifiers, not supported, 147
prefetch functions, 191–195, 570
pre-increment (--
) unary operator, 131
preprocessor build options, 223–224
preprocessor directives
OpenCL C, 141–142
program object build options, 546–547
reference guide, 555
preprocessor error macros, C++ Wrapper API, 372–374
private (_private)
address space qualifier, 139, 141
processing elements (PEs), platform model, 12
associated with platforms, 63–67
commands for events, 327–332
embedded. see embedded profile
reference guide, 549
program objects
build options, 222–227
creating and building, 52–53, 377
creating and building from binaries, 227–236
creating and building from source code, 218–222
creating and building in PyOpenCL, 494–495
creating context in execution model, 17–18
kernel objects vs., 217–218
managing and querying, 236–237
reference guide, 546–547
runtime API creating, 32
programming language. see also OpenCL C; PyOpenCL, 32–34
programming models
data-parallel, 25–27
defined, 12
other, 29
parallel algorithm limitations, 28–29
task-parallel, 27–28
properties
device, 70
querying context, 85–87
PyImageFilter2D, PyOpenCL, 488–492
context and command-queue creation, 492–493
creating and building program, 494–495
introduction to, 487–488
loading to image object, 493–494
overview of, 487
PyImageFilter2D code, 488–492
reading results, 496
running PyImageFilter2D example, 488
setting kernel arguments/executing kernel, 495–496
pyopencl vo-92+, 488
pyopencl.create_some_context()
, 492
pyramidal Lucas-Kanade optical flow algorithm, 469, 471–473
Python, using OpenCL in. see PyOpenCL
Python Image Library (PIL), 488, 493–494
qualifiers
access, 140–141
address space, 135–140
function, 133–134
kernel attribute, 134–135
type, 141
queries
buffer and sub-buffer, 257–259, 545
device, 542–543
device image support, 291
event object, 319–320
image object, 214–215, 286, 575
OpenCL/OpenGL sharing APIs, 578
OpenGL objects, 347–348
storing program binary and, 230–232
supported image formats, 574
R,G, B color histogram
optimizing, 400–402
overview of, 393
parallelizing, 395–400
rank order, usual arithmetic conversions, 113–115
read
buffers and sub-buffers, 259–268, 544
image back to host memory, 300–301
image built-in functions, 201–206, 298, 572–573
image from device to host memory, 299–300
image objects, 575
memory objects, 248
results in PyOpenCL, 496–497
read_imagef()
, 298–299
read-only qualifier, 140–141
read-write qualifier, 141
recursion, not supported in OpenCL C, 147
reference counts
buffers and sub-buffers, 256
contexts, 89
event objects, 318
regions, memory model, 21–23
relational built-in functions, 175, 178–181, 564–567
relational operators, 124, 127
relaxed consistency model, memory objects, 24
remainder (%
) arithmetic operator, 124–126
rendering of height map, Ocean application, 450
reserved data types, 550–552
restrict
type qualifier, 141
restrictions, OpenCL C, 146–148
return type, kernel function restrictions, 146
RGB images, applying Sobel OpenCL kernel to, 409
RGBA-formatted image, loading in PyOpenCL, 493–494
right shift (>>
) operator, 129–130
rounding mode modifier
explicit conversions, 119–121
vector data load and store functions, 182–189
runCLSimulation()
, 451–457
sampler data types
determining border color, 209–210
functions, 206–209
restrictions in OpenCL C, 108–109, 147
sampler objects. see also image objects
creating, 292–294
declaration fields, 577
functions of, 282
overview of, 281–282
reference guide, 576–577
releasing and querying, 294–295
_sat
(saturation) modifier, explicit conversions, 119–120
SaveProgramBinary()
, creating programs, 230–231
creating vectors with vector literals, 104–105
explicit casts of, 116–117
explicit conversions of, 117–121
half
data type, 101–102
implicit conversions of, 110–111
integer functions, 172
reference guide, 550
supported by OpenCL C, 99–101
usual arithmetic conversions with, 113–115
vector operators with. see vector operators
scalar_add ()
, writing data-parallel kernel, 97–98
754 formats, IEEE floating-point arithmetic, 34
sgentype
integer functions, 172
relational functions, 181
shape matching, soft bodies, 425
sharing APIs, OpenCL/OpenGL, 577–579
shuffle
, illegal usage of, 214
shuffle2
, illegal usage of, 214
sign, half
data type, 101
SIMD (Single Instruction Multiple Data) model, 26–27, 465
simulation
cloth. see cloth simulation in Bullet Physics SDK
ocean. see Ocean simulation, with FFT
Single Instruction Multiple Data (SIMD) model, 26–27, 465
Single Program Multiple Data (SPMD) model, 26
single-source shortest-path graph algorithm. see Dijkstra’s algorithm, parallelizing
64-bit integers, embedded profile, 385–386
sizeof
operator, 131–132
slab, tiled and packetized sparse matrix, 519
Sobel edge detection filter, 407–410
soft bodies
executing cloth simulation on CPU, 431–432
executing cloth simulation on GPU, 432–438
interoperability with OpenGL, 446–448
introduction to cloth simulation, 425–428
simulating, 429–431
software, parallel, 7–10
solveConstraints, cloth simulation on GPU, 435
solveLinksForPosition, cloth simulation on GPU, 435
source code
creating and building programs from, 218–222
program binary as compiled version of, 227
sparse matrix-vector multiplication (SpMV)
algorithm, 515–517
defined, 515
description of, 518–519
header structure, 522–523
optional team information, 524
other areas of optimization, 538–539
overview of, 515
tested hardware devices and results, 524–538
tiled and packetized design, 523–524
tiled and packetized representation of, 519–522
specify type attributes, 555
SPMD (Single Program Multiple Data) model, 26
SpMV. see sparse matrix-vector multiplication (SpMV)
storage
image layout, 308
sparse matrix formats, 517
strips, tiled and packetized sparse matrix, 519
struct type
restrictions on use of, 109–110, 146
specifying attributes, 555
sub-buffers. see buffers and sub-buffers
sub-pixel accuracy, optical flow algorithm, 480
subregions, of memory objects, 21
subtract (-
) arithmetic operator, 124–126
sub-transform size, DFT, 459–460
suffixes, vector data types, 107–108
synchronization
commands, 19–21
computing Dijkstra’s algorithm with kernel, 415–417
explicit memory fence, 570–571
functions, 190–191
primitives, 248
synchronization points
defining when enqueuing commands, 312–315
in out-of-order command-queue, 24
T1 to T3 data types, rank order of, 114
task-parallel programming model
overview of, 9–10
parallel algorithm limitations, 28–29
understanding, 27–28
team information, tiled and packetized sparse matrix, 524
ternary selection (?:
) operator, 129
tetrahedra, soft bodies, 425–428
texture cache, optical flow algorithm, 480–482
texture objects, OpenGL. see also image objects
creating image objects in OpenCL from, 344–347
Ocean application creating, 451
OpenCL/OpenGL sharing APIs, 578
querying information about, 347–348
thread safety, kernel objects, 241–242
tiled and packetized sparse matrix
defined, 515
design considerations, 523–524
header structure of, 522–523
overview of, 519–522
SpMV implementation, 517–518
timing data, profiling events, 328
traits, C++ template, 376
transpose kernel, simulating ocean, 467–468
twiddle factors, DFT
FFT kernel, 464–466
obtaining, 461–462
using local memory, 463
2D composition, in DFT, 457–458
two-layered batching, cloth simulation, 438–441
type casting, vector component, 554
type qualifiers, 141
ugentype
n
, 214–215
ulp
values, 163–168
union type, specifying attributes, 555
updatingCostArray:
, Dijkstra’s algorithm, 413–417
usual arithmetic conversions, 113–115
vadd()
kernel, Vector Add kernel, 378
variable-length arrays, not supported in OpenCL C, 147
variadic macros and functions, not supported in OpenCL C, 147
VBO (vertex buffer object), 340–344, 446–448
vbo_cl_mem
, creating VBO in OpenGL, 340–341
Vector Add example. see C++ Wrapper API, Vector Add example
application, 103–104
built-in, 102–103
data load and store functions, 181–189
explicit casts, 116–117
explicit conversions, 117–121
implicit conversions between, 110–113
literals, 104–105
load/store functions reference, 567–568
miscellaneous built-in functions, 199–200, 571
operators. see vector operators
optical flow algorithm, 470–472
reference guide, 550
supported by OpenCL C, 99
usual arithmetic conversions with, 113–115
vector literals, 104–105
arithmetic operators, 124–126
assignment operator, 132
bitwise operators, 127–128
conditional operator, 129
logical operators, 128
overview of, 123–124
reference guide, 554
relational and equality operators, 127
shift operators, 129–130
unary operators, 131–132
vertex buffer object (VBO), 340–344, 446–448
vertexArray:
, Dijkstra’s algorithm, 412–414
vertical filtering, optical flow, 474
vertices
introduction to cloth simulation, 425–428
simulating soft body, 429–431
Visual Studio, generating project in, 42–44
void
return type, kernel functions, 146
void wait_group_events()
, 193, 332–333
volatile
type qualifier, 141
voltage, multicore chip, 4–5
vstore_half()
half
data type, 101
reference guide, 568
vector store functions, 183, 187
vstore_half
n
()
, 184, 186–188, 568
vstorea_half
n
()
, 186, 188–189, 568
VSTRIDE, FFT kernel, 464
wave amplitudes, 454
weightArray:
, Dijkstra’s algorithm, 412–414
Windows. see Microsoft Windows
work-group barrier, 25–27
work-groups
data-parallel programming model, 25–27
global memory for, 21
kernel execution model, 14–16
SpMV implementation, 518
tiled and packetized sparse matrix team, 524
work-items
barrier functions, 190–191
built-in functions, 557
data-parallel programming model, 25–27
functions, 150–152
global memory for, 21
kernel execution model, 13–15
local memory for, 23
mapping get_global_id
to, 98–99
matrix multiplication, 501–509
private memory for, 21
task-parallel programming model, 27
write
buffers and sub-buffers, 259–268, 544–545
image built-in functions, 210–213, 298–299, 572–573
image from host to device memory, 301–302
image objects, 575
memory objects, 248
write_imagef()
, 298–299
write-only qualifier, 140–141
3.21.12.140