25.3. Algorithms, Implementation, and Evaluation
Otherwise, as in our other examples, each discrete wavelength to be modeled is characterized by appropriate scattering parameters and phase function. The Lattice Boltzmann model is run once for each parameter set, and the results are combined in a post-processing (ray-tracing) step.
Initial conditions include zero photon density in all directions at all interior nodes in the lattice. For each node in the lattice boundary, directions that have positive dot products with the external light source direction will receive initial photon densities that are proportional to these dot products. These directional photon densities on boundary nodes are fixed for the duration of the execution. Any flow (movement of photon density) into a boundary node is ignored, as is any flow out of a boundary node that goes off lattice.
25.3.1. An OpenCL Implementation
The fundamental update in
Eq. 25.2
must be synchronous. Each lattice site must read 19 values and write 19 values at each time step, but the layout is particularly well suited to GPU architectures in that there is no overlap in sites read or sites written between or among sites. We have implemented kernels for the update in both CUDA and OpenCL for the NVIDIA GTX 480, and we find, after optimizations of both, essentially no speed difference. Thus, in the interest of greater portability, we present the OpenCL solution here.
We use two copies of the directional densities,
, and ping-pong between them as we advance the time step,
t
. Experience indicates that a final time of
n
×
τ
, where
n
is twice the longest edge dimension of the lattice, will suffice to achieve near-equilibrium values.
Both the collision matrix, Ω, and the per-site density of the medium are constant for the duration of the computation. The Ω matrix, which is 19 × 19, easily fits in constant memory, and so we can take advantage of the caching effects there. The medium density is one float per lattice site, which is too large for constant memory, but each lattice site reads only one medium density value per update, and that value can be stored in a register.
In addition to storage for Ω and storage for the per-site medium density, GPU-side storage includes two arrays for the flows:
cl_mem focl[2][WIDTH*HEIGHT*DEPTH*DIRECTIONS];
and a single integer array
dist[WIDTH*HEIGHT*DEPTH*DIRECTIONS];
which holds, for each lattice site and each direction, the flow array index where exiting flow should be placed. The values for
dist[]
can be calculated and stored by the CPU during initialization.
On the CPU side, the entire update, using kernel
mykrn_update
, command queue
mycq
, and lattice dimension
LDIM
is then given by:
void run_updates()
{
int t, from = 0;
cl_event wait[1];
for(t=0;t<FINAL_TIME;t++,from = 1−from){
clSetKernelArg(mykrn_update,0,sizeof(cl_mem),(void *)&focl[from]);
clSetKernelArg(mykrn_update,1,sizeof(cl_mem),(void *)&focl[1-from]);
clEnqueueNDRangeKernel(mycq,mykrn_update,3,0,ws,lws,0,0,&wait[0]);
clWaitForEvents(1,wait);
}
clEnqueueReadBuffer(mycq,focl[from],CL_TRUE,0,LDIM*sizeof(float),&+6;f[0], 0,NULL,NULL);
return;
}
The work size,
ws
, is a three-dimensional integer vector with value (WIDTH, HEIGHT, DEPTH), the dimensions of the the grid. The local work size,
lws
, which specifies threads per scheduled block, is (1, 1, DEPTH), for reasons to be specified shortly.
A naive but conceptually straightforward kernel is then given by
__kernel void update(__global float* from, __global float* to,__global int* dist, __global float* omega, __global float* density)
{
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
float medium_density, new_flow;
medium_density = density(i,j,k);
for(m=0;m<DIRECTIONS;m++){
new_flow = 0.0f;
for(n=0;n<DIRECTIONS;n++) new_flow += omega(m,n)*from(i,j,k,n);
to[dist(i,j,k,m)] = from(i,j,k,m) + new_flow*medium_density;
}
}
The
from
and
to
arguments hold addresses of the two copies of the flow values. Macros
from(i,j,k,n)
,
dist(i,j,k,m)
, and
omega(m,n)
simply provide convenient indexing into the arrays of the same names, which are necessarily one-dimensional in GPU storage. Both the
from
and
to
arrays hold an extra index, a “sink,” and if the exiting flow from any site in any direction would be off lattice, the value of the
dist(i,j,k,m)
index is this “sink.”
This kernel is attractive in its simplicity, and it offers zero control flow divergence because there are no conditionals. Nevertheless, although the kernel will produce correct results, its performance advantage over a CPU implementation is modest. For the 256 × 128 × 128 array used in the first example of the next section, a CPU implementation running on an Intel i7 X980 3.33 GHz processor required 19.5 minutes. The OpenCL implementation with this kernel required 2.8 minutes. With kernel modifications, the total execution time can be reduced to 6 seconds, of which 1 second is CPU initialization and 5 seconds are GPU execution.
The first important modification is to avoid reloading the values,
from(i,j,k,n)
, from global device memory. Instead, we force storage into registers by declaring 19 local variables,
float f00, f01, …, f18
We then load each directional value once, unroll the interior loop, and use these local variable values instead of
from(i,j,k,n)
.
The second important modification is to rearrange storage to facilitate coalescing of memory requests. Because we have WIDTH × HEIGHT × DEPTH lattice sites, and each of these has 19 DIRECTIONS, it is natural to access linear storage as
from[((((i)*HEIGHT+(j))*DEPTH+(k))*DIRECTIONS+(m))]
but this leads to relatively poor performance. Instead, since DEPTH is usually a large power of 2, we interchange DEPTH and DIRECTIONS so that the DEPTH index varies most rapidly in linear storage:
from[((((i)*HEIGHT+(j))*DIRECTIONS+(m))*DEPTH+(k))]
This rearrangement applies to both copies of the photon density values,
focl[2]
, and the integer
dist[]
array.
The third modification, though least important in terms of performance gained, is probably the most surprising. Performance improves when we add conditionals and extra computation to this kernel! In particular, if we eliminate the
dist[]
array, instead pass only the 19 directions, and calculate the indices for the exiting photon density on the fly, performance improves by a small amount. The conditionals and
the index computations, which are required to avoid indices that would be off lattice, are less expensive than accesses to the global
dist[]
array. Because flow divergence then occurs only between boundary nodes and interior nodes, and we place each DEPTH stripe in a single thread block using local work size (1, 1, DEPTH), the effect of divergence is really minimal.
The final kernel is then given by
#define nindex(i,j,k) (((i)*HEIGHT+(j))*DEPTH+(k))
#define bindex(i,j,k) ((((i)*HEIGHT+(j))*DIRECTIONS*DEPTH)+(k))
#define dindex(i,j,k,m) ((((i)*HEIGHT+(j))*DIRECTIONS+(m))*DEPTH+(k))
#define inbounds(q,r,s) ((q > 0)&&(q < (WIDTH − 1))&&(r > 0)&&(r < (HEIGHT − 1))&&(s > 0)&&(s < (DEPTH − 1)))
__kernel void update(__global float* from,__global float* to,__constant int* direction, __constant float* omega,__global float* density)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
const int k = get_global_id(2);
const int lnindex = nindex(i,j,k);
const int lbindex = bindex(i,j,k);
int m;
float cloud_density, new_flow;
float f00, f01, f02, f03, …, f18;
int outi, outj, outk;
cloud_density = density[lnindex];
f00 = from[lbindex+0*DEPTH];
f01 = from[lbindex+1*DEPTH];
f02 = from[lbindex+2*DEPTH];
f03 = from[lbindex+3*DEPTH];
…
f18 = from[lbindex+18*DEPTH];
for(m=0;m<DIRECTIONS;m++)
outi = i+direction[3*m+0];
outj = j+direction[3*m+1];
outk = k+direction[3*m+2];
if(inbounds(outi,outj,outk))
new_flow = 0.0f;
new_flow += omega(m,0)*f00;
new_flow += omega(m,1)*f01;
new_flow += omega(m,2)*f02;
new_flow += omega(m,3)*f03;
…
new_flow += omega(m,18)*f18;
to[dindex(outi,outj,outk,m)] = from[lbindex+m*DEPTH] + new_flow*cloud_density;
}
}
}
Profiling this kernel shows 100% coalesced global memory reads and writes and an occupancy of 0.5. Transfer from/to global memory is measured at 67 GB/sec., which is a significant portion of the available bandwidth for the GTX 480 as reported by the
oclBandwidthTest
utility, 115 GB/sec. Note that this does not include transfer from (cached) constant memory, which is significantly larger, 718 GB/sec.
On devices of lower compute capability, this kernel can generate a significant number of uncoalesced global memory writes. Nevertheless, with a simple kernel modification we can force these writes to be coalesced as well. Because we use a local work size vector of (1, 1, DEPTH), we can collect DEPTH output values, one for each
outk
, in a local (shared) memory array of size DEPTH, synchronize the threads using a
barrier()
command, and then write the entire DEPTH stripe in order. The final kernel, revised for this lower compute capability, is then:
#define nindex(i,j,k) (((i)*HEIGHT+(j))*DEPTH+(k))
#define bindex(i,j,k) ((((i)*HEIGHT+(j))*DIRECTIONS*DEPTH)+(k))
#define dindex(i,j,k,m) ((((i)*HEIGHT+(j))*DIRECTIONS+(m))*DEPTH+(k))
#define inbounds(q,r,s) ((q > 0)&&(q < (WIDTH − 1))&&(r > 0)&&(r < (HEIGHT − 1))&&(s > 0)&&(s < (DEPTH − 1)))
__kernel void update(__global float* from,__global float* to,__constant int* direction, __constant float* omega,__global float* density)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
const int k = get_global_id(2);
const int lnindex = nindex(i,j,k);
const int lbindex = bindex(i,j,k);
__local float getout[DEPTH];
float cloud_density, new_flow;
float f00, f01, f02, f03, …, f18;
int m, outi, outj, outk;
cloud_density = density[lnindex];
f00 = from[lbindex+0*DEPTH];
f01 = from[lbindex+1*DEPTH];
f02 = from[lbindex+2*DEPTH];
…
f18 = from[lbindex+18*DEPTH];
for(m=0;m<DIRECTIONS;m++){
outk = k+direction[3*m+2];
if(outk>0 && outk<(DEPTH − 1)
new_flow = 0.0f;
new_flow += omega(m,0)*f00;
new_flow += omega(m,1)*f01;
new_flow += omega(m,2)*f02;
…
new_flow += omega(m,18)*f18;
getout[outk] = from[lbindex+m*DEPTH] + new_flow*cloud_density;
}
barrier(CLK_LOCAL_MEM_FENCE);
outi = i+direction[3*m+0];
outj = j+direction[3*m+1];
if(inbounds(outi,outj,k))
to[dindex(outi,outj,outk,m)] = getout[k];
}
}
Testing this modified kernel on a Quadro FX 5600 (compute capability 1.0) shows that it entirely eliminates the uncoalesced writes generated by the original kernel. Execution time performance improves by slightly more than 25%. Nevertheless, on the GTX 480, where there were no uncoalesced writes, the modified kernel provides no improvement over the original. In fact, the overhead of the added conditional and the added barrier causes a small execution time penalty, and the modified kernel is actually slower than the original.