List of Figures xvii
12.11 Gantt chart showing a comparison of integration and I/O
performance/activity of the parallelize over seeds P
T
and
P
H
versionsforoneofthebenchmarkruns........... 283
12.12 Performance comparison of the P
H
and P
T
variants of the par-
allelize over blocks algorithm. ................. 284
12.13 Gantt chart showing a comparison of integration I/O,
MPI
Send,andMPI Recv performance/activity of the paral-
lelize over blocks P
T
and P
H
versions for one of the benchmark
runs. ............................... 285
13.1 Contouring of two trillion cells, visualized with VisIt on
Franklinusing32,000cores. .................. 294
13.2 Plots of execution time for the I/O, contouring, and rendering
phases of the trillion cell visualizations over six supercomput-
ingenvironments......................... 296
13.3 Contouring of replicated data (one trillion cells total), visual-
ized with VisIt on Franklin using 16,016 cores. ........ 299
13.4 Rendering of an isosurface from a 321 million cell Denovo
simulation, produced by VisIt using 12,270 cores of JaguarPF. 301
13.5 Volume rendering of data from a 321 million cell Denovo sim-
ulation, produced by VisIt using 12,270 cores on JaguarPF. 302
13.6 Volume rendering of one trillion cells, visualized by VisIt on
JaguarPFusing16,000cores. ................. 303
14.1 Comparison of Gaussian and bilateral smooth applied to a
synthetic,noisydataset..................... 312
14.2 Three different 3D memory access patterns have markedly dif-
ferent performance characteristics on a many-core GPU plat-
form................................ 313
14.3 Using GPU-specific features can produce a 2× performance
gain. ............................... 315
14.4 Filter performance has a 7.5× variation depending upon the
settingsoftunablealgorithmicparameters........... 315
14.5 Chart showing how filter runtime performance on the GPU
varies as a function of CUDA thread block size. ....... 316
14.6 Parallel ray casting volume rendering performance measures
on the GPU include absolute runtime, and L2 cache miss
rates................................ 321
14.7 Examples showing how different transfer functions produce
differing visible and performance results in parallel volume
rendering. ............................ 323
14.8 Performance gains on the GPU using Z-ordered memory in-
creasewithincreasedconcurrency................ 324