322 High Performance Visualization
algorithm, which causes divergence among CUDA threads. A thread block
is executed in a single-instruction-multiple-thread (SIMT) fashion in which
warps of thirty-two threads are executed across four clock cycles in subsets of
eight threads that share a common instruction. If those eight threads do not
share a common instruction, such as when conditionals cause branching code
paths, the threads diverge and are executed serially.
This situation occurs frequently in this algorithm. For example, suppose
a thread block owns a region of the image that only partially covers the data
volume. Some of the threads in that block exit immediately due to the empty-
space skipping optimization in the algorithm, while the other threads proceed
to cast rays through the volume. However, the threads that proceed together
with ray casting may also have rays of different lengths, which will cause
divergence and load imbalance.
Since a warp must be scheduled across at least four clock cycles, using fewer
than four threads per thread block will guarantee under-utilization. The con-
figurations that met these requirements were excluded from the parameter
sweep. The study shows empirically that the sweet spot for a thread block
size is 16 or 32 threads, depending on the memory ordering and whether ERT
is enabled. Many block sizes with 16 threads perform well, even though this
number is less than the warp size of 32 threads, indicating the complex inter-
action of the CUDA runtime and warp scheduler in handling the branching
for this particular algorithm and problem. It is also likely that larger thread
blocks exhibit greater load imbalance because, the variation in ray lengths
tends to increase with block size.
Surprisingly, the small thread blocks that display the worst performance
also exhibit the fewest L2 cache misses (see Figure 14.6). Note, the converse
of this is not true: the most optimal block sizes do not show the most L2
cache misses. Instead, L2 cache misses appear to rise uniformly with the total
number of threads in a block, leading to the same diagonal striping as seen in
the runtime plot. The study suggests that achieving the best performance on
the GPU is a trade-off between using enough threads to saturate a warp and
using fewer than enough threads to maintain a good cache utilization. The
study also finds that, as in the CPU tests, L2 cache misses are systematically
less when using the Z-ordered memory layout on the GPU because of the
improved spatial locality.
Interestingly, the NVIDIA CUDA Programming Guide [24] says: “The ef-
fect of execution configuration on performance for a given kernel call gener-
ally depends on the kernel code. Experimentation is therefore recommended.”
These experiments show a wide variation in performance depending upon
thread block size. While such variation isn’t all that surprising, the amount
of variation—as much as 265%—is somewhat unexpected, as is the fact that
the optimal block size for one problem is not the same as for another problem
when run on the same platform.
Performance Optimization and Auto-Tuning 323
(a) Transfer function “A” (b) Transfer function “B”
FIGURE 14.7: Two different transfer functions have different benefits from
early ray termination. They also yield images that accentuate different features
of the underlying data set. The performance tests in this study use transfer
function “A.” Image source: Bethel and Howison, 2012 [3].
14.3.3.2 Algorithmic Optimization: Early Ray Termination
The study includes evaluating the performance impact of an algorithmic
optimization called early ray termination (ERT). In ERT, there is a condi-
tional within the inner-most loop that iterates along an individual ray tests. It
tests whether or not the ray has reached full opacity, meaning that any further
calculations would not contribute anything further to the final pixel color. The
effect of this optimization depends on both the input data and on the transfer
function that determines how data values map to color and opacity. For the
study’s specific data set and transfer function, the authors saw approximately
10% fewer integration steps when ERT was enabled, which is directly reflected
in the reported runtimes. In the cases where ERT was enabled, there was a
visible 10–15% improvement in absolute runtime.
To demonstrate the relationship between the transfer function and the
benefits of using ERT, the study included an additional test with a “shallower”
transfer function (see Fig. 14.7b) that did not penetrate as deep into the
volume. With ERT enabled on the shallower transfer function, the algorithm
performed 19.7% fewer integration steps and ran 19.1% faster. The reduced
runtime reflects the reduced amount of computational work.
Even though the use of ERT made the problem run faster, there may be
a qualitative cost associated with the reduced runtime. The image rendered
with transfer function “B” shows significant qualitative differences in many
areas. The “interior” features of the data set that are visualized may or may
not be appropriate according to the application.
Since the benefits of ERT depend highly on the scene characteristics and
324 High Performance Visualization




        
3HUFHQW)DVWHU
&RQFXUUHQF
3HUIRUPDQFH*DLQRI=RUGHURYHU$UUDRUGHU
,QWHO1HKDOHP
$0'0DJQ&RXUV
19,',$)HUPL
FIGURE 14.8: The performance gains (averaged across thread block size,
with ERT) from using Z-ordered memory instead of array-ordered increase
with concurrency. The gains are most notable on the NVIDIA/Fermi, where
the number of threads per block corresponds to the level of “per-processor”
concurrency. Image source: Bethel and Howison, 2012 [3].
transfer function, it is impossible to say with certainty how much ERT will
help algorithm performance. Earlier work investigating the use of ERT in ray
casting volume rendering [18] reports a a 300% improvement in runtime for an
opaque volume, but incurs a 30% performance penalty when run on a semi-
transparent volume. This effect attributes to the extra overhead when evaluat-
ing the conditional in the inner integration loop. The conditional can have an
adverse impact on performance, particularly for GPU implementations where
divergent code paths are costly. The effects—benefits or drawbacks—of ERT
are highly variable. There are combinations of data sets and transfer functions
where ERT has no effect, and others where it leads to a dramatic reduction
in the number of integration steps.
14.3.3.3 Algorithmic Optimization: Z-Ordered Memory
The study shows that configurations using the Z-order memory layout
outperform the array-order layout ones on all platforms and on all concur-
rency levels, and also, the performance gains increase with the increase of
concurrency (see Fig. 14.8). At higher concurrency, the benefits of Z-ordering
are most likely due to the larger penalties for non-contiguous access and cache
misses in shared-memory systems that service many cores, such as the AMD/-
MagnyCours and NVIDIA/Fermi. While the Intel/Nehalem and AMD/Mag-
nyCours show modest gains from Z-ordered memory, ranging from 3% to 29%,
Performance Optimization and Auto-Tuning 325
the NVIDIA/Fermi exhibits a gain of 146% at a 64-way concurrency, that is,
64 threads per thread block.
The improved locality of Z-ordered memory access and the corresponding,
lower L2 miss rates becomes increasingly beneficial as more cores access mem-
ory through a shared-memory controller and subsystem. The NVIDIA/Fermi,
whose memory subsystem must service 14 multi-processors (each with many
thread blocks in flight), realizes the biggest performance gains in Z-ordering.
While the slower block configurations improve somewhat with Z-ordering, the
best configurations improve greatly, leading to a larger variation for Z-ordering
than array-ordering.
In practice, since most data sets are not stored in a Z-order layout, there
would be a cost associated with reordering data. That cost is not included
in the study. However, in the case where multiple visualization images would
be generated from the same data set, that one-time cost is amortized across
many renderings.
14.3.4 Lessons Learned
This optimization study looks at two different dimensions of the perfor-
mance optimization problem on three different platforms. What values for
tunable algorithmic parameters produce the best performance? Are there al-
gorithmic optimizations that might make a difference in performance? And
how do these performance characteristics vary, and why, on different multi-
core CPU and many-core GPU platforms?
The study shows there is about a 2.5× performance gain to be had through
careful selection of a tunable algorithmic parameter: the thread block/image
tile size. This variation reflects differences in load balance and relatively better,
or worse, use of the memory hierarchy. This magnitude of performance gain
appears to be consistent across all three platforms.
The memory layout algorithmic optimization has a positive benefit on per-
formance, but the amount of performance gain varies depending on platform
and on the settings of the tunable algorithmic parameter. The study shows
performance gains ranging from 3–29% on multi-core CPU platforms, and up
to 146% on a many-core GPU platform. The large difference in performance
between these two platforms reflects the difference in performance character-
istics in the memory subsystems.
The benefit of the other algorithmic optimization—early ray termination—
depends completely on the data or problem. In some problem configurations,
the extra cost of the conditional to implement the ERT optimization incurs a
performance penalty. In other configurations, ERT can result in a substantial
performance gain. These penalties or gains seem to be largely independent of
memory layout or algorithmic tuning.
326 High Performance Visualization
14.4 Conclusion
The main point of these studies was to explore the relationship between
tunable algorithmic parameters and known algorithmic optimizations, and
also, to explore the resulting impact on performance for two different visual-
ization algorithms on modern multi-core and many-core platforms. The studies
show that the algorithmic parameters that produce the best performance vary
from problem to problem and platform to platform, often in a non-obvious
way. The GPU volume rendering study results emphasize this conclusion: op-
timal performance results in what appears to be a crossover point between
memory cache utilization and thread warp size. By empirically measuring the
performance of different algorithmic parameter settings, the study’s authors
were able to find the best performing configurations. Their study’s approach
is both successfully and widely used in the computational science commu-
nity [17, 9, 7].
Some algorithms, like ray casting volume rendering with perspective pro-
jection, use an unstructured, or irregular memory access pattern. Each ray
will, in effect, execute a different traversal path through memory. The set of
paths and the number of computational steps required for each ray is a func-
tion of runtime parameters, such as the viewpoint, data set, and color transfer
function. Other visualization algorithms exhibit similar memory access char-
acteristics, like parallel streamline computation [5], and can benefit from this
performance optimization methodology. It may be difficult, if not impossible,
to develop a performance model for such algorithms—the ones with unpre-
dictable memory access patterns. Therefore, auto-tuning methodology is a
promising approach for finding the optimal settings of tunable algorithmic
parameters.
The performance improvements to be realized can have tangible impacts:
the results of the volume rendering study provided the basis for adjusting the
tunable algorithmic parameters for a set of extreme-concurrency runs [14, 15]
that require millions of CPU hours. By finding and using optimal settings for
tunable algorithmic parameters, the tunable algorithmic parameters, in effect,
saved millions of additional CPU hours that would have been spent executing
an application in a non-optimal configuration.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.130