14. Performance Optimization and Auto-Tuning (4/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

322 High Performance Visualization

algorithm, which causes divergence among CUDA threads. A thread block

is executed in a single-instruction-multiple-thread (SIMT) fashion in which

warps of thirty-two threads are executed across four clock cycles in subsets of

eight threads that share a common instruction. If those eight threads do not

share a common instruction, such as when conditionals cause branching code

paths, the threads diverge and are executed serially.

This situation occurs frequently in this algorithm. For example, suppose

a thread block owns a region of the image that only partially covers the data

volume. Some of the threads in that block exit immediately due to the empty-

space skipping optimization in the algorithm, while the other threads proceed

to cast rays through the volume. However, the threads that proceed together

with ray casting may also have rays of diﬀerent lengths, which will cause

divergence and load imbalance.

Since a warp must be scheduled across at least four clock cycles, using fewer

than four threads per thread block will guarantee under-utilization. The con-

ﬁgurations that met these requirements were excluded from the parameter

sweep. The study shows empirically that the sweet spot for a thread block

size is 16 or 32 threads, depending on the memory ordering and whether ERT

is enabled. Many block sizes with 16 threads perform well, even though this

number is less than the warp size of 32 threads, indicating the complex inter-

action of the CUDA runtime and warp scheduler in handling the branching

for this particular algorithm and problem. It is also likely that larger thread

blocks exhibit greater load imbalance because, the variation in ray lengths

tends to increase with block size.

Surprisingly, the small thread blocks that display the worst performance

also exhibit the fewest L2 cache misses (see Figure 14.6). Note, the converse

of this is not true: the most optimal block sizes do not show the most L2

cache misses. Instead, L2 cache misses appear to rise uniformly with the total

number of threads in a block, leading to the same diagonal striping as seen in

the runtime plot. The study suggests that achieving the best performance on

the GPU is a trade-oﬀ between using enough threads to saturate a warp and

using fewer than enough threads to maintain a good cache utilization. The

study also ﬁnds that, as in the CPU tests, L2 cache misses are systematically

less when using the Z-ordered memory layout on the GPU because of the

improved spatial locality.

Interestingly, the NVIDIA CUDA Programming Guide [24] says: “The ef-

fect of execution conﬁguration on performance for a given kernel call gener-

ally depends on the kernel code. Experimentation is therefore recommended.”

These experiments show a wide variation in performance depending upon

thread block size. While such variation isn’t all that surprising, the amount

of variation—as much as 265%—is somewhat unexpected, as is the fact that

the optimal block size for one problem is not the same as for another problem

when run on the same platform.

Performance Optimization and Auto-Tuning 323

(a) Transfer function “A” (b) Transfer function “B”

FIGURE 14.7: Two diﬀerent transfer functions have diﬀerent beneﬁts from

early ray termination. They also yield images that accentuate diﬀerent features

of the underlying data set. The performance tests in this study use transfer

function “A.” Image source: Bethel and Howison, 2012 [3].

14.3.3.2 Algorithmic Optimization: Early Ray Termination

The study includes evaluating the performance impact of an algorithmic

optimization called early ray termination (ERT). In ERT, there is a condi-

tional within the inner-most loop that iterates along an individual ray tests. It

tests whether or not the ray has reached full opacity, meaning that any further

calculations would not contribute anything further to the ﬁnal pixel color. The

eﬀect of this optimization depends on both the input data and on the transfer

function that determines how data values map to color and opacity. For the

study’s speciﬁc data set and transfer function, the authors saw approximately

10% fewer integration steps when ERT was enabled, which is directly reﬂected

in the reported runtimes. In the cases where ERT was enabled, there was a

visible 10–15% improvement in absolute runtime.

To demonstrate the relationship between the transfer function and the

beneﬁts of using ERT, the study included an additional test with a “shallower”

transfer function (see Fig. 14.7b) that did not penetrate as deep into the

volume. With ERT enabled on the shallower transfer function, the algorithm

performed 19.7% fewer integration steps and ran 19.1% faster. The reduced

runtime reﬂects the reduced amount of computational work.

Even though the use of ERT made the problem run faster, there may be

a qualitative cost associated with the reduced runtime. The image rendered

with transfer function “B” shows signiﬁcant qualitative diﬀerences in many

areas. The “interior” features of the data set that are visualized may or may

not be appropriate according to the application.

Since the beneﬁts of ERT depend highly on the scene characteristics and

324 High Performance Visualization









        

3HUFHQW)DVWHU

&RQFXUUHQF

3HUIRUPDQFH*DLQRI=RUGHURYHU$UUDRUGHU

,QWHO1HKDOHP

$0'0DJQ&RXUV

19,',$)HUPL

FIGURE 14.8: The performance gains (averaged across thread block size,

with ERT) from using Z-ordered memory instead of array-ordered increase

with concurrency. The gains are most notable on the NVIDIA/Fermi, where

the number of threads per block corresponds to the level of “per-processor”

concurrency. Image source: Bethel and Howison, 2012 [3].

transfer function, it is impossible to say with certainty how much ERT will

help algorithm performance. Earlier work investigating the use of ERT in ray

casting volume rendering [18] reports a a 300% improvement in runtime for an

opaque volume, but incurs a 30% performance penalty when run on a semi-

transparent volume. This eﬀect attributes to the extra overhead when evaluat-

ing the conditional in the inner integration loop. The conditional can have an

adverse impact on performance, particularly for GPU implementations where

divergent code paths are costly. The eﬀects—beneﬁts or drawbacks—of ERT

are highly variable. There are combinations of data sets and transfer functions

where ERT has no eﬀect, and others where it leads to a dramatic reduction

in the number of integration steps.

14.3.3.3 Algorithmic Optimization: Z-Ordered Memory

The study shows that conﬁgurations using the Z-order memory layout

outperform the array-order layout ones on all platforms and on all concur-

rency levels, and also, the performance gains increase with the increase of

concurrency (see Fig. 14.8). At higher concurrency, the beneﬁts of Z-ordering

are most likely due to the larger penalties for non-contiguous access and cache

misses in shared-memory systems that service many cores, such as the AMD/-

MagnyCours and NVIDIA/Fermi. While the Intel/Nehalem and AMD/Mag-

nyCours show modest gains from Z-ordered memory, ranging from 3% to 29%,

Performance Optimization and Auto-Tuning 325

the NVIDIA/Fermi exhibits a gain of 146% at a 64-way concurrency, that is,

64 threads per thread block.

The improved locality of Z-ordered memory access and the corresponding,

lower L2 miss rates becomes increasingly beneﬁcial as more cores access mem-

ory through a shared-memory controller and subsystem. The NVIDIA/Fermi,

whose memory subsystem must service 14 multi-processors (each with many

thread blocks in ﬂight), realizes the biggest performance gains in Z-ordering.

While the slower block conﬁgurations improve somewhat with Z-ordering, the

best conﬁgurations improve greatly, leading to a larger variation for Z-ordering

than array-ordering.

In practice, since most data sets are not stored in a Z-order layout, there

would be a cost associated with reordering data. That cost is not included

in the study. However, in the case where multiple visualization images would

be generated from the same data set, that one-time cost is amortized across

many renderings.

14.3.4 Lessons Learned

This optimization study looks at two diﬀerent dimensions of the perfor-

mance optimization problem on three diﬀerent platforms. What values for

tunable algorithmic parameters produce the best performance? Are there al-

gorithmic optimizations that might make a diﬀerence in performance? And

how do these performance characteristics vary, and why, on diﬀerent multi-

core CPU and many-core GPU platforms?

The study shows there is about a 2.5× performance gain to be had through

careful selection of a tunable algorithmic parameter: the thread block/image

tile size. This variation reﬂects diﬀerences in load balance and relatively better,

or worse, use of the memory hierarchy. This magnitude of performance gain

appears to be consistent across all three platforms.

The memory layout algorithmic optimization has a positive beneﬁt on per-

formance, but the amount of performance gain varies depending on platform

and on the settings of the tunable algorithmic parameter. The study shows

performance gains ranging from 3–29% on multi-core CPU platforms, and up

to 146% on a many-core GPU platform. The large diﬀerence in performance

between these two platforms reﬂects the diﬀerence in performance character-

istics in the memory subsystems.

The beneﬁt of the other algorithmic optimization—early ray termination—

depends completely on the data or problem. In some problem conﬁgurations,

the extra cost of the conditional to implement the ERT optimization incurs a

performance penalty. In other conﬁgurations, ERT can result in a substantial

performance gain. These penalties or gains seem to be largely independent of

memory layout or algorithmic tuning.

326 High Performance Visualization

14.4 Conclusion

The main point of these studies was to explore the relationship between

tunable algorithmic parameters and known algorithmic optimizations, and

also, to explore the resulting impact on performance for two diﬀerent visual-

ization algorithms on modern multi-core and many-core platforms. The studies

show that the algorithmic parameters that produce the best performance vary

from problem to problem and platform to platform, often in a non-obvious

way. The GPU volume rendering study results emphasize this conclusion: op-

timal performance results in what appears to be a crossover point between

memory cache utilization and thread warp size. By empirically measuring the

performance of diﬀerent algorithmic parameter settings, the study’s authors

were able to ﬁnd the best performing conﬁgurations. Their study’s approach

is both successfully and widely used in the computational science commu-

nity [17, 9, 7].

Some algorithms, like ray casting volume rendering with perspective pro-

jection, use an unstructured, or irregular memory access pattern. Each ray

will, in eﬀect, execute a diﬀerent traversal path through memory. The set of

paths and the number of computational steps required for each ray is a func-

tion of runtime parameters, such as the viewpoint, data set, and color transfer

function. Other visualization algorithms exhibit similar memory access char-

acteristics, like parallel streamline computation [5], and can beneﬁt from this

performance optimization methodology. It may be diﬃcult, if not impossible,

to develop a performance model for such algorithms—the ones with unpre-

dictable memory access patterns. Therefore, auto-tuning methodology is a

promising approach for ﬁnding the optimal settings of tunable algorithmic

parameters.

The performance improvements to be realized can have tangible impacts:

the results of the volume rendering study provided the basis for adjusting the

tunable algorithmic parameters for a set of extreme-concurrency runs [14, 15]

that require millions of CPU hours. By ﬁnding and using optimal settings for

tunable algorithmic parameters, the tunable algorithmic parameters, in eﬀect,

saved millions of additional CPU hours that would have been spent executing

an application in a non-optimal conﬁguration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14. Performance Optimization and Auto-Tuning (4/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
14. Performance Optimization and Auto-Tuning (4/5)