314 High Performance Visualization
cesses. In the case of the depth-row algorithm, each of the simultaneously
executing threads will be accessing source and destination data that differs by
one address location, e.g., contiguous blocks of memory. The other approaches,
in contrast, will be accessing source and destination memory in noncontigu-
ous chunks. The performance penalty for non-contiguous memory accesses is
apparent in Figure 14.2b.
By simply altering how an algorithm iterates through memory, this re-
sult shows a significant impact on performance. Furthermore, this result will
likely vary from platform to platform, depending on the characteristics of the
memory subsystem. For example, different platforms have different memory
prefetch strategies, different sizes of memory caches, and so forth.
14.2.3.2 Device-Specific Feature: Constant Versus Global Memory
for Filter Weights
Another design consideration is how to take advantage of and leverage
device-specific features such as device memory types and memory speeds.
The NVIDIA GPU presents multiple types of memory, some slower and un-
cached, others faster and cached. According to the NVIDIA CUDA Program-
ming Guide [23], the amount of global (slower, uncached) and texture (slower,
cached) memory varies as a function of specific customer options on the graph-
ics card. Under CUDA v1.0 and higher, there is 64KB of constant memory
and 16KB of shared memory that resides on-chip and visible to all threads.
Generally speaking, device memory (global and texture memory) has a higher
latency and lower bandwidth than an on-chip memory (constant and shared
memories).
The authors wondered if storing, rather than recomputing, portions of the
problem in on-chip, high speed memory (a device-specific feature) would offer
any performance advantage. The portions of the problem they stored, rather
than recomputed, were the filter weights, which are essentially a discretization
of a 3D Gaussian and a 1D Gaussian function.
The performance question is then: “How is performance impacted by hav-
ing all of the filter weights resident in on-chip rather than in device memory?”
The results, shown in Figure 14.3, indicate that the runtime is about 2×
faster when the filter weights are resident in a high-speed, on-chip constant
memory, as opposed to when the weights reside in the device’s (slower) global
shared memory. This result is not surprising given the different latency and
bandwidth characteristics of the two different memory subsystems.
14.2.3.3 Tunable Algorithmic Parameter: Thread Block Size
In this particular problem, the tunable algorithmic parameter is the size
and shape of the CUDA thread block. The study’s objective is to find the
combination of block size parameters—a tunable algorithmic parameter—that
produce optimal performance.
A relatively simple example, shown in Figure 14.4, reveals that there is