322 High Performance Visualization
algorithm, which causes divergence among CUDA threads. A thread block
is executed in a single-instruction-multiple-thread (SIMT) fashion in which
warps of thirty-two threads are executed across four clock cycles in subsets of
eight threads that share a common instruction. If those eight threads do not
share a common instruction, such as when conditionals cause branching code
paths, the threads diverge and are executed serially.
This situation occurs frequently in this algorithm. For example, suppose
a thread block owns a region of the image that only partially covers the data
volume. Some of the threads in that block exit immediately due to the empty-
space skipping optimization in the algorithm, while the other threads proceed
to cast rays through the volume. However, the threads that proceed together
with ray casting may also have rays of different lengths, which will cause
divergence and load imbalance.
Since a warp must be scheduled across at least four clock cycles, using fewer
than four threads per thread block will guarantee under-utilization. The con-
figurations that met these requirements were excluded from the parameter
sweep. The study shows empirically that the sweet spot for a thread block
size is 16 or 32 threads, depending on the memory ordering and whether ERT
is enabled. Many block sizes with 16 threads perform well, even though this
number is less than the warp size of 32 threads, indicating the complex inter-
action of the CUDA runtime and warp scheduler in handling the branching
for this particular algorithm and problem. It is also likely that larger thread
blocks exhibit greater load imbalance because, the variation in ray lengths
tends to increase with block size.
Surprisingly, the small thread blocks that display the worst performance
also exhibit the fewest L2 cache misses (see Figure 14.6). Note, the converse
of this is not true: the most optimal block sizes do not show the most L2
cache misses. Instead, L2 cache misses appear to rise uniformly with the total
number of threads in a block, leading to the same diagonal striping as seen in
the runtime plot. The study suggests that achieving the best performance on
the GPU is a trade-off between using enough threads to saturate a warp and
using fewer than enough threads to maintain a good cache utilization. The
study also finds that, as in the CPU tests, L2 cache misses are systematically
less when using the Z-ordered memory layout on the GPU because of the
improved spatial locality.
Interestingly, the NVIDIA CUDA Programming Guide [24] says: “The ef-
fect of execution configuration on performance for a given kernel call gener-
ally depends on the kernel code. Experimentation is therefore recommended.”
These experiments show a wide variation in performance depending upon
thread block size. While such variation isn’t all that surprising, the amount
of variation—as much as 265%—is somewhat unexpected, as is the fact that
the optimal block size for one problem is not the same as for another problem
when run on the same platform.