(a) Runtimes normalized by maximum
highlight the poorest performing configu-
rations.
(b) Runtime normalized by minimum
highlight the best performing configura-
tions.
FIGURE 14.5 Visualization of performance data collected by varying the number and size of the
GPU thread blocks for the 3D bilateral filter are shown at three different filter sizes, r = {1, 5, 11}.
In (a), the performance data (normalized to the maximum value) highlights the poorest performing
configurations; the red and yellow isocontours are close to the viewer. In (b), the performance data
(normalized to the minimum value) highlights the best performing configurations. These appear as
the cone-shaped red/yellow isocontours. Image source: Bethel, 2009.
124 8161 2 4 816
1 2.92 1.61 1.00 0.77 0.74 3.00 1.59 0.90 0.58 0.44
2 1.64 0.98 0.71 0.63 0.68 1.58 0.86 0.51 0.35 0.30
4 1.04 0.70 0.59 0.61 0.74 0.89 0.51 0.32 0.26 0.33
8 0.81 0.63 0.61 0.68 0.73 0.56 0.34 0.25 0.27 0.32
16 0.83 0.71 0.72 0.68 0.71 0.42 0.30 0.27 0.27 0.32
1 2.69 1.48 0.92 0.70 0.66 2.77 1.46 0.83 0.53 0.40
2 1.51 0.90 0.64 0.57 0.60 1.46 0.79 0.47 0.32 0.27
4 0.95 0.64 0.53 0.54 0.65 0.82 0.46 0.29 0.23 0.29
8 0.73 0.56 0.54 0.60 0.63 0.51 0.31 0.23 0.24 0.28
16 0.74 0.62 0.62 0.59 0.61 0.38 0.27 0.24 0.24 0.28
NoERT
ERT
Array Order Z Order
(a) Runtime (s)
124 8161 2 4 816
1 38 39 50 115 481 27 25 25 28 42
2 32 36 57 238 632 18 17 17 22 67
4 52 59 179 517 907 14 14 16 43240
8 139 243 527 839 873 14 17 43 182 202
16 636 724 909 827 834 29 63 176 166 166
1 33 33 4395401 24 23 22 24 35
2 25 29 48 193 530 16 15 15 19 56
4 38 47144 429 773 12 12 14 37 198
8 106 192 436 711 74112 14 37 153 163
16 506 596 765 696 701 24 53 148 137 130
Array Order
Z Order
NoERTERT
(b) L2 cache misses (millions)
FIGURE 14.6 Parallel ray casting volume rendering performance measures on the NVIDIA/Fermi
GPU include absolute runtime (a), and L2 cache miss rates (b), averaged over ten views for dif-
ferent thread block sizes. Gray boxes indicated thread blocks with too few threads to fill a warp of
execution. Surprisingly, the best performing configurations do not correspond to the best use of the
memory hierarchy on that platform. Image source: Bethel and Howison, 2012.