228 High Performance Visualization
Wh^ĞƋƵĞŶƟĂůWƌŽŐƌĂŵdžĞĐƵƟŽŶ
'WhWĂƌĂůůĞů
džĞĐƵƟŽŶ
ŽŵƉƵƚĞ
WƌŽŐƌĂŵϭ
ŽŵƉƵƚĞ
WƌŽŐƌĂŵϮ
ŽŵƉƵƚĞ
WƌŽŐƌĂŵŶ
...
'WhWĂƌĂůůĞů
džĞĐƵƟŽŶ
'WhWĂƌĂůůĞů
džĞĐƵƟŽŶ
'WhWĂƌĂůůĞů
džĞĐƵƟŽŶ
PCIe PCIe
PCIe
PCIe PCIe PCIe
'WhDĞŵŽƌLJZĞƐŽƵƌĐĞƐ;dĞdžƚƵƌĞƐ͕ŽŶƐƚĂŶƚDĞŵŽƌLJ͕ƵīĞƌKďũĞĐƚƐͿ
ŽŵƉƵƚĞ
WƌŽŐƌĂŵϬ
FIGURE 11.2: GPGPU architecture with multiple compute programs called
in sequence from the CPU, but running each in parallel on the GPU.
the kernel is deployed, the GPU executes the code in parallel with full read
and write access to arbitrary locations in GPU memory. Furthermore, input
and output data of the kernels can be transferred over the PCIe bus between
system memory and GPU resources.
11.2.3 GPGPU Programming Languages
In the beginning of general purpose computing on graphics hardware
(GPGPU), plenty of research focused on the development of data structures
and special shader algorithms to solve non-graphics related problems [87]. At
that time, expert knowledge in graphics and shader programming was essen-
tial for developing hardware-accelerated code. To overcome this limitation,
new abstract programming interfaces were developed.
The Compute Unified Device Architecture (CUDA), developed by
NVIDIA [84], implements a GPGPU architecture and allows access to the
memory and the parallel computational elements of CUDA-capable GPUs
with a high-level API. The runtime environment is available for Windows,
Linux, and Mac OS X platforms and does not require any graphics library. The
programming of compute kernels is achieved with an extension to the C/C++
programming language or alternatively with a Fortran binding. In contrast to
shaders, CUDA kernels have access to arbitrary addresses in GPU memory
and still benefit from cached texture reading with hardware-accelerated inter-
polation. Furthermore, fast shared memory is available for thread cooperation
or for the implementation of a user-managed cache. The addition of double
precision floating-point arithmetic and the development of additional high-
level libraries are essential capabilities for general purpose computing. Some
of the high-level libraries include: cuFFT for fast Fourier transforms, cuBLAS
for basic linear algebra, or cuSPARSE for sparse matrix operations. A compre-
hensive introduction to CUDA programming is provided in the textbooks by
Sanders and Kandrot [101] and by Kirk and Hwu [53]. By now, applications
using CUDA cover a wide range including, but not limited to computational
fluid dynamics, molecular dynamics, or computational finance. (For a good
overview of practical applications, see The GPU Computing Gems [46].)
GPU-Accelerated Visualization 229
A limitation of CUDA is vendor dependency in the choice of GPUs. There-
fore, the Khronos group [51], a non-profit industry consortium, made an effort
to develop an independent standard for programming heterogeneous comput-
ing devices including various kinds of GPUs and multi-core CPUs. The Open
Computing Language (OpenCL) provides parallel computing using task-based
and data-based parallelism. The OpenCL syntax is also based on the C pro-
gramming language and the architecture is very similar to CUDA, allowing
an easy transfer of knowledge. As OpenCL is an open specification, every
processor vendor can decide to implement the standard for their products.
Details on OpenCL programming and thorough code examples can be found,
in Munshi et al.’s OpenCL Programming Guide [80].
11.3 GPU-Accelerated Volume Rendering
After having reviewed the fundamental architecture of current GPUs, this
section discusses GPU-based algorithms for direct volume rendering (DVR).
First, the section will summarize the basic techniques to introduce the prin-
ciples of hardware-accelerated rendering. Subsequently, the section examines
more advanced algorithms from the latest research before moving on to dis-
tributed volume rendering, which focuses on the scalability on GPU-cluster
systems.
11.3.1 Basic GPU Techniques
Chapter 4 provided an introduction to the basic techniques of direct vol-
ume rendering. The first class of GPU-based algorithms that is discussed is
an object-order approach based on 2D or 3D texturing. The basic principle of
these algorithms is to render a stack of polygon slices and to utilize texture
mapping for assigning color and opacity values from the transfer function. For
the final image, the rendered slices are composited using alpha blending.
11.3.1.1 2D Texture-Based Rendering
Early GPU-based approaches relied on 2D textures and bilinear interpo-
lation. In this case, the volumetric data set is stored in three object-aligned
stacks of 2D texture slices, one for each major axis. This is necessary to allow
interactive rotation of the data set. Depending on the viewing direction, one
of these stacks is chosen for rendering so that the angle between the slice nor-
mal and the viewing ray is minimized. Once the correct stack is determined,
the proxy geometry is rendered back-to-front and texture sampling can be
implemented in a fragment shader. The main advantages of this algorithm are
a simple implementation and high performance because graphics hardware is
highly optimized for 2D texturing and bilinear interpolation. However, there
are severe drawbacks concerning image quality and data scalability. Although
230 High Performance Visualization
sampling artifacts can be reduced by using multi-texturing and trilinear in-
terpolation [95], flickering still remains when switching between the stacks.
Furthermore, valuable graphics memory is wasted because of needing to store
three copies of the same volume data, a problem overcome with 3D texture-
based methods.
11.3.1.2 3D Texture-Based Rendering
The introduction of hardware support for 3D textures [14, 21] enabled
a new class of slicing algorithms. The volume data is stored in a 3D texture
representing a uniform grid. View-aligned polygons are rendered for generating
fragments and 3D texture coordinates are used for trilinear resampling of data
values. Gradients can be either precomputed for each grid point or calculated
on-the-fly to provide local illumination [117, 29]. In a similar way, Westermann
and Ertl [126] presented diffuse shading of texture-based isosurfaces.
Unlike 3D textures, real-world volume data is not always arranged on a
uniform grid. The projected tetrahedra algorithm was adapted to 3D tex-
turing by ottger et al. [100] for rendering unstructured data, occurring in
computational fluid dynamics, for example. Independent from the grid struc-
ture, high frequency components in the transfer function can cause severe
artifacts because of an improper sampling rate. To avoid expensive oversam-
pling, preintegration was introduced [58, 30] to handle high frequencies in the
transfer function [8], occurring when isosurfaces or similar localized features
were classified with sharp peaks in the transfer function. Assuming a piecewise
linear reconstruction, the contribution of a ray segment can be precomputed
accurately and the result is stored in a lookup table, which is later used at
runtime.
The rendering of large data sets was addressed by Weiler et al. [123], who
developed a multiresolution approach based on 3D textures to achieve an
adaptive level of detail. In numerical simulations, even larger amounts of 4D
volume data is generated. Lum et al. [64] employed compression and hardware
texturing to render time-varying data sets efficiently.
Simple lighting models often do not exhibit sufficient details when explor-
ing volume data. However, solving full radiative transfer is usually too expen-
sive in visualization. Therefore, Kniss et al. [56] developed an approximate
solution, based on half-angle slicing, which incorporates effects such as volu-
metric shadows and forward-directed scattering. Furthermore, the exploration
of volume data is also inhibited by occlusion. Weiskopf et al. [125] presented a
method for volume clipping using arbitrary geometries to unveil hidden details
in the data set. Depending on the transfer function, significant areas of the
volume can be transparent or opaque, leading to many samples that do not
contribute to the final image. Li et al. [62] show how empty space skipping
and occlusion clipping can be adapted to texture-based volume rendering.
GPU-Accelerated Visualization 231
11.3.1.3 Ray Casting
The second fundamental DVR algorithm is GPU-based ray casting, an
image-order technique that directly solves the discrete volume rendering equa-
tion by tracing eye rays through the volume. The resampled data values are
mapped to optical properties based on the transfer function. With the develop-
ment of advanced shader capabilities, image-order techniques soon dominated
over the slice-based methods. The most prominent form of ray casting [61] was
introduced to GPU-based volume rendering by ottger et al. [99], followed by
an alternative implementation by Kr¨uger and Westermann [59]. At a simi-
lar time, Weiler et al. [122] adapted Garrity’s [33] ray propagation algorithm
for GPU-based traversal of tetrahedral meshes. Up until then, all implemen-
tations relied on multiple rendering passes; but, as soon as shaders allowed
branching operations, Stegmaier et al. [108] developed a single-pass algorithm
in a fragment shader.
Ray casting can be advantageous compared to slice-based rendering be-
cause acceleration techniques like empty space skipping, early ray termination,
or adaptive sampling are easily facilitated. Similar methods can also be used
for rendering isosurfaces on the GPU. Hadwiger et al. [40] demonstrated how
ray casting can be used to refine intersections with discrete isosurfaces. The in-
troduction of GPGPU programming languages further simplified GPU-based
ray casting. Marˇalek et al. [72] presented an optimized CUDA implementa-
tion. Compared to shader implementations, the generation of eye rays does
not require the rendering of proxy geometry. Instead, ray origin and direction
are calculated directly from camera parameters for each pixel in the compute
kernel.
For further reading about DVR and GPU-accelerated visualization meth-
ods, see the books by Engel et al. [29] and Weiskopf [124].
11.3.2 Advanced GPU Algorithms
The majority of state-of-the-art DVR algorithms are based on ray casting,
but despite the growing computational power of GPUs, sampling remains a
major bottleneck. High-quality images require thorough data reconstruction
to avoid visible artifacts, leading to a bandwidth-limited problem.
Knoll et al. [57] introduced a peak finding algorithm to render isosurfaces,
arising from sharp spikes in the transfer function. Usually, post-classified ren-
dering requires very high sampling rates to achieve sufficient quality and while
preintegration solves many deficiencies with fewer samples, opacity is scaled
improperly, depending on data topology. Therefore, Knoll et al. [57] developed
an algorithm that analyzes the transfer function, searching for peaks in a pre-
processing step. Similar to preintegration, a 2D lookup table is constructed,
storing possible isovalues for a set of data segments in a 2D texture. During
ray casting, the entry and exit values of each ray segment are used to lookup if
it contains an isosurface, avoiding expensive over-sampling. In addition, color
232 High Performance Visualization
and opacity of the isosurface are not scaled because classification only depends
on the isovalue from the lookup table.
Preintegration assumes a linear function between two samples, but uni-
form sampling with trilinear interpolation usually leads to nonlinear behav-
ior. Under these circumstances, voxel-based sampling leads to piecewise cubic
polynomials along a ray [67]. Ament et al. [4] exploited this observation and
developed a CUDA-based algorithm to reconstruct Newton polynomials effi-
ciently with four trilinear samples in each cell, allowing a piecewise analytic
representation of the scalar field. The linearization of the cubic polynomials
with respect to their local extrema guarantees crack-free rendering in con-
junction with preintegration. The problem of scaled opacity is addressed by a
modified visualization model, allowing the user to classify volume data inde-
pendent of its topology and its dimensions in the spatial domain.
The previous approach achieved bandwidth reduction under the assump-
tion of trilinear reconstruction. However, higher-order techniques further push
bandwidth requirements. Lee et al. [60] introduced a GPU-accelerated algo-
rithm to approximate the scalar field between uniform samples with third-
order Catmull-Rom splines [15] to provide virtual samples by evaluating the
polynomial functions arithmetically. The control points of the splines are sam-
pled with tricubic B-spline texture filtering, by using the technique from Sigg
and Hadwiger [106] with eight trilinear texture fetches. Compared to full tricu-
bic 4× oversampling, the computational evaluation leads to an improved per-
formance of about 2.5×–3.3× at comparable rendering quality. In addition to
intensity reconstruction, the approach can also be used to calculate virtual
samples for gradient estimation.
The quality of volume shading strongly depends on gradient estimation,
especially in the surrounding area of specular lobes. Also, preintegration with
normals requires at least four additional dimensions in the lookup table, which
may not be feasible. Guetat et al. [38] introduced nonlinear gradient interpola-
tion for preintegration without suffering from excessive memory consumption.
Their first contribution is the development of an energy conserving interpola-
tion scheme for gradients between two ray samples, guaranteeing normalized
gradients and hence conservation of specular components along the entire seg-
ment. In addition, the authors were able to reduce dependency of the normals
from preintegration. With their approach, it only takes four two-dimensional
lookup tables to implement high-quality shading with preintegration.
The previously discussed GPU methods focuses on the improvement of
performance and rendering quality. However, accelerated DVR is a valuable
tool for interactive exploration of volume data, for example, a user should
be able to examine and classify relevant features of a data set. Tradition-
ally, the transfer function serves this purpose, but it can be a cumbersome
task to find proper parameters. Moreover, standard DVR may not always be
the best choice to visualize volume data. Bruckner and Gr¨oller [13] presented
maximum intensity difference accumulation (MIDA), a hybrid visualization
model combining characteristics from DVR and maximum intensity projec-
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.163.13