11. GPU-Accelerated Visualization (2/8)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

228 High Performance Visualization

Wh^ĞƋƵĞŶƟĂůWƌŽŐƌĂŵǆĞĐƵƟŽŶ

'WhWĂƌĂůůĞů

ǆĞĐƵƟŽŶ

ŽŵƉƵƚĞ

WƌŽŐƌĂŵϭ

ŽŵƉƵƚĞ

WƌŽŐƌĂŵϮ

ŽŵƉƵƚĞ

WƌŽŐƌĂŵŶ

...

'WhWĂƌĂůůĞů

ǆĞĐƵƟŽŶ

'WhWĂƌĂůůĞů

ǆĞĐƵƟŽŶ

'WhWĂƌĂůůĞů

ǆĞĐƵƟŽŶ

PCIe PCIe

PCIe

PCIe PCIe PCIe

'WhDĞŵŽƌǇZĞƐŽƵƌĐĞƐ;dĞǆƚƵƌĞƐ͕ŽŶƐƚĂŶƚDĞŵŽƌǇ͕ƵīĞƌKďũĞĐƚƐͿ

ŽŵƉƵƚĞ

WƌŽŐƌĂŵϬ

FIGURE 11.2: GPGPU architecture with multiple compute programs called

in sequence from the CPU, but running each in parallel on the GPU.

the kernel is deployed, the GPU executes the code in parallel with full read

and write access to arbitrary locations in GPU memory. Furthermore, input

and output data of the kernels can be transferred over the PCIe bus between

system memory and GPU resources.

11.2.3 GPGPU Programming Languages

In the beginning of general purpose computing on graphics hardware

(GPGPU), plenty of research focused on the development of data structures

and special shader algorithms to solve non-graphics related problems [87]. At

that time, expert knowledge in graphics and shader programming was essen-

tial for developing hardware-accelerated code. To overcome this limitation,

new abstract programming interfaces were developed.

The Compute Uniﬁed Device Architecture (CUDA), developed by

NVIDIA [84], implements a GPGPU architecture and allows access to the

memory and the parallel computational elements of CUDA-capable GPUs

with a high-level API. The runtime environment is available for Windows,

Linux, and Mac OS X platforms and does not require any graphics library. The

programming of compute kernels is achieved with an extension to the C/C++

programming language or alternatively with a Fortran binding. In contrast to

shaders, CUDA kernels have access to arbitrary addresses in GPU memory

and still beneﬁt from cached texture reading with hardware-accelerated inter-

polation. Furthermore, fast shared memory is available for thread cooperation

or for the implementation of a user-managed cache. The addition of double

precision ﬂoating-point arithmetic and the development of additional high-

level libraries are essential capabilities for general purpose computing. Some

of the high-level libraries include: cuFFT for fast Fourier transforms, cuBLAS

for basic linear algebra, or cuSPARSE for sparse matrix operations. A compre-

hensive introduction to CUDA programming is provided in the textbooks by

Sanders and Kandrot [101] and by Kirk and Hwu [53]. By now, applications

using CUDA cover a wide range including, but not limited to computational

ﬂuid dynamics, molecular dynamics, or computational ﬁnance. (For a good

overview of practical applications, see The GPU Computing Gems [46].)

GPU-Accelerated Visualization 229

A limitation of CUDA is vendor dependency in the choice of GPUs. There-

fore, the Khronos group [51], a non-proﬁt industry consortium, made an eﬀort

to develop an independent standard for programming heterogeneous comput-

ing devices including various kinds of GPUs and multi-core CPUs. The Open

Computing Language (OpenCL) provides parallel computing using task-based

and data-based parallelism. The OpenCL syntax is also based on the C pro-

gramming language and the architecture is very similar to CUDA, allowing

an easy transfer of knowledge. As OpenCL is an open speciﬁcation, every

processor vendor can decide to implement the standard for their products.

Details on OpenCL programming and thorough code examples can be found,

in Munshi et al.’s OpenCL Programming Guide [80].

11.3 GPU-Accelerated Volume Rendering

After having reviewed the fundamental architecture of current GPUs, this

section discusses GPU-based algorithms for direct volume rendering (DVR).

First, the section will summarize the basic techniques to introduce the prin-

ciples of hardware-accelerated rendering. Subsequently, the section examines

more advanced algorithms from the latest research before moving on to dis-

tributed volume rendering, which focuses on the scalability on GPU-cluster

systems.

11.3.1 Basic GPU Techniques

Chapter 4 provided an introduction to the basic techniques of direct vol-

ume rendering. The ﬁrst class of GPU-based algorithms that is discussed is

an object-order approach based on 2D or 3D texturing. The basic principle of

these algorithms is to render a stack of polygon slices and to utilize texture

mapping for assigning color and opacity values from the transfer function. For

the ﬁnal image, the rendered slices are composited using alpha blending.

11.3.1.1 2D Texture-Based Rendering

Early GPU-based approaches relied on 2D textures and bilinear interpo-

lation. In this case, the volumetric data set is stored in three object-aligned

stacks of 2D texture slices, one for each major axis. This is necessary to allow

interactive rotation of the data set. Depending on the viewing direction, one

of these stacks is chosen for rendering so that the angle between the slice nor-

mal and the viewing ray is minimized. Once the correct stack is determined,

the proxy geometry is rendered back-to-front and texture sampling can be

implemented in a fragment shader. The main advantages of this algorithm are

a simple implementation and high performance because graphics hardware is

highly optimized for 2D texturing and bilinear interpolation. However, there

are severe drawbacks concerning image quality and data scalability. Although

230 High Performance Visualization

sampling artifacts can be reduced by using multi-texturing and trilinear in-

terpolation [95], ﬂickering still remains when switching between the stacks.

Furthermore, valuable graphics memory is wasted because of needing to store

three copies of the same volume data, a problem overcome with 3D texture-

based methods.

11.3.1.2 3D Texture-Based Rendering

The introduction of hardware support for 3D textures [14, 21] enabled

a new class of slicing algorithms. The volume data is stored in a 3D texture

representing a uniform grid. View-aligned polygons are rendered for generating

fragments and 3D texture coordinates are used for trilinear resampling of data

values. Gradients can be either precomputed for each grid point or calculated

on-the-ﬂy to provide local illumination [117, 29]. In a similar way, Westermann

and Ertl [126] presented diﬀuse shading of texture-based isosurfaces.

Unlike 3D textures, real-world volume data is not always arranged on a

uniform grid. The projected tetrahedra algorithm was adapted to 3D tex-

turing by R¨ottger et al. [100] for rendering unstructured data, occurring in

computational ﬂuid dynamics, for example. Independent from the grid struc-

ture, high frequency components in the transfer function can cause severe

artifacts because of an improper sampling rate. To avoid expensive oversam-

pling, preintegration was introduced [58, 30] to handle high frequencies in the

transfer function [8], occurring when isosurfaces or similar localized features

were classiﬁed with sharp peaks in the transfer function. Assuming a piecewise

linear reconstruction, the contribution of a ray segment can be precomputed

accurately and the result is stored in a lookup table, which is later used at

runtime.

The rendering of large data sets was addressed by Weiler et al. [123], who

developed a multiresolution approach based on 3D textures to achieve an

adaptive level of detail. In numerical simulations, even larger amounts of 4D

volume data is generated. Lum et al. [64] employed compression and hardware

texturing to render time-varying data sets eﬃciently.

Simple lighting models often do not exhibit suﬃcient details when explor-

ing volume data. However, solving full radiative transfer is usually too expen-

sive in visualization. Therefore, Kniss et al. [56] developed an approximate

solution, based on half-angle slicing, which incorporates eﬀects such as volu-

metric shadows and forward-directed scattering. Furthermore, the exploration

of volume data is also inhibited by occlusion. Weiskopf et al. [125] presented a

method for volume clipping using arbitrary geometries to unveil hidden details

in the data set. Depending on the transfer function, signiﬁcant areas of the

volume can be transparent or opaque, leading to many samples that do not

contribute to the ﬁnal image. Li et al. [62] show how empty space skipping

and occlusion clipping can be adapted to texture-based volume rendering.

GPU-Accelerated Visualization 231

11.3.1.3 Ray Casting

The second fundamental DVR algorithm is GPU-based ray casting, an

image-order technique that directly solves the discrete volume rendering equa-

tion by tracing eye rays through the volume. The resampled data values are

mapped to optical properties based on the transfer function. With the develop-

ment of advanced shader capabilities, image-order techniques soon dominated

over the slice-based methods. The most prominent form of ray casting [61] was

introduced to GPU-based volume rendering by R¨ottger et al. [99], followed by

an alternative implementation by Kr¨uger and Westermann [59]. At a simi-

lar time, Weiler et al. [122] adapted Garrity’s [33] ray propagation algorithm

for GPU-based traversal of tetrahedral meshes. Up until then, all implemen-

tations relied on multiple rendering passes; but, as soon as shaders allowed

branching operations, Stegmaier et al. [108] developed a single-pass algorithm

in a fragment shader.

Ray casting can be advantageous compared to slice-based rendering be-

cause acceleration techniques like empty space skipping, early ray termination,

or adaptive sampling are easily facilitated. Similar methods can also be used

for rendering isosurfaces on the GPU. Hadwiger et al. [40] demonstrated how

ray casting can be used to reﬁne intersections with discrete isosurfaces. The in-

troduction of GPGPU programming languages further simpliﬁed GPU-based

ray casting. Marˇs´alek et al. [72] presented an optimized CUDA implementa-

tion. Compared to shader implementations, the generation of eye rays does

not require the rendering of proxy geometry. Instead, ray origin and direction

are calculated directly from camera parameters for each pixel in the compute

kernel.

For further reading about DVR and GPU-accelerated visualization meth-

ods, see the books by Engel et al. [29] and Weiskopf [124].

11.3.2 Advanced GPU Algorithms

The majority of state-of-the-art DVR algorithms are based on ray casting,

but despite the growing computational power of GPUs, sampling remains a

major bottleneck. High-quality images require thorough data reconstruction

to avoid visible artifacts, leading to a bandwidth-limited problem.

Knoll et al. [57] introduced a peak ﬁnding algorithm to render isosurfaces,

arising from sharp spikes in the transfer function. Usually, post-classiﬁed ren-

dering requires very high sampling rates to achieve suﬃcient quality and while

preintegration solves many deﬁciencies with fewer samples, opacity is scaled

improperly, depending on data topology. Therefore, Knoll et al. [57] developed

an algorithm that analyzes the transfer function, searching for peaks in a pre-

processing step. Similar to preintegration, a 2D lookup table is constructed,

storing possible isovalues for a set of data segments in a 2D texture. During

ray casting, the entry and exit values of each ray segment are used to lookup if

it contains an isosurface, avoiding expensive over-sampling. In addition, color

232 High Performance Visualization

and opacity of the isosurface are not scaled because classiﬁcation only depends

on the isovalue from the lookup table.

Preintegration assumes a linear function between two samples, but uni-

form sampling with trilinear interpolation usually leads to nonlinear behav-

ior. Under these circumstances, voxel-based sampling leads to piecewise cubic

polynomials along a ray [67]. Ament et al. [4] exploited this observation and

developed a CUDA-based algorithm to reconstruct Newton polynomials eﬃ-

ciently with four trilinear samples in each cell, allowing a piecewise analytic

representation of the scalar ﬁeld. The linearization of the cubic polynomials

with respect to their local extrema guarantees crack-free rendering in con-

junction with preintegration. The problem of scaled opacity is addressed by a

modiﬁed visualization model, allowing the user to classify volume data inde-

pendent of its topology and its dimensions in the spatial domain.

The previous approach achieved bandwidth reduction under the assump-

tion of trilinear reconstruction. However, higher-order techniques further push

bandwidth requirements. Lee et al. [60] introduced a GPU-accelerated algo-

rithm to approximate the scalar ﬁeld between uniform samples with third-

order Catmull-Rom splines [15] to provide virtual samples by evaluating the

polynomial functions arithmetically. The control points of the splines are sam-

pled with tricubic B-spline texture ﬁltering, by using the technique from Sigg

and Hadwiger [106] with eight trilinear texture fetches. Compared to full tricu-

bic 4× oversampling, the computational evaluation leads to an improved per-

formance of about 2.5×–3.3× at comparable rendering quality. In addition to

intensity reconstruction, the approach can also be used to calculate virtual

samples for gradient estimation.

The quality of volume shading strongly depends on gradient estimation,

especially in the surrounding area of specular lobes. Also, preintegration with

normals requires at least four additional dimensions in the lookup table, which

may not be feasible. Guetat et al. [38] introduced nonlinear gradient interpola-

tion for preintegration without suﬀering from excessive memory consumption.

Their ﬁrst contribution is the development of an energy conserving interpola-

tion scheme for gradients between two ray samples, guaranteeing normalized

gradients and hence conservation of specular components along the entire seg-

ment. In addition, the authors were able to reduce dependency of the normals

from preintegration. With their approach, it only takes four two-dimensional

lookup tables to implement high-quality shading with preintegration.

The previously discussed GPU methods focuses on the improvement of

performance and rendering quality. However, accelerated DVR is a valuable

tool for interactive exploration of volume data, for example, a user should

be able to examine and classify relevant features of a data set. Tradition-

ally, the transfer function serves this purpose, but it can be a cumbersome

task to ﬁnd proper parameters. Moreover, standard DVR may not always be

the best choice to visualize volume data. Bruckner and Gr¨oller [13] presented

maximum intensity diﬀerence accumulation (MIDA), a hybrid visualization

model combining characteristics from DVR and maximum intensity projec-

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. GPU-Accelerated Visualization (2/8)

Create new playlist

Sign In

Sign Up

Table of Contents for
11. GPU-Accelerated Visualization (2/8)