5. Parallel Image Compositing Methods (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

76 High Performance Visualization

the incoming portion of the image with the same part of the image that it

already owns. By continually swapping, all processes remain busy throughout

all rounds. In each round, neighbors are chosen twice as far apart, and image

portions exchanged are half as large as in the previous round. This is why

binary-swap is also called a distance-doubling and vector-halving communica-

tion algorithm in some contexts.

Since the mid-1990s when direct-send and binary-swap were ﬁrst pub-

lished, numerous variations and optimizations to the basic algorithms have

appeared. The basic ideas are to reduce active image regions using the spa-

tial locality and sparseness present in many scientiﬁc visualization images, to

better load balance after such reduction, and to keep network and computing

resources appropriately loaded through scheduling. Some examples of each of

these ideas are highlighted below.

Run-length encoding images achieves lossless compression [1], and using

bounding boxes to identify the nonzero pixels is another way to reduce the

active image size [15]. These optimizations can minimize both communication

and computation costs.

Load balancing via scan line interleaving [29] assigns individual scan lines

to processes so that each process is assigned numerous disconnected scan lines

from the entire image space. In this way, active pixels are distributed more

evenly among processes, and workload is better balanced. The drawback is

that the resulting image must be rearranged once it is composited, which can

be expensive for large images.

The SLIC [28] algorithm combines direct-send with active-pixel encod-

ing, scan-line interleaving, and scheduling of operations. Spans of compositing

tasks are assigned to processes in an interleaved fashion. Another way to sched-

ule processes is to assign them diﬀerent tasks. This is the approach taken by

Ramakrishnan et al. [26], who scheduled some processes to perform rendering

while others were assigned to compositing. The authors presented an optimal

linear-time algorithm to compute this schedule.

Image compositing has also been combined with parallel rendering for

tiled displays. The IceT library, (see Chap. 17) from Moreland et al. [19],

performs sort-last rendering on a per-tile basis. Within each display tile, the

processors that contributed image content to that tile perform either direct-





















FIGURE 5.4: Example of the binary-swap algorithm for four processes and

two rounds.

Parallel Image Compositing Methods 77

send or binary-swap compositing. Although the tile feature of IceT is not used

much in practice, IceT has become a production-quality library that oﬀers a

robust suite of image compositing algorithms to scientiﬁc visualization tools.

Both ParaView (see Chap. 18 and [32]) and VisIt (see Chap. 16 and [9]) use

the IceT library for image compositing.

5.2.3 Image Compositing Hardware

While this chapter primarily studies the evolution of software compositing

algorithms, it is worth noting that hardware solutions to the image com-

positing problem exist as well. Some of these have been made commercially

available on smaller clusters, but as the interconnects and graphics hardware

on visualization and HPC machines have improved over time, it has become

more cost-eﬀective to use these general-purpose machines for parallel render-

ing and image compositing rather than purchasing dedicated hardware for

these tasks.

Sepia [16] is one example of a parallel rendering system that included PCI-

connected FPGA boards for image composition and display. Lightning-2 [27]

is a hardware system that received images from the DVI outputs of graphics

cards, composited the images, and mapped them to sections of a large tiled

display. Muraki et al. [20] described an eight-node Linux cluster equipped with

dedicated volume rendering and image composition hardware. In a more recent

system, the availability of programmable network processing units accelerated

image compositing across 512 rendering nodes [24].

5.3 Recent Advances

Although the classic image composition algorithms and optimizations have

been used for the past ﬁfteen years, new processor and interconnect advances

such as direct memory access and multi-dimensional network topology present

new opportunities to improve the state of the art in image compositing. The

next few sections explain the relationship between direct-send and binary-

swap through a tree-based representation and how it is used to develop more

general algorithms that combine both techniques. Two recent algorithms, 2-3

swap and radix-k, are presented as examples of more general communication

patterns that can exploit new hardware.

5.3.1 2-3 Swap

Yu et al. [33] extended binary-swap compositing to nonpower-of-two num-

bers of processors with an algorithm they called 2-3 swap. One goal of this

algorithm is to combine the ﬂexibility of direct-send with the scalability of

binary-swap, and the authors accomplished this by recognizing that direct-

send and binary-swap are related and can be combined into a single algorithm.

78 High Performance Visualization



 



























FIGURE 5.5: Example of the 2-3 swap algorithm for seven processes in two

rounds. Group size is two or three in the ﬁrst round, and between two and

ﬁve in the second round.

Any natural number greater than one can be expressed as the sum of

twos and threes, and this property is used to construct the ﬁrst-round group

assignment in the 2-3 swap algorithm. The algorithm proceeds to execute a

sequence of rounds with group sizes that are between two and ﬁve. Each round

can have multiple group sizes present within the same round. The number of

rounds, r, is equal to the base of log p,wherep is the number of processes and

need not be a power of two, as is the case for the binary-swap algorithm.

An example of 2-3 swap using seven processes is shown in Figure 5.5. In

the ﬁrst round, shown on the left, processes form groups in either twos or

threes, as the name 2-3 swap suggests, and execute a direct-send within each

group. (Direct-send is the same as binary-swap when the group size is two.)

In the second round, shown on the right, the image pieces are simply divided

into a direct-send assignment, with each process owning 1/p,or1/7inthis

example, of the image. By assigning which 1/7 each process owns, however,

Yu et al. proved that the maximum number of processes in a group is ﬁve,

avoiding the contention in the ordinary direct-send. Indeed, Figure 5.5 shows

that process P

receives messages from four other processes, while all other

processes receive messages from only two or three other processes.

5.3.2 Radix-k

The next logical step in combining and generalizing direct-send and binary-

swap is to allow more combinations of rounds and group sizes. To see how

this is done, let k

represent the number of processes in a communication

Parallel Image Compositing Methods 79

 

















































FIGURE 5.6: Example of the radix-k algorithm for twelve processes, factored

into two rounds of



k =[4, 3].

group in round i. The k-values for all rounds can be written as the vector



k =[k

, ..., k

], where the number of rounds is denoted by r. Within each

group, a direct-send is performed. The total number of processes is p.

More than a convenient notation, this terminology makes clear the rela-

tionship among the previous algorithms. Direct-send is now deﬁned as r =1

and



k =[p]; and binary-swap is deﬁned as r =logp and



k =[2, 2, 2, ...]. Just

as 2-3 swap was an incremental step in combining direct-send and binary-swap

by allowing k-values that are either 2 or 3, it is natural to ask whether other

combinations of r and



k are possible. The radix-k algorithm [22] answers this

question by allowing any factorization of p into



i=1

= p. In radix-k, all

groups in round i are the same size, k

Figure 5.6 shows an example of radix-k for p =12and



k =[4, 3]. The

processes are drawn in a 4 × 3 rectangular layout to identify the rounds and

groups clearly. In this example, the rows on the left side form groups in the

ﬁrst round, while the columns on the right side form second-round groups. A

convenient way to think about forming groups in each round is to envision

the process space as an r-dimensional virtual lattice, where the size in each

dimension is the k-value for that round. This is the convention followed in

Figure 5.6 for two rounds drawn in two dimensions.

The outermost rectangles in the ﬁgure represent the image held by each

process at the start of the algorithm. During the round i, the current image

piece is further divided into k

pieces, such that the image pieces grow smaller

with each round. The image pieces are shown as highlighted boxes in each

round.

Selecting diﬀerent parameters can lead to many options; for the example

80 High Performance Visualization

above, other possible parameters for



k are [12], [6, 2], [2, 6], [3, 4], or [2, 2, 3],

to name a few. With a judicious selection of



k, higher compositing rates are

attained when the underlying hardware oﬀers support for multiple commu-

nication links and the ability to perform communication and computation

simultaneously. Even when hardware support for increased parallelism is not

available or the image size or number of processes dictates that binary-swap

or direct-send is the best approach, those algorithms are valid radix-k conﬁg-

urations.

5.3.3 Optimizations

Kendall et al. [11] extended radix-k to include active-pixel encoding and

compression, and they showed that such optimizations beneﬁt radix-k more

than its predecessors, because the choice of k-values is conﬁgurable. Hence,

when message size is decreased by active-pixel identiﬁcation and encoding,

k-values can be increased and performance can be further enhanced. Their

implementation encodes nonempty pixel regions, based on the bounding box

information, into two separate buﬀers: one for alternating counts of empty

and nonempty pixels, and the other for the actual pixel values. This way,

new subsets of the image can be taken by reassigning pointers rather than

copying pixels, and images remain encoded throughout all of the composit-

ing rounds. Non-overlapping regions are copied from one round to the next

without performing the blending operation.

A set of empirical tests was performed to determine a table of target k-

values for diﬀerent image sizes, number of processes, and architectures at both

the Argonne and Oak Ridge Leadership Computing Facilities. The platforms

tested were IBM Blue Gene/P, Cray XT, and two graphics clusters. With this

table, radix-k can look up the closest entry for a given image size, system

size, and architecture, and automatically factor the number of processes into

k-values as close to the target value as possible.

Moreland et al. [18] deployed and evaluated optimizations in a production

framework, rather than in isolated tests: IceT serves as both this test and

production framework. The advantages of this approach are that the tests

represent real workloads and improvements are ready for use in a production

environment sooner. These improvements include minimizing pixel copying

through compositing order and scanline interleaving, and a new telescoping

algorithm for the non-power-of-two number of processes that can further im-

prove radix-k. A ﬁnal advantage of using IceT for these improvements is that

IceT provides uniﬁed and reproducible benchmarks that other researchers can

repeat.

One of the improvements was devising a compositing order that minimizes

pixel copying. The usual, accumulative order causes nonoverlapping pixels to

be copied up to k

−1 times in round i, whereas tree methods only incur log k

copy operations. Pixel copying can further be reduced while using a novel

image interlacing algorithm. Rather than interleaving partitions according to

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Parallel Image Compositing Methods (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
5. Parallel Image Compositing Methods (2/4)