Parallel Integral Curves 103
6.3.5 Algorithm Analysis
The efficiency of the three algorithms are compared by applying each algo-
rithm to several representative data sets and seed point distribution scenarios
and by measuring various aspects of performance. Because each algorithm
parallelizes differently over the particles and blocks, it is insightful to ana-
lyze the performance of the total runtime and also other key metrics that are
impacted directly by the parallelization strategy choices.
One overall metric of the analysis is the total runtime or wall-clock time
of the algorithm. This metric includes the total time-to-solution of each al-
gorithm, including IC computation, I/O, and communication. Although, this
metric alone is not sufficiently fine-grained enough to capture or convey per-
formance characteristics of an algorithm on a given data set with a given
initial seed set. The analysis of communication, I/O, and block management
gives a more fine-grained insight into the performance efficiency.
Communication is a difficult metric to report and analyze since all commu-
nication in these algorithms are asynchronous. However, the time required to
post send and receive operations and associated communication management
measures the impact of parallelization that involves communication.
To measure the impact of I/O upon parallelization, time spent reading
blocks from a disk is recorded, as well as the number of times blocks are loaded
and purged. Because not all the blocks will fit into memory, an LRU cache,
with a user-defined upper bound, is implemented to handle block purging. A
ratio measures the efficiency of this aspect of the algorithm, which is defined
by the block efficiency, E. The block’s efficiency is the difference between
number of blocks loaded and the number of blocks purged, over the number
of blocks loaded.
E =
B
L
− B
P
B
L
. (6.1)
For the astrophysics data set, the wall-time graph in Figure 6.5a demon-
strates the relative time performance of the three algorithms, with the hybrid
parallelization algorithm demonstrating a better performance than either seed
parallelization or data parallelization, for both spatially sparse and dense ini-
tial seed points. However, even at 512 processors, the difference between hybrid
and data parallelization for the spatially sparse seed point set is only a factor
of 3.8, so, in order to understand the parallelization, the analysis must include
other metrics.
An examination of total time spent in I/O, as shown in Figure 6.5b, is
particularly informative. In this graph, the hybrid algorithm performs very
close to the ideal, as exemplified by the data parallelization algorithm. Though
seed parallelization performs closely to the hybrid approach from a time point
of view, it spends an order of magnitude more time in I/O for both seed point
initial conditions.
Next, the block efficiency (see Eq. 6.1), is shown in Figure 6.5c for all three
algorithms. The data parallelization performs ideally, loading each block once