Visualization at Extreme Scale Concurrency 297
performance did not perform as well in these experiments, because its
default stripe count was four. In contrast, Franklin’s default stripe count
of two was better suited for the I/O pattern which read ten separate
compressed files per task. Smaller stripe counts often benefit file-per-
task I/O because the files were usually small enough (tens of MB) that
they would not contain many stripes, and spreading them thinly over
many I/O servers increases contention.
• Because the data was stored on disk in a compressed format, there was
an unequal I/O load across the tasks. The reported I/O times measure
the elapsed time between a file open and a barrier, after all the tasks
were finished reading. Because of this load imbalance, I/O time did not
scale linearly from 16,000 to 32,000 cores on Franklin and JaguarPF.
• The Dawn machine has the slowest clock speed (850MHz), which was
reflected in its contouring and rendering times.
• Some variation in the observations could not be explained by slow clock
speeds, interconnects, or I/O servers:
– For Franklin’s increase in rendering time from 16,000 to 32,000
cores, seven to ten network links failed that day and had to be
statically re-routed, resulting in suboptimal network performance.
Rendering algorithms are “all reduce” type operations that are very
sensitive to bisection bandwidth, which was affected by this issue.
– The experimenters concluded Juno’s slow rendering time was sim-
ilarly due to a network problem.
13.2.2 Varying over I/O Pattern
This variant was designed to understand the effects of different I/O pat-
terns. It compared collective and noncollective I/O patterns on Franklin for
a one trillion cell upsampled data set. In the noncollective test, each task
performed ten pairs of fopen and fread calls on independent gzipped files
without any coordination among tasks. In the collective test, all tasks syn-
chronously called MPI
File open once, then called MPI File read at all ten
times on a shared file (each read call corresponded to a different piece of the
data set). An underlying collective buffering, or “two phase” algorithm, in
Cray’s MPI-IO implementation aggregated read requests onto a subset of 48
nodes (matching the 48 stripe count of the file) that coordinated the low-level
I/O workload, dividing it into 4MB stripe-aligned fread calls. As the 48 ag-
gregator nodes filled their read buffers, they shipped the data using message
passing to their final destination among the 16,016 tasks. A different number
of tasks was used for each scheme (16,000 versus 16,016), because the collec-
tive communication scheme could not use an arbitrary number of tasks; the