Chapter 15
The Path to Exascale
Sean Ahern
Oak Ridge National Laboratory & University of Tennessee, Knoxville
15.1 Introduction ...................................................... 331
15.2 Future System Architectures ..................................... 332
15.3 Science Understanding Needs at the Exascale ................... 335
15.4 Research Directions .............................................. 338
15.4.1 Data Processing Modes .................................. 338
15.4.1.1 In Situ Processing ......................... 338
15.4.1.2 Post-Processing Data Analysis ............ 339
15.4.2 Visualization and Analysis Methods .................... 341
15.4.2.1 Support for Data Processing Modes ....... 341
15.4.2.2 Topological Methods ....................... 342
15.4.2.3 Statistical Methods ........................ 343
15.4.2.4 Adapting to Increased Data Complexity .. 343
15.4.3 I/O and Storage Systems ................................ 344
15.4.3.1 Storage Technologies for the Exascale ..... 345
15.4.3.2 I/O Middleware Platforms ................. 346
15.5 Conclusion and the Path Forward ............................... 346
References .......................................................... 349
The hardware and system architectural changes that will occur over the next
decade, as high performance computing (HPC) enters the exascale era, will be
dramatic and disruptive. Not only are scientific simulations forecasted to grow
by many orders of magnitude, but also the current methods by which HPC
systems are programmed and data are stored are not expected to survive into
the exascale. Most of the algorithms outlined in this book have been designed
for the petascale—not the exascale—and simply increasing concurrency is
insufficient to meet the challenges posed by exascale computing. Changing
the fundamental methods by which scientific understanding is obtained from
HPC simulations is daunting. This chapter explores some research directions
for addressing these formidable challenges.
331
332 High Performance Visualization
15.1 Introduction
In February 2011, the Department of Energy Office of Advanced Scientific
Computing Research convened a workshop to explore the problem of scientific
understanding of data from HPC at the exascale. The goal of the workshop
was to identify the research directions that the data management, analysis,
and visualization community must take to enable scientific discovery for HPC
at this extreme scale (1 exaflop = 1 quintillion floating point calculations per
second). Projections from the international TOP500 list [9] place the avail-
ability of the first exascale computers at around 2018–2019.
Extracting scientific insight from large HPC facilities is of crucial impor-
tance for the United States and the world. The scientific simulations that run
on supercomputers are only half of the “science”; scientific advances are made
only once the data produced by the simulations is processed into output that
is understandable by a scientist. As mathematician Richard Hamming said,
“The purpose of computing is insight, not numbers” [17]. It is precisely the
visualization and analysis community that provides the algorithms, research,
and tools to enable that critical insight.
The hardware and software changes that will occur as HPC enters the
exascale era will be dramatic and disruptive. Scientific simulations are fore-
casted to grow by many orders of magnitude, and also the methods by which
current HPC systems are programmed and data is extracted are not expected
to survive into the exascale. Changing the fundamental methods by which
scientific understanding is obtained from HPC simulations is a daunting task.
Dramatic changes to concurrency will reformulate existing algorithms and
workflows and cause a reconsideration of how to provide the best scalable
techniques for scientific understanding. Specifically, the changes are expected
to affect: concurrency, memory hierarchies, GPU and other accelerator pro-
cessing, communication bandwidth, and, finally, I/O.
This chapter provides an overview of the February 2011 workshop [1],
which examines potential research directions for the community as computing
leaves the petascale era and enters the uncharted exascale era.
15.2 Future System Architectures
For most of the history of scientific computation, Moore’s Law [28] pre-
dicts the doubling of transistors per unit of area and cost every 18 months,
which has been reflected in increased scalar floating point performance and
increased processor core count while a fairly standard balance is maintained
among memory, I/O bandwidth, and CPU performance. However, the exascale
will usher in an age of significant imbalances between system components—
sometimes by several orders of magnitude. These imbalances will necessitate
a significant transformation in how all scientists use HPC resources.
Although extrapolation of current hardware architectures allows for a fairly
The Path to Exascale 333
strong prediction that one exaflop will be reached by some worldwide compu-
tational resource by the year 2020, the particulars of the system architecture
are not yet assured. Currently, there are two schools of thought: multi-core and
many-core. The multi-core design primarily uses traditional multi-core proces-
sors (albeit increased beyond the petascale) with a large number of system
nodes. The many-core design is more disruptive in nature, using a large num-
ber of processors akin to today’s GPUs with very wide vector execution. Al-
though some blend of the two designs could emerge, the trends with many-core
and multi-core may be usefully extrapolated. Table 15.1 (from The Scientific
Grand Challenges Workshop: Architectures and Technology for Extreme Scale
Computing [34]) shows potential multi-core and many-core exascale system
architectures.
“2018”
System Parameter 2011 Multi-core Many-core Factor Change
System Peak FLOPS 2PF/s 1EF/s 500
System Power
6MW 20 MW 3
Total System Memory
0.3 PB 32–64 PB 100–200
Total I/O Capacity
15 PB 300–1,000 PB 20–67
Total I/O Bandwidth
0.2 TB/s 20–60 TB/s 10–30
Total Concurrency
225K 1B×10 1B×100 40,000–400,000
No de Performance
125 GF 1TF 10 TF 8–80
No de Memory Bandwidth
25 GB/s 400 GB/s 4 TB/s 16–160
No de Concurrency
12 1,000 10,000 83–830
Network Bandwidth
1.5 GB/s 100 GB/s 1,000 GB/s 66–660
System Size (nodes)
18,700 1,000,000 100,000 50–500
TABLE 15.1: Expected exascale architecture parameters and comparison with
current hardware. From The Scientific Grand Challenges Workshop: Architec-
tures and Technology for Extreme Scale Computing.
An examination of the figures in Table 15.1 leaves one strong conclusion: a
machine capable of one exaflop of raw compute capability will not simply be a
petascale machine scaled in each parameter by a factor of 1,000. The primary
reason the scaling will not be consistent across all parameters is because of
the need to control the power consumption of such a machine. An increase
of a factor of 1,000 in power consumption is simply not sustainable due to
the commensurate increase in cost. In considering how to exploit an exascale
machine, certain implications become clear.
One implication is that the total concurrency of scientific simulations will
increase by a factor of 40,000–400,000, but available system memory will in-
crease by only a factor of 100. Consequently, the memory available per thread
of execution will fall from around 1GB to 0.1–1MB—a factor of 1,000 less
than in petascale systems. This dichotomy is likely to change how the systems
are programmed, with memory management becoming much more dominant
in the system architecture. Current application codes will no longer be able to
334 High Performance Visualization
exploit simple weak scaling as machines grow in size. Instead, codes, including
algorithms for scientific data understanding, will have to greatly increase their
exploitation of on-node parallelism using very scarce memory resources.
The second implication is that, primarily for reasons of energy efficiency,
the locality of data will become much more important. On an exascale-class
architecture, the most expensive operation from a power perspective will be
moving data. The further the data is moved, the more expensive the process
will be. Therefore, approaches that maximize locality as much as possible and
pay close attention to data movement are likely to be the most successful.
Although this is also the case at the petascale, it will become a much more
dominant factor at the exascale. As well as locality between nodes, it will also
be essential to pay attention to on-node locality, as the memory hierarchy
is likely to be deeper. The importance of locality also implies that global
synchronization will be very expensive, and the level of effort required to
manage varying levels of synchronization will be high.
Finally, the last implication is that the growth in the external secondary
storage system on an exascale machine, both in capacity and bandwidth, will
be dramatically less than the growth in floating point operations per second
(FLOPS) and concurrency. The relative decrease in I/O will have dramatic
impacts upon the way data is moved off the HPC system. The relative per-
formance of an I/O system often can be judged by measuring how long it will
take to “checkpoint” the entire machine, that is, write the entire contents of
the memory to the secondary storage system (spinning disk). Over the past
15 years of HPC, that time has grown steadily, from 5 minutes in 1997 to
over 26 minutes in 2008 [7]. Extrapolations to the exascale vary between 40
and 100 minutes. Clearly, it will no longer be practical to quickly “dump”
the current state of a simulation to disk for later analysis by a visualization
tool. Both storing and analyzing simulation results are thus likely to require
entirely new approaches. One immediate conclusion is that much analysis of
simulation data is likely to be performed in situ with the simulation to min-
imize communication and I/O bandwidth to secondary storage. (See 15.4.1.1
for a more in-depth discussion.)
One bright note rings out that possibly mitigates some of these issues.
New non-volatile memory technologies are emerging that may facilitate some
of the dramatic changes needed for I/O optimization. Probably the most well
known of these is NAND flash, because of its ubiquity in consumer electronics
devices such as music players, phones, and cameras. The last few years have
seen the first exploration of its use in HPC systems [6]. Because it enables sig-
nificant improvements in read and write latency, non-volatile memory holds
the promise of improving relative I/O bandwidth for small I/O operations.
Adding a non-volatile memory device to the nodes of an HPC system provides
the underlying hardware resource necessary to both improve checkpoint per-
formance and provide a fast “swap” capability for these memory-constrained
nodes.
Current visual data analysis and exploration platforms, such as VisIt [20]
The Path to Exascale 335
and ParaView [32], are well positioned to exploit current memory-parallel plat-
forms and can scale adequately to today’s leadership computing platforms [7].
But, given the new prominence of hybrid architectures in HPC, as well as
the push toward very large core counts in the next several years, it is critical
that the analysis community redesign existing visualization and analysis al-
gorithms, and the production tools where they are delivered, for much better
utilization of GPUs and multi-core systems. Some early work toward this goal
has begun; see Chapter 12 and [26, 22, 29].
Heterogeneous architectures such, as GPUs, FPGAs, and other accelera-
tors (e.g., Intel’s Many Integrated Core, or MIC, architecture), are beginning
to show promise for significant time savings over traditional CPUs for cer-
tain analysis algorithms. Future analysis approaches must take these emerg-
ing computational architectures into account even though there are substan-
tial programmatic challenges in exploiting them. Scalable algorithms must
consider hybrid cores, nonuniform memory, deeper memory hierarchies, and
billion-way parallelism. MPI is likely to be inadequate at billion-way con-
currency and does not provide portability to these new classes of computing
architectures (e.g., GPUs, accelerators).
Transient hardware failures are expected to become much more common
as core count goes up and system complexity increases. Consequently, like
every other area of computer science, visualization and analysis may need to
adopt some measure of fault-tolerant computing. Increasing core count is also
likely to require changes in programming models.
15.3 Science Understanding Needs at the Exascale
Research leading toward data understanding and visualization at the exa-
scale cannot be done in a vacuum; it must have a firm grounding in the ex-
pressed needs of the computational science communities. Eight reports from
the Scientific Grand Challenges Workshop Series [40, 3, 43, 35, 31, 13, 33, 27]
present a cross section of the grand challenges of their science domains. They
contain the recommendations of domain-specific workshops for addressing the
most significant technical barriers to scientific discovery in each field. Each
report found that scientific breakthroughs are expected as a result of the
dramatic expansion of computational power; but each also identified specific
challenges to visualization, analysis, and data management. Some domains
have certain unique needs, but a number of cross-cutting and common themes
span the science areas:
The widening gap between system I/O and computational capacity dra-
matically constrains science applications, requiring new methods that
enable analysis and visualization to be performed while data is still res-
ident in memory.
Every scientific domain will produce more data than ever before, whether
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.126.199