15. The Path to Exascale (1/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15

The Path to Exascale

Sean Ahern

Oak Ridge National Laboratory & University of Tennessee, Knoxville

15.1 Introduction ...................................................... 331

15.2 Future System Architectures ..................................... 332

15.3 Science Understanding Needs at the Exascale ................... 335

15.4 Research Directions .............................................. 338

15.4.1 Data Processing Modes .................................. 338

15.4.1.1 In Situ Processing ......................... 338

15.4.1.2 Post-Processing Data Analysis ............ 339

15.4.2 Visualization and Analysis Methods .................... 341

15.4.2.1 Support for Data Processing Modes ....... 341

15.4.2.2 Topological Methods ....................... 342

15.4.2.3 Statistical Methods ........................ 343

15.4.2.4 Adapting to Increased Data Complexity .. 343

15.4.3 I/O and Storage Systems ................................ 344

15.4.3.1 Storage Technologies for the Exascale ..... 345

15.4.3.2 I/O Middleware Platforms ................. 346

15.5 Conclusion and the Path Forward ............................... 346

References .......................................................... 349

The hardware and system architectural changes that will occur over the next

decade, as high performance computing (HPC) enters the exascale era, will be

dramatic and disruptive. Not only are scientiﬁc simulations forecasted to grow

by many orders of magnitude, but also the current methods by which HPC

systems are programmed and data are stored are not expected to survive into

the exascale. Most of the algorithms outlined in this book have been designed

for the petascale—not the exascale—and simply increasing concurrency is

insuﬃcient to meet the challenges posed by exascale computing. Changing

the fundamental methods by which scientiﬁc understanding is obtained from

HPC simulations is daunting. This chapter explores some research directions

for addressing these formidable challenges.

331

332 High Performance Visualization

15.1 Introduction

In February 2011, the Department of Energy Oﬃce of Advanced Scientiﬁc

Computing Research convened a workshop to explore the problem of scientiﬁc

understanding of data from HPC at the exascale. The goal of the workshop

was to identify the research directions that the data management, analysis,

and visualization community must take to enable scientiﬁc discovery for HPC

at this extreme scale (1 exaﬂop = 1 quintillion ﬂoating point calculations per

second). Projections from the international TOP500 list [9] place the avail-

ability of the ﬁrst exascale computers at around 2018–2019.

Extracting scientiﬁc insight from large HPC facilities is of crucial impor-

tance for the United States and the world. The scientiﬁc simulations that run

on supercomputers are only half of the “science”; scientiﬁc advances are made

only once the data produced by the simulations is processed into output that

is understandable by a scientist. As mathematician Richard Hamming said,

“The purpose of computing is insight, not numbers” [17]. It is precisely the

visualization and analysis community that provides the algorithms, research,

and tools to enable that critical insight.

The hardware and software changes that will occur as HPC enters the

exascale era will be dramatic and disruptive. Scientiﬁc simulations are fore-

casted to grow by many orders of magnitude, and also the methods by which

current HPC systems are programmed and data is extracted are not expected

to survive into the exascale. Changing the fundamental methods by which

scientiﬁc understanding is obtained from HPC simulations is a daunting task.

Dramatic changes to concurrency will reformulate existing algorithms and

workﬂows and cause a reconsideration of how to provide the best scalable

techniques for scientiﬁc understanding. Speciﬁcally, the changes are expected

to aﬀect: concurrency, memory hierarchies, GPU and other accelerator pro-

cessing, communication bandwidth, and, ﬁnally, I/O.

This chapter provides an overview of the February 2011 workshop [1],

which examines potential research directions for the community as computing

leaves the petascale era and enters the uncharted exascale era.

15.2 Future System Architectures

For most of the history of scientiﬁc computation, Moore’s Law [28] pre-

dicts the doubling of transistors per unit of area and cost every 18 months,

which has been reﬂected in increased scalar ﬂoating point performance and

increased processor core count while a fairly standard balance is maintained

among memory, I/O bandwidth, and CPU performance. However, the exascale

will usher in an age of signiﬁcant imbalances between system components—

sometimes by several orders of magnitude. These imbalances will necessitate

a signiﬁcant transformation in how all scientists use HPC resources.

Although extrapolation of current hardware architectures allows for a fairly

The Path to Exascale 333

strong prediction that one exaﬂop will be reached by some worldwide compu-

tational resource by the year 2020, the particulars of the system architecture

are not yet assured. Currently, there are two schools of thought: multi-core and

many-core. The multi-core design primarily uses traditional multi-core proces-

sors (albeit increased beyond the petascale) with a large number of system

nodes. The many-core design is more disruptive in nature, using a large num-

ber of processors akin to today’s GPUs with very wide vector execution. Al-

though some blend of the two designs could emerge, the trends with many-core

and multi-core may be usefully extrapolated. Table 15.1 (from The Scientiﬁc

Grand Challenges Workshop: Architectures and Technology for Extreme Scale

Computing [34]) shows potential multi-core and many-core exascale system

architectures.

“2018”

System Parameter 2011 Multi-core Many-core Factor Change

System Peak FLOPS 2PF/s 1EF/s 500

System Power

6MW ≤20 MW 3

Total System Memory

0.3 PB 32–64 PB 100–200

Total I/O Capacity

15 PB 300–1,000 PB 20–67

Total I/O Bandwidth

0.2 TB/s 20–60 TB/s 10–30

Total Concurrency

225K 1B×10 1B×100 40,000–400,000

No de Performance

125 GF 1TF 10 TF 8–80

No de Memory Bandwidth

25 GB/s 400 GB/s 4 TB/s 16–160

No de Concurrency

12 1,000 10,000 83–830

Network Bandwidth

1.5 GB/s 100 GB/s 1,000 GB/s 66–660

System Size (nodes)

18,700 1,000,000 100,000 50–500

TABLE 15.1: Expected exascale architecture parameters and comparison with

current hardware. From The Scientiﬁc Grand Challenges Workshop: Architec-

tures and Technology for Extreme Scale Computing.

An examination of the ﬁgures in Table 15.1 leaves one strong conclusion: a

machine capable of one exaﬂop of raw compute capability will not simply be a

petascale machine scaled in each parameter by a factor of 1,000. The primary

reason the scaling will not be consistent across all parameters is because of

the need to control the power consumption of such a machine. An increase

of a factor of 1,000 in power consumption is simply not sustainable due to

the commensurate increase in cost. In considering how to exploit an exascale

machine, certain implications become clear.

One implication is that the total concurrency of scientiﬁc simulations will

increase by a factor of 40,000–400,000, but available system memory will in-

crease by only a factor of 100. Consequently, the memory available per thread

of execution will fall from around 1GB to 0.1–1MB—a factor of 1,000 less

than in petascale systems. This dichotomy is likely to change how the systems

are programmed, with memory management becoming much more dominant

in the system architecture. Current application codes will no longer be able to

334 High Performance Visualization

exploit simple weak scaling as machines grow in size. Instead, codes, including

algorithms for scientiﬁc data understanding, will have to greatly increase their

exploitation of on-node parallelism using very scarce memory resources.

The second implication is that, primarily for reasons of energy eﬃciency,

the locality of data will become much more important. On an exascale-class

architecture, the most expensive operation from a power perspective will be

moving data. The further the data is moved, the more expensive the process

will be. Therefore, approaches that maximize locality as much as possible and

pay close attention to data movement are likely to be the most successful.

Although this is also the case at the petascale, it will become a much more

dominant factor at the exascale. As well as locality between nodes, it will also

be essential to pay attention to on-node locality, as the memory hierarchy

is likely to be deeper. The importance of locality also implies that global

synchronization will be very expensive, and the level of eﬀort required to

manage varying levels of synchronization will be high.

Finally, the last implication is that the growth in the external secondary

storage system on an exascale machine, both in capacity and bandwidth, will

be dramatically less than the growth in ﬂoating point operations per second

(FLOPS) and concurrency. The relative decrease in I/O will have dramatic

impacts upon the way data is moved oﬀ the HPC system. The relative per-

formance of an I/O system often can be judged by measuring how long it will

take to “checkpoint” the entire machine, that is, write the entire contents of

the memory to the secondary storage system (spinning disk). Over the past

15 years of HPC, that time has grown steadily, from 5 minutes in 1997 to

over 26 minutes in 2008 [7]. Extrapolations to the exascale vary between 40

and 100 minutes. Clearly, it will no longer be practical to quickly “dump”

the current state of a simulation to disk for later analysis by a visualization

tool. Both storing and analyzing simulation results are thus likely to require

entirely new approaches. One immediate conclusion is that much analysis of

simulation data is likely to be performed in situ with the simulation to min-

imize communication and I/O bandwidth to secondary storage. (See 15.4.1.1

for a more in-depth discussion.)

One bright note rings out that possibly mitigates some of these issues.

New non-volatile memory technologies are emerging that may facilitate some

of the dramatic changes needed for I/O optimization. Probably the most well

known of these is NAND ﬂash, because of its ubiquity in consumer electronics

devices such as music players, phones, and cameras. The last few years have

seen the ﬁrst exploration of its use in HPC systems [6]. Because it enables sig-

niﬁcant improvements in read and write latency, non-volatile memory holds

the promise of improving relative I/O bandwidth for small I/O operations.

Adding a non-volatile memory device to the nodes of an HPC system provides

the underlying hardware resource necessary to both improve checkpoint per-

formance and provide a fast “swap” capability for these memory-constrained

nodes.

Current visual data analysis and exploration platforms, such as VisIt [20]

The Path to Exascale 335

and ParaView [32], are well positioned to exploit current memory-parallel plat-

forms and can scale adequately to today’s leadership computing platforms [7].

But, given the new prominence of hybrid architectures in HPC, as well as

the push toward very large core counts in the next several years, it is critical

that the analysis community redesign existing visualization and analysis al-

gorithms, and the production tools where they are delivered, for much better

utilization of GPUs and multi-core systems. Some early work toward this goal

has begun; see Chapter 12 and [26, 22, 29].

Heterogeneous architectures such, as GPUs, FPGAs, and other accelera-

tors (e.g., Intel’s Many Integrated Core, or MIC, architecture), are beginning

to show promise for signiﬁcant time savings over traditional CPUs for cer-

tain analysis algorithms. Future analysis approaches must take these emerg-

ing computational architectures into account even though there are substan-

tial programmatic challenges in exploiting them. Scalable algorithms must

consider hybrid cores, nonuniform memory, deeper memory hierarchies, and

billion-way parallelism. MPI is likely to be inadequate at billion-way con-

currency and does not provide portability to these new classes of computing

architectures (e.g., GPUs, accelerators).

Transient hardware failures are expected to become much more common

as core count goes up and system complexity increases. Consequently, like

every other area of computer science, visualization and analysis may need to

adopt some measure of fault-tolerant computing. Increasing core count is also

likely to require changes in programming models.

15.3 Science Understanding Needs at the Exascale

Research leading toward data understanding and visualization at the exa-

scale cannot be done in a vacuum; it must have a ﬁrm grounding in the ex-

pressed needs of the computational science communities. Eight reports from

the Scientiﬁc Grand Challenges Workshop Series [40, 3, 43, 35, 31, 13, 33, 27]

present a cross section of the grand challenges of their science domains. They

contain the recommendations of domain-speciﬁc workshops for addressing the

most signiﬁcant technical barriers to scientiﬁc discovery in each ﬁeld. Each

report found that scientiﬁc breakthroughs are expected as a result of the

dramatic expansion of computational power; but each also identiﬁed speciﬁc

challenges to visualization, analysis, and data management. Some domains

have certain unique needs, but a number of cross-cutting and common themes

span the science areas:

• The widening gap between system I/O and computational capacity dra-

matically constrains science applications, requiring new methods that

enable analysis and visualization to be performed while data is still res-

ident in memory.

• Every scientiﬁc domain will produce more data than ever before, whether

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. The Path to Exascale (1/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. The Path to Exascale (1/5)