Computational Performance Evaluation of a Limited Area Meteorological Model by using the Earth Simulator

G. Cecia[email protected]http://www.cira.it; R. Mellaa; P. Schianoa; K. Takahashib[email protected]http://www.es.jamstec.go.jp; H. Fuchigamib    a Informatics and Computing Department, Italian Aerospace Research Center (CIRA), via Maiorise, I-81043 Capua (CE) - Italy
b Multiscale Climate Simulation Project Research, Earth Simulator Center/Japan Agency for Marine-Earth Science and Tech., 3173-25 Showa-machi, Kanazawa-ku, Yokohama Kanagawa 236-0001, Japan

Key words

parallel performance

numerical discretization

Navier-Stokes equations

INTRODUCTION

The work described in this paper comes from an agreement signed between ESC/JAMSTEC (Earth Simulator Center/Japan Agency for Marine-Earth Science and Technology) [1] and CIRA (Italian Aerospace Research Center) [2] to study hydrogeological (landslides, floods, etc.) and local meteorological phenomena.

It is well known that meteorological models are computationally demanding and they require both accurate and efficient numerical models on high performance parallel computing. Such needs are strictly and directly related to the high resolution of the model. The interest in this paper is about computational features, parallel performance and scalability analysis on Earth Simulator (ES) supercomputer of GCRM (Global Cloud Resolving Model - regional version) meteorolgical model, currently developed at ESC by the Multiscale Climate Simulation Project Research Group.

The paper is organized as follows. In Section 1 the meteorological model is described, emphasizing above all numerical and computational features. In Section 2 the ES system and the types of parallelism available on it are presented. In Section 3 details related to computational experiments and valuation criteria are discussed. In Section 4 and 5, finally, computational results are shown and discussed, respectively.

1 THE GLOBAL/REGIONAL NON-HYDROSTATIC ATMOSPHERIC MODEL

The model is based on non-hydrostatic fully compressible three-dimensional Navier-Stokes equations in flux form; the prognostic variables are density perturbation, pressure perturbation, three components of momentum and temperature [9][10][11].

Equations are casted in spherical coordinates and they are discretized on an Arakawa C/Lorenz grid by using finite difference methods; the grid structure has a horizontal resolution of 5.5 km and 32 levels are used vertically in terrain-following σ-coordinate. A time splitting procedure is applied for fast/slow waves integration in order to improve computational efficiency [3] [4] [5] following Skamarock-Wicher formulation [6] [7]; in particular, it employs Runge Kutta Gill method (2nd, 3rd and 4th order) for advection tendencies (using large time step) and forward/backward scheme for fast modes (using small time step).

The regional version is nested in a global/larger regional model (coarse grid) by 1-way interactive nesting. A sponge/relaxation boundary condition Davies-like [8] is used to allow the atmospheric interaction between the interior domain (regional domain on fine grid) and the external domain (global/larger regional model on coarse grid). Strictly speaking, the computational domain (fine grid) is composed by a halo area (outside the physical domain) and a sponge area (inside the physical domain); in the first one relaxation values of prognostic variables are defined by interpolation from the coarser grid or by a Newman condition, while in the latter prognostic variables are relaxed. Three types of sponge functions are implemented (hyperbolic, trigonometric and linear). More details about the model are available in [9] and a summary of the features discussed above is in Table 1.

Table 1

Numerical features of the GCRM meteorological model (limited area version)

Equation systemNon-hydrostatic fully compressible Navier-Stokes (flux form)
Prognostic variablesDensity perturbation, pressure perturbation, three components of momentum and temperature
Grid systemEquations casted in spherical coordinates and discretized on an Arakawa-C/Lorenz grid by FDMs.
Horizontal Resolution and Vertical LevelsHorizontal resolution 5.5 km; 32 vertical levels in terrain-following coordinate
Time integrationHE-VI. Time splitting with Runge Kutta Gill 2nd/3rd/4th order for advection tendencies (large time step) and Forward-Backward for fast modes (small time step)
Boundary ConditionsLateralSponge/relaxation boundary condition Davies-like
UpperRayleigh friction

t0010

2 THE ES SYSTEM

The ES is a distributed-memory parallel system made up by 640 NEC SX-6 nodes and each node is a parallel/vectorial shared-memory MIMD (Multiple Instruction Multiple Data) computer having 8 CPUs. The peak performance is about 40 TFLOPS (64 GFLOPS per node), the total main memory is of 10 TB and currently it is one of the most powerful supercomputers in the world [12].

The operating system is SUPER-UX, which is based on Unix System V, suitable languages and libraries (FORTRAN90/ES, C++/ES, MPI/ES, OpenMP) are provided to achieve high performance parallel computing. In particular, two types of parallelism can be implemented: hybrid and flat. The first one means that parallelism among nodes (inter-node parallelism) is achieved by MPI (Message Passing Interface) or HPF (High Performance Fortran), while in each node (intra-node parallelism) by microtasking or OpenMP. The latter, instead, means that both intra- and inter-node parallelism is archived by HPF or MPI; consequently the ES is viewed as made up by 8×640 processors. Roughly speaking, in the first case a program is subdivided into n MPI-tasks (each one running on each node) and the intra-node parallelism is performed by OpenMP or microtasking. In the second case, instead, a program is subdivided into n MPI-tasks running on n processors. Detailed description of ES is reported in [1].

3 IMPLEMENTATION DETAILS AND COMPUTATIONAL EXPERIMENTS

The programming language is Fortran90 and the algorithm is based on a SPMD (Single Program Multiple Data) program. The parallel computation of the non-hydrostatic atmospheric model, which has been developed at ESC, is implemented by a hybrid parallelization model: intra- and inter-node parallelisms are performed by micro-tasking and by MPI, respectively.

The selected test refers to an intense precipitation event occurred during November 2002 on North-Western of Italy.

To study scalability of the code (performance changing varying problem size or number of processors) experiments have been carried out keeping constant the problem size and varying the number of processors. The sizes along x−, y− and z-direction of the considered computational domains are 600×1152×32 (small domain), 1200×1152×32 (medium domain, twice bigger) and 1200×2304×32 (large domain, four times bigger). In order to reduce the communication cost a block-wise processors grid is adopted in horizontal for all the performed tests; consequently each processor works on a sub-domain having 32 vertical levels. The number of used processors varies between 16 and 512 (that means 2 and 64 nodes on the ES, respectively); in all cases the step incrementing the number of processors is by powers of two.

To evaluate runtime performances suitable compilation options and environment variables have been set. In particular, the performance analysis tool ftrace has been used to collect detailed information on each called procedure (both subroutine and function); we emphasize that it is better to use this option just for tuning since it causes an overhead. Such collected information is related to elapsed/user/system times, vector instruction execution time, MOPS (million of operations per second), MFLOPS (million of floating-point operations per second), VLEN (average vector length), Vector Operation Ratio (vector operation rate), MIPS (million of instructions per second) and so on. Criteria to achieve best performance are: vector time has to be close to user time; MFLOPS should be obviously as high as possible; VLEN, being the average vector length processed by CPUs, should be as much as possible close to the hardware vector length (256 on NEC SX-6) and, consequently, best performances are obtained on longer loops; vector operation rate (%) should be close to 100. Such information has been took into account for each performed experiment.

Scalability has been evaluated in terms of speed-up and efficiency. Since the program cannot be run sequentially, both speed-up and efficiency have been calculated using the elapsed time obtained running the parallel program with the minimum number of processors. In particular, speed-up has been evaluated by Amdhal’s Law [13] in the following form:

Sp=T2Tp

si1_e

where T2 and Tp are the wall clock times on 2 nodes (16 processors) and p nodes (p * 8 processors), respectively.

4 COMPUTATIONAL RESULTS

A feature verified in all performed tests is the good load balance among MPI processes: in fact, as showed in Table 2, the minimum value of the elapsed time is always very close to the maximum one.

Table 2

Minimum and maximum real times

domain size# nodes# processorsmin (sec)max (sec)
600x1152×322161098,7351098,845
432568,688568,918
864292,654293,462
16128165,918166,260
3225691,71292,698
1200×1152×324321117,3891117,858
864575,914576,783
16128339,519341,198
32256170,163172,292
64512100,435101,677
1200x2304x328641419,9891420,469
16128733,698733,945
32256387,153387,458
64512215,181216,717

t0015

Computational results were analyzed according to criteria discussed in the previous Section. Results concerning elapsed times and speed-up are shown in Figure 2, Figure 3 and Figure 4 going on from the small (600×1152×32), to the medium (1200×1152×32) to the large (1200×2304×32) domain, respectively. Efficiency related to each run is in Table 3.

f26-02-9780444530356
Figure 2 Elapsed time (sec.) and speed-up vs. number of processors on domain 600×1152×32
f26-03-9780444530356
Figure 3 Elapsed time (sec.) and speed-up vs. number of processors on domain 1200×1152×32
f26-04-9780444530356
Figure 4 Elapsed time (sec.) and speed-up vs. number of processors on domain 1200×2304×32

Table 3

Efficiency (%) on each domain obtained changing the number of processors along x- and y-directions

Domain 600×1152×32Domain 1200×1152×32Domain 1200×2304×32
Procs along xProcs alongyEfficiency (%)Procs along xProcs along yEfficiency (%)Procs along xProcs along yEfficiency (%)
116
13296.5216
16493.823296.88232
112882.526482.0426496.76
125674.1212881.01212891.63
225668.68225681.79

t0020

Performance analysis concerns also performed GFLOPS: analogously, the sustained peak performance obtained on small, medium and large domain is shown in Figures 5 and a comparison of sustained peak performance versus the theoretical one is in Table 6.

f26-05-9780444530356
Figure 5 Sustained peak performance (GFLOPS) vs. number of processors

Table 6

Comparison of sustained with theoretical peak performance (%) on each domain varying the number of processors

Domain 600×1152×32Domain 1200×1152×32Domain 1200×2304×32
Procs along xProcs along ySust/theor p.p.(%)Procs along xProcs along ySust/theor p.p. (%)Procs along xProcs along ySust/theor p.p.(%)
11651.50
13249.8921650.79
16448.7523249.3623251.21
112843.2726442.1426449.70
125639.97212842.48212847.33
225636.59225642.98

t0025

5 CONCLUSIONS

Results shown in the Section 4 are very satisfactory and promising for the future developments.

In all cases, ftrace tool shows that MFLOPS achieved in each routine are more or less 50% of the theoretical peak performance and that Vector Operation Ratio is very close to 100%. The small gap between the minimum and maximum real time testifies a high granularity and a good load balancing among processors: all times vary between 0.11 seconds and 2.13 seconds.

Obtained values in all performed tests show a speed-up varying nearby linearly and an efficiency staying always up to 70%. Varying the number of nodes on each domain, such values of speed-up and efficiency get worse since, increasing the number of processors, each subdomain becomes too small.

On the other hand, as shown above, the sustained performance is greater than about 40% of theoretical peak performance related to the number of used processors.

Moreover, the hybrid parallelization (MPI processes + OpenMP threads) has been compared with the flat one (only MPI processes) and in the first case an efficiency up to 25% has been achieved.

REFERENCES

[1] E.S.C. Earth Simulator Center. http://www.es.jamstec.go.jp/esc/eng/index.html.

[2] C.I.R.A. Italian Aerospace Research Center. http://www.cira.it/.

[3] Klemp JB, Wilhelmson R. The simulation of three-dimensional convective storm dynamics. J. Atmos. Sci. 1978;35:1070–1096.

[4] Skamarok WC, Klemp JB. The stability of time-split numerical methods for the hydrostatic and the nonhydrostatic elastic equation. Mon. Wea. Rev. 1992;120:2109–2127.

[5] Skamarok WC, Klemp JB. Efficiency and accuracy of the Klemp-Wilhelmson time-splitting technique. Mon. Wea. Rev. 1994;122:2623–2630.

[6] Skamarock WC, Wicher LJ. Time-splitting methods for elastic models using forward time schemes. Monthly Weather Review. 2001;130:2088–2097.

[7] Skamarock WC, Wicher LJ. A Time-splitting methods for elastic equations incorporating second-order Runge-Kutta time differencing. Monthly Weather Review. 1998;126:1992–1999.

[8] Davies HC. A Lateral boundary formulation for multi-level prediction models. Quart. J. R. Met. Soc. 1976;102:405–408.

[9] Takahashi K, Peng X, Komine K, Ohdaira M, Abe Y, Sugimura T, Goto K, Fuchigami H, Yamada M, Watanabe K. Development of Non-hydrostatic Coupled Ocean-Atmosphere Simulation Code on the Earth Simulator. In: Proceeding of 7th International Conference on High Performance Computing and Grid in Asia Pacific Region. IEEE Computer Society, Ohmiya; 2004:487–495.

[10] Takahashi K, Peng X, Komine K, Ohdaira M, Goto K, Yamada M, Fuchigami H, Sugimura T. Non-hydrostatic atmospheric GCM development and its computational performance. In: Zwieflhofer Walter, Mozdzynski George, eds. Use of High Performance computing in meteorology. World Scientific; 2005:50–62.

[11] available Takahashi K. Development of Holistic Climate Simulation codes for a nonhydrostatic atmosphere-ocean couplet system. Annual Report of the Earth Simulator Center. 2006. http://www.es.jamstec.go.jp/esc/images/annual2004/index.html.

[12] http://www.top500.org/.

[13] Amdahl GM. Validity of the single-processor approach to achieving large scale computing capabilities. In: Reston, Va. AFIPS Press; 1967:483–485. AFIPS Conference Proceedings, 30 (Atlantic City, N.J., Apr. 18–20).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.38.3