List of Figures

2.1      The application areas targeted by the HPCS program are bound by the HPCC tests in the memory access locality space.

2.2      HPCS program benchmarks and performance targets.

2.3      Detail description of the HPCC component tests (A, B, C – matrices, a, b, c, x, z – vectors, α, β – scalars, T – array of 64-bit integers).

2.4      Testing scenarios of the HPCC components.

2.5      Sample kiviat diagram of results for three different interconnects that connect the same processors.

2.6      Sample interpretation of the HPCC results.

2.7      Historical trends for winners of the performance category of the HPCC Awards for Global HPL.

2.8      Historical trends for winners of the performance category of the HPCC Awards for Global RandomAccess.

2.9      Historical trends for winners of the performance category of the HPCC Awards for Global FFT.

2.10    Historical trends for winners of the performance category of the HPCC Awards for Global EP-STREAM-Triad.

3.1      Tracking power density with Moore’s law over time.

3.2      ASHRAE heat-density projection.

3.3      Energy efficiency statistics across Green500 releases.

3.4      Energy vs. performance efficiency across Green500 releases.

3.5      Power projections for exascale systems.

4.1      Architecture of the Tera facility.

4.2      Tera 100.

4.3      Example of a 4-billion cell computation done on 4096 cores of Tera 100, visualized using our LOVE package.

4.4      Molecular dynamics simulation of the production of a shock induced tantalum scale.

4.5      Overview of the different Tera supercomputers.

4.6      The bullx B505 hybrid blade architecture.

4.7      Tera 100 network topology.

4.8      The software stack of the Tera 100 supercomputer.

4.9      Architectural overview of Arcane.

4.10    Hercule automatically handles the computational domains of the different codes through its dictionary-based mechanism.

4.11    Photos illustrating the density of our storage systems.

4.12    The Tera facility floor plan.

4.13    Electricity distribution.

4.14    Cooling things down.

4.15    Water distribution pipes.

4.16    The contributions to the PUE.

4.17    Bull’s cool cabinet door principle.

4.18    A BullX fat node. The UCM is under the label AC/DC.

4.19    Core usage on Tera 100.

4.20    Wait time classes for starting a batch job.

4.21    Time spent for useful computations or lost by failed runs for different codes.

4.22    Instantaneous production on the Lustre filesystems.

4.23    Electrical consumption of different parts of Tera 100.

4.24    Volume stored in HPSS by a group of users.

5.1      Configuration of Mole-9.7 supercomputer.

5.2      Configuration of Mole-8.7 supercomputer.

5.3      Overview of the Mole-8.5 supercomputing system.

5.4      The Mole-8.5 system at IPE, CAS.

5.5      Linpack results of Mole-8.5 with different number of nodes.

5.6      Certificate for Mole-8.5 from Green500 in 2011.

5.7      Bandwidth performance between two random nodes.

5.8      Latency performance between two random nodes.

5.9      Global architecture of the Mole-8.5 system.

5.10    Aspects of server rack and single node.

5.11    Physical layout of a Mole-8.5 system (network connection is partly shown for clarity).

5.12    Logical layout of the components in a typical Mole-8.5 node.

5.13    Internal structure of a typical computing node in Mole-8.5.

5.14    Layout of the Mole-8.5 visualization system.

5.15    High-resolution visualization of the velocity field of a DNS simulation of gas-solid system with 1 million solid particles.

5.16    Quasi real-time simulation of the particle flow in a rotating drum on the display array.

5.17    Configuration of the DDN storage of Mole-8.5 (provided by DDN).

5.18    Performance of the DDN storage of Mole-8.5.

5.19    General algorithmic platform for discrete simulation.

5.20    General purpose particle simulation algorithm on multiple GPUs.

5.21    Snapshots from the simulation of the industrial scale rotary drum.

5.22    Snapshots of the two-dimensional DNS of gas-solid suspension.

5.23    Snapshots of the three-dimensional DNS of gas-solid suspension.

5.24    Structure of the simulated influenza virion with a quarter of the outer sphere moved over to show the interior details. Components are shown in different colors for better visualization. Radius of gyration of exterior structure of the virion as a function of simulation time.

6.1      Programming the ENIAC at the Ballistics Research Laboratory.

6.2      The major components of the DoD High Performance Computing Modernization Program (HPCMP).

6.3      Pressure contours on an F-16C configured with tip AIM-9 missiles at Mach 0.9.

6.4      Comparisons among F-16C CFD computations (Kestrel and Cobalt, full- scale) with LEF = 0 degrees vs. performance data with LEF = 0 degrees for CL, CD, and Cm, Mach 0.9.

6.5      Molecular dynamics simulated response of 10-nm dia. polycrystalline- grained silicon carbide subject to tensile failure. Only atoms along fracture surfaces or grain boundaries are shown, with black and gray atoms denoting under-coordinated or dislocated atoms, respectively.

6.6      Historical overview of HPCMP supercomputers (1993-2012).

6.7      The configuration of the ERDC DSRC is representative of the computation and storage architecture of all HPCMP supercomputing centers. Machines and services in this diagram are depicted as blocks; only the Cray XE6 discussed earlier is shown in detail.

6.8      ERDC DSRC supercomputing facility during the later stages of construction.

6.9      A look under the raised floor showing the three-level assignment of infrastructure support systems in the ERDC supercomputing facility.

6.10    Overview of the HPCMP supercomputer acquisition process.

7.1      HP Proliant SL390 node.

7.2      MAGMA software stack.

7.3      MAGMA 1.1 supported functionality.

7.4      Performance of LU – LAPACK on multicore vs MAGMA on hybrid GPU- CPU system.

7.5      Algorithms as DAGs for multicore and hybrid systems.

7.6      LU factorization on a multiGPU–multicore Keeneland system node.

7.7      Composition of DAGs – execution traces on a 48 core system with synchronization (top) vs. with DAG composition (bottom).

7.8      Measuring inter-thread data sharing.

7.9      Lynx: a dynamic editing tool for PTX.

7.10    Shadowfax: the assemblies concept.

7.11    Shadowfax: the components of the system.

7.12    Shadowfax: cluster throughput with various hardware slices using LAMMPS.

7.13    CPU and GPU results from a 250k particle Gay-Berne LAMMPS simulation on KIDS.

7.14    KIDS percent utilization/month.

7.15    KIDS workload distribution by numbers of nodes/job.

7.16    KIDS workload distribution by percentage of jobs at number of nodes.

7.17    KIDS workload distribution by percentage of machine at number of nodes.

7.18    Performance of SHOC Stencil2D benchmark on KIDS, CUDA, and OpenCL versions.

8.1      JUGENE at Jülich Supercomputing Centre.

8.2      JSC supercomputer network connections.

8.3      Blue Gene/P hardware packaging.

8.4      Logical setup of Blue Gene/P compute ASIC.

8.5      Blue Gene/P node.

8.6      Blue Gene/P nodecard.

8.7      JUGENE network structure. Blue Gene/P I/O nodes connected to the GPFS fileservers through four Force10 switches.

8.8      Detail (2048×2048 cores) of the communication matrix and the full latency profile (logarithm of the number of messages versus the latency in µs) are shown, obtained with the LinkTest benchmark on the full system (72 racks, 294,912 compute cores).

8.9      Computational time on JUGENE granted national projects sorted by scientific area.

8.10    Workload of JUGENE (in %) between May 2010 and July 2012.

8.11    The fusion process of three alpha particles (helium-4 nuclei) in massive, hot stars leads to carbon-12.

8.12    Model of crystallization of AIST alloy in a DVD.

8.13    Light hadron spectrum of QCD.

8.14    Parallelization scheme: three dimensions are mapped to the torus hardware, here depicted by arrows pointing from and to the neighboring nodes (boxes), and the 4th dimension uses the 4 cores inside a Blue Gene/P node.

8.15    Left: The Blue Gene/P DMA communication controller. Right: The Blue Gene/P DMA FIFO.

8.16    Top: Performance of the Wilson kernel in a strong scaling analysis (643×144 lattice). Bottom: Shown is the same strong scaling analysis, but this time the total performance of the single precision kernel is given. Close to 350 TFlops are reached when running on 72 racks.

8.17    LLview job monitoring shows the allocation of the 72 racks in physical and logical view, as well as a list of running jobs, the overall usage statistics of the last days, and a prediction of the jobs to be processed next.

8.18    The left panel shows the hierarchy of measured metrics. The middle panel shows the distribution of the selected metric over the call tree of the program. Finally, the right panel shows the distribution of the selected metric at the selected call site over the machine topology. The screen shot shows the result of a trace analysis experiment with the ASC sweep3D benchmark running on all 294,912 cores of JUGENE.

8.19    Execution times of all routines, 1024 processors arranged on a 32x32 grid.

8.20    JUST filesystems.

8.21    JSC building at Forschungszentrum Jülich.

8.22    Old and new machine hall of the Jülich Supercomputing Centre.

8.23    New machine hall of the Jülich Supercomputing Centre during construction phase in 2003.

8.24    Raised floor in the new machine hall with power cables and water pipes for the Blue Gene/P installation.

8.25    JSC Machineroom floorplan with JUGENE, JUST, and JUROPA/ HPCFF.

9.1      Eight high-performance Synergistic Processing Elements (SPEs) plus one Power Processing Element (PPE) are in a Cell processor.

9.2      A Roadrunner triblade compute node composed of one LS21 and two QS22 blades with four independent PCIe x8 links to the two Opteron chips on the LS21 blade with two independent HyperTransport links.

9.3      Roadrunner is composed of 17 Connected Unit (CU) groupings, each comprising 180 triblade compute nodes plus 12 I/O gateway nodes.

9.4      Comparison of Array of Structures (AoS) and Structure of Arrays (SoA) data layouts.

9.5      Single-speckle LPI calculations using 16 CU of Roadrunner (11,520 ranks), nearly the full system.

9.6      A snapshot taken from a 3D VPIC LPI simulation at SRS pulse saturation of a f/4 laser beam, showing bending of iso-surfaces of EPW electric field across the speckle.

9.7      Open boundary simulations for neutral sheet geometry feature two types of secondary instabilities within the electron layer: an electromagnetic kink wave and flux rope formation.

9.8      Open boundary simulations for guide field geometry feature highly elongated electron current layers that are unstable to flux rope formation over a wide range of angles.

9.9      Simulations were performed on Roadrunner with geometry and boundary conditions relevant to the MRX experiment including the influence of weak Coulomb collisions.

9.10    Founder effects lead to entire clades sharing characteristics.

9.11    A phylogeny of about 10,000 HIV sequences colored by the study subject that was used to implement phylogenetic correction on the observed correlation between genotype and phenotype.

9.12    A series of high-resolution transmission electron microsope (HRTEM) images showing the contact formation - retraction - and rupture processes of two gold tips.

9.13    Early stage of a ParRep simulation of the stretching of a silver nanowire on Roadrunner at a temperature of 300 K and a retraction velocity of 10–5 m/s.

9.14    Late stage of a ParRep simulation of the stretching of a silver nanowire on Roadrunner at a temperature of 300K and a retraction velocity of 10–5 m/s.

9.15    Ejecta particle formation by fragmentation of an expanding jet produced when a shock wave reaches a copper-free surface with a sinusoidal profile.

9.16    Incipient spall failure in a copper bicyrstal (shocked left-to-right) by homogeneous void nucleation along the (vertical) plane of maximum tensile stress.

9.17    Incipient spall failure in a copper bicrystal by heterogeneous void nucleation along the (horizontal) grain boundaries.

9.18    Schematic of the reacting turbulence simulations.

9.19    Entropy field in reacting compressible turbulence shows the rich phenomenology of the flame turbulence interaction.

9.20    Dark matter halos from one of the large Roadrunner simulations, with 1/64 of the total (750 Mpc/h)3 volume displayed.

9.21    The redshift space flux correlation function, ζF, as a function of the comoving distance, r, measured in Mpc/h.

10.1    ALE3D.

10.2    Numerical Schlieren.

10.3    Blue Gene/Q system hierarchy.

10.4    Blue Gene/Q compute chip die photograph.

10.5    A depiction of the different components in the Blue Gene/Q system software stack and their open source status for the I/O and Compute nodes.

10.6    A depiction of the different components in the Blue Gene/Q system software stack and their open source status for the Service/Login nodes.

10.7    PAMI - Parallel Active Messaging Interface.

10.8    Overall view of the Sequoia storage and viz system.

10.9    Overall view of the Mira storage and viz system.

10.10  The effect of light from early galaxies on the gasses filling the universe.

10.11  Process tertiary piping loop and connections installed in the mechanical room.

10.12  Custom underfloor power distribution unit.

10.13  Underfloor infrastructure zoned for efficiency.

10.14  Components of the liquid cooling.

10.15  Pedestal replacement.

10.16  Deterministic mode run times.

10.17  Probabilistic mode run times.

11.1    “Strela” computer installed at MSU computing center in 1956.

11.2    The first computing cluster at RCC MSU, 2000.

11.3    “Lomonosov” supercomputer, 2012.

11.4    MSU supercomputing center’s user priorities.

11.5    Blade module PCB.

11.6    The original DDR3 memory module.

11.7    Motherboard heat sink.

11.8    TB2 chassis backplane.

11.9    “Lomonosov” supercomputer chassis.

11.10  Wiring the equipment in the computer’s standard hardware cabinet.

11.11  QDR InfiniBand switches for the blade system.

11.12  Management module.

11.13  Barrier synchronization and global interrupt network topology.

11.14  Uninterruptible power supply.

11.15  Chillers on the outside ground.

11.16  Water accumulator tanks.

11.17  3D-model of “Lomonosov” supercomputer equipment placement within the building.

11.18  Lustre parallel filesystem schematic.

11.19  Number of jobs queued by day (“Lomonosov”).

12.1    The Pleiades supercomputer, located within the NASA Advanced Super-computing facility at Ames Research Center, Moffett Field, California.

12.2    Hypercube building block.

12.3    I/O fabric: two 9-switch tori provide connectivity for the Lustre servers and the hyperwall.

12.4    The hyperwall can show a single image across all 128 screens or display multiple configurations of data on individual screens.

12.5    Schematic layout of the 15,000 sq. foot main computer floor housing the Pleiades supercomputer at the NAS facility.

12.6    Westmere scaling results for the six applications used to define the SBU charging rates for Pleiades.

12.7    Pleiades compute capacity and utilization growth over the life of the system.

12.8    Pleiades storage capacity and utilization growth over the life of the system.

12.9    Navier-Stokes simulation of a UH-60 Blackhawk helicopter rotor in high-speed forward flight, using AMR and the Spalart-Allmaras/DES turbulence model.

12.10  Contour plots of pressures on the vehicle surface and on a cutting plane for two early candidate crew and cargo SLS designs at four different Mach numbers.

12.11  Visualization of surface current speed for a global ocean simulation carried out with 1/16-degree horizontal grid spacing on Pleiades by the ECCO project, capturing the swirling flows of tens of thousands of ocean currents.

12.12  Kepler’s science data processing pipeline.

13.1    Blue Waters project components.

13.2    Comparison of the major algorithmic approaches by science area.

13.3    A high-level view of the Blue Waters system and subsystems.

13.4    Comparison of Interlagos operating modes.

13.5    Cray XE6 compute node.

13.6    Compute blade: architectural diagram.

13.7    Blue Waters I/O subsystem.

13.8    The estimated cost difference between RAIT and mirroring data on tapes.

13.9    A logical diagram of the Blue Waters hardware components.

13.10  Blue Waters software ecosystem.

13.11  Layers of communication software.

13.12  National Petascale Computational Facility at the UIUC campus.

14.1    Kraken.

14.2    Six-core AMD OpteronTM processor diagram.

14.3    Cray XT3, XT4, and XT5 compute and XT service nodes.

14.4    XT administrative components.

14.5    Kraken utilization before and after bimodal scheduling.

14.6    Kraken login types.

14.7    Lustre layout.

14.8    Kraken Lustre configuration.

14.9    Discipline utilization of Kraken.

14.10  Kraken was created to provide a platform to scientists to scale codes to petascale computers. As seen here, codes are initially ported to Kraken and tested in the small queue before scaling codes up to take advantage of Kraken’s large core count.

14.11  The overwhelming majority of compute hours are consumed by jobs that request fewer than 512 compute nodes.

14.12  Kraken’s uptime continues to increase with the addition of hardware and software patches.

14.13  Phil Andrews — the first NICS project director.

15.1    Cumulative core-hour usage by application.

15.2    Cumulative core-hour usage by job size.

16.1    Blacklight, in PSC’s machine room.

16.2    PSC system history.

16.3    The role of electron physics in development of turbulent magnetic reconnection in collisionless plasmas.

16.4    Each Blacklight compute node consists of two Intel Nehalem-EX 8-core processors running at 2.27 GHz, a UV Hub, 128 GB of memory, and four NUMAlinkTM 5 connections to the rest of the system.

16.5    A “quad” consists of two compute nodes and has four NL5 links to a fat tree.

16.6    Lower levels of the 4-plane NL5 fat-tree interconnect, showing half of the full system.

16.7    Upper-level, fat-tree, full-system (32 TB) interconnect. Each high-level line represents 32 NL5 paths.

16.8    PSC’s Data Supercell delivers a high-performance, cost-effective approach to long-term, petascale storage.

17.1    DCM of Gordon, Kraken, and Ranger.

17.2    Gordon compute and I/O node racks.

17.3    Gordon architecture.

17.4    High-level view of vSMP node built from 16 compute nodes and one I/O node.

17.5    4x4x4 torus of switches.

17.6    Compute and I/O node network topology.

17.7    Data Oasis parallel filesystem.

17.8    The full Gordon system consists of 21 racks, including 16 compute node racks, four I/O node racks, and one service rack.

17.9    Gordon project timeline.

17.10  Breakdown of time into I/O and computation on the execution timeline.

17.11  I/O time per file using disks and flash.

17.12  Screenshot of Linux top command on vSMP node during single-core Velvet run.

17.13  Screenshot of vsmpstat command on vSMP node during single-core Velvet run.

17.14  Event counts for Velvet run on vSMP node. Profiling information obtained using vsmpprof and display generated using the logpar post-processing tool.

17.15  CPU utilization for Velvet run on vSMP node. Profiling information obtained using vsmpprof and display generated using the logpar post-processing tool.

17.16  FFTW and PTRANS performance.

17.17  Random ring bandwidth and latency.

17.18  Abaqus benchmarks for large memory simulation of the response of trabecular bone to mechanical stress.

17.19  Assembly of human chromosome 19 from brain and blood samples using Velvet. The results compare performance on a Gordon vSMP node and on UCSDs PDAF (Petascale Data Analysis Facility), a large physically shared memory system.

17.20  Number of Gordon allocations by domain.

17.21  Service units (compute core hours) allocated on Gordon by domain.

17.22  SDSC Founding Management Team.

18.1    History of CSCS Cray XT and XE Series flagship systems.

18.2    System allocation by application names for 2011 (approximated values).

18.3    System usage by domains on the CSCS Cray platforms (from 2006 to 2011).

18.4    CP2K with hybrid MPI/OpenMP programming exceeding ideal scaling up to 32,768 cores on a CRAY XT5 platform.

18.5    Bulldozer micro-architecture and processor setup.

18.6    Layout of the 4-way NUMA domain of a Cray XE6 node with the default MPI and thread binding onto 32 processor cores.

18.7    Comparison of the Cray Gemini networking technology with its predecessor.

18.8    Layout of the physical topology is shown in terms of the basic building blocks, namely the blade, chassis, and racks.

18.9    Inter and intra node performance on the Cray XE6 platform with Gemini interconnect as compared to a Cray XT5 platform with SeaStarII interconnect.

18.10  Performance of the cachebench benchmark highlighting the impact of the different compilers for utilizing the Interlagos cache and memory hierarchies.

18.11  Performance of a multithreaded DGEMM using Cray libsci on a Cray XE6 Interlagos node with different numbers of thread and thread placement schemes.

18.12  A utility called xtnodestat can be used to find out mapping of user jobs on to the physical resources.

18.13  MPI bandwidth results demonstrating the impact of NUMA placements.

18.14  Scratch and project file systems storage setup and integration of data analytics and visualization resource.

18.15  CSCS’ new building, data center, and offices.

18.16  Machine room with supercomputers in the foreground and cooling islands for additional hardware in the background.

19.1    Profile of user number.

19.2    Profile of resource usage.

19.3    Structure of TH-1A system.

19.4    TH-1A system.

19.5    Architecture of compute node.

19.6    Structure of NRC.

19.7    Structure of NIC.

19.8    TH-1A interconnection network.

19.9    Architecture of TH-1A storage system.

19.10  Architecture of HPUC.

19.11  Hybrid programming model for the TH-1A system.

19.12  Linpack result on TH-1A.

19.13  Software architecture of JASMIN.

20.1    The old computer room that housed TSUBAME1.

20.2    The TSUBAME1.0 supercomputing grid cluster.

20.3    NVIDIA Tesla s1070 GPU retrofitted to TSUBAME1; NVIDIA Tesla M2050 GPU used in TSUBAME2.0.

20.4    Fiber optic cables are aggregated into the large InfiniBand core switch fabric from the lower-tier edge switches.

20.5    Individual compute nodes in TSUBAME2.0 are 1/4th the size of TSUBAME1 nodes, despite being 10 times more computationally powerful.

20.6    Various storage servers in TSUBAME2.0 for Lustre, GPFS, NFS, CIFS, etc. Each disk enclosure contains 60 units of 2 terabyte SATA HDDs. TSUBAME2’s SL8500 tape drive system, embodying over 10,000 tape slots and located in a separate building. TSUBAME2.0 embeds two or more SSD drives on each node of 60 GB, or 120 GB, depending on the node memory capacity.

20.7    TSUBAME2.0 system configuration.

20.8    Despite the fact the performance boost is more that 30 times compared to TSUBAME1.2, the space required for installation has narrowed down.

20.9    Cooling: modular cooling system.

20.10  TSUBAME2.0 begins the long road from TSUBAME1.0 to 2.0 (Part Two) TSUBAME2.0 Linpack Execution Output.

20.11  November 2011 Green500 Special Award for “Greenest Production Supercomputer in the World.”

20.12  The top ranking machine of November 2010 TOP500 and Green500 and their corresponding rankings on the other list.

20.13  ASUCA benchmarking on TSUBAME2.0.

20.14  ASUCA benchmarking on TSUBAME2.0.

20.15  Snapshot of the dendritic solidification growth.

20.16  ASUCA real operation to describe a typhoon.

20.17  PPI network prediction.

20.18  Interstellar atomic hydrogen turbulence.

20.19  Illustration of the FMO-NMR calculation concept.

20.20  Simulation. Experiment for dam-break problem.

20.21  Images of a material microstructure.

20.22  Solidification process of an alloy observed at SPring-8. Solidification growth simulated by the phase-field model using GPUs.

20.23  Geometry of the simulated coronary arteries, with the underlying level of red blood cells embedded in the lattice Boltzmann mesh.

21.1    Block diagram of computation node.

21.2    System view of HA-PACS base cluster and computation node blade.

21.3    Hierarchical structure from quarks to nuclei.

21.4    Expected QCD phase diagram on T-µ plane with T the temperature and µ the quark chemical potential.

21.5    Targets of molecular dynamics (MD) simulations in the computational bioscience fields from macromolecule systems to small proteins.

21.6    Performance of GPU accelerated MD using NAMD@HA-PACS.

21.7    Difference between original GPU-GPU communication among nodes through CPU memory and network card and Tightly Coupled Accelerators architecture with PEACH2 technology.

21.8    Block diagram of PEACH2 chip.

21.9    Photo of PEACH2 board.

21.10  An example code segment of XMP.

21.11  An example code segment of XMP-dev.

22.1    System architecture at Argonne.

22.2    Photos of the Magellan system at Argonne.

22.3    System architecture at NERSC.

22.4    Photos of the Magellan system at NERSC.

22.5    Features of cloud computing that are of interest to scientific users.

22.6    SPEC CPU benchmark results.

22.7    Performance of PARATEC and MILC plotted on a log-log scale as a function of core count using several different interconnect technologies and/or protocols.

23.1    Selected key events in HPC, grid, and cloud computing.

23.2    Network infrastructure of FutureGrid.

23.3    FutureGrid software architecture.

23.4    Google trends shows popularity of OpenStack to be significantly rising.

23.5    Architecture of RAIN to conduct image provisioning on IaaS and baremetal.

23.6    Image registration.

23.7    The concept of raining allows to dynamically provision images on baremetal, but also in virtualized environments. This allows performance experiments between virtualized and nonvirtualized infrastructures.

23.8    Using a public and private cloud for read mapping.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.193.55