Book Description

Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

The book focuses on optimization for clusters consisting of the Intel® Xeon processor, but the optimization methodologies also apply to the Intel® Xeon Phi™ coprocessor and heterogeneous clusters mixing both architectures. Besides the tutorial and reference content, the authors address and refute many myths and misconceptions surrounding the topic. The text is augmented and enriched by descriptions of real-life situations.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. About ApressOpen
  5. Dedication
  6. Contents at a Glance
  7. Contents
  8. About the Authors
  9. About the Technical Reviewers
  10. Acknowledgments
  11. Foreword
  12. Introduction
  13. Chapter 1: No Time to Read This Book?
    1. Using Intel MPI Library
    2. Using Intel Composer XE
    3. Tuning Intel MPI Library
      1. Gather Built-in Statistics
      2. Optimize Process Placement
      3. Optimize Thread Placement
    4. Tuning Intel Composer XE
      1. Analyze Optimization and Vectorization Reports
      2. Use Interprocedural Optimization
    5. Summary
    6. References
  14. Chapter 2: Overview of Platform Architectures
    1. Performance Metrics and Targets
      1. Latency, Throughput, Energy, and Power
      2. Peak Performance as the Ultimate Limit
      3. Scalability and Maximum Parallel Speedup
      4. Bottlenecks and a Bit of Queuing Theory
      5. Roofline Model
    2. Performance Features of Computer Architectures
      1. Increasing Single-Threaded Performance: Where You Can and Cannot Help
      2. Process More Data with SIMD Parallelism
      3. Distributed and Shared Memory Systems
    3. HPC Hardware Architecture Overview
      1. A Multicore Workstation or a Server Compute Node
      2. Coprocessor for Highly Parallel Applications
      3. Group of Similar Nodes Form an HPC Cluster
      4. Other Important Components of HPC Systems
    4. Summary
    5. References
  15. Chapter 3: Top-Down Software Optimization
    1. The Three Levels and Their Impact on Performance
      1. System Level
      2. Application Level
      3. Microarchitecture Level
    2. Closed-Loop Methodology
      1. Workload, Application, and Baseline
      2. Iterating the Optimization Process
    3. Summary
    4. References
  16. Chapter 4: Addressing System Bottlenecks
    1. Classifying System-Level Bottlenecks
      1. Identifying Issues Related to System Condition
      2. Characterizing Problems Caused by System Configuration
    2. Understanding System-Level Performance Limits
      1. Checking General Compute Subsystem Performance
      2. Testing Memory Subsystem Performance
      3. Testing I/O Subsystem Performance
    3. Characterizing Application System-Level Issues
      1. Selecting Performance Characterization Tools
      2. Monitoring the I/O Utilization
      3. Analyzing Memory Bandwidth
    4. Summary
    5. References
  17. Chapter 5: Addressing Application Bottlenecks: Distributed Memory
    1. Algorithm for Optimizing MPI Performance
    2. Comprehending the Underlying MPI Performance
      1. Recalling Some Benchmarking Basics
      2. Gauging Default Intranode Communication Performance
      3. Gauging Default Internode Communication Performance
      4. Discovering Default Process Layout and Pinning Details
      5. Gauging Physical Core Performance
    3. Doing Initial Performance Analysis
      1. Is It Worth the Trouble?
    4. Getting an Overview of Scalability and Performance
      1. Learning Application Behavior
      2. Choosing Representative Workload(s)
      3. Balancing Process and Thread Parallelism
      4. Doing a Scalability Review
      5. Analyzing the Details of the Application Behavior
    5. Choosing the Optimization Objective
      1. Detecting Load Imbalance
    6. Dealing with Load Imbalance
      1. Classifying Load Imbalance
      2. Addressing Load Imbalance
    7. Optimizing MPI Performance
      1. Classifying the MPI Performance Issues
      2. Addressing MPI Performance Issues
      3. Mapping Application onto the Platform
      4. Tuning the Intel MPI Library
      5. Optimizing Application for Intel MPI
    8. Using Advanced Analysis Techniques
      1. Automatically Checking MPI Program Correctness
      2. Comparing Application Traces
      3. Instrumenting Application Code
      4. Correlating MPI and Hardware Events
    9. Summary
    10. References
  18. Chapter 6: Addressing Application Bottlenecks: Shared Memory
    1. Profiling Your Application
      1. Using VTune Amplifier XE for Hotspots Profiling
      2. Hotspots for the HPCG Benchmark
      3. Compiler-Assisted Loop/Function Profiling
    2. Sequential Code and Detecting Load Imbalances
    3. Thread Synchronization and Locking
    4. Dealing with Memory Locality and NUMA Effects
    5. Thread and Process Pinning
      1. Controlling OpenMP Thread Placement
      2. Thread Placement in Hybrid Applications
    6. Summary
    7. References
  19. Chapter 7: Addressing Application Bottlenecks: Microarchitecture
    1. Overview of a Modern Processor Pipeline
      1. Pipelined Execution
      2. Out-of-order vs. In-order Execution
      3. Superscalar Pipelines
      4. SIMD Execution
      5. Speculative Execution: Branch Prediction
      6. Memory Subsystem
      7. Putting It All Together: A Final Look at the Sandy Bridge Pipeline
      8. A Top-down Method for Categorizing the Pipeline Performance
    2. Intel Composer XE Usage for Microarchitecture Optimizations
      1. Basic Compiler Usage and Optimization
      2. Using Optimization and Vectorization Reports to Read the Compiler’s Mind
      3. Optimizing for Vectorization
      4. Dealing with Disambiguation
      5. Dealing with Branches
      6. When Optimization Leads to Wrong Results
    3. Analyzing Pipeline Performance with Intel VTune Amplifier XE
    4. Summary
    5. References
  20. Chapter 8: Application Design Considerations
    1. Abstraction and Generalization of the Platform Architecture
      1. Types of Abstractions
      2. Levels of Abstraction and Complexities
      3. Raw Hardware vs. Virtualized Hardware in the Cloud
    2. Questions about Application Design
      1. Designing for Performance and Scaling
      2. Designing for Flexibility and Performance Portability
      3. Understanding Bounds and Projecting Bottlenecks
      4. Data Storage or Transfer vs. Recalculation
      5. Total Productivity Assessment
    3. Summary
    4. References
  21. Index