0%

Book Description

Solaris™ Application Programming is a comprehensive guide to optimizing the performance of applications running in your Solaris environment. From the fundamentals of system performance to using analysis and optimization tools to their fullest, this wide-ranging resource shows developers and software architects how to get the most from Solaris systems and applications.

Whether you’re new to performance analysis and optimization or an experienced developer searching for the most efficient ways to solve performance issues, this practical guide gives you the background information, tips, and techniques for developing, optimizing, and debugging applications on Solaris.

The text begins with a detailed overview of the components that affect system performance. This is followed by explanations of the many developer tools included with Solaris OS and the Sun Studio compiler, and then it takes you beyond the basics with practical, real-world examples. In addition, you will learn how to use the rich set of developer tools to identify performance problems, accurately interpret output from the tools, and choose the smartest, most efficient approach to correcting specific problems and achieving maximum system performance.

Coverage includes

  • A discussion of the chip multithreading (CMT) processors from Sun and how they change the way that developers need to think about performance

  • A detailed introduction to the performance analysis and optimization tools included with the Solaris OS and Sun Studio compiler

  • Practical examples for using the developer tools to their fullest, including informational tools, compilers, floating point optimizations, libraries and linking, performance profilers, and debuggers

  • Guidelines for interpreting tool analysis output

  • Optimization, including hardware performance counter metrics and source code optimizations

  • Techniques for improving application performance using multiple processes, or multiple threads

  • An overview of hardware and software components that affect system performance, including coverage of SPARC and x64 processors

  • Table of Contents

    1. Copyright
    2. Preface
      1. About This Book
      2. Goals and Assumptions
      3. Chapter Overview
      4. Acknowledgments
    3. I. Overview of the Processor
      1. 1. The Generic Processor
        1. 1.1. Chapter Objectives
        2. 1.2. The Components of a Processor
        3. 1.3. Clock Speed
        4. 1.4. Out-of-Order Processors
        5. 1.5. Chip Multithreading
        6. 1.6. Execution Pipes
          1. 1.6.1. Instruction Latency
          2. 1.6.2. Load/Store Pipe
          3. 1.6.3. Integer Operation Pipe
          4. 1.6.4. Branch Pipe
          5. 1.6.5. Floating-Point Pipe
        7. 1.7. Caches
        8. 1.8. Interacting with the System
          1. 1.8.1. Bandwidth and Latency
          2. 1.8.2. System Buses
        9. 1.9. Virtual Memory
          1. 1.9.1. Overview
          2. 1.9.2. TLBs and Page Size
        10. 1.10. Indexing and Tagging of Memory
        11. 1.11. Instruction Set Architecture
      2. 2. The SPARC Family
        1. 2.1. Chapter Objectives
        2. 2.2. The UltraSPARC Family
          1. 2.2.1. History of the SPARC Architecture
          2. 2.2.2. UltraSPARC Processors
        3. 2.3. The SPARC Instruction Set
          1. 2.3.1. A Guide to the SPARC Instruction Set
          2. 2.3.2. Integer Registers
          3. 2.3.3. Register Windows
          4. 2.3.4. Floating-Point Registers
        4. 2.4. 32-bit and 64-bit Code
        5. 2.5. The UltraSPARC III Family of Processors
          1. 2.5.1. The Core of the CPU
          2. 2.5.2. Communicating with Memory
          3. 2.5.3. Prefetch
          4. 2.5.4. Blocking Load on Data Cache Misses
          5. 2.5.5. UltraSPARC III-Based Systems
          6. 2.5.6. Total Store Ordering
        6. 2.6. UltraSPARC T1
        7. 2.7. UltraSPARC T2
        8. 2.8. SPARC64 VI
      3. 3. The x64 Family of Processors
        1. 3.1. Chapter Objectives
        2. 3.2. The x64 Family of Processors
        3. 3.3. The x86 Processor: CISC and RISC
        4. 3.4. Byte Ordering
        5. 3.5. Instruction Template
        6. 3.6. Registers
        7. 3.7. Instruction Set Extensions and Floating Point
        8. 3.8. Memory Ordering
    4. II. Developer Tools
      1. 4. Informational Tools
        1. 4.1. Chapter Objectives
        2. 4.2. Tools That Report System Configuration
          1. 4.2.1. Introduction
          2. 4.2.2. Reporting General System Information (prtdiag, prtconf, prtpicl, prtfru)
          3. 4.2.3. Enabling Virtual Processors (psrinfo and psradm)
          4. 4.2.4. Controlling the Use of Processors through Processor Sets or Binding (psrset and pbind)
          5. 4.2.5. Reporting Instruction Sets Supported by Hardware (isalist)
          6. 4.2.6. Reporting TLB Page Sizes Supported by Hardware (pagesize)
          7. 4.2.7. Reporting a Summary of SPARC Hardware Characteristics (fpversion)
        3. 4.3. Tools That Report Current System Status
          1. 4.3.1. Introduction
          2. 4.3.2. Reporting Virtual Memory Utilization (vmstat)
          3. 4.3.3. Reporting Swap File Usage (swap)
          4. 4.3.4. Reporting Process Resource Utilization (prstat)
          5. 4.3.5. Listing Processes (ps)
          6. 4.3.6. Locating the Process ID of an Application (pgrep)
          7. 4.3.7. Reporting Activity for All Processors (mpstat)
          8. 4.3.8. Reporting Kernel Statistics (kstat)
          9. 4.3.9. Generating a Report of System Activity (sar)
          10. 4.3.10. Reporting I/O Activity (iostat)
          11. 4.3.11. Reporting Network Activity (netstat)
          12. 4.3.12. The snoop command
          13. 4.3.13. Reporting Disk Space Utilization (df)
          14. 4.3.14. Reporting Disk Space Used by Files (du)
        4. 4.4. Process- and Processor-Specific Tools
          1. 4.4.1. Introduction
          2. 4.4.2. Timing Process Execution (time, timex, and ptime)
          3. 4.4.3. Reporting System-Wide Hardware Counter Activity (cpustat)
          4. 4.4.4. Reporting Hardware Performance Counter Activity for a Single Process (cputrack)
          5. 4.4.5. Reporting Bus Activity (busstat)
          6. 4.4.6. Reporting on Trap Activity (trapstat)
          7. 4.4.7. Reporting Virtual Memory Mapping Information for a Process (pmap)
          8. 4.4.8. Examining Command-Line Arguments Passed to Process (pargs)
          9. 4.4.9. Reporting the Files Held Open by a Process (pfiles)
          10. 4.4.10. Examining the Current Stack of Process (pstack)
          11. 4.4.11. Tracing Application Execution (truss)
          12. 4.4.12. Exploring User Code and Kernel Activity with dtrace
        5. 4.5. Information about Applications
          1. 4.5.1. Reporting Library Linkage (ldd)
          2. 4.5.2. Reporting the Type of Contents Held in a File (file)
          3. 4.5.3. Reporting Symbols in a File (nm)
          4. 4.5.4. Reporting Library Version Information (pvs)
          5. 4.5.5. Examining the Disassembly of an Application, Library, or Object (dis)
          6. 4.5.6. Reporting the Size of the Various Segments in an Application, Library, or Object (size)
          7. 4.5.7. Reporting Metadata Held in a File (dumpstabs, dwarfdump, elfdump, dump, and mcs)
      2. 5. Using the Compiler
        1. 5.1. Chapter Objectives
        2. 5.2. Three Sets of Compiler Options
        3. 5.3. Using -xtarget=generic on x86
        4. 5.4. Optimization
          1. 5.4.1. Optimization Levels
          2. 5.4.2. Using the -O Optimization Flag
          3. 5.4.3. Using the -fast Compiler Flag
          4. 5.4.4. Specifying Architecture with -fast
          5. 5.4.5. Deconstructing -fast
          6. 5.4.6. Performance Optimizations in -fast (for the Sun Studio 12 Compiler)
        5. 5.5. Generating Debug Information
          1. 5.5.1. Debug Information Flags
          2. 5.5.2. Debug and Optimization
        6. 5.6. Selecting the Target Machine Type for an Application
          1. 5.6.1. Choosing between 32-bit and 64-bit Applications
          2. 5.6.2. The Generic Target
          3. 5.6.3. Specifying Cache Configuration Using the -xcache Flag
          4. 5.6.4. Specifying Code Scheduling Using the -xchip Flag
          5. 5.6.5. The -xarch Flag and -m32/-m64
        7. 5.7. Code Layout Optimizations
          1. 5.7.1. Introduction
          2. 5.7.2. Crossfile Optimization
          3. 5.7.3. Mapfiles
          4. 5.7.4. Profile Feedback
          5. 5.7.5. Link-Time Optimization
        8. 5.8. General Compiler Optimizations
          1. 5.8.1. Prefetch Instructions
          2. 5.8.2. Enabling Prefetch Generation (-xprefetch)
          3. 5.8.3. Controlling the Aggressiveness of Prefetch Insertion (-xprefetch_level)
          4. 5.8.4. Enabling Dependence Analysis (-xdepend)
          5. 5.8.5. Handling Misaligned Memory Accesses on SPARC (-xmemalign/-dalign)
          6. 5.8.6. Setting Page Size Using -xpagesize=<size>
        9. 5.9. Pointer Aliasing in C and C++
          1. 5.9.1. The Problem with Pointers
          2. 5.9.2. Diagnosing Aliasing Problems
          3. 5.9.3. Using Restricted Pointers in C and C++ to Reduce Aliasing Issues
          4. 5.9.4. Using the -xalias_level Flag to Specify the Degree of Pointer Aliasing
          5. 5.9.5. -xalias_level for C
          6. 5.9.6. -xalias_level=any in C
          7. 5.9.7. -xalias_level=basic in C
          8. 5.9.8. -xalias_level=weak in C
          9. 5.9.9. -xalias_level=layout in C
          10. 5.9.10. -xalias_level=strict in C
          11. 5.9.11. -xalias_level=std in C
          12. 5.9.12. -xalias_level=strong in C
          13. 5.9.13. -xalias_level in C++
          14. 5.9.14. -xalias_level=simple in C++
          15. 5.9.15. -xalias_level=compatible in C++
        10. 5.10. Other C- and C++-Specific Compiler Optimizations
          1. 5.10.1. Enabling the Recognition of Standard Library Routines (-xbuiltin)
        11. 5.11. Fortran-Specific Compiler Optimizations
          1. 5.11.1. Aligning Variables for Optimal Layout (-xpad)
          2. 5.11.2. Placing Local Variables on the Stack (-xstackvar)
        12. 5.12. Compiler Pragmas
          1. 5.12.1. Introduction
          2. 5.12.2. Specifying Alignment of Variables
          3. 5.12.3. Specifying a Function’s Access to Global Data
          4. 5.12.4. Specifying That a Function Has No Side Effects
          5. 5.12.5. Specifying That a Function Is Infrequently Called
          6. 5.12.6. Specifying a Safe Degree of Pipelining for a Particular Loop
          7. 5.12.7. Specifying That a Loop Has No Memory Dependencies within a Single Iteration
          8. 5.12.8. Specifying the Degree of Loop Unrolling
        13. 5.13. Using Pragmas in C for Finer Aliasing Control
          1. 5.13.1. Asserting the Degree of Aliasing between Variables
          2. 5.13.2. Asserting That Variables Do Alias
          3. 5.13.3. Asserting Aliasing with Nonpointer Variables
          4. 5.13.4. Asserting That Variables Do Not Alias
          5. 5.13.5. Asserting No Aliasing with Nonpointer Variables
        14. 5.14. Compatibility with GCC
      3. 6. Floating-Point Optimization
        1. 6.1. Chapter Objectives
        2. 6.2. Floating-Point Optimization Flags
          1. 6.2.1. Mathematical Optimizations in -fast
          2. 6.2.2. IEEE-754 and Floating Point
          3. 6.2.3. Vectorizing Floating-Point Computation (-xvector)
          4. 6.2.4. Vectorizing Computation Using SIMD Instructions (-xvector=simd) (x64 Only)
          5. 6.2.5. Subnormal Numbers
          6. 6.2.6. Flushing Subnormal Numbers to Zero (-fns)
          7. 6.2.7. Handling Values That Are Not-a-Number
          8. 6.2.8. Enabling Floating-Point Expression Simplification (-fsimple)
          9. 6.2.9. Elimination of Comparisons
          10. 6.2.10. Elimination of Unnecessary Calculation
          11. 6.2.11. Reordering of Calculations
          12. 6.2.12. Kahan Summation Formula
          13. 6.2.13. Hoisting of Divides
          14. 6.2.14. Honoring of Parentheses at Levels of Floating-Point Simplification
          15. 6.2.15. Effect of -fast on errno
          16. 6.2.16. Specifying Which Floating-Point Events Cause Traps (-ftrap)
          17. 6.2.17. The Floating-Point Exception Flags
          18. 6.2.18. Floating-Point Exceptions in C99
          19. 6.2.19. Using Inline Template Versions of Floating-Point Functions (-xlibmil)
          20. 6.2.20. Using the Optimized Math Library (-xlibmopt)
          21. 6.2.21. Do Not Promote Single-Precision Values to Double Precision (-fsingle for C)
          22. 6.2.22. Storing Floating-Point Constants in Single Precision (-xsfpconst for C)
        3. 6.3. Floating-Point Multiply Accumulate Instructions
        4. 6.4. Integer Math
          1. 6.4.1. Other Integer Math Opportunities
        5. 6.5. Floating-Point Parameter Passing with SPARC V8 Code
      4. 7. Libraries and Linking
        1. 7.1. Introduction
        2. 7.2. Linking
          1. 7.2.1. Overview of Linking
          2. 7.2.2. Dynamic and Static Linking
          3. 7.2.3. Linking Libraries
          4. 7.2.4. Creating a Static Library
          5. 7.2.5. Creating a Dynamic Library
          6. 7.2.6. Specifying the Location of Libraries
          7. 7.2.7. Lazy Loading of Libraries
          8. 7.2.8. Initialization and Finalization Code in Libraries
          9. 7.2.9. Symbol Scoping
          10. 7.2.10. Library Interposition
          11. 7.2.11. Using the Debug Interface
          12. 7.2.12. Using the Audit Interface
        3. 7.3. Libraries of Interest
          1. 7.3.1. The C Runtime Library (libc and libc_psr)
          2. 7.3.2. Memory Management Libraries
          3. 7.3.3. libfast
          4. 7.3.4. The Performance Library
          5. 7.3.5. STLport4
        4. 7.4. Library Calls
          1. 7.4.1. Library Routines for Timing
          2. 7.4.2. Picking the Most Appropriate Library Routines
          3. 7.4.3. SIMD Instructions and the Media Library
          4. 7.4.4. Searching Arrays Using VIS Instructions
      5. 8. Performance Profiling Tools
        1. 8.1. Introduction
        2. 8.2. The Sun Studio Performance Analyzer
        3. 8.3. Collecting Profiles
        4. 8.4. Compiling for the Performance Analyzer
        5. 8.5. Viewing Profiles Using the GUI
        6. 8.6. Caller–Callee Information
        7. 8.7. Using the Command-Line Tool for Performance Analysis
        8. 8.8. Interpreting Profiles
        9. 8.9. Intepreting Profiles from UltraSPARC III/IV Processors
        10. 8.10. Profiling Using Performance Counters
        11. 8.11. Interpreting Call Stacks
        12. 8.12. Generating Mapfiles
        13. 8.13. Generating Reports on Performance Using spot
        14. 8.14. Profiling Memory Access Patterns
        15. 8.15. er_kernel
        16. 8.16. Tail-Call Optimization and Debug
        17. 8.17. Gathering Profile Information Using gprof
        18. 8.18. Using tcov to Get Code Coverage Information
        19. 8.19. Using dtrace to Gather Profile and Coverage Information
        20. 8.20. Compiler Commentary
      6. 9. Correctness and Debug
        1. 9.1. Introduction
        2. 9.2. Compile-Time Checking
          1. 9.2.1. Introduction
          2. 9.2.2. Compile-Time Checking for C Source Code
          3. 9.2.3. Checking of C Source Code Using lint
          4. 9.2.4. Source Processing Options Common to the C and C++ Compilers
          5. 9.2.5. C++
          6. 9.2.6. Fortran
        3. 9.3. Runtime Checking
          1. 9.3.1. Bounds Checking
          2. 9.3.2. watchmalloc
          3. 9.3.3. Debugging Options under Other mallocs
          4. 9.3.4. Runtime Array Bounds Checking in Fortran
          5. 9.3.5. Runtime Stack Overflow Checking
          6. 9.3.6. Memory Access Error Detection Using discover
        4. 9.4. Debugging Using dbx
          1. 9.4.1. Debug Compiler Flags
          2. 9.4.2. Debug and Optimization
          3. 9.4.3. Debug Information Format
          4. 9.4.4. Debug and OpenMP
          5. 9.4.5. Frame Pointer Optimization on x86
          6. 9.4.6. Running the Debugger on a Core File
          7. 9.4.7. Example of Debugging an Application
          8. 9.4.8. Running an Application under dbx
        5. 9.5. Locating Optimization Bugs Using ATS
        6. 9.6. Debugging Using mdb
    5. III. Optimization
      1. 10. Performance Counter Metrics
        1. 10.1. Chapter Objectives
        2. 10.2. Reading the Performance Counters
        3. 10.3. UltraSPARC III and UltraSPARC IV Performance Counters
          1. 10.3.1. Instructions and Cycles
          2. 10.3.2. Data Cache Events
          3. 10.3.3. Instruction Cache Events
          4. 10.3.4. Second-Level Cache Events
          5. 10.3.5. Cycles Lost to Cache Miss Events
          6. 10.3.6. Example of Cache Access Metrics
          7. 10.3.7. Synthetic Metrics for Latency
          8. 10.3.8. Synthetic Metrics for Memory Bandwidth Consumption
          9. 10.3.9. Prefetch Cache Events
          10. 10.3.10. Comparison of Performance Counters with and without Prefetch
          11. 10.3.11. Write Cache Events
          12. 10.3.12. Cycles Lost to Processor Stall Events
          13. 10.3.13. Branch Misprediction
          14. 10.3.14. Memory Controller Events
        4. 10.4. Performance Counters on the UltraSPARC IV and UltraSPARC IV+
          1. 10.4.1. Introduction
          2. 10.4.2. UltraSPARC IV+ L3 Cache
          3. 10.4.3. Memory Controller Events
        5. 10.5. Performance Counters on the UltraSPARC T1
          1. 10.5.1. Hardware Performance Counters
          2. 10.5.2. UltraSPARC T1 Cycle Budget
          3. 10.5.3. Performance Counters at the Core Level
          4. 10.5.4. Calculating System Bandwidth Consumption
        6. 10.6. UltraSPARC T2 Performance Counters
        7. 10.7. SPARC64 VI Performance Counters
        8. 10.8. Opteron Performance Counters
          1. 10.8.1. Introduction
          2. 10.8.2. Instructions
          3. 10.8.3. Instruction Cache Events
          4. 10.8.4. Data Cache Events
          5. 10.8.5. TLB Events
          6. 10.8.6. Branch Events
          7. 10.8.7. Stall Cycles
      2. 11. Source Code Optimizations
        1. 11.1. Overview
        2. 11.2. Traditional Optimizations
          1. 11.2.1. Introduction
          2. 11.2.2. Loop Unrolling and Pipelining
          3. 11.2.3. Loop Peeling, Fusion, and Splitting
          4. 11.2.4. Loop Interchange and Tiling
          5. 11.2.5. Loop Invariant Hoisting
          6. 11.2.6. Common Subexpression Elimination
          7. 11.2.7. Strength Reduction
          8. 11.2.8. Function Cloning
        3. 11.3. Data Locality, Bandwidth, and Latency
          1. 11.3.1. Bandwidth
          2. 11.3.2. Integer Data
          3. 11.3.3. Storing Streams
          4. 11.3.4. Manual Prefetch
          5. 11.3.5. Latency
          6. 11.3.6. Copying and Moving Memory
        4. 11.4. Data Structures
          1. 11.4.1. Structure Reorganizing
          2. 11.4.2. Structure Prefetching
          3. 11.4.3. Considerations for Optimal Performance from Structures
          4. 11.4.4. Matrices and Accesses
          5. 11.4.5. Multiple Streams
        5. 11.5. Thrashing
          1. 11.5.1. Summary
          2. 11.5.2. Data TLB Performance Counter
        6. 11.6. Reads after Writes
        7. 11.7. Store Queue
          1. 11.7.1. Stalls
          2. 11.7.2. Detecting Store Queue Stalls
        8. 11.8. If Statements
          1. 11.8.1. Introduction
          2. 11.8.2. Conditional Moves
          3. 11.8.3. Misaligned Memory Accesses on SPARC Processors
        9. 11.9. File-Handling in 32-bit Applications
          1. 11.9.1. File Descriptor Limits
          2. 11.9.2. Handling Large Files in 32-bit Applications
    6. IV. Threading and Throughput
      1. 12. Multicore, Multiprocess, Multithread
        1. 12.1. Introduction
        2. 12.2. Processes, Threads, Processors, Cores, and CMT
        3. 12.3. Virtualization
        4. 12.4. Horizontal and Vertical Scaling
        5. 12.5. Parallelization
        6. 12.6. Scaling Using Multiple Processes
          1. 12.6.1. Multiple Processes
          2. 12.6.2. Multiple Cooperating Processes
          3. 12.6.3. Parallelism Using MPI
        7. 12.7. Multithreaded Applications
          1. 12.7.1. Parallelization Using Pthreads
          2. 12.7.2. Thread Local Storage
          3. 12.7.3. Mutexes
          4. 12.7.4. Using Atomic Operations
          5. 12.7.5. False Sharing
          6. 12.7.6. Memory Layout for a Threaded Application
        8. 12.8. Parallelizing Applications Using OpenMP
        9. 12.9. Using OpenMP Directives to Parallelize Loops
        10. 12.10. Using the OpenMP API
        11. 12.11. Parallel Sections
          1. 12.11.1. Setting Stack Sizes for OpenMP
        12. 12.12. Automatic Parallelization of Applications
        13. 12.13. Profiling Multithreaded Applications
        14. 12.14. Detecting Data Races in Multithreaded Applications
        15. 12.15. Debugging Multithreaded Code
        16. 12.16. Parallelizing a Serial Application
          1. 12.16.1. Example Application
          2. 12.16.2. Impact of Optimization on Serial Performance
          3. 12.16.3. Profiling the Serial Application
          4. 12.16.4. Unrolling the Critical Loop
          5. 12.16.5. Parallelizing Using Pthreads
          6. 12.16.6. Parallelization Using OpenMP
          7. 12.16.7. Auto-Parallelization
          8. 12.16.8. Load Balancing with OpenMP
          9. 12.16.9. Sharing Data between Threads
          10. 12.16.10. Sharing Variables between Threads Using OpenMP
    7. V. Concluding Remarks
      1. 13. Performance Analysis
        1. 13.1. Introduction
        2. 13.2. Algorithms and Complexity
        3. 13.3. Tuning Serial Code
        4. 13.4. Exploring Parallelism
        5. 13.5. Optimizing for CMT Processors
    3.137.220.120