0%

An Expert Guide to Software Performance Optimization

From mobile and cloud apps to video games to driverless vehicle control, more and more software is time-constrained: It must deliver reliable results seamlessly, consistently, and virtually instantaneously. If it doesn't, customers are unhappy--and sometimes lives are put at risk. When complex software underperforms or fails, software engineers need to identify and address the root causes. This is difficult and, historically, few tools have been available to help.

In Understanding Software Dynamics, performance expert Richard L. Sites tackles the problem head on, offering expert methods and advanced tools for understanding complex, time-constrained software dynamics, improving reliability and troubleshooting challenging performance problems.

Sites draws on several decades of experience pioneering software performance optimization, as well as extensive experience teaching graduate-level developers. He introduces principles and techniques for use in any environment, from embedded devices to datacenters, illuminating them with examples based on x86 or ARM processors running Linux and linked by Ethernet. He also guides readers through building and applying a powerful, new, extremely low-overhead open-source software tool, KUtrace, to precisely trace executions on every CPU core. Using insights gleaned from this tool, readers can apply nuanced solutions--not merely brute-force techniques such as turning off caches or cores.

  • Measure and address issues associated with CPUs, memory, disk/SSD, networks, and their interactions

  • Fix programs that are always too slow, and those that sometimes lag for no apparent reason

  • Design useful observability, logging, and time-stamping capabilities into your code

  • Reason more effectively about performance data to see why reality differs from expectations

  • Identify problems such as excess execution, slow instruction execution, waiting for resources, and software locks

Understanding Software Dynamics will be valuable to experienced software professionals, including application and OS developers, hardware and system architects, real-time system designers, and game developers, as well as advanced students.

Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

Table of Contents

  1. Cover Page
  2. About This eBook
  3. Halftitle Page
  4. Title Page
  5. Copyright Page
  6. Dedication
  7. Contents at a Glance
  8. Contents
  9. Foreword
  10. Preface
  11. Acknowledgments
  12. About the Author
  13. Part I: Measurement
    1. Chapter 1. My Program Is Too Slow
    2. 1.1 Datacenter Context
    3. 1.2 Datacenter Hardware
    4. 1.3 Datacenter Software
    5. 1.4 Long-Tail Latency
    6. 1.5 Thought Framework
    7. 1.6 Order-of-Magnitude Estimates
    8. 1.7 Why Are Transactions Slow?
    9. 1.8 The Five Fundamental Resources
    10. 1.9 Summary
    11. Chapter 2. Measuring CPUs
    12. 2.1 How We Got Here
    13. 2.2 Where Are We Now?
    14. 2.3 Measuring the Latency of an add Instruction
    15. 2.4 Straight-Line Code Fail
    16. 2.5 Simple Loop, Loop Overhead Fail, Optimizing Compiler Fail
    17. 2.6 Dead Variable Fail
    18. 2.7 Better Loop
    19. 2.8 Dependent Variables
    20. 2.9 Actual Execution Latency
    21. 2.10 More Nuance
    22. 2.11 Summary
    23. Exercises
    24. Chapter 3. Measuring Memory
    25. 3.1 Memory Timing
    26. 3.2 About Memory
    27. 3.3 Cache Organization
    28. 3.4 Data Alignment
    29. 3.5 Translation Lookaside Buffer Organization
    30. 3.6 The Measurements
    31. 3.7 Measuring Cache Line Size
    32. 3.8 Problem: N+1 Prefetching
    33. 3.9 Dependent Loads
    34. 3.10 Non-random Dynamic Random-Access Memory
    35. 3.11 Measuring Total Size of Each Cache Level
    36. 3.12 Measuring Cache Associativity of Each Level
    37. 3.13 Translation Buffer Time
    38. 3.14 Cache Underutilization
    39. 3.15 Summary
    40. Exercises
    41. Chapter 4. CPU and Memory Interaction
    42. 4.1 Cache Interaction
    43. 4.2 Simple Matrix Multiply Dynamics
    44. 4.3 Estimates
    45. 4.4 Initialization, Cross-Checking, and Observing
    46. 4.5 Initial Results
    47. 4.6 Faster Matrix Multiply, Transpose Method
    48. 4.7 Faster Matrix Multiply, Subblock Method
    49. 4.8 Cache-Aware Computation
    50. 4.9 Summary
    51. Exercises
    52. Chapter 5. Measuring Disk/SSD
    53. 5.1 About Hard Disks
    54. 5.2 About SSDs
    55. 5.3 Software Disk Access and On-Disk Buffering
    56. 5.4 How Fast Is a Disk Read?
    57. 5.5 A Little Back-of-the-Envelope Calculation
    58. 5.6 How Fast Is a Disk Write?
    59. 5.7 Results
    60. 5.8 Reading from Disk
    61. 5.9 Writing to Disk
    62. 5.10 Reading from SSD
    63. 5.11 Writing to SSD
    64. 5.12 Multiple Transfers
    65. 5.13 Summary
    66. Exercises
    67. Chapter 6. Measuring Networks
    68. 6.1 About Ethernet
    69. 6.2 About Hubs, Switches, and Routers
    70. 6.3 About TCP/IP
    71. 6.4 About Packets
    72. 6.5 About Remote Procedure Calls (RPCs)
    73. 6.6 Slop
    74. 6.7 Observing Network Traffic
    75. 6.8 Sample RPC Message Definition
    76. 6.9 Sample Logging Design
    77. 6.10 Sample Client-Server System Using RPCs
    78. 6.11 Sample Server Program
    79. 6.12 Spinlocks
    80. 6.13 Sample Client Program
    81. 6.14 Measuring One Sample Client-Server RPC
    82. 6.15 Postprocessing RPC Logs
    83. 6.16 Observations
    84. 6.17 Summary
    85. Exercises
    86. Chapter 7. Disk and Network Database Interaction
    87. 7.1 Time Alignment
    88. 7.2 Multiple Clients
    89. 7.3 Spinlocks
    90. 7.4 Experiment 1
    91. 7.5 On-Disk Database
    92. 7.6 Experiment 2
    93. 7.7 Experiment 3
    94. 7.8 Logging
    95. 7.9 Understanding Transaction Latency Variation
    96. 7.10 Summary
    97. Exercises
  14. Part II: Observation
    1. Chapter 8. Logging
    2. 8.1 Observation Tools
    3. 8.2 Logging
    4. 8.3 Basic Logging
    5. 8.4 Extended Logging
    6. 8.5 Timestamps
    7. 8.6 RPC IDs
    8. 8.7 Log File Formats
    9. 8.8 Managing Log Files
    10. 8.9 Summary
    11. Chapter 9. Aggregate Measures
    12. 9.1 Uniform vs. Bursty Event Rates
    13. 9.2 Measurement Intervals
    14. 9.3 Timelines
    15. 9.4 Further Summarizing of Timelines
    16. 9.5 Histogram Time Scales
    17. 9.6 Aggregating Per-Event Measurements
    18. 9.7 Patterns of Values Over Time
    19. 9.8 Update Intervals
    20. 9.9 Example Transactions
    21. 9.10 Conclusion
    22. Chapter 10. Dashboards
    23. 10.1 Sample Service
    24. 10.2 Sample Dashboards
    25. 10.3 Master Dashboard
    26. 10.4 Per-Instance Dashboards
    27. 10.5 Per-Server Dashboards
    28. 10.6 Sanity Checks
    29. 10.7 Summary
    30. Exercises
    31. Chapter 11. Other Existing Tools
    32. 11.1 Kinds of Observation Tools
    33. 11.2 Data to Observe
    34. 11.3 top Command
    35. 11.4 /proc and /sys Pseudofiles
    36. 11.5 time Command
    37. 11.6 perf Command
    38. 11.7 oprofile, CPU Profiler
    39. 11.8 strace, System Calls
    40. 11.9 ltrace, CPU C Library Calls
    41. 11.10 ftrace, CPU Trace
    42. 11.11 mtrace, Memory Malloc/Free
    43. 11.12 blktrace, Disk Trace
    44. 11.13 tcpdump and Wireshark, Network Trace
    45. 11.14 locktrace, Critical Section Locks
    46. 11.15 Offered Load, Outbound Calls, and Transaction Latency
    47. 11.16 Summary
    48. Exercises
    49. Chapter 12. Traces
    50. 12.1 Tracing Advantages
    51. 12.2 Tracing Disadvantages
    52. 12.3 The Three Starting Questions
    53. 12.4 Example: Early Program Counter Trace
    54. 12.5 Example: Per-Function Counts and Time
    55. 12.6 Case Study: Per-Function Trace of Gmail
    56. 12.7 Summary
    57. Chapter 13. Observation Tool Design Principles
    58. 13.1 What to Observe
    59. 13.2 How Frequently and For How Long?
    60. 13.3 How Much Overhead?
    61. 13.4 Design Consequences
    62. 13.5 Case Study: Histogram Buckets
    63. 13.6 Designing Data Display
    64. 13.7 Summary
  15. Part III: Kernel-User Trace
    1. Chapter 14. KUtrace: Goals, Design, Implementation
    2. 14.1 Overview
    3. 14.2 Goals
    4. 14.3 Design
    5. 14.4 Implementation
    6. 14.5 Kernel Patches and Module
    7. 14.6 Control Program
    8. 14.7 Postprocessing
    9. 14.8 A Note on Security
    10. 14.9 Summary
    11. Chapter 15. KUtrace: Linux Kernel Patches
    12. 15.1 Trace Buffer Data Structures
    13. 15.2 Raw Traceblock Format
    14. 15.3 Trace Entries
    15. 15.4 IPC Trace Entries
    16. 15.5 Timestamps
    17. 15.6 Event Numbers
    18. 15.7 Nested Trace Entries
    19. 15.8 Code
    20. 15.9 Packet Tracing
    21. 15.10 AMD/Intel x86-64 Patches
    22. 15.11 Summary
    23. Exercises
    24. Chapter 16. KUtrace: Linux Loadable Module
    25. 16.1 Kernel Interface Data Structures
    26. 16.2 Module Load/Unload
    27. 16.3 Initializing and Controlling Tracing
    28. 16.4 Implementing Trace Calls
    29. 16.5 Insert1
    30. 16.6 InsertN
    31. 16.7 Switching to a New Traceblock
    32. 16.8 Summary
    33. Chapter 17. KUtrace: User-Mode Runtime Control
    34. 17.1 Controlling Tracing
    35. 17.2 Standalone kutrace_control Program
    36. 17.3 The Underlying kutrace_lib Library
    37. 17.4 The Control Interface to the Loadable Module
    38. 17.5 Summary
    39. Chapter 18. KUtrace: Postprocessing
    40. 18.1 Postprocessing Details
    41. 18.2 The rawtoevent Program
    42. 18.3 The eventtospan Program
    43. 18.4 The spantotrim Program
    44. 18.5 The spantospan Program
    45. 18.6 The samptoname_k and samptoname_u Programs
    46. 18.7 The makeself Program
    47. 18.8 KUtrace JSON Format
    48. 18.9 Summary
    49. Chapter 19. KUtrace: Display of Software Dynamics
    50. 19.1 Overview
    51. 19.2 Region 1, Controls
    52. 19.3 Region 2, Y-axis
    53. 19.4 Region 3, Timelines
    54. 19.5 Region 4, IPC Legend
    55. 19.6 Region 5, X-axis
    56. 19.7 Region 6, Save/Restore
    57. 19.8 Secondary Controls
    58. 19.9 Summary
  16. Part IV: Reasoning
    1. Chapter 20. What to Look For
    2. 20.1 Overview
    3. Chapter 21. Executing Too Much
    4. 21.1 Overview
    5. 21.2 The Program
    6. 21.3 The Mystery
    7. 21.4 Exploring and Reasoning
    8. 21.5 Mystery Understood
    9. 21.6 Summary
    10. Chapter 22. Executing Slowly
    11. 22.1 Overview
    12. 22.2 The Program
    13. 22.3 The Mystery
    14. 22.4 Floating-Point Antagonist
    15. 22.5 Memory Antagonist
    16. 22.6 Mystery Understood
    17. 22.7 Summary
    18. Chapter 23. Waiting for CPU
    19. 23.1 The Program
    20. 23.2 The Mystery
    21. 23.3 Exploring and Reasoning
    22. 23.4 Mystery 2
    23. 23.5 Mystery 2 Understood
    24. 23.6 Bonus Mystery
    25. 23.7 Summary
    26. Exercises
    27. Chapter 24. Waiting for Memory
    28. 24.1 The Program
    29. 24.2 The Mystery
    30. 24.3 Exploring and Reasoning
    31. 24.4 Mystery 2: Access to a Page Table
    32. 24.5 Mystery 2 Understood
    33. 24.6 Summary
    34. Exercises
    35. Chapter 25. Waiting for Disk
    36. 25.1 The Program
    37. 25.2 The Mystery
    38. 25.3 Exploring and Reasoning
    39. 25.4 Reading 40MB
    40. 25.5 Reading Sequential 4KB Blocks
    41. 25.6 Reading Random 4KB Blocks
    42. 25.7 Writing and Sync of 40MB on SSD
    43. 25.8 Reading 40MB on SSD
    44. 25.9 Two Programs Accessing Two Files at Once
    45. 25.10 Mysteries Understood
    46. 25.11 Summary
    47. Exercises
    48. Chapter 26. Waiting for Network
    49. 26.1 Overview
    50. 26.2 The Programs
    51. 26.3 Experiment 1
    52. 26.4 Experiment 1 Mystery
    53. 26.5 Experiment 1 Exploring and Reasoning
    54. 26.6 Experiment 1 What About the Time Between RPCs?
    55. 26.7 Experiment 2
    56. 26.8 Experiment 3
    57. 26.9 Experiment 4
    58. 26.10 Mysteries Understood
    59. 26.11 Bonus Anomaly
    60. 26.12 Summary
    61. Chapter 27. Waiting for Locks
    62. 27.1 Overview
    63. 27.2 The Program
    64. 27.3 Experiment 1: Long Lock Hold Times
    65. 27.4 Mysteries in Experiment 1
    66. 27.5 Exploring and Reasoning in Experiment 1
    67. 27.6 Experiment 2: Fixing Lock Capture
    68. 27.7 Experiment 3: Fixing Lock Contention via Multiple Locks
    69. 27.8 Experiment 4: Fixing Lock Contention via Less Locked Work
    70. 27.9 Experiment 5: Fixing Lock Contention via RCU for Dashboard
    71. 27.10 Summary
    72. Chapter 28. Waiting for Time
    73. 28.1 Periodic Work
    74. 28.2 Timeouts
    75. 28.3 Timeslicing
    76. 28.4 Inline Execution Delays
    77. 28.5 Summary
    78. Chapter 29. Waiting for Queues
    79. 29.1 Overview
    80. 29.2 Request Distribution
    81. 29.3 Queue Structure
    82. 29.4 Worker Tasks
    83. 29.5 Primary Task
    84. 29.6 Dequeue
    85. 29.7 Enqueue
    86. 29.8 Spinlock
    87. 29.9 The “Work” Routine
    88. 29.10 Simple Examples
    89. 29.11 What Could Possibly Go Wrong?
    90. 29.12 CPU Frequency
    91. 29.13 Complex Examples
    92. 29.14 Waiting for CPUs: RPC Log
    93. 29.15 Waiting for CPUs: KUtrace
    94. 29.16 PlainSpinLock Flaw
    95. 29.17 Root Cause
    96. 29.18 PlainSpinLock Fixed: Observability
    97. 29.19 Load Balancing
    98. 29.20 Queue Depth: Observability
    99. 29.21 Spin at the End
    100. 29.22 One More Flaw
    101. 29.23 Cross-Checking
    102. 29.24 Summary
    103. Exercises
    104. Chapter 30. Recap
    105. 30.1 What You Learned
    106. 30.2 What We Haven’t Covered
    107. 30.3 Next Steps
    108. 30.4 Summary (for the Entire Book)
  17. Appendix A. Sample Servers
    1. A.1 Sample Server Hardware
    2. A.2 Connecting the Servers
  18. Appendix B. Trace Entries
    1. B.1 Fixed-Length Trace Entries
    2. B.2 Variable-Length Trace Entries
    3. B.3 Event Numbers
  19. Glossary
  20. References
  21. Index
3.144.93.73