Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Alex Hidalgo
Implementing Service Level Objectives
Foreword
Preface
You Don’t Have to Be Perfect
How to Read This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
I. SLO Development
1. The Reliability Stack
Service Truths
The Reliability Stack
Service Level Indicators
Service Level Objectives
Error Budgets
What Is a Service?
Example Services
Things to Keep in Mind
SLOs Are Just Data
SLOs Are a Process, Not a Project
Iterate Over Everything
The World Will Change
It’s All About Humans
Summary
2. How to Think About Reliability
Reliability Engineering
Past Performance and Your Users
Implied Agreements
Making Agreements
A Worked Example of Reliability
How Reliable Should You Be?
100% Isn’t Necessary
Reliability Is Expensive
How to Think About Reliability
Summary
3. Developing Meaningful Service Level Indicators
What Meaningful SLIs Provide
Happier Users
Happier Engineers
A Happier Business
Caring About Many Things
A Request and Response Service
Measuring Many Things by Measuring Only a Few
A Written Example
Something More Complex
Measuring Complex Service User Reliability
Another Written Example
Business Alignment and SLIs
Summary
4. Choosing Good Service Level Objectives
Reliability Targets
User Happiness
The Problem of Being Too Reliable
The Problem with the Number Nine
The Problem with Too Many SLOs
Service Dependencies and Components
Service Dependencies
Service Components
Reliability for Things You Don’t Own
Open Source or Hosted Services
Measuring Hardware
Choosing Targets
Past Performance
Basic Statistics
Metric Attributes
Percentile Thresholds
What to Do Without a History
Summary
5. How to Use Error Budgets
Error Budgets in Practice
To Release New Features or Not?
Project Focus
Examining Risk Factors
Experimentation and Chaos Engineering
Load and Stress Tests
Blackhole Exercises
Purposely Burning Budget
Error Budgets for Humans
Error Budget Measurement
Establishing Error Budgets
Decision Making
Error Budget Policies
Summary
II. SLO Implementation
6. Getting Buy-In
Engineering Is More than Code
Key Stakeholders
Engineering
Product
Operations
QA
Legal
Executive Leadership
Making It So
Order of Operation
Common Objections and How to Overcome Them
Your First Error Budget Policy (and Your First Critical Test)
Lessons Learned the Hard Way
Summary
7. Measuring SLIs and SLOs
Design Goals
Flexible Targets
Testable Targets
Freshness
Cost
Reliability
Organizational Constraints
Common Machinery
Centralized Time Series Statistics (Metrics)
Structured Event Databases (Logging)
Common Cases
Latency-Sensitive Request Processing
Low-Lag, High-Throughput Batch Processing
Mobile and Web Clients
The General Case
Other Considerations
Integration with Distributed Tracing
SLI and SLO Discoverability
Summary
8. SLO Monitoring and Alerting
Motivation: What Is SLO Alerting, and Why Should You Do It?
The Shortcomings of Simple Threshold Alerting
A Better Way
How to Do SLO Alerting
Choosing a Target
Error Budgets and Response Time
Error Budget Burn Rate
Rolling Windows
Putting It Together
Troubleshooting with SLO Alerting
Corner Cases
SLO Alerting in a Brownfield Setup
Parting Recommendations
Summary
9. Probability and Statistics for SLIs and SLOs
On Probability
SLI Example: Availability
SLI Example: Low QPS
On Statistics
Maximum Likelihood Estimation
Maximum a Posteriori
Bayesian Inference
SLI Example: Queueing Latency
Batch Latency
SLI Example: Durability
Further Reading
Summary
10. Architecting for Reliability
Example System: Image-Serving Service
Architectural Considerations: Hardware
Architectural Considerations: Monolith or Microservices
Architectural Considerations: Anticipating Failure Modes
Architectural Considerations: Three Types of Requests
Systems and Building Blocks
Quantitative Analysis of Systems
Instrumentation! The System Also Needs Instrumentation!
Architectural Considerations: Hardware, Revisited
SLOs as a Result of System SLIs
The Importance of Identifying and Understanding Dependencies
Summary
11. Data Reliability
Data Services
Designing Data Applications
Users of Data Services
Setting Measurable Data Objectives
Data and Data Application Reliability
Data Properties
Data Application Properties
System Design Concerns
Data Application Failures
Other Qualities
Data Lineage
Summary
12. A Worked Example
Dogs Deserve Clothes
How a Service Grows
The Design of a Service
SLIs and SLOs as User Journeys
Customers: Finding and Browsing Products
Other Services as Users: Buying Products
Internal Users
Platforms as Services
Summary
III. SLO Culture
13. Building an SLO Culture
A Culture of No SLOs
Strategies for Shifting Culture
Path to a Culture of SLOs
Getting Buy-in
Prioritizing SLO Work
Implementing Your SLO
What Will Your SLIs Be?
What Will Your SLOs Be?
Using Your SLO
Iterating on Your SLO
Determining When Your SLOs Are Good Enough
Advocating for Others to Use SLOs
Summary
14. SLO Evolution
SLO Genesis
The First Pass
Listening to Users
Periodic Revisits
Usage Changes
Increased Utilization Changes
Decreased Utilization Changes
Functional Utilization Changes
Dependency Changes
Service Dependency Changes
Platform Changes
Dependency Introduction or Retirement
Failure-Induced Changes
User Expectation and Requirement Changes
User Expectation Changes
User Requirement Changes
Tooling Changes
Measurement Changes
Calculation Changes
Intuition-Based Changes
Setting Aspirational SLOs
Identifying Incorrect SLOs
Listening to Users (Redux)
Paying Attention to Failures
How to Change SLOs
Revisit Schedules
Summary
15. Discoverable and Understandable SLOs
Understandability
SLO Definition Documents
Phraseology
Discoverability
Document Repositories
Discoverability Tooling
SLO Reports
Dashboards
Summary
16. SLO Advocacy
Crawl
Do Your Research
Prepare Your Sales Pitch
Create Your Supporting Artifacts
Run Your First Training and Workshop
Implement an SLO Pilot with a Single Service
Spread Your Message
Learn How to Handle Challenges
Walk
Work with Early Adopters to Implement SLOs for More Services
Celebrate Achievements and Build Confidence
Create a Library of Case Studies
Scale Your Training Program by Adding More Trainers
Scale Your Communications
Run
Share Your Library of SLO Case Studies
Create a Community of SLO Experts
Continuously Improve
Summary
17. Reliability Reporting
Basic Reporting
Counting Incidents
Severity Levels
The Problem with Mean Time to X
SLOs for Basic Reporting
Advanced Reporting
SLO Status
Error Budget Status
Summary
A. SLO Definition Template
SLO Definition: Service Name
Service Overview
SLIs and SLOs
Rationale
Revisit Schedule
Error Budget Policy
External Links
B. Proofs for Chapter 9
Theorem 1
Proof
Theorem 2
Proof
Theorem 3
Proof
Theorem 4
Proof
Theorem 5
Proof
Theorem 6
Proof
Theorem 7
Proof
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Praise for Implementing Service Level Objectives
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset