Index
A
- access, Other Qualities
- accuracy, Accuracy-Accuracy, Validity
- AIOps (Artificial intelligence operations)-style approach, Picking an SLO number is something a human should do
- alert
- alert attachment, SLO Alerting in a Brownfield Setup
- alert fatigue, Alert fatigue and fog of war
- Allspaw, John, Incidents are unique
- Amazon S3, Durability
- Amazon Web Services, Architectural Considerations: Hardware
- Apache Flink, Low-Lag, High-Throughput Batch Processing
- Apache Kafka, Low-Lag, High-Throughput Batch Processing
- Apache Spark, Low-Lag, High-Throughput Batch Processing
- approvers, Approvers-Approvers
- architecture
- Artificial intelligence operations (AIOps)-style approach, Picking an SLO number is something a human should do
- Artificial Intelligence, A Modern Approach, Further Reading
- asynchronous request, Asynchronous requests
- auto-remediation, Error Budgets and Response Time
- availability
- average, Expected value (see mean)
- (see also expected value)
- Azure, Durability
B
- Backblaze, Measuring Hardware
- baseline, Rolling Windows
- batch job, Batch jobs, Batch Latency
- batch process (see batch request)
- batch request, Batch requests-Batch requests
- bathtub curve, SLI Example: Durability
- Bayes’ theorem, Bayes’ theorem-Bayes’ theorem, Bayesian Inference-Bayesian Inference
- Bell, Gordon, Architecting for Reliability
- Bentley, Jon, Architecting for Reliability
- Bernoulli trial, Coin interlude-Coin interlude, SLI Example: Low QPS, Modeling events with the Poisson distribution, Proof
- Beyer, Betsy, Purposely Burning Budget, A Better Way
- bias, definition of, Coin interlude
- binomial distribution, SLI Example: Low QPS, SLI Example: Low QPS
- (see also geometric distribution, negative binomial distribution)
- binomial theorem, Proof
- binomial trial, SLI Example: Low QPS, SLI Example: Low QPS
- birth-death process (see M/M/1 queue)
- black swan event, How to Think About Reliability, Error budget burn policies
- blackhole exercise, Blackhole Exercises
- brownfield, How to Do SLO Alerting, SLO Alerting in a Brownfield Setup, Parting Recommendations
- Bruce, Andrew, Further Reading
- Bruce, Peter, Further Reading
- Bugzilla, Scale Your Communications
- burn rate, Rolling Windows
C
- cache, Architectural Considerations: Hardware, Revisited, The Importance of Identifying and Understanding Dependencies, The Design of a Service
- (see also capacity cache, latency cache)
- caching layer, Turning hard dependencies into soft dependencies, Project Focus
- calendar-bound window, Rolling versus calendar-bound windows-Rolling versus calendar-bound windows
- CAP theorem, Consistency
- capacity cache, Architectural Considerations: Hardware, Revisited
- cardinality, The five Ms, TSDBs and our design goals
- (see also distinct combinations of values)
- Cauchy distribution, Expected value
- chaos engineering, Experimentation and Chaos Engineering
- (see also blackhole exercise, load test)
- Chubby, Purposely Burning Budget
- collaboration, Collaboration-based training-Collaboration-based training
- common vulnerabilities and exposures (CVEs), Counting Incidents
- communication, Scale Your Communications-Scale Your Communications
- completeness, Completeness-Completeness
- complex distributed system, Complexity and failure in distributed systems
- complex system
- comprehensiveness (see completeness)
- compute platform, Compute platforms
- confidence interval, The highest density interval
- consistency, Consistency-Consistency
- container platforms, Platforms as Services-SLO: Container platform
- content delivery network (CDN), How a Service Grows, A worked reporting example
- continuous probability distribution, The exponential distribution
- correctness, Validity
- CPU usage
- credible interval, The highest density interval
- (see also highest density interval (HDI))
- cumulative distribution function (CDF), Variance, percentiles, and the cumulative distribution function-Variance, percentiles, and the cumulative distribution function, Proof
- customer research, Listening to Users
- customer service, A Happier Business, Listening to Users (Redux)
D
- dashboards, Lessons Learned the Hard Way, Dashboards-Dashboards, SLO Status-SLO Status
- Data Analysis with Open Source Tools, Further Reading
- data application, Designing Data Applications
- data application properties, Data Application Properties-Robustness
- data conformance (see validity)
- data processing pipeline, Data processing pipelines
- data properties, Data and Data Application Reliability-Durability
- data quality, Data Properties
- data reliability
- database and storage system, Databases and storage systems
- Davidovič, Štĕpán, A Better Way
- DDoS attack, Incidents are unique, A worked reporting example
- (see also volumetric attack)
- dependency, Dependency Changes-Dependency Introduction or Retirement
- dependency math, Dependency math-Dependency math
- Design Patterns, Architecting for Reliability
- Designing Data-Intensive Applications, Scalability
- deviations, Ranges-Ranges
- DevOps, How to Think About Reliability
- discoverability, Discoverability-Dashboards
- distinct combinations of values, TSDBs and our design goals
- distributed denial of service attack, Incidents are unique
- distribution, Statistical distribution support-Statistical distribution support, Other Qualities
- distribution tail, Expected value
- documentation, Create Your Supporting Artifacts-Training
- Doing Bayesian Data Analysis, Further Reading
- downtime, How to Think About Reliability, The Problem of Being Too Reliable, Availability, Resilience
- Drucker, Peter, What do your company executives and business partners care about?
- Dunning and Ertl’s t-digest algorithm, Statistical distribution support
- durability
E
- Engineering and the Design and Operation of Manufacturing Systems, Architecting for Reliability
- engineering team, Engineering-Engineering, Order of Operation, Engineering
- error budget deficit, Events-based error budget math, Decision Making
- error budget recovery, Events-based error budget math
- error budget surplus, Events-based error budget math, Decision Making
- error budgets
- alert and, Error Budgets and Response Time-Error Budget Burn Rate
- approaches and, Error Budgets, How to Use Error Budgets
- benefits of, Operations
- burning, Purposely Burning Budget
- decision making and, Decision Making, Exhausting your error budget-Using surplus error budget
- definition of, The Reliability Stack
- establishing, Establishing Error Budgets-Establishing Error Budgets
- events-based, Establishing Error Budgets-Events-based error budget math
- experimentation and, Experimentation and Chaos Engineering-Experimentation and Chaos Engineering
- policies and, Error Budget Policies-Error budget exceeded policies, Your First Error Budget Policy (and Your First Critical Test), Error budget policy
- projects and, To Release New Features or Not?-Project Focus
- reporting, Error Budget Status-Error Budget Status
- risk factors of, Examining Risk Factors, Examining Risk Factors
- time-based, Establishing Error Budgets, Time-based error budget math-Time-based error budget math, Error Budget Status
- error injection, Experimentation and Chaos Engineering
- error rate
- error ratio rate, Latency-Sensitive Request Processing, Latency-Sensitive Request Processing
- errors, Data Application Failures-Data Application Failures
- events, definition of, Sample spaces
- Ewaschuk, Rob, A Better Way
- executive leadership, Executive Leadership-Executive Leadership, Leadership-Leadership
- expectation, Expected value
- (see also expected value)
- expected value, Expected value-Expected value
- exponential distribution, The exponential distribution-The exponential distribution, SLI Example: Durability, Proof
F
- failure domain, Lessons Learned the Hard Way, Architectural Considerations: Hardware
- failure mode, Architectural Considerations: Anticipating Failure Modes
- failures, Failure-Induced Changes, Paying Attention to Failures
- fault tolerance, Resilience
- faults, Data Application Failures-Data Application Failures
- feature freeze, No new features (feature freeze)-No new features (feature freeze)
- flexible targets, Flexible Targets, Statistical distribution support, TSDBs and our design goals, Structured event databases and our design goals
- flood attack, Incidents are unique
- fog of war, Alert fatigue and fog of war
- Fowler, Susan, Architectural Considerations: Monolith or Microservices
- freshness, Freshness-Freshness, TSDBs and our design goals, Structured event databases and our design goals, Freshness-Freshness
G
- Gamma, Erich, Architecting for Reliability
- Gaussian distribution, Expected value
- (see also normal distribution)
- geometric distribution, SLI Example: Low QPS
- Gershwin, Stanley B., Architecting for Reliability
- Google, Making Agreements, Summary, Statistical distribution support
- Google Cloud Platform, Durability
- Google Docs, Document Repositories
- granularity, Accuracy
- greenfield, How to Do SLO Alerting-How to Do SLO Alerting
H
- hard dependency, Service Dependencies-Turning hard dependencies into soft dependencies, Purposely Burning Budget
- hardware
- high dynamic range (HDR) histograms, Statistical distribution support
- highest density interval (HDI), The highest density interval-The highest density interval
- histograms, Statistical distribution support, Coin interlude, SLI Example: Low QPS-SLI Example: Low QPS
- (see also high dynamic range (HDR) histograms, latency histogram)
- Hopper, Grace, Data Services
- hosted services, Open Source or Hosted Services
- Hyrum’s law, Making Agreements
L
- Large-Scale Cluster Management at Google with Borg, Architectural Considerations: Hardware
- last-in first-out (LIFO) queue, Decreasing latency
- latency
- client-side, Poor proxies for user experience-Poor proxies for user experience
- CPU usage and, Poor proxies for user experience-Poor proxies for user experience
- distribution, Statistical distribution support
- histogram, Statistical distribution support
- measuring, Measuring Complex Service User Reliability, Single-team component services, Percentile Thresholds, Establishing Error Budgets
- performance and, Performance
- prediction, Decreasing latency
- queueing, SLI Example: Queueing Latency-Variance, percentiles, and the cumulative distribution function
- rate, Latency-Sensitive Request Processing
- response and, Latency-Sensitive Request Processing, Other Services as Users: Buying Products-Other Services as Users: Buying Products
- latency cache, Architectural Considerations: Hardware, Revisited
- latency-sensitive request processing, Latency-Sensitive Request Processing-Latency-Sensitive Request Processing
- law of conditional probability, Proof, Proof
- law of large numbers, Expected value
- law of total probability, Proof
- legal team, Legal-Legal, Order of Operation, Legal
- library of case studies, Create a Library of Case Studies, Share Your Library of SLO Case Studies
- load balancer, Architectural Considerations: Anticipating Failure Modes
- load test, Load and Stress Tests, Error budget burn policies
- log lines, SLO: Business data analysis
- logging, Structured Event Databases (Logging)
- long tail, Percentiles, Percentile Thresholds, Percentile Thresholds
- lookahead, Rolling Windows
- low-lag, high-throughput batch processing, Low-Lag, High-Throughput Batch Processing
M
- M/M/1 queue, Decreasing latency-Decreasing latency
- M/M/c queue, Adding capacity
- Majors, Charity, Percentile Thresholds
- MAP estimator, Maximum a Posteriori, The relationship between MLE and MAP, Bayesian Inference
- MapReduce, Low-Lag, High-Throughput Batch Processing, Completeness
- Markdown, Document Repositories
- Markovian, Decreasing latency
- max value, The five Ms
- maximum a posteriori, Maximum a Posteriori, The relationship between MLE and MAP
- maximum likelihood estimation (MLE), Maximum Likelihood Estimation, The relationship between MLE and MAP
- mean, The five Ms, Expected value
- (see also expected value)
- mean time between failures (MTBF), Quantitative Analysis of Systems
- mean time to <something> (MTTX), The Problem with Mean Time to X, The Problem with Mean Time to X-Incidents are unique, Incidents are unique, SLOs for Basic Reporting, A worked reporting example
- mean time to detect (MTTD), Quantitative Analysis of Systems
- mean time to mitigate (MTTM), Quantitative Analysis of Systems
- mean time to repair (MTTR), Architectural Considerations: Hardware
- mean time to resolution (MTTR), Means aren’t always meaningful
- median, The five Ms-The five Ms, Median
- message queue, Low-Lag, High-Throughput Batch Processing
- metric attributes, Metric Attributes
- metrics system, A Written Example-A Written Example, Centralized Time Series Statistics (Metrics), Measurement Changes-Calculation Changes
- (see also time series database (TSDB))
- microservice
- min value, The five Ms
- mobile and web clients, Mobile and Web Clients-Mobile and Web Clients
- mode, The five Ms
- monolith, Architectural Considerations: Monolith or Microservices
- MTBF (mean time between failures), Quantitative Analysis of Systems
- MTTD (mean time to detect), Quantitative Analysis of Systems
- MTTM (mean time to mitigate), Quantitative Analysis of Systems
- MTTR (mean time to repair), Architectural Considerations: Hardware
- MTTR (mean time to resolution), Means aren’t always meaningful
- multidimensional probability distribution, Proof
- multimodal dataset, The five Ms
- multiple comparison problem, The Problem with Too Many SLOs
- multiple-team component services, Multiple-team component services
- Murphy, Niall, Show the human impact of the current situation
- mutability, Other Qualities
N
- negative binomial distribution, SLI Example: Low QPS
- nested request processing (see latency-sensitive request processing)
- nines, The Problem with the Number Nine-The Problem with the Number Nine, Percentile Thresholds, Putting It Together, Corner Cases, What can you do?, Increased Utilization Changes
- Non-Abstract Large System Design (NALSD), Architecting for Reliability, Architectural Considerations: Hardware
- nonhomogeneous Poisson process, SLI Example: Durability
- normal distribution, Expected value
- (see also Gaussian distribution)
- Norvig, Peter, Further Reading
O
- Objective and Key Result (OKR), SLOs Are a Process, Not a Project
- observability
- approaches to, A Better Way
- definition of, Complexity and failure in distributed systems
- monitoring, Common Machinery, Low-Lag, High-Throughput Batch Processing, Mobile and Web Clients, The General Case, Run the old and new in parallel
- system, Troubleshooting with SLO Alerting, Parting Recommendations
- Office 365, Document Repositories
- OKR (see remote procedure call (RPC))
- open source software, Open Source or Hosted Services-Open Source or Hosted Services, SLO: Internal wiki-SLO: Internal wiki
- open source software (OSS), Centralized Time Series Statistics (Metrics)
- OpenTelemetry, Latency-Sensitive Request Processing
- operational underload, The Problem of Being Too Reliable
- operations team, Operations, Order of Operation, Operations
- opportunity cost, Cost
- order of operations, Order of Operation-Order of Operation
- OSS (see open source software (OSS))
- outliers, The five Ms, Percentiles, Mobile and Web Clients
- ownership, Ownership-Ownership
P
- parameter, Bayes’ theorem
- PDF (see probability density function (PDF))
- percentile thresholds, Percentile Thresholds-Percentile Thresholds
- percentiles, Percentiles
- performance, Reliability Engineering, Mobile and Web Clients, Show the human impact of the current situation, Architecting for Reliability, Example System: Image-Serving Service, Architectural Considerations: Anticipating Failure Modes, Performance
- Philosophy of Alerting, A Better Way
- phraseology, Phraseology
- platform, Compute platforms, Platform Changes-Platform Changes
- (see also container platforms, computer platform)
- PMF, SLI Example: Low QPS
- (see also probability mass function)
- pod, Platforms as Services-SLO: Container platform
- point estimator, Bayesian Inference
- Poisson distribution, Modeling events with the Poisson distribution-Modeling events with the Poisson distribution
- Poisson process, Modeling events with the Poisson distribution, The exponential distribution, SLI Example: Durability, SLI Example: Durability, Proof
- (see also nonhomogeneous Poisson process)
- polyglot persistence, Designing Data Applications
- posterior, Maximum a Posteriori, Bayes’ theorem
- posterior distribution, Bayesian Inference
- PR (pull request), Error Budgets for Humans
- Practical Statistics for Data Scientists, Further Reading
- PRD (product requirement document), Product
- precision, Systems and Building Blocks, Accuracy
- prior, Using MAP, Using MAP (see prior probability)
- prior probability, Bayes’ theorem
- privacy (see security)
- probability, Probability and Statistics for SLIs and SLOs-On Probability
- probability density function (PDF), The exponential distribution
- probability distribution, SLI Example: Low QPS, Expected value
- (see also expected value)
- probability mass function (PMF), SLI Example: Low QPS
- prober, What can you do?
- product management team, Product-Product, Order of Operation, Product-Product
- product requirement document (PRD), Product
- Production-Ready Microservices, Architectural Considerations: Monolith or Microservices
- Programming Pearls, Architecting for Reliability
- Prometheus, Measurement Changes
- proofs, Theorem 1-Proof
- pull request (PR), Error Budgets for Humans
- Push on Green model, Architecting for Reliability
R
- random variables, Coin interlude
- range, Ranges-Ranges
- recoverability (see resilience)
- Reduce Toil Through Better Alerting, A Better Way
- redundancy, Other Qualities
- reliability
- concepts of, How Reliable Should You Be?-How to Think About Reliability
- costs of, Reliability Is Expensive-Reliability Is Expensive
- definition of, Caring About Many Things
- hardware and, Architectural Considerations: Hardware
- problems, The Problem of Being Too Reliable
- reporting, Reliability Reporting-Basic Reporting, SLOs for Basic Reporting-Advanced Reporting
- service and, Service Truths, Reliability Engineering-Implied Agreements, Business Alignment and SLIs, User Happiness
- utilization changes and, Increased Utilization Changes-Functional Utilization Changes
- worked example of, A Worked Example of Reliability-A Worked Example of Reliability
- reliability burndown, SLO Status
- reliability engineering, How to Think About Reliability-Reliability Engineering
- Reliability Stack, The Reliability Stack, Service Level Indicators, Error Budgets, Developing Meaningful Service Level Indicators, How to Use Error Budgets
- reliability targets, Reliability Targets-The Problem with the Number Nine, Beyond just hardware
- remote procedure call (RPC), Request and response APIs
- reporting
- request and response API, Request and response APIs-Data processing pipelines, A Request and Response Service-A Request and Response Service, Quantity, SLO: Business data analysis
- requests (see asynchronous request, batch request, synchronous request)
- resilience, Resilience-Resilience
- resolution, Resolution-Quality
- retention horizon, Structured event databases and our design goals
- retrospective meetings, Error Budgets for Humans
- revisits, Periodic Revisits, Functional Utilization Changes, Dependency Introduction or Retirement, Tooling Changes, Revisit Schedules, Definition status, Revisit schedule
- RFC 2119, Error budget burn policies
- robustness, Robustness-Robustness
- rolling window, Rolling versus calendar-bound windows, Rolling Windows-Rolling Windows
- RPC (see remote procedure call (RPC))
- Russell, Stuart, Further Reading
S
- SaaS, Centralized Time Series Statistics (Metrics), Mobile and Web Clients, Architectural Considerations: Hardware
- sample, The five Ms
- sample space, Sample spaces
- sampling, Other Qualities
- scalability, Scalability-Scalability
- scalars, Aggregate analysis
- scale, The exponential distribution
- Search as a Service (SaaS), The Reliability Stack
- security, Security-Security
- Security as a Service (SaaS), The Reliability Stack
- Seeking SRE, Show the human impact of the current situation
- Serra, James, Designing Data Applications
- service
- service components, Service Components-Single-team component services
- (see also multiple-team component services, single-team component services)
- service dependency, Service Dependencies-Dependency math
- (see also hard dependency, soft dependency)
- service failure, Choosing Good Service Level Objectives
- service level agreements (SLAs)
- service level indicators (SLIs)
- approaches, What Meaningful SLIs Provide, The General Case
- benefits of, Developing Meaningful Service Level Indicators-A Happier Business, Legal
- complications with, Data processing pipelines, Databases and storage systems, Iterate Over Everything
- definition of, The Reliability Stack-Service Level Indicators
- determiners, Low-Lag, High-Throughput Batch Processing
- durability, SLI Example: Durability-SLI Example: Durability
- meaningful, What Meaningful SLIs Provide-A Written Example
- measuring, Measuring Many Things by Measuring Only a Few, Past Performance, What Will Your SLIs Be?, How to Change SLOs
- service level objectives (SLOs)
- adoption lessons, Lessons Learned the Hard Way
- alerting, Alerting (see alert, threshold alert)
- approaches, Things to Keep in Mind-It’s All About Humans, Reliability Engineering-Reliability Engineering, Making Agreements, Caring About Many Things, Business Alignment and SLIs, Owners and stakeholders, Strategies for Shifting Culture, SLOs for Basic Reporting, Advanced Reporting
- benefits of, Engineering-Legal, Product, Data Application Failures
- buy-in for, Engineering Is More than Code-Order of Operation, Getting Buy-in-Assign it, Prepare Your Sales Pitch
- changes to, How to Change SLOs-Revisit Schedules
- culture of, Path to a Culture of SLOs-Advocating for Others to Use SLOs
- definition of, The Reliability Stack-Service Level Objectives, Quantitative Analysis of Systems
- definition templates, SLO Definition: Service Name-External Links
- document, Start with a document, SLO Definition Documents-Phraseology, Document Repositories
- example services and, Web services-Hardware and the network
- goals, Design Goals-Organizational Constraints
- implementation strategies, Latency-Sensitive Request Processing-The General Case, Assign it-Exhausting your error budget, The First Pass-Periodic Revisits
- (see also latency-sensitive request processing; low-lag, high throughput processing; mobile and web clients)
- measuring, What is important to measure?-What is important to measure?
- objections to, Common Objections and How to Overcome Them-QA, Summary
- problems with, The Problem with Too Many SLOs
- reports, SLO Reports
- targets, But I am big enough!, Choosing Targets, Percentiles, Flexible Targets
- (see also flexible targets, testable targets)
- silver bullets, No new features (feature freeze)
- single-team component services, Single-team component services
- Site Reliability Engineering (book), Purposely Burning Budget, Do Your Research
- Site Reliability Engineering (SRE), Things to Keep in Mind, How to Think About Reliability, Reliability for Things You Don’t Own, Architecting for Reliability
- Site Reliability Workbook, The, Rolling Windows, Architecting for Reliability, Do Your Research
- SLO Advocate
- slow burn problem, Error Budgets and Response Time-Error Budgets and Response Time
- soft dependency, Service Dependencies-Turning hard dependencies into soft dependencies
- Software as a Service (SaaS), The Reliability Stack
- span, Latency-Sensitive Request Processing
- specification, Architecting for Reliability
- standard deviation, Variance, percentiles, and the cumulative distribution function
- statistical approaches, Basic Statistics-Percentiles
- statistics, The five Ms, Probability and Statistics for SLIs and SLOs
- Stockholm syndrome, SLO Alerting in a Brownfield Setup
- Storage as a Service (SaaS), The Reliability Stack
- stress test, Load and Stress Tests, Error budget burn policies
- structured event database, Structured Event Databases (Logging)-Structured event databases and our design goals
- structured logging data, Cost
- (see also structured events database)
- SYN flood attack, Incidents are unique
- synchronous request, Synchronous requests-Synchronous requests
- system failures, Choosing Good Service Level Objectives
- systems architect, Architecting for Reliability, Systems and Building Blocks
- systems engineering, Architecting for Reliability
T
- telemetry, The Reliability Stack, Single-team component services, Mobile and Web Clients, Complexity and failure in distributed systems, Completeness
- testable targets, Testable Targets, Statistical distribution support, TSDBs and our design goals, Structured event databases and our design goals
- thaw tax, No new features (feature freeze)
- threshold alert
- throughput
- Tilbrook, D., Run the old and new in parallel
- time, Other Qualities
- time series data, Cost, Centralized Time Series Statistics (Metrics)-TSDBs and our design goals
- time series database (TSDB), Centralized Time Series Statistics (Metrics)-TSDBs and our design goals
- time windows, Rolling versus calendar-bound windows-Choosing a time window
- (see also calendar-bound window, rolling window)
- timeliness (see freshness)
- tooling, Tooling Changes-Calculation Changes, Discoverability Tooling
- training, Training-Learn How to Handle Challenges, Scale Your Training Program by Adding More Trainers
- transactional API, Service Dependency Changes
- trials, definition of, Sample spaces, Independence
- Trusted Platform Modules (TPM), Integrity
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.