Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

About the Editor

Index

A

abandonment expense (AbEx), Project Operating Expense and Abandonment Expense
access control, Early Intervention and Education Through Evangelism
accommodations, for on-call personnel, Accommodations
active learning, Active Teaching and Learning-A Call to Action: Ditch the Boring Slides
- basics, Active Learning
- costs of failing to learn, The Costs of Failing to Learn
- Incident Manager card game, Active Learning Example: Incident Manager (a Card Game)-Active Learning Example: Incident Manager (a Card Game)
- learning habits of effective SRE teams, Learning Habits of Effective SRE Teams-Postmortems
- postmortems and, Postmortems
- production meetings and, Production Meetings
- SRE Classroom, Active Learning Example: SRE Classroom
- Wheel of Misfortune game, Active Learning Example: Wheel of Misfortune
activism (see social activism)
address resolution protocol (ARP) tables, Technical learnings
adopt-to-buy abandonment scenario, Project Operating Expense and Abandonment Expense
advocate phase of SRE execution, Phase 3: Advocates/Partners
Affordable Care Act, Elegy for Complex Systems
Agilent Technologies, Introducing SRE in Large Enterprises-Closing Thoughts
alarming, Observability and Alarming
alerts, On-Call and Alerting
Allspaw, John, SRE Cognitive Work, Introduction-Conclusion
Almeida, Daniel Prata, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
AlphaGo, From Chess to Go: How Deep Can We Dive?
Amaro, Ricardo, Introduction to Machine Learning for SRE, Why Use Machine Learning for SRE?-Success Stories
Amazon Glacier, Offline storage
Amazon Web Services (AWS), Self-Service Is More Than a Button
Andersen, Kurt, SRE as a Success Culture, SRE as a Success Culture-Focus on the Details of Success
Anderson, Brian, Origin Story
antipatterns, SRE Antipatterns-So, That’s It, Then?
APIs
- of service-level middleware, APIs of Service-Level Middleware
- reporting as first step towards building SLIs, Polling API informs SLIs
- third-party integrations and, Logging
application errors, Application errors
Application Operations (AppOps) teams, Production Engineering at Facebook-Production Engineering at Facebook
assurance windows, Why Set Goals?
- (see also time quanta)
automation
- and operator fatigue, Operator Fatigue
- and reliability, Reliability
- as team player in SRE work, Focus on Making Automation a Team Player in SRE Work
- data durability engineering, Automation-Reliability
- privacy engineering and, Automation
- SRE teams and, Provisioning, Change Management, and Velocity
- testing at Google, Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- third-party integrations, Automation
- window of vulnerability and, Window of Vulnerability
availability
- as SLO, Availability-Transactions over Time Quanta
- time quanta as metric, Time Quanta
- tracking availability level, Tracking Availability Level-Tracking Availability Level
- transactions as metric, Transactions
- transactions over time quanta, Transactions over Time Quanta
Avenet, Julien, Replies

B

backhauling a request, Routing requests in the application
backlogs, Unify Backlogs and Protect Capacity
backpressure, Special Knowledge About Complex Systems
backpropagation, A neural network from scratch
backups
- data durability engineering and, Backups
- logical, Full and incremental logical backups
- physical, Full physical backups
  - (see also recovery/recoverability of data)
Bainbridge, Lisanne
- on automation, Why Should We Care About Practitioner Cognition?
- on human operators and advanced systems, SRE Cognitive Work
base image, Building the Base Image
Bayesian inference, What Is Machine Learning?
Beamish, Alex, Replies
Beck, Kent, Economic Pillars of Complexity
benefits, for job applicants with mental disorders, Benefits-Benefits
Beyer, Betsy, The Intersection of Reliability and Privacy, The Intersection of Reliability and Privacy-Conclusion
biases, in job interviewing, Biases
Bisset, Blake, SRE Antipatterns, SRE Antipatterns-So, That’s It, Then?
black boxes
- running like a service, Running the Black Box Like a Service
- SLIs on, SLIs on black boxes
blame
- assigning/avoiding, in reviews of social activism, Charlottesville in review: assigning and avoiding blame
- in cross-team postmortems, Resolving Cross-Team Reliability Issues by Using Postmortems
blameless retrospectives
- in social activism, Beyond culpability: building capacity instead of assigning blame
- SRE/social activism parallels, Principles 3 and 4 (blameless retrospectives and psychological safety)
  - (see also postmortems)
Bland, Mike, Pattern 1: Birth of Automated Testing at Google, Pattern 1: Birth of Automated Testing at Google
Blank-Edelman, David
- Context Versus Control in SRE, Context Versus Control in SRE-Context Versus Control in SRE
- DevOps and SRE, Background-Replies
- Production Engineering at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
Blew, Aaron, Replies
blind resume review, Biases
blue/green deployment
- about, Continuous Integration/Continuous Deployment with Confidence
- rolling release vs., Disadvantages
Boggs, Grace Lee, The Long Tail: Turning Action into Change
Bootcamp, at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
bot attacks, Case Study: WAF/Bot Mitigation
bottlenecks, Establishing SLAs for Internal Components
breakpoints, Establishing SLAs for Internal Components
Brummel, Janna, Replies
build/buy/adopt decision, Build, Buy, or Adopt?-Project Operating Expense and Abandonment Expense
- assessing project considerations, Acknowledge Reality
- core competencies and, Is this a core competency?
- deciding which option to take, Make a Decision
- determining importance of an integration, Establish Importance
- integration timeline, Integration timeline?
- LinkedIn case study, Project Operating Expense and Abandonment Expense
- project operating expense and abandonment expense, Project Operating Expense and Abandonment Expense
- stakeholder identification, Identify Stakeholders
Burns, Robert, Before, During, After
business case, for SRE
- preparing, Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability
- presenting, Presenting the Business Case
buy-to-adopt abandonment scenario, Project Operating Expense and Abandonment Expense
buy-to-buy abandonment scenario, Project Operating Expense and Abandonment Expense

C

calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- addressing, Address the Calibration Problem
- incidents and recalibration, Incidents Trigger Individual Recalibration
- incidents and specific calibration problems, Incidents Are Opportunities for Collective Recalibration
- mental models, Mental Models
Campbell, Laine, Database Reliability Engineering, Database Reliability Engineering-Making the Case for DBRE
CAMS (Culture, Automation, Measurement and Sharing), Replies
Canahuati, Pedro, Production Engineering at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
canary rollout method, Tracking Availability Level
canarying, On-Call Is Emergency Medicine Instead of Ward Medicine
capacity planning, Capacity Planning and Demand Forecasting
capacity, team, Unify Backlogs and Protect Capacity
capital expenditure (CapEx), Make a Decision, Get Rid of as Many Handoffs as Possible, Capacity Planning and Demand Forecasting
cascading failure, Decoherence and Cascading Failure
catalytic role SREs, Phase 4: Catalytic
certificate authorities (CAs), Establish Importance
Chakrabarti, Saunak Jai, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
Challenger space shuttle disaster, Always in a State of Partial Failure
champions, for SRE, Start having conversations with leaders and champions in the organization
change management
- parallels between social activism and technology industry, The Long Tail: Turning Action into Change-Activism and Change Within a Company
- SRE teams and, Provisioning, Change Management, and Velocity
chaos engineering, In the Beginning, There Was Chaos-Conclusion
- advanced principles, Advanced Principles
- and Chaos Kong, Chaos Goes Big
- and Economic Pillars of Complexity, Economic Pillars of Complexity
- FAQs, Frequently Asked Questions
- inherent problems with complex systems, The Problem with Systems-The Problem with Systems
- navigating complexity for safety, Navigating Complexity for Safety
- origins of, Beginning Chaos
Chaos Kong, Chaos Goes Big
Charlottesville, Virginia, demonstrations/counterdemonstrations (2017) (see social activism)
Check, Martin, Using Incident Metrics to Improve SRE at Scale, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
choke points, Tooling, Antipattern 10: Design Chokepoints
Churchill, Winston, Frequently Asked Questions
CIA, training games used by, Active Learning
Clarke, Arthur C., The Awakening of Applied AI
classification schemes, Classification schemes
cloud services, Chaos Goes Big
code review, documentation as part of, Require Docs as Part of Code Review
cognitive hacks, Cognitive hacks
cognitive overload, Cognitive overload
cognitive work, Introduction-Conclusion
- activities of SREs during incidents, What Do SRE People Do?
- calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- critical decisions made under uncertainty and time pressure, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
- human performance in modern complex systems, Human Performance in Modern Complex Systems: The Main Themes
- importance of understanding practitioner cognition, Why Should We Care About Practitioner Cognition?-Human Performance in Modern Complex Systems: The Main Themes
- intersections between operations and social activism, Intersections Between Operations and Social Activism-Conclusion
- managing coordination costs, Managing the Costs of Coordination
- observations on SRE cognitive work around incidents, Observations on SRE Cognitive Work Around Incidents-SREs Are Cognitive Agents Working in a Joint Cognitive System
- opportunities to mitigate worst aspects of incident, Every Incident Could Have Been Worse
- repairs to functional systems, Repairs to Functional Systems
- sacrifice decisions and uncertainty, Sacrifice Decisions Take Place Under Uncertainty
- special knowledge about complex systems, Special Knowledge About Complex Systems
- SREs as cognitive agents working in a joint cognitive system, SREs Are Cognitive Agents Working in a Joint Cognitive System
Columbia space shuttle disaster, Always in a State of Partial Failure
communication
- for teams, Make your communication clear and your expectations explicit
- third-party integrations and, Communication
compensation
- for on-call, Compensation
- job applicants with mental disorders, Compensation
- novelty priority inversion and, Novelty Priority Inversion
complex systems, Elegy for Complex Systems-To Get Involved
- as always in state of partial failure, Always in a State of Partial Failure
- community for, To Get Involved
- decoherence and cascading failure, Decoherence and Cascading Failure
- defining characteristics, The Problem with Systems
- Economic Pillars of Complexity, Economic Pillars of Complexity
- human performance in, Human Performance in Modern Complex Systems: The Main Themes
- incidents as inevitable in, Incidents Will Continue
- inherent problems with, The Problem with Systems-The Problem with Systems
- inseparability of computer and human systems, The Computer and Human Systems Cannot Be Separated
- novelty priority inversion, Novelty Priority Inversion
- overhead of coordination, Nobody Anticipates the Overhead of Coordination
- special knowledge about, Special Knowledge About Complex Systems
- tooling for management of, Approaching Operations as an Engineering Problem
configuration management
- immutable infrastructure and, Release Engineering
containers, The Deployment Platform
Content Delivery Network (CDN), Testing and staging, Uses for RUM
context propagation, Thin Libraries and Context Propagation
context, control vs., Context Versus Control in SRE-Context Versus Control in SRE
continuous delivery/deployment
- championing, Championing CD
- database reliability engineering and, Continuous Delivery: From Development to Production-Tools
- deployment, Deployment-Championing CD
- immutable infrastructure and, Continuous Integration/Continuous Deployment with Confidence
- third-party integrations and, Testing and staging
continuous integration
- immutable infrastructure and, Continuous Integration/Continuous Deployment with Confidence
- third-party integrations and, Testing and staging
contract termination, Decommissioning
control plane, Configuration Management (Control Plane Versus Data Plane)
convolutional neural networks, How and When Should We Apply Neural Networks?
Conway's law, Getting Buy-In, The Computer and Human Systems Cannot Be Separated
Conway, Mel, The Computer and Human Systems Cannot Be Separated
Cook, Richard, SRE Cognitive Work, Introduction-Conclusion
coordination overhead
- in complex systems, Nobody Anticipates the Overhead of Coordination
- managing, Managing the Costs of Coordination
costs of incidents, Incidents Will Impose Costs-Incidents Will Impose Costs
Cowling, James, Engineering for Data Durability, Engineering for Data Durability-Conclusion
crises
- and persons with mental disorders, Leaving
- as scarcity of time/attention, The corollary to trust is forgiveness
crisis management, Managing Crisis: Responding When Things Break Down-The corollary to trust is forgiveness
cross-domain failures, Introducing Production Engineering-Introducing Production Engineering
cross-functional teams, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
cross-team reliability, Resolving Cross-Team Reliability Issues by Using Postmortems
cultural fit interviews, Cultural interviews
culture (see success culture, SRE as)

D

dashboards, Real-Time Dashboards: The Bread and Butter of SRE
data durability, engineering for, Engineering for Data Durability-Conclusion
- automation, Automation-Reliability
- backups, Backups
- estimating durability, Estimating durability-Estimating durability
- freshness, Freshness
- isolation, Isolation-Operational isolation
- protection, Protection-Recovery
- real-world strategies, Real-World Durability-Reliability
- recovery, Recovery
- replication, Replication Is Table Stakes-Estimating durability
- replication techniques, Replication-Estimating durability
- restoration, Restoration
- safeguards, Safeguards
- testing, Testing
- verification, Verification-Watching the Watchers
- window of vulnerability, Window of Vulnerability
- zero-errors system, The Power of Zero
data loss
- application errors and, Application errors
- detection of, Building Block 1: Detection-Operating system and hardware errors
- infrastructure services and, Infrastructure services
- operating system/hardware errors, Operating system and hardware errors
- user error and, User error
data plane, Configuration Management (Control Plane Versus Data Plane)
database reliability engineering, Database Reliability Engineering-Making the Case for DBRE
- anatomy of a recovery strategy, Anatomy of a Recovery Strategy-Championing Recovery Reliability
- best practices and standards, Best practices and standards
- collaboration, Collaboration
- considerations for recovery, Considerations for Recovery
- continuous delivery and, Continuous Delivery: From Development to Production-Tools
- culture of, A Culture of Database Reliability Engineering
- data protection, Protect the Data
- deployment of CD, Deployment-Championing CD
- documentation, Architecture
- educating developers, Education and Collaboration-Tools
- guiding principles, Guiding Principles of the Database Reliability Engineer-Databases Are Not Special
- impact analysis, Impact Analysis
- making the case for, Making the Case for DBRE
- migration patterns, Migration Patterns-Rollback testing
- migration testing, Migration testing
- migrations and versioning, Migrations and Versioning
- organization's data model, Data model
- pet vs. cattle servers, Databases Are Not Special
- recoverability, Recoverability-Championing Recovery Reliability
- rollback testing, Rollback testing
- self-service for scale, Self-Service for Scale
- tools for, Tools
de-provisioning, Provisioning, Change Management, and Velocity
Debois, Patrick, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
decision trees, Why Now? What Changed for Us?, Decision trees-Decision trees
decision-making, uncertainty and, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
decoherence, Decoherence and Cascading Failure
decommissioning, Decommissioning
Deep Blue, From Chess to Go: How Deep Can We Dive?
deep reinforcement learning, From Chess to Go: How Deep Can We Dive?
DeepMind, From Chess to Go: How Deep Can We Dive?, Success Stories
degradation, graduated, Graduated degradation
demand forecasting, Capacity Planning and Demand Forecasting
dependencies, external, Understanding External Dependencies-Understanding External Dependencies
deployment process, immutable infrastructure and, Deploying Applications
dev owners, The dev owner role
development teams
- embedding SRE teams into, The Embedded SRE
- interaction with SRE teams, Choose SRE for the Right Reasons
DevOps
- and enterprise operations model–SRE transition, Leverage Existing Enthusiasm for DevOps
- and SRE origins, Where Did SRE Come From?
- automated testing at Google, Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- creating shared source code repository, Pattern 3: Create a Shared Source Code Repository-Pattern 3: Create a Shared Source Code Repository
- defined, Leverage Existing Enthusiasm for DevOps
- DevOps and SRE, Background-Replies
- favorite SRE patterns, SRE Patterns Loved by DevOps People Everywhere-Conclusion
Dickerson, Mikey, Production Engineering at Facebook
- Elegy for Complex Systems, Elegy for Complex Systems-To Get Involved
disaster planning, Disaster planning
Disk Scrubber, Disk Scrubber
Distributed Denial-of-Service (DDoS) attack, Case Study: WAF/Bot Mitigation
DNS (Domain Name System)
- as external dependency, Understanding External Dependencies
- at Spotify, Lightening the manual load
- routing requests, Routing requests with DNS
Docker images, Deploying Applications
documentation
- archiving/deleting unnecessary docs, Ruthlessly Prune Your Docs
- best practices for, Doing Docs Better: Best Practices-Recognize and Reward Documentation
- communicating value of, Communicating the Value of Documentation-Communicating the Value of Documentation
- database reliability engineering and, Architecture
- Envoy, Operational learnings
- functional requirements for, Functional Requirements for SRE Documentation-Defining success metrics
- Google and, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
- integrating into engineering workflow, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- integrations as key to adoption, Integrations are key to adoption
- markup language for, Pick the simplest markup language that supports your needs
- of policies, Policies
- playbooks, Playbooks
- postmortems, Postmortems
- quality characteristics, Defining Quality: What Do Good Docs Look Like?-Defining success metrics
- recognizing/rewarding, Recognize and Reward Documentation
- requiring as part of code review, Require Docs as Part of Code Review
- service overviews, Service overviews
- setting realistic quality standards, Better > Best: Set Realistic Standards for Quality
- SLAs, SLAs
- source control and, Where possible, documentation should live in source control, alongside its associated code
- templates for, Create Templates for Each Documentation Type
Doherty, Mike, Replies
downtime
- direct impact, Direct impact
- indirect impact, Indirect impact
- third parties, When They’re Down, You’re Down-Indirect impact
Dropbox, data durability engineering at, Engineering for Data Durability-Conclusion
Drucker, Peter, Achieving Business Success Through Promises (Service Levels)
durability (see data durability, engineering for)
dynamic configuration API, Configuration Management (Control Plane Versus Data Plane)

E

ecommerce, scriptable load balancers and, Case Study: Checkout Queue-Case Study: Checkout Queue
Economic Pillars of Complexity (EPC), Economic Pillars of Complexity
Edge Side Includes (ESI), Project Operating Expense and Abandonment Expense
Edwards, Damon, Clearing the Way for SRE in the Enterprise, Clearing the Way for SRE in the Enterprise-Join the Movement
Ek, Dainel, Driving the Paradigm Shift
Eklund, Jeff, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
Elastic Compute Cloud (EC2), Self-Service Is More Than a Button
embedded SREs, The Embedded SRE, Getting Buy-In
emergency response, Incident Management and Emergency Response
emotion, engineering problems and, Orienting to a Data-Driven Approach
EngPlay, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
Ensor, Phil S., Silos Get in the Way
enterprise operations model–SRE transition, Clearing the Way for SRE in the Enterprise-Join the Movement
- DevOps and, Leverage Existing Enthusiasm for DevOps
- error budgets, Error Budgets-Error Budgets
- Lean manufacturing concepts applied to, Start by Leaning on Lean-Start by Leaning on Lean
- minimizing handoffs, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- psychological safety/human factors, Psychological Safety and Human Factors
- replacing handoffs with self-service, Replace Remaining Handoffs with Self-Service-Operations as a Service
- silos, queues, and tickets, Silos, Queues, and Tickets-Ticket-Driven Request Queues Are Expensive
- steps in clearing obstacles to, Take Action Now-Psychological Safety and Human Factors
- toil as enemy of SRE, Toil, the Enemy of SRE-Toil, the Enemy of SRE
- toil in the enterprise, Toil in the Enterprise
- toil limits in, Toil Limits
- unifying backlogs and protecting capacity, Unify Backlogs and Protect Capacity
Envoy
- development learnings, Development learnings
- operation of, Operating Envoy at Lyft-Technical learnings
- operational learnings, Operational learnings
- origin and development of, The Origin and Development of Envoy at Lyft
- technical learnings, Technical learnings
Equal Employment Opportunity Commission, U.S. (EEOC), Application
error budgets
- defined, Tracking Availability Level
- in enterprise operations model–SRE transition, Error Budgets-Error Budgets
eventually consistent service discovery, Eventually Consistent Service Discovery
Ewald, Michael, Replies
exclusion backlash, Exclusion backlash
exit interviews, Leaving
expectations
- ambiguous/imaginary, Imaginary expectations
- for teams, Make your communication clear and your expectations explicit

F

Facebook, production engineering at, Production Engineering at Facebook-Production Engineering at Facebook
failure
- cascading, Decoherence and Cascading Failure
- complex systems as always in state of partial failure, Always in a State of Partial Failure
- mean time to (see mean time to failure)
failure domains, isolating, Isolated failure domains
failure recovery, Failure Recovery
Farley, Thomas, Sacrifice Decisions Take Place Under Uncertainty
Farmer, Andrew, Replies
Fast Properties, Context Versus Control in SRE
fatigue, of operator, Operator Fatigue
FBAR (Facebook Auto Remediation), Production Engineering at Facebook, Production Engineering at Facebook
Fernandez, Manuel, Replies
Fields, James Alex, Preparing for the worst: handling terror at Unite the Right
firefighting phase of SRE execution, Phase 1: Firefighting/Reactive
first-class citizens, third parties as, Third Parties as First-Class Citizens-Closing Thoughts
flexible scheduling, Working Conditions, Flexible schedules
Fong, Andrew, Interviewing Site Reliability Engineers, Interviewing Site Reliability Engineers-Final Thoughts on Interviewing SREs
Fong-Jones, Liz, Intersections Between Operations and Social Activism, Intersections Between Operations and Social Activism-Conclusion
Ford, Henry, Introducing SRE in Large Enterprises, Why Should We Care About Practitioner Cognition?
forgiveness, trust and, The corollary to trust is forgiveness
Forster, E. M., Active Teaching and Learning
frameworks, privacy engineering and, Frameworks
frequentist estimators, What Is Machine Learning?
Fukushima Daiichi nuclear disaster, Every Incident Could Have Been Worse
functional quality, of documentation, Defining Quality: What Do Good Docs Look Like?
funnel (hiring process), The Funnel
- onsite interview, The Onsite Interview
- phone screens, Phone Screens
- take-home questions, Take-Home Questions

G

g3doc, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
games, for active learning, Active Learning-Active Learning Example: Incident Manager (a Card Game)
gamma distribution, On Evaluating SLOs
Garza, Alicia, Beyond culpability: building capacity instead of assigning blame
gatekeeper phase of SRE execution, Phase 2: Gatekeepers
Gillies, Aaron, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
Go (game), From Chess to Go: How Deep Can We Dive?
goalies, Introducing the goalie role
goals, for SRE teams, Setting goals and defining metrics of success
Going to the Gemba, Start by Leaning on Lean
Golden Path, Prelude
Goldfuss, Alice, We don’t need another hero
Gollapalli, Sriram, Introducing SRE in Large Enterprises, Introducing SRE in Large Enterprises-Closing Thoughts
Google
- automated testing at, Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- data center cooling management with AI, Success Stories
- DevOps–SRE relationship, Replies
- documentation at, Do Docs Better: Integrating Documentation into the Engineering Workflow, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
- launch and handoff readiness review, Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
- shared source code repository at, Pattern 3: Create a Shared Source Code Repository-Pattern 3: Create a Shared Source Code Repository
- SRE teams embedded with software engineering teams, Identify Stakeholders
- Wheel of Misfortune training tool, Active Learning Example: Wheel of Misfortune
Google model of cross-functional teams, Get Rid of as Many Handoffs as Possible
Google Web Server (GWS), Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
Gorcenski, Emily, Intersections Between Operations and Social Activism, Intersections Between Operations and Social Activism-Conclusion
gradient descent, A neural network from scratch
graduated degradation, Graduated degradation
Gustavsson, Niklas, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
Gwartz, Jason, Replies

H

hack-a-months, Production Engineering at Facebook
HAL 9000, Why Use Machine Learning for SRE?, The Awakening of Applied AI
Hale, Jefferson, Replies
Hand, Jason, Replies
handoff readiness review (HRR), Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
handoffs
- minimizing, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- replacing with self-service, Replace Remaining Handoffs with Self-Service-Operations as a Service
HAProxy, Scriptable Load Balancers: The New Kid on the Block
hardware errors, Operating system and hardware errors
healthcare.gov, Elegy for Complex Systems
Heckman, Tim, Replies
heroics
- negative consequences of, We don’t need another hero
Heyer, Heather, Preparing for the worst: handling terror at Unite the Right, Charlottesville in review: assigning and avoiding blame
hiring (see job application/hiring process)
histograms
- percentiles vs., Where Percentiles Fall Down (and Histograms Step Up)
- SLOs and, Histograms-Where Percentiles Fall Down (and Histograms Step Up)
Hopper, Grace, Approaching Operations as an Engineering Problem
Horowitz, Jonah, Immutable Infrastructure and SRE, Immutable Infrastructure and SRE-Disadvantages
human error, Antipattern 4: Root Cause = Human Error-Antipattern 4: Root Cause = Human Error
human resources, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
human-factors research, Beyond Burnout
Humble, Jez, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion

I

immutable infrastructure, Immutable Infrastructure and SRE-Disadvantages
- base image construction, Building the Base Image
- continuous integration/continuous deployment with confidence, Continuous Integration/Continuous Deployment with Confidence
- defined, Scalability, Reliability, and Performance
- deploying applications, Deploying Applications
- disadvantages of, Disadvantages
- failure recovery, Failure Recovery
- faster startup times, Faster Startup Times
- known state, Known State
- multiregion operations, Multiregion Operations
- release engineering, Release Engineering
- scalability, reliability, and performance, Scalability, Reliability, and Performance
- security, Security
- simplicity, Simpler Operations
impact analysis, Impact Analysis
impact monitoring, Monitoring
incentive structure
- for SRE buy-in, Getting Buy-In
- novelty priority inversion and, Novelty Priority Inversion
incident analysis, What Can You Do?
incident command, Principles 1 and 2 (interfaces and incident command)
Incident Manager (card game), Active Learning Example: Incident Manager (a Card Game)-Active Learning Example: Incident Manager (a Card Game)
incident metrics, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- improving SRE at scale with, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- real-time dashboards, Real-Time Dashboards: The Bread and Butter of SRE
- repair debt, Repair Debt
- reviewing, Metrics Review: If a Metric Falls in the Forest…
- surrogate metrics, Surrogate Metrics
- time to detect, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- time to engage, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- time to fix, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- virtual repair debt, Virtual Repair Debt: Exorcising the Ghost in the Machine
- virtuous cycle, The Virtuous Cycle to the Rescue: If You Don’t Measure It…-The Virtuous Cycle to the Rescue: If You Don’t Measure It…
incident response/management
- activities of SREs during incidents, What Do SRE People Do?
- calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- classification schemes, Classification schemes
- cognitive work and, Introduction-Conclusion
- formal role assignments, Formal role assignments
- managing coordination costs, Managing the Costs of Coordination
- observations on SRE cognitive work around incidents, Observations on SRE Cognitive Work Around Incidents-SREs Are Cognitive Agents Working in a Joint Cognitive System
- opportunities to mitigate worst aspects of incident, Every Incident Could Have Been Worse
- repairs to functional systems, Repairs to Functional Systems
- sacrifice decisions and uncertainty, Sacrifice Decisions Take Place Under Uncertainty
- special knowledge about complex systems, Special Knowledge About Complex Systems
- SRE teams and, Incident Management and Emergency Response
- SREs as cognitive agents working in a joint cognitive system, SREs Are Cognitive Agents Working in a Joint Cognitive System
incidents
- addressing the calibration problem, Address the Calibration Problem
- and collective recalibration, Incidents Are Opportunities for Collective Recalibration
- and individual recalibration, Incidents Trigger Individual Recalibration
- and specific calibration problems, Incidents Are Opportunities for Collective Recalibration
- automation as team player in SRE work, Focus on Making Automation a Team Player in SRE Work
- building a corpus of cases, Build a Corpus of Cases
- changing patterns of, Incident Patterns Will Change
- costs imposed by, Incidents Will Impose Costs-Incidents Will Impose Costs
- harvesting value of, What Should Happen Next?-Address the Calibration Problem
- inevitability in complex systems, Incidents Will Continue
Index Scanner, Index Scanner
infrastructure services, Infrastructure services
infrastructure, immutable (see immutable infrastructure)
insourcing, growing teams via, Growing the team: insource or outsource?
integration monitoring, Monitoring
integrations, documentation and, Integrations are key to adoption
intelligent agents, What Is Machine Learning?
interrupt work, project work vs., We love interrupts and the torrents of information
interviewing (job interviews), Interviewing Site Reliability Engineers-Final Thoughts on Interviewing SREs
- advice for hiring managers, Advice for Hiring Managers
- and persons with mental disorders, Interviewing
- basics, Interviewing 101-The Funnel
- biases in, Biases
- funnel basics, The Funnel
- funnels, SRE Funnels-Final Thoughts on Interviewing SREs
- industry vs. university candidate profiles, Industry Versus University
- onsite interview, The Onsite Interview
- phone screens, Phone Screens
- selling candidates on your organization, Selling candidates
- take-home questions, Take-Home Questions
- walking away from a candidate, Walking away
IPython, Installing Python, IPython, and Jupyter Notebook
isolation
- and data durability, Isolation-Operational isolation
- logical, Logical isolation
- of failure domains, Isolated failure domains
- operational, Operational isolation-Operational isolation
- physical, Physical isolation

J

Jansson, Mattias, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
job application/hiring process
- as funnel process, The Funnel
- benefits for applicants with mental disorders, Benefits-Benefits
- compensation and applicants with mental disorders, Compensation
- for persons with mental disorders, Interviewing
- interviewing SREs (see interviewing)
- onboarding packets, Onboarding
job duties, for persons with mental disorders, Job Duties
job function, on-call as, Application
job postings, Application
Joblint, Application
Johnston, Bennie, Replies
joint cognitive system (JCS), SREs Are Cognitive Agents Working in a Joint Cognitive System, Focus on Making Automation a Team Player in SRE Work
Jones, Matt, Replies
Jupyter Notebook, Installing Python, IPython, and Jupyter Notebook

K

Kaizen (continuous improvement), Start by Leaning on Lean-Start by Leaning on Lean
Kanban, Unify Backlogs and Protect Capacity
Kanwar, Pranay, Replies
Kasparov, Garry, From Chess to Go: How Deep Can We Dive?
Kata method, Start by Leaning on Lean-Start by Leaning on Lean
Kelvin, Lord (William Thomson), From SysAdmin to SRE in 8,963 Words, Where Did SRE Come From?
Key Performance Indicators (KPIs), Where Did SRE Come From?, Monitoring, Metrics, and KPIs
Kim, Gene, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
Kissner, Lea, The General Landscape of Privacy Engineering
Klein, Matt, SRE throughout the development cycle
- on service meshes, The Service Mesh: Wrangler of Your Microservices?-The Future of the Service Mesh
Knight Capital, Sacrifice Decisions Take Place Under Uncertainty
known-knowns, software failure and, Underlying Assumptions Driving On-Call for Engineers
known-unknowns, software failure and, Underlying Assumptions Driving On-Call for Engineers
Kobayashi Maru, Active Learning Example: Wheel of Misfortune
Koen, Brian, Approaching Operations as an Engineering Problem
Kriegsspiel, Active Learning
Kubrick, Stanley, The Awakening of Applied AI

L

Lafeldt, Mathias, Approaching Operations as an Engineering Problem
Lamott, Anne, Better > Best: Set Realistic Standards for Quality
large enterprises, Introducing SRE in Large Enterprises-Closing Thoughts
- defining current state, Defining Current State-To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape
- defining SRE for, Defining SRE-Defining SRE
- DevOps–SRE relationship, Replies
- identifying/educating stakeholders, Identifying and Educating Stakeholders
- implementing the SRE team, Implementing the SRE Team-Defining the role of supporting divisions
- introducing SRE into, Introducing SRE-Closing Thoughts
- lessons learned from process of introducing SRE, Lessons Learned
- preparing business case for SRE, Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability
- presenting business case for SRE, Presenting the Business Case
- sample implementation roadmap, Sample Implementation Roadmap
launch readiness review (LRR), Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
leaders, as advocates for SRE, Start having conversations with leaders and champions in the organization
Lean manufacturing movement, Start by Leaning on Lean-Start by Leaning on Lean
learning, active (see active learning)
Lee, Francis, Beyond culpability: building capacity instead of assigning blame
Legeza, Vladimir, From SysAdmin to SRE in 8,963 Words, From SysAdmin to SRE in 8,963 Words-Conclusion
LGBTQ+ inclusivity, Mental Disorders Are Missing from the Diversity Conversation, Benefits
Lightweight Directory Access Protocol (LDAP), Understanding External Dependencies
Limoncelli, Thomas A.
- on DevOps–SRE relationship, Replies
- on LRR, Pattern 2: Launch and Handoff Readiness Review at Google
- on shared source repository, Pattern 3: Create a Shared Source Code Repository
LinkedIn, Testing and staging, Uses for RUM
load balancers (see scriptable load balancers)
logging, Logging
logical backups, Full and incremental logical backups
logical isolation, Logical isolation
long short-term memory (LSTM) networks, Why Now? What Changed for Us?
Looney, John, Psychological Safety in SRE, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
Lund, Tanner, Replies
Lutner, Sean, Replies
Lyft
- operation of Envoy at, Operating Envoy at Lyft-Technical learnings
- origin and development of Envoy, The Origin and Development of Envoy at Lyft

M

machine learning, Why Use Machine Learning for SRE?-Success Stories
- AI background, The Awakening of Applied AI
- basics, What Is Machine Learning?-Why Now? What Changed for Us?
- current SRE environment and, Why Now? What Changed for Us?
- decision trees, Decision trees-Decision trees
- defined, What Is Machine Learning?
- enterprise IT areas affected by, Success Stories
- human–machine games, From Chess to Go: How Deep Can We Dive?-From Chess to Go: How Deep Can We Dive?
- modern definition of learning in terms of machine, What Do We Mean by Learning?
- neural networks, What Are Neural Networks?-Popular Libraries for Neural Networks
- on-call substitute, Counterarguments
- practical examples, Practical Machine Learning Examples-Time series: server requests waiting
- Python/IPython/Jupyter Notebook installation, Installing Python, IPython, and Jupyter Notebook
- reasons for company to use, Why Use Machine Learning for SRE?-Success Stories
- reasons to use, Why Use Machine Learning for SRE?
- Spotify and, The Future: Speed at Scale, Safely
- SRE problems addressed by, Some SRE Problems Machine Learning Can Help Solve
- TensorFlow and TensorBoard, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
- time series: server requests waiting, Time series: server requests waiting-Time series: server requests waiting
- training a neural network from scratch, A neural network from scratch-A neural network from scratch
MacNamara, Ríona, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
Mangot, Dave, Replies
Markdown, The Google Experience: g3doc and EngPlay, Pick the simplest markup language that supports your needs
market-oriented teams, Get Rid of as Many Handoffs as Possible
Markov model, Estimating durability
markup language, Pick the simplest markup language that supports your needs
Maslow's Hierarchy of Needs, Production Engineering at Facebook
Master Service Agreements (MSAs), Negotiating SLAs with vendors
McDuffee, Keith, Replies
McEniry, Chris, Replies
Mean Time Between Failures (MTBF), Always in a State of Partial Failure
Mean Time to Detect (MTTD), Clearing the Way for SRE in the Enterprise
Mean Time to Failure (MTTF)
- in Markov chain, Estimating durability
- inappropriate optimization of, Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)-Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
Mean Time to Recovery (MTTR), Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)-Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
Mean Time to Repair (MTTR), Always in a State of Partial Failure
- defined, Clearing the Way for SRE in the Enterprise
- window of vulnerability and, Window of Vulnerability
Mediratta, Bharat, Pattern 1: Birth of Automated Testing at Google
Meickle, James, Beyond Burnout, Beyond Burnout-Mental Disorder Resources
Menchaca, Joaquin, Replies
mental disorders, persons with
- and diversity conversation, Mental Disorders Are Missing from the Diversity Conversation
- benefits for, Benefits-Benefits
- business environment, Sanity Isn’t a Business Requirement
- compensation in job application process, Compensation
- crisis and, Leaving
- defined, Defining Mental Disorders
- importance of detailed job postings, Application
- inclusivity as beneficial to all, Inclusivity for Anyone Helps Everyone
- ineffectiveness of common workplace strategies towards, Thoughts and Prayers Aren’t Scalable
- interviewing for job, Interviewing
- job duties, Job Duties
- leaving a job, Leaving
- on-call work and, Application
- onboarding packets, Onboarding
- pro-inclusion patterns/antipatterns, Full-Stack Inclusivity-Mental Disorder Resources
- promotion, Promotion
- resources, Mental Disorder Resources
- training, Training
- working conditions, Working Conditions-Working Conditions
- workplace inclusivity and, Beyond Burnout-Mental Disorder Resources
mental health, Beyond Burnout-Mental Disorder Resources
- crises, Leaving
- defined, Beyond Burnout
mental models, Mental Models
mentorship programs, Training
Mercereau, Jonathan, Working with Third Parties Shouldn’t Suck, Working with Third Parties Shouldn’t Suck-Closing Thoughts
Messeri, Eran, Pattern 1: Birth of Automated Testing at Google, Pattern 3: Create a Shared Source Code Repository
metrics, Orienting to a Data-Driven Approach, Setting goals and defining metrics of success, Monitoring, Metrics, and KPIs
- (see also incident metrics)
Miasnikoŭ, Stas, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
Michel, Drew, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
microservices
- context vs. control at Netflix, Context Versus Control in SRE-Context Versus Control in SRE
- current state of microservice networking, Current State of Microservice Networking
- service mesh and (see service mesh)
mid-sized organizations (see Soundcloud, SRE at)
migration cost, Project Operating Expense and Abandonment Expense
- (see also abandonment expense)
migrations, database reliability engineering and, Migration Patterns-Rollback testing
Mineiro, Luis, Replies
Mitchell, Tom, What Do We Mean by Learning?
money (see compensation)
monitoring
- SRE teams and, Monitoring, Metrics, and KPIs
- third-party integrations, Monitoring-Uses for RUM
monolithic architecture
- disadvantages of, Ready to Get Rid of the Monolith?
- service mesh vs., Ready to Get Rid of the Monolith?
Moraes, Gleicon, Replies
Morrison, David, Make respect part of your team’s culture
Most Favored Customer (MFC) clauses, Decommissioning
multiregion operations, immutable infrastructure and, Multiregion Operations
Murphy, Niall Richard
- Against On-Call, Against On-Call: A Polemic-Conclusion
- Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
Mushero, Steve, Replies

N

Netflix
- chaos engineering at, In the Beginning, There Was Chaos-Conclusion
- context vs. control in SRE, Context Versus Control in SRE-Context Versus Control in SRE
Netflix model (cross-functional teams), Get Rid of as Many Handoffs as Possible
Network Operations Center (NOC), Antipattern 1: Site Reliability Operations
neural networks, What Are Neural Networks?-Popular Libraries for Neural Networks
- artificial neurons and, Neurons and Neural Networks
- datasets for, What Kinds of Data Can We Use?
- popular libraries for, Popular Libraries for Neural Networks
- training from scratch, A neural network from scratch-A neural network from scratch
- when to apply, How and When Should We Apply Neural Networks?
New York Stock Exchange outage (2015), Sacrifice Decisions Take Place Under Uncertainty
nginScript, Scriptable Load Balancers: The New Kid on the Block
Nolan, Laura, Active Teaching and Learning, Active Teaching and Learning-A Call to Action: Ditch the Boring Slides
normal distribution, On Evaluating SLOs
nosocomial automation, Focus on Making Automation a Team Player in SRE Work
Nukala, Shylaja, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation

O

O'Reilly, Tim
- on complex systems, Approaching Operations as an Engineering Problem
- on SRE, Where Did SRE Come From?
Obama, Barack, Elegy for Complex Systems
object stores, Object stores
object-relational mapping (ORM), Cognitive overload
observability, service mesh and, Current State of Microservice Networking, Observability and Alarming
on-call
- accommodations for on-call personnel, Accommodations
- alternatives to current approach, Actual Solutions-Cognitive hacks
- arguments for keeping current system, Counterarguments
- at Spotify, On-Call and Alerting
- cognitive hacks, Cognitive hacks
- cost to humans, The Cost to Humans of Doing On-Call-We don’t need another hero
- emergency medicine/SRE differences, Differences with SRE
- emergency medicine/SRE parallels, Parallels with SRE
- emergency medicine/ward medicine distinction, On-Call Is Emergency Medicine Instead of Ward Medicine-On-Call Is Emergency Medicine Instead of Ward Medicine
- exclusion backlash and, Exclusion backlash
- flexible schedules and, Flexible schedules
- improving on-the-job performance, Improving On-the-Job Performance
- industry-wide compensation model, Compensation
- need for fundamental change in approach to, We Need a Fundamental Change in Approach-A Union of the Two
- negative consequences of heroism, We don’t need another hero
- opt-out policy, Exclusion backlash
- pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- pair on-call, Cognitive hacks
- persons with mental disorders and, Application
- prioritization, Prioritization-Exclusion backlash
- production engineering team at Facebook, Production Engineering at Facebook
- psychological safety and, On-call and operations
- rationale for, The Rationale for On-Call-Counterarguments
- reasons to end, Against On-Call: A Polemic-Conclusion
- recovery time immediately after on-call shift, Recovery
- Strong-Anti-On-Call position, Strong-Anti-On-Call
- Strong/Weak-Anti-On-Call position, A Union of the Two
- training, Training
- triage function, First, Do No Harm
- underlying assumptions, Underlying Assumptions Driving On-Call for Engineers-Underlying Assumptions Driving On-Call for Engineers
- Weak-Anti-On-Call position, Weak-Anti-On-Call
onboarding packets, Onboarding
onsite interview, The Onsite Interview
- coding and system questions, Coding and system questions
- cultural fit interviews, Cultural interviews
- deep dives and architecture questions, Deep dives and architecture questions
OpenResty, Scriptable Load Balancers: The New Kid on the Block
operating system errors, data loss and, Operating system and hardware errors
operational debt, Repair Debt
operational expenditures (OpEx), Make a Decision, Get Rid of as Many Handoffs as Possible, Capacity Planning and Demand Forecasting
operational isolation, Operational isolation-Operational isolation
operations
- approaching as engineering problem, Approaching Operations as an Engineering Problem-Approaching Operations as an Engineering Problem
- intersections between social activism and, Intersections Between Operations and Social Activism-Conclusion
- Spotify's early focus on, Prelude, Bringing Scalability and Reliability to the Forefront
- Spotify's growth and, Prelude
Operations as a Service (OaaS), Operations as a Service
operations organization (see enterprise operations model–SRE transition)
operator fatigue, Operator Fatigue
ops owners, The ops owner role
Ops Teams
- at Spotify, Spawning Off Internal Office Support
Ops-in-Squads
- balancing squad autonomy with tech stack consistency, Autonomy Versus Consistency: 2015–2017-Key Learnings
- benefits, Benefits
- expansion of concept, Driving the Paradigm Shift
- rollout, Building on Trust
- Spotify's use of, Introducing Ops-in-Squads: 2013–2015-Key Learnings
- trade-offs, Trade-Offs
outsourcing, Growing the team: insource or outsource?
- (see also third parties)
overhead of coordination, Nobody Anticipates the Overhead of Coordination

P

pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
pair on-call, Cognitive hacks
Papert, Seymour, Active Learning
Pareto distribution, On Evaluating SLOs
partner phase of SRE execution, Phase 3: Advocates/Partners
patterns
- automated testing at Google, Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- creating shared source code repository, Pattern 3: Create a Shared Source Code Repository-Pattern 3: Create a Shared Source Code Repository
- DevOps and, SRE Patterns Loved by DevOps People Everywhere-Conclusion
  - (see also antipatterns)
- launch and handoff readiness review at Google, Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
- of incidents, Incident Patterns Will Change
- pro-inclusion of employees with mental disorders, Full-Stack Inclusivity-Mental Disorder Resources
Paul, Paula, Replies
PayPal, Replies
Pentagon building, Nobody Anticipates the Overhead of Coordination
percentiles, histograms vs., Where Percentiles Fall Down (and Histograms Step Up)
performance analysis, Performance Analysis and Optimization
performance optimization, Performance Analysis and Optimization
phone screening, Phone Screens
Photon, Active Learning Example: SRE Classroom
physical backups, Full physical backups
physical isolation, Physical isolation
playbooks, documentation of, Playbooks
Poblador i Garcia, David, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
policies, documentation of, Policies
political organizing (see social activism)
postmortems
- as active learning opportunity, Postmortems
- documentation of, Postmortems
- resolving cross-team reliability issues with, Resolving Cross-Team Reliability Issues by Using Postmortems
- social activism, Writing Our Own History: Making Sense of What Went Down-Beyond culpability: building capacity instead of assigning blame
Potvin, Rachel, Pattern 3: Create a Shared Source Code Repository
privacy engineering, The Intersection of Reliability and Privacy-Conclusion
- automation and, Automation
- commonalities with SRE, Privacy and SRE: Common Approaches-Early Intervention and Education Through Evangelism
- default behavior for shared architectures, Default behavior for shared architectures
- differences from SRE, Nuances, Differences, and Trade-Offs
- early intervention and education through evangelism, Early Intervention and Education Through Evangelism-Early Intervention and Education Through Evangelism
- efficient and deliberate problem solving, Efficient and Deliberate Problem Solving
- frameworks and, Frameworks
- goals of, The General Landscape of Privacy Engineering-The General Landscape of Privacy Engineering
- intersection of reliability and privacy, The Intersection of Reliability and Privacy
- questions for evaluating products/services, The General Landscape of Privacy Engineering
- root causing, Find and address root causes
- team relationship management, Relationship Management
- toil reduction with, Reducing Toil-Frameworks
Probability Density Function (PDF), On Evaluating SLOs
procurement teams, Integration timeline?
product-oriented teams, Get Rid of as Many Handoffs as Possible
production engineering (PE), Production Engineering at Facebook-Production Engineering at Facebook
- centralized reporting structure for, Production Engineering at Facebook-Production Engineering at Facebook
- implementing a PE organization outside of Facebook, Production Engineering at Facebook-Production Engineering at Facebook
- key traits of successful engineers, Production Engineering at Facebook-Production Engineering at Facebook
- limited on-call status of, Production Engineering at Facebook
- organizational model for, Production Engineering at Facebook-Production Engineering at Facebook
- origins at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
- relationships with other teams at Facebook, Production Engineering at Facebook
- stages of team/service development, Production Engineering at Facebook
- team creation, Production Engineering at Facebook
production meetings, as active learning opportunity, Production Meetings
production readiness, Integration timeline?
project operating expense (PrOpEx), Project Operating Expense and Abandonment Expense
project-based funding, Get Rid of as Many Handoffs as Possible
Prometheus, Developers’ Productivity and Health Versus the Pager
promotion of persons with mental disorders, Promotion
Proof-of-Concept (PoC), Make a Decision
provisioning, Provisioning, Change Management, and Velocity
psychological safety
- avoiding ambiguous expectations, Imaginary expectations
- avoiding information overload, We love interrupts and the torrents of information
- building into your team, How to Build Psychological Safety into Your Own Team-Operations teams are bad at estimating their level of psychological safety
- clear communication/explicit expectations, Make your communication clear and your expectations explicit
- intersections between operations and social activism, Intersections Between Operations and Social Activism-Conclusion
- on-call rotations and pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- on-call work, On-call and operations, The Cost to Humans of Doing On-Call-We don’t need another hero
- operations teams vs. other engineering teams, Why are operations teams more likely to feel unsafe than other engineering teams?
- operations teams' difficulty in estimating level of, Operations teams are bad at estimating their level of psychological safety
- publicizing/celebrating teams successes, Make it obvious when your team is doing well
- respect as part of team culture, Make respect part of your team’s culture
- space for people to take chances, Make space for people to take chances
- SRE cognitive work, Introduction-Conclusion
- SRE teams and, Orienting to a Data-Driven Approach
- SRE/social activism parallels, Principles 3 and 4 (blameless retrospectives and psychological safety)
- successful teams and, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- ways to make teams feel safe, Make your team feel safe
Python, Installing Python, IPython, and Jupyter Notebook

Q

quality of service (QoS), Why Set Goals?
queues, ticket-driven, Silos Get in the Way-Silos Get in the Way

R

Rabenstein, Björn, How to Apply SRE Principles Without Dedicated SRE Teams, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
Rampke, Matthias, How to Apply SRE Principles Without Dedicated SRE Teams, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
Rasmussen, Jens, Navigating Complexity for Safety
Rau, Vivek, Toil, the Enemy of SRE
reactive phase of SRE execution, Phase 1: Firefighting/Reactive
real-time dashboards, Real-Time Dashboards: The Bread and Butter of SRE
Real-User Monitoring (RUM)
- SLIs informed by, RUM informs SLIs
- third-party integrations and, Indirect impact, Uses for RUM
recovery time (see Mean Time to Recovery (MTTR))
recovery/recoverability of data
- championing recovery reliability, Championing Recovery Reliability
- considerations for, Considerations for Recovery
- database reliability engineering and, Recoverability-Championing Recovery Reliability
- detection of data loss/corruption, Building Block 1: Detection-Operating system and hardware errors
- diverse storage, Building Block 2: Diverse Storage
- full physical backups, Full physical backups
- incremental physical backups, Incremental physical backups
- logical backups, Full and incremental logical backups
- object stores, Object stores
- testing, Building Block 4: Testing
- varied toolbox for, Building Block 3: A Varied Toolbox
recurrent neural networks, How and When Should We Apply Neural Networks?
Redundant Array of Independent Disks (RAID), Window of Vulnerability
redundant systems, Redundant systems
Reinertsen, Donald G., Ticket-Driven Request Queues Are Expensive
release engineering, Release Engineering
remote work, Working Conditions
Rendell, Mark, Replies
repair debt, Repair Debt
replication techniques
- data durability, engineering for, Replication-Estimating durability
- estimating durability, Estimating durability-Estimating durability
reporting APIs, Polling API informs SLIs, Uses for RUM, Logging
Republican National Convention protests (2008), Principles of Organizing
request pausing, Case Study: Intermission-Case Study: Intermission
request queues, Silos Get in the Way-Silos Get in the Way
respect, team culture and, Make respect part of your team’s culture
resumes, blind review of, Biases
roles, formal assignment of, Formal role assignments
rollbacks, Rollback testing
rolling release, blue/green release vs., Disadvantages
root access, Bringing Scalability and Reliability to the Forefront
root causing
- privacy engineering, Find and address root causes
Root, Lynn, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
Rosenthal, Casey, In the Beginning, There Was Chaos, In the Beginning, There Was Chaos-Conclusion
Rother, Mike, Start by Leaning on Lean
routing, shard-aware (see shard-aware routing)
Russek, Johannes, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
Ryan, Andrew, Production Engineering at Facebook

S

sacrifice decisions, Sacrifice Decisions Take Place Under Uncertainty
St. Paul Principles, Principles of Organizing, The corollary to trust is forgiveness
Samuel, Arthur, From Chess to Go: How Deep Can We Dive?
scaling, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
scheduling, flexible, Working Conditions, Flexible schedules
Schlossnagle, Theo, The Art and Science of the Service-Level Objective, The Art and Science of the Service-Level Objective-Parting Thought: Looking at SLOs Upside Down
Schwartz, Mark, Antipattern 15: Ungainly Governance
scriptable load balancers, Scriptable Load Balancers-Looking to the Future and Further Reading
- about, Scriptable Load Balancers: The New Kid on the Block-Why Scriptable Load Balancers?
- advantages of, Why Scriptable Load Balancers?
- checkout queue case study, Case Study: Checkout Queue-Case Study: Checkout Queue
- defined, Scriptable Load Balancers
- future issues, Looking to the Future and Further Reading
- harnessing the potential of, Harnessing Potential
- maintaining resiliency, Avoiding Disaster-Case Study: Checkout Queue
- problems solved by, Making the Difficult Easy-Case Study: Intermission
- request pausing, Case Study: Intermission-Case Study: Intermission
- routing requests with, Routing requests with a scriptable load balancer
- service-level middleware, Service-Level Middleware-Case Study: WAF/Bot Mitigation
- shard-aware routing, Shard-Aware Routing-Routing requests with a scriptable load balancer
- state and, Getting Clever with State
security
- immutable infrastructure and, Security
- self-service and, Self-Service Helps SREs in Multiple Ways
- third-party vendors and, Establish Importance
self-service
- and OaaS, Operations as a Service
- benefits to SREs, Self-Service Helps SREs in Multiple Ways
- database reliability engineering and, Self-Service for Scale
- maximizing effectiveness of, Self-Service Is More Than a Button
- replacing handoffs with, Replace Remaining Handoffs with Self-Service-Operations as a Service
separation of duties, Self-Service Helps SREs in Multiple Ways
servers as cattle vs. servers as pets, Databases Are Not Special
service discovery, Eventually Consistent Service Discovery
service meshes, The Service Mesh: Wrangler of Your Microservices?-The Future of the Service Mesh
- configuration management, Configuration Management (Control Plane Versus Data Plane)
- context propagation, Thin Libraries and Context Propagation
- control plane vs. data plane, Configuration Management (Control Plane Versus Data Plane)
- current state of microservice networking, Current State of Microservice Networking
- eventually consistent service discovery system, Eventually Consistent Service Discovery
- in practice, The Service Mesh in Practice-Technical learnings
- monolithic architecture vs., Ready to Get Rid of the Monolith?
- observability and alarming, Observability and Alarming
- operation of Envoy at Lyft, Operating Envoy at Lyft-Technical learnings
- origin and development of Envoy at Lyft, The Origin and Development of Envoy at Lyft
- sidecar performance implications, Sidecar Performance Implications
- sidecar proxy, Service Mesh to the Rescue-The Benefits of a Sidecar Proxy
- thin libraries, Thin Libraries and Context Propagation
service overviews, documentation of, Service overviews
Service Pyramid, at Facebook, Production Engineering at Facebook
Service-Level Agreements (SLAs)
- business priorities and changes in, Dealing with Corner Cases
- corner cases and, Dealing with Corner Cases-Dealing with Corner Cases
- documentation of, SLAs
- error budgets and, Error Budgets
- negotiating with vendors, Negotiating SLAs with vendors
- nontechnical solutions in SysAdmin–SRE transition, Nontechnical Solutions
- progression in service-level execution, Progression in Service-Level Execution
- SLOs and, Service-Level Objective, Why Set Goals?
- success culture and, Achieving Business Success Through Promises (Service Levels)
- SysAdmin–SRE transition and, SLA, Establishing SLAs for Internal Components-Establishing SLAs for Internal Components
- tracking availability level for, Tracking Availability Level-Tracking Availability Level
Service-Level Indicators (SLIs)
- black boxes, SLIs on black boxes
- real-time data and, Real-time data informs SLIs
- reporting APIs, Polling API informs SLIs
- RUM and, RUM informs SLIs
- synthetic monitoring, Synthetic monitoring informs SLIs
- SysAdmin–SRE transition and, Service-Level Indicator
- third-party services and, Service-Level Indicators, Service-Level Objectives, and SLAs-Negotiating SLAs with vendors
service-level middleware
- APIs of, APIs of Service-Level Middleware
- scriptable load balancers, Service-Level Middleware-Case Study: WAF/Bot Mitigation
- WAF/Bot mitigation case study, Case Study: WAF/Bot Mitigation
Service-Level Objectives (SLOs), The Art and Science of the Service-Level Objective-Parting Thought: Looking at SLOs Upside Down
- aligning performance standards with customers' needs, Parting Thought: Looking at SLOs Upside Down
- availability, Availability-Transactions over Time Quanta
- data recovery strategies and, Considerations for Recovery
- error budgets and, Error Budgets
- evaluating, On Evaluating SLOs
- goal setting and, Why Set Goals?
- histograms and, Histograms-Where Percentiles Fall Down (and Histograms Step Up)
- percentiles vs. histograms, Where Percentiles Fall Down (and Histograms Step Up)
- SysAdmin–SRE transition and, Service-Level Objective
- third-party services and, SLOs
service-oriented teams, Get Rid of as Many Handoffs as Possible
Shannon, Adam, Replies
shard-aware routing
- routing queries in the application, Routing queries in the application
- routing requests in the application, Routing requests in the application
- routing requests with a scriptable load balancer, Routing requests with a scriptable load balancer
- routing requests with DNS, Routing requests with DNS
Sharpe, Jeremy, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
Shopify, Case Study: Checkout Queue-Case Study: Checkout Queue
shopping websites, Case Study: Checkout Queue-Case Study: Checkout Queue
Short, Chris, Replies
Shoup, Randy, Pattern 3: Create a Shared Source Code Repository
sidecar proxy, Service Mesh to the Rescue-The Benefits of a Sidecar Proxy, Sidecar Performance Implications
Siegrist, John, Replies
Sigmoid function, A neural network from scratch-A neural network from scratch
silos
- enterprise operations model–SRE transition and, Silos Get in the Way-Silos Get in the Way
- Kata and, Start by Leaning on Lean
- mismatches and, Silos Get in the Way
- missed SLOs and, Antipattern 17: Tossing Your API Over the Firewall
- Spotify and, Unintentional specialization and misalignment
Single Points of Failure (SPOFs), Beginning Chaos, Isolated failure domains
Single-Page Application (SPA) frameworks, Direct impact
Sinjakli, Chris, Replies
site up, as SRE goal, Keeping the Site Up-Graduated degradation
- graduated degradation, Graduated degradation
- isolating failure domains, Isolated failure domains
- redundant systems, Redundant systems
SLA inversion, Avoiding Disaster
Slicer, Routing requests with a scriptable load balancer
snapshots, Offline storage
social activism
- assigning/avoiding blame in reviews of, Charlottesville in review: assigning and avoiding blame
- building capacity instead of assigning blame, Beyond culpability: building capacity instead of assigning blame
- crisis management, Managing Crisis: Responding When Things Break Down-The corollary to trust is forgiveness
- forgiveness as corollary to trust, The corollary to trust is forgiveness
- intersections between operations and, Intersections Between Operations and Social Activism-Conclusion
- planning stage, Creating the Perfect Plan
- postmortems, Writing Our Own History: Making Sense of What Went Down-Beyond culpability: building capacity instead of assigning blame
- principles of organizing, Principles of Organizing
- software engineering as analogous to, Before, During, After
- turning action into change, The Long Tail: Turning Action into Change-Activism and Change Within a Company
Soundcloud, SRE at, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
- deployment platform, The Deployment Platform
- embedded SREs, The Embedded SRE
- failure of team approach, SREs to the Rescue! (and How They Failed)-The Embedded SRE
- getting buy-in, Getting Buy-In-Getting Buy-In
- implementation details, Some Implementation Details-Getting Buy-In
- need to adjust approach to circumstances, A Matter of Scale in Terms of Headcount
- on-call rotations and pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- postmortems for resolving cross-team reliability issues, Resolving Cross-Team Reliability Issues by Using Postmortems
- Production Engineering team, Introducing Production Engineering-Introducing Production Engineering
- uniform infrastructure/tooling vs. autonomy/innovation, Uniform Infrastructure and Tooling Versus Autonomy and Innovation-Uniform Infrastructure and Tooling Versus Autonomy and Innovation
- you build it, you run it approach, You Build It, You Run It-Introducing Production Engineering
source control, documentation in, Where possible, documentation should live in source control, alongside its associated code
source-code repository, Google, Pattern 3: Create a Shared Source Code Repository-Pattern 3: Create a Shared Source Code Repository
space shuttle disasters, Always in a State of Partial Failure
speed at scale, safely (s3), The Future: Speed at Scale, Safely
Spotify, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- balancing squad autonomy with tech stack consistency, Autonomy Versus Consistency: 2015–2017-Key Learnings
- beta and release period, Prelude-Key Learnings
- bringing scalability and reliability to the forefront, Bringing Scalability and Reliability to the Forefront
- challenges in ops team scaling, A System That Didn’t Scale: 2012-Key Learnings
- deployment time slots, Blessed Deployment Time Slots
- dev owner role, The dev owner role
- engineering culture, SRE Without SRE: The Spotify Case Study
- excessive server growth, Forming Bad Habits
- forensics, Creating Detectives
- formalizing core services, Formalizing Core Services
- future of SRE at, The Future: Speed at Scale, Safely-The Future: Speed at Scale, Safely
- goalie role, Introducing the goalie role
- growth and early success (2010), The Curse of Success: 2010-Key Learnings
- growth-related challenges (2011), Pets and Cattle, and Agile: 2011-Breaking Those Bad Habits
- interruptions, Interruptions
- lead time issues, Long lead times
- lightening backend engineers' manual load, Lightening the manual load
- limits to manual deployments, Manual Work Hits a Cliff
- moving away from Ops-owner approach, Building on Trust-Building on Trust
- new ownership model, A New Ownership Model
- on-call and alerting, On-Call and Alerting
- operations focus in early history, Tabula Rasa: 2006–2007
- ops owner role, The ops owner role
- Ops-in-Squads (2013-2015), Introducing Ops-in-Squads: 2013–2015-Key Learnings
- reorganization of dev teams/op teams, Breaking Those Bad Habits
- splitting ops team into Production Ops and Internal IT Ops, Spawning Off Internal Office Support
- unintentional specialization/misalignment, Unintentional specialization and misalignment
SRE (generally)
- antipatterns (see antipatterns)
- defining for larger enterprises, Defining SRE-Defining SRE
  - (see also enterprise operations model–SRE transition)
- DevOps and, SRE Patterns Loved by DevOps People Everywhere-Conclusion
- origins of, Where Did SRE Come From?-Where Did SRE Come From?
- phases of execution, Phases of SRE Execution-Complications of Differing Phases
- success culture and (see success culture, SRE as)
- without teams (see Soundcloud) (see Spotify)
SRE Classroom (workshop), Active Learning Example: SRE Classroom
SRE teams (see teams)
staging of third-party integrations, Testing and staging
stakeholder identification
- in build/buy/adopt decision, Identify Stakeholders
- in large enterprises, Identifying and Educating Stakeholders
state, immutable infrastructure and, Known State
Stolarsky, Emil, Scriptable Load Balancers, Scriptable Load Balancers-Looking to the Future and Further Reading
Stone, Luke, So, You Want to Build an SRE Team?, So, You Want to Build an SRE Team?-Making a Decision About SRE
storage
- and database recovery strategy, Building Block 2: Diverse Storage
- object storage, Object storage
- offline, Offline storage
- online, high-performance, Online, high-performance storage
- online, low-performance, Online, low-performance storage
Storage Watcher, Storage Watcher
structural quality, of documentation, Defining Quality: What Do Good Docs Look Like?-Defining Quality: What Do Good Docs Look Like?
Suarez Ordoñez, Santiago, Replies
success culture, SRE as, SRE as a Success Culture-Focus on the Details of Success
- advocate/partner phase of SRE execution, Phase 3: Advocates/Partners
- approaching operations as engineering problem, Approaching Operations as an Engineering Problem-Approaching Operations as an Engineering Problem
- business success through promises (service levels), Achieving Business Success Through Promises (Service Levels)
- capacity planning/demand forecasting, Capacity Planning and Demand Forecasting
- catalytic stage of SRE execution, Phase 4: Catalytic
- complications of differing phases of SRE execution, Complications of Differing Phases
- critical enabling functions of SRE, Critical Enabling Functions of SRE-Provisioning, Change Management, and Velocity
- empowering teams to do the right thing, Empowering Teams to “Do the Right Thing”
- firefighting/reactive phase of SRE execution, Phase 1: Firefighting/Reactive
- focusing on details of success, Focus on the Details of Success
- gatekeeper phase of SRE execution, Phase 2: Gatekeepers
- incident management/emergency response, Incident Management and Emergency Response
- key values for SRE, Key Values for SRE-Progression in Service-Level Execution
- monitoring, metrics, and KPIs, Monitoring, Metrics, and KPIs
- origins of SRE, Where Did SRE Come From?-Where Did SRE Come From?
- performance analysis/optimization, Performance Analysis and Optimization
- phases of SRE execution, Phases of SRE Execution-Complications of Differing Phases
- progression in service-level execution, Progression in Service-Level Execution
- provisioning/change management/velocity, Provisioning, Change Management, and Velocity
- site up, Keeping the Site Up-Graduated degradation
surrogate metrics, Surrogate Metrics
synthetic monitoring
- SLIs and, Synthetic monitoring informs SLIs
- third-party integrations and, Indirect impact, Uses for synthetic monitoring
SysAdmin–SRE transition, From SysAdmin to SRE in 8,963 Words-Conclusion
- availability level tracking for SLAs, Tracking Availability Level-Tracking Availability Level
- concerns of Site Reliability Engineers vs. those of System Administrators, From SysAdmin to SRE in 8,963 Words
- corner cases and SLAs, Dealing with Corner Cases-Dealing with Corner Cases
- establishing SLAs for internal components, Establishing SLAs for Internal Components-Establishing SLAs for Internal Components
- external dependencies, Understanding External Dependencies-Understanding External Dependencies
- key steps in, Conclusion
- nontechnical solutions for SLAs, Nontechnical Solutions
- SLAs in, SLA
- SLIs in, Service-Level Indicator
- SLOs in, Service-Level Objective
- terminology for, Clarifying Terminology-Service-Level Objective
SysOps, A Matter of Scale in Terms of Headcount
- code deploys and, Closing the Loop: Take Your Own Pager
- Production Engineering team and, Introducing Production Engineering-Introducing Production Engineering
systems (see complex systems)

T

tail latency, Sidecar Performance Implications
taking chances, space for people to, Make space for people to take chances
tape backup, Offline storage
Taylor, Frederick Winslow, Why Should We Care About Practitioner Cognition?
teaching, active (see active learning)
teams
- avoiding ambiguous expectations, Imaginary expectations
- avoiding cognitive overload, Cognitive overload
- avoiding information overload, We love interrupts and the torrents of information
- building, So, You Want to Build an SRE Team?-Making a Decision About SRE
- building psychological safety into, How to Build Psychological Safety into Your Own Team-Operations teams are bad at estimating their level of psychological safety
- capacity planning/demand forecasting, Capacity Planning and Demand Forecasting
- choosing SRE for right reasons, Choose SRE for the Right Reasons-Choose SRE for the Right Reasons
- clear communication/explicit expectations, Make your communication clear and your expectations explicit
- commitment to SRE, Commitment to SRE
- control-based models, Context Versus Control in SRE
- cross-functional, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- defining the role of supporting divisions, Defining the role of supporting divisions
- empowering to do the right thing, Empowering Teams to “Do the Right Thing”
- Facebook PE team creation, Production Engineering at Facebook
- file pillars of practice, Critical Enabling Functions of SRE-Provisioning, Change Management, and Velocity
- goal-setting and metrics, Setting goals and defining metrics of success
- incident management/emergency response, Incident Management and Emergency Response
- insource/outsource approaches to growth, Growing the team: insource or outsource?
- large enterprises and, Implementing the SRE Team-Defining the role of supporting divisions
- learning habits of effective SRE teams, Learning Habits of Effective SRE Teams-Postmortems
- making a decision about SRE, Making a Decision About SRE
- monitoring, metrics, and KPIs, Monitoring, Metrics, and KPIs
- orienting to a data-driven approach, Orienting to a Data-Driven Approach
- performance analysis/optimization, Performance Analysis and Optimization
- provisioning/change management/velocity, Provisioning, Change Management, and Velocity
- psychological safety for, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- publicizing/celebrating successes of, Make it obvious when your team is doing well
- respect as part of culture, Make respect part of your team’s culture
- rotation of engineering team members into SRE team, Insourcing experienced talent: rotating engineering team members
- scaling to company size, A Matter of Scale in Terms of Headcount
- space for people to take chances, Make space for people to take chances
- SRE in the development cycle, SRE throughout the development cycle
- ways to make teams feel safe, Make your team feel safe
TensorBoard, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
TensorFlow, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
termination, contract, Decommissioning
third parties
- build/buy/adopt decision, Build, Buy, or Adopt?-Project Operating Expense and Abandonment Expense
- direct impact of downtime, Direct impact
- downtime, When They’re Down, You’re Down-Indirect impact
- first-class citizens, Third Parties as First-Class Citizens-Closing Thoughts
- growing teams via, Growing the team: insource or outsource?
- indirect impact of downtime, Indirect impact
- LinkedIn case study, Project Operating Expense and Abandonment Expense
- negotiating SLAs with vendors, Negotiating SLAs with vendors
- running the black box like a service, Running the Black Box Like a Service
- SLIs, SLOs, SLAs, Service-Level Indicators, Service-Level Objectives, and SLAs-Negotiating SLAs with vendors
- SLOs, SLOs
- working with, Working with Third Parties Shouldn’t Suck-Closing Thoughts
third-party integrations
- automation, Automation
- communication, Communication
- contract termination, Decommissioning
- decommissioning, Decommissioning
- disaster planning, Disaster planning
- LinkedIn case study, Testing and staging
- logging, Logging
- monitoring, Monitoring-Uses for RUM
- playbook for, Playbook: From Staging to Production-Closing Thoughts
- reporting APIs, Logging
- synthetic monitoring, Uses for synthetic monitoring
- testing and staging, Testing and staging
- tooling, Tooling
Three Mile Island nuclear disaster, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted, Every Incident Could Have Been Worse
throttling, Getting Clever with State
ticket-driven request queues, Silos Get in the Way-Silos Get in the Way
time quanta, Time Quanta
time to detect (TTD), The Virtuous Cycle to the Rescue: If You Don’t Measure It…, Surrogate Metrics
time to engage (TTE), The Virtuous Cycle to the Rescue: If You Don’t Measure It…, Surrogate Metrics
time to fix (TTF), The Virtuous Cycle to the Rescue: If You Don’t Measure It…
Todd, Chad, Replies
toil
- as enemy of SRE, Toil, the Enemy of SRE-Toil, the Enemy of SRE
- defined, Toil, the Enemy of SRE
- engineering work vs., Toil, the Enemy of SRE-Toil, the Enemy of SRE
- enterprise operations model–SRE transition, Toil Limits
- privacy engineering and, Reducing Toil-Frameworks
- self-service capabilities and, Self-Service Helps SREs in Multiple Ways
tooling, third-party integrations and, Tooling
Toyota Production System, Start by Leaning on Lean
training
- for persons with mental disorders, Training
- on-call, Training
transactions, as availability metric, Transactions
transgender inclusivity, Benefits
Treat, Robert, Replies
Treynor Sloss, Benjamin, Leverage Existing Enthusiasm for DevOps, SRE Patterns Loved by DevOps People Everywhere
triage, First, Do No Harm
trust, forgiveness as corollary to, The corollary to trust is forgiveness
2001: A Space Odyssey (movie), The Awakening of Applied AI

U

unknown-unknowns, software failure and, Underlying Assumptions Driving On-Call for Engineers, On-Call Is Emergency Medicine Instead of Ward Medicine
US Digital Service, Elegy for Complex Systems
user error, data loss and, User error

V

vacation time, Benefits
van Zijll, Robin, Replies
velocity of change, Provisioning, Change Management, and Velocity
vendor lock-in, Project Operating Expense and Abandonment Expense
verification
- coverage, Verification Coverage-Storage Watcher
- data durability engineering and, Verification-Watching the Watchers
- testing the verification system, Watching the Watchers
- zero-errors system, The Power of Zero
virtual repair debt, Virtual Repair Debt: Exorcising the Ghost in the Machine
virtuous cycle, The Virtuous Cycle to the Rescue: If You Don’t Measure It…-The Virtuous Cycle to the Rescue: If You Don’t Measure It…
visual analysis, Start by Leaning on Lean

W

wages (see compensation)
Watson, Coburn, Context Versus Control in SRE, Context Versus Control in SRE-Context Versus Control in SRE
Wheel of Misfortune (game), Active Learning Example: Wheel of Misfortune
Willis, John, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
window of vulnerability, Window of Vulnerability
Woods' Theorem, Mental Models
work-life balance, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- (see also psychological safety)

Y

you build it, you run it
- at Soundcloud, You Build It, You Run It-Introducing Production Engineering
- deployment platform, The Deployment Platform
- Production Engineering team and, Introducing Production Engineering-Introducing Production Engineering
- SysOps and code deploys, Closing the Loop: Take Your Own Pager
Yust, Amber, The Intersection of Reliability and Privacy, The Intersection of Reliability and Privacy-Conclusion

Z

zero-errors systems, The Power of Zero
Zhang, Yichun agentzh, Scriptable Load Balancers: The New Kid on the Block
Zwieback, Dave, Approaching Operations as an Engineering Problem

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.14.142.115