Index
A
- abandonment expense (AbEx), Project Operating Expense and Abandonment Expense
- access control, Early Intervention and Education Through Evangelism
- accommodations, for on-call personnel, Accommodations
- active learning, Active Teaching and Learning-A Call to Action: Ditch the Boring Slides
- activism (see social activism)
- address resolution protocol (ARP) tables, Technical learnings
- adopt-to-buy abandonment scenario, Project Operating Expense and Abandonment Expense
- advocate phase of SRE execution, Phase 3: Advocates/Partners
- Affordable Care Act, Elegy for Complex Systems
- Agilent Technologies, Introducing SRE in Large Enterprises-Closing Thoughts
- alarming, Observability and Alarming
- alerts, On-Call and Alerting
- Allspaw, John, SRE Cognitive Work, Introduction-Conclusion
- Almeida, Daniel Prata, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- AlphaGo, From Chess to Go: How Deep Can We Dive?
- Amaro, Ricardo, Introduction to Machine Learning for SRE, Why Use Machine Learning for SRE?-Success Stories
- Amazon Glacier, Offline storage
- Amazon Web Services (AWS), Self-Service Is More Than a Button
- Andersen, Kurt, SRE as a Success Culture, SRE as a Success Culture-Focus on the Details of Success
- Anderson, Brian, Origin Story
- antipatterns, SRE Antipatterns-So, That’s It, Then?
- APIs
- application errors, Application errors
- Application Operations (AppOps) teams, Production Engineering at Facebook-Production Engineering at Facebook
- assurance windows, Why Set Goals?
- automation
- and operator fatigue, Operator Fatigue
- and reliability, Reliability
- as team player in SRE work, Focus on Making Automation a Team Player in SRE Work
- data durability engineering, Automation-Reliability
- privacy engineering and, Automation
- SRE teams and, Provisioning, Change Management, and Velocity
- testing at Google, Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- third-party integrations, Automation
- window of vulnerability and, Window of Vulnerability
- availability
- Avenet, Julien, Replies
B
- backhauling a request, Routing requests in the application
- backlogs, Unify Backlogs and Protect Capacity
- backpressure, Special Knowledge About Complex Systems
- backpropagation, A neural network from scratch
- backups
- Bainbridge, Lisanne
- base image, Building the Base Image
- Bayesian inference, What Is Machine Learning?
- Beamish, Alex, Replies
- Beck, Kent, Economic Pillars of Complexity
- benefits, for job applicants with mental disorders, Benefits-Benefits
- Beyer, Betsy, The Intersection of Reliability and Privacy, The Intersection of Reliability and Privacy-Conclusion
- biases, in job interviewing, Biases
- Bisset, Blake, SRE Antipatterns, SRE Antipatterns-So, That’s It, Then?
- black boxes
- blame
- blameless retrospectives
- Bland, Mike, Pattern 1: Birth of Automated Testing at Google, Pattern 1: Birth of Automated Testing at Google
- Blank-Edelman, David
- Blew, Aaron, Replies
- blind resume review, Biases
- blue/green deployment
- Boggs, Grace Lee, The Long Tail: Turning Action into Change
- Bootcamp, at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
- bot attacks, Case Study: WAF/Bot Mitigation
- bottlenecks, Establishing SLAs for Internal Components
- breakpoints, Establishing SLAs for Internal Components
- Brummel, Janna, Replies
- build/buy/adopt decision, Build, Buy, or Adopt?-Project Operating Expense and Abandonment Expense
- Burns, Robert, Before, During, After
- business case, for SRE
- buy-to-adopt abandonment scenario, Project Operating Expense and Abandonment Expense
- buy-to-buy abandonment scenario, Project Operating Expense and Abandonment Expense
C
- calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- Campbell, Laine, Database Reliability Engineering, Database Reliability Engineering-Making the Case for DBRE
- CAMS (Culture, Automation, Measurement and Sharing), Replies
- Canahuati, Pedro, Production Engineering at Facebook, Production Engineering at Facebook-Production Engineering at Facebook
- canary rollout method, Tracking Availability Level
- canarying, On-Call Is Emergency Medicine Instead of Ward Medicine
- capacity planning, Capacity Planning and Demand Forecasting
- capacity, team, Unify Backlogs and Protect Capacity
- capital expenditure (CapEx), Make a Decision, Get Rid of as Many Handoffs as Possible, Capacity Planning and Demand Forecasting
- cascading failure, Decoherence and Cascading Failure
- catalytic role SREs, Phase 4: Catalytic
- certificate authorities (CAs), Establish Importance
- Chakrabarti, Saunak Jai, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- Challenger space shuttle disaster, Always in a State of Partial Failure
- champions, for SRE, Start having conversations with leaders and champions in the organization
- change management
- chaos engineering, In the Beginning, There Was Chaos-Conclusion
- Chaos Kong, Chaos Goes Big
- Charlottesville, Virginia, demonstrations/counterdemonstrations (2017) (see social activism)
- Check, Martin, Using Incident Metrics to Improve SRE at Scale, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- choke points, Tooling, Antipattern 10: Design Chokepoints
- Churchill, Winston, Frequently Asked Questions
- CIA, training games used by, Active Learning
- Clarke, Arthur C., The Awakening of Applied AI
- classification schemes, Classification schemes
- cloud services, Chaos Goes Big
- code review, documentation as part of, Require Docs as Part of Code Review
- cognitive hacks, Cognitive hacks
- cognitive overload, Cognitive overload
- cognitive work, Introduction-Conclusion
- activities of SREs during incidents, What Do SRE People Do?
- calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- critical decisions made under uncertainty and time pressure, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
- human performance in modern complex systems, Human Performance in Modern Complex Systems: The Main Themes
- importance of understanding practitioner cognition, Why Should We Care About Practitioner Cognition?-Human Performance in Modern Complex Systems: The Main Themes
- intersections between operations and social activism, Intersections Between Operations and Social Activism-Conclusion
- managing coordination costs, Managing the Costs of Coordination
- observations on SRE cognitive work around incidents, Observations on SRE Cognitive Work Around Incidents-SREs Are Cognitive Agents Working in a Joint Cognitive System
- opportunities to mitigate worst aspects of incident, Every Incident Could Have Been Worse
- repairs to functional systems, Repairs to Functional Systems
- sacrifice decisions and uncertainty, Sacrifice Decisions Take Place Under Uncertainty
- special knowledge about complex systems, Special Knowledge About Complex Systems
- SREs as cognitive agents working in a joint cognitive system, SREs Are Cognitive Agents Working in a Joint Cognitive System
- Columbia space shuttle disaster, Always in a State of Partial Failure
- communication
- compensation
- complex systems, Elegy for Complex Systems-To Get Involved
- as always in state of partial failure, Always in a State of Partial Failure
- community for, To Get Involved
- decoherence and cascading failure, Decoherence and Cascading Failure
- defining characteristics, The Problem with Systems
- Economic Pillars of Complexity, Economic Pillars of Complexity
- human performance in, Human Performance in Modern Complex Systems: The Main Themes
- incidents as inevitable in, Incidents Will Continue
- inherent problems with, The Problem with Systems-The Problem with Systems
- inseparability of computer and human systems, The Computer and Human Systems Cannot Be Separated
- novelty priority inversion, Novelty Priority Inversion
- overhead of coordination, Nobody Anticipates the Overhead of Coordination
- special knowledge about, Special Knowledge About Complex Systems
- tooling for management of, Approaching Operations as an Engineering Problem
- configuration management
- containers, The Deployment Platform
- Content Delivery Network (CDN), Testing and staging, Uses for RUM
- context propagation, Thin Libraries and Context Propagation
- context, control vs., Context Versus Control in SRE-Context Versus Control in SRE
- continuous delivery/deployment
- continuous integration
- contract termination, Decommissioning
- control plane, Configuration Management (Control Plane Versus Data Plane)
- convolutional neural networks, How and When Should We Apply Neural Networks?
- Conway's law, Getting Buy-In, The Computer and Human Systems Cannot Be Separated
- Conway, Mel, The Computer and Human Systems Cannot Be Separated
- Cook, Richard, SRE Cognitive Work, Introduction-Conclusion
- coordination overhead
- costs of incidents, Incidents Will Impose Costs-Incidents Will Impose Costs
- Cowling, James, Engineering for Data Durability, Engineering for Data Durability-Conclusion
- crises
- crisis management, Managing Crisis: Responding When Things Break Down-The corollary to trust is forgiveness
- cross-domain failures, Introducing Production Engineering-Introducing Production Engineering
- cross-functional teams, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- cross-team reliability, Resolving Cross-Team Reliability Issues by Using
Postmortems
- cultural fit interviews, Cultural interviews
- culture (see success culture, SRE as)
D
- dashboards, Real-Time Dashboards: The Bread and Butter of SRE
- data durability, engineering for, Engineering for Data Durability-Conclusion
- automation, Automation-Reliability
- backups, Backups
- estimating durability, Estimating durability-Estimating durability
- freshness, Freshness
- isolation, Isolation-Operational isolation
- protection, Protection-Recovery
- real-world strategies, Real-World Durability-Reliability
- recovery, Recovery
- replication, Replication Is Table Stakes-Estimating durability
- replication techniques, Replication-Estimating durability
- restoration, Restoration
- safeguards, Safeguards
- testing, Testing
- verification, Verification-Watching the Watchers
- window of vulnerability, Window of Vulnerability
- zero-errors system, The Power of Zero
- data loss
- data plane, Configuration Management (Control Plane Versus Data Plane)
- database reliability engineering, Database Reliability Engineering-Making the Case for DBRE
- anatomy of a recovery strategy, Anatomy of a Recovery Strategy-Championing Recovery Reliability
- best practices and standards, Best practices and standards
- collaboration, Collaboration
- considerations for recovery, Considerations for Recovery
- continuous delivery and, Continuous Delivery: From Development to Production-Tools
- culture of, A Culture of Database Reliability Engineering
- data protection, Protect the Data
- deployment of CD, Deployment-Championing CD
- documentation, Architecture
- educating developers, Education and Collaboration-Tools
- guiding principles, Guiding Principles of the Database Reliability Engineer-Databases Are Not Special
- impact analysis, Impact Analysis
- making the case for, Making the Case for DBRE
- migration patterns, Migration Patterns-Rollback testing
- migration testing, Migration testing
- migrations and versioning, Migrations and Versioning
- organization's data model, Data model
- pet vs. cattle servers, Databases Are Not Special
- recoverability, Recoverability-Championing Recovery Reliability
- rollback testing, Rollback testing
- self-service for scale, Self-Service for Scale
- tools for, Tools
- de-provisioning, Provisioning, Change Management, and Velocity
- Debois, Patrick, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
- decision trees, Why Now? What Changed for Us?, Decision trees-Decision trees
- decision-making, uncertainty and, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
- decoherence, Decoherence and Cascading Failure
- decommissioning, Decommissioning
- Deep Blue, From Chess to Go: How Deep Can We Dive?
- deep reinforcement learning, From Chess to Go: How Deep Can We Dive?
- DeepMind, From Chess to Go: How Deep Can We Dive?, Success Stories
- degradation, graduated, Graduated degradation
- demand forecasting, Capacity Planning and Demand Forecasting
- dependencies, external, Understanding External Dependencies-Understanding External Dependencies
- deployment process, immutable infrastructure and, Deploying Applications
- dev owners, The dev owner role
- development teams
- DevOps
- Dickerson, Mikey, Production Engineering at Facebook
- disaster planning, Disaster planning
- Disk Scrubber, Disk Scrubber
- Distributed Denial-of-Service (DDoS) attack, Case Study: WAF/Bot Mitigation
- DNS (Domain Name System)
- Docker images, Deploying Applications
- documentation
- archiving/deleting unnecessary docs, Ruthlessly Prune Your Docs
- best practices for, Doing Docs Better: Best Practices-Recognize and Reward Documentation
- communicating value of, Communicating the Value of Documentation-Communicating the Value of Documentation
- database reliability engineering and, Architecture
- Envoy, Operational learnings
- functional requirements for, Functional Requirements for SRE Documentation-Defining success metrics
- Google and, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
- integrating into engineering workflow, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- integrations as key to adoption, Integrations are key to adoption
- markup language for, Pick the simplest markup language that supports your needs
- of policies, Policies
- playbooks, Playbooks
- postmortems, Postmortems
- quality characteristics, Defining Quality: What Do Good Docs Look Like?-Defining success metrics
- recognizing/rewarding, Recognize and Reward Documentation
- requiring as part of code review, Require Docs as Part of Code Review
- service overviews, Service overviews
- setting realistic quality standards, Better > Best: Set Realistic Standards for Quality
- SLAs, SLAs
- source control and, Where possible, documentation should live in source control, alongside its associated code
- templates for, Create Templates for Each Documentation Type
- Doherty, Mike, Replies
- downtime
- Dropbox, data durability engineering at, Engineering for Data Durability-Conclusion
- Drucker, Peter, Achieving Business Success Through Promises (Service Levels)
- durability (see data durability, engineering for)
- dynamic configuration API, Configuration Management (Control Plane Versus Data Plane)
E
- ecommerce, scriptable load balancers and, Case Study: Checkout Queue-Case Study: Checkout Queue
- Economic Pillars of Complexity (EPC), Economic Pillars of Complexity
- Edge Side Includes (ESI), Project Operating Expense and Abandonment Expense
- Edwards, Damon, Clearing the Way for SRE in the Enterprise, Clearing the Way for SRE in the Enterprise-Join the Movement
- Ek, Dainel, Driving the Paradigm Shift
- Eklund, Jeff, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- Elastic Compute Cloud (EC2), Self-Service Is More Than a Button
- embedded SREs, The Embedded SRE, Getting Buy-In
- emergency response, Incident Management and Emergency Response
- emotion, engineering problems and, Orienting to a Data-Driven Approach
- EngPlay, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
- Ensor, Phil S., Silos Get in the Way
- enterprise operations model–SRE transition, Clearing the Way for SRE in the Enterprise-Join the Movement
- DevOps and, Leverage Existing Enthusiasm for DevOps
- error budgets, Error Budgets-Error Budgets
- Lean manufacturing concepts applied to, Start by Leaning on Lean-Start by Leaning on Lean
- minimizing handoffs, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- psychological safety/human factors, Psychological Safety and Human Factors
- replacing handoffs with self-service, Replace Remaining Handoffs with Self-Service-Operations as a Service
- silos, queues, and tickets, Silos, Queues, and Tickets-Ticket-Driven Request Queues Are Expensive
- steps in clearing obstacles to, Take Action Now-Psychological Safety and Human Factors
- toil as enemy of SRE, Toil, the Enemy of SRE-Toil, the Enemy of SRE
- toil in the enterprise, Toil in the Enterprise
- toil limits in, Toil Limits
- unifying backlogs and protecting capacity, Unify Backlogs and Protect Capacity
- Envoy
- Equal Employment Opportunity Commission, U.S. (EEOC), Application
- error budgets
- eventually consistent service discovery, Eventually Consistent Service Discovery
- Ewald, Michael, Replies
- exclusion backlash, Exclusion backlash
- exit interviews, Leaving
- expectations
F
- Facebook, production engineering at, Production Engineering at Facebook-Production Engineering at Facebook
- failure
- failure domains, isolating, Isolated failure domains
- failure recovery, Failure Recovery
- Farley, Thomas, Sacrifice Decisions Take Place Under Uncertainty
- Farmer, Andrew, Replies
- Fast Properties, Context Versus Control in SRE
- fatigue, of operator, Operator Fatigue
- FBAR (Facebook Auto Remediation), Production Engineering at Facebook, Production Engineering at Facebook
- Fernandez, Manuel, Replies
- Fields, James Alex, Preparing for the worst: handling terror at Unite the Right
- firefighting phase of SRE execution, Phase 1: Firefighting/Reactive
- first-class citizens, third parties as, Third Parties as First-Class Citizens-Closing Thoughts
- flexible scheduling, Working Conditions, Flexible schedules
- Fong, Andrew, Interviewing Site Reliability Engineers, Interviewing Site Reliability Engineers-Final Thoughts on Interviewing SREs
- Fong-Jones, Liz, Intersections Between Operations and Social Activism, Intersections Between Operations and Social Activism-Conclusion
- Ford, Henry, Introducing SRE in Large Enterprises, Why Should We Care About Practitioner Cognition?
- forgiveness, trust and, The corollary to trust is forgiveness
- Forster, E. M., Active Teaching and Learning
- frameworks, privacy engineering and, Frameworks
- frequentist estimators, What Is Machine Learning?
- Fukushima Daiichi nuclear disaster, Every Incident Could Have Been Worse
- functional quality, of documentation, Defining Quality: What Do Good Docs Look Like?
- funnel (hiring process), The Funnel
G
- g3doc, The Google Experience: g3doc and EngPlay-Integrations are key to adoption
- games, for active learning, Active Learning-Active Learning Example: Incident Manager (a Card Game)
- gamma distribution, On Evaluating SLOs
- Garza, Alicia, Beyond culpability: building capacity instead of assigning blame
- gatekeeper phase of SRE execution, Phase 2: Gatekeepers
- Gillies, Aaron, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- Go (game), From Chess to Go: How Deep Can We Dive?
- goalies, Introducing the goalie role
- goals, for SRE teams, Setting goals and defining metrics of success
- Going to the Gemba, Start by Leaning on Lean
- Golden Path, Prelude
- Goldfuss, Alice, We don’t need another hero
- Gollapalli, Sriram, Introducing SRE in Large Enterprises, Introducing SRE in Large Enterprises-Closing Thoughts
- Google
- Google model of cross-functional teams, Get Rid of as Many Handoffs as Possible
- Google Web Server (GWS), Pattern 1: Birth of Automated Testing at Google-Pattern 1: Birth of Automated Testing at Google
- Gorcenski, Emily, Intersections Between Operations and Social Activism, Intersections Between Operations and Social Activism-Conclusion
- gradient descent, A neural network from scratch
- graduated degradation, Graduated degradation
- Gustavsson, Niklas, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- Gwartz, Jason, Replies
H
- hack-a-months, Production Engineering at Facebook
- HAL 9000, Why Use Machine Learning for SRE?, The Awakening of Applied AI
- Hale, Jefferson, Replies
- Hand, Jason, Replies
- handoff readiness review (HRR), Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
- handoffs
- HAProxy, Scriptable Load Balancers: The New Kid on the Block
- hardware errors, Operating system and hardware errors
- healthcare.gov, Elegy for Complex Systems
- Heckman, Tim, Replies
- heroics
- Heyer, Heather, Preparing for the worst: handling terror at Unite the Right, Charlottesville in review: assigning and avoiding blame
- hiring (see job application/hiring process)
- histograms
- Hopper, Grace, Approaching Operations as an Engineering Problem
- Horowitz, Jonah, Immutable Infrastructure and SRE, Immutable Infrastructure and SRE-Disadvantages
- human error, Antipattern 4: Root Cause = Human Error-Antipattern 4: Root Cause = Human Error
- human resources, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- human-factors research, Beyond Burnout
- Humble, Jez, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
I
- immutable infrastructure, Immutable Infrastructure and SRE-Disadvantages
- base image construction, Building the Base Image
- continuous integration/continuous deployment with confidence, Continuous Integration/Continuous Deployment with Confidence
- defined, Scalability, Reliability, and Performance
- deploying applications, Deploying Applications
- disadvantages of, Disadvantages
- failure recovery, Failure Recovery
- faster startup times, Faster Startup Times
- known state, Known State
- multiregion operations, Multiregion Operations
- release engineering, Release Engineering
- scalability, reliability, and performance, Scalability, Reliability, and Performance
- security, Security
- simplicity, Simpler Operations
- impact analysis, Impact Analysis
- impact monitoring, Monitoring
- incentive structure
- incident analysis, What Can You Do?
- incident command, Principles 1 and 2 (interfaces and incident command)
- Incident Manager (card game), Active Learning Example: Incident Manager (a Card Game)-Active Learning Example: Incident Manager (a Card Game)
- incident metrics, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- improving SRE at scale with, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- real-time dashboards, Real-Time Dashboards: The Bread and Butter of SRE
- repair debt, Repair Debt
- reviewing, Metrics Review: If a Metric Falls in the Forest…
- surrogate metrics, Surrogate Metrics
- time to detect, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- time to engage, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- time to fix, The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- virtual repair debt, Virtual Repair Debt: Exorcising the Ghost in the Machine
- virtuous cycle, The Virtuous Cycle to the Rescue: If You Don’t Measure It…-The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- incident response/management
- activities of SREs during incidents, What Do SRE People Do?
- calibration problem, The Calibration Problem-Incidents Are Opportunities for Collective Recalibration
- classification schemes, Classification schemes
- cognitive work and, Introduction-Conclusion
- formal role assignments, Formal role assignments
- managing coordination costs, Managing the Costs of Coordination
- observations on SRE cognitive work around incidents, Observations on SRE Cognitive Work Around Incidents-SREs Are Cognitive Agents Working in a Joint Cognitive System
- opportunities to mitigate worst aspects of incident, Every Incident Could Have Been Worse
- repairs to functional systems, Repairs to Functional Systems
- sacrifice decisions and uncertainty, Sacrifice Decisions Take Place Under Uncertainty
- special knowledge about complex systems, Special Knowledge About Complex Systems
- SRE teams and, Incident Management and Emergency Response
- SREs as cognitive agents working in a joint cognitive system, SREs Are Cognitive Agents Working in a Joint Cognitive System
- incidents
- addressing the calibration problem, Address the Calibration Problem
- and collective recalibration, Incidents Are Opportunities for Collective Recalibration
- and individual recalibration, Incidents Trigger Individual Recalibration
- and specific calibration problems, Incidents Are Opportunities for Collective Recalibration
- automation as team player in SRE work, Focus on Making Automation a Team Player in SRE Work
- building a corpus of cases, Build a Corpus of Cases
- changing patterns of, Incident Patterns Will Change
- costs imposed by, Incidents Will Impose Costs-Incidents Will Impose Costs
- harvesting value of, What Should Happen Next?-Address the Calibration Problem
- inevitability in complex systems, Incidents Will Continue
- Index Scanner, Index Scanner
- infrastructure services, Infrastructure services
- infrastructure, immutable (see immutable infrastructure)
- insourcing, growing teams via, Growing the team: insource or outsource?
- integration monitoring, Monitoring
- integrations, documentation and, Integrations are key to adoption
- intelligent agents, What Is Machine Learning?
- interrupt work, project work vs., We love interrupts and the torrents of information
- interviewing (job interviews), Interviewing Site Reliability Engineers-Final Thoughts on Interviewing SREs
- advice for hiring managers, Advice for Hiring Managers
- and persons with mental disorders, Interviewing
- basics, Interviewing 101-The Funnel
- biases in, Biases
- funnel basics, The Funnel
- funnels, SRE Funnels-Final Thoughts on Interviewing SREs
- industry vs. university candidate profiles, Industry Versus University
- onsite interview, The Onsite Interview
- phone screens, Phone Screens
- selling candidates on your organization, Selling candidates
- take-home questions, Take-Home Questions
- walking away from a candidate, Walking away
- IPython, Installing Python, IPython, and Jupyter Notebook
- isolation
J
- Jansson, Mattias, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- job application/hiring process
- job duties, for persons with mental disorders, Job Duties
- job function, on-call as, Application
- job postings, Application
- Joblint, Application
- Johnston, Bennie, Replies
- joint cognitive system (JCS), SREs Are Cognitive Agents Working in a Joint Cognitive System, Focus on Making Automation a Team Player in SRE Work
- Jones, Matt, Replies
- Jupyter Notebook, Installing Python, IPython, and Jupyter Notebook
K
- Kaizen (continuous improvement), Start by Leaning on Lean-Start by Leaning on Lean
- Kanban, Unify Backlogs and Protect Capacity
- Kanwar, Pranay, Replies
- Kasparov, Garry, From Chess to Go: How Deep Can We Dive?
- Kata method, Start by Leaning on Lean-Start by Leaning on Lean
- Kelvin, Lord (William Thomson), From SysAdmin to SRE in 8,963 Words, Where Did SRE Come From?
- Key Performance Indicators (KPIs), Where Did SRE Come From?, Monitoring, Metrics, and KPIs
- Kim, Gene, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
- Kissner, Lea, The General Landscape of Privacy Engineering
- Klein, Matt, SRE throughout the development cycle
- Knight Capital, Sacrifice Decisions Take Place Under Uncertainty
- known-knowns, software failure and, Underlying Assumptions Driving On-Call for Engineers
- known-unknowns, software failure and, Underlying Assumptions Driving On-Call for Engineers
- Kobayashi Maru, Active Learning Example: Wheel of Misfortune
- Koen, Brian, Approaching Operations as an Engineering Problem
- Kriegsspiel, Active Learning
- Kubrick, Stanley, The Awakening of Applied AI
L
- Lafeldt, Mathias, Approaching Operations as an Engineering Problem
- Lamott, Anne, Better > Best: Set Realistic Standards for Quality
- large enterprises, Introducing SRE in Large Enterprises-Closing Thoughts
- defining current state, Defining Current State-To establish a roadmap for what products SRE will be responsible for, survey the current infrastructure landscape
- defining SRE for, Defining SRE-Defining SRE
- DevOps–SRE relationship, Replies
- identifying/educating stakeholders, Identifying and Educating Stakeholders
- implementing the SRE team, Implementing the SRE Team-Defining the role of supporting divisions
- introducing SRE into, Introducing SRE-Closing Thoughts
- lessons learned from process of introducing SRE, Lessons Learned
- preparing business case for SRE, Prepare the business case: personalize and evaluate the cost of having engineering resources responsible for reliability
- presenting business case for SRE, Presenting the Business Case
- sample implementation roadmap, Sample Implementation Roadmap
- launch readiness review (LRR), Pattern 2: Launch and Handoff Readiness Review at Google-Pattern 2: Launch and Handoff Readiness Review at Google
- leaders, as advocates for SRE, Start having conversations with leaders and champions in the organization
- Lean manufacturing movement, Start by Leaning on Lean-Start by Leaning on Lean
- learning, active (see active learning)
- Lee, Francis, Beyond culpability: building capacity instead of assigning blame
- Legeza, Vladimir, From SysAdmin to SRE in 8,963 Words, From SysAdmin to SRE in 8,963 Words-Conclusion
- LGBTQ+ inclusivity, Mental Disorders Are Missing from the Diversity Conversation, Benefits
- Lightweight Directory Access Protocol (LDAP), Understanding External Dependencies
- Limoncelli, Thomas A.
- LinkedIn, Testing and staging, Uses for RUM
- load balancers (see scriptable load balancers)
- logging, Logging
- logical backups, Full and incremental logical backups
- logical isolation, Logical isolation
- long short-term memory (LSTM) networks, Why Now? What Changed for Us?
- Looney, John, Psychological Safety in SRE, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- Lund, Tanner, Replies
- Lutner, Sean, Replies
- Lyft
M
- machine learning, Why Use Machine Learning for SRE?-Success Stories
- AI background, The Awakening of Applied AI
- basics, What Is Machine Learning?-Why Now? What Changed for Us?
- current SRE environment and, Why Now? What Changed for Us?
- decision trees, Decision trees-Decision trees
- defined, What Is Machine Learning?
- enterprise IT areas affected by, Success Stories
- human–machine games, From Chess to Go: How Deep Can We Dive?-From Chess to Go: How Deep Can We Dive?
- modern definition of learning in terms of machine, What Do We Mean by Learning?
- neural networks, What Are Neural Networks?-Popular Libraries for Neural Networks
- on-call substitute, Counterarguments
- practical examples, Practical Machine Learning Examples-Time series: server requests waiting
- Python/IPython/Jupyter Notebook installation, Installing Python, IPython, and Jupyter Notebook
- reasons for company to use, Why Use Machine Learning for SRE?-Success Stories
- reasons to use, Why Use Machine Learning for SRE?
- Spotify and, The Future: Speed at Scale, Safely
- SRE problems addressed by, Some SRE Problems Machine Learning Can Help Solve
- TensorFlow and TensorBoard, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
- time series: server requests waiting, Time series: server requests waiting-Time series: server requests waiting
- training a neural network from scratch, A neural network from scratch-A neural network from scratch
- MacNamara, Ríona, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- Mangot, Dave, Replies
- Markdown, The Google Experience: g3doc and EngPlay, Pick the simplest markup language that supports your needs
- market-oriented teams, Get Rid of as Many Handoffs as Possible
- Markov model, Estimating durability
- markup language, Pick the simplest markup language that supports your needs
- Maslow's Hierarchy of Needs, Production Engineering at Facebook
- Master Service Agreements (MSAs), Negotiating SLAs with vendors
- McDuffee, Keith, Replies
- McEniry, Chris, Replies
- Mean Time Between Failures (MTBF), Always in a State of Partial Failure
- Mean Time to Detect (MTTD), Clearing the Way for SRE in the Enterprise
- Mean Time to Failure (MTTF)
- Mean Time to Recovery (MTTR), Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)-Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
- Mean Time to Repair (MTTR), Always in a State of Partial Failure
- Mediratta, Bharat, Pattern 1: Birth of Automated Testing at Google
- Meickle, James, Beyond Burnout, Beyond Burnout-Mental Disorder Resources
- Menchaca, Joaquin, Replies
- mental disorders, persons with
- and diversity conversation, Mental Disorders Are Missing from the Diversity Conversation
- benefits for, Benefits-Benefits
- business environment, Sanity Isn’t a Business Requirement
- compensation in job application process, Compensation
- crisis and, Leaving
- defined, Defining Mental Disorders
- importance of detailed job postings, Application
- inclusivity as beneficial to all, Inclusivity for Anyone Helps Everyone
- ineffectiveness of common workplace strategies towards, Thoughts and Prayers Aren’t Scalable
- interviewing for job, Interviewing
- job duties, Job Duties
- leaving a job, Leaving
- on-call work and, Application
- onboarding packets, Onboarding
- pro-inclusion patterns/antipatterns, Full-Stack Inclusivity-Mental Disorder Resources
- promotion, Promotion
- resources, Mental Disorder Resources
- training, Training
- working conditions, Working Conditions-Working Conditions
- workplace inclusivity and, Beyond Burnout-Mental Disorder Resources
- mental health, Beyond Burnout-Mental Disorder Resources
- mental models, Mental Models
- mentorship programs, Training
- Mercereau, Jonathan, Working with Third Parties Shouldn’t Suck, Working with Third Parties Shouldn’t Suck-Closing Thoughts
- Messeri, Eran, Pattern 1: Birth of Automated Testing at Google, Pattern 3: Create a Shared Source Code Repository
- metrics, Orienting to a Data-Driven Approach, Setting goals and defining metrics of success, Monitoring, Metrics, and KPIs
- (see also incident metrics)
- Miasnikoŭ, Stas, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- Michel, Drew, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- microservices
- mid-sized organizations (see Soundcloud, SRE at)
- migration cost, Project Operating Expense and Abandonment Expense
- (see also abandonment expense)
- migrations, database reliability engineering and, Migration Patterns-Rollback testing
- Mineiro, Luis, Replies
- Mitchell, Tom, What Do We Mean by Learning?
- money (see compensation)
- monitoring
- monolithic architecture
- Moraes, Gleicon, Replies
- Morrison, David, Make respect part of your team’s culture
- Most Favored Customer (MFC) clauses, Decommissioning
- multiregion operations, immutable infrastructure and, Multiregion Operations
- Murphy, Niall Richard
- Mushero, Steve, Replies
N
- Netflix
- Netflix model (cross-functional teams), Get Rid of as Many Handoffs as Possible
- Network Operations Center (NOC), Antipattern 1: Site Reliability Operations
- neural networks, What Are Neural Networks?-Popular Libraries for Neural Networks
- New York Stock Exchange outage (2015), Sacrifice Decisions Take Place Under Uncertainty
- nginScript, Scriptable Load Balancers: The New Kid on the Block
- Nolan, Laura, Active Teaching and Learning, Active Teaching and Learning-A Call to Action: Ditch the Boring Slides
- normal distribution, On Evaluating SLOs
- nosocomial automation, Focus on Making Automation a Team Player in SRE Work
- Nukala, Shylaja, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
O
- O'Reilly, Tim
- Obama, Barack, Elegy for Complex Systems
- object stores, Object stores
- object-relational mapping (ORM), Cognitive overload
- observability, service mesh and, Current State of Microservice Networking, Observability and Alarming
- on-call
- accommodations for on-call personnel, Accommodations
- alternatives to current approach, Actual Solutions-Cognitive hacks
- arguments for keeping current system, Counterarguments
- at Spotify, On-Call and Alerting
- cognitive hacks, Cognitive hacks
- cost to humans, The Cost to Humans of Doing On-Call-We don’t need another hero
- emergency medicine/SRE differences, Differences with SRE
- emergency medicine/SRE parallels, Parallels with SRE
- emergency medicine/ward medicine distinction, On-Call Is Emergency Medicine Instead of Ward Medicine-On-Call Is Emergency Medicine Instead of Ward Medicine
- exclusion backlash and, Exclusion backlash
- flexible schedules and, Flexible schedules
- improving on-the-job performance, Improving On-the-Job Performance
- industry-wide compensation model, Compensation
- need for fundamental change in approach to, We Need a Fundamental Change in Approach-A Union of the Two
- negative consequences of heroism, We don’t need another hero
- opt-out policy, Exclusion backlash
- pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- pair on-call, Cognitive hacks
- persons with mental disorders and, Application
- prioritization, Prioritization-Exclusion backlash
- production engineering team at Facebook, Production Engineering at Facebook
- psychological safety and, On-call and operations
- rationale for, The Rationale for On-Call-Counterarguments
- reasons to end, Against On-Call: A Polemic-Conclusion
- recovery time immediately after on-call shift, Recovery
- Strong-Anti-On-Call position, Strong-Anti-On-Call
- Strong/Weak-Anti-On-Call position, A Union of the Two
- training, Training
- triage function, First, Do No Harm
- underlying assumptions, Underlying Assumptions Driving On-Call for Engineers-Underlying Assumptions Driving On-Call for Engineers
- Weak-Anti-On-Call position, Weak-Anti-On-Call
- onboarding packets, Onboarding
- onsite interview, The Onsite Interview
- OpenResty, Scriptable Load Balancers: The New Kid on the Block
- operating system errors, data loss and, Operating system and hardware errors
- operational debt, Repair Debt
- operational expenditures (OpEx), Make a Decision, Get Rid of as Many Handoffs as Possible, Capacity Planning and Demand Forecasting
- operational isolation, Operational isolation-Operational isolation
- operations
- Operations as a Service (OaaS), Operations as a Service
- operations organization (see enterprise operations model–SRE transition)
- operator fatigue, Operator Fatigue
- ops owners, The ops owner role
- Ops Teams
- Ops-in-Squads
- outsourcing, Growing the team: insource or outsource?
- overhead of coordination, Nobody Anticipates the Overhead of Coordination
P
- pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- pair on-call, Cognitive hacks
- Papert, Seymour, Active Learning
- Pareto distribution, On Evaluating SLOs
- partner phase of SRE execution, Phase 3: Advocates/Partners
- patterns
- Paul, Paula, Replies
- PayPal, Replies
- Pentagon building, Nobody Anticipates the Overhead of Coordination
- percentiles, histograms vs., Where Percentiles Fall Down (and Histograms Step Up)
- performance analysis, Performance Analysis and Optimization
- performance optimization, Performance Analysis and Optimization
- phone screening, Phone Screens
- Photon, Active Learning Example: SRE Classroom
- physical backups, Full physical backups
- physical isolation, Physical isolation
- playbooks, documentation of, Playbooks
- Poblador i Garcia, David, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- policies, documentation of, Policies
- political organizing (see social activism)
- postmortems
- Potvin, Rachel, Pattern 3: Create a Shared Source Code Repository
- privacy engineering, The Intersection of Reliability and Privacy-Conclusion
- automation and, Automation
- commonalities with SRE, Privacy and SRE: Common Approaches-Early Intervention and Education Through Evangelism
- default behavior for shared architectures, Default behavior for shared architectures
- differences from SRE, Nuances, Differences, and Trade-Offs
- early intervention and education through evangelism, Early Intervention and Education Through Evangelism-Early Intervention and Education Through Evangelism
- efficient and deliberate problem solving, Efficient and Deliberate Problem Solving
- frameworks and, Frameworks
- goals of, The General Landscape of Privacy Engineering-The General Landscape of Privacy Engineering
- intersection of reliability and privacy, The Intersection of Reliability and Privacy
- questions for evaluating products/services, The General Landscape of Privacy Engineering
- root causing, Find and address root causes
- team relationship management, Relationship Management
- toil reduction with, Reducing Toil-Frameworks
- Probability Density Function (PDF), On Evaluating SLOs
- procurement teams, Integration timeline?
- product-oriented teams, Get Rid of as Many Handoffs as Possible
- production engineering (PE), Production Engineering at Facebook-Production Engineering at Facebook
- production meetings, as active learning opportunity, Production Meetings
- production readiness, Integration timeline?
- project operating expense (PrOpEx), Project Operating Expense and Abandonment Expense
- project-based funding, Get Rid of as Many Handoffs as Possible
- Prometheus, Developers’ Productivity and Health Versus the Pager
- promotion of persons with mental disorders, Promotion
- Proof-of-Concept (PoC), Make a Decision
- provisioning, Provisioning, Change Management, and Velocity
- psychological safety
- avoiding ambiguous expectations, Imaginary expectations
- avoiding information overload, We love interrupts and the torrents of information
- building into your team, How to Build Psychological Safety into Your Own Team-Operations teams are bad at estimating their level of psychological safety
- clear communication/explicit expectations, Make your communication clear and your expectations explicit
- intersections between operations and social activism, Intersections Between Operations and Social Activism-Conclusion
- on-call rotations and pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- on-call work, On-call and operations, The Cost to Humans of Doing On-Call-We don’t need another hero
- operations teams vs. other engineering teams, Why are operations teams more likely to feel unsafe than other engineering teams?
- operations teams' difficulty in estimating level of, Operations teams are bad at estimating their level of psychological safety
- publicizing/celebrating teams successes, Make it obvious when your team is doing well
- respect as part of team culture, Make respect part of your team’s culture
- space for people to take chances, Make space for people to take chances
- SRE cognitive work, Introduction-Conclusion
- SRE teams and, Orienting to a Data-Driven Approach
- SRE/social activism parallels, Principles 3 and 4 (blameless retrospectives and psychological safety)
- successful teams and, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- ways to make teams feel safe, Make your team feel safe
- Python, Installing Python, IPython, and Jupyter Notebook
R
- Rabenstein, Björn, How to Apply SRE Principles Without Dedicated SRE Teams, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
- Rampke, Matthias, How to Apply SRE Principles Without Dedicated SRE Teams, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
- Rasmussen, Jens, Navigating Complexity for Safety
- Rau, Vivek, Toil, the Enemy of SRE
- reactive phase of SRE execution, Phase 1: Firefighting/Reactive
- real-time dashboards, Real-Time Dashboards: The Bread and Butter of SRE
- Real-User Monitoring (RUM)
- recovery time (see Mean Time to Recovery (MTTR))
- recovery/recoverability of data
- championing recovery reliability, Championing Recovery Reliability
- considerations for, Considerations for Recovery
- database reliability engineering and, Recoverability-Championing Recovery Reliability
- detection of data loss/corruption, Building Block 1: Detection-Operating system and hardware errors
- diverse storage, Building Block 2: Diverse Storage
- full physical backups, Full physical backups
- incremental physical backups, Incremental physical backups
- logical backups, Full and incremental logical backups
- object stores, Object stores
- testing, Building Block 4: Testing
- varied toolbox for, Building Block 3: A Varied Toolbox
- recurrent neural networks, How and When Should We Apply Neural Networks?
- Redundant Array of Independent Disks (RAID), Window of Vulnerability
- redundant systems, Redundant systems
- Reinertsen, Donald G., Ticket-Driven Request Queues Are Expensive
- release engineering, Release Engineering
- remote work, Working Conditions
- Rendell, Mark, Replies
- repair debt, Repair Debt
- replication techniques
- reporting APIs, Polling API informs SLIs, Uses for RUM, Logging
- Republican National Convention protests (2008), Principles of Organizing
- request pausing, Case Study: Intermission-Case Study: Intermission
- request queues, Silos Get in the Way-Silos Get in the Way
- respect, team culture and, Make respect part of your team’s culture
- resumes, blind review of, Biases
- roles, formal assignment of, Formal role assignments
- rollbacks, Rollback testing
- rolling release, blue/green release vs., Disadvantages
- root access, Bringing Scalability and Reliability to the Forefront
- root causing
- Root, Lynn, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- Rosenthal, Casey, In the Beginning, There Was Chaos, In the Beginning, There Was Chaos-Conclusion
- Rother, Mike, Start by Leaning on Lean
- routing, shard-aware (see shard-aware routing)
- Russek, Johannes, SRE Without SRE, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- Ryan, Andrew, Production Engineering at Facebook
S
- sacrifice decisions, Sacrifice Decisions Take Place Under Uncertainty
- St. Paul Principles, Principles of Organizing, The corollary to trust is forgiveness
- Samuel, Arthur, From Chess to Go: How Deep Can We Dive?
- scaling, Using Incident Metrics to Improve SRE at Scale-Learnings: TL;DR
- scheduling, flexible, Working Conditions, Flexible schedules
- Schlossnagle, Theo, The Art and Science of the Service-Level Objective, The Art and Science of the Service-Level Objective-Parting Thought: Looking at SLOs Upside Down
- Schwartz, Mark, Antipattern 15: Ungainly Governance
- scriptable load balancers, Scriptable Load Balancers-Looking to the Future and Further Reading
- about, Scriptable Load Balancers: The New Kid on the Block-Why Scriptable Load Balancers?
- advantages of, Why Scriptable Load Balancers?
- checkout queue case study, Case Study: Checkout Queue-Case Study: Checkout Queue
- defined, Scriptable Load Balancers
- future issues, Looking to the Future and Further Reading
- harnessing the potential of, Harnessing Potential
- maintaining resiliency, Avoiding Disaster-Case Study: Checkout Queue
- problems solved by, Making the Difficult Easy-Case Study: Intermission
- request pausing, Case Study: Intermission-Case Study: Intermission
- routing requests with, Routing requests with a scriptable load balancer
- service-level middleware, Service-Level Middleware-Case Study: WAF/Bot Mitigation
- shard-aware routing, Shard-Aware Routing-Routing requests with a scriptable load balancer
- state and, Getting Clever with State
- security
- self-service
- separation of duties, Self-Service Helps SREs in Multiple Ways
- servers as cattle vs. servers as pets, Databases Are Not Special
- service discovery, Eventually Consistent Service Discovery
- service meshes, The Service Mesh: Wrangler of Your Microservices?-The Future of the Service Mesh
- configuration management, Configuration Management (Control Plane Versus Data Plane)
- context propagation, Thin Libraries and Context Propagation
- control plane vs. data plane, Configuration Management (Control Plane Versus Data Plane)
- current state of microservice networking, Current State of Microservice Networking
- eventually consistent service discovery system, Eventually Consistent Service Discovery
- in practice, The Service Mesh in Practice-Technical learnings
- monolithic architecture vs., Ready to Get Rid of the Monolith?
- observability and alarming, Observability and Alarming
- operation of Envoy at Lyft, Operating Envoy at Lyft-Technical learnings
- origin and development of Envoy at Lyft, The Origin and Development of Envoy at Lyft
- sidecar performance implications, Sidecar Performance Implications
- sidecar proxy, Service Mesh to the Rescue-The Benefits of a Sidecar Proxy
- thin libraries, Thin Libraries and Context Propagation
- service overviews, documentation of, Service overviews
- Service Pyramid, at Facebook, Production Engineering at Facebook
- Service-Level Agreements (SLAs)
- business priorities and changes in, Dealing with Corner Cases
- corner cases and, Dealing with Corner Cases-Dealing with Corner Cases
- documentation of, SLAs
- error budgets and, Error Budgets
- negotiating with vendors, Negotiating SLAs with vendors
- nontechnical solutions in SysAdmin–SRE transition, Nontechnical Solutions
- progression in service-level execution, Progression in Service-Level Execution
- SLOs and, Service-Level Objective, Why Set Goals?
- success culture and, Achieving Business Success Through Promises (Service Levels)
- SysAdmin–SRE transition and, SLA, Establishing SLAs for Internal Components-Establishing SLAs for Internal Components
- tracking availability level for, Tracking Availability Level-Tracking Availability Level
- Service-Level Indicators (SLIs)
- service-level middleware
- Service-Level Objectives (SLOs), The Art and Science of the Service-Level Objective-Parting Thought: Looking at SLOs Upside Down
- aligning performance standards with customers' needs, Parting Thought: Looking at SLOs Upside Down
- availability, Availability-Transactions over Time Quanta
- data recovery strategies and, Considerations for Recovery
- error budgets and, Error Budgets
- evaluating, On Evaluating SLOs
- goal setting and, Why Set Goals?
- histograms and, Histograms-Where Percentiles Fall Down (and Histograms Step Up)
- percentiles vs. histograms, Where Percentiles Fall Down (and Histograms Step Up)
- SysAdmin–SRE transition and, Service-Level Objective
- third-party services and, SLOs
- service-oriented teams, Get Rid of as Many Handoffs as Possible
- Shannon, Adam, Replies
- shard-aware routing
- Sharpe, Jeremy, Do Docs Better, Do Docs Better: Integrating Documentation into the Engineering Workflow-Communicating the Value of Documentation
- Shopify, Case Study: Checkout Queue-Case Study: Checkout Queue
- shopping websites, Case Study: Checkout Queue-Case Study: Checkout Queue
- Short, Chris, Replies
- Shoup, Randy, Pattern 3: Create a Shared Source Code Repository
- sidecar proxy, Service Mesh to the Rescue-The Benefits of a Sidecar Proxy, Sidecar Performance Implications
- Siegrist, John, Replies
- Sigmoid function, A neural network from scratch-A neural network from scratch
- silos
- Single Points of Failure (SPOFs), Beginning Chaos, Isolated failure domains
- Single-Page Application (SPA) frameworks, Direct impact
- Sinjakli, Chris, Replies
- site up, as SRE goal, Keeping the Site Up-Graduated degradation
- SLA inversion, Avoiding Disaster
- Slicer, Routing requests with a scriptable load balancer
- snapshots, Offline storage
- social activism
- assigning/avoiding blame in reviews of, Charlottesville in review: assigning and avoiding blame
- building capacity instead of assigning blame, Beyond culpability: building capacity instead of assigning blame
- crisis management, Managing Crisis: Responding When Things Break Down-The corollary to trust is forgiveness
- forgiveness as corollary to trust, The corollary to trust is forgiveness
- intersections between operations and, Intersections Between Operations and Social Activism-Conclusion
- planning stage, Creating the Perfect Plan
- postmortems, Writing Our Own History: Making Sense of What Went Down-Beyond culpability: building capacity instead of assigning blame
- principles of organizing, Principles of Organizing
- software engineering as analogous to, Before, During, After
- turning action into change, The Long Tail: Turning Action into Change-Activism and Change Within a Company
- Soundcloud, SRE at, How to Apply SRE Principles Without Dedicated SRE Teams-Further Reading
- deployment platform, The Deployment Platform
- embedded SREs, The Embedded SRE
- failure of team approach, SREs to the Rescue! (and How They Failed)-The Embedded SRE
- getting buy-in, Getting Buy-In-Getting Buy-In
- implementation details, Some Implementation Details-Getting Buy-In
- need to adjust approach to circumstances, A Matter of Scale in Terms of Headcount
- on-call rotations and pager fatigue, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- postmortems for resolving cross-team reliability issues, Resolving Cross-Team Reliability Issues by Using
Postmortems
- Production Engineering team, Introducing Production Engineering-Introducing Production Engineering
- uniform infrastructure/tooling vs. autonomy/innovation, Uniform Infrastructure and Tooling Versus Autonomy
and Innovation-Uniform Infrastructure and Tooling Versus Autonomy
and Innovation
- you build it, you run it approach, You Build It, You Run It-Introducing Production Engineering
- source control, documentation in, Where possible, documentation should live in source control, alongside its associated code
- source-code repository, Google, Pattern 3: Create a Shared Source Code Repository-Pattern 3: Create a Shared Source Code Repository
- space shuttle disasters, Always in a State of Partial Failure
- speed at scale, safely (s3), The Future: Speed at Scale, Safely
- Spotify, SRE Without SRE: The Spotify Case Study-The Future: Speed at Scale, Safely
- balancing squad autonomy with tech stack consistency, Autonomy Versus Consistency: 2015–2017-Key Learnings
- beta and release period, Prelude-Key Learnings
- bringing scalability and reliability to the forefront, Bringing Scalability and Reliability to the Forefront
- challenges in ops team scaling, A System That Didn’t Scale: 2012-Key Learnings
- deployment time slots, Blessed Deployment Time Slots
- dev owner role, The dev owner role
- engineering culture, SRE Without SRE: The Spotify Case Study
- excessive server growth, Forming Bad Habits
- forensics, Creating Detectives
- formalizing core services, Formalizing Core Services
- future of SRE at, The Future: Speed at Scale, Safely-The Future: Speed at Scale, Safely
- goalie role, Introducing the goalie role
- growth and early success (2010), The Curse of Success: 2010-Key Learnings
- growth-related challenges (2011), Pets and Cattle, and Agile: 2011-Breaking Those Bad Habits
- interruptions, Interruptions
- lead time issues, Long lead times
- lightening backend engineers' manual load, Lightening the manual load
- limits to manual deployments, Manual Work Hits a Cliff
- moving away from Ops-owner approach, Building on Trust-Building on Trust
- new ownership model, A New Ownership Model
- on-call and alerting, On-Call and Alerting
- operations focus in early history, Tabula Rasa: 2006–2007
- ops owner role, The ops owner role
- Ops-in-Squads (2013-2015), Introducing Ops-in-Squads: 2013–2015-Key Learnings
- reorganization of dev teams/op teams, Breaking Those Bad Habits
- splitting ops team into Production Ops and Internal IT Ops, Spawning Off Internal Office Support
- unintentional specialization/misalignment, Unintentional specialization and misalignment
- SRE (generally)
- SRE Classroom (workshop), Active Learning Example: SRE Classroom
- SRE teams (see teams)
- staging of third-party integrations, Testing and staging
- stakeholder identification
- state, immutable infrastructure and, Known State
- Stolarsky, Emil, Scriptable Load Balancers, Scriptable Load Balancers-Looking to the Future and Further Reading
- Stone, Luke, So, You Want to Build an SRE Team?, So, You Want to Build an SRE Team?-Making a Decision About SRE
- storage
- Storage Watcher, Storage Watcher
- structural quality, of documentation, Defining Quality: What Do Good Docs Look Like?-Defining Quality: What Do Good Docs Look Like?
- Suarez Ordoñez, Santiago, Replies
- success culture, SRE as, SRE as a Success Culture-Focus on the Details of Success
- advocate/partner phase of SRE execution, Phase 3: Advocates/Partners
- approaching operations as engineering problem, Approaching Operations as an Engineering Problem-Approaching Operations as an Engineering Problem
- business success through promises (service levels), Achieving Business Success Through Promises (Service Levels)
- capacity planning/demand forecasting, Capacity Planning and Demand Forecasting
- catalytic stage of SRE execution, Phase 4: Catalytic
- complications of differing phases of SRE execution, Complications of Differing Phases
- critical enabling functions of SRE, Critical Enabling Functions of SRE-Provisioning, Change Management, and Velocity
- empowering teams to do the right thing, Empowering Teams to “Do the Right Thing”
- firefighting/reactive phase of SRE execution, Phase 1: Firefighting/Reactive
- focusing on details of success, Focus on the Details of Success
- gatekeeper phase of SRE execution, Phase 2: Gatekeepers
- incident management/emergency response, Incident Management and Emergency Response
- key values for SRE, Key Values for SRE-Progression in Service-Level Execution
- monitoring, metrics, and KPIs, Monitoring, Metrics, and KPIs
- origins of SRE, Where Did SRE Come From?-Where Did SRE Come From?
- performance analysis/optimization, Performance Analysis and Optimization
- phases of SRE execution, Phases of SRE Execution-Complications of Differing Phases
- progression in service-level execution, Progression in Service-Level Execution
- provisioning/change management/velocity, Provisioning, Change Management, and Velocity
- site up, Keeping the Site Up-Graduated degradation
- surrogate metrics, Surrogate Metrics
- synthetic monitoring
- SysAdmin–SRE transition, From SysAdmin to SRE in 8,963 Words-Conclusion
- availability level tracking for SLAs, Tracking Availability Level-Tracking Availability Level
- concerns of Site Reliability Engineers vs. those of System Administrators, From SysAdmin to SRE in 8,963 Words
- corner cases and SLAs, Dealing with Corner Cases-Dealing with Corner Cases
- establishing SLAs for internal components, Establishing SLAs for Internal Components-Establishing SLAs for Internal Components
- external dependencies, Understanding External Dependencies-Understanding External Dependencies
- key steps in, Conclusion
- nontechnical solutions for SLAs, Nontechnical Solutions
- SLAs in, SLA
- SLIs in, Service-Level Indicator
- SLOs in, Service-Level Objective
- terminology for, Clarifying Terminology-Service-Level Objective
- SysOps, A Matter of Scale in Terms of Headcount
- systems (see complex systems)
T
- tail latency, Sidecar Performance Implications
- taking chances, space for people to, Make space for people to take chances
- tape backup, Offline storage
- Taylor, Frederick Winslow, Why Should We Care About Practitioner Cognition?
- teaching, active (see active learning)
- teams
- avoiding ambiguous expectations, Imaginary expectations
- avoiding cognitive overload, Cognitive overload
- avoiding information overload, We love interrupts and the torrents of information
- building, So, You Want to Build an SRE Team?-Making a Decision About SRE
- building psychological safety into, How to Build Psychological Safety into Your Own Team-Operations teams are bad at estimating their level of psychological safety
- capacity planning/demand forecasting, Capacity Planning and Demand Forecasting
- choosing SRE for right reasons, Choose SRE for the Right Reasons-Choose SRE for the Right Reasons
- clear communication/explicit expectations, Make your communication clear and your expectations explicit
- commitment to SRE, Commitment to SRE
- control-based models, Context Versus Control in SRE
- cross-functional, Get Rid of as Many Handoffs as Possible-Get Rid of as Many Handoffs as Possible
- defining the role of supporting divisions, Defining the role of supporting divisions
- empowering to do the right thing, Empowering Teams to “Do the Right Thing”
- Facebook PE team creation, Production Engineering at Facebook
- file pillars of practice, Critical Enabling Functions of SRE-Provisioning, Change Management, and Velocity
- goal-setting and metrics, Setting goals and defining metrics of success
- incident management/emergency response, Incident Management and Emergency Response
- insource/outsource approaches to growth, Growing the team: insource or outsource?
- large enterprises and, Implementing the SRE Team-Defining the role of supporting divisions
- learning habits of effective SRE teams, Learning Habits of Effective SRE Teams-Postmortems
- making a decision about SRE, Making a Decision About SRE
- monitoring, metrics, and KPIs, Monitoring, Metrics, and KPIs
- orienting to a data-driven approach, Orienting to a Data-Driven Approach
- performance analysis/optimization, Performance Analysis and Optimization
- provisioning/change management/velocity, Provisioning, Change Management, and Velocity
- psychological safety for, The Primary Indicator of a Successful Team-Operations teams are bad at estimating their level of psychological safety
- publicizing/celebrating successes of, Make it obvious when your team is doing well
- respect as part of culture, Make respect part of your team’s culture
- rotation of engineering team members into SRE team, Insourcing experienced talent: rotating engineering team members
- scaling to company size, A Matter of Scale in Terms of Headcount
- space for people to take chances, Make space for people to take chances
- SRE in the development cycle, SRE throughout the development cycle
- ways to make teams feel safe, Make your team feel safe
- TensorBoard, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
- TensorFlow, Using TensorFlow and TensorBoard-Using TensorFlow and TensorBoard
- termination, contract, Decommissioning
- third parties
- build/buy/adopt decision, Build, Buy, or Adopt?-Project Operating Expense and Abandonment Expense
- direct impact of downtime, Direct impact
- downtime, When They’re Down, You’re Down-Indirect impact
- first-class citizens, Third Parties as First-Class Citizens-Closing Thoughts
- growing teams via, Growing the team: insource or outsource?
- indirect impact of downtime, Indirect impact
- LinkedIn case study, Project Operating Expense and Abandonment Expense
- negotiating SLAs with vendors, Negotiating SLAs with vendors
- running the black box like a service, Running the Black Box Like a Service
- SLIs, SLOs, SLAs, Service-Level Indicators, Service-Level Objectives, and SLAs-Negotiating SLAs with vendors
- SLOs, SLOs
- working with, Working with Third Parties Shouldn’t Suck-Closing Thoughts
- third-party integrations
- automation, Automation
- communication, Communication
- contract termination, Decommissioning
- decommissioning, Decommissioning
- disaster planning, Disaster planning
- LinkedIn case study, Testing and staging
- logging, Logging
- monitoring, Monitoring-Uses for RUM
- playbook for, Playbook: From Staging to Production-Closing Thoughts
- reporting APIs, Logging
- synthetic monitoring, Uses for synthetic monitoring
- testing and staging, Testing and staging
- tooling, Tooling
- Three Mile Island nuclear disaster, Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted, Every Incident Could Have Been Worse
- throttling, Getting Clever with State
- ticket-driven request queues, Silos Get in the Way-Silos Get in the Way
- time quanta, Time Quanta
- time to detect (TTD), The Virtuous Cycle to the Rescue: If You Don’t Measure It…, Surrogate Metrics
- time to engage (TTE), The Virtuous Cycle to the Rescue: If You Don’t Measure It…, Surrogate Metrics
- time to fix (TTF), The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- Todd, Chad, Replies
- toil
- as enemy of SRE, Toil, the Enemy of SRE-Toil, the Enemy of SRE
- defined, Toil, the Enemy of SRE
- engineering work vs., Toil, the Enemy of SRE-Toil, the Enemy of SRE
- enterprise operations model–SRE transition, Toil Limits
- privacy engineering and, Reducing Toil-Frameworks
- self-service capabilities and, Self-Service Helps SREs in Multiple Ways
- tooling, third-party integrations and, Tooling
- Toyota Production System, Start by Leaning on Lean
- training
- transactions, as availability metric, Transactions
- transgender inclusivity, Benefits
- Treat, Robert, Replies
- Treynor Sloss, Benjamin, Leverage Existing Enthusiasm for DevOps, SRE Patterns Loved by DevOps People Everywhere
- triage, First, Do No Harm
- trust, forgiveness as corollary to, The corollary to trust is forgiveness
- 2001: A Space Odyssey (movie), The Awakening of Applied AI
V
- vacation time, Benefits
- van Zijll, Robin, Replies
- velocity of change, Provisioning, Change Management, and Velocity
- vendor lock-in, Project Operating Expense and Abandonment Expense
- verification
- virtual repair debt, Virtual Repair Debt: Exorcising the Ghost in the Machine
- virtuous cycle, The Virtuous Cycle to the Rescue: If You Don’t Measure It…-The Virtuous Cycle to the Rescue: If You Don’t Measure It…
- visual analysis, Start by Leaning on Lean
W
- wages (see compensation)
- Watson, Coburn, Context Versus Control in SRE, Context Versus Control in SRE-Context Versus Control in SRE
- Wheel of Misfortune (game), Active Learning Example: Wheel of Misfortune
- Willis, John, SRE Patterns Loved by DevOps, SRE Patterns Loved by DevOps People Everywhere-Conclusion
- window of vulnerability, Window of Vulnerability
- Woods' Theorem, Mental Models
- work-life balance, Developers’ Productivity and Health Versus the Pager-Developers’ Productivity and Health Versus the Pager
- (see also psychological safety)
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.