0%

Book Description

Site Reliability Engineering is an outgrowth of the "always-on" world of online services. Initiated at Google more than a decade ago, SRE helps many of today’s sites run effectively and reliably, even as those sites continuously introduce new features. This ebook explains how site reliability engineers work with development teams to meet the fundamental customer expectation that online services can and must be reliable.

Kurt Andersen from LinkedIn and Craig Sebenik from Split explain how new and existing organizations can successfully adopt SRE processes for managing—and learning from—a host of system incidents. Complete with case studies from Google, LinkedIn, and Slack, this ebook explores SRE and provides guiding principles for implementing the role in your organization.

  • Understand how your company’s culture, capabilities, and organizational structure affect the way you approach SRE
  • Learn how SREs strive to maintain system components in a resilient, predictable, consistent, repeatable, and measured state
  • Examine differences between SRE and other operational roles, such as DevOps and classical ops
  • Learn approaches for adopting SRE in either a new or existing organization
  • Explore key patterns and anti-patterns for introducing SRE in your organization

Table of Contents

  1. 1. Defining “SRE”
    1. Digging Into the Terms in These Definitions
      1. Production Feedback Loops
      2. Data-Informed
      3. Appropriate Level (of Reliability)
      4. Sustainable
      5. Reliability-Focused Engineering Work
      6. Continuous Improvement
      7. Organizational Model
    2. Where Did SRE Come From?
      1. Case Study: SRE at Google
    3. What’s the Relationship Between SRE and DevOps?
    4. How Do I Get My Company to “Do SRE”?
  2. 2. Understanding the SRE Role
    1. Culture/Capabilities/Configuration
      1. Culture
      2. Capabilities
      3. Configuration
    2. Distinguishing SRE from Other Operational Models
      1. SRE Versus “Classical” SysAdmin
      2. SRE Versus “Classical” Ops
    3. SRE for Internal Services
      1. SRE for Backend or Platform Services
      2. SRE for Databases (DBRE)
      3. SRE for Security
      4. SRE for Internal IT?
  3. 3. Implementing SRE
    1. Hierarchy of Reliability
    2. Starting a New Organization with SRE
      1. Case Study: SRE at Slack
    3. Introducing SRE into an Existing Organization
    4. Overlap Between Greenfield and Brownfield
      1. Case Study: LinkedIn
  4. 4. Economic Trends Relating to the SRE Profession
  5. 5. Patterns and Antipatterns of SRE
    1. This IS NOT SRE
    2. This IS SRE
  6. A. Further Reading
3.141.244.201