Chapter 5. Patterns and Antipatterns of SRE

This IS NOT SRE

There are many ways that an attempt to implement SRE practices and teams can go wrong. You can find more on Twitter and in Chapter 23 of Seeking SRE, but here are some key problems to avoid:

  • Changing the name of any existing team (usually “ops”) to “SRE” without making the organizational adjustments required to enable them to do meaningful development work

  • Using the SRE team to shield devs from the pain of how their services really function in production

  • Failing to contain interrupts

  • Attempting to do SRE project work without the same support (such as project managers, technical writers, etc.) that any other dev team would have (because SREs only spend 50% of their time on project work, we contend that support structures are even more important for SRE teams to make effective use of their development time)

  • Valuing (perhaps simply through call-out recognition) incident response heroics over prudent design and preventative planning

  • Implementing processes or systems that slow down the delivery of value to customers without incontrovertible benefit

  • Building a “gatekeeper” team that functions as a chokepoint

  • Static or ill-considered SLOs

  • Thinking that SRE is a point solution to a particular problem rather than a fundamental cultural shift

This IS SRE

Hearkening back to the beginning:

SRE is an organizational model for running reliable online services by teams that are chartered to do reliability-focused engineering work.

As a discipline, SREs are devoted to helping an organization sustainably achieve the appropriate level of reliability for its services by implementing and continually improving data-informed production feedback loops to balance availability, performance, and agility.

Does it make sense for your company to commit heavily to reliability and pursue the implementation of SRE in your organization? Only you and the other leaders in your company can answer that question. Some companies will be at a size where having a distinct organizational component or team just does not fit, but the principles can be put in place to provide a foundation for the future.

Just like with any new methodology or cultural shift, when implementing SRE it will take time, grit, and humility to adjust to the changing circumstances—but the payoff will be an institutionalized commitment to the importance of the user’s interaction with your site, service, system, or other “online stuff.” Over time, with the SRE team(s) consistently representing reliability and operability concerns as well as actively contributing to the product codebase to improve reliability, feature developers will learn to factor these pieces into their plans as they develop new features. At that point, SREs will be able to shift their impact to a deeper and wider level, making next month’s problems different from today’s.

Our hope is that this brief introduction to Site Reliability Engineering will have provided you with an effective understanding of the what and how of SRE. There are lots of resources available to dive into greater detail. We’ve listed some of the best starting points for further reading in Appendix A.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.171.121