Preface

This report discusses how to train Site Reliability Engineers, or SREs. Before we go any further, we’d like to clarify the term “SRE.” “SRE” means a variety of things:

  • Site Reliability Engineer or a Site Reliability Engineering team, based on the context (singular, SRE, or plural, SREs)

  • Site Reliability Engineering concepts, discipline, or way of thinking (SRE)

  • Belonging to an SRE individual, team, or way of thinking (SRE’s or SREs’)

Ben Treynor Sloss, the founder of Site Reliability Engineering at Google, describes SRE, or the Site Reliability Engineering discipline, as what happens when “you ask a software engineer to design an operations function.” The traditional systems administration model of software management in production requires an organization to scale the number of operators as the service increases in size and complexity.1 SRE is able to scale humans sublinearly with the scale of the services they are supporting. This is done by applying proactive engineering solutions to eliminate repetitive, no-value-added tasks and toil.

We assume that you’re familiar with the concepts in the Site Reliability Engineering book. As we were growing our SRE department, we noticed how difficult it was to get new hires up to speed. You might be surprised to find that technical skills are not necessarily the most important skills to have. Without a doubt, troubleshooting is an important part of incident management, but we also show that growing the students’ confidence, explaining the importance of good relations, and encouraging clear communications with other SREs and dev teams are essential to bringing your students up to speed.

In this report, we share our experience ramping up new SREs, but we also look at other use cases. For example, we have talked with several smaller organizations that are successful in ramping people up to do SRE (or SRE-like) functions.

Training should be purposefully designed and not just thrown together, without any thought. Therefore, we also discuss the theory behind the training design. You need to know what your training needs are and who your audience is. Set clear learning objectives and build your training content based on that. We’ve seen that making SRE training hands-on is extremely important for building the confidence of the students who ultimately go on-call2 for a production service.

Finally, when teaching how to “SRE,” we should implement the practices of SRE while administering the program: that is, “SRE” your SRE training program. In other words, monitor the results and be willing to adjust the training program if the monitoring shows it is necessary. We show that just like SRE has a hierarchy of needs, SRE training also has a hierarchy of needs, which follow SRE’s needs.

While much of this report focuses on the specific experience of Google SRE, we aim to present best practices and lessons learned over the past several years, which can be applied to organizations that are at varying points along the spectrum in terms of size and maturity.

Acknowledgments

The authors would like to thank Google SRE EDU team members past and present for shaping the program and our ways of working including David Butts, Ben Weaver, Laura Baum, Brad Lipinski, Andrew Widdowson, Betsy Beyer, and Rob Shanley. We’d also like to acknowledge the small army of volunteers who teach for SRE EDU and those who volunteered cycles to make our ‘breakable’ photos service a reality.

Jennifer Petoff: Thank you to Phil Beevers for timely review and feedback and to Nat Welch and Steve McGhee for giving their perspective on training practices that are important for organizations working to adopt the SRE model.

JC van Winkel: Many thanks to SRE leadership who gave the proposers of SRE EDU their trust, have supported the team through the years, and helped us build on our initial success.

Preston Yoshioka: Thank you to Evan Jernagan and Moira Gagen for input and mentorship.

1 For more context, see https://oreil.ly/53yK0.

2 On-call involves being available to address production issues during both working and nonworking hours: https://oreil.ly/hQmCp

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.153.38