Chapter 3. Case Studies

Let’s take a look now at how SRE training has been done in practice. We discuss training activities in place for organizations along a spectrum, from very large to very small. We use Google as an example of a large SRE organization; for the medium and smaller organizations, we look at other companies.

Training in a Large Organization

Google’s SRE training program provides a case study of one possible way to implement such a program at a large organization.

When Google renamed its “production team” to Site Reliability Engineering in 2003, the team members were experienced software engineers tasked with “keeping Google running.” These software engineers had deep knowledge of the systems Google was using. The number of different systems Google was running was limited; it was more or less possible to know most of the internals.

As Google grew, and systems grew increasingly specialized, we needed more Site Reliability Engineers (SREs). Instead of transferring experienced Google software engineers into SRE, Google began directly hiring SREs. Although Google had a handful of classes to train new software engineering hires, we didn’t have any SRE-specific training. The newly hired engineers joining SRE had to “grok SRE the hard way.”

In 2014, a couple of SREs began discussing the great difficulty of onboarding new SREs. Google SRE founded a team specifically geared toward education for SREs. Initially, this team concentrated mostly on new hires. Over time, the team also began organizing classes for experienced SREs who needed to learn a new technology.

Google has many different SRE teams dedicated to one or more services because there is a limited amount of state that any one engineer or team can retain. Different services can have completely different characteristics (for example a batch-oriented service compared to a streaming-oriented service). They also make use of many different supporting subsystems, such as widely differing database systems. There are many differences between the services, but there is also a lot of common ground, such as the Google infrastructure for networking, the Borg cluster management system, and so on (see Chapter 2 of Site Reliability Engineering). Therefore, it makes sense to split out these common subjects from the team-specific subjects. Google’s SRE EDU team is responsible for acquainting students with SRE culture and the common infrastructure. Individual teams are responsible for the team-specific training (see the section on “Team-specific training” in this chapter).

Large distributed systems have many moving parts, all with their peculiarities that only a few specialists know a lot about. It’s impossible to know everything about all of that infrastructure. It’s not surprising that knowledge about such large distributed systems will also be distributed. For new people on the job, it can be frustrating to find out that contrary to your previous job, where you knew about all the machinery, now you know only a small part. It’s easy for new hires to fall victim to the imposter syndrome (see Chapter 2, “Fight imposter syndrome”).

Because the Google SRE team is both large and growing rapidly (see the upper-right quadrant of Figure 2-4), we designed and launched a full life cycle training program. This ensures that new people confidently ramp-up on the team while providing continuing education opportunities for experienced SREs to build new skills, or enable them to switch teams and support a different service.

Stages of Training

For Google SRE, onboarding begins the second week after hire, when all administrative details have been handled and new hires have been introduced to Google’s general culture and procedures.

Orientation

Orientation is the program where we get new hires who have not had any Google-specific technical training yet. We teach them about SRE principles, practices and culture, and some technical aspects that are general to Google.

Legacy orientation

For a decade, Google had no structured education for new SRE hires. One could say we were following the “sink or swim” model discussed earlier. Apart from haphazard, team-specific materials, and random classes taught at irregular intervals, there was no SRE-specific training for new people.

In 2015, Google SRE formed an education team called SRE EDU. To ramp itself up as fast as possible, first, an inventory was made of all available classes (and all the different versions available for those classes). This led to a curriculum that put 11 classes, all one to two hours long, in a week. The students were happy, but there were frequent remarks that learning would be more effective if there were more hands-on-oriented classes. We quickly and effectively launched a minimum viable training curriculum and then used survey feedback (a form of monitoring that we discuss in Chapter 5) to find opportunities to innovate and improve.

Current state of SRE orientation at Google

After onboarding at the local office, we have students travel to one of our three hubs for orientation, where they receive a week-long, SRE-oriented training. The training concentrates on SRE culture, tools and systems, and applications.

For the culture aspect, we have a number of classes sandwiched between the more technical classes. Following are the important factors in these cultural classes:

  • Failure always happens—it is a logical consequence of Google’s size. Therefore, we must embrace failure and use redundancy and resilience to fight the effect of the failures that do occur.

  • Toil is something we want to get rid of as much as possible. Our approaches are to automate as much as possible. Also, after an outage, investigating to find the root cause and fixing that lowers repeat outages and toil.

  • Creating a good social fabric is important. This comprises having a good relationship with the developers, using blameless postmortems after outages, and only having actionable alerts. Also, having diverse opinions matter. This helps fight tunnel vision, which can prolong the duration of outages.

For the more technical parts of the curriculum, we have changed our approach drastically. Adult learners acquire new knowledge most effectively by doing and applying what they’ve just seen. Therefore, we went from a model in which we had a number of slide deck–based classes, to a model with fewer (and less-detailed) classes with specific, hands-on practice. In particular, we created a photo upload service called the Breakage Service, discussed in more detail in Chapter 4.

Wherever relevant, the technical classes refer to our photo service as an example, so the students see right away how that works in real life. They immediately use the tools they heard about and investigate how the parts of the service hang together. For example, when we teach about Remote Procedure Calls (RPCs) in Google, the students also look at diagnostic consoles with regard to the RPCs sent from the frontend servers to the backends.

We’ve found that a day and a half after orientation begins, the students are able to correctly triage, mitigate, and find a resolution for the first outage. Once, after the first breakage exercise, a student correctly gave a four-sentence summary of what had just happened. When asked what they would have thought if they’d heard those four sentences only a day before, their eyes went big, their jaw dropped and they said, “I would not have understood a word of what I just said.”

From surveys we held, we found that on a seven-point scale, from –3 (not very confident) to 0 (neutral) to +3 (very confident), the students, on average, rated their confidence nearly two points higher than they did before orientation. To further illustrate how SRE EDU orientation raises confidence, 89% of participants reported at least a one-point increase in confidence, with 29% of participants reporting at least a three-point increase in confidence. Figure 3-1 shows the shift in confidence in a histogram of survey responses.

Histogram of survey responses of self reported confidence.
Figure 3-1. Histogram of survey responses of self-reported confidence.

Takeaways

We found it beneficial to move from a class-only model to a model centered around real-life and hands-on troubleshooting. We saw much better student participation, happier students, and above all, a rise in student’s confidence that they’d be able to do their SRE jobs. We strongly suggest making your onboarding classes hands-on, with troubleshooting exercises that are as close to real world as possible. This might be more difficult to do when your organization is smaller (you would probably not build a complete application stack for your learning environment), but it’s still worth it. We look at how this is done in “SRE Training in Smaller Organizations”, later in this chapter.

Training of instructors

A training model that centers more around the experience than on actual technical knowledge requires that the instructors know what the philosophy behind the program is.

We’ve noticed in the past that giving students the full depth of information about the systems we have leads to information overload, and the students don’t remember the information afterward. However, to do the breakage exercises, only a minimum amount of detail is needed. Therefore, we have deliberately limited the depth of the class material to teach only what is necessary so that we can go through the breakage exercises. This shift in training method is sometimes difficult for instructors, who are often subject matter experts that volunteer to teach. The instructors often want to tell a lot of exciting details about their subject, and might even disagree with the amount of material that is left out, including specific details about the “missing” material. However, when the instructors see how students apply what they’ve been taught when doing the exercises, and resolving the deliberate breakages, they better understand why the curriculum is set up the way it is.

To help instructors see our reasoning behind the idea of less depth in the curriculum, we run regular “Train the Trainer” sessions. We have two different sessions: one for the class material, to give the background and reasoning behind the classes, and another for the story that ties them together.

The latter Train the Trainer class concentrates on how to be a facilitator (or teaching assistant) during the breakage exercises. We’ve found these sessions are valued and in high demand from our instructors. In the “breakage” part of Train the Trainer, we have the future facilitators be on the “receiving end” of the breakages, asking, before the exercise starts, to not just resolve the issue, but also observe how the Train the Trainer instructors behave. We have multiple breakage exercises, and in later exercises, the Train the Trainer instructor is more and more hands off. We discuss the tactics here and why it’s used (as the new SRE students gain more confidence, we keep more distance). After a number of these exercises, we have the Train the Trainer students practice running a breakage exercise. This sometimes poses a challenge, as the students have to act like they don’t know what is wrong, during the breakage exercise they actually solved earlier.

In the same way the entire curriculum is there to give the new SRE hires confidence, the Train the Trainer is also there to give facilitators and TAs the confidence they need to run a class. How we implemented the Train the Trainer program is described in detail in Chapter 4, Instructional Design Principles.

Having volunteers teach and facilitate is the only way we could scale our program to the size it is now, with a team of only seven people running 14 classes every month, in three different locations. To date, more than 3,000 students have gone through Google’s SRE EDU orientation.

Team-specific training

After students have gone through foundational SRE training, they must learn the team specifics. The teams all have different ways of ramping up their new SREs for the service for which they are responsible. Having a company-wide checklist for SRE teams to fill is something that Google uses, though the extent to which each of the teams use a checklist differs. Several techniques are used by the teams which we discuss in the following sections.

Documentation

One source of information that new team members learn about the systems they will be responsible for comes from documentation. Documentation should describe the following:

  • The systems the team is responsible for

  • The procedures they follow; for example, for cluster turnups and turndowns

  • A playbook/runbook that describes how to handle certain outages (see the section "Maintaining Playbooks" in Chapter 8 of Site Reliability Engineering Workbook).

This helps all new team members understand the systems the team is involved with. Some teams also have specific onboarding documents that are then referred to, from the onboarding checklist.

New team members should be encouraged to fix any inconsistencies they encounter in the material. If your documentation is checked into your versioning system after review, this is easy and safe to do. Having new team members verify the documentation helps keep the documentation up-to-date. New team members are in a unique position to conclude that the documentation no longer describes the setup of the system. Unfortunately, the state of documentation is often “slightly stale” more often than not.

In-person and recorded classes

Team members can teach classes about the systems, perhaps as impromptu one-offs or as deliberate ramp-up summits (if there are multiple new team members). If you record the classes delivered at the summits, they can also be used by new arrivals who join the team later. Even more than with documentation, the risk exists that the recordings are out of date because systems are continuously in flux. Although documentation can be updated, this is much more difficult to do for recorded videos.

Whiteboarding sessions

Having whiteboarding sessions is also useful. Here a team member who is the “owner” of a specific subsystem explains how that subsystem works, any weak spots it might have, and what the plans are for the future of the subsystem. Having these sessions not only helps other team members understand more about these subsystems, it also works as a forcing function to have people explain the subsystem in a clear and succinct way. Also, these sessions can be recorded for later consumption (with the same staleness risks).

Teach-back sessions

There is a special kind of whiteboarding session during which new team members are asked to prepare a session on a specific subsystem, and then use the experience from reading documentation, skimming through configuration, and looking at jobs running in production, to explain what they think the system does. At Google, we’ve found that usually, the new team members do a really good job, and any misunderstandings are quickly resolved. Many times, such teach-back sessions uncover new aspects of the subsystem that the non-experts of that subsystem did not know about yet. These sessions are of value not only to new team members, but to the team as a whole.

Mentoring

It’s important that the team assigns the new hire a team member as a mentor, as a first point of contact for questions related to the team’s systems and the company’s infrastructure in general. We also advise new hires to have a mentor outside the team, for specific issues not related to the team—someone who can give a point of view without the consequences of day-to-day work. For example, “What can I expect from my manager?” or “How does your team use tool XYZ?”

Going on-call

A mid-term goal for team member ramp up is preparing the new team member to go on-call. Depending on the teams that the new SREs are in, this takes anywhere between three months to a year.

Classes

At Google, we offer newly hired SREs a set of classes after about six weeks, to prepare students for going on-call in the general sense. Of course, there are no team specifics in these classes. We have two classes that talk about the mechanics of being on-call (a little bit about the tools, and mostly about the procedures around incident management). As previously discussed, we also have two more “soft skills"–oriented classes: the first is about stress management during incidents. During incidents, the stakes can be very high, and keeping your cool in those situations is important. We want to teach this without scaring the students and having a negative influence on their confidence. This class was codesigned with an aviation consultant—stress management in the cockpit is vital!

The second soft-skills class is about proper handoffs and escalation—how to communicate when roping in help from other people. This teaches people the following:

  • How to prevent misunderstanding by using very explicit communications

  • How to behave, escalate, and ask questions during stressful situations, in such a way that the communication goes as smoothly as possible

Powerups

Some teams (most notably those with highly critical systems) use a mechanism called powerups. This works just like in video-games, in which a powerup gives a player certain abilities after having achieved an intermediate goal. In Google’s case, after a new team member has demonstrated that they’ve acquired the knowledge and skills needed for the “next level,” they are granted more permissions to manipulate the systems they are responsible for. There are usually multiple tiers. Having these powerups not only makes sure that people on the team with the permissions to administer the systems, have the skills for that, but also gives the new team members the confidence that they actually are “at the next level.”

Shadowing

An important part of ramping people up to be on-call is shadowing an experienced team member who is on-call. The engagement here varies from “looking over the shoulder” of the on-caller, to an almost reverse situation in which the new team member is driving the investigation, with the experienced team member only supervising and encouraging. Of course, it depends on the urgency of the outage, and how involved the shadower is.

Ongoing education

In large environments like Google, systems are continuously in flux. Systems are frequently deprecated, and new systems emerge to replace them (for example, in 2011, Google replaced its internal cluster file system, GFS, with Colossus, a major undertaking with consequences for almost all teams within Google). Most of these changes led to extra work for service owners. Therefore, we encourage people to create classes about new emerging systems, and subject matter experts to teach them regularly.

Once a year, we organize an “Annual Week of Education” (AWE) during which people worldwide are invited to (create and) teach classes. Because we have offices around the world, this amounts to having classes being taught almost 24 hours around the clock, Monday through Friday, for that week. People can attend these classes locally, where they are being taught, but also remotely, from other offices, through video conferencing. Of course, these classes are recorded, so people who are in an inconvenient time zone can still attend. In the past few years, we had more than 70 classes during AWE, with 50% of the classes presented being newly developed. Following best practices described in section “Managing SRE Training Materials”, we have materials curated by subject matter experts. We monitor their age, and set dates by which a subject matter expert must verify that the material is still fresh enough (or the material is removed). We created a catalog of the material for discoverability, which seems to work well; though, as always, there is room for improvement.

Several of our SRE sites also organize regular, ongoing, education classes, taught by subject matter experts. This is either done in an on-demand fashion, or in a regular cadence; for example, 10 classes for a week, every quarter. Because the SRE EDU core team does not have the cycles to cover all of these sites, and the local SREs have knowledge of the training needs of the site, we have SRE EDU site contacts in all the SRE sites who are responsible for local, ongoing, education activities, with scaled support from the SRE EDU Core team.

Summary

In a large SRE organization, with a very wide scope of technologies in use, instilling confidence is the most important aspect of education. This confidence is best gained by having hands-on exercises that are very close to the real-life production systems. Generic ramp-up material is best created by a centralized group but team-specific information should be created and delivered by the teams themselves.

After initial ramp-up, continuous education is also important. We suggest that these training efforts be done in a distributed way, with subject matter experts creating trainings. If recordings of these trainings are made, they can easily be distributed in the organization. However, as always, care must be taken that the freshness of these materials is monitored and repaired and that classes be removed whenever they are stale beyond repair.

SRE Training in Smaller Organizations

In the previous sections, we’ve seen how Google delivers content in ways that are tailored for training diverse audiences (out-of-school new hires, senior engineer new hires, in-house transfers, etc.). We discussed how Google focuses on hands-on training to help establish a foundation of knowledge, muscle memory, and expert intuition, all of which are necessary for our SREs to succeed in their daily tasks. Although the discussions were focused on how large companies like Google can implement training, the approaches can be scaled down and modified for companies of any size, be they large or small.

Of course, for an organization with, say, a single SRE team, it might not be feasible to create a training program like the one described in “Training in a Large Organization”. For smaller organizations, we look at companies that apply or advocate SRE practices, not necessarily calling themselves SRE. This could include software engineers who care about reliability, DevOps practitioners, and so on. Many of the principles that Google uses can still be applied to these organizations. For example, it’s important that students get a chance to practice what they just learned in a safe environment. With only a few students per year, it’s difficult to create classes and have instructors teach them—that would not be very cost effective. For smaller organizations, more time is spent in self-study, mentoring, and shadowing.

Applying What They’ve Learned

The key point—people usually learn better when applying what they have just learned—still holds true for a smaller organization, but it’s more challenging to set up an environment at a small scale. Even so, it’s not impossible, as shown in the following examples.

The Swiss company, AdNovum, has consultants that install, configure, and troubleshoot their “NEVIS” security suite at customer sites. They train their new hires using reading materials and self-study exercises that are set up on virtual machines (VMs). For these exercises, the students install and configure the software and verify that it’s correctly working. Then, they run a script that breaks a specific part of the system. That’s when their troubleshooting exercise begins—the students must find the root cause of the problem. Because this is in a VM, in a learning environment, students don’t need to be afraid of breaking something critical—it’s easy to reset the situation by spinning up the VM from scratch, or by using the “UNDO” option provided by the script. Creating a safe environment helps the student gain confidence. Because the student has access to the breaking script, it might be a good idea to obfuscate it so that they don’t get tempted to see what’s broken by looking at what the script does.

Another example is from one of our authors who previously worked at a small teaching company, AT Computing, in the Netherlands. During network classes, AT Computing uses multiple VMs on a desktop computer to offer exercises on troubleshooting routing and firewall problems in the virtual network. The company makes the VMs as small as possible so that many VMs could be booted up from the desktop. This allows for elaborate virtual networks. This setup is used for a Linux network management class, a firewall class, and a DNS management class.

Finally, let’s take a look at Yelp. In their SREcon Americas 2019 talk, Chie Shu, Dorothy Jung, and Wenting Wang describe how they surveyed their fellow engineers at Yelp about on-call readiness. Almost 65% of the respondents answered that they did not feel ready for going on-call. They then introduced a wargame that lets new SREs simulate an incident in a safe environment. Their work includes a template that is used to run the exercise with different participants. It instructs how to introduce the incident (in a nonproduction environment) so that others can play the wargame. Players take on various roles like investigator, communicator, and commander. Again, we see that creating an almost-real environment that is safe to “play” in and can be broken on command lets new SREs practice with the tools they need for troubleshooting.

Company X

Now let’s look at an example of how another company (let’s call it “Company X”) has implemented its SRE training program. Company X developed its program independently from Google, and we discuss why its choices make sense for a company of its size and organizational maturity.

Company X employs nearly 1,000 developers, eight of which are SRE. They hire SREs instead of more software developers for a reason that is common in the industry: developers tend to treat available resources (network, computing power) as infinite and never-failing. SREs are comfortable with the uncertainty, ambiguity, and unpredictability of modern distributed systems and focus on reliability issues that are not generally top-of-mind for developers.

The onboarding process for all engineers, including SREs, consists of about 40 hours of classroom time spread out over their first few weeks. Each one of the training modules is one to two hours long and covers essential infrastructure: dev practices and tools, monitoring systems, service architectures, and so on. One of the most intensive modules (three hours) is “Incident Response and Postmortems” All engineers need this incident response training, not just SREs, because any engineer might be called upon to help mitigate an issue during an incident. A company of bigger size could afford having a dedicated team for incident response, but at Company X, all engineers are expected to play a part, in one way or another, of a coordinated response.

Each engineer is expected to walk through the entire process during the first three months of work. That gives them plenty of time to attend the courses they are requested to complete, work on the practical exercises they have been assigned to perform, and meet the people of the respective teams they’ll be working with. The training process overlaps with regular work during that time because Company X does not have a dedicated training team to engage full-time with the new engineers during the first few weeks/months of their tenure.

With only eight members, the SRE team is small enough that each new individual receives a highly customized onboarding document of about 10 pages. The document contains personalized steps based on the new hire’s expected role and past experience.

Most of the onboarding material is not SRE-specific, and only a couple of the self-study guides provided by the SRE team are targeted to develop SRE-specific practices. During the time the new hire is expected to work through these, lasting about two weeks, they are paired with an onboarding mentor who provides answers to any questions they might have.

The SREs go through the onboarding process, and get ready for going on-call by performing some drill exercises, using nonproduction environments that mimic the actual systems they will be dealing with, similar to the process described in the “Current state of SRE orientation at Google” section of this chapter.

Company X also performs disaster recovery (DR) drills, once per year, that test services across the entire company. These tests are performed over the course of a single day, supervised by senior members of the engineering team.

In addition to DR drills, engineers participate in sessions called “Wheel of Misfortune,” during which role-playing exercises are invented and solved together. These sessions tend to be arranged ad hoc, where engineers are presented with invented scenarios to resolve.

Readiness

Although there’s no formal process to certify that a new hire is ready for on-call, other than the SRE feeling that they are ready (and their teammates agreeing), the new SREs generally shadow an on-call shift (a seasoned engineer is steering the wheel) before taking the lead role. Likewise, due to the limited numbers of SREs, it is unfeasible for Company X to have an established reverse-shadow process. Due to the small size of the team, and in order to provide an escalation path and production support, when new engineers are on-call, at least one other team member is expected to be easily reachable, 24/7.

Continuous Development

Finally, the training and development of the new hire does not stop after they are fully onboarded and assume the role and responsibilities of a regular SRE member. There is an annual, self-directed budget of $2,000 that each individual has at their disposal, for professional development, such as attending conferences.

Conclusion

In this chapter, we looked at specific case studies of training at large, medium, and smaller organizations. We discussed the stages of training, including orientation, team-specific training, and going on-call, and we considered examples of how training could be adapted for the size of the organization. In the next chapter, we examine how to implement such training.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.241.82