Chapter 27. Psychological Safety in SRE

Note

This work was previously published in Intercom and in ;login: magazine before being reworked specifically for an SRE audience.

The Primary Indicator of a Successful Team

When I worked for Google as an SRE, I was lucky enough to travel around the world with a group called “Team Development.” Our mission was to design and deliver team-building courses to teams who wanted to work better together. Our work was based on research later published as Project Aristotle. It found that the primary indicator of a successful team wasn’t tenure, seniority, or salary levels, but psychological safety.

Think of a team you work with closely. How strongly do you agree with these five statements?

  1. If I take a chance and screw up, it will be held against me.

  2. Our team has a strong sense of culture, and it’s difficult for new people to join.

  3. My team is slow to offer help to people who are struggling.

  4. Using my unique skills and talents comes second to the objectives of the team.

  5. It’s uncomfortable to have open honest conversations about our team’s sensitive issues.

Teams that score high on questions like these can be deemed to be “unsafe.” Unsafe to innovate, unsafe to resolve conflict, unsafe to admit they need help. Unsafe teams can deliver for short periods of time, provided they can focus on goals and ignore interpersonal problems. Eventually, unsafe teams will underperform or shatter because they resist change.

Let me highlight the impact an unsafe team can have on an individual, as seen through the eyes of an imaginary, capable, and enthusiastic new college graduate.

This imaginary graduate—we’ll call her Karen—read about a low-level locking optimization for distributed databases and realized it applied to the service for which her team was on call. Test results showed a 15% CPU saving! She excitedly rolled it out to production. Changes to the database configuration file didn’t go through the usual code-review process, and, unfortunately, it caused a hard-lock-up of the database. There was a brief but total website outage. Thankfully, her more experienced colleagues spotted the problem and rolled back the change within 10 minutes. Being professionals, the incident was discussed at the weekly postmortem meeting.

1. “If I take a chance, and screw up, it will be held against me”

At the meeting, the engineering director asserted that causing downtime by chasing small optimizations was unacceptable. Karen was described as “irresponsible” in front of the team. The team suggested ways to ensure that it wouldn’t happen again. Unlike Karen, the director soon forgot about this interaction.

Karen would never try to innovate without explicit permission again.

2. “Our team has a strong sense of culture, and it’s hard for new people to join”

The impact on Karen was magnified because no one stood up for her. No one pointed out the lack of code reviews on the database configuration. No one highlighted the difference between one irresponsible act and labeling someone as irresponsible. The team was proud of its system’s reliability, so defending its reputation was more important than defending a new hire.

Karen learned that her team, and her manager, didn’t have her back.

3. “My team is slow to offer help to people who are struggling”

Karen was new to being on call for a “production” system, so she had no formal training in incident management, production hygiene, or troubleshooting distributed systems. Her team was mostly made up of people with decades of experience who never needed training or new-hire documentation. They didn’t need playbooks. There were no signals that it was OK for a new graduate to spend time learning these skills. There were certainly no explicit offers of help, other than an initial whiteboarding session that seemed to spend more time on how it used to work than it did today.

Karen was terrified of being left with the pager. She didn’t understand how she passed the hiring process and frequently wondered why she hadn’t been fired yet. We call this impostor syndrome.

4. “Using my unique skills and talents comes second to the goals of the team”

Karen’s background was in algorithms, data structures, and distributed computing. She realized the existing system had design flaws and could never handle load spikes. The team had always blamed the customers for going over its contracted rates, which is like a parent blaming their infant for eating dirt. Karen rightly expected that her nonoperations background would have been a benefit to the team. It’s not always clear whether a problem will require understanding a database schema, Ruby debugging, C++ performance understanding, product knowledge, or people skills.

Karen proposed a new design based on technology she’d used during her internship. Her coworkers were unfamiliar with the new technology and immediately considered it too risky. Karen dropped her proposal without discussion. She wanted to write code and build systems, not have pointless arguments.

5. “It’s uncomfortable to have open, honest conversations about our team’s sensitive issues”

When a large customer traffic spike caused the product to be unavailable for a number of hours, the CEO demanded a meeting with the operations team. Many details were discussed, and Karen explained that the existing design meant it could never deal with such spikes, and then she mentioned her design. Her director reminded her that her design had already been turned down at an Engineering Review, and then he promised the CEO they could improve the existing design.

Karen discussed the meeting with one of her teammates afterward. She expressed dismay that the director couldn’t see that his design was the root cause of their problems. The teammate shrugged and pointed out that the team had delivered a really good service for the past five years and thus had no interest in arguing about alternate designs with the director.

Karen left work early to look for a new job. The company didn’t miss her when she left. After all, she was “reckless, whiny, and had a problem with authority.” They didn’t reflect on the design that would have saved the company from repeated outages that caused a customer exodus.

How to Build Psychological Safety into Your Own Team

What is special about operations that drives away so many promising engineers and suffers others to achieve less than their potential?

We know that success requires a strong sense of culture, shared understandings, and common values. We need to balance that respect for our culture with an openness to change it as needed. A team—initially happy to work from home—needs to colocate if they take on interns. Teams—proud that every engineer is on call for their service—might need to professionalize around a smaller team of operations-focused engineers as the potential production impact of an outage grows.

We need to be thoughtful about how we balance work that people love with work the company needs to get done. Good managers are proactive about transferring out an engineer who cannot make progress on a team’s workload due to a mismatch in interest or skills. Great managers expand their team’s remit to make better use of the engineers they have, so they feel their skills and talents are valued. Engineers whose skills go unused grow frustrated. Engineers who are ill-equipped to succeed at assigned work will feel set up to fail.

Make respect part of your team’s culture

It’s difficult to give 100% if you spend mental energy pretending to be someone else. We need to make sure people can be themselves by ensuring that we say something when we witness disrespect. David Morrison (Australia’s Chief of the Army) captured this sentiment perfectly, in his “The standard you walk past is the standard you accept” speech.

Being thoughtless about people’s feelings and experiences can shut them down. Here are some examples in which I’ve personally intervened:

  • Someone welcomed a new female PM to the team on her first day but made the assumption that she wasn’t technical. The team member used baby words to explain a service, as an attempt at humor. I immediately highlighted that the new PM had a PhD in computer science. No harm was intended, and the speaker expressed embarrassment that their attempt at fun could be taken any other way. It’s sometimes hard to distinguish unconscious bias from innocence.

  • In a conversation about people’s previous positions, one person mentioned they had worked for a no-longer-successful company. A teammate mocked this person for being “brave enough” to admit it. I pointed out that mocking people is unprofessional and unwelcome, and everyone present became aware of a line that hadn’t been visible previously.

  • A quiet, bright engineer consistently was talked over by extroverts in meetings. I pointed out to the “loud” people that we were missing an important viewpoint by not ensuring everyone speaks up. Everyone becomes more self-aware, though I had to stress in 1:1s that I expect all senior people to speak up.

It’s essential to challenge lack of respect immediately, politely, and in front of everyone who heard the disrespect. It would have been wonderful had someone reminded Karen’s director, in front of the group, that Karen wasn’t irresponsible, the outage wasn’t a big deal, and the team should improve its test coverage. Imagine how grateful Karen would have been had a senior engineer at the Engineering Review offered to work on her design with her so that it was more acceptable to the team. Improve people’s ideas rather than discount them.

Make space for people to take chances

Some companies talk of 20% time. Intercom, where I worked, has “buffer” weeks in between some of their six-week sprints. People often took that chance to scratch an itch that was bothering them, without it having an impact on the external commitments the team had made. Creating an expectation that everyone on the team has permission to innovate and at the same time encouraging the entire team to go off-piste sends a powerful message.

Be careful that “innovation time” isn’t the only time people should take chances. I worked with one company in the automotive industry that considers “innovation time” to be 2:30 PM on Tuesdays! Ideas that people think are worthy should be given air-time at team design reviews, not just dismissed. Use them as an opportunity to share context on why an idea that seems good isn’t appropriate.

Make it obvious when your team is doing well

One engineer describes his experience of on-call as “being like the maintenance crew at the fairground. No one notices our work, until there is a horrible accident.” Make sure people across the organization notice when your team is succeeding; let team members send out announcements of their success. Don’t let senior people hog the limelight during all-hands meetings.

I love how my team writes goals on Post-it notes at our daily standups and weekly goal meetings. These visible marks of incremental success can be cheered as they are moved to the “done” pile. We can also celebrate glorious failure!

Many years ago, when I was running one of Google’s storage SRE teams, we were halfway through a three-year project to replace the old Google File System. Through a confluence of bad batteries, firmware bugs, poor tooling, untested software, an aggressive rollout schedule, and two power cuts, we lost an entire storage cell for a number of hours. Though all services would have had storage in other availability zones, the team spent three long days and three long nights rebuilding the cluster. After it was done, the team members—and I—were dejected. Demoralized. Defeated. An amazing manager (who happened to be visiting our office) realized I was down and pointed out that we’d just learned more about our new storage stack in those three days than we had in the previous three months. He reckoned a celebration was in order.

I bought some cheap sparkling wine from the local supermarket, and along with another manager, took over a big conference room for a few hours. Each time someone wrote something they learned on the whiteboard, we toasted them. The team that left that room was utterly different from the one that entered it.

I’m sure Karen would have loved being appreciated for her uncovering the team’s weak noncode test coverage and its undocumented love of uptime-above-all-else.

Make your communication clear and your expectations explicit

Rather than yelling at an engineering team each time it has an outage, help the engineers build tools to measure what an outage is, a Service-Level Objective (SLO) that shows how they are doing, and a culture that means they use the space between their objective and reality to choose the most impactful work.

When discussing failures, people need to feel safe to share all relevant information, with the understanding that they will be judged not on how they fail, but how their handling of failures improved their team, their product, and their organization as a whole. Teams with operational responsibilities need to come together and discuss outages and process failures. It’s essential to approach these as fun learning opportunities, not root-cause obsessed witch hunts.

I’ve seen a team paralyzed, trying to decide whether to ship an efficiency win that would increase end-user latency by 20%. A short conversation with the product team resulted in updates to the SLO, detailing “estimated customer attrition due to different latency levels,” and the impact that would have on the company’s bottom line. Anyone on the team could see in seconds that low latency was far more important than hardware costs, and instead drastically over-provisioned.

If you expect someone to do something for you, ask for a specific commitment—“When might this be done?”—rather than assuming that everyone agrees on its urgency. Trust can be destroyed by missed commitments; this is why a primary responsibility of management is to set context, to provide structure and clarity.

Karen would have enjoyed a manager who told her in advance that the team considered reliability sacred, and asked her to work on reliability improvements, rather than optimizations.

Make your team feel safe

If you are inspired to make your team feel more psychologically safe, there are a few things you can do today:

  • Give your team a short survey (like the questions posed at the beginning of this chapter), and share the results with your team. Perhaps get a trusted person from outside the team to do 1:1s with them, promising to summarize and anonymize the feedback.

  • Discuss what “safety” means to your team; see if they’ll share when they felt “unsafe,” because safety means different things to different people—it can mean having confidence to speak up, having personal connections with the team, or feeling trained and competent to succeed at their job.

  • Build a culture of respect and clear communication, starting with your actions.

Treat psychological safety as a key business metric, as important as revenue, cost of sales, or uptime. This will feed into your team’s effectiveness, productivity, and staff retention and any other business metric you value. Don’t optimize for it exclusively—if someone who feels unsafe quits your team, your metrics go up, which does not indicate success!

Why are operations teams more likely to feel unsafe than other engineering teams?

Let’s unpick the melange of personality quirks and organizational necessities that put operations teams, and specifically SRE, into danger.

We love interrupts and the torrents of information

Humans suck at multitasking. Trying to do multiple things at once either doubles the time it takes to complete the task or doubles the mistakes.1 A team that’s expected to make progress with project work while being expected to be available for interrupt work (tickets, on-call, walk-ups) is destined to fail. And yet, operations attracts people who like being distracted by novel events. Do one thing at a time. “Timebox” inbound communications as well as interrupt time.

Operations teams are expected to manage risk and uncertainty for their organization. We build philosophies for reasoning about risk; strategies for coping with bad outcomes; defense in depth, playbooks, incident management, escalation policies, and so on. When humans are exposed to uncertainty, the resultant “Information Gap” results in a hunger for information, often exaggerated past the point of utility.2 This can lead to information overload in the shape of ludicrously ornate and hard-to-understand dashboards, torrents of email, alerts, and automatically filed bugs. We all know engineers who have hundreds of bugs assigned to them, which they cannot possibly ever fix, but refuse to mark them “Won’t Fix.” Another pathology is subscribing to developer mailing lists, to be aware of every change being made to the system. Our love of novelty blinds us to the lack of value in information we cannot act on.

Admit that most information is not actionable; be brutal with your bugs, your mail filters, and your open chat apps. Tell your team that it’s OK to assume anything urgent will page; any other work can be picked up after they finish tasks.

On-call and operations

The stress of on-call drives people away from operations roles. Curiously, 24/7 shifts are not the problem. The real problem is underpopulated on-call teams, working long, frequent shifts. The more time people spend on call, the more likely they are to suffer from depression and anxiety.3 The expectation of having to act is more stressful than acting itself.4 It’s one thing to accept that on-call is part of a job; it’s another thing altogether to tell your five-year-old daughter that you can’t bring her to the playground.

We can mitigate this stress by ensuring on-call rotations of no less than six people, with comp time for those with significant expectations around response times, or personal life curtailment. Compensate teams based on time expecting work, not time doing work. Having per-shift on-call payments (or giving comp time)—as opposed to including on-call compensation in the base compensation package—implies employers value the sacrifice of their employees’ personal time. Incident management training or frequent “Wheel of Misfortune” drills can also reduce stress, by increasing people’s confidence. Ensure on-call engineers prioritize finding someone to fix a problem when multiple incidents happen concurrently.5

Cognitive overload

Operations teams support software written by much larger teams. I know a team of 65 SREs that supports software written by 3,500 software engineers. Teams faced with supporting software written in multiple languages, with different underlying technologies and frameworks, spend a huge amount of time trying to understand the system and consequently have less time to improve it.

To reduce complexity, software engineers deploy more and more abstractions. Abstractions can be like quicksand. Object/relational mappers (ORMs) are a wonderful example of a tool that can make a developer’s life easy by reducing the amount of time thinking about database schemas. By obviating the need for developers to understand the underlying schema, developers no longer consider how ORM changes impact production performance. Operations now need to understand the ORM layer and why it impacts the database.

Monolithic designs are often easier to develop and extend than microservices. There can be valid business reasons to avoid duplication of sensitive or complex code, and they have simpler orchestration configuration. However, because they attract heterogeneous traffic classes and costs, monolithic architectures are a nightmare for operations teams to troubleshoot or capacity-plan.

Most of us understand that onboarding new, evolving software strains an operations team. We ignore the burden of mature “stable” services. There is rarely any glamorous work to be done on such services, but the team still needs to understand it. The extra care required not to impair mature services while iterating on newer ones must be accounted during time-and-effort estimation.

Ensure teams document the impact of cognitive load on development velocity. It has a direct and serious impact on the reliability of the software, the morale and well-being of the operations team, and the long-term success of the organization.

Imaginary expectations

Good operations teams take pride in their work. When there is ambiguity around expectations of a service, we will err on the side of caution and do more work than needed. Do we consider all of our services to be as important as each other? Are there some we can drop to “best effort”? Do we really need to fix all bugs logged against our team, or can we say, “Sorry, that’s not our team’s focus”? Are our Service-Level Agreements (SLAs) worded well enough that the entire team knows where their effort is best directed on any given day? Do we start our team meeting with the team’s most important topics, or do we blindly follow process?

Ensure that there are no magic numbers in your alerts and SLAs. If your team is being held to account for something, verify that there is a good reason that everyone agrees with and understands.

Operations teams are bad at estimating their level of psychological safety

Finally, I’ll leave you with a thought: people who are good at operations are bad at recognizing psychologically unsafe situations. We consider occasionally stressful on-call “normal,” and don’t feel it getting worse until we burn out. Over-emphasizing acts of sacrifice during work normalizes sacrifice, and turns sacrifice into expectations.6 The curiosity that allows us to be creative drives us to information overload. Despite being realistic about how terrible everything is, we stay strongly optimistic that the systems, software, and people we work with will get better.

I’ve conducted surveys of deeply troubled teams for which every response seemed to indicate everything was wonderful. I’d love to hear from people who have experience uncovering such cognitive dissonance in engineers. After all these years, I’m still surprised when I uncover it.

Bio

John Looney is a production engineer at Facebook, managing a data center provisioning team. Before that, he helped built a modern SaaS-based infrastructure platform for Intercom, one of the fastest-growing technology companies in the world. Before that, he was a full-stack SRE at Google who did everything from rack design and data center automation through ads-serving, stopping at GFS, Borg, and Colossus along the way. He wrote a chapter of the SRE book on automation and is on the steering committee for USENIX SRECon.

1 Paul Atchley, “You Can’t Multitask, So Stop Trying”, Harvard Business Review.

2 George Loewsenstein, “The Psychology of Curiosity”, Psychological Bulletin.

3 Anne-Marie Nicol and Jackie Botterill, “On-call work and health: a review”, Environ Health.

4 J. Dettmers et. al., “Extended work availability and its relation with start-of-day mood and cortisol”.

5 Dave O’Connor, “Bad Machinery”, SREcon15 EU.

6 Emily Gorcenski, “The Cult(Ure) of Strength”, SRECON.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.190.167