Chapter 13. Building an SLO Culture

This book so far has explained the importance of SLOs, how to implement them, and even how to get various departments on board with them. If you’re an engineering team of one, that might be enough for you to go and start making the world a more reliable place. For the rest of us, there’s more work to be done.

It’s one thing to understand and live by these principles yourself, but it’s another to spread these ideas throughout your organization and get others working alongside you. That means having a team interested in using error budgets and having discussions about implementing feature freezes, and it means being able to rely on the systems managed by external teams. SLOs are most powerful when everyone is following the process and invested in building reliable systems. It will be easier to iterate on your systems and improve their reliability if the systems they depend on are doing the same. If you are working alone, it’s going to be a continual battle of priorities.

This can be one of the most difficult challenges in getting SLOs off the ground. Adding measurements, monitoring, and alerts can sometimes be done in a couple of days, but changing how your organization works takes more than a package install and configuration files. Luckily, you’re not the first to venture out on this journey.

While Chapter 6 discussed how to convince your organization to adopt an SLO-based approach, this chapter aims to guide you through the actual motions and steps of building an SLO culture within your team and beyond.

A Culture of No SLOs

Alerts are going off. Something about a job failing? You stop working to look into it, but it seems like the job retried and succeeded. Back to work. Another alert fires for a different job. Probably just a flaky database connection. You ignore it. Two hours later a customer is complaining that your service isn’t returning any data. Hasn’t been for a while. They’ve been asking around to figure out why and finally found your team. You look into it, and find that a job failed and never got restarted. You run it again and tell your customer to check back in an hour. They confirm the problem is fixed. You go back to cranking out your feature work.

We often don’t know when our services are working or broken. Alerts are in place, but it’s unclear if they are firing unnecessarily or catching major issues. Responsiveness is a function of how quickly customers complain. Our systems can’t be trusted and we instead expend our energy trying to make sense of them all. The time used not fighting fires is spent building unreliable features as quickly as possible, which inevitably just furthers our users’ frustrations.

Or, maybe a different scenario: our system works great. Test coverage is maintained at 100% and all changes are rigorously tested by QA teams over the course of multiple days. An engineer is on call and available at all times: middle of the night, weekends, holidays. All alerts are triaged and resolved within minutes. Extra time is taken to ensure changes are made with zero downtime. It’s an expensive operation to maintain—but users have never complained about outages. However, they do wonder why simple tweaks and changes take weeks to get out the door. We pour energy into keeping our systems perfect. We delay features until they are flawless. Failure is not an option, and we pay heavily to live by that standard.

There is a balance at play between shipping new features and supporting existing features. The two preceding examples are extreme, but they highlight the issue with investing all your energy into one or the other. If we don’t invest in reliability, our systems break down and we lose our users. If we overinvest in reliability, we never make progress in developing the features our users need. SLOs are a gauge for balancing these opposing priorities. A culture of SLOs helps us become more intentional in how we strike that balance.

Maybe you aren’t at either extreme, but no matter where you are, SLOs are essential to finding that balance. Creating a culture of SLOs will help you pinpoint where you are on the spectrum and be more intentional about where you end up. SLOs are data to help your team set more meaningful goals.

So, where do we start?

Strategies for Shifting Culture

It might be tempting to think that getting your teams to adopt SLOs will simply involve showing up at a team meeting and saying, “Let’s develop SLOs!” At which point everyone will return to their desks and perfectly instrument their services in a day. If that approach works for you—congrats!

More likely, however, it will involve tons of conversations and pushing through friction as you rewire your team to take an SLO-based approach. Here are some things to keep in mind as you are going through this process:

Start small
If this is new for your team, it doesn’t make sense to try and change everything overnight. Success isn’t measuring every corner of your system, achieving 99.99% availability, and having weekly outage simulations. Generally it’s best to bring these practices to your team gradually. Start with a single SLO set to whatever your current level of reliability is, and work your way up to multiple SLOs trending toward the desired level of reliability.
Be patient
It may take a while for SLOs to catch on. Don’t be discouraged if change doesn’t happen overnight. Continue discussing these tools and concepts with your team, experimenting with ideas, and trending toward better monitoring and reliability agreements. Look for small wins as you go through this process. Those might be verbal commitments to prioritize SLOs, receiving time to work on SLOs, a colleague asking a question about SLIs, a month of SLI data getting collected, or a meeting with stakeholders to agree on reliability standards. There are lots of small steps that will happen along the way. Use those steps as motivation to carry on as opposed to getting discouraged by how much there is left to go. This book has a lot of information in it, and it might feel like you aren’t succeeding until it’s all been implemented. That isn’t the case.
Engage with your team
At the end of the day, SLOs are there to help your team—engaging with them throughout the process is how you’ll be able to ensure you’re creating a process that everyone is on board with. It might not be intuitive at first how these practices will help your team (“We won’t be able to ship anything when we exceed our error budget?!”), but go back to the agreed principles and work from there. At the end of the day, this is about making users and developers happy—so keep working together to get to that place.
Reflect as you go
The whole reason to adopt an SLO-based approach is to get you to a place where you and your users are happy with the reliability of your application. So, continually reflect on where you are and what you can do to get to that final state. Think about whether the changes you’re making are getting you closer to that goal. Don’t have any monitoring? Add some so you can see how your system is behaving. Notice that a feature is frequently broken but nobody cares? Maybe that’s a sign it’s no longer needed.

With that aside, let’s break down what paving the way to a culture of SLOs looks like.

Path to a Culture of SLOs

Maybe you’ll bring these principles to your team and everyone will be on board right away. More likely, these ideas are going to seem counterintuitive to some, or people will agree in principle but push back when it comes time to actually prioritize the work. Here’s a high-level overview of the work required to move your team toward an SLO approach:

1. Get buy-in.
Communicate how SLOs work and get everyone in agreement that they provide value.
2. Prioritize SLO work.
Get the work on your roadmap, assign it to one or more people, and make it a priority.
3. Implement your SLOs.
Decide what SLIs to track, how to monitor them, and what level of reliability you want to provide, and learn how you’re performing against those targets.
4. Use your SLOs.
Decide as a team how to alert on your SLOs, how to use your error budget, and how to inform work using your SLOs.
5. Iterate on your SLOs.
Discuss what is and isn’t working, add/remove/adjust your SLIs/SLOs, and continually revisit your SLOs to check that they reflect your stakeholders’ needs.
6. Advocate for others to use SLOs.
Use what you’ve learned to educate others about the benefits of SLOs.

Let’s talk about each of these steps in a little more detail, and how they move you toward making SLOs a part of your culture.

Getting Buy-in

Before anything can happen, people need to be in agreement about the value of SLOs. If your team doesn’t value reliability, it’s going to be hard for you to justify creating SLOs. Sure, you can go rogue and start trying to change things on the side, but what’s the point if people aren’t going to want it?

How much buy-in you need to get will depend on your situation. If you’re on a pretty autonomous team that gets to pick and choose its priorities, maybe just getting your team on board will be enough. If you’re working on a project with more stakeholders and dependencies, there are likely going to be more people to convince before everyone is in agreement about the importance of SLO-based approaches (QA, engineers, executives, and so on).

Note that getting buy-in doesn’t mean everyone is now brainstorming SLIs and signing up to do the work—it just means they agree SLOs should be a priority and would be happy to see someone do that work. Chapter 6 is all about getting buy-in, so if you’re feeling uncertain about who the stakeholders might be and how to convey the importance of SLOs to them, go take a look there for an in-depth explanation. For the purposes of this chapter, I’ll just mention it as a critical step in the SLO culture–building process.

Prioritizing SLO Work

You’ve achieved consensus that reliability is your most important feature and that SLOs will help you build reliable systems. People seem to be on board, but when you arrive at your next planning meeting the feature work is once again at center stage and nobody is talking about reliability.

This brings us to the next stage of the process: prioritization. Many will agree that SLOs sound like a good idea, but when faced with what to work on next they’ll revert back to whatever the previous priorities were. Shifting the culture means catching these moments of relapse and getting in the practice of reasserting the importance of SLOs.

Once you have buy-in, look for an opportunity to make the work a priority. If you have a major launch in a week maybe that won’t be today, but eventually it needs to happen. It’s also important to decide who will be responsible for this work. Without accountability it can be easy for people to go back to whatever work was previously their priority. When it comes to who does the work, there are generally two scenarios.

Do it yourself

The easiest way to make something a priority is to make it your priority. Offer to do the work necessary to get an SLO defined for your service. Explain to your team why it is important to you:

  • Our customers are complaining, and I think it’s important we show progress toward making the system more reliable. SLOs will help ensure we do that.

  • I’ve wasted three hours this week responding to false alarms. I want to work on SLOs so I can be confident in our alerts and respond to issues more efficiently.

It’s difficult to argue what’s right for every member of the team, but if you’re able to frame the importance of SLOs in terms of your own values it becomes a lot harder to brush your argument aside:

Team member: I agree SLOs are important, but don’t we want to get these bug fixes out?

You: Do we know how often these bugs cause issues? I haven’t heard any complaints about them. It’s not clear to me that these issues are important because we don’t have any data on how they impact our users. I think it makes more sense to have an SLO for the feature so we can learn about the bugs that are having the biggest impact on our users and address those.

Figure out how SLOs will be a benefit compared to your team’s competing interests, and communicate that. Unless there’s a hard deadline for something else approaching, you should be able to make the case that SLOs are overdue.

Having read this book, you will likely be the most knowledgeable on the subject and the most driven to make the move to an SLO culture. Leading by example and making the work your priority will signal to others that you’re committed to making this change. Working on the initial SLOs will also create a baseline for others to refer to when taking on SLO work in the future. Going through the process yourself will help you understand more of the intricacies of SLOs and make you a better advocate for SLO culture down the road.

Assign it

Maybe you’re not involved with day-to-day development, and implementing SLOs is not aligned with your responsibilities. In that case, you’ll need to assign the work.

Ideally, your buy-in conversations will have helped you identify a person who might be a good fit for this job. If people have bought in they’ve agreed this is important work, so it shouldn’t be a big jump to convince them it’s important enough for them to work on. Communicate the importance of this work and how it’s going to improve things for the both of you.

Document the work that needs to get done and make someone responsible for completing it. Depending on your process, that might mean any number of things: creating tickets with detailed descriptions of the work that needs to get done, pulling the tasks into your sprint, adding SLOs to your OKRs or roadmap. Whatever it may be, make this work visible as part of your process, and find someone to be responsible for it.

Once someone has agreed to do the work, be available to coach them and support them along the way. If someone is new to SLO work, they may get stuck or confused and be tempted to switch back to picking up feature work. Make sure the SLO work is encouraged and supported. Creating an environment where people feel motivated and equipped to work on SLOs is the goal. That is a much bigger step toward creating a culture of SLOs than any single SLO being completed.

Implementing Your SLO

Your team is excited about SLOs, and the time has been set aside to define one. It’s time to get it done.

This book is full of information on how to implement SLOs, but for the purposes of this chapter I’ll just mention what I consider to be an ideal start for an initial SLO. Your first SLO is going to be important in a few respects. First, it demonstrates to your team the benefits of SLOs: improved prioritization, clearer agreements on feature versus reliability work, time to work on system stability, etc. Second, it materializes what an SLO is and provides a concrete example of how to implement them. As mentioned before, it’s usually a good idea to start with something simple so you have something digestible to point to as an example for others.

The cost of those benefits is additional work to reach consensus on what you should be aiming for. Your initial SLO will likely involve the following:

  • Create an SLO document.

  • Debate on what is most important to measure.

  • Debate on the best SLIs to measure with.

  • Implementing monitoring and tracking SLIs.

  • Debate on your SLO target.

Start with a document

Chapter 15 talks about the concept of an SLO document. This is a great place to start your SLO implementation process. A document will allow you to identify stakeholders and reviewers for your SLO and get everyone referring to a single source of truth.

This document will formally specify a list of approvers and stakeholders who will sign off when they are satisfied with your SLO. You will find a template for the SLO document in Appendix A.

What is important to measure?

As described throughout Part I of this book, your users decide what “reliable” means for your service. When deciding what to measure, think about what your users care about most and where your service is failing to meet those expectations. Focusing on these areas will strengthen your SLO—it’s hard to debate the importance of something that your users are currently complaining about.

  • Is your service too slow? Does it produce incorrect data? Does it frequently crash? Is it inconsistently available?

  • What parts of their experience are your users the least satisfied with?

Based on the answers to those questions, you might choose an SLO around latency, or data quality, or error rates, or availability.

Or maybe you’re on the flip side—your service works great, but you want more room to experiment and experience failure.

Where does it feel like you are exerting excess effort? Are you trying to launch all features with zero downtime? Are you feeling pressured to respond to and resolve all errors with no delay? Are you trying to avoid bugs at all costs by having all features go through a rigorous manual verification process?

Where would it be nice to have more flexibility?

Based on those answers you might choose an SLO around availability or error rates.

Put yourself in the shoes of your users and come up with a proposal for your team based on your findings.

What Will Your SLIs Be?

Knowing that most of your user complaints are around latency or error rates is great, but it’s time to get more specific.

Is there a certain feature that has a lot of errors? How do you detect an error? What does “slow” mean to your users?

Narrowing down on these things will help you understand what parts of your service need to be instrumented to measure the error rate or latency. Maybe you add a middleware to your API that increments a counter every time there’s a 500 response. Maybe you add timers to your code that measure the time certain functionality takes. Maybe you query your database to see how long jobs for the past week took and continually report that statistic.

Whatever it is, you need to think of a way to develop meaningful SLIs that capture the problem. If you don’t have much infrastructure in place for monitoring, the task of tracking this data and making it easily exposed might seem like a burden. It doesn’t need to be fancy, though—start with command-line tools and manual querying if that will get things off the ground. Build a little application that uses your business-critical APIs and measures the time they take. Get creative.

Once you’ve determined the data you can expose and how to combine it into an SLI, add that to your document. At this point it can be good to check with your team and your users to see if your plans for the SLO and SLI are correct.

What Will Your SLOs Be?

It might be tempting to get into a philosophical debate about the true needs of your users and the ideal SLO. The problem with that is that you may be setting your sights too high and end up failing in the end. If your team hasn’t thought too much about reliability until now, a simple solution is to just measure your service for a couple of weeks and determine your SLO based on whatever you see as the current level of reliability. Sometimes maintaining a consistent level of reliability can be a challenge, so starting with some realistic baseline and sticking to that can be your first step.

You can also think of tiered SLOs. Maybe your application is only needed during the work week, and won’t be used on the weekends. That could allow you to say that you only want to be 10% available on the weekends, but 95% reliable during the week. The world is your oyster.

It can also be helpful to think of your unavailability in the context of a time range. A day of downtime each week is very different from a day of downtime each month. That time also feels very different when considered as a single incident as opposed to 24 hour-long incidents.

Whatever you choose, make sure it’s something your team is willing to stick to. Commit to it. Monitor how your SLO performs, ship features when you have excess error budget, and seek out needed reliability work when your error budget is exhausted. If you select an SLO and then everyone shrugs when it’s violated, that could be an indication that your team hasn’t fully bought into the process or your SLO is wrong. If alerts are going off and being ignored, it might mean that people don’t really care. When you are interested in having your team adopt SLOs, it might be tempting to think that alerts being ignored is a sign of them not being bought-in enough. In reality, it probably means that your SLO is more stringent than they care about.

Using Your SLO

You’ve outlined your SLO, had it approved, and implemented all the technical details. It’s time to put it to use!

Now that you’ve agreed on what the reliability target is, there are a few more things to agree upon with your team:

  • How should the SLO tie into alerting?

  • What do you want to do when you run out of error budget?

  • What will you do with your additional error budget?

All of these things will need to be decided on through a conversation with your team, but there are many best practices described in this book that you can bring to the table. Be willing to advocate for the best practices while also being open minded about the concerns of your team. It’s better to have everyone following a less-than-optimal strategy than to be following the “best practices” all on your own.

Alerting

Given that you now have a metric that correlates to the happiness of your users, it makes sense to raise alerts based on that. In the beginning, you may want to be conservative with your alerting. Many find alerts distracting and lose trust when alerts fire when they shouldn’t. Since you’re in the middle of experimenting and have limited goodwill to spare, alert when things seem legitimately concerning. If this ever results in issues being detected too late, use that as justification to adjust your alerts to better catch rapid burn-through of your error budget.

Because your SLIs map to a user’s experience, you should find these alerts to be better at detecting real issues than other alerts you had in place in the past. Eventually you may be able to disable those other alerts and get a better pulse on the health of your system.

Give alerts time and attention. If alerts are going off when they shouldn’t, debug the issue. Figure out ways to make metrics less noisy and systems more stable. SLOs should protect you from being bombarded by alerts, not create more noise.

There’s lots more that can be said about SLO monitoring and alerting. Chapter 8 covers this topic in depth.

Exhausting your error budget

If you’ve developed your SLO correctly, you’ve defined clearly what is expected by your users. Breaking those expectations too often will wear on your users’ trust, so exhausting your error budget warrants a response. Ideally this will be a team effort, and all engineers will be accountable for keeping services up and running. If only one person is in charge of maintenance, it doesn’t incentivize others to take reliability seriously.

What happens when you exhaust your error budget right before a feature deadline? Talk about such situations and how to make the necessary trade-offs. Perhaps you have a “thaw tax” that requires additional time be spent on reliability after the launch, as discussed in Chapter 6. Perhaps the response will vary based on the severity of the issue. Maybe small error budget overages can be overlooked, but severe outages require work on reliability to be completed before any further feature work can happen. Remember that it’s always the discussions that matter most.

That being said, exhausting your well-defined error budget should usually trigger some type of shift to reliability work. Use those moments to start a discussion about what is wrong and what can be done to address it. You should take as much time as needed to fix issues and improve the system until your error budget surplus returns. If you have multiple services it might also make sense to refrain from feature work across all of them until things return to a level of reliability the team is comfortable with.

If you find your applications are breaking SLOs and there’s a lack of urgency to repair the situation, it might be a sign that you need to make some adjustments. Perhaps you aren’t measuring the right things, and your SLIs need to change. Perhaps there needs to be a conversation with the team about how the SLO process is working to ensure people are still on board. Perhaps your SLO is higher than users really care about. For whatever reason, that level of reliability isn’t important to the team or to your users, so it’s best to adjust to remain more realistic about what you want to maintain.

Using surplus error budget

Your team is amazing, and you never break your SLOs—great! What do you want to do with that excess budget?

Maybe there’s a large migration that would be much easier if all systems went down. Maybe you want to do chaos testing and introduce random errors to see what happens. Maybe you want to try a new implementation of an algorithm. There are all sorts of options!

Using up your excess error budget is the fun part of SLOs. It’s a rare thing to have permission to fail. Lots of things become easier when you’re able to accept a certain level of failure. That big risky change that might break everything? Ship it! You can learn so much from pushing things straight to production.

If the thought of doing this scares you, that might be another opportunity to use your error budget. Spend it developing infrastructure to quickly roll back broken changes, or experiment with shipping bugs to see how well you’re protecting against failure. Practice recovering from controlled failures so you are better prepared for real outages.

Iterating on Your SLO

Chapter 14 covers how to iterate on your SLO in great detail, but here are some starting points.

SLOs are a process, not a project. As such, your first SLO is likely not going to be great. It will be way too ambitious, not ambitious enough, or not what customers want. You’ll realize your SLIs are poorly implemented and incorrect. Your monitoring hasn’t been correctly configured, and you aren’t collecting the data you thought you would. A dependency goes down all the time, and you aren’t in control.

All sorts of things can go wrong. But that’s okay! We’re allowed to fail.

Decide with your team a cadence for reevaluating your SLO. Take turns reexamining it so that everyone gets exposure to working with the SLO. At the start this should be done at least once a month, with longer gaps as you gain more confidence in your SLO.

When reevaluating your SLO, look at things holistically:

  • Are your SLIs correctly identifying issues?

  • Is your monitoring correctly reporting the state of your system?

  • Is your SLO actually aligned to what your customers care about?

  • Is the SLO giving your team more confidence in the system or becoming a distraction?

There are all sorts of things to think about.

Iterate on your practices, too. Are people happy with the alerts? Does it feel like reliability is being worked on when it’s needed? Is this process helping you or becoming dogmatic? Are you finding you can ship features faster? Are these metrics useful?

Your goal is to build a culture of SLOs, not build out SLOs. If something is not working, it’s worth taking a pause to figure out why. Steamrolling concerns will go nowhere in getting people on board with SLOs.

Don’t think that you’ll be able to book off a week, knock out some SLOs, and call it a day. Solid SLOs are built from continual iteration and evolution with the service you’re building. As such, don’t get hung up on making things perfect. At the start, cut some corners and make risky bets if need be. You can always patch things up down the road if they aren’t working.

Determining When Your SLOs Are Good Enough

At what point do you stop? What is good enough?

As we’ve said before, your SLOs are not a project, so they’ll never be “good enough.” That being said, SLOs should eventually get you to the point where your service is just reliable enough to make your users happy. You should be able to detect issues before floods of users come complaining to you. You should have enough error budget that features can be deployed without overthinking every potential consequence. You should be able to do fun and interesting experiments with your error budget. You should be able to develop systems that expect and manage failure. You should have systems that self-heal and are easy to roll back. You should end up with a system that is reliable.

Chapter 14 is a deep dive on SLO evolution and will cover these topics in more depth.

Advocating for Others to Use SLOs

After you’ve established an SLO for one service, you’ll have more solid footing for applying these concepts elsewhere.

Maybe your SLO will help identify poor reliability of a downstream dependency that’s impacting your ability to meet your goals. Advocate for that service to implement an SLO so that you can provide better service to your users.

Maybe your SLO will reduce the number of alerts your team needs to respond to. Look at your other projects generating endless alerts and see where these learnings could be applied there.

Maybe your SLO has helped you realize you can ship much faster than you have been. Look at which projects you’re nervous to deploy and see if SLOs can help increase your velocity.

Maybe your SLO has increased team satisfaction—people are happier not getting paged for every issue. See if any other projects despised by the team could use a more realistic set of expectations.

Or maybe the effects will be something else entirely. Whatever the results might be, look for ways to communicate that value and get others on board.

Some of these things can seem like bold statements to make. Who are you to say that everybody on your team is better off after implementing SLOs? In those cases, talk about your own experience. Do you feel more confident in your projects and happier with what you’re working on? Do you think there are better things to be done?

A great way to keep SLOs top of mind is to build the conversation into your team dynamics. Schedule some time in your weekly meetings to discuss the state of various SLOs. Ask others how their services are striving to be more reliable and hold them accountable for making reliability a first-class citizen in their applications.

Making the shift to SLOs starts on your team, but there’s no need to stop there. Chapter 16 talks more deeply about how to shift this conversation from your team to your organization at large.

Summary

In this chapter we discussed the importance of building a culture of reliability. Reliability is the number-one feature of your service and not worth leaving to chance. Creating a culture of reliability is about being intentional about what reliability means for your team and efficiently delivering that to your users.

The math and theory behind SLOs are thoroughly discussed in the rest of this book; here we focused on the conversations required to get SLOs prioritized and make them part of your team’s process. We discussed getting buy-in from the team, assigning the work to implement SLIs, deciding on targets for SLOs, using your SLOs, iterating on your SLOs, and finally spreading the SLO culture beyond a single team/project. Hopefully this has made clearer the challenges you’ll need to overcome as you shift to an SLO process and the ways to respond to roadblocks as they arise. Although SLIs and monitoring can be added to a service with a bit of code, reprogramming how your team thinks about reliability will likely be a larger endeavor.

This chapter should also remind you that at the end of the day, SLOs are about people. Creating a culture of SLOs is about making your users and your team happier. The process has been validated; your job is to figure out how to make this process fit in with your existing team dynamics. Should you encounter resistance, listen to your stakeholders until you can figure out a way forward that works for everyone.

SLOs are a process, not a project. They won’t stick overnight, but hopefully the content in this chapter has given you a better sense of how to circle back and iterate on these approaches until things begin to click.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.236.174