Chapter 15. Discoverable and Understandable SLOs

An SLO-based approach to reliability works best when everyone is on the same page. You need to ensure that everyone involved within your organization buys into the process in much the same way. Each team should feel free to adapt the system to how they work, what their services are like, and what kinds of data and measurements they have available to them. However, they still need to ensure they’re using terminology and definitions that are consistent with everyone else, so that members of other teams can easily understand them. This is important because when things become too divergent, you lose one of the primary benefits of SLOs: the ability for others to quickly and intuitively discover how you’re defining the reliability of your service, as well as its current state.

In addition to making sure that your SLO definitions are consistent, you need to make sure that others can discover them in the first place. A beautifully crafted, detailed overview of the definitions and statuses of your systems’ SLOs means nothing to others if they can’t find it.

This chapter focuses on best practices and other tips for ensuring your SLOs are as discoverable and understandable as possible.

Understandability

When defining SLOs, you need to make sure they are understandable by anyone who might be involved. This starts with your own team. The SLIs, SLOs, and error budget policies you define need to be agreed upon by everyone responsible for managing your service, and you can’t have proper agreement if they don’t all understand exactly what is being said.

Additionally, you need your SLOs to be understood by engineers on other teams, who might either be direct human users of your service or be operating services that depend on yours. They need to know how your service is performing, and that isn’t possible if they don’t understand how you’re choosing to measure things (how to spread this kind of understanding is covered in depth in Chapters 13 and 16).

The data about your service’s reliability needs to be usable by people well beyond engineering teams, too. This isn’t always easy—in some organizations these sorts of measurements are considered guarded data that doesn’t warrant being shared. However, if you can push for a culture change, everyone will get along better.

For example, product teams need to know:

  • That operational teams are trying to understand things from a user perspective

  • How reliable a service is aiming to be

  • How reliable a service has recently been, or currently is

The leadership chain needs to understand these same things, or it won’t be able to allocate resources to the places that need them most. Your customer support teams likewise must be able to internalize SLO-related definitions and data so that they can properly communicate reliability status to paying customers. These same teams will likely be the first to know if your SLO is out of alignment with what your customers expect, so they also need to have a way to communicate this back to engineering teams.

Consistency is invaluable to ensuring that your SLOs are understandable by everyone. This is where templatized SLO definition documents come into play. You can find an example SLO definition template in Appendix A.

SLO Definition Documents

SLO definition documents are incredibly important to having discoverable and understandable SLOs. Even though (as discussed in Chapter 14) SLOs can and should change over time, they also need to be formalized at any specific point in time. An SLO definition document is the clearest and easiest way of accomplishing this: you need to make sure that your SLOs are formalized in writing somewhere, and by using templates you ensure that they’re written down in the same way. This makes it easy for everyone to follow them. Let’s outline some of the most important parts of what such a document needs to contain.

Ownership

The first thing people should be able to understand about your SLO definition is who is responsible for it, and who calls the shots when it comes to determining future work based upon error budget status. Ownership of an SLO is an important part of how the entire process works, since someone needs to be in charge of making the decisions that SLOs give you the data to make. For services owned by a single engineering team, the owner should likely be the entire team, since individual people might transfer teams or even leave the company entirely. However, there is nothing necessarily wrong with choosing a single primary stakeholder. In fact, this is the recommended process to follow for services that encompass many other services, because there isn’t often a single team responsible in these situations.

In such cases, you might have to bring in ownership at higher levels. For services that represent many aspects of an organization, or perhaps even your entire customer-facing product, the correct owner should be a director, a vice president of engineering, or even the CTO or CEO themselves. Whoever the owner is needs to be able to mandate change. If the SLO in question is for your entire customer-facing product, that has to be someone near the top of your org chart.

At the end of the day, someone has to be responsible for how things are being measured, how decisions are made, and how resources are allocated—but this doesn’t have to be a single person, or even a single team. Ownership of SLO definitions should be assigned to the most appropriate people, even if they aren’t directly working on the operations or contributing code to the service at all. Leadership should not be excluded from these conversations.

Note

SLOs are about having better data to help you make better decisions. If you’re establishing targets for your entire company to follow, there is nothing wrong with identifying leadership to be responsible for reacting to this data. Just make sure they’re aware of that role and on board with the concept. If you need to shift an entire engineering organization’s aims for some period of time, you need an owner who has the power to make this happen. But, of course, you can’t just sign someone up without them understanding what this all entails. Chapter 6 covers how to have those conversations and get buy-in.

By establishing formal ownership, you are not just communicating who is responsible for making decisions about the reliability of a service; you’re also letting people know who to reach out to in case they have qualms or disagreements about the SLO. If the owner of the SLO is not the author of the definition document or the team responsible for the service, that should be made clear. You should also provide some indication of how to contact the owner. This could be anything from an individual’s email address or a team mailing list to a ticket queue or a chat channel. Use whatever communication channels are most commonly used at your organization.

Approvers

All SLOs should have a number of approvers, and these should be sourced from across all stakeholders. A single person or a single team’s take on measuring reliability is, from the perspective of the service’s users, often insufficient. Having a diversity of voices involved in setting objectives and determining how to measure them is vital—what might seem important about a service will often differ greatly from the engineering team, to the operational team, to the product team, to customer support, and so on.

An approver is different from an owner in a few important ways.1 An approver is someone who has reason to care about how an SLI/SLO/error budget is defined, but is not someone who can mandate a change in focus if the error budget is being exceeded by too much. Additionally, they’re often a stakeholder, but are likely not in the direct management chain of the team at the base of the service in question. This doesn’t always have to be true, of course—sometimes you’re setting SLOs for an entire company’s product.

Although the exact details of who needs to be involved with the approval process of an SLO definition should vary depending on the nature and size of the service itself, here are some best practices to follow:

  • No SLO definition should ever be approved by just a single person or team. Keep in mind other teams, such as product and customer support.

  • You should always ensure that at the very least you have a senior engineer or two from outside of the responsible team involved with the process. Individual teams can become blind to how others depend on them.

  • If your service is depended upon by many services and teams, you should have representatives from a reasonable number of those teams. For example, if three other services depend on yours directly, it makes sense to find approvers from the teams responsible for those three services; however, if you are in charge of the database platform that every microservice at your company relies upon, you probably don’t need to find approvers from every single team.

Finally, ensure that a certain number of members of the team responsible for this service have approved the definition as well. This might seem obvious, but you need to make sure that your list of stakeholders includes the people that directly take care of the system on a daily basis, through operational and code contributions alike.

Your definition document should include a list of all the approvers and an indication of whether and when they have signed off on the current version of the SLO definition.

Definition status

An important aspect of your definition to communicate is its status—not the status of your SLO or error budget, but of the definition itself. For everything to be as understandable as possible, you should have a number of dates documented.

The first date you may want to include is the original proposal date. This lets people know how long this service has been managed with an SLO-based approach and provides an indication of the maturity of your SLO.

A very important date to provide is the last-updated date. As we’ve discussed at great length throughout this book, SLOs are always evolving and changing, and it’s important that stakeholders know when you last revisited the various aspects of your SLO definition—this date should be incremented every time a revisit occurs, even if no changes to the definition document are made. This allows people to discover not only the last time things may have changed, but the last time they were reviewed.

Tip

Ensure that previous versions of your definitions are still available to people as well, either through explicit mentions of older versions and changes made in your rationale section or, at the least, via a code repository history.

In addition to or instead of the original proposal date, you may want to provide an approval or implementation date. This tells people when the ideas behind the objectives were actually put into practice and allows them to understand how long the measurements have been taking place (bearing in mind that this may have begun some time after the initial proposal date).

Finally, you should mention the next revisit date. All SLOs should be revisited at some interval; as well as when the last review of the various aspects of your SLO definition was held, stakeholders need to know when the next one will be conducted.

Service overview

A simple summary of what your service actually does is an important part of your definition document. You shouldn’t assume that everyone interested in your SLO definition or status necessarily knows exactly what it is that your service does or how it is architected. This section does not have to be lengthy, but it should be comprehensive enough to allow anyone from any department of your company to understand the basics of the purpose and functionality of the service being described. A paragraph or two with accompanying links to more in-depth documentation is recommended here.

An example summary could be something as simple as: “This microservice provides an interface for processing user logins. It is used by all customer-facing services and validates login information stored in an encrypted database.” If you’re responsible for the database behind this service as well, you should expound on this and add some additional details. For example: “This database additionally stores login history and metadata of customer transactions. Due to GDPR laws we are careful about our retention period and only hold data for 28 days.”

Be detailed enough to ensure you’re not overlooking any important service features (especially those related to compliance or security), but succinct enough that any visitor can quickly understand the functionality and purpose of the service in question.

SLO definitions and status

It probably comes as no surprise that SLO definition documents have to contain the definitions of the SLOs themselves. In this section, you need to describe in the simplest fashion exactly what your SLIs are, what your SLOs are, what your target percentages are, and so forth. While you should focus on using simple language so that people not familiar with your service can understand what you’re measuring and how, it is also often useful to have a second definition that is more technical.

For example, for each SLO, you might have one section written in easy-to-understand language, and then a second one that outlines the exact query or math you’re using to measure the results. The former allows people to understand what you’re measuring, and the latter allows people to understand how you’re measuring it. People who just care about what your definition is can focus on the text description, whereas people interested in the exact mechanics involved can focus on the second part. An example of the sentence-based definition could be something like, “We will serve 200 HTTP responses within 500 ms to 99.9% of incoming requests,” while the query-based definition might say something like sum(http_requests{status="200"}) / sum(http_requests).

You should also provide a link in this section that allows people to discover the current status of your SLO and your error budget. In the best possible scenario this information should be presented in a dashboard, or something similar (we’ll discuss building SLO dashboards and how to discover SLO and error budget status in more detail in “Dashboards”). The important part for building your template is that this should be a link to a dynamic data source, unless you can embed dynamic data directly in your document system. Avoid relying on a human trying to keep this information up to date in the document itself.

Rationale

Your SLO definition should also include an explanation of why you’ve chosen the SLIs and SLOs that you have. This is a very important data point for people on other teams that the owners of an SLO definition may never think about. You can stave off many questions by documenting your decision-making process for others. Ensure that people not only know what your definitions look like today, but have some idea of how they came to be.

Revisit schedule

You should lay out clearly what your revisit schedule is, and explain why you’ve chosen the schedule you have. Is this a new definition that will require frequent review? Is it an established measurement that is trusted and only needs to be revisited quarterly, or even yearly? Make sure your justification is sound and your rationale is clear.

Error budget policy

As we discussed in detail in Chapter 5, having an error budget policy can help jumpstart the conversation that examining SLO and error budget status necessarily implies. This section is likely going to be much more free-form than other sections of your document, because it’s going to be much more unique to your exact service and team. You don’t need to try to plan out every possible scenario here, but this is a great place to lay some foundations. For example, you might say things like:

  • If 50% of the error budget has been depleted, the team will have a meeting to determine how to proceed.

  • If the error budget has been entirely exhausted, the team will dedicate its next sprint to the reliability backlog.

Remember that you don’t have to adhere to whatever you define here; the idea is just to have some suggestions to look to when having discussions about project priority. When issues come up, it can be tremendously helpful to have some sort of starting point mapped out.

Note

This doesn’t mean that you can’t have strict rules, of course. Perhaps your SLO is tied very directly to an SLA, in which case you might not have a choice but to pivot people to defending the error budget. Or perhaps you’ve been at this for a very long time and trust your numbers very much, and you have evidence that having two engineers spend their next sprint focused on reliability issues makes sense when you’ve exceeded your error budget. Strict rules are fine if they work for you, but they’re not mandatory to implement the philosophies at play.

Phraseology

Phraseology is important to creating understandable service level objectives. The basic idea behind phraseology is that words have different meanings when they’re used next to other words, or when they’re used in the same way often enough, and knowing these things can help you communicate ideas better to certain populations. Using the concepts behind phraseology, you can make your SLO definitions more understandable.

Using a definition template is the first step. If every SLO someone encounters is defined in the same manner, they’ll be more easily understood the more of them that person reads. But you can go a step further by also ensuring that people use the same terms and phrases to indicate the same things. For example, “SLO” or “service level objective” is generally understood to mean the entire defined system, from SLI to error budget window, while “SLO target percentage” is understood to be just the actual number that represents what kind of reliability you’re aiming for.

Whether you develop a glossary for people in your organization to refer to is entirely up to you. In my experience those kinds of lists end up being ignored as time passes, so the effort is not always enduring. The main thing is to be aware of how humans think about words, phrases, and acronyms, and build your templates in a way that promotes a unified language.

Discoverability

Once you can trust that your SLO definitions are as understandable as possible, you also need to make sure that they’re discoverable. No matter how well written your document is, it’s no help to anyone if they don’t know that it exists, or can’t find it. Additionally, interested parties need to be able to discover the status of your SLOs, not just their documentation.

Document Repositories

The best way to ensure SLO definitions are easily discoverable is to ensure they all live in the same place. Having a centralized repository for all SLO documents at your company or organization is the most painless way to do this. How to set this repository up will depend a lot on how you handle the rest of your internal documentation, but we’ll outline a few good options here.

The simplest way to handle this is to just create a folder or a tree of folders in a documentation system like Google Docs or Office 365. If you have a lot of SLOs, you might want to organize them by department, product, or team; however, unless you trust the search functionality available to you, you should also have a single document somewhere that lists every single definition. While this might seem unwieldy, there is a benefit to allowing someone to search for a text string in a single location to find what they’re looking for. You shouldn’t always assume that everyone will know who owns every service; in fact, one of the biggest problems at tech companies in general is that no one knows exactly who owns what. Additionally, team and organization names are prone to change, while service names often stay the same over their life cycle.2

A slightly better option is something like a wiki system. Whether you choose to use the wiki as a repository that links to documents that live elsewhere or actually make the definition documents available there is up to you. In either case, wikis have been built from the ground up to facilitate discoverability by having the built-in concept of spaces to contain related material. As we discuss in further detail in Chapter 16, you’re going to want a lot of documentation about how SLOs work for your organization, and a wiki-like system is great for ensuring your document repository is available right next to all of your other SLO-specific documentation.

Documentation-as-code is another great option, especially if you use it to feed into a wiki-like system. Templates can be a little more difficult to use with markup languages like Markdown, but the history you get from storing your documentation in a code repository can be excellent. Additionally, assuming your actual SLO configs, metrics definitions, and monitoring queries also live in a repository, you can ensure the two are linked in a way that guarantees your documentation does not drift from what your tooling is actually measuring and calculating.

Do be aware, however, that not everyone at your company or in your organization may be comfortable with things like code repositories or Markdown as a text formatting language. Try to have a low barrier to entry for everything instead of relying on the state-of-the-art engineering solutions.

Discoverability Tooling

Even better than using something like a wiki system by itself, you could develop custom tooling to keep your repository up to date. Today, almost any documentation system you might use has an API you can interact with. You can use this to build tools to automatically scan through your documents and populate the repository landing page itself. Writing some software is often the best way to ensure there’s no drift between reality and what has been documented.3

In addition to making sure that your SLO measurements are in sync with your documentation, and ensuring that all of your SLO documents are discoverable by programmatically aggregating them in the same location, this kind of tooling can provide you with other benefits. If you have a system that already scans through your SLO definitions, you can have it analyze the data contained within them as well.

For example, you could have your tooling autopopulate your repository with dates like when an SLO definition was last updated, or when the next revisit date is. This can allow stakeholders to quickly glean this information at a glance. You could also use this to automatically mark documents as past due or even open a ticket in a team’s queue if they’ve missed their revisit date. You can set SLOs for the freshness of your SLO definitions!

SLO Reports

Creating a centralized repository for your SLO definitions will help with discoverability, but you can also proactively take steps to increase awareness of SLOs, their definitions, and their status. It’s a great idea to regularly report on the definitions and statuses of the most important SLOs in your organization. This could take the form of anything from a weekly or monthly email to a scheduled visit to a repository or dashboard during meetings. Chapter 17 discusses how to report on SLOs in much greater detail, but don’t forget that reporting on them is a way to make them discoverable in the first place.

Dashboards

In addition to reports, you should have dashboards that help people discover various aspects of your SLOs. In a way, you should view these dashboards as a second repository, since they should also be grouped together in some way.

SLO dashboards don’t have to be complicated, and they really only have to consist of a few items in order to be effective. Before we look at a visual example with real data, let’s talk about the components that should exist on dashboards such as these.

The first thing your dashboards need to show is the current status—that is to say, “Is this service at this moment meeting its objectives or not?” This is most often best represented by a binary value or string, and not a graph. People should be able to glance at the dashboard and immediately know what the current state of the world is. Most large SaaS or cloud platform companies, like Google, Amazon, and Microsoft, have these sorts of status pages for their external customers to discover, and many different vendors exist that will host status pages for you.

Note

A status page is a prime candidate to be handled by a vendor, because if your own infrastructure is having problems you want to ensure you can still publicly post your service status somewhere. It can be pretty embarrassing if you’re down so badly that you can’t even let anyone know that you’re experiencing problems.

Next, you should have an SLI violations graph. This allows people to discover the exact moments when your measurements determined that your service was unreliable versus when it was reliable. Having this kind of data presented over time can make it easier to correlate SLI violations with other events.

Another key feature of a useful SLO dashboard is a “burndown” graph. This shows how reliable you’ve been in terms of percentage over time. Closely related is your error budget status. A panel showing how much error budget you have remaining—whether positive or negative—allows stakeholders to quickly ascertain the state of your service. I personally think it’s a great idea to represent both of these values as a line graph as well as with a single numeric value showing where you stand at this exact moment in time.

Finally, these dashboards need to include links to the SLO definition documents in question. People should be able to discover the specifics of what is being presented on the graphs and panels in front of them without having to do much research.

Now that we’ve outlined what sorts of items you might have on an SLO dashboard, let’s look at an actual visual example in Figure 15-1.

An example SLO dashboard for a healthy service
Figure 15-1. An example SLO dashboard for a healthy service

This screenshot shows some example output measuring the latency of responses to human-generated requests to a website. The target SLO percentage for this service is 99.9%, so you’ll notice that it has been operating reliably for some time. The service uses a 30-day error budget window, which means you get around 43 minutes of error budget per month. Knowing that you’re sitting at 37 minutes remaining is a good indication that you probably don’t need to allocate any new resources to improving the latency of this service. You could quickly use this kind of dashboard in a weekly sync or something similar to make that case.

Figure 15-2 shows what such a dashboard might look like for an unreliable service.

An SLO dashboard showing an unreliable service
Figure 15-2. An SLO dashboard showing an unreliable service

Instead of measuring response latency, this SLO is concerned with the start time associated with Kubernetes pods. In this case the target is 99.95%, and it looks like things haven’t been going so great recently. The measurements are currently in violation of their targets, and the error budget (just shy of 22 minutes per 30 days in this case) has been completely depleted and looks to be continuing to drop. With an 11-minute deficit incurred over the course of almost two entire weeks, this probably isn’t a complete emergency, but it’s likely that one of two things needs to happen:

  1. Assign engineers to figure out what can be done about improving startup time.

  2. Reassess whether your SLI measurements and SLO target are reasonable.

Humans are often great at spotting trends in a visual manner, which is why dashboards are so popular for many uses across the industry. You should make sure that all of your data is discoverable via nonvisual means as well, of course—you should always be thinking about accessibility—but the benefits of having SLO status dashboards are innumerable for many organizations. If SLOs are about providing you with better data to make better decisions, then you need to ensure that this data is presented in a format that the humans involved can use to make those decisions.

Summary

Reliability requires people to know what’s going on, and SLOs provide a clear, customer-centric picture that speaks a thousand words. This is true whether these people are on the team responsible for a service or at the highest levels of the leadership chain, and everything in between. You can simplify all of this by doing just a few things: creating intuitive and informative SLO definition templates; ensuring that these filled-out templates are discoverable; using tooling to validate the status of various SLOs; and creating reports and dashboards that make it easy for stakeholders to discover not just how things are defined, but also the current and historical reliability status of a service.

1 A good way to think about these sorts of differences is by using a responsibility matrix. I personally like RACI, which is a methodology that helps you define who is Responsible, Accountable, Consulted, and Informed about any particular project. As always, use what works for you, but it can be useful to have a defined system in place to assign and define these sorts of responsibilities.

2 Of course, service names sometimes do change, and sometimes multiple services are combined into a new one. The point is just that you need to be aware of these sorts of changes and try to be proactive in preventing them from confusing people down the road.

3 You can even define an SLO for this tooling!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.247