Chapter 3. Developing Meaningful Service Level Indicators

The single most important aspect of adopting an SLO-based approach to reliability doesn’t even involve SLOs at all. SLIs are the most vital part of this entire process and system. There are several reasons for this, all of which boil down to the fact that human happiness is the ultimate end goal of using SLO-based approaches to running reliable services.1 You can make lots of lives better if you take the time to develop meaningful SLIs.

SLIs are the foundation of the Reliability Stack. You can’t have reasonable SLO targets or useful error budgets if your SLIs aren’t meaningful. The entire stack becomes useless if it hasn’t been built upon something solid. You want to be able to have meaningful discussions and make meaningful decisions with the data an SLO-based approach can give you, and you won’t have good data to use if the measurements at the bottom of your stack aren’t good ones. Remember we defined a meaningful SLI as “a metric that tells you how your service is operating from the perspective of your users” in Chapter 1. It is this kind of meaningfulness that needs to travel up through the rest of your stack so you can make the most meaningful data-driven decisions with the data you produce.

Furthermore, your service isn’t reliable if your users don’t think it is. While many services only directly have users that are other services, there are still humans involved at every step along the way. Even if you’re responsible for just a single microservice that is depended upon by just a single other service, those two services both still have humans that maintain them and are responsible for them.2 Taking a step back and measuring what your users need you to measure is an important step to understanding reliability. Even if you never end up with SLO target percentages or decision making based upon error budgets, you can still think about your users.

Sometimes you might say that the user is simply wrong in terms of what they expect from a service. You might have financial or technological constraints they aren’t aware of, or you may have miscommunicated what it is that your service intends to do. This is an unfortunate reality, but even here meaningful SLIs can help you. By exposing how you’re measuring the reliability of your service, you can communicate to users what they can expect. This could come with an implicit promise that you’ll try to meet their needs at some point, or it could simply communicate to your users that they are trying to fit a square peg into a round hole. In either case, you’re helping humans have a better understanding of the state of the world and the direction it might be headed in.

At some level all computer systems exist to perform some sort of task for people, and you can help ensure the operators, the maintainers, and the users of these systems have a better experience and view of things if your system measurements are built from the ground up considering what they need.

This chapter talks about the concepts behind developing meaningful SLIs. Chapters 7 and 9 discuss the technology and math behind measuring them.

What Meaningful SLIs Provide

Though some of the benefits of meaningful SLIs may seem self-evident, we haven’t really spent much time talking about why they can be so important to you, your users, and your business. Measuring things from your users’ perspective might be an intuitive thing to aim for, but there are many benefits to adopting such an approach. Using an SLI-based approach to measuring your service reliability has multiple perks.

Happier Users

It’s not a difficult argument to make that positioning your users as your primary focus may mean that your users end up happier. It can be easy to get caught up in the day-to-day, quarterly OKRs, or various other business needs while forgetting about what it is that your service actually needs to do. Some engineering teams are very far removed from the users of their services, to the point that it can be easy to forget that users even exist at all. The people writing the code or running the operations for software services haven’t traditionally been closely involved with conversations that involve the needs of users, whether these are internal or external.

By shifting your thinking away from what you think your service needs to do and toward what the users of the service need it to do—and encouraging others to do the same—you can build and observe better telemetry about these user needs. It doesn’t matter if the users are represented by a single other internal service maintained by a team across the hall, or if there are hundreds of thousands of external, paying customers—by making any kind of user the focus of how you think about your service’s reliability, you’re almost certainly going to make those users happier. You’re now thinking about them, which is a great first step.

Happier Engineers

Figuring out how to properly measure user-focused SLIs is often more complicated than just relying on traditional system resource metrics. This means that being able to expose or collect the metrics you do need might require nontrivial amounts of code to be written. It might mean standing up entire new services that exist only to help determine what this telemetry looks like. But even if you initially have to ask for more work to be performed, at the end of the SLI adoption process you’ll likely end up with happier engineers, as well.

One of the primary ways that this can happen is that you can stop alerting on almost everything else you have in the past. You should absolutely still continue to collect metrics and other telemetry about the basic status of the services you run, such as the state of the jobs that back your service, their error rates and stack traces, their response latencies, and even their resource utilization. These are all incredibly important pieces of data to have access to—but they’re not always good pieces of data to alert or page off of.

If you can develop meaningful SLIs, the only reason you have to wake someone up at 03:00 is when that SLI isn’t performing correctly. It doesn’t matter how many errors are in your logs, what latencies your database queries are currently experiencing, or how many of your jobs are currently crash-looping. If your service is still able to perform its job reliably as determined by comprehensive SLIs, then those are all problems that can wait until normal work hours—if they even have to be addressed at all.

Tip

Develop good SLIs so your engineers have a single thing to point at in order to determine what constitutes an emergency.

A Happier Business

Your business could have many goals, or just one. Your business may not even be a business in the traditional capitalist sense. Maybe your analogue to what we’re talking about is an open source project, or something provided not for profit or entirely for free. Regardless of how it’s defined, however, there is always some other actor involved besides the people writing and maintaining the service and the people using the service—the people determining what the service should be in the first place.3

Developing meaningful SLIs will make your business happier. Primarily, this is because your engineering organizations can now better align with your product, business, and QA organizations. Remember that at a philosophical level, an SLI is essentially a user journey, which is often just a key performance indicator (KPI), which is often similar to an interface test. Different organizations may place different weights on these, and they’ll certainly use different language to describe them, but it’s important to remember that all of these efforts work better together.

If you can bring about alignment across many organizations about what should be measured, you can also generate new and more meaningful data that allows you to better communicate across these organizations. With a representative SLI you can power an SLO. Then you can use the status of that SLO to report to the product, business, executive, and other teams in your company about exactly how reliable a service has been. Chapter 17 discusses how to do this in great depth.

Developing meaningful SLIs will not only make your business units happier, but can also improve the very health of your business. By making sure you are thinking about your customers—no matter who they may be—from the ground up, you can also make sure that your customers are as happy as they can be. Happy customers and happy employees mean a happy business.

Caring About Many Things

In order to develop meaningful SLIs, it turns out that you often have to care about many things. To best leverage an SLO-based approach to reliability, you need to make sure that you’re measuring a lot of different levels of interaction with your service. Individual metrics about things like system-level resources or error rates are not always enough, even if that’s how you’ve been measuring or monitoring your service in the past. Things could be worse than you realize if you’re just measuring error rates, since those don’t matter if requests aren’t even reaching your service in the first place. Alternatively, the situation could be better than you realize if your focus is on high memory or CPU usage, since those things don’t matter to your users if requests are completing successfully at a high rate anyway.

In many organizations, setting monitoring for a service is a one-and-done task. You make sure you have system resource metrics, maybe some HTTP response code counts, some access logs, a place to store stack traces, and so forth; then you build some dashboards you can look at when you get paged, which can be for any number of situations, none of which may actually map onto what your service needs to do for its users. Chances are you should instead be looking at the complex interactions between the many different components of your systems.

Another example is that people often conflate reliability, availability, and uptime, even though they’re all entirely different things. A service can be running yet not be available to your users. Additionally, it could be available but not operating in a reliable manner. These are all important measures of your service, and you should care about all of them.

Note

There are a few terms that people frequently blend together when talking about how computer services operate. The three most common of these will benefit from concrete definitions to avoid confusion:

Uptime
A measurement of the time an individual service is actually running on a platform
Availability
A measurement of the time an individual service is actually able to respond to requests by users
Reliability
A measurement of the time an individual service is actually able to perform the duties it was designed to do

In addition, you have to care about many people. These include the people responsible for actually operating your service, the product managers that help define your services, the customer operations staff that have to handle end-user inquiries and complaints, and those end users themselves.

We’ve barely scratched the surface here, but if you want to think about your service from your users’ point of view, you have to care about many things. Luckily, it turns out there are some simple logic tricks you can employ to narrow down what you ultimately need for meaningful SLIs. That is to say, you can care about many things by caring about just a few. Let’s look at a simple example.

A Request and Response Service

Imagine that you’re responsible for a simple request and response API. It doesn’t really matter what this service does, but it has to accept calls from across the network, and then return some data based upon these calls. How do you start to think about the reliability of such a simple service? How do you begin to put yourself into the shoes of your users? Let’s compile a list of questions you need to answer in order to determine the reliability of this service.

The first thing you have to care about is if your service is actually up. If your service isn’t up and running, it clearly can’t do what your users need it to do. So, you should probably have some kind of metric that tells you if things are actually running. If your service is in a crash loop or has been accidentally stopped, it almost certainly can’t do what it needs to be doing. So, the first question on your list is, “Is the service up?”

Asking if the service is up is a fine starting point; however, that doesn’t really matter to your users if the service isn’t also available. Being available and being up might seem like very similar things, but there are important differences. You could have 100 instances of your service all running, but if they aren’t actually available for your users to communicate with, they’re not really doing much at all. So now you have to ask yourself two questions: “Is the service up?” and “Is the service available?”

Moving on, you can now answer two questions that your users need you to, but even if you’re both up and available, it doesn’t matter if you can’t send a response in a timely manner. Exactly what a response time needs to look like depends a lot on what it is that your service does, but no matter what that is, it needs to be in a time frame that’s acceptable to your users. Now you have three questions to ask: “Is the service up, is it available, and is it responsive?”

With these three things under your belt it might seem like you’re getting close to developing a meaningful SLI—but there are still a few more things that are vitally important for the reliability of even the most simple request and response service. For example, even if the service is up, available, and responsive to the user, it’s not a useful service if it only returns errors. Even if those errors are delivered consistently and quickly, too high an error rate cannot be a feature of a reliable service. Add, “Is the service returning an acceptable number of good responses?” to the list.

You’ve now established four things to care about: is the service up, is it available, is it responsive, and is it returning an acceptable number of good responses? Your telemetry needs to be able to report on all four of these, or you can’t be sure your service is being reliable. But even if you can ensure all of that, it doesn’t matter much if the service can’t provide responses in an understandable data format. If someone performs a request call against the API and they receive a response in a timely manner and free of errors, they still are not going to be happy if they can’t understand what that response is. For example, if the user is expecting a response in JSON but gets a protobuf instead, that response isn’t going to be very useful to them. Therefore, you need to add a fifth item to the list of things you need to care about: is the service up, is it available, is it responsive, does it return an acceptable number of good responses, and are these responses in the correct format?

If the service is up, available, responsive, and sending good responses in the correct format, you’re probably pretty close to ensuring the reliability of a request and response API. Except none of that matters if the data being returned is from yesterday when today’s data is expected. So now you also need to ask yourself, “Is the correct data being returned?”

Note

We could continue this thought experiment for a while, but these initial six questions are a great place to start. Conduct this same sort of experiment for your own services.

Measuring Many Things by Measuring Only a Few

It may seem like there are quite a number of things you need to care about in order to determine the reliability of even a simple service. In one way, this is true: you need to be collecting at least six metrics so that they can be analyzed and examined when needed. However, not all of these metrics help inform a meaningful SLI directly, even if they’re all important things to know about. Even if you have many things to measure, you can often get to meaningful SLIs by measuring only a few of them.

Here is the list of things you know you should care about:

  • Is the service up?

  • Is the service available?

  • Is the service responsive?

  • Are there enough good responses when compared to errors?

  • Are the responses in the correct data format?

  • Are the payloads of the responses the data actually being requested?

Looking at this list again, it turns out that while there should probably be metrics for all of these, you only need a few of them in order to inform a meaningful SLI.

To explain how this works, let’s start with the question at the bottom of this list: “Are the payloads of the responses the data actually being requested?” It turns out that if you can figure out a way to measure this, you’re also measuring “Are the responses in the correct data format?” From a user’s perspective, you can’t possibly be receiving the correct data if the data isn’t formatted in the way you expect it to be.

And if you know that the data is both the data you need and in the correct format, you can also be sure that the responses you’re receiving are good, and not just errors. If you’re receiving responses at all, you also know that the service is both up and available. You’ve just gone from measuring one thing to measuring five.

It might be the case that you’ll have to calculate latency via a secondary metric, but even if that’s true, you’re still measuring six things by measuring only two of them.

A Written Example

What might the description of an SLI for measuring a service like the one we were discussing earlier look like? In simple English, it might read:

The 95th percentile of requests to our service will be responded to with the correct data within 400 ms.4

This is simple and understandable by most people, which is an incredibly important part of having meaningful SLIs. Chapter 15 goes into much more detail about this, but it is crucial that your written and defined SLIs and SLOs are easily understood by any and all stakeholders, even if the underlying process that delivers the data is complex. Chapter 7 covers how to actually measure SLIs at a technical level.

This is also a great description because it informs a binary state. As written, this can result in an outcome that is either true or is not: “Yes, this service is currently doing what it needs to be doing,” or “No, this service is currently not doing what it needs to be doing.” When measuring things that are either true or not, trivial math can be used to convert the results into a percentage, which makes it much easier to set an SLO target percentage later.

But now you might be thinking, “My service is much more complicated than just a simple request and response API. How can I measure that?”

Something More Complex

Our thought experiment involving a request and response API gave us a great starting point for some of the most important aspects of developing meaningful SLIs. By taking a step back and ensuring you’re measuring the things your users actually care about, instead of looking only at all the intermediary metrics that might exist, you’re well on your way to ensuring your service can be made more reliable for those users. But most modern services aren’t just a single API. So, let’s define something a little more complex.

Imagine you are in charge of a retail website. Like before, the details don’t really matter. It’s not of particular consequence for our discussion what this website sells; it’s just a starting point for describing a service that is itself composed of many other services and components.

Figure 3-1 shows a very oversimplified example of what a retail website’s architecture might look like. It’s missing many components you might find in the real world, but we wouldn’t gain much from trying to architect an entire solution here. We just need an example of a service that is composed of more than a single microservice.

A basic and oversimplified retail website architecture
Figure 3-1. A basic and oversimplified retail website architecture

The setup here is pretty simple. At the center we have a web app that acts as our site server. This is where user requests enter, and where the corresponding web pages are constructed for rendering in the user’s browser. This site server web app is likely the busiest part of the infrastructure, so it’s distributed across multiple instances. It has to talk to a database in order to get accurate data about what to actually display for a user, and it also relies on a caching service layer to better handle serving frequently accessed assets such as the front page, images that exist on most pages such as the company logo, and information about the most popular items for sale.

Because there are multiple instances of this site server, there also needs to be some kind of load balancing solution. This load balancer sits in front of the web app and routes user requests from the internet to the site server app.

Behind the primary web app live a number of microservices. One handles things like user information, which also then requires a database that stores data like usernames, password hashes, and payment and shipping information.

There is also a shopping cart microservice, since something needs to be responsible for keeping track of what a user wants to purchase. This service probably relies on a database too, but it also likely has some kind of cache setup since shopping cart information is often more transient than other bits of data about a user, and that cache will ensure quicker responses as well.

The final piece of our simple retail website is a payment gateway microservice. Handling sensitive information like credit card numbers and actually charging them is a thing most people don’t want to have to think about, so the primary purpose of this microservice is to interact with a third-party vendor that does the payment processing for us.

Now that we have this basic architecture defined, how might we think about the ways it needs to be reliable?

Measuring Complex Service User Reliability

This more complicated service has many more components, so we have many more things we need to measure. Let’s go through some possible user interactions with this retail website service, and see how they might inform a meaningful SLI.

Just like with our simple web API example, the first thing you need to care about is if your service is both up and available. No matter what your service looks like or how it is architected, it won’t do much for your users if they can’t access it. Similarly, the service needs to respond in a timely manner, send an acceptable number of good responses compared to errors, and include the correct data in its responses. What “correct” data for a service like this looks like is where things start to diverge a bit.

Measuring correct data becomes more difficult for a more complex service for two separate but related reasons: because there are more ways to interact with multicomponent systems, user requests will have to travel through more logical routes in order for the service to appropriately respond to these interactions. Just to start off with, Table 3-1 identifies a number of these paths and the components such requests will have to interact with.

Table 3-1. Some example service interactions
Interaction Components
Visit home page Load balancers, web app, cache
Browse items Load balancers, web app, database
Add/remove cart item Load balancers, web app, cart service, cache, database
Edit shipping address Load balancers, web app, user service, database
Purchase item Load balancers, web app, payment gateway, third-party payment vendor

While all of these interactions rely on the load balancers (and the network in general) and the primary site serving app, every user journey follows a different logical service path. You’ll have to make sure you develop a way to measure all of these different interactions, and all of them will require different ways of determining whether the response is correct or not.

You can likely get a lot of the way there by watching responses from the site server that aren’t errors, and there is nothing wrong with using this as a starting point as it’s a metric you likely already have available to you. But in order to get even better insight, you’ll have to measure in a way that follows the entire logical service path from start to finish. As an example, let’s imagine what it’s like to attempt to log in to this service.

To determine if users are reliably able to log in, we could look at something like the error rate between the web app and the user microservice—but, this metric doesn’t really mean much if most user login requests aren’t reaching the user microservice in the first place. It also might not mean much if users are able to log in, but they’re logged in as the wrong user. It might be a good idea for the team responsible for the login service to be monitoring and maybe even using login success/failure as the basis for their own SLI in order to determine how they are doing in terms of what other internal teams need. However, that’s not a very representative or meaningful basis for an SLI for the external customers actually trying to buy things on this site.

Note

This chapter is focused on stopping what you’re currently doing so you can make sure that you’re examining all of the things that are required for a service to be reliable, so our examples have often been broad. However, it might also be the case that your service is very simple and only relied upon by a small number of other services. In a case like that, there is nothing wrong with having a simple SLI that only measures a few things. The point, as always, is just to make sure that you’re thinking about what your users need. If all they need is responses that aren’t errors, then that might be all you need, as well.

That being said, measuring the error rate between the site server web app and the user microservice isn’t the worst way to start, because it does already encompass a different layer as well. Unless some strange error handling has been set up, looking at errors at that location in the service mesh should also expose errors when the user service tries to query or update the database. So, at least a few things are being measured at the same time. This could very well be a great starter SLI, especially since it’s entirely possible you already have the data you need to power this sort of measurement.

But it’s nowhere close to representing the entire user experience. If you want to have an SLI for your actual human customers that encompasses how they experience logging into your site, it has to start at the edge. (Or even beyond—remember that these users have internet connections that will not operate reliably at all times, as well!) You need to be able to say that requests that hit your load balancers are received, not rejected, and properly forwarded to a working instance of your web app, which then knows what to do with them. It should properly forward each request to the login microservice, which is able to talk to the database and verify that the username and password combination is correct. It then relays this information back to the site server web app, which is able to present a page that can be rendered correctly in the user’s browser of choice. That is how you make sure you’re measuring many things by measuring only a few. Complex systems don’t make the actual measurement easy, but it shouldn’t be too difficult to figure out what you should be trying to measure.

Our example up to this point doesn’t include user journeys like adding things to the shopping cart or being able to pay, check out, and receive an order, but the concepts for those are the same. These sorts of user journeys are complex and complicated, and it would be next to impossible to accurately measure every single possible user interaction or data flow. But the closer you can get to what your users are actually doing and what they actually need, the more meaningful and representative your SLIs are going to be.

Note

Just as your services can never be perfect, your measurements of them can never be perfect either. Things like latency or error rates are often easy to measure and calculate, especially when you only have a single component at play. As your systems get more complex and deeper, measuring everything correctly gets more difficult at some rate that is almost certainly steeper than linear. For many services it might make more sense to just ensure that each component is performing reliably enough in relation to the others, as opposed to capturing the entire user journey with a single measurement. You can use these individual measurements to inform an approximate view of the entire flow, and often that is more than good enough. Meaningful SLIs are about capturing data in a way that you might not be doing today, and it is of the utmost importance that that is being done from a user’s perspective—but you don’t have to entirely mimic their interactions to do so.

Another Written Example

What might the description for how to measure a slightly more complex service such as this one look like? In simple English, it might read like this:

When clients external to our network provide a valid username and password combination, the site will reload in a logged-in state.

Again, this is a simple sentence that most people should be able to understand, even if the underlying technology that measures this is nontrivial and complex. Measuring this sort of thing is not always going to be easy. It might require a lot of custom code, standing up black box monitors in remote data centers, or very clever math using the white box metrics you already have. Even better is introducing a service tracing solution that allows you to measure every actual incoming user request instead of generating artificial ones (Chapter 7 covers how to do this kind of measurement in detail). But no matter how complicated the underlying approaches are, the best SLIs are always able to be used to write easy-to-explain sentences.

Business Alignment and SLIs

Once you take the time to take a step back and think about how to measure your service from your users’ perspective, you’re well on your way to adopting an SLO-based approach to reliability. There is even a fair argument to be made that this single step gets you most of the way through the story this book is trying to tell. One of the reasons why is that this step is one that’s already kind of expected by other organizations in your business, even if no one else has noticed it yet.

Engineers want to work on fun new projects and write new code more than they want to clean up technical debt or fix bugs to make things more reliable; however, if those bugs pile up enough, they’ll never actually have any time to work on those new projects. Product managers and the business want to ship new features and might not like seeing an entire month or quarter dedicated to just making things more reliable; however, if you let reliability out of your sight for too long, those new features may never work at all. It turns out that it’s likely that all of these groups already want meaningful SLIs that inform reliability—they just think about things in a slightly different manner.

For example, let’s think about our SLI measurement from the previous section. If you were to put that sentence in front of a product manager, they might say: “Sure, but that’s not an SLI, that’s a user journey.” They very well may already have a document literally titled “User Journeys” that has your exact SLI defined in slightly different language. And if you took that SLI description and put it in front of the business aspect of your organization, they might say: “Sure, but that’s not an SLI, that’s a KPI.” Taking this a step further, if you described your SLI to your QA or test engineering team, they might respond with: “We agree, but that’s not an SLI, it’s an interface test.”

Chances are that many people at your company or in your organization are already entirely aligned in terms of what aspects of your service need to be measured and how important those are to users. It’s just likely that the language you’ve all been using doesn’t line up exactly. We’ll cover how to ensure everyone is on the same page in various future chapters, but as you start your journey in developing SLIs, make sure you’re thinking about how various aspects of your organization are thinking about these same things—because they probably already are.

Summary

Service level indicators are the single most important part of an SLO-based approach to reliability. You can use SLIs even if you don’t have SLOs, and you have to have SLIs before you can have SLOs.

No matter what your service does, it isn’t doing much if it isn’t doing what its users need it to do, and the only way you can make sure that’s true is by measuring things from that angle. In some cases, simple service level metrics might be all you need. Maybe you can measure reliability just from locally reported error rates. Or perhaps you can use the throughput of various components of a pipeline to determine if things are running well enough. Sometimes you might have to write a brand new service that exists only to explore other services and see what they look like from the outside. But no matter which of these things might be true for your own service, they all require some moments of reflection as you stop thinking about what you need and instead place what your users need first.

Achieving everything outlined in this book isn’t easy. It can take a lot of time, but you also might be surprised at how quickly you can ship out your first SLI/SLO—and how addicting it can be to continue that process. You might need to convince a lot of people of the usefulness of this approach. You might need to spend a lot of time developing documents, tooling, and workshops. You might have to ask people to think about things in a different way than they’ve ever done in the past. But don’t let any of that discourage you. Get started with your first SLI today, then refine it over time. Pick an SLO target, and see how you perform against it. You don’t need buy-in from everyone to get started. Achieving everything outlined in this book isn’t easy, but you can often see value quickly by tackling the low-hanging fruit yourself.

That all being said, eventually you’ll want to get buy-in from the entire organization to best utilize SLO-based approaches. Chapter 6 covers strategies for how to do that. In the meantime, even if you can’t or haven’t yet reached all of your goals, you can always develop meaningful SLIs. Thinking about your users first is never a bad idea.

1 There are, unfortunately, many computer-based services that do harm to our world. Weapons systems, mass surveillance systems, machine learning algorithms biased against nonwhite people, and many more do little to improve the human condition while they remain reliable. The ethical concerns about how to address all of this are out of the scope of this book, but important enough to mention here.

2 There are definitely cases where services get lost in the shuffle and suddenly no one knows who is responsible for them, but that just leads to humans having to scramble around trying to figure stuff out, so people are still involved.

3 At smaller organizations, the people who write the code, the people who maintain it, the people who use it, and the people who determine what it should be may all be the same people, or even the same individual. The general philosophies as described here all still stand.

4 Measuring “correct data” can be incredibly difficult and sometimes fraught with problems, and it is often the case that you are measuring things “well enough” even if you don’t get all the way to this point. This example serves to try and cover all of the bases, but don’t worry if this isn’t a realistic possibility for measuring your service.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.189