Chapter 51. When SLOs Attack: Pathological SLOs and How to Fix Them

Narayan Desai

SLOs are a wonderfully intuitive concept: a quantitative contract that describes expected service behavior. These are often used to build feedback loops that prioritize reliability, communicate expected behavior when taking on a new dependency, and synchronize priorities across teams when problems occur, among other use cases.

However, SLOs are built on an implicit model of service behavior, with a raft of simplifying assumptions that don’t universally hold—assumptions such as the independence of requests, even distribution of errors, and the equality of all requests. These assumptions make SLO rules of thumb fall apart with real-world services. Understanding where and how these assumptions break down is critical: cases when SLOs inadvertently send us in the wrong direction.

Consider error budgets: a number or percentage of failures over a time interval. These errors could occur in a short period or at a low rate over a long time. They could be distributed across all users or focused on a few. Individual users could have low or 100% error rates. All these factors color how outages will be perceived and what kinds of effects they have on users.

Further, how best to deal with catastrophes? Many service providers try to incorporate bad days into SLO promises; however, some bad days are very, very bad days. This results in a compromise that serves in both fair weather and foul poorly. Neither is well described.

When things go really poorly, and multiple periods of error budget are consumed, what then? As appealing as the prospect of freezing a service for years may be, it rarely serves the best interests of either users or service providers.

Similarly, because error budgets often incorporate tail risks, they represent the P99+ bad experience. Hence, spending an error budget aggressively is a good way to deliver a consistently bad experience to customers.

Mismatches between SLOs and the average experience customers want can also lead to disagreements between service providers and their customers. Customers tend to expect the experience they received yesterday, even if that was a positive outlier on a service-wide basis, and changes to this behavior tend to cause their architectures to have issues.

At the end of the day, SLOs are about quantifying delivered service, setting appropriate expectations, and changing tactics when things aren’t going well. All of these activities are crucial to deliver trustworthy services. So what can we do to fix SLOs?

  1. Use different methods to describe discrete aspects of SLOs. Have steady-state error rate SLOs to measure transient error rates, but use bad-minute type SLOs to characterize major outages. Measure the frequency and severity of major outages and communicate them.

  2. Measure and store per-customer SLI data to determine the experience individual customers are having and whether errors are evenly distributed.

  3. Don’t exercise error budgets unless your SLOs actually approximate the service you want to deliver to customers. This may or may not ever be the right thing to do for some services.

  4. Embrace the ambiguity of many SLO measures; our services are rich, and a single aggregated measure of service goodness isn’t possible. This approach leaves room for nuanced situational awareness and a variety of directions that can be used to improve user experience.

  5. Set SLOs at actual customer-desired behavior in steady state, not incorporating tail risk.

With these guidelines, not only can we have SLOs that are less pathological, but we also get a series of metrics that we can use to improve our services in focused ways. With this, we can deliver reliable services with well-quantified behavior.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.162.110