Service-level indicators

In Chapter 1, Monitoring Fundamentals, we introduced the notion of What to measure, discussing Google's Four Golden Signals, as well as the USE and RED methodologies. Building upon that knowledge, we can start to define service-level indicators (SLIs), which reflect a given service's performance and availability. Constructing queries to generate SLIs is a common pattern of PromQL usage and one of the most useful.

Let's look at an example of an SLI: the typical definition of one is the number of good events over the number of valid events; in this case, we want to understand whether the percentage of requests being served by Prometheus is at or below 100 ms, which makes it a latency SLI. First we need to gather information about how many requests are being served under that threshold; thankfully, we can rely on an already available histogram-typed metric called prometheus_http_request_duration_seconds_bucket. As we already know, this type of metric has buckets represented by the le label (less or equal), so we just match the elements under 100 ms (0.1 s), like so:

prometheus_http_request_duration_seconds_bucket{le="0.1"}

While ratios are typically the base unit in these type of calculations, for this example, we want a percentage, and so we must divide the matched elements by the total number of requests made (prometheus_http_request_duration_seconds_count) and multiply that result by 100. These two instant vectors can't to be divided directly due to the mismatch of the le label, so we must ignore it, as follows:

prometheus_http_request_duration_seconds_bucket{le="0.1"} / ignoring (le) prometheus_http_request_duration_seconds_count * 100

This gives us an instant vector with information per endpoint and per instance, setting the sample value to the percentage of requests answered below 100 ms since the service started on each instance (remember that _bucket is a counter). That's a good start, but not quite what we're after, as we want the SLI for the service as a whole, not for each instance or for each endpoint. It's also more useful to calculate it on a rolling window instead of averaging an indeterminate amount of data; as more data is collected, the average becomes smoother and harder to move. So, to fix these issues, we need to rate the counters over a time window to get a fixed rolling average, and then aggregate away instances and endpoints using sum(). This way, we don't need to ignore the le label either, as it is also discarded during aggregation. Let's put it all together:

sum by (job, namespace, service) (
  rate(prometheus_http_request_duration_seconds_bucket{le="0.1"}[5m])
) / 
sum by (job, namespace, service) (
  rate(prometheus_http_request_duration_seconds_count[5m])
) * 100

Building a service-level objective (SLO) for our service now becomes trivial as we are just required to specify the percentage we're aiming to achieve using a comparison operator. This makes for an excellent condition to be defined as an alert.

Table of Contents for Service-level indicators

Create new playlist

Sign In

Sign Up

Table of Contents for
Service-level indicators