Sum-of-rates versus rate-of-sums

The previous point might seem obvious, but when the complexity of the queries built starts to increase, it's easy to make mistakes. A common example of this is trying to rate a sum of counters instead of summing rates. The rate function expects a counter, but a sum of counters is actually a gauge, as it can go down when one of the counters resets; this would translate into seemingly random spikes when graphed, because rate would consider any decrease a counter reset, but the total sum of the other counters would be considered a huge delta between zero and the current value. In the following diagram, we can see this in action: two counters (G1, G2), one of which had a reset (G2); G3 shows the expected aggregate result that's produced by summing the rate of each counter; G4 shows what the sum of counters 1 and 2 looks like; G5 represents how the rate function would interpret G4 as a counter (the sudden increase being the difference between 0 and the point where the decrease happened); and finally, G6 shows what rating the sum of counters would look like, with the erroneous spike appearing where G2's counter reset happened:

Figure 7.11: Approximation of what rate of sums and sum of rates look like

An example of how to properly do this in PromQL might be:

sum(rate(http_requests_total[5m]))

Making this mistake was a bit harder in past versions of Prometheus, because to give rate a range vector of sums, we would either need a recording rule or a manual sum of range vectors. Unfortunately, as of Prometheus 2.7.0, it is now possible to ask for the sum of counters over a time window, effectively creating a range vector from that result. This is an error and should not be done. So, in short, always apply aggregations after taking rates, never the other way around.

Table of Contents for Sum-of-rates versus rate-of-sums

Create new playlist

Sign In

Sign Up

Table of Contents for
Sum-of-rates versus rate-of-sums