Delays on alerting

In the previous topics, we talked about the three states that an alert goes through; but there's more to it when calculating the total time required for an alert to become firing. First, there's the scrape interval (which, in our example, is 30 seconds, although generally the scrape and evaluation intervals should be the same for clarity), we then have the rule evaluation interval (in our case, it was globally defined as 1 minute), and, finally, there's the 1 minute defined in the alert rule's for clause. If we put all of these variables together, the time for this alert to be considered as firing can take up to 2 minutes and 30 seconds in the worst-case scenario. The next figure illustrates this example situation:

Figure 9.7: Alert delay visualized

All these delays are just on the Prometheus side. The external service processing the alert sent may have other constraints, which can make the global delay until a notification is sent even longer.

Before Prometheus 2.4.0, the pending and firing states were not persistent across restarts, which could extend the delay for alerting even further. This was solved by implementing a new metric, called ALERTS_FOR_STATE, which stores the alert states. You can find the release notes for Prometheus 2.4.0 at https://github.com/prometheus/prometheus/releases/tag/v2.4.0.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.166.7