Previously, we took a look at the Prometheus configuration file; we'll now move onto the provided alerting rules example, which we can see in the following snippet:
vagrant@prometheus:~$ cat /etc/prometheus/alerting_rules.yml
groups:
- name: alerting_rules
rules:
- alert: NodeExporterDown
expr: up{job="node"} != 1
for: 1m
labels:
severity: "critical"
annotations:
description: "Node exporter {{ $labels.instance }} is down."
link: "https://example.com"
Let's look at the NodeExporterDown alert definition more closely. We can split the configuration into five distinct sections: alert, expr, for, labels, and annotations. We'll now go over each one of these in the next table:
Section |
Description |
Mandatory |
alert |
The alert name to use |
Yes |
expr |
The PromQL expression to evaluate |
Yes |
for |
The time to ensure that the alert is being triggered before sending the alert, defaults to 0 |
No |
labels |
User-defined key-value pairs |
No |
annotations |
User-defined key-value pairs |
No |
The NodeExporterDown rule will only trigger when the up metric with the job=”node” selector is not 1 for more than one minute, which we'll now test by stopping the Node Exporter service:
vagrant@prometheus:~$ sudo systemctl stop node-exporter
vagrant@prometheus:~$ sudo systemctl status node-exporter
...
Mar 05 20:49:40 prometheus systemd[1]: Stopping Node Exporter...
Mar 05 20:49:40 prometheus systemd[1]: Stopped Node Exporter.
We're now forcing an alert to become active. This will force the alert to go through three different states:
Order |
State |
Description |
1 |
Inactive |
Not yet pending or firing |
2 |
Pending |
Not yet active long enough to become firing |
3 |
Firing |
Active for more than the defined for clause threshold |
Going to the /alerts endpoint on the Prometheus server web interface, we can visualize the three different states for the NodeExporterDown alert. First, the alert is inactive, as we can see in the following figure:
Then, we can see the alert in a pending state. This means that, while the alert condition has been triggered, Prometheus will continue to check whether that condition keeps being triggered for each evaluation cycle until the for duration has passed. The next figure illustrates the pending state; notice that the Show annotations tick box is selected, which expands the alert annotations:
Finally, we can see the alert turn to firing. This means that the alert is active for more than the duration defined by the for clause – in this case, 1 minute:
When an alert becomes firing, Prometheus sends a JSON payload to the configured alerting service endpoint, which, in our case, is the alertdump service, which is configured to log to the /vagrant/cache/alerting.log file. This makes it very easy to understand what kind of information is being sent and can be validated as follows:
vagrant@prometheus:~$ cat /vagrant/cache/alerting.log
[
{
"labels": {
"alertname": "NodeExporterDown",
"dc": "dc1",
"instance": "prometheus:9100",
"job": "node",
"prom": "prom1",
"severity": "critical"
},
"annotations": {
"description": "Node exporter prometheus:9100 is down.",
"link": "https://example.com"
},
"startsAt": "2019-03-04T21:51:15.04754979Z",
"endsAt": "2019-03-04T21:58:15.04754979Z",
"generatorURL": "http://prometheus:9090/graph?g0.expr=up%7Bjob%3D%22node%22%7D+%21%3D+1&g0.tab=1"
}
]
Now that we've seen how to configure some alerting rules and validated what Prometheus is sending to the configured alerting system, let's explore how to enrich those alerts with contextual information by using labels and annotations.