Escalating once a threshold is reached

We know how to perform an action if a threshold is reached, such as the temperature being too high, the available disk space being too low, or a web server not working. We can send a message, open a ticket in a tracker, run a custom script, or execute a command on a remote machine. But all of these are simple if-then sequences; if it's this problem, then do this. Quite often, the severity of the problem depends on how long the problem persists. For example, a connection loss to a branch office that lasts a couple of minutes might not be critical, but it's still worth noting down and emailing IT staff. The inability to reach a branch office for five minutes is quite important, and, at this point, we would like to open a ticket in the help desk system and send an SMS to IT staff. After 20 minutes of the problem not being fixed, we would email an IT manager. Let's look at what tools Zabbix provides to enable such gradual activities and configure a simple example.

In the frontend, perform the following operations:

Navigate to Configuration | Actions and click on Disabled next to the Test action in the Status column to enable this action
Then, click on Enabled next to the SNMP action
Now, click on Test action in the Name column

Currently, this action sends a single email to the user Admin whenever a problem occurs. Let's extend this situation:

Our first user, Admin, will be notified five minutes after the problem happens, with a one-minute interval. After that, they would be notified every five minutes until the problem is resolved.
advanced_user is lower-level management who would like to receive a notification if a problem is not resolved within five minutes.
monitoring_user is a higher-level manager who should be notified in 20 minutes if the problem is still not resolved, and if it has not yet been acknowledged.

While these times would be longer in real life, in this instance, we are interested in seeing escalation in action.

Now, we are ready to configure escalations. Switch to the Operations tab.

Looking at the operations list, we can see that it currently contains only a single operation—sending an email message to the Admin user immediately and only once, which is indicated by the Steps Details column having only the first step listed:

The first change we would like to perform is to make sure that Admin receives notifications every minute for the first five minutes after the problem happens. Before we modify that, though, we should change the default operation step duration, which, by default, is 3600, and cannot be lower than 60 seconds. Looking at our requirements, two factors affect the possible step length:

The lowest time between two repeated alerts, in our case, 1 minute.
The biggest common divisor for the starting time of delayed alerts. In our case, the delayed alerts were required at 5 and 20 minutes, thus, the biggest common divisor is 5 minutes.

Normally, you would set the default step duration to the biggest common divisor of both of these factors. Here, that would be 60 seconds, but we may also override step duration inside an operation. Let's see how that can help us have a simpler escalation process.

Enter 300 in the Default operation step duration—that's five minutes in seconds, so 5m should give the same result as 300. Now, let's make sure that Admin receives a notification every minute for the first five minutes. Click on Edit in the Action operations block.

Notice how the operation details also have a Step duration field. This allows us to override the action level step duration for each operation. We have an action level step duration of 300 seconds, but these steps should be performed at one-minute intervals, so enter 60 in the Step duration field. The two Steps fields denote the step this operation should start and end with. Step 1 means immediately, thus, the first field satisfies us. On the other hand, it currently sends the message only once, but we want to pester our administrator for five minutes. In the Steps fields, enter 6 in the second field.

Step 6 happens 5 minutes after the problem happened, step 1 is right away, which is 0 minutes, step 2 is one minute, and so on. Sending messages for 5 minutes will result in six messages in total, as we send a message both at the beginning and the end of this period.

The final result should look like this:

If it does, click on Update in the Operations block, not the button at the bottom yet. Now, on to the next task: Admin must receive notifications every five minutes after that, until the problem is resolved.

We have to figure out what values to incorporate in the Steps field. We want this operation to kick in after five minutes, but notification at five minutes is already covered by the first operation, so we are probably aiming for 10 minutes. But which step should we use for 10 minutes? Let's try to create a timeline. We have a single operation currently set that overrides the default period. After that, the default period starts working, and even though we currently have no operations assigned, we can calculate when further steps would be taken:

Step	Operation	Interval (seconds)	Time passed
1	Send message to user `Admin`	Operation, 60	0
2	Send message to user `Admin`	Operation, 60	1 minute
3	Send message to user `Admin`	Operation, 60	2 minutes
4	Send message to user `Admin`	Operation, 60	3 minutes
5	Send message to user `Admin`	Operation, 60	4 minutes
6	Send message to user `Admin`	Operation, 60	5 minutes
7	None	Default, 300	6 minutes
8	None	Default, 300	11 minutes

Operation step duration overrides periods for the steps included in it. If an operation spans steps 5-7, it overrides periods 5-6, 6-7, and 7-8. If an operation is at step 3 only, it overrides period 3-4.

We wanted to have 10 minutes, but it looks like this is not possible with this particular configuration—our first operation puts step 7 at 6 minutes, and reverting to the default intervals puts step 8 at 11 minutes. To override interval 6-7, we would have to define some operation at step 7, but we do not want to do that. Is there a way to configure it in the desired manner? This should be feasible by observing the following:

Click on Edit in the Operations column and change the second Steps field to 5, and then click on Update in the Operation details block. Do not click on the main Update button at the bottom.
Now, click on New in the Operations block. Let's configure the simple things first.

Click on Add in the Send to Users section in the Operation details block, and click on Admin in the resulting popup. With the first operation updated, let's model the final few steps again:

Step	Operation	Interval (seconds)	Time passed
...	...	...	...
5	Send message to user `Admin`	Operation, 60	4 minutes
6	None	Default, 300	5 minutes
7	None	Default, 300	10 minutes
8	None	Default, 300	15 minutes

With the latest modifications, it looks like we can send a message after 10 minutes have passed—that would be step 7, but we actually removed message sending at step 6, after 5 minutes. The good news is that if we now add another operation to start at step 6, that finishes the first five-minute sending cycle and then keeps on sending a message every 5 minutes. Perfect!

Enter 6 in the first Steps field. We want this operation to continue until the problem is resolved, so 0 goes in the second Steps fields. Once complete, click on the Add control at the bottom of the Operation details block.

We can see that Zabbix helpfully calculated the time when the second operation should start, which allows us to quickly spot errors in our calculations. There are no errors here; the second operation starts at 5 minutes, as desired.

With that covered, our lower-level manager, advanced_user, must be notified after five minutes, but only once. That means another operation, as follows:

Click on New in the Operations block.
Click on Add in the Send to Users section and, in the popup, click on advanced_user in the Alias column.
The single message should be simple. We know that step 6 happens after five minutes have elapsed, so let's enter 6 in both Steps fields, and then click on Add at the bottom of the Operation details block. Again, the Start in column shows that this step will be executed after five minutes, as expected.

If two escalation operations overlap steps, and one of them has a custom interval and the other uses the default, the custom interval will be used for the overlapping steps. If both operations have a custom interval defined, the smallest interval is used for the overlapping steps.

We are now left with the final task—notifying the higher-level manager after 20 minutes, and only if the problem has not been acknowledged. As before, click on New in the operations block, and then click on Add in the Send to Users section. In the popup, click on monitoring_user in the Alias column. Now, let's continue with our planned step table:

Step	Operation	Interval (seconds)	Time passed
...	...	...	...
7	None	Default, 300	10 minutes
8	None	Default, 300	15 minutes
9	None	Default, 300	20 minutes

Since steps just continue with the default period, this shows us that step 9 is the correct one. As we want only a single notification here, enter 9 in both of the Steps fields.

It is not required to fill all steps with operations. Some steps in-between can be skipped if the planned schedule so requires.

An additional requirement was to notify this user only if the problem has not been acknowledged.

To add such a restriction, execute the following:

Click on New in the Conditions area.
The Operation condition block is displayed, and the default setting already has Not Ack chosen, so click on Add in the Operation condition block. The form layout can be a bit confusing here, so make sure not to click on Add in the Operation details block instead. While we're almost done, there's one more thing we can do to make this notification less confusing for upper management.

Currently, everybody receives the same message—some trigger information and the last values of items that are being referenced in triggers. Item values might not be that interesting to the manager, hence we can try omitting them from those messages. Untick the Default message checkbox and notice how we can customize the subject and message for a specific operation.

For the message, remove everything that goes below the Trigger URL line. For the manager, it might also be useful to know who was notified and when. Luckily, there's another helpful macro, {ESC.HISTORY}. Let's modify this message by adding an empty line and then this macro. Here's what the final result for this operation should look like:

It's all fine, so click on Add at the bottom of the Operation details block. We can now review action operations and verify that each operation starts when it should.

Everything seems to match the specification. Let's switch to the Recovery operations tab and, similar to the SNMP action, change the Recovery subject to Resolved: {TRIGGER.NAME}. This time, we wanted to avoid Resolved: OK:, opting for a single mention that everything is now fine. Add the users in the recovery operation. We can finally click on Update. With this notification setup in place, let's break something. On Another host, execute the following command:

$ rm /tmp/testfile

It will take a short time for Zabbix to notice this problem and fire away the first email to the Admin user. This email won't be that different from the ones we received before. But now let's be patient and wait a further 20 minutes. During this time, the Admin user will receive more messages. What we are really interested in is the message content in the email to the monitoring_user. Once you receive this message, look at what it contains:

Trigger: Testfile is missing
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL:
    
Problem started: 2016.04.15 15:05:25 Age: 20m
1. 2016.04.15 15:05:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
2. 2016.04.15 15:06:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
3. 2016.04.15 15:07:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
4. 2016.04.15 15:08:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
5. 2016.04.15 15:09:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
6. 2016.04.15 15:10:27 message failed        "advanced user (advanced_user)" No media defined for user "advanced user (advanced_user)"
6. 2016.04.15 15:10:27 message sent        Email [email protected] "Zabbix Administrator (Admin)"
7. 2016.04.15 15:15:28 message sent        Email [email protected] "Zabbix Administrator (Admin)"
8. 2016.04.15 15:20:28 message sent        Email [email protected] "Zabbix Administrator (Admin)"

As in all other notifications, the time here will use the local time on the Zabbix server.

It now contains a lot more information than just what happened; the manager has also received a detailed list of who was notified of the problem. The Admin user has received many notifications,but advanced_user has not received the notification because their email address is not configured. There's some work to do in terms of either this user, or the Zabbix administrators, fixing this issue. And, in this case, the issue is escalated to the monitoring_user only if nobody has acknowledged the problem before, which means nobody has even looked into it.

The current setup would cancel escalation to the management user if the problem is acknowledged. We may create a delayed escalation by adding yet another operation that sends a message to the management user at some later step, but does so without an acknowledgement condition. If the problem is acknowledged, the first operation to the management user would be skipped, but the second one would always work. If the problem is not acknowledged at all, the management user would get two notifications.

If we look carefully at the prefixed numbers, they are not sequential numbers of entries in the history; they are actually the escalation step numbers. That gives us a quick overview of which notifications happened at the same time, without comparing timestamps. The Email string is the name of the media type that's used for this notification.

Let's fix this problem now; on Another host, execute the following command:

$ touch /tmp/testfile

In a short while, two email messages should be sent—one to the Admin user and one to monitoring_user. As these are recovery messages, they will both have our custom subject:

Resolved: Testfile is missing

Our test action had escalation thresholds that are too short for most real-life situations. If reducing these meant creating an action from scratch, that would be very inconvenient. Let's see how easily we can adapt the existing one. In the frontend, navigate to Configuration | Actions, click on Test action in the Name column, and then switch to the Operations tab. We might want to make the following changes, assuming that this is not a critical problem and does not warrant a quick response, unless it has been there for half an hour:

Increase the interval between the additional repeated messages that the Admin user receives
Increase the delay before the messages to the advanced_user and monitoring_user are sent
Start sending messages to the Admin user after the problem has been there for 30 minutes

In the next few steps, be careful not to click on the Update button too early as that will discard the modifications in the operation that we are currently editing.

Let's start by changing the Default operation step duration to 1,800 (30 minutes). Then, click on Edit in the Action column next to the first entry (currently spanning steps 1-5). In its properties, set the Steps fields to 2 and 6, and then click on the Update control in the operation details block.

For both operations that start at step 6, change that to step 7. For the operation that has 6 in both of the Steps fields, change both occurrences the same way as before, and again, be careful not to click on the Update button yet.

The final result should look like this:

If it does, click on that Update button.

The first change for the default operation step spaced all steps out, except the ones that were overridden in the operation properties. That mostly achieved our goals to space out notifications to the Admin user and delay notifications to the two other users. By changing the first step in the first operation from 1 to 2, we achieved two goals. The interval between steps 1 and 2 went back to the default interval for the action (as we excluded step 1 from the operation that did the overriding with 60 seconds), and no message was sent to the Admin user right away. Additionally, we moved the end step a bit further so that the total number of messages the Admin user would receive at one-minute intervals would not change. That resulted in some further operations not being so nicely aligned to the 5-minute boundary, so we moved them to step 7. Let's compare this to the previous configuration:

This allows us to easily scale notifications and escalations up from a testing configuration to something more appropriate to the actual situation, as well as adapting quickly to changing requirements. Let's create another problem. On Another host, execute the following command:

$ rm /tmp/testfile

Wait for the trigger to fire and for a couple of emails to arrive for the Admin user, and then solve the problem:

$ touch /tmp/testfile

That should send a recovery email to the Admin user soon. Hey, wait ..., why for that user only? Zabbix only sends recovery notifications to users who have received problem notifications. As the problem did not get escalated for the management user to receive the notification, that user was not informed about resolving the problem either. A similar thing actually happened with advanced_user, who did not have media assigned. As the notification was not sent when the event was escalated (because no email address was configured), Zabbix did not even try to send a recovery message to that user. No matter how many problem messages were sent to a user, only a single recovery message will be sent per action.

So, in this case, if the Admin user resolved or acknowledged the issue before monitoring_user received an email about the problem, monitoring_user would receive neither the message about the problem, nor the one about resolving it.

As we can see, escalations are fairly flexible and allow you to combine many operations when responding to an event. We could imagine one fairly long and complex escalation sequence of a web server going down to proceed as follows:

Email the administrator
Send an SMS to admin
Open a report at the help desk system
Email management
Send an SMS to management
Restart Apache
Reboot the server
Power cycle the entire server room

Well, the last one might be a bit over the top, but we can indeed construct a fine-grained stepping up of reactions and notifications about problems.

Table of Contents for Escalating once a threshold is reached

Create new playlist

Sign In

Sign Up

Table of Contents for
Escalating once a threshold is reached