In this recipe, we'll learn how to arrange a Nagios Core configuration such that after a certain number of repetitions, notifications for problems on hosts or services are escalated to another contact, instead of (or in addition to) the normally defined contact. This is done by defining a separate object type called a host or service escalation.
This kind of setup could be useful for alerting more senior networking staff of an unsolved problem that a less experienced person is struggling to fix and can also function as a "safety valve" to ensure that problem notifications for hosts eventually do reach someone else if they remain unfixed.
You should have a Nagios Core 4.0 or newer server, with at least one host or service configured already, and at least two contact groups — one for the first few notifications, and one for the escalations. You should understand how notifications are generated and sent to the contacts
and contact_groups
for hosts or services.
We'll use the example of a host called sparta.example.net
, which normally sends notifications to a group called ops
. We'll arrange for all the notifications after the fourth one to also be sent to a contact group called emergency
.
We can configure an escalation for our host or service as follows:
/usr/local/nagios/etc/objects
. If you've put the definition for your host in a different file, move to that directory instead.# cd /usr/local/nagios/etc/objects
define host { use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 contact_groups ops notification_period 24x7 notification_interval 10 }
hostescalation
object:define hostescalation { host_name sparta.example.net contact_groups ops,emergency first_notification 5 last_notification 0 notification_interval 10 }
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload
With this done, when problems are encountered with the host that generate notifications, all the notifications beyond the fourth one will be sent to both the ops
contact group and also to the emergency
contact group, expanding the number of people or systems contacted, and making it more likely that the problem is actually addressed or fixed. Perhaps someone on the ops
team has misplaced their pager.
The configuration added in the preceding section is best thought of as a special case or override for a particular host, in that it specifies a range of notifications that should be sent to a different set of contact groups. It can be broken down as follows:
host_name
: The same value of host_name
given for the host in its definition. We specified sparta.example.net
.contact_groups
: The contact groups to which the notifications meeting this special case should be sent. Note that we specify both the emergency
group and the ops
group, so that the matching notifications go to both. Note that they are comma-separated.first_notification
: The count of the first notification that should match this escalation. We chose the fifth notification.last_notification
: The count of the last notification that should match this escalation. We set this to zero, which means that all notifications after the first_notification
should be sent to the nominated contact or contact groups. The notifications will not stop until they are manually turned off, or the problem is fixed.notification_interval
: Like the host and service directives of the same name, this specifies how long Nagios Core should wait before sending new notifications if the host remains in a problematic state. Here, we've chosen ten minutes.Individual contacts can also (or instead) be specified with the contacts
directive, rather than contact_groups
.
The preceding escalation continues sending notifications both to the original ops
group and also to the members of the emergency
group. It's generally a good idea to do it this way rather than sending notifications only to the escalated group, because the point of escalations is to increase the reach of the notifications when a problem is not being dealt with, rather than merely trying to contact a different group of people instead.
This principle applies to stacking escalations as well; if we had a host group with all our contacts in it, perhaps named everyone
, we could define a second escalation that from the tenth notification onwards goes to every single contact:
define hostescalation { host_name sparta.example.net contact_groups everyone first_notification 10 last_notification 0 notification_interval 10 }
Just as we can specify multiple host escalations, it's also fine for the ranges of notifications to overlap so that more than one escalation applies.
With a little arithmetic, you can arrange escalations such that they work after a host or service has been in a problematic state for a certain period of time. For example, the escalation we specified in the recipe will apply after the host has been in a problematic state for 40 minutes, because the notification_interval
specifies that Nagios Core should wait 10 minutes between resending notifications.
Service escalations work much the same way as host escalations do; the difference is that you need to specify the service by its service_description
as well as its host name. Everything else works the same way. An escalation for a service check called HTTP running on sparta.example.net
that does the same thing as the previous escalation would look like this:
define serviceescalation { host_name sparta.example.net service_description HTTP contact_groups ops,emergency first_notification 5 last_notification 0 notification_interval 10 }
3.146.105.137