In this recipe, we'll learn how to arrange Nagios Core configuration to send notifications about problems with a host or service only after a check has been repeated a certain number of times and failed each time.
This can be an ideal arrangement for non-critical hosts that occasionally have "blips" or short outages and only become problematic if they remain down after repeated checks. It prevents sending notifications on the first check, only sending them if the problem turns out to be consistent.
You should have a Nagios Core 4.0 or newer server, with at least one host configured already. We'll use the example of sparta.example.net
, a host defined in its own file. We'll arrange for it to send us notifications only after a total of five failed host checks.
We can configure the number of failed checks to be tolerated before sending a notification as follows:
/usr/local/nagios/etc/objects
. If you've put the definition for your host in a different file, move to that directory instead.# cd /usr/local/nagios/etc/objects
define host { use linux-server host_name sparta.example.net alias sparta address 192.0.2.21 notification_period 24x7 }
max_check_attempts
to 5
:define host {
use linux-server
host_name sparta.example.net
alias sparta
address 192.0.2.21
notification_period 24x7
max_check_attempts 5
}
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload
With this done, Nagios Core will not send a notification for the host entering the DOWN
state until the check has been attempted a total of five times and has failed each time. In the following screenshot, note that even though four checks have already taken place and failed, no notification has been sent, and the state is listed as SOFT
:
The SOFT
state shows that while Nagios Core has flagged the host as DOWN
, it will retry the check until it exhausts its max_check_attempts
, at which point it will flag the host as being in a HARD DOWN
state and send a notification.
However, if the host were to come back up before the next check, the state would change back to UP
, without ever having sent a notification:
The preceding configuration alters the max_check_attempts
directive for the host to specify the number of checks that need to have failed before the host flags a HARD DOWN
or UNREACHABLE
state and generates a notification event.
The process for changing the maximum number of attempts before a notification for a service is identical; we add the same directive and value to the definition for the service:
define service {
use generic-service
host_name sparta.example.net
service_description HTTP
check_command check_http
max_check_attempts 5
}
The time between successive retry checks of a host before sending any notification can also be customized with the retry_interval
directive. By default, the interval is in minutes, so if we wanted to configure a two-minute wait between the retry checks, we could add this directive to the host or service:
define host { ... retry_interval 2 } define service { ... retry_interval 2 }
3.21.46.78