Specifying the number of failed checks before notification

In this recipe, we'll learn how to arrange Nagios Core configuration to send notifications about problems with a host or service only after a check has been repeated a certain number of times and failed each time.

This can be an ideal arrangement for non-critical hosts that occasionally have "blips" or short outages and only become problematic if they remain down after repeated checks. It prevents sending notifications on the first check, only sending them if the problem turns out to be consistent.

Getting ready

You should have a Nagios Core 4.0 or newer server, with at least one host configured already. We'll use the example of sparta.example.net, a host defined in its own file. We'll arrange for it to send us notifications only after a total of five failed host checks.

How to do it...

We can configure the number of failed checks to be tolerated before sending a notification as follows:

  1. Change to the objects configuration directory for Nagios Core. The default is /usr/local/nagios/etc/objects. If you've put the definition for your host in a different file, move to that directory instead.
    # cd /usr/local/nagios/etc/objects
    
  2. Edit the file containing your host definition and find the definition within it. It may look something like this:
    define host {
        use                  linux-server
        host_name            sparta.example.net
        alias                sparta
        address              192.0.2.21
        notification_period  24x7
    }
  3. Add or edit the value for max_check_attempts to 5:
    define host {
        use                  linux-server
        host_name            sparta.example.net
        alias                sparta
        address              192.0.2.21
        notification_period  24x7
        max_check_attempts   5
    }
  4. Validate the configuration and restart the Nagios Core server:
    # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # /etc/init.d/nagios reload
    

With this done, Nagios Core will not send a notification for the host entering the DOWN state until the check has been attempted a total of five times and has failed each time. In the following screenshot, note that even though four checks have already taken place and failed, no notification has been sent, and the state is listed as SOFT:

How to do it...

The SOFT state shows that while Nagios Core has flagged the host as DOWN, it will retry the check until it exhausts its max_check_attempts, at which point it will flag the host as being in a HARD DOWN state and send a notification.

However, if the host were to come back up before the next check, the state would change back to UP, without ever having sent a notification:

How to do it...

How it works...

The preceding configuration alters the max_check_attempts directive for the host to specify the number of checks that need to have failed before the host flags a HARD DOWN or UNREACHABLE state and generates a notification event.

The process for changing the maximum number of attempts before a notification for a service is identical; we add the same directive and value to the definition for the service:

define service {
    use                  generic-service
    host_name            sparta.example.net
    service_description  HTTP
    check_command        check_http
    max_check_attempts   5
}

There's more...

The time between successive retry checks of a host before sending any notification can also be customized with the retry_interval directive. By default, the interval is in minutes, so if we wanted to configure a two-minute wait between the retry checks, we could add this directive to the host or service:

define host {
    ...
    retry_interval  2
}

define service {
    ...
    retry_interval  2
}

See also

  • Specifying how frequently to check a host or service, Chapter 3, Working with Checks and States
  • Managing brief outages with flapping, Chapter 3, Working with Checks and States
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.78