Managing brief outages with flapping

In this recipe, we'll learn how to use Nagios Core's state flapping detection and handling techniques to avoid sending excessive notifications when a host or service changes its state too frequently. This is useful in circumstances where a host or service is changing between OK to WARNING to CRITICAL states too frequently within the last 21 checks; if the percentage of state changes is too high, then Nagios Core will suppress further notifications and add a comment to the host or service, showing that it is flapping.

Flap detection is normally enabled by default in Nagios Core and it is part of the recommended generic-host host template and the generic-service service template. It's therefore likely that it's already enabled on most servers and we only need to check that it's still working.

Getting ready

You should have a Nagios Core 4.0 or newer server with at least one host and one service configured already. You should also have access to a working web interface for the Nagios Core server. It would be helpful if you are monitoring a test service that you can bring up and down to trigger the flap detection.

You should be familiar with the way that hosts and services change states, as a result of their checks, and the different states corresponding to hosts and services, to understand the basics of how flap detection works.

How to do it...

We can check whether flap detection is enabled for our Nagios Core server, our hosts, and our services as follows:

  1. Change to the configuration directory for Nagios Core. The default location is /usr/local/nagios/etc:
    # cd /usr/local/nagios/etc
    
  2. Edit the nagios.cfg file.
    # vi nagios.cfg
    
  3. Look for an existing definition for the enable_flap_detection directive and verify that it is set to 1:
    enable_flap_detection=1
  4. If this was not set to 1, after we've changed it, we will probably also need to at least temporarily disable the use_retained_program_state directive in the same file:
    use_retained_program_state=0
  5. Edit the file for our particular hosts/services. We should verify that at least one of the following is the case:

    The host or service inherits from a template that has the enable_flap_detection directive set to 1. For example, both the generic-host and generic-service templates defined by default in /usr/local/nagios/objects/templates.cfg do this.

    The host or service itself has the enable_flap_detection directive set to 1 in its own definition.

    In the latter case, the configuration for the host or service might look as follows:

    define host {
        ...
        flap_detection_enabled 1
    }
    define service {
        ...
        flap_detection_enabled 1
    }
  6. If any of the preceding configuration was changed, validate the new configuration and restart the Nagios Core server:
    # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    # /etc/init.d/nagios reload
    
  7. Verify that, for the hosts or services for which flap detection is wanted, the word ENABLED appears in the details for that host or service:
    How to do it...
  8. When you are done with the preceding steps, if a host or service changes its state too frequently within 21 checks, it will be flagged as flapping and will appear with a custom icon in the views of that host or service:
    How to do it...
  9. A comment is also placed in the host or service, explaining what happened for the benefit of anyone who views the host or service in the web interface, who might, perhaps, be wondering why the notifications have stopped:
    How to do it...
  10. There will also be an indicator on the details for the host or service defining whether the host or service is flapping:
    How to do it...
  11. For hosts or services that don't have flap detection enabled, this particular field simply reads N/A, and the Flap Detection field below it will display DISABLED.

How it works...

The logic behind determining flap detection is actually quite complex. For our purposes, it suffices to explain flap detection as being based on whether a host or service has changed its state within its last 21 checks too often—with the thresholds usually expressed as a percentage.

This is discussed in great detail, including the formulas that are used to determine the flapping state in the Nagios Core 4.0 documentation, which is available online at http://nagios.sourceforge.net/docs/nagioscore/4/en/flapping.html.

There's more...

A common cause of flapping is that checks are too stringent. As an example, if you are checking a busy shared web server's response time is less than 50 ms, while the server is busy checks might pass and fail without actually giving an accurate reflection of whether the service is doing its job. In this case, it would be appropriate to loosen the thresholds of the service by increasing its percentage thresholds so that it isn't quite so ready to flag a WARNING or CRITICAL state over things that aren't actually very worrisome. Flap detection can help diagnose these sorts of cases.

We can also enable or disable flap detection for a host via the web interface; in the details screen for both hosts and services, a menu item is available under Host Commands, which is labeled Enable/Disable flap detection for this host, and under Service Commands there's another item labeled Enable/Disable flap detection for this service.

These may be useful when we want to turn flap detection on or off for a particular host or service temporarily, perhaps because under certain circumstances it is or is not appropriate to use the feature. For a permanent setup and for clarity, it would be best to include it explicitly in the configuration as shown in this recipe.

See also

  • The Adjusting flapping percentage thresholds for a service section in this chapter
  • Tolerating a certain number of failed checks, Chapter 4, Configuring Notifications
  • Adding comments on hosts or services in a web interface, Chapter 7, Using the Web Interface
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.162.110