In this recipe, we'll learn how to use Nagios Core's state flapping detection and handling techniques to avoid sending excessive notifications when a host or service changes its state too frequently. This is useful in circumstances where a host or service is changing between OK
to WARNING
to CRITICAL
states too frequently within the last 21 checks; if the percentage of state changes is too high, then Nagios Core will suppress further notifications and add a comment to the host or service, showing that it is flapping.
Flap detection is normally enabled by default in Nagios Core and it is part of the recommended generic-host
host template and the generic-service
service template. It's therefore likely that it's already enabled on most servers and we only need to check that it's still working.
You should have a Nagios Core 4.0 or newer server with at least one host and one service configured already. You should also have access to a working web interface for the Nagios Core server. It would be helpful if you are monitoring a test service that you can bring up and down to trigger the flap detection.
You should be familiar with the way that hosts and services change states, as a result of their checks, and the different states corresponding to hosts and services, to understand the basics of how flap detection works.
We can check whether flap detection is enabled for our Nagios Core server, our hosts, and our services as follows:
/usr/local/nagios/etc
:# cd /usr/local/nagios/etc
nagios.cfg
file.# vi nagios.cfg
enable_flap_detection
directive and verify that it is set to 1
:enable_flap_detection=1
1
, after we've changed it, we will probably also need to at least temporarily disable the use_retained_program_state
directive in the same file:use_retained_program_state=0
The host or service inherits from a template that has the enable_flap_detection
directive set to 1
. For example, both the generic-host
and generic-service
templates defined by default in /usr/local/nagios/objects/templates.cfg
do this.
The host or service itself has the enable_flap_detection
directive set to 1
in its own definition.
In the latter case, the configuration for the host or service might look as follows:
define host { ... flap_detection_enabled 1 } define service { ... flap_detection_enabled 1 }
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload
The logic behind determining flap detection is actually quite complex. For our purposes, it suffices to explain flap detection as being based on whether a host or service has changed its state within its last 21 checks too often—with the thresholds usually expressed as a percentage.
This is discussed in great detail, including the formulas that are used to determine the flapping state in the Nagios Core 4.0 documentation, which is available online at http://nagios.sourceforge.net/docs/nagioscore/4/en/flapping.html.
A common cause of flapping is that checks are too stringent. As an example, if you are checking a busy shared web server's response time is less than 50 ms, while the server is busy checks might pass and fail without actually giving an accurate reflection of whether the service is doing its job. In this case, it would be appropriate to loosen the thresholds of the service by increasing its percentage thresholds so that it isn't quite so ready to flag a WARNING
or CRITICAL
state over things that aren't actually very worrisome. Flap detection can help diagnose these sorts of cases.
We can also enable or disable flap detection for a host via the web interface; in the details screen for both hosts and services, a menu item is available under Host Commands, which is labeled Enable/Disable flap detection for this host, and under Service Commands there's another item labeled Enable/Disable flap detection for this service.
These may be useful when we want to turn flap detection on or off for a particular host or service temporarily, perhaps because under certain circumstances it is or is not appropriate to use the feature. For a permanent setup and for clarity, it would be best to include it explicitly in the configuration as shown in this recipe.
3.137.162.110