Chapter 3. Working with Checks and States

In this chapter, we will cover the following recipes:

  • Specifying how frequently to check a host or service
  • Changing thresholds for PING RTT and packet loss
  • Changing thresholds for disk usage
  • Scheduling downtime for a host or service
  • Managing brief outages with flapping
  • Adjusting flapping percentage thresholds for a service

Introduction

Once hosts and services are configured in Nagios Core, its behavior is primarily dictated by the checks it makes to ensure that hosts and services are operating as expected and, in turn, as a result of these checks, it concludes the state in which these hosts and services must be.

How often it's appropriate to check hosts and services and on what basis it's appropriate to flag a host or service as problematic depends very much on the nature of the service and the importance of it running all the time. If a host on the other side of the world is being checked with PING and, during busy periods, its round trip time is over 100 ms, this may not actually be a cause of concern at all and perhaps is not something to even flag a WARNING state over, let alone a CRITICAL one.

However, if the same host was on the local network where it would be appropriate to expect a round trip time of less than 10 ms, a round trip time of more than 100 ms could well be considered a grave cause for concern, perhaps signaling a packet storm or another problem with the local network. In such cases, we would want to notify the appropriate administrators immediately. Similarly, for services such as web servers, we may not be concerned by a response time of more than a second for a page on a busy, budget-shared webhost for customers. However, if the response time for a corporate website or a dedicated colocation customer was bad, it might be important to notify the web server administrator about it.

Not all hosts and services are, therefore, equal. Nagios Core provides several ways to define behaviors with more precision, which are as follows:

  • How often a host or service should be checked with its appropriate check_command
  • How bad a check's results have to be before a WARNING or CRITICAL problem is flagged, if at all
  • Defining a downtime period for a host or service so that Nagios Core knows not to expect it to operate during a specified period of time, often for upgrades or other maintenance
  • Whether to automatically tolerate flapping or hosts and services that seem to go up and down a lot

This chapter will use some common instances of problems with the preceding behaviors to give you examples that show how to configure them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.82.23