Chapter 3.  Configuring Nagios

The previous chapter described how to set up and configure Nagios. Now that our Nagios system is up and running, we can move on to add hosts and services that should be monitored.

In this chapter, we will cover the following points:

  • Configuring Nagios
  • Understanding macro definitions
  • Configuring hosts and host groups
  • Configuring services and service groups
  • Configuring commands and time periods
  • Configuring contacts and contact groups
  • Verifying configuration
  • Understanding notifications
  • Templates and object inheritance

Configuring Nagios

In this chapter, you will learn about other ways in which the Nagios status can be checked as well as how Nagios itself can be managed.

Nagios configuration is stored in a separate directory. Usually it's either in /etc/nagios or /usr/local/etc/nagios. If you have followed the steps for manual installation, it will be in the /etc/nagios directory.

The default installation creates a sample host called localhost and a few services. We will now create additional hosts and services, and create a more robust directory structure to manage all of the objects.

The main Nagios configuration file is called nagios.cfg, and it's the main file that is loaded during Nagios startup.

Its syntax is simple, a line beginning with # is a comment, and all lines in the form of <parameter>=<value> set a value. In some cases, a value might be repeated (such as specifying additional files/directories to read).

The following is a sample of the Nagios main configuration file:

# log file to use 
log_file=/var/nagios/nagios.log 
# object configuration directory 
cfg_dir=/etc/nagios/objects 
# storage information 
resource_file=/etc/nagios/resource.cfg 
status_file=/var/nagios/status.dat 
status_update_interval=10 
(...) 

The main configuration file needs to define a log file to use and that has to be passed as the first option in the file. It also configures various Nagios parameters that allow tuning its behavior and performance. The following are some of the commonly used options:

Option

Description

log_file

Specifies the log file to use; defaults to [localstatedir]/nagios.log

cfg_file

Specifies the configuration file to read for object definitions; might be specified multiple times

cfg_dir

Specifies the configuration directory where all files in it should be read for object definitions; might be specified multiple times

resource_file

File that stores additional macro definitions; [sysconfdir]/resource.cfg

temp_file

Path to a temporary file that is used for temporary data; defaults to [localstatedir]/nagios.tmp

lock_file

Path to a file that is used for synchronization; defaults to [localstatedir]/nagios.lock

temp_path

Path to where Nagios can create temporary files; defaults to /tmp

status_file

Path to a file that stores the current status of all hosts and services; defaults to [localstatedir]/status.dat

status_update_interval

Specifies how often (in seconds) the status file should be updated; defaults to 10 (seconds)

nagios_user

User to run the daemon

nagios_group

Group to run the daemon

command_file

It specifies the path to the external command line that is used by other processes to control the Nagios daemon; defaults to [localstatedir]/rw/nagios.cmd

use_syslog

Whether Nagios should log messages to syslog as well as to the Nagios log file; defaults to 1 (enabled)

state_retention_file

Path to a file that stores state information across shutdowns; defaults to [localstatedir]/retention.dat

retention_update_interval

How often (in seconds) the retention file should be updated; defaults to 60 (seconds)

service_check_timeout

After how many seconds should a service check be assumed that it has failed; defaults to 60 (seconds)

host_check_timeout

After how many seconds should a host check be assumed that it has failed; defaults to 30 (seconds)

event_handler_timeout

After how many seconds should an event handler be terminated; defaults to 30 (seconds)

notification_timeout

After how many seconds should a notification attempt be assumed that it has failed; defaults to 30 (seconds)

enable_environment_macros

Whether Nagios should pass all macros to plugins as environment variables; defaults to 1 (enabled)

interval_length

Specifies the number of seconds a "unit interval" is; this defaults to 60, which means that an interval is one minute; it is not recommended to change the option in any way, as it might end up with undesirable behavior

For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/.

The Nagios resource_file option defines a file to store user variables. This file can be used to store additional information that can be accessed in all object definitions. These usually contain sensitive data as they can only be used in object definitions and it is not possible to read their values from the web interface. This makes it possible to hide passwords of various sensitive services from Nagios administrators without proper privileges. There can be up to 256 macros, named $USER1$, $USER2$ ... $USER256$. The $USER1$ macro defines the path to Nagios plugins and is commonly used in check command definitions.

The cfg_file and cfg_dir options are used to specify files that should be read for object definitions. The first option specifies a single file to read and the second one specifies the directory to read all files with the .cfg extension in the directory and all child directories. Each file may contain different types of objects. The next section describes each type of definition that Nagios uses.

One of the first things that needs to be planned is how your Nagios configuration should be stored. In order to create a configuration that will be maintainable as your IT infrastructure changes, it is worth investing some time to plan out how you want your host definitions set up and how that could be easily placed in a configuration file structure. Throughout this book, various approaches to make your configuration maintainable are discussed. It's also recommended to set up a small Nagios system to get a better understanding of the Nagios configuration before proceeding to larger setups.

Sometimes, it is best to have the configuration grouped into directories by the locations in which hosts and/or services are. In other cases, it might be best to keep the definitions of all servers with a similar functionality in one directory.

A good directory layout makes it much easier to control the Nagios configuration; for example, massively disable all objects related to a particular part of the IT infrastructure. Even though it is recommended to use downtimes, it is sometimes useful to just remove all entries from the Nagios configuration.

Throughout all the configuration examples in this book, we will use the directory structure. A separate directory is used for each object type, and similar objects are grouped within a single file. For example, all command definitions are to be stored in the commands/ subdirectory. All host definitions are stored in the hosts/<hostname>.cfg file.

For Nagios to read the configuration from these directories, edit your main Nagios configuration file (/etc/nagios/nagios.cfg), remove all the cfg_file and cfg_dir entries, and add the following ones:

cfg_dir=/etc/nagios/commands 
cfg_dir=/etc/nagios/timeperiods 
cfg_dir=/etc/nagios/contacts 
cfg_dir=/etc/nagios/contactgroups 
cfg_dir=/etc/nagios/hosts 
cfg_dir=/etc/nagios/hostgroups 
cfg_dir=/etc/nagios/services 
cfg_dir=/etc/nagios/servicegroups 

The next step is to create the directories by executing the following commands:

root@ubuntu:~# cd /etc/nagios
root@ubuntu:/etc/nagios# mkdir commands timeperiods 
contacts    contactgroups hosts hostgroups services  servicegroups

In order to use default Nagios plugins, copy the default Nagios command definition file /etc/nagios/objects/commands.cfg to /etc/nagios/commands/default.cfg. Also, make sure that the following options are set as follows in your nagios.cfg file:

check_external_commands=1
interval_length=60
accept_passive_service_checks=1
accept_passive_host_checks=1

If any of the options is set to a different value, change them and add them to the end of the file if they are not currently present. After making such changes in the Nagios setup, you can now move on to the next sections and prepare a working configuration for your Nagios installation.

Understanding macro definitions

The ability to use macro definitions is one of the key features of Nagios. They offer a lot of flexibility in object and command definitions. Nagios also provides custom macro definitions, which give you greater possibility to use object templates for specifying parameters common to a group of similar objects.

All command definitions can use macros. Macro definitions allow parameters from other objects, such as hosts, services, and contacts, to be referenced so that a command does not need to have everything passed as an argument. Each macro invocation begins and ends with a $ sign.

A typical example is a HOSTADDRESS macro, which references the address field from the host object. All host definitions provide the value of the address parameter.

The following is a sample host and command definition:

  define host{ 
    host_name        somemachine 
    address          10.0.0.1 
    check_command    check-host-alive 
    } 
 
  define command{ 
    command_name     check-host-ssh 
    command_line     $USER1$/check_ssh -H $HOSTADDRESS$ 
    } 

For this example, the following command will be invoked by Nagios to perform the check:

/opt/nagios/plugins/check_ssh -H 10.0.0.1

This check will validate whether it is possible to connect to the SSH service on the said machine. This is a simple and effective way to check machines that have SSH service present, as it is often blocked by any firewalls. Other ways to test connectivity for hosts and services are described in more detail in Chapter 6, Using the Nagios Plugins.

Both $USER1$ and $HOSTADDRESS$ will be substituted appropriately. The USER1 macro was also used and expanded as a path to the Nagios plugins directory. This is a macro definition that references the data contained in a file that is passed as the resource_file configuration directive.

Even though it is not required for the USER1 macro to point to the plugins directory, all standard command definitions that come with Nagios use this macro, so it is not recommended that you change it.

Some of the macro definitions are listed in the following table:

Macro

Description

HOSTNAME

The short, unique name of the host; maps to the host_name directive in the host object

HOSTADDRESS

The IP or hostname of the host; maps to the address directive in the host object

HOSTDISPLAYNAME

Description of the host; maps to the alias directive in the host object

HOSTSTATE

The current state of the host (one of UP, DOWN, UNREACHABLE)

HOSTGROUPNAMES

Short names of all host groups a host belongs, separated by a comma

LASTHOSTCHECK

The date and time of last check of the host, in Unix timestamp (number of seconds since 1970-01-01)

LASTHOSTSTATE

The last known state of the host (one of UP, DOWN, UNREACHABLE)

SERVICEDESC

Description of the service; maps to the description directive in the service object

SERVICESTATE

The current state of the service (one of OK, WARNING,UNKNOWN, CRITICAL)

SERVICEGROUPNAMES

Short names of all service groups a service belongs, separated by a comma

CONTACTNAME

Short, unique name of the contact; maps to the contact_name directive in the contact object

CONTACTALIAS

Description of the contact; maps to the alias directive in the contact object

CONTACTEMAIL

The e-mail address of the contact; maps to the email directive in the contact object

CONTACTGROUPNAMES

Short names of all contact groups a contact belongs to, separated by a comma

This table is not complete and only covers commonly used macro definitions. A complete list of available macros can be found in the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/

All macro definitions need to be prefixed and suffixed with a $ sign, for example $HOSTADDRESS$ to refer to the HOSTADDRESS macro definition.

Another interesting functionality is on-demand macro definitions. These are macros that allow the referencing of any other object and if found in a command definition, will be parsed and substituted accordingly.

These macros accept one or more arguments inside the macro definition name, each passed after a colon. This is mainly used to read specific values, not related to the current object. In order to read the contact e-mail for the user jdoe, regardless of who the current contact person is, the macro would be $CONTACTEMAIL:jdoe$, which means getting a CONTACTEMAIL macro definition in the context of the jdoe contact.

Nagios also offers custom macro definitions. These work in a way that administrators can define additional attributes in each type of an object, and that macro can then be used inside a command.

This is used to store additional parameters related to an object; for example, you can store a MAC address in a host definition and use it in certain types of host checks.

Simply start a directive inside an object with an underscore and write its name in uppercase. It can then be referenced in one of the following ways, based on the object type it is defined

$_HOST<variable>$ - for directives defined within a host object 
$_SERVICE<variable>$ - for directives defined within a service object 
$_CONTACT<variable>$ - for directives defined within a contact object 

This can be used for any type of command to refer to a custom attribute of an object such as the following:

  define host{ 
    host_name        somemachine 
    address          10.0.0.1 
    _MAC             12:12:12:12:12:12 
    check_command    check-host-by-mac 
    } 

This defines a MAC address for a host that can be used inside commands.

A corresponding check command that uses this attribute inside a check is as follows:

  define command{ 
    command_name     check-host-by-mac 
    command_line     $USER1$/check_hostmac -H $HOSTADDRESS$ -m
      $_HOSTMAC$ 
    } 

$_HOSTMAC$ will be replaced with an _MAC custom directive from the host the check is running for.

It is also a good idea to prefix the custom variables with two underscores so that the actual reference also includes an underscore as shown here:

  define host{ 
    host_name        somemachine 
    address          10.0.0.1 
    __MAC            12:12:12:12:12:12 
    check_command    check-host-by-mac 
    } 

Then the MAC address can be referenced as $_HOST_MAC$ (rather than $_HOSTMAC$ from the earlier example) and is more readable when reading the configuration files.

A majority of standard macro definitions are exported to check commands as environment variables so that the plugins can access them as any other variable. Environment variable names are the same as macros, but are prefixed with NAGIOS_; for example, HOSTADDRESS is passed as the NAGIOS_HOSTADDRESS environment variable. For security reasons, the $USERn$ variables are also not passed to commands as environment variables. It is also not possible to query on-demand or custom macro definitions.

Configuring hosts

Hosts are objects that describe machines that should be monitored—either physical hardware or virtual machines. A host consists of a short name, a descriptive name, and an IP address or host name.

Host definition also specifies when and how the system should be monitored, as well as who will be contacted regarding any problem related to this host. It also describes how often the host should be checked, how retrying the checks should be handled, and details regarding how the notifications about problems should be sent out.

A sample definition of a host is as follows:

  define host{ 
    host_name                       linuxbox01 
    hostgroups                      linuxservers 
    alias                           Linux Server 01 
    address                         10.0.2.15 
    check_command                   check-host-alive 
    check_interval                  10 
    retry_interval                  1 
    max_check_attempts              5 
    check_period                    24x7 
    contact_groups                  linux-admins 
    notification_interval           30 
    notification_period             24x7 
    notification_options            d,u,r 
    } 

The preceding example defines a Linux box that will use the check-host-alive command to make sure it is up and running. The test will be performed every 10 minutes and after 5 failed tests it will assume that the host is down. If it is down, a notification will be sent out every 30 minutes.

The following is a table of common directives that can be used to describe hosts, items in bold are required when specifying a host:

Option

Description

host_name

The short, unique name of the host

alias

The descriptive name of the host

address

An IP address or a fully qualified domain name of the host; it is recommended to use an IP address as all tests will fail if DNS servers are down

parents

The list of all parent hosts on which this host depends, separated by a comma; this is usually one or more switch and router to which this host is directly connected

hostgroups

The list of all host groups this host should be a member of; separated by a comma

check_command

The short name of the command that should be used to test if the host is alive; if a command returns an OK state, the host is assumed to be up; it is assumed to be down otherwise

check_interval

Specifies how often a check should be performed; the value is in minutes

retry_interval

Specifies how many minutes to wait before retesting if the host is up

max_check_attempts

Specifies how many times a test needs to report that a host is down before it is assumed to be down by Nagios

check_period

Specifies the name of the time period that should be used to determine time during which tests if the host is up should be performed

contacts

The list of all contacts that should receive notifications related to host state changes sent; separated by a comma; at least one contact or contact group needs to be specified for each host

contact_groups

List of all contact groups that should receive notifications related to host state changes sent; separated by a comma; at least one contact or contact group needs to be specified for each host

first_notification_delay

Specifies the number of minutes before the first notification related to a host being down is sent out

notification_interval

Specifies the number of minutes before each next notification related to a host being down is sent out

notification_period

Specifies time periods during which notifications related to host states should be sent out

notification_options

Specifies which notification types for host states should be sent, separated by a comma; should be one or more of the following:

d: the host DOWN state

u: the host UNREACHABLE state

r: host recovery (UP state)

f: the host starts and stops flapping

s: notify when scheduled downtime starts or ends

For a complete list of accepted parameters, refer to the Nagios documentation available at:

http://library.nagios.com/library/products/nagioscore/manuals/

By default, Nagios assumes all host states to be UP. If the check_command option is not specified for a host, then its state will always be set to UP. When the command to perform host checks is specified, then scheduled checks will take place regularly and the host state will be monitored using the check_interval value as the number of minutes between checks.

Nagios uses a soft and hard state logic to handle host states. Therefore, if a host state has changed from UP to DOWN since the last hard state, then Nagios assumes that the host is in the soft DOWN state and performs retries of the test, waiting for the retry_interval minutes between each test. Once the result is the same after the max_check_attempts number of times, Nagios assumes that the DOWN state is a hard state. The same mechanisms apply for DOWN to UP transitions. Notifications are also only sent if a host is in a hard state. This means that a temporary failure that only occurred for a single test will not cause a notification to be sent if max_check_attempts was set to a number higher than 1.

The host object parents directive is used to define the topology of the network. Usually, this directive points to a switch, router, or any other device that is responsible for forwarding network packets. The host is assumed to be unreachable if the parent host is currently in the hard DOWN state. For example, if a router is down, then all machines accessed via it are considered unreachable and no tests will be performed on these hosts.

If your network consists of servers connected via switches and routers to a different network, then the router will be the parent for all servers in the local network and the router would also be the switch. The parent of the router on the other side of the link would be the local router.

The following diagram shows the actual network infrastructure and how Nagios hosts should be configured in terms of parents for each element of the network:

Configuring hosts

In the preceding diagram, the actual network topology is shown on the left and parent hosts setup for each of the machine is shown on the right. Each arrow represents mapping from a host to a parent host.

There is no need to define a parent for hosts that are directly on the network with your Nagios server. So, in this case, switch 1 should not have a parent host set.

Some devices such as switches cannot be easily checked if they are down. However, it is still a good idea to describe them as part of your topology. In that case, you might use functionality such as scheduled downtime to keep track of when the device is going to be offline or mark it as down manually. This helps in determining other problems because Nagios will not scan any hosts that have the router somewhere along the path that is currently scheduled for downtime. This way, you will not receive multiple notifications that are reported due to the scheduled downtime.

Checks and notification periods specify the time periods during which checks for host state and notifications are to be performed. These can be specified so that different hosts can be monitored at different times.

It is also possible to set up where information that a host is down is kept, without notifying anyone about it. This can be done by specifying notification_period that will tell Nagios when a notification should be sent out. No notifications will be sent out outside this time period.

A typical example is a server that is only required during business hours and has a daily maintenance window between 10 P.M. and 4 A.M. You can set up Nagios so as not to monitor host availability outside business hours, or you can make Nagios monitor it, but without notifying that it is actually down. If monitoring is not done at all, then Nagios will perform fewer operations during this period. In the second case, it is possible to gather statistics on how much of the maintenance window is used, which can be used if changes to the window need to be made.

Configuring host groups

Nagios offers the hostgroup objects that are a group of one or more machines. This allows managing hosts or adding services to groups or hosts more efficiently.

A host might be a member of more than one host group. Usually, grouping is done by the type of machines, the location they are in, or the role of the machine. Each host group has a unique short name used to identify it, a descriptive name, and one or more hosts that are members of this group.

The following are the examples of host group definitions that define groups of hosts and a group that combines both groups:

  define hostgroup{ 
    hostgroup_name                 linux-servers 
    alias                          Linux servers 
    members                        linuxbox01,linuxbox02 
    } 
 
  define hostgroup{ 
    hostgroup_name                 aix-servers 
    alias                          AIX servers 
    members                        aixbox1,aixbox2 
    } 
 
  define hostgroup{ 
    hostgroup_name                 unix-servers 
    alias                          UNIX servers servers 
    hostgroup_members              linux-servers,aix-servers 
    } 

The following table contains directives that can be used to describe host groups; items in bold are required when specifying a host group:

Option

Description

hostgroup_name

The short, unique name of the host group

alias

The descriptive name of the host group

members

The list of all hosts that should be a member of this group; separated by a comma

hostgroup_members

The list of all other host groups whose members should also be members of this group; separated by a comma

Host groups can also be used when defining services or dependencies.

For example, it is possible to tell Nagios that all Linux servers should have their SSH services monitored and all AIX servers should have telnet accepting connections.

It is also possible to define dependencies between hosts. They are, in a way, similar to the parent-host relationship, but dependencies offer more complex configuration options. Nagios will only issue host and service checks if all dependent hosts are currently up. More details on dependencies can be found in Chapter 7, Advanced Configuration.

For the purpose of this book, we will define at least one host in our Nagios configuration directory structure. To be able to monitor a local server that the Nagios installation is running, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg file:

  define host{ 
    host_name                       localhost 
    alias                           Localhost 
    address                         127.0.0.1 
    check_command                   check-host-alive 
    check_interval                  5 
    retry_interval                  1 
    max_check_attempts              5 
    check_period                    24x7 
    contact_groups                  admins 
    notification_interval           60 
    notification_period             24x7 
    notification_options            d,u,r 
    } 

Although Nagios does not require a naming convention, it is a good practice to use the hostname as the name of the file. To make sure that Nagios monitoring works, it is also a good idea to set the address to a valid IP address of a local machine, such as 127.0.0.1, as stated in the preceding code, or the IP address in your network if it is static.

If you are planning on monitoring other servers as well, you will want to add them—the recommended approach is to define a single object definition in a single file.

Configuring services

Services are objects that describe a functionality that a particular host provides. This can be virtually anything—network servers such as NFS or FTP, resources such as storage space, or CPU load.

A service is always tied to a host that it is running. It is also identified by its description, which needs to be unique within a particular host. A service also defines when and how Nagios should check if it is running properly and how to notify the people responsible for this service. A short example of a web server that is defined on the localhost machine created earlier is as follows:

  define service{ 
    host_name                      localhost 
    service_description            www 
    check_command                  check_http 
    check_interval                 10 
    check_period                   24x7 
    retry_interval                 3 
    max_check_attempts             3 
    notification_interval          30 
    notification_period            24x7 
    notification_options           w,c,u,r 
    contact_groups                 admins 
    } 

This definition tells Nagios to monitor that the web server is working correctly every 10 minutes. The recommended file for this definition is /etc/nagios/services/localhost-www.cfg. With services, a good approach is to use <host>-<servicename> as the name of the file if a single host or host group is being set up for monitoring.

The following table is about the common directives that can be used to describe a service; items in bold are required when specifying a service:

Option

Description

host_name

The short name of the hosts on which the service is running; when specifying multiple objects, the list names of hosts should be separated by a comma

hostgroup_name

The short name of the host groups that the service is running on; when specifying multiple objects, the list names of hosts should be separated by a comma

service_description

The description of the service that is used to uniquely identify services running on a host

servicegroups

The list of all service groups of which this service should be a member; separated by a comma

check_command

The short name of the command that should be used to test if the service is running

check_interval

Specifies how often a check should be performed; the value is in minutes

retry_interval

Specifies how many minutes to wait before retesting whether the service is working

max_check_attempts

Specifies how many times a test needs to report that a service is down before it is assumed to be down by Nagios

check_period

Specifies the name of the time period that should be used to determine the time during which tests should be performed if the service is working

contacts

The list of all contacts that should receive notifications related to service state changes; separated by a comma; at least one contact or contact group needs to be specified for each service

contact_groups

The list of all contacts groups that should receive notifications related to service state changes, separated by a comma; at least one contact or contact group needs to be specified for each service

first_notification_delay

Specifies the number of minutes before the first notification related to a service state change sent out

notification_interval

Specifies the number of minutes before each next notification related to a service not working correctly is sent out

notification_period

Specifies time periods during which notifications related to service states should be sent out

notification_options

Specifies which notification types for service states should be sent, separated by a comma; should be one or more of the following:

w: the service WARNING state

u: the service UNKNOWN state

c: the service CRITICAL state

r: the service recovery (back to OK) state

f: the host starts and stops flapping

s: notify when the scheduled downtime starts or ends

For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/.

Nagios requires that at least one service should be defined for every host and one service for it to run. That is why we will now create a sample service in our configuration directory structure. For this purpose, we'll monitor the SSH protocol.

In order to monitor whether the SSH server is running on the Nagios installation, we will need to add its definition into the /etc/nagios/services/localhost-ssh.cfg file:

  define service{ 
    host_name                       localhost 
    service_description             ssh 
    check_command                   check_ssh 
    check_interval                  5 
    retry_interval                  1 
    max_check_attempts              3 
    check_period                    24x7 
    contact_groups                  admins 
    notification_interval           60 
    notification_period             24x7 
    notification_options            w,c,u,r 
    } 

If you are planning on monitoring other services as well, you will want to add a definition as well.

In many cases the same services (such as SSH) should be monitored on multiple hosts. It is possible to define a service once and add it to multiple hosts or even specify host groups. The items should be separated using a comma, as shown here:

  define service{ 
    hostgroup_name                 linux-servers 
    host_name                      localhost,aix01 
    service_description            SSH 
    (...) 
    } 

It is also possible to specify hosts for which checks will not be performed by prefixing the host or host group with an exclamation mark (!), such as if a service is present on all hosts in a group except for a specific box. To specify that SSH should be checked on an aix01 machine, all Linux servers except for linux01—the  aix01 machine, a service definition similar to the following has to be created:

  define service{ 
    hostgroup_name                 linux-servers 
    host_name                      !linuxbox01,aix01 
    service_description            SSH 
    (...) 
    } 

Services may be configured to be dependent on one another, similar to hosts. In this case, Nagios will only perform checks on a service if all dependent services are working correctly. More details on dependencies can be found in Chapter 7, Advanced Configuration.

Configuring service groups

Service objects can be grouped similar to hosts. This can be used to manage services more conveniently. It also helps when checking service reports on the Nagios web interface. Service groups are also used to configure dependencies in a more convenient way.

The following table describes attributes that can be used to define a group, items in bold are required when specifying a service group:

Option

Description

servicegroup_name

The short, unique name of the service group

alias

The descriptive name of the service group

members

The list of all hosts and services that should be a member of this group, separated by a comma

servicegroup_members

The list of all other service groups whose all members should also be members of this group; separated by a comma

The format of the members directive of service group object is one or more <host>,<service> pair.

An example of a service group is shown here:

  define servicegroup{ 
    servicegroup_name     databaseservices 
    alias                 All services related to databases 
    members               linuxbox01,mysql,linuxbox01,
                            pgsql,aix01,db2 
    } 

This service group consists of the mysql and pgsql services on a linuxbox01 host and db2 on the aix01 machine. It is uniquely identified by its name databaseservices.

It is possible to specify groups that a service should be member of inside the service definition itself. To do this, add groups so that it will be a member of in the servicegroups directive in the service definition. It is also possible to define an empty service group and have the service definitions specify to which service groups they belong. Observe the following example:

  define servicegroup{ 
    servicegroup_name     databaseservices 
    alias                 All services related to databases 
    } 
 
  define service{ 
    host_name             linuxbox01 
    service_description   mysql 
    check_command         check_ssh 
    servicegroups         databaseservices 
    } 

In most cases, this approach is easier to maintain. Having a list of service groups that each service is a member of inside its definition and a definition of the service group without services explicitly listed makes it easier to manage, especially when creating services on multiple hosts, such as by using host groups, as explained earlier in this chapter.

Configuring commands

Command definitions describe how host/service checks should be done. They are also used to specify how notifications about problems or event handlers should work.

Commands defined in Nagios tell how it can perform checks, such as what commands to run to check if a database is working properly, how to check if SSH, SMTP, or FTP servers are properly working, or if the DHCP server is assigning IP addresses correctly.

Commands are also used in notifications to let users know of issues, or try to recover a problem automatically.

Nagios makes no distinction between commands provided by the Nagios plugins project and custom commands either created by a third party or written by you. Since its interface is very straightforward, it is very easy to create your own checks.

Note

Chapter 13, Programming Nagios, talks about writing custom commands to perform tasks such as monitoring custom protocols and communicating with installed applications.

Commands are defined in a manner similar to other objects in Nagios. A command definition has two parameters, namely, name and command line. The first parameter is a name that is then used for defining checks and notifications. The second parameter is an actual command that will be run along with all parameters.

Commands are used by hosts and services. They define what to run when making sure a host or service is working properly. A check command is identified by its unique name.

When used with other object definitions, it can also have additional arguments and use exclamation mark as a delimiter. Commands with parameters have the syntax as, command_name[!arg1][!arg2][!arg3][...].

A command name is often the same as the plugin that it runs, but it can be different. The command line includes macro definitions (such as $HOSTADDRESS$). Check commands also use $ARG1$, $ARG2$ ... $ARG32$ macros if a check command for a host or service passed additional arguments. The following is an example that defines a command to ping a host to make sure that it is working properly; it does not use any arguments:

  define command{ 
    command_name     check-host-alive 
    command_line     $USER1$/check_ping -H $HOSTADDRESS$ 
                     -w 3000.0,80% -c 5000.0,100% -p 5 
    } 

A very short host definition that would use this check command could be similar to the one shown here:

  define host{ 
    host_name        somemachine 
    address          10.0.0.1 
    check_command    check-host-alive 
    } 

Such a check is usually done as part of the host checks. This allows Nagios to make sure that a machine is working properly if it responds to ICMP requests. Commands allow to pass arguments as this offers a more flexible way of defining checks. So, a definition accepting parameters could be as follows:

  define command{ 
    command_name     check-host-alive-limits 
    command_line     $USER1$/check_ping -H $HOSTADDRESS$ 
                     -w $ARG1$ -c $ARG2$ -p 5 
    } 

The corresponding host definition is as follows:

  define host{ 
    host_name        othermachine 
    address          10.0.0.2 
    check_command    check-host-alive-limits!3000.0,80%!5000.0,100% 
    } 

The following is another example that sets up a check command for a previously defined service:

  define command{ 
    command_name     check_http 
    command_line     $USER1$/check_http -H $HOSTADDRESS$ 
    } 

This check can be used when defining a service to be monitored by Nagios. Our Nagios configuration includes the default Nagios plugin definitions that we have previously copied into /etc/nagios/commands/default.cfg.

Note

Chapter 6, Using the Nagios Plugins, covers the standard Nagios plugins along with sample command definitions.

Configuring time periods

Time periods are definitions of dates and times during which an action should be performed or specific people should be notified. These describe ranges of days and times and can be reused across various operations. A time period definition consists of a name that uniquely identifies it in Nagios as well as a description. It also contains one or more days or dates along with time spans that define when a time period is valid.

A typical example of a time period would be working hours, which define that a valid time to perform an action is from Monday to Friday during business hours. Another definition of a time period can be weekends which mean Saturday and Sunday, all day long. The following is a sample time period for working hours:

  define timeperiod{ 
    timeperiod_name  workinghours 
    alias            Working Hours, from Monday to Friday 
    monday           09:00-17:00 
    tuesday          09:00-17:00 
    wednesday        09:00-17:00 
    thursday         09:00-17:00 
    friday           09:00-17:00 
    } 

This particular example tells Nagios that a valid time to perform an action is from every Monday to Friday between 9 A.M. and 5 P.M. Each entry in a time period contains information on the date or weekday. It also contains a range of hours. Nagios first checks if the current date matches any of the dates specified. If it does, then it compares whether the current time matches the time ranges specified for that particular date.

There are multiple ways that a date can be specified. Depending on what type of date it is, one definition might take precedence over another. For example, a definition for December 24 is more important than a generic definition that on every weekday an action should be performed between 9 A.M. and 5 P.M.

Possible date types are mentioned here:

  • Calendar date: For example, 2015-11-01, which means November 1, 2015; Nagios accepts dates in the YYYY-MM-DD format
  • Date recurring every year: Such as July 4, which means 4th of July
  • Specific day within a month: For example, day 14, which means the 14th day of every month
  • Specific weekday along with offset in a month: For example, Monday 1 September, which means the first Monday in September; Monday -1 May would mean the last Monday in May
  • Specific weekday in all months: Such as Monday 1, which means every the first Monday in a month
  • Weekday: For example, Monday, which means all Mondays

It lists all the types in the order in which Nagios uses different date types. This means that a date recurring every year will always be used prior to an entry describing what should be done every Monday.

In order to be able to correctly configure all objects, we will now create some standard time periods that will be used in the configuration. The following example periods will be used in the remaining sections of this chapter, and it is recommended that you put them in the /etc/nagios/timeperiods/default.cfg file:

  define timeperiod{ 
    timeperiod_name      workinghours 
    alias                Working Hours, from Monday to Friday 
    monday               09:00-17:00 
    tuesday              09:00-17:00 
    wednesday            09:00-17:00 
    thursday             09:00-17:00 
    friday               09:00-17:00 
    } 
 
  define timeperiod{ 
    timeperiod_name      weekends 
    alias                Weekends all day long 
    saturday             00:00-24:00 
    sunday               00:00-24:00 
    } 
 
  define timeperiod{ 
    timeperiod_name      24x7 
    alias                24 hours a day 7 days a week 
    monday               00:00-24:00 
    tuesday              00:00-24:00 
    wednesday            00:00-24:00 
    thursday             00:00-24:00 
    friday               00:00-24:00 
    saturday             00:00-24:00 
    sunday               00:00-24:00 
    } 

The last time period is also used by the SSH service defined earlier. This way, monitoring the SSH server will be done all the time.

It is also possible to define multiple periods of time by separating them with a comma, as shown here:

  define timeperiod{ 
    timeperiod_name      workinghours 
    alias                Working Hours, excluding lunch break 
    monday               09:00-13:00,14:00-17:00 
    tuesday              09:00-13:00,14:00-17:00 
    wednesday            09:00-13:00,14:00-17:00 
    thursday             09:00-13:00,14:00-17:00 
    friday               09:00-13:00,14:00-17:00 
    } 

It is also possible to have one time period excluded whenever another period is active, as shown here:

  define timeperiod{ 
    timeperiod_name      first-mondays 
    alias                First Mondays of each month 
    monday 1 january     00:00-24:00 
    monday 1 february    00:00-24:00 
    monday 1 march       00:00-24:00 
    monday 1 april       00:00-24:00 
    monday 1 may         00:00-24:00 
    monday 1 june        00:00-24:00 
    monday 1 july        00:00-24:00 
    monday 1 august      00:00-24:00 
    monday 1 september   00:00-24:00 
    monday 1 october     00:00-24:00 
    monday 1 november    00:00-24:00 
    monday 1 december    00:00-24:00 
  define timeperiod{ 
    timeperiod_name      workinghours-without-first-monday 
    alias                Working Hours, without first Monday of
                         each month 
    monday               09:00-17:00 
    tuesday              09:00-17:00 
    wednesday            09:00-17:00 
    thursday             09:00-17:00 
    friday               09:00-17:00 
    exclude              first-mondays 
    } 

The second time period will include all working days except for the first Monday of each month.

Configuring contacts

Contacts define people who can either be owners of specific machines or people who should be contacted in case of problems. Depending on how your organization might contact people in case of problems, a definition of a contact may vary a lot. A contact consists of a unique name, a descriptive name, one or more e-mail addresses, and pager numbers. Contact definitions can also contain additional data specific to how a person can be contacted.

A basic contact definition is shown here, and specifies the unique contact name, an alias, and the contact information. It also specifies event types that the person should receive and time periods during which notifications should be sent:

  define contact{ 
    contact_name                   jdoe 
    alias                          John Doe 
    email                          [email protected] 
    contactgroups                  admins,nagiosadmin 
    host_notification_period       workinghours 
    service_notification_period    workinghours 
    host_notification_options      d,u,r 
    service_notification_options   w,u,c,r 
    host_notification_commands     notify-host-by-email 
    service_notification_commands  notify-service-by-email 
    } 

The contactgroups line defines that this user is a member of the admins group, which we'll define later in this chapter. Contact groups work similar to host and service groups, either a contact defines groups it belongs to or a contact group definition specifies users that belong to this group.

We will now create a similar file in /etc/nagios/contacts, setting values for contact_name, alias, and email based on your username, full name, and e-mail address. The recommended name for the file is based on contact_name.

The following table describes all available directives when defining a contact; items in bold are required when specifying a contact:

Option

Description

contact_name

The short, unique name of the contact

alias

The descriptive name of the contact; usually, this is the full name of the person

contactgroups

The list of all contact groups of which this user should be a member, separated by a comma

host_notifications_enabled

This specifies whether this person should receive notifications regarding host state

host_notification_period

This specifies the name of the time period that should be used to determine the time during which a person should receive notifications regarding the host state

host_notification_commands

Specifies one or more commands that should be used to notify the person of a host state, separated by a comma

host_notification_options

Specifies host states about which the user should be notified, separated by a comma; should be one or more of the following:

d: the host DOWN state

u: the host UNREACHABLE state

r: the host recovery (UP state)

f: the host starts and stops flapping

s: notify when scheduled downtime starts or ends

n: the person will not receive any service notifications

service_notifications_enabled

Specifies whether this person should receive notifications regarding the service state

service_notification_period

Specifies the name of the time period that should be used to determine the time during which a person should receive notifications regarding the service state

service_notification_commands

Specifies one or more commands that should be used to notify the person of a service state; separated by a comma

service_notification_options

Specifies service states about which the user should be notified, separated by a comma; should be one or more of the following:

w: the service WARNING state

u: the service UNKNOWN state

c: the service CRITICAL state

r: the service recovery (OK state)

f: the service starts and stops flapping

n: the person will not receive any service notifications

email

Specifies the e-mail address of the contact

pager

Specifies the pager number of the contact; it can also be an e-mail to the pager gateway

address1 ... address6

Additional six addresses that can be specified for the contact; these can be anything, based on how the notification commands will use these fields

can_submit_commands

Specifies whether the user is allowed to execute commands from the Nagios web interface

retain_status_information

Specifies whether the status-related information about this person is retained across restarts

retain_nonstatus_information

Specifies whether the non-status information about this person should be retained across restarts

For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/

Contacts are also mapped to the users that log into the Nagios web interface. This means that all operations done via the interface will be logged as that particular user, and that the web interface will use the access granted to particular contact objects when evaluating if an operation should be allowed. The contact_name field from a contact object maps to usernames in the Nagios web interface.

Configuring contact groups

Similar to other object types, contacts can also be grouped. Usually, grouping is used to keep a list of which users are responsible for which tasks and maps to job responsibilities for particular people. It also makes it possible to define people that should be responsible for handling problems at specific time periods, and Nagios will automatically contact the right people for a particular time a problem has occurred.

A sample definition of a contact group is as follows:

  define contactgroup{ 
    contactgroup_name              linux-admins 
    alias                          Linux Administrators 
    members                        jdoe,asmith 
    } 

This group is also used when defining the linuxbox01 and WWW service contacts. This means that both the jdoe and asmith contacts will receive information on the status of this host and service.

The following table is a complete list of directives that can be used to describe contact groups, items in bold are required when specifying a contact group:

Option

Description

contactgroup_name

The short, unique name of the contact group

alias

The descriptive name of the contact group

members

The list of all contacts that should be a member of this group; separated by a comma

contactgroup_members

The list of all other contact groups whose all members should also be members of this group; separated by a comma

The members of a contact group can either be specified in the contact group definition or using the contactgroups directive in a contact definition. It is also possible to combine both the methods—some of the members can be specified in the contact group definition and others can be specified in their contact object definition.

Contacts are used to specify who will be contacted if a status changes for one or more hosts or services. Nagios accepts both contacts and contact groups in their object definitions. This allows to make either specific people or entire groups responsible for particular machines or services.

It is also possible to specify different people or groups for handling host-related problems and service related problems—for example, hardware administrators for handling host problems and system administrators for handling service issues.

In order for our previously created user jdoe to work properly, we need to define the admins and nagiosadmin groups in the /etc/nagios/contactgroups/admins.cfg file:

  define contactgroup{ 
    contactgroup_name              admins 
    alias                          System administrators 
  } 
 
  define contactgroup{ 
    contactgroup_name              nagiosadmin 
    alias                          Nagios administrators 
    } 

Verifying the configuration

At this point, our configuration file should be ready for use. We can now verify that all of the configuration statements are correct and that Nagios would start correctly with our configuration. We can do this by running the nagios command with the -v option.

The -v option will try to load the configuration and all of the objects into Nagios and validate that they are defined properly. This is meant to detect any configuration errors or other issues that would prevent Nagios from starting with the configuration file; this is especially useful in order to test configuration before restarting Nagios, as restarting Nagios with an invalid configuration will cause it to stop working in most cases.

For example, the following is an output of checking a valid configuration file:

root@ubuntu:~# /opt/nagios/bin/nagios -v /etc/nagios/nagios.cfg
Nagios Core 4.1.1
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-19-2015
License: GPL
Website: http://www.nagios.org
Reading configuration data...
   Read main config file okay...
   Read object config files okay...
Running pre-flight check on configuration data...
Checking services...
        Checked 1 services.
Checking hosts...
        Checked 1 hosts.
Checking host groups...
        Checked 1 host groups.
Checking service groups...
        Checked 1 service groups.
Checking contacts...
        Checked 1 contacts.
Checking contact groups...
        Checked 1 contact groups.
Checking commands...
        Checked 24 commands.
Checking time periods...
        Checked 3 time periods.
Checking for circular paths...
        Checked 1 hosts
        Checked 0 service dependencies
        Checked 0 host dependencies
        Checked 3 timeperiods
Checking global event handlers...
Checking obsessive compulsive processor commands...
Checking misc settings...
Total Warnings: 0
Total Errors:   0
Things look okay - No serious problems were detected during the pre-flight check

The preceding command indicates a correct configuration file.

If there are errors, the message will indicate the problem, as shown in the following command:

root@ubuntu:~# /opt/nagios/bin/nagios -v /etc/nagios/nagios.cfg
Nagios Core 4.1.1
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-19-2015
License: GPL
Website: http://www.nagios.org
Reading configuration data...
   Read main config file okay...
Error: Contactgroup 'admin' is not defined anywhere
Error: Could not add contactgroup 'admin' to service (config file '/etc/nagios/services/localhost-www.cfg', starting on line 1)
   Error processing object config files!
***> One or more problems was encountered while processing the config files...
     Check your configuration file(s) to ensure that they contain valid
     directives and data defintions.  If you are upgrading from a previous
     version of Nagios, you should be aware that some variables/definitions
     may have been removed or modified in this version.  Make sure to read
     the HTML documentation regarding the config files, as well as the
     'Whats New' section to find out what has changed.

The preceding example indicates that the contactgroup value admin is not valid for a service defined in the /etc/nagios/services/localhost-www.cfg file.

Note

It is always recommended that you verify the Nagios configuration file after making changes to ensure that it does not prevent Nagios from functioning properly.

Even if the /etc/init.d/nagios script prevents restarting when the configuration is incorrect, this would cause Nagios not to start after a system restart.

Understanding notifications

Notifications are meant to let people know that something is either wrong or has returned to the normal way of functioning. This is a very important functionality in Nagios, and configuring notifications correctly might seem a bit tricky in the beginning.

When and how notifications are sent out is configured as part of the contact configuration. Each contact has configuration directives for when notifications can be sent out and how they want to be contacted. Contacts also contain information about contact details, such as telephone number, e-mail address, and Jabber/MSN address. Each host and service is configured with when information about it should be sent and who should be contacted. Nagios then combines all this information in order to notify people of changes in the status.

Notifications may be sent out in any of the following situations:

  • Host has changed its state to the DOWN or UNREACHABLE state; a notification is sent out after the first_notification_delay number of minutes specified in the corresponding host object
  • Host remains in the DOWN or UNREACHABLE state; a notification is sent out every notification_interval number of minutes specified in the corresponding host object
  • Host recovers to an UP state; a notification is sent out immediately and only once
  • Host starts or stops flapping; a notification is sent out immediately
  • Host remains flapping; a notification is sent out every notification_interval number of minutes specified in the corresponding host object
  • Service has changed its state to the WARNING, CRITICAL, or UNKNOWN states; a notification is sent out after the first_notification_delay number of minutes specified in the corresponding service object
  • Service remains in the WARNING, CRITICAL, or UNKNOWN states; a notification is sent out every notification_interval number of minutes specified in the corresponding service object
  • Service recovers to an OK state; a notification is sent out immediately and only once
  • Service starts or stops flapping; a notification is sent out immediately
  • Service remains flapping; a notification is sent out every notification_interval number of minutes specified in the corresponding service object

If one of these conditions occurs, Nagios starts evaluating whether information about it should be sent out and to whom.

The first check is to compare the current date and time against the notification time period, which is taken from the notification_timeperiod field from the current host or the service definition. A notification will be sent out only if the time period includes the current date and time.

Next, a list of users based on the contacts and contact_groups fields is created. Based on all members of all groups and included groups, as well as all contacts directly bound with the current host or service, a complete list of users is made.

Each of matched users is checked whether they should be notified about the current event. In this case, each user's time period is also checked whether it includes the current date and time. The host_notification_period or service_notification_period directive is used, depending on whether the notification is for the host or the service.

For host notifications, the host_notification_options directive for each contact is also used to determine whether that particular person should be contacted; for example, different users might be contacted about an unreachable host if the host is actually down. For service notifications, the service_notification_options parameter is used to check every user if they should be notified about this issue.

If all of these criteria have been met, then Nagios will send a notification to this user. It will now use commands specified in the host_notification_commands and service_notification_commands directives.

It is possible to specify multiple commands that will be used for notifications, so it is possible to set up Nagios so that it sends both e-mail as well as a message on an instant messaging or chat system such as XMPP or Slack.

Nagios also offers escalations that allow sending e-mails to other people when a problem has not been resolved for too long. This can be used to propagate problems to the higher management or teams that might be affected by unresolved problems. It is a very powerful mechanism and is split between the host and service-based escalations. This functionality is described in more detail in Chapter 8, Notifications and Events.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.172.146