The previous chapter described how to set up and configure Nagios. Now that our Nagios system is up and running, we can move on to add hosts and services that should be monitored.
In this chapter, we will cover the following points:
In this chapter, you will learn about other ways in which the Nagios status can be checked as well as how Nagios itself can be managed.
Nagios configuration is stored in a separate directory. Usually it's either in /etc/nagios
or /usr/local/etc/nagios
. If you have followed the steps for manual installation, it will be in the /etc/nagios
directory.
The default installation creates a sample host called localhost and a few services. We will now create additional hosts and services, and create a more robust directory structure to manage all of the objects.
The main Nagios configuration file is called nagios.cfg
, and it's the main file that is loaded during Nagios startup.
Its syntax is simple, a line beginning with #
is a comment, and all lines in the form of <parameter>=<value>
set a value. In some cases, a value might be repeated (such as specifying additional files/directories to read).
The following is a sample of the Nagios main configuration file:
# log file to use log_file=/var/nagios/nagios.log # object configuration directory cfg_dir=/etc/nagios/objects # storage information resource_file=/etc/nagios/resource.cfg status_file=/var/nagios/status.dat status_update_interval=10 (...)
The main configuration file needs to define a log file to use and that has to be passed as the first option in the file. It also configures various Nagios parameters that allow tuning its behavior and performance. The following are some of the commonly used options:
Option |
Description |
|
Specifies the log file to use; defaults to |
|
Specifies the configuration file to read for object definitions; might be specified multiple times |
|
Specifies the configuration directory where all files in it should be read for object definitions; might be specified multiple times |
|
File that stores additional macro definitions; |
|
Path to a temporary file that is used for temporary data; defaults to |
|
Path to a file that is used for synchronization; defaults to |
|
Path to where Nagios can create temporary files; defaults to |
|
Path to a file that stores the current status of all hosts and services; defaults to |
|
Specifies how often (in seconds) the |
|
User to run the daemon |
|
Group to run the daemon |
|
It specifies the path to the external command line that is used by other processes to control the Nagios daemon; defaults to |
|
Whether Nagios should log messages to syslog as well as to the Nagios log file; defaults to |
|
Path to a file that stores state information across shutdowns; defaults to |
|
How often (in seconds) the retention file should be updated; defaults to |
|
After how many seconds should a service check be assumed that it has failed; defaults to |
|
After how many seconds should a host check be assumed that it has failed; defaults to |
|
After how many seconds should an event handler be terminated; defaults to |
|
After how many seconds should a notification attempt be assumed that it has failed; defaults to |
|
Whether Nagios should pass all macros to plugins as environment variables; defaults to |
|
Specifies the number of seconds a "unit interval" is; this defaults to |
For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/.
The Nagios resource_file
option defines a file to store user variables. This file can be used to store additional information that can be accessed in all object definitions. These usually contain sensitive data as they can only be used in object definitions and it is not possible to read their values from the web interface. This makes it possible to hide passwords of various sensitive services from Nagios administrators without proper privileges. There can be up to 256 macros, named $USER1$
, $USER2$
... $USER256$
. The $USER1$
macro defines the path to Nagios plugins and is commonly used in check command definitions.
The cfg_file
and cfg_dir
options are used to specify files that should be read for object definitions. The first option specifies a single file to read and the second one specifies the directory to read all files with the .cfg
extension in the directory and all child directories. Each file may contain different types of objects. The next section describes each type of definition that Nagios uses.
One of the first things that needs to be planned is how your Nagios configuration should be stored. In order to create a configuration that will be maintainable as your IT infrastructure changes, it is worth investing some time to plan out how you want your host definitions set up and how that could be easily placed in a configuration file structure. Throughout this book, various approaches to make your configuration maintainable are discussed. It's also recommended to set up a small Nagios system to get a better understanding of the Nagios configuration before proceeding to larger setups.
Sometimes, it is best to have the configuration grouped into directories by the locations in which hosts and/or services are. In other cases, it might be best to keep the definitions of all servers with a similar functionality in one directory.
A good directory layout makes it much easier to control the Nagios configuration; for example, massively disable all objects related to a particular part of the IT infrastructure. Even though it is recommended to use downtimes, it is sometimes useful to just remove all entries from the Nagios configuration.
Throughout all the configuration examples in this book, we will use the directory structure. A separate directory is used for each object type, and similar objects are grouped within a single file. For example, all command definitions are to be stored in the commands/
subdirectory. All host definitions are stored in the hosts/<hostname>.cfg
file.
For Nagios to read the configuration from these directories, edit your main Nagios configuration file (/etc/nagios/nagios.cfg
), remove all the cfg_file
and cfg_dir
entries, and add the following ones:
cfg_dir=/etc/nagios/commands cfg_dir=/etc/nagios/timeperiods cfg_dir=/etc/nagios/contacts cfg_dir=/etc/nagios/contactgroups cfg_dir=/etc/nagios/hosts cfg_dir=/etc/nagios/hostgroups cfg_dir=/etc/nagios/services cfg_dir=/etc/nagios/servicegroups
The next step is to create the directories by executing the following commands:
root@ubuntu:~# cd /etc/nagios root@ubuntu:/etc/nagios# mkdir commands timeperiods contacts contactgroups hosts hostgroups services servicegroups
In order to use default Nagios plugins, copy the default Nagios command definition file /etc/nagios/objects/commands.cfg
to /etc/nagios/commands/default.cfg
. Also, make sure that the following options are set as follows in your nagios.cfg
file:
check_external_commands=1 interval_length=60 accept_passive_service_checks=1 accept_passive_host_checks=1
If any of the options is set to a different value, change them and add them to the end of the file if they are not currently present. After making such changes in the Nagios setup, you can now move on to the next sections and prepare a working configuration for your Nagios installation.
The ability to use macro definitions is one of the key features of Nagios. They offer a lot of flexibility in object and command definitions. Nagios also provides custom macro definitions, which give you greater possibility to use object templates for specifying parameters common to a group of similar objects.
All command definitions can use macros. Macro definitions allow parameters from other objects, such as hosts, services, and contacts, to be referenced so that a command does not need to have everything passed as an argument. Each macro invocation begins and ends with a $
sign.
A typical example is a HOSTADDRESS
macro, which references the address field from the host object. All host definitions provide the value of the address
parameter.
The following is a sample host and command definition:
define host{ host_name somemachine address 10.0.0.1 check_command check-host-alive } define command{ command_name check-host-ssh command_line $USER1$/check_ssh -H $HOSTADDRESS$ }
For this example, the following command will be invoked by Nagios to perform the check:
/opt/nagios/plugins/check_ssh -H 10.0.0.1
This check will validate whether it is possible to connect to the SSH service on the said machine. This is a simple and effective way to check machines that have SSH service present, as it is often blocked by any firewalls. Other ways to test connectivity for hosts and services are described in more detail in Chapter 6, Using the Nagios Plugins.
Both $USER1$
and $HOSTADDRESS$
will be substituted appropriately. The USER1
macro was also used and expanded as a path to the Nagios plugins directory. This is a macro definition that references the data contained in a file that is passed as the resource_file
configuration directive.
Even though it is not required for the USER1
macro to point to the plugins directory, all standard command definitions that come with Nagios use this macro, so it is not recommended that you change it.
Some of the macro definitions are listed in the following table:
Macro |
Description |
|
The short, unique name of the host; maps to the |
|
The IP or hostname of the host; maps to the |
|
Description of the host; maps to the |
|
The current state of the host (one of |
|
Short names of all host groups a host belongs, separated by a comma |
|
The date and time of last check of the host, in Unix timestamp (number of seconds since 1970-01-01) |
|
The last known state of the host (one of |
|
Description of the service; maps to the |
|
The current state of the service (one of |
|
Short names of all service groups a service belongs, separated by a comma |
|
Short, unique name of the contact; maps to the |
|
Description of the contact; maps to the |
|
The e-mail address of the contact; maps to the |
|
Short names of all contact groups a contact belongs to, separated by a comma |
This table is not complete and only covers commonly used macro definitions. A complete list of available macros can be found in the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/
All macro definitions need to be prefixed and suffixed with a $
sign, for example $HOSTADDRESS$
to refer to the HOSTADDRESS
macro definition.
Another interesting functionality is on-demand macro definitions. These are macros that allow the referencing of any other object and if found in a command definition, will be parsed and substituted accordingly.
These macros accept one or more arguments inside the macro definition name, each passed after a colon. This is mainly used to read specific values, not related to the current object. In order to read the contact e-mail for the user jdoe
, regardless of who the current contact person is, the macro would be $CONTACTEMAIL:jdoe$
, which means getting a CONTACTEMAIL
macro definition in the context of the jdoe
contact.
Nagios also offers custom macro definitions. These work in a way that administrators can define additional attributes in each type of an object, and that macro can then be used inside a command.
This is used to store additional parameters related to an object; for example, you can store a MAC address in a host definition and use it in certain types of host checks.
Simply start a directive inside an object with an underscore and write its name in uppercase. It can then be referenced in one of the following ways, based on the object type it is defined
$_HOST<variable>$ - for directives defined within a host object $_SERVICE<variable>$ - for directives defined within a service object $_CONTACT<variable>$ - for directives defined within a contact object
This can be used for any type of command to refer to a custom attribute of an object such as the following:
define host{ host_name somemachine address 10.0.0.1 _MAC 12:12:12:12:12:12 check_command check-host-by-mac }
This defines a MAC address for a host that can be used inside commands.
A corresponding check command that uses this attribute inside a check is as follows:
define command{ command_name check-host-by-mac command_line $USER1$/check_hostmac -H $HOSTADDRESS$ -m $_HOSTMAC$ }
$_HOSTMAC$
will be replaced with an _MAC
custom directive from the host the check is running for.
It is also a good idea to prefix the custom variables with two underscores so that the actual reference also includes an underscore as shown here:
define host{ host_name somemachine address 10.0.0.1 __MAC 12:12:12:12:12:12 check_command check-host-by-mac }
Then the MAC address can be referenced as $_HOST_MAC$
(rather than $_HOSTMAC$
from the earlier example) and is more readable when reading the configuration files.
A majority of standard macro definitions are exported to check commands as environment variables so that the plugins can access them as any other variable. Environment variable names are the same as macros, but are prefixed with NAGIOS_
; for example, HOSTADDRESS
is passed as the NAGIOS_HOSTADDRESS
environment variable. For security reasons, the $USERn$
variables are also not passed to commands as environment variables. It is also not possible to query on-demand or custom macro definitions.
Hosts are objects that describe machines that should be monitored—either physical hardware or virtual machines. A host consists of a short name, a descriptive name, and an IP address or host name.
Host definition also specifies when and how the system should be monitored, as well as who will be contacted regarding any problem related to this host. It also describes how often the host should be checked, how retrying the checks should be handled, and details regarding how the notifications about problems should be sent out.
A sample definition of a host is as follows:
define host{ host_name linuxbox01 hostgroups linuxservers alias Linux Server 01 address 10.0.2.15 check_command check-host-alive check_interval 10 retry_interval 1 max_check_attempts 5 check_period 24x7 contact_groups linux-admins notification_interval 30 notification_period 24x7 notification_options d,u,r }
The preceding example defines a Linux box that will use the check-host-alive
command to make sure it is up and running. The test will be performed every 10 minutes and after 5 failed tests it will assume that the host is down. If it is down, a notification will be sent out every 30 minutes.
The following is a table of common directives that can be used to describe hosts, items in bold are required when specifying a host:
Option |
Description |
host_name |
The short, unique name of the host |
alias |
The descriptive name of the host |
address |
An IP address or a fully qualified domain name of the host; it is recommended to use an IP address as all tests will fail if DNS servers are down |
|
The list of all parent hosts on which this host depends, separated by a comma; this is usually one or more switch and router to which this host is directly connected |
|
The list of all host groups this host should be a member of; separated by a comma |
|
The short name of the command that should be used to test if the host is alive; if a command returns an OK state, the host is assumed to be up; it is assumed to be down otherwise |
|
Specifies how often a check should be performed; the value is in minutes |
|
Specifies how many minutes to wait before retesting if the host is up |
max_check_attempts |
Specifies how many times a test needs to report that a host is down before it is assumed to be down by Nagios |
check_period |
Specifies the name of the time period that should be used to determine time during which tests if the host is up should be performed |
contacts |
The list of all contacts that should receive notifications related to host state changes sent; separated by a comma; at least one contact or contact group needs to be specified for each host |
contact_groups |
List of all contact groups that should receive notifications related to host state changes sent; separated by a comma; at least one contact or contact group needs to be specified for each host |
|
Specifies the number of minutes before the first notification related to a host being down is sent out |
notification_interval |
Specifies the number of minutes before each next notification related to a host being down is sent out |
notification_period |
Specifies time periods during which notifications related to host states should be sent out |
|
Specifies which notification types for host states should be sent, separated by a comma; should be one or more of the following:
|
For a complete list of accepted parameters, refer to the Nagios documentation available at:
http://library.nagios.com/library/products/nagioscore/manuals/
By default, Nagios assumes all host states to be UP. If the check_command
option is not specified for a host, then its state will always be set to UP. When the command to perform host checks is specified, then scheduled checks will take place regularly and the host state will be monitored using the check_interval
value as the number of minutes between checks.
Nagios uses a soft and hard state logic to handle host states. Therefore, if a host state has changed from UP to DOWN since the last hard state, then Nagios assumes that the host is in the soft DOWN state and performs retries of the test, waiting for the retry_interval
minutes between each test. Once the result is the same after the max_check_attempts
number of times, Nagios assumes that the DOWN state is a hard state. The same mechanisms apply for DOWN to UP transitions. Notifications are also only sent if a host is in a hard state. This means that a temporary failure that only occurred for a single test will not cause a notification to be sent if max_check_attempts
was set to a number higher than 1.
The host object parents
directive is used to define the topology of the network. Usually, this directive points to a switch, router, or any other device that is responsible for forwarding network packets. The host is assumed to be unreachable if the parent host is currently in the hard DOWN state. For example, if a router is down, then all machines accessed via it are considered unreachable and no tests will be performed on these hosts.
If your network consists of servers connected via switches and routers to a different network, then the router will be the parent for all servers in the local network and the router would also be the switch. The parent of the router on the other side of the link would be the local router.
The following diagram shows the actual network infrastructure and how Nagios hosts should be configured in terms of parents for each element of the network:
In the preceding diagram, the actual network topology is shown on the left and parent hosts setup for each of the machine is shown on the right. Each arrow represents mapping from a host to a parent host.
There is no need to define a parent for hosts that are directly on the network with your Nagios server. So, in this case, switch 1
should not have a parent host set.
Some devices such as switches cannot be easily checked if they are down. However, it is still a good idea to describe them as part of your topology. In that case, you might use functionality such as scheduled downtime to keep track of when the device is going to be offline or mark it as down manually. This helps in determining other problems because Nagios will not scan any hosts that have the router somewhere along the path that is currently scheduled for downtime. This way, you will not receive multiple notifications that are reported due to the scheduled downtime.
Checks and notification periods specify the time periods during which checks for host state and notifications are to be performed. These can be specified so that different hosts can be monitored at different times.
It is also possible to set up where information that a host is down is kept, without notifying anyone about it. This can be done by specifying notification_period
that will tell Nagios when a notification should be sent out. No notifications will be sent out outside this time period.
A typical example is a server that is only required during business hours and has a daily maintenance window between 10 P.M. and 4 A.M. You can set up Nagios so as not to monitor host availability outside business hours, or you can make Nagios monitor it, but without notifying that it is actually down. If monitoring is not done at all, then Nagios will perform fewer operations during this period. In the second case, it is possible to gather statistics on how much of the maintenance window is used, which can be used if changes to the window need to be made.
Nagios offers the hostgroup
objects that are a group of one or more machines. This allows managing hosts or adding services to groups or hosts more efficiently.
A host might be a member of more than one host group. Usually, grouping is done by the type of machines, the location they are in, or the role of the machine. Each host group has a unique short name used to identify it, a descriptive name, and one or more hosts that are members of this group.
The following are the examples of host group definitions that define groups of hosts and a group that combines both groups:
define hostgroup{ hostgroup_name linux-servers alias Linux servers members linuxbox01,linuxbox02 } define hostgroup{ hostgroup_name aix-servers alias AIX servers members aixbox1,aixbox2 } define hostgroup{ hostgroup_name unix-servers alias UNIX servers servers hostgroup_members linux-servers,aix-servers }
The following table contains directives that can be used to describe host groups; items in bold are required when specifying a host group:
Option |
Description |
---|---|
hostgroup_name |
The short, unique name of the host group |
alias |
The descriptive name of the host group |
|
The list of all hosts that should be a member of this group; separated by a comma |
|
The list of all other host groups whose members should also be members of this group; separated by a comma |
Host groups can also be used when defining services or dependencies.
For example, it is possible to tell Nagios that all Linux servers should have their SSH services monitored and all AIX servers should have telnet accepting connections.
It is also possible to define dependencies between hosts. They are, in a way, similar to the parent-host relationship, but dependencies offer more complex configuration options. Nagios will only issue host and service checks if all dependent hosts are currently up. More details on dependencies can be found in Chapter 7, Advanced Configuration.
For the purpose of this book, we will define at least one host in our Nagios configuration directory structure. To be able to monitor a local server that the Nagios installation is running, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg
file:
define host{ host_name localhost alias Localhost address 127.0.0.1 check_command check-host-alive check_interval 5 retry_interval 1 max_check_attempts 5 check_period 24x7 contact_groups admins notification_interval 60 notification_period 24x7 notification_options d,u,r }
Although Nagios does not require a naming convention, it is a good practice to use the hostname as the name of the file. To make sure that Nagios monitoring works, it is also a good idea to set the address
to a valid IP address of a local machine, such as 127.0.0.1
, as stated in the preceding code, or the IP address in your network if it is static.
If you are planning on monitoring other servers as well, you will want to add them—the recommended approach is to define a single object definition in a single file.
Services are objects that describe a functionality that a particular host provides. This can be virtually anything—network servers such as NFS or FTP, resources such as storage space, or CPU load.
A service is always tied to a host that it is running. It is also identified by its description, which needs to be unique within a particular host. A service also defines when and how Nagios should check if it is running properly and how to notify the people responsible for this service. A short example of a web server that is defined on the localhost
machine created earlier is as follows:
define service{ host_name localhost service_description www check_command check_http check_interval 10 check_period 24x7 retry_interval 3 max_check_attempts 3 notification_interval 30 notification_period 24x7 notification_options w,c,u,r contact_groups admins }
This definition tells Nagios to monitor that the web server is working correctly every 10
minutes. The recommended file for this definition is /etc/nagios/services/localhost-www.cfg
. With services, a good approach is to use <host>-<servicename>
as the name of the file if a single host or host group is being set up for monitoring.
The following table is about the common directives that can be used to describe a service; items in bold are required when specifying a service:
Option |
Description |
|
The short name of the hosts on which the service is running; when specifying multiple objects, the list names of hosts should be separated by a comma |
|
The short name of the host groups that the service is running on; when specifying multiple objects, the list names of hosts should be separated by a comma |
|
The description of the service that is used to uniquely identify services running on a host |
|
The list of all service groups of which this service should be a member; separated by a comma |
|
The short name of the command that should be used to test if the service is running |
|
Specifies how often a check should be performed; the value is in minutes |
|
Specifies how many minutes to wait before retesting whether the service is working |
max_check_attempts |
Specifies how many times a test needs to report that a service is down before it is assumed to be down by Nagios |
check_period |
Specifies the name of the time period that should be used to determine the time during which tests should be performed if the service is working |
contacts |
The list of all contacts that should receive notifications related to service state changes; separated by a comma; at least one contact or contact group needs to be specified for each service |
contact_groups |
The list of all contacts groups that should receive notifications related to service state changes, separated by a comma; at least one contact or contact group needs to be specified for each service |
|
Specifies the number of minutes before the first notification related to a service state change sent out |
notification_interval |
Specifies the number of minutes before each next notification related to a service not working correctly is sent out |
notification_period |
Specifies time periods during which notifications related to service states should be sent out |
|
Specifies which notification types for service states should be sent, separated by a comma; should be one or more of the following:
|
For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/.
Nagios requires that at least one service should be defined for every host and one service for it to run. That is why we will now create a sample service in our configuration directory structure. For this purpose, we'll monitor the SSH protocol.
In order to monitor whether the SSH server is running on the Nagios installation, we will need to add its definition into the /etc/nagios/services/localhost-ssh.cfg
file:
define service{ host_name localhost service_description ssh check_command check_ssh check_interval 5 retry_interval 1 max_check_attempts 3 check_period 24x7 contact_groups admins notification_interval 60 notification_period 24x7 notification_options w,c,u,r }
If you are planning on monitoring other services as well, you will want to add a definition as well.
In many cases the same services (such as SSH) should be monitored on multiple hosts. It is possible to define a service once and add it to multiple hosts or even specify host groups. The items should be separated using a comma, as shown here:
define service{ hostgroup_name linux-servers host_name localhost,aix01 service_description SSH (...) }
It is also possible to specify hosts for which checks will not be performed by prefixing the host or host group with an exclamation mark (!
), such as if a service is present on all hosts in a group except for a specific box. To specify that SSH should be checked on an aix01
machine, all Linux servers except for linux01
—the aix01
machine, a service
definition similar to the following has to be created:
define service{ hostgroup_name linux-servers host_name !linuxbox01,aix01 service_description SSH (...) }
Services may be configured to be dependent on one another, similar to hosts. In this case, Nagios will only perform checks on a service if all dependent services are working correctly. More details on dependencies can be found in Chapter 7, Advanced Configuration.
Service objects can be grouped similar to hosts. This can be used to manage services more conveniently. It also helps when checking service reports on the Nagios web interface. Service groups are also used to configure dependencies in a more convenient way.
The following table describes attributes that can be used to define a group, items in bold are required when specifying a service group:
Option |
Description |
---|---|
servicegroup_name |
The short, unique name of the service group |
alias |
The descriptive name of the service group |
|
The list of all hosts and services that should be a member of this group, separated by a comma |
|
The list of all other service groups whose all members should also be members of this group; separated by a comma |
The format of the members
directive of service group object is one or more <host>,<service>
pair.
An example of a service group is shown here:
define servicegroup{ servicegroup_name databaseservices alias All services related to databases members linuxbox01,mysql,linuxbox01, pgsql,aix01,db2 }
This service group consists of the mysql
and pgsql
services on a linuxbox01
host and db2
on the aix01
machine. It is uniquely identified by its name databaseservices
.
It is possible to specify groups that a service should be member of inside the service definition itself. To do this, add groups so that it will be a member of in the servicegroups
directive in the service definition. It is also possible to define an empty service group and have the service definitions specify to which service groups they belong. Observe the following example:
define servicegroup{ servicegroup_name databaseservices alias All services related to databases } define service{ host_name linuxbox01 service_description mysql check_command check_ssh servicegroups databaseservices }
In most cases, this approach is easier to maintain. Having a list of service groups that each service is a member of inside its definition and a definition of the service group without services explicitly listed makes it easier to manage, especially when creating services on multiple hosts, such as by using host groups, as explained earlier in this chapter.
Command definitions describe how host/service checks should be done. They are also used to specify how notifications about problems or event handlers should work.
Commands defined in Nagios tell how it can perform checks, such as what commands to run to check if a database is working properly, how to check if SSH, SMTP, or FTP servers are properly working, or if the DHCP server is assigning IP addresses correctly.
Commands are also used in notifications to let users know of issues, or try to recover a problem automatically.
Nagios makes no distinction between commands provided by the Nagios plugins project and custom commands either created by a third party or written by you. Since its interface is very straightforward, it is very easy to create your own checks.
Chapter 13, Programming Nagios, talks about writing custom commands to perform tasks such as monitoring custom protocols and communicating with installed applications.
Commands are defined in a manner similar to other objects in Nagios. A command definition has two parameters, namely, name and command line. The first parameter is a name that is then used for defining checks and notifications. The second parameter is an actual command that will be run along with all parameters.
Commands are used by hosts and services. They define what to run when making sure a host or service is working properly. A check command is identified by its unique name.
When used with other object definitions, it can also have additional arguments and use exclamation mark as a delimiter. Commands with parameters have the syntax as, command_name[!arg1][!arg2][!arg3][...]
.
A command name is often the same as the plugin that it runs, but it can be different. The command line includes macro definitions (such as $HOSTADDRESS$
). Check commands also use $ARG1$
, $ARG2$
... $ARG32$
macros if a check command for a host or service passed additional arguments. The following is an example that defines a command to ping a host to make sure that it is working properly; it does not use any arguments:
define command{ command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 }
A very short host definition that would use this check command could be similar to the one shown here:
define host{ host_name somemachine address 10.0.0.1 check_command check-host-alive }
Such a check is usually done as part of the host checks. This allows Nagios to make sure that a machine is working properly if it responds to ICMP requests. Commands allow to pass arguments as this offers a more flexible way of defining checks. So, a definition accepting parameters could be as follows:
define command{ command_name check-host-alive-limits command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5 }
The corresponding host definition is as follows:
define host{ host_name othermachine address 10.0.0.2 check_command check-host-alive-limits!3000.0,80%!5000.0,100% }
The following is another example that sets up a check command for a previously defined service:
define command{ command_name check_http command_line $USER1$/check_http -H $HOSTADDRESS$ }
This check can be used when defining a service to be monitored by Nagios. Our Nagios configuration includes the default Nagios plugin definitions that we have previously copied into /etc/nagios/commands/default.cfg
.
Chapter 6, Using the Nagios Plugins, covers the standard Nagios plugins along with sample command definitions.
Time periods are definitions of dates and times during which an action should be performed or specific people should be notified. These describe ranges of days and times and can be reused across various operations. A time period definition consists of a name that uniquely identifies it in Nagios as well as a description. It also contains one or more days or dates along with time spans that define when a time period is valid.
A typical example of a time period would be working hours, which define that a valid time to perform an action is from Monday to Friday during business hours. Another definition of a time period can be weekends which mean Saturday and Sunday, all day long. The following is a sample time period for working hours:
define timeperiod{ timeperiod_name workinghours alias Working Hours, from Monday to Friday monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 }
This particular example tells Nagios that a valid time to perform an action is from every Monday to Friday between 9 A.M. and 5 P.M. Each entry in a time period contains information on the date or weekday. It also contains a range of hours. Nagios first checks if the current date matches any of the dates specified. If it does, then it compares whether the current time matches the time ranges specified for that particular date.
There are multiple ways that a date can be specified. Depending on what type of date it is, one definition might take precedence over another. For example, a definition for December 24 is more important than a generic definition that on every weekday an action should be performed between 9 A.M. and 5 P.M.
Possible date types are mentioned here:
2015-11-01
, which means November 1, 2015; Nagios accepts dates in the YYYY-MM-DD formatJuly 4
, which means 4th of Julyday 14
, which means the 14th day of every monthMonday 1 September
, which means the first Monday in September; Monday -1 May
would mean the last Monday in MayMonday 1
, which means every the first Monday in a monthMonday
, which means all MondaysIt lists all the types in the order in which Nagios uses different date types. This means that a date recurring every year will always be used prior to an entry describing what should be done every Monday.
In order to be able to correctly configure all objects, we will now create some standard time periods that will be used in the configuration. The following example periods will be used in the remaining sections of this chapter, and it is recommended that you put them in the /etc/nagios/timeperiods/default.cfg
file:
define timeperiod{ timeperiod_name workinghours alias Working Hours, from Monday to Friday monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 } define timeperiod{ timeperiod_name weekends alias Weekends all day long saturday 00:00-24:00 sunday 00:00-24:00 } define timeperiod{ timeperiod_name 24x7 alias 24 hours a day 7 days a week monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 sunday 00:00-24:00 }
The last time period is also used by the SSH
service defined earlier. This way, monitoring the SSH server will be done all the time.
It is also possible to define multiple periods of time by separating them with a comma, as shown here:
define timeperiod{ timeperiod_name workinghours alias Working Hours, excluding lunch break monday 09:00-13:00,14:00-17:00 tuesday 09:00-13:00,14:00-17:00 wednesday 09:00-13:00,14:00-17:00 thursday 09:00-13:00,14:00-17:00 friday 09:00-13:00,14:00-17:00 }
It is also possible to have one time period excluded whenever another period is active, as shown here:
define timeperiod{ timeperiod_name first-mondays alias First Mondays of each month monday 1 january 00:00-24:00 monday 1 february 00:00-24:00 monday 1 march 00:00-24:00 monday 1 april 00:00-24:00 monday 1 may 00:00-24:00 monday 1 june 00:00-24:00 monday 1 july 00:00-24:00 monday 1 august 00:00-24:00 monday 1 september 00:00-24:00 monday 1 october 00:00-24:00 monday 1 november 00:00-24:00 monday 1 december 00:00-24:00 define timeperiod{ timeperiod_name workinghours-without-first-monday alias Working Hours, without first Monday of each month monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 exclude first-mondays }
The second time period will include all working days except for the first Monday of each month.
Contacts define people who can either be owners of specific machines or people who should be contacted in case of problems. Depending on how your organization might contact people in case of problems, a definition of a contact may vary a lot. A contact consists of a unique name, a descriptive name, one or more e-mail addresses, and pager numbers. Contact definitions can also contain additional data specific to how a person can be contacted.
A basic contact definition is shown here, and specifies the unique contact name, an alias, and the contact information. It also specifies event types that the person should receive and time periods during which notifications should be sent:
define contact{ contact_name jdoe alias John Doe email [email protected] contactgroups admins,nagiosadmin host_notification_period workinghours service_notification_period workinghours host_notification_options d,u,r service_notification_options w,u,c,r host_notification_commands notify-host-by-email service_notification_commands notify-service-by-email }
The contactgroups
line defines that this user is a member of the admins
group, which we'll define later in this chapter. Contact groups work similar to host and service groups, either a contact defines groups it belongs to or a contact group definition specifies users that belong to this group.
We will now create a similar file in /etc/nagios/contacts
, setting values for contact_name
, alias
, and email
based on your username, full name, and e-mail address. The recommended name for the file is based on contact_name
.
The following table describes all available directives when defining a contact; items in bold are required when specifying a contact:
Option |
Description |
|
The short, unique name of the contact |
alias |
The descriptive name of the contact; usually, this is the full name of the person |
|
The list of all contact groups of which this user should be a member, separated by a comma |
|
This specifies whether this person should receive notifications regarding host state |
host_notification_period |
This specifies the name of the time period that should be used to determine the time during which a person should receive notifications regarding the host state |
host_notification_commands |
Specifies one or more commands that should be used to notify the person of a host state, separated by a comma |
host_notification_options |
Specifies host states about which the user should be notified, separated by a comma; should be one or more of the following:
|
|
Specifies whether this person should receive notifications regarding the service state |
service_notification_period |
Specifies the name of the time period that should be used to determine the time during which a person should receive notifications regarding the service state |
service_notification_commands |
Specifies one or more commands that should be used to notify the person of a service state; separated by a comma |
service_notification_options |
Specifies service states about which the user should be notified, separated by a comma; should be one or more of the following:
|
|
Specifies the e-mail address of the contact |
|
Specifies the pager number of the contact; it can also be an e-mail to the pager gateway |
|
Additional six addresses that can be specified for the contact; these can be anything, based on how the notification commands will use these fields |
|
Specifies whether the user is allowed to execute commands from the Nagios web interface |
|
Specifies whether the status-related information about this person is retained across restarts |
|
Specifies whether the non-status information about this person should be retained across restarts |
For a complete list of accepted parameters, refer to the Nagios documentation available at http://library.nagios.com/library/products/nagioscore/manuals/
Contacts are also mapped to the users that log into the Nagios web interface. This means that all operations done via the interface will be logged as that particular user, and that the web interface will use the access granted to particular contact objects when evaluating if an operation should be allowed. The contact_name
field from a contact object maps to usernames in the Nagios web interface.
Similar to other object types, contacts can also be grouped. Usually, grouping is used to keep a list of which users are responsible for which tasks and maps to job responsibilities for particular people. It also makes it possible to define people that should be responsible for handling problems at specific time periods, and Nagios will automatically contact the right people for a particular time a problem has occurred.
A sample definition of a contact group is as follows:
define contactgroup{ contactgroup_name linux-admins alias Linux Administrators members jdoe,asmith }
This group is also used when defining the linuxbox01
and WWW
service contacts. This means that both the jdoe
and asmith
contacts will receive information on the status of this host and service.
The following table is a complete list of directives that can be used to describe contact groups, items in bold are required when specifying a contact group:
Option |
Description |
---|---|
contactgroup_name |
The short, unique name of the contact group |
alias |
The descriptive name of the contact group |
|
The list of all contacts that should be a member of this group; separated by a comma |
|
The list of all other contact groups whose all members should also be members of this group; separated by a comma |
The members
of a contact group can either be specified in the contact group definition or using the contactgroups
directive in a contact definition. It is also possible to combine both the methods—some of the members can be specified in the contact group definition and others can be specified in their contact object definition.
Contacts are used to specify who will be contacted if a status changes for one or more hosts or services. Nagios accepts both contacts and contact groups in their object definitions. This allows to make either specific people or entire groups responsible for particular machines or services.
It is also possible to specify different people or groups for handling host-related problems and service related problems—for example, hardware administrators for handling host problems and system administrators for handling service issues.
In order for our previously created user jdoe
to work properly, we need to define the admins
and nagiosadmin
groups in the /etc/nagios/contactgroups/admins.cfg
file:
define contactgroup{ contactgroup_name admins alias System administrators } define contactgroup{ contactgroup_name nagiosadmin alias Nagios administrators }
At this point, our configuration file should be ready for use. We can now verify that all of the configuration statements are correct and that Nagios would start correctly with our configuration. We can do this by running the nagios
command with the -v
option.
The -v
option will try to load the configuration and all of the objects into Nagios and validate that they are defined properly. This is meant to detect any configuration errors or other issues that would prevent Nagios from starting with the configuration file; this is especially useful in order to test configuration before restarting Nagios, as restarting Nagios with an invalid configuration will cause it to stop working in most cases.
For example, the following is an output of checking a valid configuration file:
root@ubuntu:~# /opt/nagios/bin/nagios -v /etc/nagios/nagios.cfg Nagios Core 4.1.1 Copyright (c) 2009-present Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: 08-19-2015 License: GPL Website: http://www.nagios.org Reading configuration data... Read main config file okay... Read object config files okay... Running pre-flight check on configuration data... Checking services... Checked 1 services. Checking hosts... Checked 1 hosts. Checking host groups... Checked 1 host groups. Checking service groups... Checked 1 service groups. Checking contacts... Checked 1 contacts. Checking contact groups... Checked 1 contact groups. Checking commands... Checked 24 commands. Checking time periods... Checked 3 time periods. Checking for circular paths... Checked 1 hosts Checked 0 service dependencies Checked 0 host dependencies Checked 3 timeperiods Checking global event handlers... Checking obsessive compulsive processor commands... Checking misc settings... Total Warnings: 0 Total Errors: 0 Things look okay - No serious problems were detected during the pre-flight check
The preceding command indicates a correct configuration file.
If there are errors, the message will indicate the problem, as shown in the following command:
root@ubuntu:~# /opt/nagios/bin/nagios -v /etc/nagios/nagios.cfg Nagios Core 4.1.1 Copyright (c) 2009-present Nagios Core Development Team and Community Contributors Copyright (c) 1999-2009 Ethan Galstad Last Modified: 08-19-2015 License: GPL Website: http://www.nagios.org Reading configuration data... Read main config file okay... Error: Contactgroup 'admin' is not defined anywhere Error: Could not add contactgroup 'admin' to service (config file '/etc/nagios/services/localhost-www.cfg', starting on line 1) Error processing object config files! ***> One or more problems was encountered while processing the config files... Check your configuration file(s) to ensure that they contain valid directives and data defintions. If you are upgrading from a previous version of Nagios, you should be aware that some variables/definitions may have been removed or modified in this version. Make sure to read the HTML documentation regarding the config files, as well as the 'Whats New' section to find out what has changed.
The preceding example indicates that the contactgroup
value admin is not valid for a service defined in the /etc/nagios/services/localhost-www.cfg
file.
It is always recommended that you verify the Nagios configuration file after making changes to ensure that it does not prevent Nagios from functioning properly.
Even if the /etc/init.d/nagios
script prevents restarting when the configuration is incorrect, this would cause Nagios not to start after a system restart.
Notifications are meant to let people know that something is either wrong or has returned to the normal way of functioning. This is a very important functionality in Nagios, and configuring notifications correctly might seem a bit tricky in the beginning.
When and how notifications are sent out is configured as part of the contact configuration. Each contact has configuration directives for when notifications can be sent out and how they want to be contacted. Contacts also contain information about contact details, such as telephone number, e-mail address, and Jabber/MSN address. Each host and service is configured with when information about it should be sent and who should be contacted. Nagios then combines all this information in order to notify people of changes in the status.
Notifications may be sent out in any of the following situations:
DOWN
or UNREACHABLE
state; a notification is sent out after the first_notification_delay
number of minutes specified in the corresponding host objectDOWN
or UNREACHABLE
state; a notification is sent out every notification_interval
number of minutes specified in the corresponding host objectUP
state; a notification is sent out immediately and only oncenotification_interval
number of minutes specified in the corresponding host objectWARNING
, CRITICAL
, or UNKNOWN
states; a notification is sent out after the first_notification_delay
number of minutes specified in the corresponding service objectWARNING
, CRITICAL
, or UNKNOWN
states; a notification is sent out every notification_interval
number of minutes specified in the corresponding service objectOK
state; a notification is sent out immediately and only oncenotification_interval
number of minutes specified in the corresponding service objectIf one of these conditions occurs, Nagios starts evaluating whether information about it should be sent out and to whom.
The first check is to compare the current date and time against the notification time period, which is taken from the notification_timeperiod
field from the current host or the service definition. A notification will be sent out only if the time period includes the current date and time.
Next, a list of users based on the contacts
and contact_groups
fields is created. Based on all members of all groups and included groups, as well as all contacts directly bound with the current host or service, a complete list of users is made.
Each of matched users is checked whether they should be notified about the current event. In this case, each user's time period is also checked whether it includes the current date and time. The host_notification_period
or service_notification_period
directive is used, depending on whether the notification is for the host or the service.
For host notifications, the host_notification_options
directive for each contact is also used to determine whether that particular person should be contacted; for example, different users might be contacted about an unreachable host if the host is actually down. For service notifications, the service_notification_options
parameter is used to check every user if they should be notified about this issue.
If all of these criteria have been met, then Nagios will send a notification to this user. It will now use commands specified in the host_notification_commands
and service_notification_commands
directives.
It is possible to specify multiple commands that will be used for notifications, so it is possible to set up Nagios so that it sends both e-mail as well as a message on an instant messaging or chat system such as XMPP or Slack.
Nagios also offers escalations that allow sending e-mails to other people when a problem has not been resolved for too long. This can be used to propagate problems to the higher management or teams that might be affected by unresolved problems. It is a very powerful mechanism and is split between the host and service-based escalations. This functionality is described in more detail in Chapter 8, Notifications and Events.
18.119.172.146