So far, we have seen how to define service checks for a localhost. But the real use case of a monitoring server like Icinga is to monitor an entire infrastructure, not to deploy Icinga on each of the hosts you would want to monitor. This chapter covers ways to monitor remote servers from an Icinga instance, similar to the one we have for localhost monitoring. A similar configuration is used to monitor remote servers as well, with slight modifications.
There are several different ways of monitoring our infrastructure using remote servers, depending on the needs and services we want to monitor.
The type of the check can be configured on a per-service check basis using the active_checks_enabled
and passive_checks_enabled
directives in the service object definition.
Both active and passive checks have their appropriate use cases. It is important to determine the type of check to be used for your use case. Active checking is generally recommended for monitoring services such as HTTP, IMAP, and so on; while passive checking is recommended for services that are long running or are generated by internal events of the host, such as monitoring a logfile for errors; such an event would submit a CRITICAL event to Icinga.
Further, we will have a look at what tools are required for both types of checks.
The monitoring server initiates the checks at specific intervals and their statuses are set according to the return value of the check plugin. There are several ways to retrieve status of a service, depending on the kind of service it is.
There are majorly two types of services: public services and private services. We will look at both of them in this section.
Publicly available services include services that are accessible over the network, either the internal network or via the Internet; basically, ones that can be checked by establishing the network connection and optionally making a sample request. Examples include HTTP, FTP, SSH, IMAP, SMTP, and MySQL Server.
If, for example, we want to monitor HTTP, SSH, and IMAP services on server1.example.org
, which is some remote host other than the monitoring server itself, the host and services configuration would look like the following:
Icinga's default set of configuration comes with a linux-server
host template, which is defined in templates.cfg
. Following is the configuration for a few service checks.
GET /
request on port 80
:define command { command_name check_http command_line $USER1$/check_http -I $HOSTADDRESS$ } define service { use generic-service host_name server1.example.org service_description HTTP check_command check_http }
define command { command_name check_ssh command_line $USER1$/check_ssh $ARG1$ $HOSTADDRESS$ } define service { use generic-service host_name server1.example.org service_description SSH check_command check_ssh }
define command { command_name check_imap command_line $USER1$/check_imap -H $HOSTADDRESS$ $ARG1$ } define service { use generic-service host_name server1.example.org service_description IMAP check_command check_imap }
We can play around with command-line arguments that the check plugin such as check_http
provides; for example, warning/critical threshold values of response time, and so on. We will cover the check plugins in detail in next chapter.
Icinga service must be reloaded whenever we update any of the configuration files for these changes to take effect.
$ sudo service icinga reload
The reload
/restart
command verifies the entire configuration for syntax or semantics errors, and reports any errors found. It is recommended, as a general practice, to always do a configuration check before reloading/restarting Icinga.
$ sudo service icinga show-errors
The preceding command verifies the Icinga configuration and shows the errors, if any.
Private services include various system resource and performance checks, such as checks for free disk space, CPU load, memory usage, number of processes, and so on. Such information is not available over the network and has to be acquired using some intermediate agents that can provide the same when requested. Some of the agents are as follows:
These agents can also be used to check for public services that are not necessarily accessible from the Icinga server, or the purpose of the check is different. For example, to test the reachability of web server running on server1 from server2 (both of which are different from Icinga server), running the HTTP check for server1 from Icinga server would not serve the purpose. This use case will use one or more of these agents running on server2 so that they can provide Icinga with the status of reachability of server1.
The simplest way to get any information from a Linux server is to run SSH on the remote server and run any command/script to get the information. The nagios-plugins
package provides ready-to-use plugins for such purposes (check_load
, check_disk
, and so on). We have used these in our localhost monitoring setup. So, we can define a command to SSH the server, run one of these plugins, and return the output, which is determined to set the status of the service check on the monitoring server. We also need to ensure that the nagios-plugins
package is installed on the remote server, so that we have all the available check plugins to execute over SSH. Let's look at a configuration for disk space check on server1.example.org
:
define command { command_name check_by_ssh command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C 'PATH=$PATH:/usr/lib64/nagios/plugins $ARG1$' } define service { use generic-service host_name server1.example.org service_description Disk check_command check_by_ssh!check_disk -w 20% -c 10% }
Icinga runs and executes checks as user configured using the icinga_user
directive in icinga.cfg
. So, we need to make sure that proper SSH keys are generated for the user on the monitoring server, and added to authorized_keys
of the remote server(s) so that check_by_ssh
can execute flawlessly.
To generate SSH keys for the icinga
user, use the following command:
$ su icinga –c 'ssh-keygen –t rsa'
Keep pressing Enter to give default values for presented parameters. This will generate the SSH public key (~/.ssh/id_rsa.pub
) and the SSH private key (~/.ssh/id_rsa
) in the .ssh
directory inside the home folder of the icinga
user.
It is necessary to put the public key in ~/.ssh/authorized_keys
, which is in the home folder of the icinga
user on the remote host. You will have to make sure the icinga
user exists on the remote host. This will give SSH access to the icinga
user on the Icinga server for the icinga
user on the remote host.
We appended /usr/lib64/nagios/plugins
to PATH
so that the check_by_ssh
command object can be re-used to run other plugins over SSH, without having to give the full path in the command every time.
NRPE is an add-on that is deployed on the remote hosts to execute the check plugins on them. It is similar to using SSH; NRPE daemon has to be running on the remote server and its configuration should have a command_name
to command_executable
(with arguments) mapping. So, when Icinga executes the check_nrpe
check, it uses the NRPE
command name that we specify in the service definition, then sends it to the NRPE agent (daemon) on the remote server. This executes the corresponding command line and returns the exit code.
There are pros and cons in using this method instead of using the SSH method. SSH gives us more flexibility in terms of running any desired command or script over SSH. NRPE has an overhead of defining NRPE command-name to command-executable mapping and other required configuration. On the other hand, SSH increases the load on the monitoring server if there are a large number of checks, due to frequent opening and closing of SSH connections.
Each execution of a check calls for a SSH connection, execution, and closing of the connection, which is a considerable overhead. Following is an example of the NRPE daemon configuration (usually /etc/nrpe.cfg
), similar command and service object definitions can be used to execute checks over NRPE:
# command[<command_name>]=<command_line> command[check_users]=/usr/lib64/nagios/plugins/check_users -c 10 command[check_load]=/usr/lib64/nagios/plugins/check_load -c 40%
More information on installation and configuration of NRPE can be found at http://docs.icinga.org/latest/en/nrpe.html.
While the above methods are best suited for Linux servers, they are not supported for the Windows servers. For this purpose, there is an agent called NSClient++. It is the Windows' replacement for Linux's NRPE daemon, although it is cross-platform and available for Linux too. The same check_nrpe
plugin can be used to run commands on remote Windows servers. The plugin contacts the NSClient++ agent and asks for the status of one of the commands made available by the agent. The list of available commands and their usage can be found in the agent's documentation.
For example, if we have a Windows server with the hostname server2.example.org
, the host definition can be as follows:
define host { use windows-server host_name server2.example.org alias Example server 2 address 172.16.143.23 hostgroups windows ; just an example }
Icinga already provides a windows-server
host template, found in templates.cfg
. Following is what the Icinga configuration for checking CPU load would look like (NSClient++ supports a command called CheckCPU
):
define command { command_name check_nrpe command_line $USER1$/check_nrpe -u -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$ } define service { use generic-service host_name server2.example.org service_description CPU check_command check_nrpe!CheckCPU!warn=80 crit=90 }
We'd need a working NSClient++ deployed on the remote Windows server(s) for the preceding code to work. Have a look at http://docs.icinga.org/latest/en/monitoring-windows.html#installwindowsagent for the same. Make sure proper whitelisting is done on the Windows server side to allow check_nrpe
to talk to the agent. The list of commands supported by NSClient++ is available at http://www.nsclient.org/nscp/wiki/CheckCommands.
SNMP agents on routers and switches can be used to monitor checkpoints or services on them. Monitoring network devices mostly includes simply network traffic and open ports check. A wide range of such values is available via SNMP and are called the OIDs. Each Object Identifier (OID) has a value associated with it. For example, the OID sysUpTime.0
gives the uptime of the device. An example of a host definition for a router is as follows:
define host { use generic-switch host_name switch1.example.org alias HP 12504 AC switch address 192.168.32.58 hostgroups switches ; just an example }
An example of a service check for getting the uptime is as follows:
define command { command_name check_snmp command_line $USER1$/check_snmp -H $HOSTADDRESS$ -o $ARG1$ $ARG2$ } define service { use generic-service host_name switch1.example.org service_description SNMP check_command check_snmp!sysUpTime.0 }
Readers are advised to read on how to use SNMP to get various kinds of values. These values can then be used for monitoring. The check_snmp
plugin is used to contact the device and query the values via the SNMP agent after supplying relevant authorization. The snmpwalk
command can be used to get the list of available OIDs with a particular device:
$ snmpwalk -mAll –v1 –cpublic switch1.example.org system
The previous command gives you a list of all OIDs and their values as reported by the switch device.
18.116.60.158