In this chapter, we will look into common system health check techniques (private services) for Windows and Linux servers and network devices, along with their configuration, and setting parent-child relationships between Icinga hosts and services.
The nagios-plugins
package provides many check plugins for checking many common things. More plugins can be found online for specific use cases. Nagios Exchange (http://exchange.nagios.org/directory/Plugins) and MonitoringExchange (http://www.monitoringexchange.org/) are some of the very resourceful sources for useful plugins. Some plugins are even provided as distribution packages. It is required to install the nagios-plugins
packages on both the Icinga server and the servers to be monitored—at the server side to run checks for public services from the server, and at the host's side to run checks over an agent on hosts. Most of the plugins are installed inside /usr/lib/nagios/plugins
or /usr/lib64/nagios/plugins
, depending on the machine's architecture.
In this chapter, we will take into consideration the following hostgroups:
linux
hostgroup is used for all Linux servers.define hostgroup { hostgroup_name linux alias Linux servers }
windows
hostgroup is used for all Windows servers.define hostgroup { hostgroup_name windows alias Windows servers }
switches
hostgroup is used for all switches (the checks would also apply to routers).define hostgroup { hostgroup_name switches alias Network switches }
Any publicly available service (accessible over an internal network, or the Internet from the Icinga monitoring server) can be easily monitored irrespective of the operating system of the server hosting it, given that proper whitelisting is for the Icinga server in firewalls and other security software. We have covered the monitoring of the public services in the previous chapter, so this chapter will cover the monitoring of the common private services (such as CPU load, disk space, and number of processes) for different operating systems.
Let's add an example Linux server's host object to the linux
hostgroup.
define host { use linux-server host_name server1.example.org address 192.168.32.56 hostgroups all,linux }
Similarly, we add the hostgroups
directive in all our Linux servers' host definitions.
We will use check_by_ssh
to perform the private service checks. The common command definition is as follows:
define command { command_name check_by_ssh command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C 'PATH=$PATH:/usr/lib64/nagios/plugins $ARG1$' }
The common service checks for Linux servers include:
Although the SSH check is a publicly available service, it is important to mention this service check here because all the check_by_ssh
checks rely on this SSH service, and it's a good idea to place a check for SSH itself.
Just a quick look on the SSH check(note that it is check_ssh
not check_by_ssh
):
define command { command_name check_ssh command_line $USER1$/check_ssh -H $HOSTADDRESS$ } define service { use generic-service hostgroup_name linux service_description SSH check_command check_ssh }
This service check would generate alerts, which we will notice on the web interface, if there were a problem with getting SSH access into the server and subsequently. All checks relying on it will also start to fail, that is, generate an alert.
The check_load
plugin is provided under the standard directory as mentioned earlier. It takes warning (as a -w
switch) and critical load values (as a -c
switch) and returns the corresponding exit status for the currently reported system load averages for the past 1, 5 and 15 minutes. The service definition is as follows:
define service { use generic-service hostgroup_name linux service_description Load check_command check_by_ssh!check_load -w 1,7,11 -c 2,10,15 }
Alternatively, we can define a wrapper check_load_by_ssh
command object to be able to re-use it generically in other service definitions for some host which is not in the linux
hostgroup.
define command { command_name check_load_by_ssh command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C '$USER1$/check_load -w $ARG1$ -c $ARG2$ $ARG3$' } define service { use generic-service hostgroup_name linux service_description Load check_command check_load_by_ssh!1,7,11!2,10,15 }
This service check will give:
We can also divide the load averages by the number of CPU cores using the -r
switch in check_load
, and set warning or critical thresholds on these load averages.
The check_disk
plugin is available as part of the standard nagios plugins
package. It allows us to set WARNING and CRITICAL thresholds for the free disk space, in terms of specific amount of disk space or a percentage.
define command { command_name check_disk_by_ssh command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ $ARG3$' } define service { use generic-service hostgroup_name linux service_description Disk check_command check_disk_by_ssh!20%!10% }
This would generate:
The plugin also provides -W/--iwarning
and -C/--icritical
switches to check for percentage of free inode space. We can also specify a particular path (--path
), partition (--partition
), or a mount point (--mountpoint
) to check for free disk/inode space.
18.221.126.56