Chapter 4. Monitoring Infrastructure, Network Services, and System Health

In this chapter, we will look into common system health check techniques (private services) for Windows and Linux servers and network devices, along with their configuration, and setting parent-child relationships between Icinga hosts and services.

The nagios-plugins package provides many check plugins for checking many common things. More plugins can be found online for specific use cases. Nagios Exchange (http://exchange.nagios.org/directory/Plugins) and MonitoringExchange (http://www.monitoringexchange.org/) are some of the very resourceful sources for useful plugins. Some plugins are even provided as distribution packages. It is required to install the nagios-plugins packages on both the Icinga server and the servers to be monitored—at the server side to run checks for public services from the server, and at the host's side to run checks over an agent on hosts. Most of the plugins are installed inside /usr/lib/nagios/plugins or /usr/lib64/nagios/plugins, depending on the machine's architecture.

In this chapter, we will take into consideration the following hostgroups:

  • The linux hostgroup is used for all Linux servers.
    define hostgroup {
        hostgroup_name      linux
        alias               Linux servers
    }
  • The windows hostgroup is used for all Windows servers.
    define hostgroup {
        hostgroup_name      windows
        alias               Windows servers
    }
  • The switches hostgroup is used for all switches (the checks would also apply to routers).
    define hostgroup {
        hostgroup_name      switches
        alias               Network switches
    }

Any publicly available service (accessible over an internal network, or the Internet from the Icinga monitoring server) can be easily monitored irrespective of the operating system of the server hosting it, given that proper whitelisting is for the Icinga server in firewalls and other security software. We have covered the monitoring of the public services in the previous chapter, so this chapter will cover the monitoring of the common private services (such as CPU load, disk space, and number of processes) for different operating systems.

Linux servers

Let's add an example Linux server's host object to the linux hostgroup.

define host {
    use             linux-server
    host_name       server1.example.org
    address         192.168.32.56
    hostgroups      all,linux
}

Similarly, we add the hostgroups directive in all our Linux servers' host definitions.

We will use check_by_ssh to perform the private service checks. The common command definition is as follows:

define command {
    command_name    check_by_ssh
    command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -C 'PATH=$PATH:/usr/lib64/nagios/plugins $ARG1$'
}

The common service checks for Linux servers include:

  • The Secure Shell (SSH) check
  • The load check
  • The disk check

The Secure Shell (SSH) check

Although the SSH check is a publicly available service, it is important to mention this service check here because all the check_by_ssh checks rely on this SSH service, and it's a good idea to place a check for SSH itself.

Just a quick look on the SSH check(note that it is check_ssh not check_by_ssh):

define command {
    command_name    check_ssh
    command_line    $USER1$/check_ssh -H $HOSTADDRESS$
}
define service {
    use                     generic-service
    hostgroup_name          linux
    service_description     SSH
    check_command           check_ssh
}

This service check would generate alerts, which we will notice on the web interface, if there were a problem with getting SSH access into the server and subsequently. All checks relying on it will also start to fail, that is, generate an alert.

The load check

The check_load plugin is provided under the standard directory as mentioned earlier. It takes warning (as a -w switch) and critical load values (as a -c switch) and returns the corresponding exit status for the currently reported system load averages for the past 1, 5 and 15 minutes. The service definition is as follows:

define service {
    use                  generic-service
    hostgroup_name       linux
    service_description  Load
    check_command        check_by_ssh!check_load -w 1,7,11 -c 2,10,15
}

Alternatively, we can define a wrapper check_load_by_ssh command object to be able to re-use it generically in other service definitions for some host which is not in the linux hostgroup.

define command {
    command_name            check_load_by_ssh
    command_line            $USER1$/check_by_ssh -H $HOSTADDRESS$ -C '$USER1$/check_load -w $ARG1$ -c $ARG2$ $ARG3$'
}

define service {
    use                     generic-service
    hostgroup_name          linux
    service_description     Load
    check_command           check_load_by_ssh!1,7,11!2,10,15
}

This service check will give:

  • CRITICAL for load averages of more than 2,10,15
  • WARNING for load averages of more than 1,7,11
  • OK for load averages of less than 1,7,11

We can also divide the load averages by the number of CPU cores using the -r switch in check_load, and set warning or critical thresholds on these load averages.

The disk check

The check_disk plugin is available as part of the standard nagios plugins package. It allows us to set WARNING and CRITICAL thresholds for the free disk space, in terms of specific amount of disk space or a percentage.

define command {
    command_name            check_disk_by_ssh
    command_line            $USER1$/check_by_ssh -H $HOSTADDRESS$ -C '$USER1$/check_disk -w $ARG1$ -c $ARG2$ $ARG3$'
}

define service {
    use                     generic-service
    hostgroup_name          linux
    service_description     Disk
    check_command           check_disk_by_ssh!20%!10%
}

This would generate:

  • CRITICAL for less than 10 percent of the free disk space
  • WARNING for less than 20 percent of the free disk space
  • OK for more than 20 percent of the free disk space

The plugin also provides -W/--iwarning and -C/--icritical switches to check for percentage of free inode space. We can also specify a particular path (--path), partition (--partition), or a mount point (--mountpoint) to check for free disk/inode space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.126.56