Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5 Host monitoring

In the last two chapters we installed and configured Riemann and Graphite. We got an introduction to Riemann and how it manages and indexes events. We also saw how to integrate Riemann servers in multiple environments. Finally, we saw how to send events from Riemann to Graphite and then graph them in Grafana.

In this chapter we’re going to add another building block to our monitoring framework by collecting host-based data and sending it to Riemann. Host-based monitoring provides the basic information about our hosts and their performance. We can then collect and combine this data with the application data that we’ll learn to collect in Chapter 9.

Note We’ll also look at container monitoring, focusing on Docker containers in Chapter 7.

We’ll aim to implement host monitoring that:

Is a scalable and high-performing solution. Our collection needs to be lightweight and not interfere with the actual running of our hosts and applications (refer to the observer effect we talked about in Chapter 2).
Can ship data quickly and avoid queuing important information that we need.
Has a flexible monitoring interface that can collect a wide variety of data “out of the box” but also allows us to collect custom data that is less common or unique to our environment.
Accommodates our push-versus-pull architecture.

To satisfy these needs we’re going to look at a tool called collectd.

5.1 Introducing collectd

The collectd daemon, which acts as a monitoring collection agent, will do the local monitoring and send the data to Riemann. It will run locally on our hosts and selectively monitor and collect data from a variety of components.

We’ve chosen collectd because it is high performance and reliable. The collectd daemon has been around for about ten years, it’s written in C for performance, and is well tested and widely used. It’s also open source and licensed under a mix of the MIT and GPLv2 licenses.

Warning This book assumes you’re running collectd 5.5 or later. Earlier versions may work with this configuration but some components may not behave exactly as described.

The daemon uses a modular design: a central core and an integrated plugin system. The core of collectd is small and provides basic filtering and routing for the collected data. The collection, storage, and transmission of data is handled by plugins that can be enabled individually.

For the collection of data, collectd uses plugins called read plugins. They can collect information like CPU performance, memory, or application-specific metrics. This data is then passed into collectd’s core functions.

The data can then be filtered or routed to write plugins. The write plugins allow data to be stored locally—for example, writing to files—or to send that data across the network to other services. In our case, they’ll allow us to send that data to Riemann.

There are a large collection of default plugins shipped with collectd, as well as a variety of community-contributed and developed plugins you can add to it. The collectd daemon also supports running plugins you have written yourself.

This diagram shows our collectd architecture.

Our applications and services are connected via collectd’s read plugins. These read plugins send events to a write plugin that will send events onto Riemann.

In this chapter we’re focusing on using collectd for host-based monitoring, but we’re also going to use it later in the book for collecting service and application data. Using collectd will allow us to run a single agent locally on our host and use it to send our data into our Riemann event router.

5.2 What host components should we monitor?

Before we jump into installing and configuring collectd, let’s discuss what we might like to collect on our hosts. We’re going to focus on collecting basic data that will show us the core performance of our hosts. We’re going to configure a generic collection of monitoring data on all our hosts, and we’ll add additional monitoring for specific use cases. For example, we would install our base monitoring on all hosts but add specific monitoring for a database or application server.

Our basic set of monitoring will include:

CPU - Shows us how our host is running workloads.
Memory - Show us how much memory is being used and is available on our hosts.
Load - The “system load.” A rough overview of host utilization, defined as the number of runnable tasks in the run-queue in one, five, and fifteen-minute averages.
Swap - Shows us how much swap is being used and is available on our hosts.
Processes - Monitor both specific processes and processes counts, and identify “zombie” processes.
Disk - Shows us how much disk space is being used and is available on our filesystems.
Network - Shows us the basic performance of our interfaces and networks, including errors and traffic.

This base set of data will allow us to identify host performance issues or provide sufficient supplemental data for fault diagnosis of application issues.

At this point some of you may be wondering, “But didn’t we say earlier that we should focus on application and business events and metrics?” We’re indeed going to focus on those application and business events and metrics, but they aren’t the full story. When we have a fault or need to further diagnose a performance issue we often need to drill down to more granular data. The data we’re gathering here will supplement our application and business events and metrics and allow us to diagnose and identify system-level issues that cause application problems.

Note Two topics we’re not covering directly in the book are monitoring of Microsoft Windows and monitoring of non-host devices: networking equipment, storage devices, data center equipment. We will discuss some options for this later in the chapter.

5.3 Installing collectd

We’ll install collectd on every host and configure it to collect our metrics and send them to Riemann. We’ll walk through the installation process on both Ubuntu and Red Hat family operating systems and provide you with appropriate configuration management resources to do the installation automatically.

Note It’s important to make a (slightly pedantic) distinction about polling versus pushing here. Technically a client like collectd is polling on the local host, but it is polling locally and then pushing the events to Riemann. This local polling scales differently from centrally scheduled and initiated polling—for example, from a monitoring server to many hosts and services.

5.3.1 Installing collectd on Ubuntu

We’ll install version 5.5 of collectd on Ubuntu from the collectd project’s own Landscape repositories and via the apt-get command.

Add the collectd repository configuration to your host:

$ sudo sudo add-apt-repository ppa:collectd/collectd-5.5

Then update and install collectd.

$ sudo apt-get update
$ sudo apt-get -y install collectd

The collectd daemon and its associated dependencies will now be installed. We’ll test it is installed and working by running the collectd binary.

$ collectd -h
Usage: collectd [OPTIONS]

. . .

The -h flag will output collectd’s flags and command line help.

5.3.2 Installing collectd on Red Hat

We’ll install collectd on Red Hat using the EPEL repository.

Let’s add the EPEL repository now.

$ sudo yum install -y epel-release

Now we’ll use the yum command to install the collectd package. We’ll also install the write_riemann plugin, which we’ll use to send our events to Riemann, and the protobuf-c package it requires.

$ sudo yum install collectd protobuf-c collectd-write_riemann

Tip On newer Red Hat and family versions the yum command has been replaced with the dnf command. The syntax is otherwise unchanged.

The collectd daemon and its associated dependencies will now be installed. We’ll test it is installed by running the collectd binary.

$ collectd -h
Usage: collectd [OPTIONS]
. . .

The -h flag will output collectd’s flags and command line help.

5.3.3 Installing collectd via configuration management

There are a variety of options for installing collectd on your hosts via configuration management.

You can find a Chef cookbook for collectd at:

https://supermarket.chef.io/cookbooks?q=collectd.

You can find Puppet modules for collectd at:

https://forge.puppetlabs.com/modules?sort=rank&q=collectd.

You can find an Ansible role for collectd at:

https://galaxy.ansible.com/list#/roles/350.

You can also find Docker containers at:

https://hub.docker.com/search?q=collectd.

And a Vagrant image at:

https://github.com/httpdss/collectd-web-vagrant.

Tip We recommend using one of these methods to manage collectd on your hosts rather than configuring collectd manually.

5.4 Configuring collectd

Now that we have the collectd daemon installed we need to configure it. We’re going to enable some plugins to gather our data and then configure a plugin to send the events to Riemann.

The collectd daemon is configured using a configuration file located at:

/etc/collectd/collectd.conf on Ubuntu.
/etc/collectd.conf on Red Hat.

The collectd package on both distributions installs a default configuration file. We’re not going to use that file—instead, we’re going to build our own configuration.

The collectd configuration is divided into three main concerns:

Global settings for the daemon.
Plugin loading.
Plugin configuration.

Let’s look at our initial configuration file.

# Global settings
Interval 2
CheckThresholds true
WriteQueueLimitHigh 5000
WriteQueueLimitLow 5000

# Plugin loading
LoadPlugin logfile
LoadPlugin threshold

# Plugin Configuration
<Plugin "logfile">
  LogLevel "info"
  File "/var/log/collectd.log"
  Timestamp true
</Plugin>

Include "/etc/collectd.d/*.conf"

Tip The collectd configuration files lend themselves to being turned into configuration management templates. Indeed, many of the collectd modules for various configuration management tools include templating for collectd configuration.

First, we need to set two global configuration options in our collectd configuration. The first configuration option, Interval, sets the collectd daemon’s check interval. This is the resolution at which collectd collects data. It defaults to 10 seconds. We’re going to make our resolution more granular and move it to two second intervals. This should either match or be slightly larger than the lowest retention period set in Graphite to ensure the collection periods sync up correctly. Remember that in Chapter 4 we installed Graphite and configured its retentions in the /etc/carbon/storage-schemas.conf configuration file.

Tip We could set the Interval to one second but Riemann’s one second precision sometimes means duplicate metrics are generated when Riemann rounds the time to the nearest second.

[default]
pattern = .*
retentions = 1s:24h, 10s:7d, 1m:30d, 10m:2y

You can see our shortest resolution in the storage schema is one second. Our collectd Interval should be the same or more than that resolution.

Warning As we discussed in Chapter 4, if you change the resolution of your collection, then it will be difficult to compare data collected during periods with the previous resolution.

The second global option, CheckThresholds, can be set to true or false, and controls how state is set for collected data. The collectd daemon can check collected data for thresholds, marking data with metrics that exceed those thresholds as being in a warning or failed state. We’re not going to set any specific thresholds right now, but we’re turning on the functionality because it creates a default state—shown in the :state field in a Riemann event—of ok that we can later use in Riemann.

The WriteQueueLimitHigh and WriteQueueLimitLow options control the queue size of write plugins. This protects us from memory issues if a write plugin is slow, for example if the network times out. You can set the limits using the WriteQueueLimitHigh and WriteQueueLimitLow options. Each option takes a number, which is the number of metrics in the queue. If there are more metrics than the value of the WriteQueueLimitHigh options then any new metrics will be dropped. If there are less than WriteQueueLimitLow metrics in the queue, all new metrics will be enqueued.

If the number of metrics currently in the queue is between the two thresholds the metric is potentially dropped, with a probability that is proportional to the number of metrics in the queue. This is a bit random for us so we set both options to 5000. This sets an absolute threshold. If more than 5000 metrics are in the queue then incoming metrics will be dropped. This protects the memory on the host running collectd.

Next, we’ll enable some plugins for collectd. Use the LoadPlugin command to load each plugin.

LoadPlugin logfile

<Plugin "logfile">
  LogLevel "info"
  File "/var/log/collectd.log"
  Timestamp true
</Plugin>

LoadPlugin threshold

The first plugin, logfile, tells collectd to log its output to a log file.

We’ve configured the plugin using a <Plugin> block. Each <Plugin> block specifies the plugin it is providing configuration for, here <Plugin "logfile">. It also contains a series of configuration options for that plugin. In this case, we’ve specified the LogLevel option, which tells collectd how verbose to make our logging. We’ve chosen a middle ground of info, which logs some operational and error information, but skips a lot of debugging output. We’ve also specified the File option to tell where collectd to log this information, choosing /var/log/collectd.log.

Tip We can also log in Logstash’s JSON format using the log_logstash plugin. See Chapter 8 for more information on logging and Logstash.

Lastly, we’ve set the Timestamp option to true which adds a timestamp to any log output collectd generates.

Tip We’ve specified the logfile plugin and its configuration first in the collectd.conf file so that if something goes wrong with collectd we’re likely to catch it in the log.

We’ll also need to load the threshold plugin. The threshold plugin is linked to the CheckThresholds setting we configured in the global options above. This plugin enables collectd’s threshold checking logic. We’re going to use it to help identify when things go wrong on our host.

Finally, we’ll set one last global configuration option, Include "/etc/collectd.d/*.conf. The Include option specifies a directory to hold additional collectd configuration. Any file ending in .conf will be added to our collectd configuration. We’re going to use this capability to make our configuration easier to manage. With snippets included via the Include directory, we can easily manage collectd with configuration management tools like Puppet, Chef, or Ansible.

You’ll note that we’ve put the Include option at the end of the configuration file. This is because collectd loads configuration in a top down order. We want our included files to be done last.

Let’s create the directory for the files we wish to include now (it already exists on some distributions).

$ sudo mkdir /etc/collectd.d

Tip You can find a list of other global configuration options on the collectd wiki.

5.4.1 Loading and configuring collectd plugins for monitoring

We’re now going to load and configure a series of plugins to collect data. We’re going to make use of the following plugins:

cpu - The CPU plugin collects the amount of time spent by the CPU in various states.
df - The DF plugin collects filesystem usage information, so named because it returns similar data to the df command.
load - The Load plugin collects the system load.
interface - The Interface plugin collects network interface statistics.
protocols - The Protocols plugin records information about network protocol performance and data on the host, including TCP, UDP, and ICMP traffic.
memory - The Memory plugin collects physical memory utilization.
processes - The Processes plugin collects the number of processes, grouped by state: running, sleeping, zombie, etc.
swap - The Swap plugin collects the amount of memory currently written to swap.

These plugins provide data for the basic state of most Linux hosts. We’re going to configure each plugin in a separate file and store them in the /etc/collectd.d/ directory we’ve specified as the value of the Include option. This separation allows us to individually manage each plugin and lends itself to management with a configuration management tool like Puppet, Chef, or Ansible.

Now let’s configure each plugin.

5.4.1.1 The cpu plugin

The first plugin we’re going to configure is the cpu plugin. The cpu plugin collects CPU performance metrics on our hosts. By default, the cpu plugin emits CPU metrics in Jiffies: the number of ticks since the host booted. We’re going to also send something a bit more useful: percentages.

First, we’re going to create a file to hold our plugin configuration. We’ll put it into the /etc/collectd.d directory. As a result of the Include option in the collectd.conf configuration file, collectd will load this file automatically.

$ sudo touch /etc/collectd.d/cpu.conf

Let’s now populate this file with our configuration.

LoadPlugin cpu
<Plugin cpu>
  ValuesPercentage true
  ReportByCpu false
</Plugin>

We load the plugin using the LoadPlugin option. We then specify a <Plugin> block to configure our plugin. We’ve specified two options. The first, ValuesPercentage, set to true, will tell collectd to also send all CPU metrics as percentages. The second option, ReportByCpu, set to false, will aggregate all CPU cores on the host into a single metric. We like to do this for simplicity’s sake. If you’d prefer your host to report CPU performance per core, you can change this to true.

5.4.1.2 The memory plugin

Next let’s configure the memory plugin. This plugin collects information on how much RAM is free and used on our host. By default the memory plugin returns metrics in bytes used or free. Often this isn’t useful because we don’t know how much memory a specific host might have. If we return the value as a percentage instead, we can more easily use this metric to determine if we need to take some action on our host. So we’re going to do the same percentage conversion for our memory metrics.

Let’s start by creating a file to hold the configuration for the memory plugin.

$ sudo touch /etc/collectd.d/memory.conf

Now let’s populate this file with our configuration.

LoadPlugin memory
<Plugin memory>
  ValuesPercentage true
</Plugin>

We first load the plugin and then add a <Plugin> block to configure it. We set the ValuesPercentage option to true to make the memory plugin also emit metrics as percentages.

Note It’ll continue to emit metrics in bytes, too, in a separate metric so we can still use these if required.

5.4.1.3 The df plugin

The df plugin collects disk space metrics, including space used, on mount points and devices. Like the memory plugin, it outputs in bytes by default—so let’s configure it to emit metrics in percentages, instead. That will make it easier to determine if a mount or device has a disk space issue.

Let’s start by creating a configuration file.

$ sudo touch /etc/collectd.d/df.conf

And now we’ll populate this file with our configuration.

LoadPlugin df
<Plugin df>
  MountPoint "/"
  ValuesPercentage true
</Plugin>

We first load the plugin and then configure it. The ValuesPercentage option tells the df plugin to also emit metrics with percentage values. We’ve also specified a mount point we’d like to monitor via the MountPoint option. This option allows us to specify which mount points to collect disk space metrics on. We’ve only specified the / (“root”) mount point. If we wanted or needed to, we could add additional mount points to the configuration now. To monitor a mount point called /data we would add:

LoadPlugin df
<Plugin df>
  MountPoint "/"
  MountPoint "/data"
  ValuesPercentage true
</Plugin>

We can also specify devices using the Dev option.

LoadPlugin df
<Plugin df>
  MountPoint "/"
  Dev "/dev/hda1"
  ValuesPercentage true
</Plugin>

Or we can monitor all filesystems and mount points on the host.

<Plugin df>
  IgnoreSelected true
  ValuesPercentage true
</Plugin>

The IgnoreSelected option, when set to true, tells the df plugin to ignore all configured mounts points or devices and monitor every mount point and device.

5.4.1.4 The swap plugin

Let’s now configure the swap plugin. This plugin collects a variety of metrics on the state of swap on your hosts. Like other plugins we’ve seen that it returns metric values measuring swap used in bytes. We again want to make those easier to consume by emitting them as percentage values.

Let’s first create a file to hold its configuration.

$ sudo touch /etc/collectd.d/swap.conf

Now we’ll populate this file with our configuration.

LoadPlugin swap
<Plugin swap>
  ValuesPercentage true
</Plugin>

We’ve only specified one option, ValuesPercentage, and set it to true. This will cause the swap plugin to emit metrics using percentage values.

5.4.1.5 The interface plugin

Now let’s configure our interface plugin. The interface plugin collects data on our network interfaces and their performance.

$ sudo touch /etc/collectd.d/interface.conf

We’re not going to add any configuration to the interface plugin by default so we’re just going to load it.

LoadPlugin interface

Without configuration, the interface plugin collects metrics on all interfaces on the host. If we wanted to limit this to one or more interfaces, instead, that can be specified via configuration using the Interface option.

<Plugin interface>
  Interface "eth0"
</Plugin>

Other times we might want to exclude specific interfaces—for example, the loopback interface—from our monitoring.

<Plugin "interface">
  Interface "lo"
  IgnoreSelected true
</Plugin>

Here we’ve specified the loopback, lo, interface with the Interface option. We’ve added the IgnoreSelected option and set it to true. This will tell the interface plugin to monitor all interfaces except the lo interface.

5.4.1.6 The protocols plugin

Also collecting data about our host’s network performance is the protocols plugin. The protocols plugin collects data on our network protocols running on the host and their performance.

$ sudo touch /etc/collectd.d/protocols.conf

We’re not going to add any configuration to the protocols plugin by default so we’re just going to load it.

LoadPlugin protocols

5.4.1.7 The load plugin

We’re not going to add any configuration for the next plugin, load. We’re just going to load it. This plugin collects load metrics on our host.

Let’s create a file for the load plugin.

$ sudo touch /etc/collectd.d/load.conf

And configure it to load the load plugin.

LoadPlugin load

5.4.1.8 The processes plugin

Lastly, we’re going to configure the processes plugin. The processes plugin monitors both processes broadly on the host—for example, the number of active and zombie processes—but it can also be focused on individual processes to get more details about them. We’re going to create a file to hold our processes plugin configuration.

$ sudo vi /etc/collectd.d/processes.conf

We’re then going to specifically monitor the collectd process itself and set a default threshold for any monitored processes. The processes plugin provides insight into the performance of specific processes. We’ll also use it to make sure specific processes are running—for example, ensuring collectd is running.

LoadPlugin processes
<Plugin "processes">
    Process "collectd"
</Plugin>

<Plugin "threshold">
   <Plugin "processes">
       <Type "ps_count">
         DataSource "processes"
         FailureMin 1
       </Type>
   </Plugin>
</Plugin>

In some cases you might want to monitor a process (or any other plugin) at a longer interval than the global default of 2 seconds. You can override the global Interval by setting a new interval when loading the plugin. To do this we convert the LoadPlugin directive into a block like so:

<LoadPlugin processes>
  Interval 10
</LoadPlugin>

Tip You can see the other options you can configure per-plugin in the LoadPlugin configuration section in the collectd configuration documentation.

Here inside our new <LoadPlugin> block we’ve specified a new Interval directive with a collection period of 10 seconds.

We continue by configuring the processes plugin inside the <Plugin> block. It can take two configuration options: Process and ProcessMatch. The Process option matches a specific process by name, for example the collectd process as we’ve defined here. The ProcessMatch option matches one or more processes via a regular expression. Each regular expression is tied to a label, like so:

ProcessMatch label regex

Let’s take a look at ProcessMatch in action.

<Plugin "processes">
  ProcessMatch "carbon-cache" "python.+carbon-cache"
  ProcessMatch "carbon-relay" "python.+carbon-relay"
</Plugin>

This example would match all of the carbon-cache and carbon-relay processes we configured in Chapter 4 under labels. For example, the regular expression python.+carbon-cache would match all Carbon Cache daemons running on a host and group them under the label carbon-cache.

In another example, if we wished to monitor the Riemann server running on our riemanna, riemannb, and riemannmc hosts, we’d configure the following ProcessMatch regular expression.

<Plugin "processes">
  ProcessMatch "riemann" "riemann.jar:sriemann.bin"
</Plugin>

If we run a ps command on one of the Riemann hosts to check if monitoring for riemann.jar:sriemann.bin will find the Riemann process we’ll see:

$ ps aux | grep 'riemann.jar:sriemann.bin'
riemann  14998 33.8 28.4 6854420 2329212 ?     Sl   Apr21 5330:50 java -Xmx4096m -Xms4096m -XX:NewRatio=1 -XX:PermSize=128m -XX:MaxPermSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config

We see here that our grep matches riemann.jar:sriemann.bin, hence collectd will be able to monitor the Riemann server process.

The output from the processes plugin for specific processes provides us with detailed information on:

Resident Segment Size (RSS), or the amount of physical memory used by the process.
CPU user and system time consumed by the processes.
The number of threads used.
The number of major and minor page faults.
The number of processes by that name that are running.

Now let’s look at how we’re going to configure some process thresholds using some of this data.

5.4.1.9 The processes plugin and thresholds

The last metric, the number of processes by that name, is useful for determining whether the process itself has failed. We’re going to use it to monitor processes to determine if they are running. This brings us to the second piece of configuration in our processes.conf file.

. . .
<Plugin "threshold">
   <Plugin "processes">
       <Type "ps_count">
         DataSource "processes"
         FailureMin 1
       </Type>
   </Plugin>
</Plugin>

This new <Plugin> block configures the threshold plugin we loaded as part of our global configuration. You can specify one or more thresholds for every plugin you use. If a threshold is breached then collectd will generate failure events—for example, if we were monitoring the RSyslog daemon and collectd stopped detecting the process, we’d receive a notification much like:

[2016-01-07 04:51:15] Notification: severity = FAILURE, host = graphitea, plugin = processes, plugin_instance = rsyslogd, type = ps_count, message = Host graphitea, plugin processes (instance rsyslogd) type ps_count: Data source "processes" is currently 0.000000. That is below the failure threshold of 1.000000.

The plugin we’re going to use for sending events to Riemann is also aware of these thresholds and will send them onto Riemann where we can use them to detect services that have failed.

To configure our threshold, we must first define which collectd plugin the threshold will apply to using a new <Plugin> block. In this case we’re defining thresholds for the processes plugin. We then need to tell the threshold plugin which metric generated by this plugin to use for our threshold. For the processes plugin this metric is called ps_count, which is the number of processes running under a specific name. We define the metric to be used with the <Type> block.

Inside the <Type> block we specify two options: DataSource and FailureMin. The DataSource option references the type of data we’re collecting. Each collectd metric can be made up of multiple types of data. To keep track of these data types, the collectd daemon has a central registry of all data known to it, called the type database. It’s located in a database file called types.db, which was created when we installed collectd. On most distributions it’s in the /usr/share/collectd/ directory. If we peek inside the types.db file to find the ps_count metric we will see:

ps_count    processes:GAUGE:0:1000000, threads:GAUGE:0:1000000

Note that the ps_count metric is made up of two data sources, both gauges (see Chapter 2 for information on what a gauge is):

processes - which can range from 0 to 1000000.
threads - which can range from 0 to 1000000.

This means ps_count can be used to either track the number of processes or the number of threads. In our case we care about how many processes are running.

The second option, FailureMin, is the threshold itself. The FailureMin option specifies the minimum number of processes required before triggering a failure event. We’ve specified 1. So, if there is 1 process running, collectd will be happy. If the process count drops below 1, then a failure event will be generated and sent on to Riemann.

The threshold plugin supports several types of threshold:

FailureMin - Generates a failure event if the metric falls below the minimum.
WarningMin - Generates a warning event if the metric falls below the minimum.
FailureMax - Fails if the metric exceeds the maximum.
WarningMax - Warns if the metric exceeds the maximum.

If you combine failure and warning thresholds you can create dual-tier event notifications, for example:

FailureMin 1
WarningMin 3

Here, if the metric drops below 3 then a warning event is triggered. If it drops below 1 then a failure event is triggered. We use this to vary the response to a threshold breach—for example, a warning might not merit immediate action but the failure would.

If the threshold breach is corrected then collectd will generate a “things are okay” event, letting us know that things are back to normal. This will also be sent to Riemann and we could use it to automatically resolve an incident or generate a notification.

The threshold we’ve just defined is global for the processes plugin. This means any process we watch with the processes plugin will be expected to have a minimum of one running process. Currently, we’re only managing one process, collectd; if it drops to zero, it likely means collectd has failed. But if we were to define other processes to watch, which we’ll do in subsequent chapters, then collectd would trigger events for these if the number of processes was to drop below 1.

But sometimes normal operation for a service will be to have more than one running process. To address this we customize thresholds for specific processes. Let’s say normal operation for a specific process meant two or more processes should be running, for example:

<Plugin "processes">
  ProcessMatch "carbon-cache" "python.+carbon-cache"
  ProcessMatch "carbon-relay" "python.+carbon-relay"
</Plugin>

<Plugin "threshold">
   <Plugin "processes">
     Instance "carbon-cache"
       <Type "ps_count">
         DataSource "processes"
         WarningMin 2
         FailureMin 1
       </Type>
     Instance "carbon-relay"
       <Type "ps_count">
         DataSource "processes"
         FailureMin 1
       </Type>
   </Plugin>
</Plugin>

Note In the source code for the book we’ve included collectd process monitoring configuration for Graphite and Riemann in the code for those chapters.

Here we have our carbon-cache and carbon-relay process matching configuration. We know that there should be two carbon-cache processes, and we configure the threshold plugin to check that. We’ve added a new directive, Instance, which we’ve set to the ProcessMatch label, carbon-cache. The Instance directive tells the threshold plugin that this configuration is specifically for monitoring our carbon-cache processes.

We’ve also set both the WarningMin and FailureMin thresholds. If the number of carbon-cache processes drops below 2 then a warning event will trigger. If the number drops below 1 then a failure event will trigger.

Tip There are a number of other useful configuration items we can set for the threshold plugin. You can read about them on the collectd wiki’s threshold configuration page.

If we were to stop the carbon-cache daemon now on one of our Graphite hosts, we’d see the following event in the /var/log/collectd.log log file.

[2015-12-25 00:09:08] Notification: severity = FAILURE, host = graphitea, plugin = processes, plugin_instance = carbon-cache, type = ps_count, message = Host graphitea, plugin processes (instance carbon-cache) type ps_count: Data source "processes" is currently 0.000000. That is below the failure threshold of 2.000000.

We could then use this event in Riemann to trigger a notification and let us know something was wrong. We’ll see how to do that later in this chapter. First though, we need our metrics to be sent onto our last plugin, write_riemann, which does the actual sending of data to Riemann.

Note We’ve also added a similar configuration for our carbon-relay process.

5.4.1.10 Writing to Riemann

Next, we need to configure the write_riemann plugin. This is the plugin that will write our metrics to Riemann. We’re going to create our write_riemann configuration in a file in the /etc/collectd.d/ directory.

First, let’s create a file to hold the plugin’s configuration.

$ sudo touch /etc/collectd.d/write_riemann.conf

Now let’s load and configure the plugin.

LoadPlugin write_riemann
<Plugin "write_riemann">
    <Node "riemanna">
        Host "riemanna.example.com"
        Port "5555"
        Protocol TCP
        CheckThresholds true
        StoreRates false
        TTLFactor 30.0
    </Node>
    Tag "collectd"
</Plugin>

We first use the LoadPlugin directive to load the write_riemann plugin. Then, inside the <Plugin> block, we specify <Node> blocks. Each <Node> block has a name and specifies a connection to a Riemann instance. We’ve called our only node riemanna. Each node must have a unique name.

Each connection is configured with the Host, Port, and Protocol options. In this case we’re installing on a host in the Production A environment, and we’re connecting to the corresponding Riemann server in that environment: riemanna.example.com. We’re using the default port of 5555 and the TCP protocol. To ensure your collectd events reach Riemann you’ll need to ensure that TCP port 5555 is open between your hosts and the Riemann server. We’re instead going to use TCP rather than UDP to provide a stronger guarantee of delivery.

Note Don’t use UDP to send events to Riemann. UDP has no guarantee of delivery, ordering, or duplicate protection. You will lose events and data.

The CheckThresholds option configures Riemann to use any thresholds we might set using the thresholds plugin we enabled earlier.

Setting the StoreRates option to false configures the plugin to not convert any counters to rates. The default, true, turns any counters into rates rather than incremental integers. As we’re going to use counters a lot (see especially Chapter 9 on application metrics) we want to use them rather than rates.

The last option in the Node block is TTLFactor. This contributes to setting the default TTL set on events sent to Riemann. It’s a factor in the TTL calculation:

Interval x TTLFactor = Event TTL

This takes the Interval we set earlier, in our case 2, and the TTLFactor and multiples them, producing the value of the :ttl field in the Riemann event that will get set when events are sent to Riemann. Our calculation would look like:

2 x 30.0 = 60

This sets the value of the :ttl field to 60 seconds, matching the default TTL we set using the default function in Chapter 3. This means collectd events will have a time to live of 60 seconds if we choose to index them in Riemann.

If we want to send metrics to two Riemann instances we’ll add a second Node block, like so:

<Plugin "write_riemann">
    <Node "riemanna">
        Host "riemanna.example.com"
        Port "5555"
        Protocol TCP
        CheckThresholds true
        TTLFactor 30.0
    </Node>
    <Node "riemannb">
        Host "riemannb.example.com"
        Port "5555"
        Protocol TCP
        CheckThresholds true
        TTLFactor 30.0
    </Node>
    Tag "collectd"
</Plugin>

You can see we’ve added a second <Node> block for the riemannb.example.com server. This potentially allows us to send data to multiple Riemann hosts, creating some redundancy in delivering events.

This is also how we could handle sharding and partitioning if we needed to scale our Riemann deployment. Hosts can be divided into classes and categories and directed at specific Riemann servers. For example, we could create servers like riemanna1 or riemanna2, etc. With a configuration management tool it is easy to configure collectd (or other collectors) to use a specific Riemann server.

Lastly, we’ve also specified the Tag option. The Tag option adds a string as a tag to our Riemann events, populating the :tags field of an event with that string. You can specify multiple tags by specifying multiple Tag options.

Note You can find the full write_riemann documentation at the collectd wiki.

5.4.2 Finishing up

Now that we have a basic collectd configuration we can proceed to add that configuration to all of our hosts. The best way to do this— indeed, the best way to install and manage collectd overall—is to use a configuration management tool like Puppet, Chef, or Ansible. The collectd configuration file and the per-plugin configuration files we’ve just created lend themselves to management with most configuration management tools’ template engines. This makes it centrally managed and easier to update across many hosts.

5.4.3 Enabling and running collectd

Now that we have configured the collectd daemon we enable it to run at boot and start it. On Ubuntu we enable collectd and start it like so:

$ sudo update-rc.d collectd defaults
$ sudo service collectd start

On Red Hat we enable collectd and start it like so:

$ sudo systemctl enable collectd
$ sudo service collectd start

We then see in the /var/log/collectd.log log file if collectd is running.

[2015-08-18 20:46:01] Initialization complete, entering read-loop.

5.5 The collectd events

Once the collectd daemon is configured and running, events should start streaming into Riemann. Let’s log in to riemanna.example.com and look in the /var/log/riemann/riemann.log log file to see what some of those events look like. Remember, our Riemann instance is currently configured to output all events to a log file:

(streams
  (default :ttl 60
    ; Index all events immediately.
    index

    ; Send all events to the log file.
    #(info %)

    (where (service #"^riemann.*")
      graph

      downstream))))

The #(info %) function will send all incoming events to the log file. Let’s look in the log file now to see events generated by collectd from a selection of plugins.

{:host tornado-web1, :service df-root/df_complex-free, :state ok, :description nil, :metric 2.7197804544E10, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance free, :type df_complex, :plugin_instance root, :plugin df}
{:host tornado-web1, :service load/load/shortterm, :state ok, :description nil, :metric 0.05, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name shortterm, :ds_type gauge, :type load, :plugin load}
{:host tornado-web1, :service memory/memory-used, :state ok, :description nil, :metric 4.5678592E7, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance used, :type memory, :plugin memory}

Let’s look at some of the fields of the events that we’re going to work with in Riemann.

Field	Description
host	The hostname, e.g. `tornado-web1`.
service	The service or metric.
state	A string describing state, e.g. `ok`, `warning`, `critical`.
time	The time of the event.
tags	Tags from the `Tag` fields in the `write_riemann` plugin.
metric	The value of the collectd metric.
ttl	Calculated from the `Interval` and `TTLFactor`.
type	The type of data. For example, CPU data.
type_instance	A sub-type of data. For example, idle CPU time.
plugin	The collectd plugin that generated the event.
ds_name	The DataSource name from the `types.db` file.
ds_type	The type of DataSource. For example, a gauge.

The collectd daemon has constructed events that contain a :host field with the host name we’re collecting from. In our case all three events are from the tornado-web1 host. We also have a :service field containing the name of each metric collectd is sending.

Importantly, when we send metrics to Graphite, it will construct the metric name from a combination of these two fields. So the combination of :host tornado-web1 and :service memory/memory-used becomes tornado-web1.memory.memory-used in Graphite. Or :host tornado-web1 and :service load/load/shortterm becomes tornado-web1.load.load.shortterm.

Our collectd events also have a :state field. The value of this field is controlled by the threshold checking we enabled with the threshold plugin above. If no specific threshold has been set or triggered for a collectd metric then the :state field will default to a value of ok.

Also familiar from Chapter 3 should be the :ttl field which sets the time to live for Riemann events in the index. Here the :ttl is 60 seconds, controlled by multiplying our Interval and TTLFactor values as we discovered when we configured collectd.

The :description field is empty, but we get some descriptive data from the :type, :type_instance, and :plugin fields which tell us what type of metric it is, the more granular type instance, and the name of the plugin that collected it, respectively. For example, one of our events is generated from the memory plugin with a :type of memory and a :type_instance of used. Combined, these provide a description of what data we’re specifically collecting.

Also available is the :ds_name field which is a short name or title for the data source. Next is the :ds_type field which specifies the data source type. This tells us what type of data this source is—in the case of all these events, a type of gauge. The collectd daemon has four data types:

Gauges — A value that is simply stored. This is generally for values that can increase and decrease.
Derive — These values assume any change in value is interesting, the derivative.
Counter — Like the derive data source type, but if a new value is smaller than the previous value, collectd assumes the counter has rolled.
Absolute — This is intended for counters that are reset upon reading. We won’t see any absolute data sources in this chapter.

Tip You can find more details on collectd data sources on the collectd wiki, and you can find the full list of default data types in the types.db file in /usr/share/collectd/ directory.

Lastly, the :metric and the :time fields hold the actual value of the metric and the time it was collected, respectively.

5.6 Sending our collectd events to Graphite

Now that we’ve got events flowing from collectd to Riemann, we want to send them the step further and onto Graphite so we can graph them. We’re going to use a tagged stream to select the collectd events. The tagged stream selects all events with a specific tag. Our collectd events have acquired a tag, collectd, via the Tag directive in the write_riemann configuration. We can match on this tag and then send the events to Graphite using the graph var we created in Chapter 4.

Let’s see our new streams.

. . .

      (tagged "collectd"
        graph)

. . .

Our tagged stream grabs all events tagged with collectd. We then added the graph var we created in Chapter 4 so that all these events will be sent on to Graphite.

5.7 Refactoring the collectd metric names

Now let’s see what the metrics arriving at Graphite look like? If we look at the /var/log/carbon/creates.log log file on our graphitea.example.com host we should see our metrics being updated. Most of them look much like this:

. . .
productiona.hosts.tornado-web1.interface-eth0/if_packets/rx
productiona.hosts.tornado-web1.interface-eth0/if_errors/rx
productiona.hosts.tornado-web1.cpu-percent/cpu-steal
productiona.hosts.tornado-web1.df-root/df_complex-used
productiona.hosts.tornado-web1.interface-lo/if_packets/rx
productiona.hosts.tornado-web1.processes/ps_state-paging
productiona.hosts.tornado-web1.processes/ps_state-zombies
productiona.hosts.tornado-api1.df-dev/df_complex-used
productiona.hosts.tornado-api1.df-run-shm/df_complex-reserved
productiona.hosts.tornado-api1.df-dev/df_complex-reserved
productiona.hosts.tornado-api1.interface-eth0/if_octets/rx
productiona.hosts.tornado-api1.df-run-lock/df_complex-reserved
productiona.hosts.tornado-api1.interface-lo/if_octets/rx
productiona.hosts.tornado-api1.df-run-lock/df_complex-free
productiona.hosts.tornado-api1.df-run/df_complex-reserved
productiona.hosts.tornado-api1.processes/ps_state-blocked
. . .

We see that our productiona.hosts. has been appended to the metric name thanks to the configuration we added in Chapter 4. We also see the name of the host from which the metrics are being collected, here tornado-web1 and app2-api. Each metric name is a varying combination of the collectd plugin data structure: plugin name, plugin instance, and type. For example:

productiona.hosts.tornado-api1.processes/ps_state-blocked

This metric combines the processes plugin name; the process state, ps_state; and the type, blocked.

You can see that a number of the metrics have fairly complex naming patterns. To make it easier to use them we’re going to try to simplify that a little. This is entirely optional but it is something we do to make building graphs a bit faster and easier. If you don’t see the need for any refactoring you can skip to the next section.

There are several places we could adjust the names of our metrics:

We can use the Chains construct in the collectd daemon.
We can rewrite the rules inside Carbon using the Carbon Aggregation daemon.
We can rewrite the metric events inside Riemann.

We’re going to use the last option, Riemann, because it’s a nice central collection point for metrics. We’re going to make use of a neat bit of code written by Pierre-Yves Ritschard, a well-known member of the collectd and Riemann communities. The code takes the incoming collectd metrics and rewrites their :service fields to make them easier to understand.

Let’s first create a file to hold our new code on our Riemann server.

$ sudo touch /etc/riemann/examplecom/etc/collectd.clj

Here we’ve created /etc/riemann/examplecom/etc/collectd.clj. Let’s populate this file with our borrowed code.

(ns examplecom.etc.collectd
  (:require [clojure.tools.logging :refer :all]
            [riemann.streams :refer :all]
            [clojure.string :as str]))

(def default-services
  [{:service #"^load/load/(.*)$" :rewrite "load $1"}
   {:service #"^cpu/percent-(.*)$" :rewrite "cpu $1"}
. . .
   {:service #"^interface-(.*)/if_(errors|packets|octets)/(tx|rx)$" :rewrite "nic $1 $3 $2"}])

(defn rewrite-service-with
  [rules]
  (let [matcher (fn [s1 s2] (if (string? s1) (= s1 s2) (re-find s1 s2)))]
    (fn [{:keys [service] :as event}]
      (or
       (first
        (for [{:keys [rewrite] :as rule} rules
              :when (matcher (:service rule) service)]
          (assoc event :service
                 (if (string? (:service rule))
                   rewrite
                   (str/replace service (:service rule) rewrite)))))
       event))))

(def rewrite-service
  (rewrite-service-with default-services))

This looks complex but what’s happening is actually pretty simple. We first define a namespace: examplecom.etc.collectd. We then require three libraries. The first is Clojure’s string functions from clojure.string. When we require this namespace we’ve used a new directive called :as. This creates an alias for the namespace, here str. This allows us to refer to functions inside the library as by this alias: str/replace. If we remember Chapter 3, refer lets you use names from other namespaces without having to fully qualify them, and :as lets you use a shorter name for a namespace when you’re writing out a fully qualified name.

We also require the clojure.tools.logging namespace which provides access to some logging functions, for example the info function we used earlier in the book. Lastly, we require the riemann.streams namespace which provides access to Riemann’s streams, for example the where stream.

Next we’ve created a var called default-services with the def statement. This is a series of regular expression maps. Each line does a rewrite of a specific :service field like so:

{:service #"^cpu/percent-(.*)$" :rewrite "cpu $1"}

We first specify the content of the :service field that we want to match, here that’s #"^cpu/percent-(.*)$". This will grab all of the collectd CPU metrics. We capture the final portion of the metric namespace. In the next element, :rewrite, we specify how the final metrics will look—here that’s cpu $1, with $1 being the output we captured in our initial regular expression. What’s this actually look like? Well, the :service field of one of our CPU metrics is: cpu-percent/cpu-steal. Our regular expression will match the steal portion of the current metric name and rewrite the service to: cpu steal. Graphite’s metric names treat spaces as dots, so when it hits Graphite that will be converted into the full path of:

productiona.hosts.tornado-web1.cpu.steal

The rules in our code right now handle the basic set of metrics we’re collecting in this chapter. You can easily add new rules to cover other metrics you’re collecting.

Next is a new function called rewrite-service-with. This contains the magic that actually does the rewrite. It takes a list of rules, examines incoming events, grabs any events with a :service field that matches any of our rules, and then rewrites the :service field using the Clojure string function replace. It’s reasonably complex but you’re not likely to ever need to change it. We’re not going to step through it in any detail.

Note If you have any questions you can find the original code on GitHub at https://github.com/pyr/riemann-extra.

Lastly, we got a final var called rewrite-service. This is what actually runs the rewrite-service-with function and passes it the rules in the default-services var.

Let’s rewrite our original tagged filter in /etc/riemann/riemann.config to send events through our rewrite function. Here’s our original stanza.

(tagged "collectd"
  graph)

Let’s replace it with:

(require '[examplecom.etc.collectd :refer :all])

. . .

(tagged "collectd"
  (smap rewrite-service graph))

We first add a require to load the functions in our

examplecom/etc/collectd

namespace. We’re still using the same tagged filter to grab all events tagged with collectd, but we’re passing them to a new stream called smap. The smap stream is a streaming map. It’s highly useful for transforming events. In this case the smap stream says: “send any incoming events into the rewrite-service var, process them, and then send them onto the graph var”.

This will turn a metric like:

productiona.hosts.tornado-web1.load/load/shortterm

Into:

productiona.hosts.tornado-web1.load.shortterm

Tip When Graphite sees spaces or slashes it converts them into periods.

This makes the overall metric much easier to parse and use, especially when we start to build graphs and dashboards.

Note We’ve included all example configuration and code in the book on GitHub.

5.8 Summary

In this chapter we saw how to collect the base level of data across our hosts: CPU, memory, disk, and related data. To do this we installed and configured collectd. We configured collectd to use plugins to collect a wide variety of data on our hosts and direct them to Riemann for any processing and checks, and to forward them onto Graphite for longer-term storage.

In the next chapter we’ll look at making use of our collectd metrics in Riemann and in Graphite and Grafana.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5 Host monitoring

Create new playlist

Sign In

Sign Up

5 Host monitoring

5.1 Introducing collectd

5.2 What host components should we monitor?

5.3 Installing collectd

5.3.1 Installing collectd on Ubuntu

5.3.2 Installing collectd on Red Hat

5.3.3 Installing collectd via configuration management

5.4 Configuring collectd

5.4.1 Loading and configuring collectd plugins for monitoring

5.4.1.1 The cpu plugin

5.4.1.2 The memory plugin

5.4.1.3 The df plugin

5.4.1.4 The swap plugin

5.4.1.5 The interface plugin

5.4.1.6 The protocols plugin

5.4.1.7 The load plugin

5.4.1.8 The processes plugin

5.4.1.9 The processes plugin and thresholds

5.4.1.10 Writing to Riemann

5.4.2 Finishing up

5.4.3 Enabling and running collectd

5.5 The collectd events

5.6 Sending our collectd events to Graphite

5.7 Refactoring the collectd metric names

5.8 Summary

Table of Contents for
5 Host monitoring