In the last two chapters we installed and configured Riemann and Graphite. We got an introduction to Riemann and how it manages and indexes events. We also saw how to integrate Riemann servers in multiple environments. Finally, we saw how to send events from Riemann to Graphite and then graph them in Grafana.
In this chapter we’re going to add another building block to our monitoring framework by collecting host-based data and sending it to Riemann. Host-based monitoring provides the basic information about our hosts and their performance. We can then collect and combine this data with the application data that we’ll learn to collect in Chapter 9.
We’ll aim to implement host monitoring that:
To satisfy these needs we’re going to look at a tool called collectd.
The collectd daemon, which acts as a monitoring collection agent, will do the local monitoring and send the data to Riemann. It will run locally on our hosts and selectively monitor and collect data from a variety of components.
We’ve chosen collectd because it is high performance and reliable. The collectd daemon has been around for about ten years, it’s written in C for performance, and is well tested and widely used. It’s also open source and licensed under a mix of the MIT and GPLv2 licenses.
The daemon uses a modular design: a central core and an integrated plugin system. The core of collectd is small and provides basic filtering and routing for the collected data. The collection, storage, and transmission of data is handled by plugins that can be enabled individually.
For the collection of data, collectd uses plugins called read plugins. They can collect information like CPU performance, memory, or application-specific metrics. This data is then passed into collectd’s core functions.
The data can then be filtered or routed to write plugins. The write plugins allow data to be stored locally—for example, writing to files—or to send that data across the network to other services. In our case, they’ll allow us to send that data to Riemann.
There are a large collection of default plugins shipped with collectd, as well as a variety of community-contributed and developed plugins you can add to it. The collectd daemon also supports running plugins you have written yourself.
This diagram shows our collectd architecture.
Our applications and services are connected via collectd’s read plugins. These read plugins send events to a write plugin that will send events onto Riemann.
In this chapter we’re focusing on using collectd for host-based monitoring, but we’re also going to use it later in the book for collecting service and application data. Using collectd will allow us to run a single agent locally on our host and use it to send our data into our Riemann event router.
Before we jump into installing and configuring collectd, let’s discuss what we might like to collect on our hosts. We’re going to focus on collecting basic data that will show us the core performance of our hosts. We’re going to configure a generic collection of monitoring data on all our hosts, and we’ll add additional monitoring for specific use cases. For example, we would install our base monitoring on all hosts but add specific monitoring for a database or application server.
Our basic set of monitoring will include:
This base set of data will allow us to identify host performance issues or provide sufficient supplemental data for fault diagnosis of application issues.
At this point some of you may be wondering, “But didn’t we say earlier that we should focus on application and business events and metrics?” We’re indeed going to focus on those application and business events and metrics, but they aren’t the full story. When we have a fault or need to further diagnose a performance issue we often need to drill down to more granular data. The data we’re gathering here will supplement our application and business events and metrics and allow us to diagnose and identify system-level issues that cause application problems.
We’ll install collectd on every host and configure it to collect our metrics and send them to Riemann. We’ll walk through the installation process on both Ubuntu and Red Hat family operating systems and provide you with appropriate configuration management resources to do the installation automatically.
We’ll install version 5.5 of collectd on Ubuntu from the collectd project’s own Landscape repositories and via the apt-get
command.
Add the collectd repository configuration to your host:
Then update and install collectd.
The collectd daemon and its associated dependencies will now be installed. We’ll test it is installed and working by running the collectd
binary.
The -h
flag will output collectd’s flags and command line help.
We’ll install collectd on Red Hat using the EPEL repository.
Let’s add the EPEL repository now.
Now we’ll use the yum
command to install the collectd
package. We’ll also install the write_riemann
plugin, which we’ll use to send our events to Riemann, and the protobuf-c
package it requires.
yum
command has been replaced with the dnf
command. The syntax is otherwise unchanged.
The collectd daemon and its associated dependencies will now be installed. We’ll test it is installed by running the collectd
binary.
The -h
flag will output collectd’s flags and command line help.
There are a variety of options for installing collectd on your hosts via configuration management.
You can find a Chef cookbook for collectd at:
You can find Puppet modules for collectd at:
You can find an Ansible role for collectd at:
You can also find Docker containers at:
And a Vagrant image at:
Now that we have the collectd daemon installed we need to configure it. We’re going to enable some plugins to gather our data and then configure a plugin to send the events to Riemann.
The collectd daemon is configured using a configuration file located at:
/etc/collectd/collectd.conf
on Ubuntu./etc/collectd.conf
on Red Hat.The collectd
package on both distributions installs a default configuration file. We’re not going to use that file—instead, we’re going to build our own configuration.
The collectd configuration is divided into three main concerns:
Let’s look at our initial configuration file.
# Global settings
Interval 2
CheckThresholds true
WriteQueueLimitHigh 5000
WriteQueueLimitLow 5000
# Plugin loading
LoadPlugin logfile
LoadPlugin threshold
# Plugin Configuration
<Plugin "logfile">
LogLevel "info"
File "/var/log/collectd.log"
Timestamp true
</Plugin>
Include "/etc/collectd.d/*.conf"
First, we need to set two global configuration options in our collectd configuration. The first configuration option, Interval
, sets the collectd daemon’s check interval. This is the resolution at which collectd collects data. It defaults to 10 seconds. We’re going to make our resolution more granular and move it to two second intervals. This should either match or be slightly larger than the lowest retention period set in Graphite to ensure the collection periods sync up correctly. Remember that in Chapter 4 we installed Graphite and configured its retentions in the /etc/carbon/storage-schemas.conf
configuration file.
Interval
to one second but Riemann’s one second precision sometimes means duplicate metrics are generated when Riemann rounds the time to the nearest second.
You can see our shortest resolution in the storage schema is one second. Our collectd Interval
should be the same or more than that resolution.
The second global option, CheckThresholds
, can be set to true
or false
, and controls how state is set for collected data. The collectd daemon can check collected data for thresholds, marking data with metrics that exceed those thresholds as being in a warning or failed state. We’re not going to set any specific thresholds right now, but we’re turning on the functionality because it creates a default state—shown in the :state
field in a Riemann event—of ok
that we can later use in Riemann.
The WriteQueueLimitHigh
and WriteQueueLimitLow
options control the queue size of write plugins. This protects us from memory issues if a write plugin is slow, for example if the network times out. You can set the limits using the WriteQueueLimitHigh
and WriteQueueLimitLow
options. Each option takes a number, which is the number of metrics in the queue. If there are more metrics than the value of the WriteQueueLimitHigh
options then any new metrics will be dropped. If there are less than WriteQueueLimitLow
metrics in the queue, all new metrics will be enqueued.
If the number of metrics currently in the queue is between the two thresholds the metric is potentially dropped, with a probability that is proportional to the number of metrics in the queue. This is a bit random for us so we set both options to 5000
. This sets an absolute threshold. If more than 5000
metrics are in the queue then incoming metrics will be dropped. This protects the memory on the host running collectd.
Next, we’ll enable some plugins for collectd. Use the LoadPlugin
command to load each plugin.
LoadPlugin logfile
<Plugin "logfile">
LogLevel "info"
File "/var/log/collectd.log"
Timestamp true
</Plugin>
LoadPlugin threshold
The first plugin, logfile
, tells collectd to log its output to a log file.
We’ve configured the plugin using a <Plugin>
block. Each <Plugin>
block specifies the plugin it is providing configuration for, here <Plugin "logfile">
. It also contains a series of configuration options for that plugin. In this case, we’ve specified the LogLevel
option, which tells collectd how verbose to make our logging. We’ve chosen a middle ground of info
, which logs some operational and error information, but skips a lot of debugging output. We’ve also specified the File
option to tell where collectd to log this information, choosing /var/log/collectd.log
.
Lastly, we’ve set the Timestamp
option to true
which adds a timestamp to any log output collectd generates.
logfile
plugin and its configuration first in the collectd.conf
file so that if something goes wrong with collectd we’re likely to catch it in the log.
We’ll also need to load the threshold
plugin. The threshold
plugin is linked to the CheckThresholds
setting we configured in the global options above. This plugin enables collectd’s threshold checking logic. We’re going to use it to help identify when things go wrong on our host.
Finally, we’ll set one last global configuration option, Include "/etc/collectd.d/*.conf
. The Include
option specifies a directory to hold additional collectd configuration. Any file ending in .conf
will be added to our collectd configuration. We’re going to use this capability to make our configuration easier to manage. With snippets included via the Include
directory, we can easily manage collectd with configuration management tools like Puppet, Chef, or Ansible.
You’ll note that we’ve put the Include
option at the end of the configuration file. This is because collectd loads configuration in a top down order. We want our included files to be done last.
Let’s create the directory for the files we wish to include now (it already exists on some distributions).
We’re now going to load and configure a series of plugins to collect data. We’re going to make use of the following plugins:
df
command.These plugins provide data for the basic state of most Linux hosts. We’re going to configure each plugin in a separate file and store them in the /etc/collectd.d/
directory we’ve specified as the value of the Include
option. This separation allows us to individually manage each plugin and lends itself to management with a configuration management tool like Puppet, Chef, or Ansible.
Now let’s configure each plugin.
The first plugin we’re going to configure is the cpu
plugin. The cpu
plugin collects CPU performance metrics on our hosts. By default, the cpu
plugin emits CPU metrics in Jiffies: the number of ticks since the host booted. We’re going to also send something a bit more useful: percentages.
First, we’re going to create a file to hold our plugin configuration. We’ll put it into the /etc/collectd.d
directory. As a result of the Include
option in the collectd.conf
configuration file, collectd will load this file automatically.
Let’s now populate this file with our configuration.
We load the plugin using the LoadPlugin
option. We then specify a <Plugin>
block to configure our plugin. We’ve specified two options. The first, ValuesPercentage
, set to true
, will tell collectd to also send all CPU metrics as percentages. The second option, ReportByCpu
, set to false
, will aggregate all CPU cores on the host into a single metric. We like to do this for simplicity’s sake. If you’d prefer your host to report CPU performance per core, you can change this to true
.
Next let’s configure the memory
plugin. This plugin collects information on how much RAM is free and used on our host. By default the memory
plugin returns metrics in bytes used or free. Often this isn’t useful because we don’t know how much memory a specific host might have. If we return the value as a percentage instead, we can more easily use this metric to determine if we need to take some action on our host. So we’re going to do the same percentage conversion for our memory metrics.
Let’s start by creating a file to hold the configuration for the memory
plugin.
Now let’s populate this file with our configuration.
We first load the plugin and then add a <Plugin>
block to configure it. We set the ValuesPercentage
option to true
to make the memory
plugin also emit metrics as percentages.
The df
plugin collects disk space metrics, including space used, on mount points and devices. Like the memory
plugin, it outputs in bytes by default—so let’s configure it to emit metrics in percentages, instead. That will make it easier to determine if a mount or device has a disk space issue.
Let’s start by creating a configuration file.
And now we’ll populate this file with our configuration.
We first load the plugin and then configure it. The ValuesPercentage
option tells the df
plugin to also emit metrics with percentage values. We’ve also specified a mount point we’d like to monitor via the MountPoint
option. This option allows us to specify which mount points to collect disk space metrics on. We’ve only specified the /
(“root”) mount point. If we wanted or needed to, we could add additional mount points to the configuration now. To monitor a mount point called /data
we would add:
We can also specify devices using the Dev
option.
Or we can monitor all filesystems and mount points on the host.
The IgnoreSelected
option, when set to true
, tells the df
plugin to ignore all configured mounts points or devices and monitor every mount point and device.
Let’s now configure the swap
plugin. This plugin collects a variety of metrics on the state of swap on your hosts. Like other plugins we’ve seen that it returns metric values measuring swap used in bytes. We again want to make those easier to consume by emitting them as percentage values.
Let’s first create a file to hold its configuration.
Now we’ll populate this file with our configuration.
We’ve only specified one option, ValuesPercentage
, and set it to true
. This will cause the swap
plugin to emit metrics using percentage values.
Now let’s configure our interface
plugin. The interface
plugin collects data on our network interfaces and their performance.
We’re not going to add any configuration to the interface
plugin by default so we’re just going to load it.
Without configuration, the interface
plugin collects metrics on all interfaces on the host. If we wanted to limit this to one or more interfaces, instead, that can be specified via configuration using the Interface
option.
Other times we might want to exclude specific interfaces—for example, the loopback interface—from our monitoring.
Here we’ve specified the loopback, lo
, interface with the Interface
option. We’ve added the IgnoreSelected
option and set it to true
. This will tell the interface
plugin to monitor all interfaces except the lo
interface.
Also collecting data about our host’s network performance is the protocols
plugin. The protocols
plugin collects data on our network protocols running on the host and their performance.
We’re not going to add any configuration to the protocols
plugin by default so we’re just going to load it.
We’re not going to add any configuration for the next plugin, load
. We’re just going to load it. This plugin collects load metrics on our host.
Let’s create a file for the load
plugin.
And configure it to load the load
plugin.
Lastly, we’re going to configure the processes
plugin. The processes
plugin monitors both processes broadly on the host—for example, the number of active and zombie processes—but it can also be focused on individual processes to get more details about them. We’re going to create a file to hold our processes
plugin configuration.
We’re then going to specifically monitor the collectd
process itself and set a default threshold for any monitored processes. The processes
plugin provides insight into the performance of specific processes. We’ll also use it to make sure specific processes are running—for example, ensuring collectd is running.
LoadPlugin processes
<Plugin "processes">
Process "collectd"
</Plugin>
<Plugin "threshold">
<Plugin "processes">
<Type "ps_count">
DataSource "processes"
FailureMin 1
</Type>
</Plugin>
</Plugin>
In some cases you might want to monitor a process (or any other plugin) at a longer interval than the global default of 2 seconds. You can override the global Interval
by setting a new interval when loading the plugin. To do this we convert the LoadPlugin
directive into a block like so:
LoadPlugin
configuration section in the collectd configuration documentation.
Here inside our new <LoadPlugin>
block we’ve specified a new Interval
directive with a collection period of 10
seconds.
We continue by configuring the processes
plugin inside the <Plugin>
block. It can take two configuration options: Process
and ProcessMatch
. The Process
option matches a specific process by name, for example the collectd
process as we’ve defined here. The ProcessMatch
option matches one or more processes via a regular expression. Each regular expression is tied to a label, like so:
Let’s take a look at ProcessMatch
in action.
<Plugin "processes">
ProcessMatch "carbon-cache" "python.+carbon-cache"
ProcessMatch "carbon-relay" "python.+carbon-relay"
</Plugin>
This example would match all of the carbon-cache
and carbon-relay
processes we configured in Chapter 4 under labels. For example, the regular expression python.+carbon-cache
would match all Carbon Cache daemons running on a host and group them under the label carbon-cache
.
In another example, if we wished to monitor the Riemann server running on our riemanna
, riemannb
, and riemannmc
hosts, we’d configure the following ProcessMatch
regular expression.
If we run a ps
command on one of the Riemann hosts to check if monitoring for riemann.jar:sriemann.bin
will find the Riemann process we’ll see:
$ ps aux | grep 'riemann.jar:sriemann.bin'
riemann 14998 33.8 28.4 6854420 2329212 ? Sl Apr21 5330:50 java -Xmx4096m -Xms4096m -XX:NewRatio=1 -XX:PermSize=128m -XX:MaxPermSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -cp /usr/share/riemann/riemann.jar: riemann.bin start /etc/riemann/riemann.config
We see here that our grep
matches riemann.jar:sriemann.bin
, hence collectd will be able to monitor the Riemann server process.
The output from the processes
plugin for specific processes provides us with detailed information on:
Now let’s look at how we’re going to configure some process thresholds using some of this data.
The last metric, the number of processes by that name, is useful for determining whether the process itself has failed. We’re going to use it to monitor processes to determine if they are running. This brings us to the second piece of configuration in our processes.conf
file.
. . .
<Plugin "threshold">
<Plugin "processes">
<Type "ps_count">
DataSource "processes"
FailureMin 1
</Type>
</Plugin>
</Plugin>
This new <Plugin>
block configures the threshold
plugin we loaded as part of our global configuration. You can specify one or more thresholds for every plugin you use. If a threshold is breached then collectd will generate failure events—for example, if we were monitoring the RSyslog daemon and collectd stopped detecting the process, we’d receive a notification much like:
[2016-01-07 04:51:15] Notification: severity = FAILURE, host = graphitea, plugin = processes, plugin_instance = rsyslogd, type = ps_count, message = Host graphitea, plugin processes (instance rsyslogd) type ps_count: Data source "processes" is currently 0.000000. That is below the failure threshold of 1.000000.
The plugin we’re going to use for sending events to Riemann is also aware of these thresholds and will send them onto Riemann where we can use them to detect services that have failed.
To configure our threshold, we must first define which collectd plugin the threshold will apply to using a new <Plugin>
block. In this case we’re defining thresholds for the processes
plugin. We then need to tell the threshold
plugin which metric generated by this plugin to use for our threshold. For the processes
plugin this metric is called ps_count
, which is the number of processes running under a specific name. We define the metric to be used with the <Type>
block.
Inside the <Type>
block we specify two options: DataSource
and FailureMin
. The DataSource
option references the type of data we’re collecting. Each collectd metric can be made up of multiple types of data. To keep track of these data types, the collectd daemon has a central registry of all data known to it, called the type database. It’s located in a database file called types.db
, which was created when we installed collectd. On most distributions it’s in the /usr/share/collectd/
directory. If we peek inside the types.db
file to find the ps_count
metric we will see:
Note that the ps_count
metric is made up of two data sources, both gauges (see Chapter 2 for information on what a gauge is):
processes
- which can range from 0
to 1000000
.threads
- which can range from 0
to 1000000
.This means ps_count
can be used to either track the number of processes or the number of threads. In our case we care about how many processes are running.
The second option, FailureMin
, is the threshold itself. The FailureMin
option specifies the minimum number of processes required before triggering a failure event. We’ve specified 1
. So, if there is 1
process running, collectd will be happy. If the process count drops below 1
, then a failure event will be generated and sent on to Riemann.
The threshold
plugin supports several types of threshold:
FailureMin
- Generates a failure event if the metric falls below the minimum.WarningMin
- Generates a warning event if the metric falls below the minimum.FailureMax
- Fails if the metric exceeds the maximum.WarningMax
- Warns if the metric exceeds the maximum.If you combine failure and warning thresholds you can create dual-tier event notifications, for example:
Here, if the metric drops below 3
then a warning event is triggered. If it drops below 1
then a failure event is triggered. We use this to vary the response to a threshold breach—for example, a warning might not merit immediate action but the failure would.
If the threshold breach is corrected then collectd will generate a “things are okay” event, letting us know that things are back to normal. This will also be sent to Riemann and we could use it to automatically resolve an incident or generate a notification.
The threshold we’ve just defined is global for the processes
plugin. This means any process we watch with the processes
plugin will be expected to have a minimum of one running process. Currently, we’re only managing one process, collectd
; if it drops to zero, it likely means collectd has failed. But if we were to define other processes to watch, which we’ll do in subsequent chapters, then collectd would trigger events for these if the number of processes was to drop below 1
.
But sometimes normal operation for a service will be to have more than one running process. To address this we customize thresholds for specific processes. Let’s say normal operation for a specific process meant two or more processes should be running, for example:
<Plugin "processes">
ProcessMatch "carbon-cache" "python.+carbon-cache"
ProcessMatch "carbon-relay" "python.+carbon-relay"
</Plugin>
<Plugin "threshold">
<Plugin "processes">
Instance "carbon-cache"
<Type "ps_count">
DataSource "processes"
WarningMin 2
FailureMin 1
</Type>
Instance "carbon-relay"
<Type "ps_count">
DataSource "processes"
FailureMin 1
</Type>
</Plugin>
</Plugin>
Here we have our carbon-cache
and carbon-relay
process matching configuration. We know that there should be two carbon-cache
processes, and we configure the threshold
plugin to check that. We’ve added a new directive, Instance
, which we’ve set to the ProcessMatch
label, carbon-cache
. The Instance
directive tells the threshold
plugin that this configuration is specifically for monitoring our carbon-cache
processes.
We’ve also set both the WarningMin
and FailureMin
thresholds. If the number of carbon-cache
processes drops below 2
then a warning event will trigger. If the number drops below 1
then a failure event will trigger.
threshold
plugin. You can read about them on the collectd wiki’s threshold configuration page.
If we were to stop the carbon-cache
daemon now on one of our Graphite hosts, we’d see the following event in the /var/log/collectd.log
log file.
We could then use this event in Riemann to trigger a notification and let us know something was wrong. We’ll see how to do that later in this chapter. First though, we need our metrics to be sent onto our last plugin, write_riemann
, which does the actual sending of data to Riemann.
carbon-relay
process.
Next, we need to configure the write_riemann
plugin. This is the plugin that will write our metrics to Riemann. We’re going to create our write_riemann
configuration in a file in the /etc/collectd.d/
directory.
First, let’s create a file to hold the plugin’s configuration.
Now let’s load and configure the plugin.
LoadPlugin write_riemann
<Plugin "write_riemann">
<Node "riemanna">
Host "riemanna.example.com"
Port "5555"
Protocol TCP
CheckThresholds true
StoreRates false
TTLFactor 30.0
</Node>
Tag "collectd"
</Plugin>
We first use the LoadPlugin
directive to load the write_riemann
plugin. Then, inside the <Plugin>
block, we specify <Node>
blocks. Each <Node>
block has a name and specifies a connection to a Riemann instance. We’ve called our only node riemanna
. Each node must have a unique name.
Each connection is configured with the Host
, Port
, and Protocol
options. In this case we’re installing on a host in the Production A environment, and we’re connecting to the corresponding Riemann server in that environment: riemanna.example.com
. We’re using the default port of 5555
and the TCP protocol. To ensure your collectd events reach Riemann you’ll need to ensure that TCP port 5555
is open between your hosts and the Riemann server. We’re instead going to use TCP rather than UDP to provide a stronger guarantee of delivery.
The CheckThresholds
option configures Riemann to use any thresholds we might set using the thresholds
plugin we enabled earlier.
Setting the StoreRates
option to false
configures the plugin to not convert any counters to rates. The default, true
, turns any counters into rates rather than incremental integers. As we’re going to use counters a lot (see especially Chapter 9 on application metrics) we want to use them rather than rates.
The last option in the Node
block is TTLFactor
. This contributes to setting the default TTL set on events sent to Riemann. It’s a factor in the TTL calculation:
Interval x TTLFactor = Event TTL
This takes the Interval
we set earlier, in our case 2
, and the TTLFactor and multiples them, producing the value of the :ttl
field in the Riemann event that will get set when events are sent to Riemann. Our calculation would look like:
2 x 30.0 = 60
This sets the value of the :ttl
field to 60
seconds, matching the default TTL we set using the default
function in Chapter 3. This means collectd events will have a time to live of 60 seconds if we choose to index them in Riemann.
If we want to send metrics to two Riemann instances we’ll add a second Node
block, like so:
<Plugin "write_riemann">
<Node "riemanna">
Host "riemanna.example.com"
Port "5555"
Protocol TCP
CheckThresholds true
TTLFactor 30.0
</Node>
<Node "riemannb">
Host "riemannb.example.com"
Port "5555"
Protocol TCP
CheckThresholds true
TTLFactor 30.0
</Node>
Tag "collectd"
</Plugin>
You can see we’ve added a second <Node>
block for the riemannb.example.com
server. This potentially allows us to send data to multiple Riemann hosts, creating some redundancy in delivering events.
This is also how we could handle sharding and partitioning if we needed to scale our Riemann deployment. Hosts can be divided into classes and categories and directed at specific Riemann servers. For example, we could create servers like riemanna1
or riemanna2
, etc. With a configuration management tool it is easy to configure collectd (or other collectors) to use a specific Riemann server.
Lastly, we’ve also specified the Tag
option. The Tag
option adds a string as a tag to our Riemann events, populating the :tags
field of an event with that string. You can specify multiple tags by specifying multiple Tag
options.
write_riemann
documentation at the collectd wiki.
Now that we have a basic collectd configuration we can proceed to add that configuration to all of our hosts. The best way to do this— indeed, the best way to install and manage collectd overall—is to use a configuration management tool like Puppet, Chef, or Ansible. The collectd configuration file and the per-plugin configuration files we’ve just created lend themselves to management with most configuration management tools’ template engines. This makes it centrally managed and easier to update across many hosts.
Now that we have configured the collectd daemon we enable it to run at boot and start it. On Ubuntu we enable collectd and start it like so:
On Red Hat we enable collectd and start it like so:
We then see in the /var/log/collectd.log
log file if collectd is running.
[2015-08-18 20:46:01] Initialization complete, entering read-loop.
Once the collectd daemon is configured and running, events should start streaming into Riemann. Let’s log in to riemanna.example.com
and look in the /var/log/riemann/riemann.log
log file to see what some of those events look like. Remember, our Riemann instance is currently configured to output all events to a log file:
(streams
(default :ttl 60
; Index all events immediately.
index
; Send all events to the log file.
#(info %)
(where (service #"^riemann.*")
graph
downstream))))
The #(info %)
function will send all incoming events to the log file. Let’s look in the log file now to see events generated by collectd from a selection of plugins.
{:host tornado-web1, :service df-root/df_complex-free, :state ok, :description nil, :metric 2.7197804544E10, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance free, :type df_complex, :plugin_instance root, :plugin df}
{:host tornado-web1, :service load/load/shortterm, :state ok, :description nil, :metric 0.05, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name shortterm, :ds_type gauge, :type load, :plugin load}
{:host tornado-web1, :service memory/memory-used, :state ok, :description nil, :metric 4.5678592E7, :tags [collectd], :time 1425240301, :ttl 60.0, :ds_index 0, :ds_name value, :ds_type gauge, :type_instance used, :type memory, :plugin memory}
Let’s look at some of the fields of the events that we’re going to work with in Riemann.
Field | Description |
---|---|
host | The hostname, e.g. tornado-web1 . |
service | The service or metric. |
state | A string describing state, e.g. ok , warning , critical . |
time | The time of the event. |
tags | Tags from the Tag fields in the write_riemann plugin. |
metric | The value of the collectd metric. |
ttl | Calculated from the Interval and TTLFactor . |
type | The type of data. For example, CPU data. |
type_instance | A sub-type of data. For example, idle CPU time. |
plugin | The collectd plugin that generated the event. |
ds_name | The DataSource name from the types.db file. |
ds_type | The type of DataSource. For example, a gauge. |
The collectd daemon has constructed events that contain a :host
field with the host name we’re collecting from. In our case all three events are from the tornado-web1
host. We also have a :service
field containing the name of each metric collectd is sending.
Importantly, when we send metrics to Graphite, it will construct the metric name from a combination of these two fields. So the combination of :host tornado-web1
and :service memory/memory-used
becomes tornado-web1.memory.memory-used
in Graphite. Or :host tornado-web1
and :service load/load/shortterm
becomes tornado-web1.load.load.shortterm
.
Our collectd events also have a :state
field. The value of this field is controlled by the threshold checking we enabled with the threshold
plugin above. If no specific threshold has been set or triggered for a collectd metric then the :state
field will default to a value of ok
.
Also familiar from Chapter 3 should be the :ttl
field which sets the time to live for Riemann events in the index. Here the :ttl
is 60 seconds, controlled by multiplying our Interval
and TTLFactor
values as we discovered when we configured collectd.
The :description
field is empty, but we get some descriptive data from the :type
, :type_instance
, and :plugin
fields which tell us what type of metric it is, the more granular type instance, and the name of the plugin that collected it, respectively. For example, one of our events is generated from the memory
plugin with a :type
of memory
and a :type_instance
of used
. Combined, these provide a description of what data we’re specifically collecting.
Also available is the :ds_name
field which is a short name or title for the data source. Next is the :ds_type
field which specifies the data source type. This tells us what type of data this source is—in the case of all these events, a type of gauge
. The collectd daemon has four data types:
types.db
file in /usr/share/collectd/
directory.
Lastly, the :metric
and the :time
fields hold the actual value of the metric and the time it was collected, respectively.
Now that we’ve got events flowing from collectd to Riemann, we want to send them the step further and onto Graphite so we can graph them. We’re going to use a tagged
stream to select the collectd events. The tagged
stream selects all events with a specific tag. Our collectd events have acquired a tag, collectd
, via the Tag
directive in the write_riemann
configuration. We can match on this tag and then send the events to Graphite using the graph
var we created in Chapter 4.
Let’s see our new streams.
Our tagged
stream grabs all events tagged with collectd
. We then added the graph
var we created in Chapter 4 so that all these events will be sent on to Graphite.
Now let’s see what the metrics arriving at Graphite look like? If we look at the /var/log/carbon/creates.log
log file on our graphitea.example.com
host we should see our metrics being updated. Most of them look much like this:
. . .
productiona.hosts.tornado-web1.interface-eth0/if_packets/rx
productiona.hosts.tornado-web1.interface-eth0/if_errors/rx
productiona.hosts.tornado-web1.cpu-percent/cpu-steal
productiona.hosts.tornado-web1.df-root/df_complex-used
productiona.hosts.tornado-web1.interface-lo/if_packets/rx
productiona.hosts.tornado-web1.processes/ps_state-paging
productiona.hosts.tornado-web1.processes/ps_state-zombies
productiona.hosts.tornado-api1.df-dev/df_complex-used
productiona.hosts.tornado-api1.df-run-shm/df_complex-reserved
productiona.hosts.tornado-api1.df-dev/df_complex-reserved
productiona.hosts.tornado-api1.interface-eth0/if_octets/rx
productiona.hosts.tornado-api1.df-run-lock/df_complex-reserved
productiona.hosts.tornado-api1.interface-lo/if_octets/rx
productiona.hosts.tornado-api1.df-run-lock/df_complex-free
productiona.hosts.tornado-api1.df-run/df_complex-reserved
productiona.hosts.tornado-api1.processes/ps_state-blocked
. . .
We see that our productiona.hosts.
has been appended to the metric name thanks to the configuration we added in Chapter 4. We also see the name of the host from which the metrics are being collected, here tornado-web1
and app2-api
. Each metric name is a varying combination of the collectd plugin data structure: plugin name, plugin instance, and type. For example:
This metric combines the processes
plugin name; the process state, ps_state
; and the type, blocked
.
You can see that a number of the metrics have fairly complex naming patterns. To make it easier to use them we’re going to try to simplify that a little. This is entirely optional but it is something we do to make building graphs a bit faster and easier. If you don’t see the need for any refactoring you can skip to the next section.
There are several places we could adjust the names of our metrics:
Chains
construct in the collectd daemon.We’re going to use the last option, Riemann, because it’s a nice central collection point for metrics. We’re going to make use of a neat bit of code written by Pierre-Yves Ritschard, a well-known member of the collectd and Riemann communities. The code takes the incoming collectd metrics and rewrites their :service
fields to make them easier to understand.
Let’s first create a file to hold our new code on our Riemann server.
Here we’ve created /etc/riemann/examplecom/etc/collectd.clj
. Let’s populate this file with our borrowed code.
(ns examplecom.etc.collectd
(:require [clojure.tools.logging :refer :all]
[riemann.streams :refer :all]
[clojure.string :as str]))
(def default-services
[{:service #"^load/load/(.*)$" :rewrite "load $1"}
{:service #"^cpu/percent-(.*)$" :rewrite "cpu $1"}
. . .
{:service #"^interface-(.*)/if_(errors|packets|octets)/(tx|rx)$" :rewrite "nic $1 $3 $2"}])
(defn rewrite-service-with
[rules]
(let [matcher (fn [s1 s2] (if (string? s1) (= s1 s2) (re-find s1 s2)))]
(fn [{:keys [service] :as event}]
(or
(first
(for [{:keys [rewrite] :as rule} rules
:when (matcher (:service rule) service)]
(assoc event :service
(if (string? (:service rule))
rewrite
(str/replace service (:service rule) rewrite)))))
event))))
(def rewrite-service
(rewrite-service-with default-services))
This looks complex but what’s happening is actually pretty simple. We first define a namespace: examplecom.etc.collectd
. We then require three libraries. The first is Clojure’s string functions from clojure.string
. When we require this namespace we’ve used a new directive called :as
. This creates an alias for the namespace, here str
. This allows us to refer to functions inside the library as by this alias: str/replace
. If we remember Chapter 3, refer lets you use names from other namespaces without having to fully qualify them, and :as
lets you use a shorter name for a namespace when you’re writing out a fully qualified name.
We also require the clojure.tools.logging
namespace which provides access to some logging functions, for example the info
function we used earlier in the book. Lastly, we require the riemann.streams
namespace which provides access to Riemann’s streams, for example the where
stream.
Next we’ve created a var called default-services
with the def
statement. This is a series of regular expression maps. Each line does a rewrite of a specific :service
field like so:
We first specify the content of the :service
field that we want to match, here that’s #"^cpu/percent-(.*)$"
. This will grab all of the collectd CPU metrics. We capture the final portion of the metric namespace. In the next element, :rewrite
, we specify how the final metrics will look—here that’s cpu $1
, with $1
being the output we captured in our initial regular expression. What’s this actually look like? Well, the :service
field of one of our CPU metrics is: cpu-percent/cpu-steal
. Our regular expression will match the steal
portion of the current metric name and rewrite the service to: cpu steal
. Graphite’s metric names treat spaces as dots, so when it hits Graphite that will be converted into the full path of:
productiona.hosts.tornado-web1.cpu.steal
The rules in our code right now handle the basic set of metrics we’re collecting in this chapter. You can easily add new rules to cover other metrics you’re collecting.
Next is a new function called rewrite-service-with
. This contains the magic that actually does the rewrite. It takes a list of rules, examines incoming events, grabs any events with a :service
field that matches any of our rules, and then rewrites the :service
field using the Clojure string function replace
. It’s reasonably complex but you’re not likely to ever need to change it. We’re not going to step through it in any detail.
Lastly, we got a final var called rewrite-service
. This is what actually runs the rewrite-service-with
function and passes it the rules in the default-services
var.
Let’s rewrite our original tagged
filter in /etc/riemann/riemann.config
to send events through our rewrite function. Here’s our original stanza.
Let’s replace it with:
(require '[examplecom.etc.collectd :refer :all])
. . .
(tagged "collectd"
(smap rewrite-service graph))
We first add a require
to load the functions in our
examplecom/etc/collectd
namespace. We’re still using the same tagged
filter to grab all events tagged with collectd
, but we’re passing them to a new stream called smap
. The smap
stream is a streaming map. It’s highly useful for transforming events. In this case the smap
stream says: “send any incoming events into the rewrite-service
var, process them, and then send them onto the graph
var”.
This will turn a metric like:
productiona.hosts.tornado-web1.load/load/shortterm
Into:
productiona.hosts.tornado-web1.load.shortterm
This makes the overall metric much easier to parse and use, especially when we start to build graphs and dashboards.
In this chapter we saw how to collect the base level of data across our hosts: CPU, memory, disk, and related data. To do this we installed and configured collectd. We configured collectd to use plugins to collect a wide variety of data on our hosts and direct them to Riemann for any processing and checks, and to forward them onto Graphite for longer-term storage.
In the next chapter we’ll look at making use of our collectd metrics in Riemann and in Graphite and Grafana.
18.223.106.232