Chapter 12. Writing Exporters

Sometimes you will not be able to either add direct instrumentation to an application, nor find an existing exporter that covers it. This leaves you with having to write an exporter yourself. The good news is that exporters are relatively easy to write. The hard part is figuring out what the metrics exposed by applications mean. Units are often unknown, and documentation, if it exists at all, can be vague. In this chapter you will learn how to write exporters.

Consul Telemetry

I’m going to write a small exporter for Consul to demonstrate the process. While we already saw version 0.3.0 of the Consul exporter in “Consul”, that version is missing metrics from the newly added telemetry API.1

While you can write exporters in any programming language, the majority are written in Go, and that is the language I will use here. However, you will find a small number of exporters written in Python, and an even smaller number in Java.

If your Consul is not running, start it again following the instructions in Example 8-6. If you visit http://localhost:8500/v1/agent/metrics you will see the JSON output that you will be working with, which is similar to Example 12-1. Conveniently, Consul provides a Go library that you can use, so you don’t have to worry about parsing the JSON yourself.

Example 12-1. An abbreviated example output from a Consul agent’s metrics output
{
  "Timestamp": "2018-01-31 14:42:10 +0000 UTC",
  "Gauges": [
    {
        "Name": "consul.autopilot.failure_tolerance",
        "Value": 0,
        "Labels": {}
    }
  ],
  "Points": [],
  "Counters": [
    {
        "Name": "consul.raft.apply",
        "Count": 1,
        "Sum": 1, "Min": 1, "Max": 1, "Mean": 1, "Stddev": 0,
        "Labels": {}
    }
  ],
  "Samples": [
    {
        "Name": "consul.fsm.coordinate.batch-update",
        "Count": 1,
        "Sum": 0.13156799972057343,
        "Min": 0.13156799972057343, "Max": 0.13156799972057343,
        "Mean": 0.13156799972057343, "Stddev": 0,
        "Labels": {}
    }
  ]
}

You are in luck that Consul has split out the counters and gauges for you.2 The Samples also look like you can use the Count and Sum in a summary metric. Looking at all the Samples again, I have a suspicion that they are tracking latency. Digging through the documentation confirms that they are timers, which means a Prometheus Summary (see “The Summary”). The timers are also all in milliseconds, so we can convert them to seconds.3 While the JSON has a field for labels, none are used, so you can ignore that. Aside from that, the only other thing you need to do is ensure any invalid characters in the metric names are sanitised.

You now know the logic you need to apply to the metrics that Consul exposes, so you can write your exporter as in Example 12-2.

Example 12-2. consul_metrics.go, an exporter for Consul metrics written in Go
package main

import (
  "log"
  "net/http"
  "regexp"

  "github.com/hashicorp/consul/api"
  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
  up = prometheus.NewDesc(
    "consul_up",
    "Was talking to Consul successful.",
    nil, nil,
  )
  invalidChars = regexp.MustCompile("[^a-zA-Z0-9:_]")
)

type ConsulCollector struct {
}

// Implements prometheus.Collector.
func (c ConsulCollector) Describe(ch chan<- *prometheus.Desc) {
  ch <- up
}

// Implements prometheus.Collector.
func (c ConsulCollector) Collect(ch chan<- prometheus.Metric) {
  consul, err := api.NewClient(api.DefaultConfig())
  if err != nil {
    ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 0)
    return
  }

  metrics, err := consul.Agent().Metrics()
  if err != nil {
    ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 0)
    return
  }
  ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 1)

  for _, g := range metrics.Gauges {
    name := invalidChars.ReplaceAllLiteralString(g.Name, "_")
    desc := prometheus.NewDesc(name, "Consul metric "+g.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        desc, prometheus.GaugeValue, float64(g.Value))
  }

  for _, c := range metrics.Counters {
    name := invalidChars.ReplaceAllLiteralString(c.Name, "_")
    desc := prometheus.NewDesc(name+"_total", "Consul metric "+c.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        desc, prometheus.CounterValue, float64(c.Count))
  }

  for _, s := range metrics.Samples {
    // All samples are times in milliseconds, we convert them to seconds below.
    name := invalidChars.ReplaceAllLiteralString(s.Name, "_") + "_seconds"
    countDesc := prometheus.NewDesc(
        name+"_count", "Consul metric "+s.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        countDesc, prometheus.CounterValue, float64(s.Count))
    sumDesc := prometheus.NewDesc(
        name+"_sum", "Consul metric "+s.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        sumDesc, prometheus.CounterValue, s.Sum/1000)
  }
}

func main() {
  c := ConsulCollector{}
  prometheus.MustRegister(c)
  http.Handle("/metrics", promhttp.Handler())
  log.Fatal(http.ListenAndServe(":8000", nil))
}

If you have a working Go development environment you can run the exporter with:

go get -d -u github.com/hashicorp/consul/api
go get -d -u github.com/prometheus/client_golang/prometheus
go run consul_metrics.go

If you visit http://localhost:8000/metrics you will see metrics like:

# HELP consul_autopilot_failure_tolerance Consul metric
    consul.autopilot.failure_tolerance
# TYPE consul_autopilot_failure_tolerance gauge
consul_autopilot_failure_tolerance 0
# HELP consul_raft_apply_total Consul metric consul.raft.apply
# TYPE consul_raft_apply_total counter
consul_raft_apply_total 1
# HELP consul_fsm_coordinate_batch_update_seconds_count Consul metric
    consul.fsm.coordinate.batch-update
# TYPE consul_fsm_coordinate_batch_update_seconds_count counter
consul_fsm_coordinate_batch_update_seconds_count 1
# HELP consul_fsm_coordinate_batch_update_seconds_sum Consul metric
    consul.fsm.coordinate.batch-update
# TYPE consul_fsm_coordinate_batch_update_seconds_sum counter
consul_fsm_coordinate_batch_update_seconds_sum 1.3156799972057343e-01

That’s all well and good, but how does the code work? In the next section I’ll show you how.

Custom Collectors

With direct instrumentation the client library takes in instrumentation events and tracks the values of the metrics over time. Client libraries provide the counter, gauge, summary, and histogram metrics for this, which are all examples of collectors. At scrape time each collector in a registry is collected, which is to say, asked for its metrics. These metrics will then be returned by the scrape of /metrics. Counters and the other three standard metric types only ever return one metric family.

If rather than using direct instrumentation you want to provide from some other source, you use a custom collector, which is any collector that is not one of the standard four. Custom collectors can return any number of metric families. Collection happens on every single scrape of a /metrics, where each collection is a consistent snapshot of the metrics from a collector.

In Go your collectors must implement the prometheus.Collector interface. That is to say the collectors must be objects with Describe and Collect methods with a specific signature.

The Describe method returns a description of the metrics it will produce, in particular the metric name, label names, and help string. The Describe method is called at registration time, and is used to avoid duplicate metric registration. There are two types of metrics an exporter can have, ones where it knows the names and labels in advance, and ones where they are only determined at scrape time. In this example, consul_up is known in advance so you can create its Desc once with NewDesc and provide it via Describe. All the other metrics are generated dynamically at scrape time, so cannot be included:

var (
  up = prometheus.NewDesc(
    "consul_up",
    "Was talking to Consul successful.",
    nil, nil,
  )
)
// Implements prometheus.Collector.
func (c ConsulCollector) Describe(ch chan<- *prometheus.Desc) {
  ch <- up
}
Tip

The Go client requires that at least one Desc is provided by Describe. If all your metrics are dynamic, you can provide a dummy Desc to work around this.

At the core of a custom collector is the Collect method. In this method you fetch all the data you need from the application instance you are working with, munge it as needed, and then send the metrics back to the client library. Here you need to connect to Consul and then fetch its metrics. If an error occurs, consul_up is returned as 0; otherwise, once we know that the collection is going to be successful, it is returned as 1. Only returning a metric sometimes is difficult4 to deal with in PromQL; having consul_up allows you to alert on issues talking to Consul so you’ll know that something is awry.

To return consul_up, prometheus.MustNewConstMetric is used to provide a sample for just this scrape. It takes its Desc, type, and value:

// Implements prometheus.Collector.
func (c ConsulCollector) Collect(ch chan<- prometheus.Metric) {
  consul, err := api.NewClient(api.DefaultConfig())
  if err != nil {
    ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 0)
    return
  }

  metrics, err := consul.Agent().Metrics()
  if err != nil {
    ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 0)
    return
  }
  ch <- prometheus.MustNewConstMetric(up, prometheus.GaugeValue, 1)

There are three possible values: GaugeValue, CounterValue, and UntypedValue. Gauge and Counter you already know, and Untyped is for cases where you are not sure whether a metric is a counter or a gauge. This is not possible with direct instrumentation, but it is not unusual for the type of metrics from other monitoring and instrumentation systems to be unclear and impractical to determine.

Now that you have the metrics from Consul, you can process the gauges. Invalid characters in the metric name, such as dots and hyphens, are converted to underscores. A Desc is created on the fly, and immediately used in a MustNewConstMetric:

  for _, g := range metrics.Gauges {
    name := invalidChars.ReplaceAllLiteralString(g.Name, "_")
    desc := prometheus.NewDesc(name, "Consul metric "+g.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        desc, prometheus.GaugeValue, float64(g.Value))
  }

Processing of counters is similar, except that a _total suffix is added to the metric name:

  for _, c := range metrics.Counters {
    name := invalidChars.ReplaceAllLiteralString(c.Name, "_")
    desc := prometheus.NewDesc(name+"_total", "Consul metric "+c.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        desc, prometheus.CounterValue, float64(s.Count))
  }

The contents of metrics.Samples are more complicated. While the samples are a Prometheus summary, the Go client does not currently support those for MustNewConstMetric. Instead, you can emulate it using two counters. _seconds is appended to the metric name, and the sum is divided by a thousand to convert from milliseconds to seconds:

  for _, s := range metrics.Samples {
    // All samples are times in milliseconds, we convert them to seconds below.
    name := invalidChars.ReplaceAllLiteralString(s.Name, "_") + "_seconds"
    countDesc := prometheus.NewDesc(
        name+"_count", "Consul metric "+s.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        countDesc, prometheus.CounterValue, float64(s.Count))
    sumDesc := prometheus.NewDesc(
        name+"_sum", "Consul metric "+s.Name, nil, nil)
    ch <- prometheus.MustNewConstMetric(
        sumDesc, prometheus.CounterValue, s.Sum/1000)
  }
Warning

s.Sum here is a float64, but you must be careful when doing division with integers to ensure you don’t unnecessarily lose precision. If sum were an integer, float64(sum)/1000 would convert to floating point first and then divide, which is what you want. On the other hand, float64(sum/1000) will first divide the integer value by 1000, losing three digits of precision.

Finally, the custom collector object is instantiated and registered with the default registry, in the same way you would one of the direct instrumentation metrics:

  c := ConsulCollector{}
  prometheus.MustRegister(c)

Exposition is performed in the usual way, which you already saw in “Go”:

  http.Handle("/metrics", promhttp.Handler())
  log.Fatal(http.ListenAndServe(":8000", nil))

This is, of course, a simplified example. In reality you would have some way to configure the Consul server to talk to, such as a command-line flag, rather than depending on the client’s default. You would also reuse the client between scrapes, and allow the various authentication options of the client to be specified.

Note

The min, max, mean, and stddev were discarded as they are not very useful. You can calculate a mean using the sum and count. min, max, and stddev, on the other hand, are not aggregatable and you don’t know over what time period they were measured.

As the default registry is being used, go_ and process_ metrics are included in the result. These provide you with information about the performance of the exporter itself, and are useful to detect issues such as file descriptor leaks using the process_open_fds. This saves you from having to scrape the exporter separately for these metrics.

The only time you might not use the default registry for an exporter is when writing a Blackbox/SNMP style exporter, where some interpretation of URL parameters needs to be performed as collectors have no access to URL parameters for a scrape. In that case, you would also scrape the /metrics of the exporter in order to monitor the exporter itself.

For comparison, the equivalent exporter written using Python 3 is shown in Example 12-3. This is largely the same as the one written in Go, the only notable difference is that a SummaryMetricFamily is available to represent a summary, instead of emulating it with two separate counters. The Python client does not have as many sanity checks as the Go client, so you need to be a little more careful with it.

Example 12-3. consul_metrics.py, an exporter for Consul metrics written in Python 3
import json
import re
import time
from urllib.request import urlopen

from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily
from prometheus_client.core import SummaryMetricFamily, REGISTRY
from prometheus_client import start_http_server


def sanitise_name(s):
    return re.sub(r"[^a-zA-Z0-9:_]", "_", s)

class ConsulCollector(object):
  def collect(self):
    out = urlopen("http://localhost:8500/v1/agent/metrics").read()
    metrics = json.loads(out.decode("utf-8"))

    for g in metrics["Gauges"]:
      yield GaugeMetricFamily(sanitise_name(g["Name"]),
          "Consul metric " + g["Name"], g["Value"])

    for c in metrics["Counters"]:
      yield CounterMetricFamily(sanitise_name(c["Name"]) + "_total",
          "Consul metric " + c["Name"], c["Count"])

    for s in metrics["Samples"]:
      yield SummaryMetricFamily(sanitise_name(s["Name"]) + "_seconds",
          "Consul metric " + s["Name"],
          count_value=c["Count"], sum_value=s["Sum"] / 1000)

if __name__ == '__main__':
  REGISTRY.register(ConsulCollector())
  start_http_server(8000)
  while True:
    time.sleep(1)

Labels

In the preceding example you only saw metrics without labels. To provide labels you need to specify the label names in Desc and then the values in MustNewConstMetric.

To expose a metric with the time series example_gauge{foo="bar", baz="small"} and example_gauge{foo="quu", baz="far"} you could do:

func (c MyCollector) Collect(ch chan<- prometheus.Metric) {
  desc := prometheus.NewDesc(
    "example_gauge",
    "A help string.",
    []string{"foo", "baz"}, nil,
  )
  ch <- prometheus.MustNewConstMetric(
    desc, prometheus.GaugeValue, 1, "bar", "small")
  ch <- prometheus.MustNewConstMetric(
    desc, prometheus.GaugeValue, 2, "quu", "far")
}

with the Go Prometheus client library. First, you can provide each time series individually. The registry will take care of combining all the time series belonging to the same metric family together in the /metrics output.

Warning

The help strings of all metrics with the same name must be identical. Providing differing Descs will cause the scrape to fail.

The Python client works a little differently; you assemble the metric family and then return it. While that may sound like more effort, it usually works out to be the same level of effort in practice:

class MyCollector(object):
  def collect(self):
    mf = GaugeMetricFamily("example_gauge", "A help string.",
        labels=["foo", "baz"])
    mf.add_metric(["bar", "small"], 1)
    mf.add_metric(["quu", "far"], 2)
    yield mf

Guidelines

While direct instrumentation tends to be reasonably black and white, writing exporters tends to be murky and involve engineering tradeoffs. Do you want to spend a lot of ongoing effort to produce perfect metrics, or do something that’s good enough and requires no maintenance? Writing exporters is more of an art than a science.

You should try to follow the metric naming practices, in particular, avoiding the _count, _sum, _total, _bucket, and _info suffixes unless the time series is part of a metric that is meant to contain such a time series.

It is often not possible or practical to determine whether a bunch of metrics are gauges, counters, or a mix of the two. In cases where there is a mix you should mark them as untyped rather than using gauge or counter, which would be incorrect. If a metric is a counter, don’t forget to add the _total suffix.

Where practical you should try to provide units for your metrics, and at the very least try to ensure that the units are in the metric name. Having to determine what the units are from metrics as in Example 12-1 is not fun for anyone, so you should try to remove this burden from the users of your exporter. Seconds and bytes are always preferred.

In terms of using labels in exporters there are a few gotchas to look out for. As with direct instrumentation, cardinality is also a concern for exporters for the same reasons that were discussed in “Cardinality”. Metrics with high churn in their labels should be avoided.

Labels should create a partition across a metric, and if you take a sum or average across a metric it should be meaningful, as discussed in “When to Use Labels”. In particular, you should look out for any time series that are just totals of all the other values in a metric, and remove them. If you are ever unsure as to whether a label makes sense when writing an exporter then it is safest not to use one, though keep in mind “Table Exception”. As with direct instrumentation, you should not apply a label such as env="prod" to all metrics coming from your exporter, as that is what target labels are for, as discussed in “Target Labels”.

It is best to expose raw metrics to Prometheus, rather than doing calculations on the application side. For example, there is no need to expose a 5-minute rate when you have a counter, as you can use the rate function to calculate a rate over any period you like. Similarly with ratios, drop them in favour of the numerator and denominator. If you have a percentage without its constituent numerator and denominator, at the least convert it to a ratio.5

Beyond multiplication and division to standardise units, you should avoid math in exporters, as processing raw data in PromQL is preferred. Race conditions between metrics instrumentation events can lead to artifacts, particularly when you subtract one metric from another. Addition of metrics for the purposes of reducing cardinality can be okay, but if they’re counters, make sure there will not be spurious resets due to some of them disappearing.

Some metrics are not particularly useful given how Prometheus is intended to be used. Many applications expose metrics such as machine RAM, CPU, and disk. You should not expose machine-level metrics in your exporter, as that is the responsibility of the Node exporter.6 Minimums, maximums, and standard deviations cannot be sanely aggregated so should also be dropped.

You should plan on running one exporter per application instance,7 and fetch metrics synchronously for each scrape without any caching. This keeps the responsibilities of service discovery and scrape scheduling with Prometheus. Note that you should be aware that concurrent scrapes can happen.

Just as Prometheus adds a scrape_duration_seconds metric when performing a scrape, you may also add a myexporter_scrape_duration_seconds metric for how long it takes your exporter to pull the data from its application. This helps in performance debugging, as you can see if it’s the application or your exporter that is getting slow. Additional metrics such as the number of metrics processed can also be helpful.

It can make sense for you to add direct instrumentation to exporters, in addition to the custom collectors that provide their core functionality. For example, the CloudWatch exporter has a cloudwatch_requests_total counter tracking the number of API calls it makes, as each API call costs money. But this is usually only something that you will see with Blackbox/SNMP-style exporters.

Now that you know how to get metrics out of both your applications and third-party code, in the next chapter I’ll start covering PromQL, which allows you to work with these metrics.

1 These metrics will likely be in the 0.4.0 version of the Consul exporter.

2 Just because something is called a counter does not mean it is a counter. For example, Dropwizard has counters that can go down, so depending on how the counter is used in practice it may be a counter, gauge, or untyped in Prometheus terms.

3 If only some of the Samples were timers, you would have to choose between exposing them as-is or maintaining a list of which metrics are latencies and which weren’t.

4 See “or operator”.

5 And check that it is actually a ratio/percentage; it’s not unknown for metrics to confuse the two.

6 Or WMI exporter for Windows users.

7 Unless writing a Blackbox/SNMP-style exporter, which is rare.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.238.20