Sometimes you will not be able to either add direct instrumentation to an application, nor find an existing exporter that covers it. This leaves you with having to write an exporter yourself. The good news is that exporters are relatively easy to write. The hard part is figuring out what the metrics exposed by applications mean. Units are often unknown, and documentation, if it exists at all, can be vague. In this chapter you will learn how to write exporters.
I’m going to write a small exporter for Consul to demonstrate the process. While we already saw version 0.3.0 of the Consul exporter in “Consul”, that version is missing metrics from the newly added telemetry API.1
While you can write exporters in any programming language, the majority are written in Go, and that is the language I will use here. However, you will find a small number of exporters written in Python, and an even smaller number in Java.
If your Consul is not running, start it again following the instructions in Example 8-6. If you visit http://localhost:8500/v1/agent/metrics you will see the JSON output that you will be working with, which is similar to Example 12-1. Conveniently, Consul provides a Go library that you can use, so you don’t have to worry about parsing the JSON yourself.
{
"Timestamp"
:
"2018-01-31 14:42:10 +0000 UTC"
,
"Gauges"
:
[
{
"Name"
:
"consul.autopilot.failure_tolerance"
,
"Value"
:
0
,
"Labels"
:
{}
}
],
"Points"
:
[],
"Counters"
:
[
{
"Name"
:
"consul.raft.apply"
,
"Count"
:
1
,
"Sum"
:
1
,
"Min"
:
1
,
"Max"
:
1
,
"Mean"
:
1
,
"Stddev"
:
0
,
"Labels"
:
{}
}
],
"Samples"
:
[
{
"Name"
:
"consul.fsm.coordinate.batch-update"
,
"Count"
:
1
,
"Sum"
:
0.13156799972057343
,
"Min"
:
0.13156799972057343
,
"Max"
:
0.13156799972057343
,
"Mean"
:
0.13156799972057343
,
"Stddev"
:
0
,
"Labels"
:
{}
}
]
}
You are in luck that Consul has split out the counters and gauges for
you.2 The Samples
also look like you can use the Count
and Sum
in a summary metric.
Looking at all the Samples
again, I have a suspicion that they are tracking
latency. Digging through
the documentation confirms
that they are timers, which means a Prometheus Summary (see “The Summary”). The
timers are also all in milliseconds, so we can convert them to
seconds.3 While the JSON has a field for labels, none are
used, so you can ignore that. Aside from that, the only other thing you need to do is ensure any invalid characters in the metric names are sanitised.
You now know the logic you need to apply to the metrics that Consul exposes, so you can write your exporter as in Example 12-2.
package
main
import
(
"log"
"net/http"
"regexp"
"github.com/hashicorp/consul/api"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var
(
up
=
prometheus
.
NewDesc
(
"consul_up"
,
"Was talking to Consul successful."
,
nil
,
nil
,
)
invalidChars
=
regexp
.
MustCompile
(
"[^a-zA-Z0-9:_]"
)
)
type
ConsulCollector
struct
{
}
// Implements prometheus.Collector.
func
(
c
ConsulCollector
)
Describe
(
ch
chan
<-
*
prometheus
.
Desc
)
{
ch
<-
up
}
// Implements prometheus.Collector.
func
(
c
ConsulCollector
)
Collect
(
ch
chan
<-
prometheus
.
Metric
)
{
consul
,
err
:=
api
.
NewClient
(
api
.
DefaultConfig
())
if
err
!=
nil
{
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
0
)
return
}
metrics
,
err
:=
consul
.
Agent
().
Metrics
()
if
err
!=
nil
{
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
0
)
return
}
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
1
)
for
_
,
g
:=
range
metrics
.
Gauges
{
name
:=
invalidChars
.
ReplaceAllLiteralString
(
g
.
Name
,
"_"
)
desc
:=
prometheus
.
NewDesc
(
name
,
"Consul metric "
+
g
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
GaugeValue
,
float64
(
g
.
Value
))
}
for
_
,
c
:=
range
metrics
.
Counters
{
name
:=
invalidChars
.
ReplaceAllLiteralString
(
c
.
Name
,
"_"
)
desc
:=
prometheus
.
NewDesc
(
name
+
"_total"
,
"Consul metric "
+
c
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
CounterValue
,
float64
(
c
.
Count
))
}
for
_
,
s
:=
range
metrics
.
Samples
{
// All samples are times in milliseconds, we convert them to seconds below.
name
:=
invalidChars
.
ReplaceAllLiteralString
(
s
.
Name
,
"_"
)
+
"_seconds"
countDesc
:=
prometheus
.
NewDesc
(
name
+
"_count"
,
"Consul metric "
+
s
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
countDesc
,
prometheus
.
CounterValue
,
float64
(
s
.
Count
))
sumDesc
:=
prometheus
.
NewDesc
(
name
+
"_sum"
,
"Consul metric "
+
s
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
sumDesc
,
prometheus
.
CounterValue
,
s
.
Sum
/
1000
)
}
}
func
main
()
{
c
:=
ConsulCollector
{}
prometheus
.
MustRegister
(
c
)
http
.
Handle
(
"/metrics"
,
promhttp
.
Handler
())
log
.
Fatal
(
http
.
ListenAndServe
(
":8000"
,
nil
))
}
If you have a working Go development environment you can run the exporter with:
go get -d -u github.com/hashicorp/consul/api go get -d -u github.com/prometheus/client_golang/prometheus go run consul_metrics.go
If you visit http://localhost:8000/metrics you will see metrics like:
# HELP consul_autopilot_failure_tolerance Consul metric consul.autopilot.failure_tolerance # TYPE consul_autopilot_failure_tolerance gauge consul_autopilot_failure_tolerance 0 # HELP consul_raft_apply_total Consul metric consul.raft.apply # TYPE consul_raft_apply_total counter consul_raft_apply_total 1 # HELP consul_fsm_coordinate_batch_update_seconds_count Consul metric consul.fsm.coordinate.batch-update # TYPE consul_fsm_coordinate_batch_update_seconds_count counter consul_fsm_coordinate_batch_update_seconds_count 1 # HELP consul_fsm_coordinate_batch_update_seconds_sum Consul metric consul.fsm.coordinate.batch-update # TYPE consul_fsm_coordinate_batch_update_seconds_sum counter consul_fsm_coordinate_batch_update_seconds_sum 1.3156799972057343e-01
That’s all well and good, but how does the code work? In the next section I’ll show you how.
With direct instrumentation the client library takes in instrumentation events and tracks the values of the metrics over time. Client libraries provide the counter, gauge, summary, and histogram metrics for this, which are all examples of collectors. At scrape time each collector in a registry is collected, which is to say, asked for its metrics. These metrics will then be returned by the scrape of /metrics. Counters and the other three standard metric types only ever return one metric family.
If rather than using direct instrumentation you want to provide from some other source, you use a custom collector, which is any collector that is not one of the standard four. Custom collectors can return any number of metric families. Collection happens on every single scrape of a /metrics, where each collection is a consistent snapshot of the metrics from a collector.
In Go your collectors must implement the prometheus.Collector
interface. That
is to say the collectors must be objects with Describe
and Collect
methods with a
specific signature.
The Describe
method returns a description of the metrics it will produce, in
particular the metric name, label names, and help string. The Describe
method
is called at registration time, and is used to avoid duplicate metric
registration. There are two types of metrics an exporter can have, ones
where it knows the names and labels in advance, and ones where they are only
determined at scrape time. In this example, consul_up
is known in advance so
you can create its Desc
once with NewDesc
and provide it via Describe
. All
the other metrics are generated dynamically at scrape time, so cannot be
included:
var
(
up
=
prometheus
.
NewDesc
(
"consul_up"
,
"Was talking to Consul successful."
,
nil
,
nil
,
)
)
// Implements prometheus.Collector.
func
(
c
ConsulCollector
)
Describe
(
ch
chan
<-
*
prometheus
.
Desc
)
{
ch
<-
up
}
The Go client requires that at least one Desc
is provided by Describe
. If
all your metrics are dynamic, you can provide a dummy Desc
to work around
this.
At the core of a custom collector is the Collect
method. In this method you fetch all the data you need from the application instance you are
working with, munge it as needed, and then send the metrics back to the client
library. Here you need to connect to Consul and then fetch its metrics. If an
error occurs, consul_up
is returned as 0
; otherwise, once we know that the
collection is going to be successful, it is returned as 1
. Only returning a
metric sometimes is difficult4 to deal with in PromQL; having consul_up
allows you to alert on issues talking to Consul so you’ll know that something is awry.
To return consul_up
, prometheus.MustNewConstMetric
is used to provide a
sample for just this scrape. It takes its Desc
, type, and value:
// Implements prometheus.Collector.
func
(
c
ConsulCollector
)
Collect
(
ch
chan
<-
prometheus
.
Metric
)
{
consul
,
err
:=
api
.
NewClient
(
api
.
DefaultConfig
())
if
err
!=
nil
{
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
0
)
return
}
metrics
,
err
:=
consul
.
Agent
().
Metrics
()
if
err
!=
nil
{
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
0
)
return
}
ch
<-
prometheus
.
MustNewConstMetric
(
up
,
prometheus
.
GaugeValue
,
1
)
There are three possible values: GaugeValue
, CounterValue
, and
UntypedValue
. Gauge and Counter you already know, and Untyped is for cases where
you are not sure whether a metric is a counter or a gauge. This is not possible
with direct instrumentation, but it is not unusual for the type of metrics from
other monitoring and instrumentation systems to be unclear and impractical to
determine.
Now that you have the metrics from Consul, you can process the gauges. Invalid characters in
the metric name, such as dots and hyphens, are converted to underscores. A Desc
is created on the fly, and immediately used in a MustNewConstMetric
:
for
_
,
g
:=
range
metrics
.
Gauges
{
name
:=
invalidChars
.
ReplaceAllLiteralString
(
g
.
Name
,
"_"
)
desc
:=
prometheus
.
NewDesc
(
name
,
"Consul metric "
+
g
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
GaugeValue
,
float64
(
g
.
Value
))
}
Processing of counters is similar, except that a _total
suffix is added to the metric name:
for
_
,
c
:=
range
metrics
.
Counters
{
name
:=
invalidChars
.
ReplaceAllLiteralString
(
c
.
Name
,
"_"
)
desc
:=
prometheus
.
NewDesc
(
name
+
"_total"
,
"Consul metric "
+
c
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
CounterValue
,
float64
(
s
.
Count
))
}
The contents of metrics.Samples
are more complicated. While the samples are a Prometheus summary, the Go client does
not currently support those for MustNewConstMetric
. Instead, you can emulate
it using two counters. _seconds
is appended to the metric name, and the sum
is divided by a thousand to convert from milliseconds to seconds:
for
_
,
s
:=
range
metrics
.
Samples
{
// All samples are times in milliseconds, we convert them to seconds below.
name
:=
invalidChars
.
ReplaceAllLiteralString
(
s
.
Name
,
"_"
)
+
"_seconds"
countDesc
:=
prometheus
.
NewDesc
(
name
+
"_count"
,
"Consul metric "
+
s
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
countDesc
,
prometheus
.
CounterValue
,
float64
(
s
.
Count
))
sumDesc
:=
prometheus
.
NewDesc
(
name
+
"_sum"
,
"Consul metric "
+
s
.
Name
,
nil
,
nil
)
ch
<-
prometheus
.
MustNewConstMetric
(
sumDesc
,
prometheus
.
CounterValue
,
s
.
Sum
/
1000
)
}
s.Sum
here is a float64, but you must be careful when doing division with
integers to ensure you don’t unnecessarily lose precision. If sum
were an
integer, float64(sum)/1000
would convert to floating point first and then
divide, which is what you want. On the other hand, float64(sum/1000)
will
first divide the integer value by 1000, losing three digits of precision.
Finally, the custom collector object is instantiated and registered with the default registry, in the same way you would one of the direct instrumentation metrics:
c
:=
ConsulCollector
{}
prometheus
.
MustRegister
(
c
)
Exposition is performed in the usual way, which you already saw in “Go”:
http
.
Handle
(
"/metrics"
,
promhttp
.
Handler
())
log
.
Fatal
(
http
.
ListenAndServe
(
":8000"
,
nil
))
This is, of course, a simplified example. In reality you would have some way to configure the Consul server to talk to, such as a command-line flag, rather than depending on the client’s default. You would also reuse the client between scrapes, and allow the various authentication options of the client to be specified.
The min
, max
, mean
, and stddev
were discarded as they are not very useful. You
can calculate a mean using the sum and count. min
, max
, and stddev
, on the other
hand, are not aggregatable and you don’t know over what time period they were
measured.
As the default registry is being used, go_
and process_
metrics are included
in the result. These provide you with information about the performance of the
exporter itself, and are useful to detect issues such as file descriptor leaks
using the process_open_fds
. This saves you from having to scrape the exporter
separately for these metrics.
The only time you might not use the default registry for an exporter is when writing a Blackbox/SNMP style exporter, where some interpretation of URL parameters needs to be performed as collectors have no access to URL parameters for a scrape. In that case, you would also scrape the /metrics of the exporter in order to monitor the exporter itself.
For comparison, the equivalent exporter written using Python 3 is shown in
Example 12-3. This is largely the same as the one written in Go, the only notable
difference is that a SummaryMetricFamily
is available to represent a summary, instead of emulating it with two separate counters. The Python
client does not have as many sanity checks as the Go client, so you need to be
a little more careful with it.
import
json
import
re
import
time
from
urllib.request
import
urlopen
from
prometheus_client.core
import
GaugeMetricFamily
,
CounterMetricFamily
from
prometheus_client.core
import
SummaryMetricFamily
,
REGISTRY
from
prometheus_client
import
start_http_server
def
sanitise_name
(
s
):
return
re
.
sub
(
r
"[^a-zA-Z0-9:_]"
,
"_"
,
s
)
class
ConsulCollector
(
object
):
def
collect
(
self
):
out
=
urlopen
(
"http://localhost:8500/v1/agent/metrics"
)
.
read
()
metrics
=
json
.
loads
(
out
.
decode
(
"utf-8"
))
for
g
in
metrics
[
"Gauges"
]:
yield
GaugeMetricFamily
(
sanitise_name
(
g
[
"Name"
]),
"Consul metric "
+
g
[
"Name"
],
g
[
"Value"
])
for
c
in
metrics
[
"Counters"
]:
yield
CounterMetricFamily
(
sanitise_name
(
c
[
"Name"
])
+
"_total"
,
"Consul metric "
+
c
[
"Name"
],
c
[
"Count"
])
for
s
in
metrics
[
"Samples"
]:
yield
SummaryMetricFamily
(
sanitise_name
(
s
[
"Name"
])
+
"_seconds"
,
"Consul metric "
+
s
[
"Name"
],
count_value
=
c
[
"Count"
],
sum_value
=
s
[
"Sum"
]
/
1000
)
if
__name__
==
'__main__'
:
REGISTRY
.
register
(
ConsulCollector
())
start_http_server
(
8000
)
while
True
:
time
.
sleep
(
1
)
In the preceding example you only saw metrics without labels. To provide labels
you need to specify the label names in Desc
and then the values in MustNewConstMetric
.
To expose a metric with the time series example_gauge{foo="bar", baz="small"}
and example_gauge{foo="quu", baz="far"}
you could do:
func
(
c
MyCollector
)
Collect
(
ch
chan
<-
prometheus
.
Metric
)
{
desc
:=
prometheus
.
NewDesc
(
"example_gauge"
,
"A help string."
,
[]
string
{
"foo"
,
"baz"
},
nil
,
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
GaugeValue
,
1
,
"bar"
,
"small"
)
ch
<-
prometheus
.
MustNewConstMetric
(
desc
,
prometheus
.
GaugeValue
,
2
,
"quu"
,
"far"
)
}
with the Go Prometheus client library. First, you can provide each time series individually. The registry will take care of combining all the time series belonging to the same metric family together in the /metrics output.
The help strings of all metrics with the same name must be
identical. Providing differing Desc
s will cause the scrape to fail.
The Python client works a little differently; you assemble the metric family and then return it. While that may sound like more effort, it usually works out to be the same level of effort in practice:
class
MyCollector
(
object
):
def
collect
(
self
):
mf
=
GaugeMetricFamily
(
"example_gauge"
,
"A help string."
,
labels
=
[
"foo"
,
"baz"
])
mf
.
add_metric
([
"bar"
,
"small"
],
1
)
mf
.
add_metric
([
"quu"
,
"far"
],
2
)
yield
mf
While direct instrumentation tends to be reasonably black and white, writing exporters tends to be murky and involve engineering tradeoffs. Do you want to spend a lot of ongoing effort to produce perfect metrics, or do something that’s good enough and requires no maintenance? Writing exporters is more of an art than a science.
You should try to follow the metric naming practices, in particular, avoiding
the _count
, _sum
, _total
, _bucket
, and _info
suffixes unless the time
series is part of a metric that is meant to contain such a time series.
It is often not possible or practical to determine whether a bunch of metrics
are gauges, counters, or a mix of the two. In cases where there is a mix you should mark them as untyped
rather than using gauge or counter, which would be incorrect. If a metric
is a counter, don’t forget to add the _total
suffix.
Where practical you should try to provide units for your metrics, and at the very least try to ensure that the units are in the metric name. Having to determine what the units are from metrics as in Example 12-1 is not fun for anyone, so you should try to remove this burden from the users of your exporter. Seconds and bytes are always preferred.
In terms of using labels in exporters there are a few gotchas to look out for. As with direct instrumentation, cardinality is also a concern for exporters for the same reasons that were discussed in “Cardinality”. Metrics with high churn in their labels should be avoided.
Labels should create a partition across a metric, and if you take a sum or
average across a metric it should be meaningful, as discussed in
“When to Use Labels”. In particular, you should look out for any time series
that are just totals of all the other values in a metric, and remove them. If
you are ever unsure as to whether a label makes sense when writing an exporter then
it is safest not to use one, though keep in mind “Table Exception”. As with
direct instrumentation, you should not apply a label such as env="prod"
to
all metrics coming from your exporter, as that is what target labels are for, as
discussed in “Target Labels”.
It is best to expose raw metrics to Prometheus, rather than doing calculations
on the application side. For example, there is no need to expose a 5-minute
rate when you have a counter, as you can use the rate
function to calculate
a rate over any period you like. Similarly with ratios, drop them in favour of
the numerator and denominator. If you have a percentage without its constituent
numerator and denominator, at the least convert it to a ratio.5
Beyond multiplication and division to standardise units, you should avoid math in exporters, as processing raw data in PromQL is preferred. Race conditions between metrics instrumentation events can lead to artifacts, particularly when you subtract one metric from another. Addition of metrics for the purposes of reducing cardinality can be okay, but if they’re counters, make sure there will not be spurious resets due to some of them disappearing.
Some metrics are not particularly useful given how Prometheus is intended to be used. Many applications expose metrics such as machine RAM, CPU, and disk. You should not expose machine-level metrics in your exporter, as that is the responsibility of the Node exporter.6 Minimums, maximums, and standard deviations cannot be sanely aggregated so should also be dropped.
You should plan on running one exporter per application instance,7 and fetch metrics synchronously for each scrape without any caching. This keeps the responsibilities of service discovery and scrape scheduling with Prometheus. Note that you should be aware that concurrent scrapes can happen.
Just as Prometheus adds a scrape_duration_seconds
metric when performing a
scrape, you may also add a myexporter_scrape_duration_seconds
metric for how
long it takes your exporter to pull the data from its application. This helps
in performance debugging, as you can see if it’s the application or your
exporter that is getting slow. Additional metrics such as the number of metrics
processed can also be helpful.
It can make sense for you to add direct instrumentation to exporters, in addition
to the custom collectors that provide their core functionality. For example, the
CloudWatch exporter has a cloudwatch_requests_total
counter tracking the
number of API calls it makes, as each API call costs money. But this is usually
only something that you will see with Blackbox/SNMP-style exporters.
Now that you know how to get metrics out of both your applications and third-party code, in the next chapter I’ll start covering PromQL, which allows you to work with these metrics.
1 These metrics will likely be in the 0.4.0 version of the Consul exporter.
2 Just because something is called a counter does not mean it is a counter. For example, Dropwizard has counters that can go down, so depending on how the counter is used in practice it may be a counter, gauge, or untyped in Prometheus terms.
3 If only some of the Samples
were timers, you would have to choose between exposing them as-is or maintaining a list of which metrics are latencies and which weren’t.
4 See “or operator”.
5 And check that it is actually a ratio/percentage; it’s not unknown for metrics to confuse the two.
6 Or WMI exporter for Windows users.
7 Unless writing a Blackbox/SNMP-style exporter, which is rare.
3.144.238.20