Chapter 8. Service Discovery

Thus far you’ve had Prometheus find what to scrape using static configuration via static_configs. This is fine for simple use cases,1 but having to manually keep your prometheus.yml up to date as machines are added and removed would get annoying, particularly if you were in a dynamic environment where new instances might be brought up every minute. This chapter will show you how you can let Prometheus know what to scrape.

You already know where all of your machines and services are, and how they are laid out. Service discovery (SD) enables you to provide that information to Prometheus from whichever database you store it in. Prometheus supports many common sources of service information, such as Consul, Amazon’s EC2, and Kubernetes out of the box. If your particular source isn’t already supported, you can use the file-based service discovery mechanism to hook it in. This could be by having your configuration management system, such as Ansible or Chef, write the list of machines and services they know about in the right format, or a script running regularly to pull it from whatever data source you use.

Knowing what your monitoring targets are, and thus what should be scraped, is only the first step. Labels are a key part of Prometheus (see Chapter 5), and assigning target labels to targets allows them to be grouped and organised in ways that make sense to you. Target labels allow you to aggregate targets performing the same role, that are in the same environment, or are run by the same team.

As target labels are configured in Prometheus rather than in the applications and exporters themselves, this allows your different teams to have label hierarchies that make sense to them. Your infrastructure team might care only about which rack and PDU2 a machine is on, while your database team would care that it is the PostgreSQL master for their production environment. If you had a kernel developer who was investigating a rarely occurring problem, they might just care which kernel version was in use.

Service discovery and the pull model allow all these views of the world to coexist, as each of your teams can run their own Prometheus with the target labels that make sense to them.

Service Discovery Mechanisms

Service discovery is designed to integrate with the machine and service databases that you already have. Out of the box, Prometheus 2.2.1 has support for Azure, Consul, DNS, EC2, GCE, OpenStack, File, Kubernetes, Marathon, Nerve, Serverset, and Triton service discovery in addition to the static discovery you have already seen.

Service discovery isn’t just about you providing a list of machines to Prometheus, or monitoring. It is a more general concern that you will see across your systems; applications need to find their dependencies to talk to, and hardware technicians need to know which machines are safe to turn off and repair. Accordingly, you should not only have a raw list of machines and services, but also conventions around how they are organised and their lifecycles.

A good service discovery mechanism will provide you with metadata. This may be the name of a service, its description, which team owns it, structured tags about it, or anything else that you may find useful. Metadata is what you will convert into target labels, and generally the more metadata you have, the better.

A full discussion of service discovery is beyond the scope of this book. If you haven’t gotten around to formalising your configuration management and service databases yet, Consul tends to be a good place to start.

Static

You have already seen static configuration in Chapter 2, where targets are provided directly in the prometheus.yml. It is useful if you have a small and simple setup that rarely changes. This might be your home network, a scrape config that is only for a local Pushgateway, or even Prometheus scraping itself as in Example 8-1.

Example 8-1. Using static service discovery to have Prometheus scrape itself
scrape_configs:
 - job_name: prometheus
   static_configs:
    - targets:
      - localhost:9090

If you are using a configuration management tool such as Ansible, you could have its templating system write out a list of all the machines it knows about to have their Node exporters scraped, such as in Example 8-2.

Example 8-2. Using Ansible’s templating to create targets for the Node exporter on all machines
scrape_configs:
 - job_name: node
   static_configs:
    - targets:
{% for host in groups["all"] %}
      - {{ host }}:9100
{% endfor %}

In addition to providing a list of targets, a static config can also provide labels for those targets in the labels field. If you find yourself needing this, then file SD, covered in “File”, tends to be a better approach.

The plural in static_configs indicates that it is a list, and you can specify multiple static configs in one scrape config, as shown in Example 8-3. While there is not much point to doing this for static configs, it can be useful with other service discovery mechanisms if you want to talk to multiple data sources. You can even mix and match service discovery mechanisms within a scrape config, though that is unlikely to result in a particularly understandable configuration.

Example 8-3. Two monitoring targets are provided, each in its own static config
scrape_configs:
 - job_name: node
   static_configs:
    - targets:
       - host1:9100
    - targets:
       - host2:9100

The same applies to scrape_configs, a list of scrape configs in which you can specify as many as you like. The only restriction is that the job_name must be unique.

File

File service discovery, usually referred to as file SD, does not use the network. Instead, it reads monitoring targets from files you provide on the local filesystem. This allows you to integrate with service discovery systems Prometheus doesn’t support out of the box, or when Prometheus can’t quite do the things you need with the metadata available.

You can provide files in either JSON or YAML formats. The file extension must be .json for JSON, and either .yml or .yaml for YAML. You can see a JSON example in Example 8-4, which you would put in a file called filesd.json. You can have as many or as few targets as you like in a single file.

Example 8-4. filesd.json with three targets
[
  {
    "targets": [ "host1:9100", "host2:9100" ],
    "labels": {
      "team": "infra",
      "job": "node"
    }
  },
  {
    "targets": [ "host1:9090" ],
    "labels": {
      "team": "monitoring",
      "job": "prometheus"
    }
  }
]

JSON

The JSON format is not perfect. One issue you will likely encounter here is that the last item in a list or hash cannot have a trailing comma. I would recommend using a JSON library to generate JSON files rather than trying to do it by hand.

Configuration in Prometheus uses file_sd_configs in your scrape config as shown in Example 8-5. Each file SD config takes a list of filepaths, and you can use globs in the filename.3 Paths are relative to Prometheus’s working directory, which is to say the directory you start Prometheus in.

Example 8-5. prometheus.yml using file service discovery
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'

Usually you would not provide metadata for use with relabelling when using file SD, but rather the ultimate target labels you would like to have.

If you visit http://localhost:9090/service-discovery in your browser4 and click on show more, you will see Figure 8-1, with both job and team labels from filesd.json.5 As these are made up targets, the scrapes will fail, unless you actually happen to have a host1 and host2 on your network.

Service Discovery status page showing three targets
Figure 8-1. Service discovery status page showing three discovered targets from file SD

Providing the targets with a file means it could come from templating in a configuration management system, a daemon that writes it out regularly, or even from a web service via a cronjob using wget. Changes are picked up automatically using inotify, so it would be wise to ensure file changes are made atomically using rename, similarly to how you did in “Textfile Collector”.

Consul

Consul service discovery is a service discovery mechanism that uses the network, as almost all mechanisms do. If you do not already have a service discovery system within your organisation, Consul is one of the easier ones to get up and running. Consul has an agent that runs on each of your machines, and these gossip amongst themselves. Applications talk only to the local agent on a machine. Some number of agents are also servers, providing persistence and consistency.

To try it out, you can set up a development Consul agent by following Example 8-6. If you wish to use Consul in production, you should follow the official Getting Started guide.

Example 8-6. Setting up a Consul agent in development mode
hostname $ wget https://releases.hashicorp.com/consul/1.0.2/
    consul_1.0.2_linux_amd64.zip
hostname $ unzip consul_1.0.2_linux_amd64.zip
hostname $ ./consul agent -dev

The Consul UI should now be available in your browser on http://localhost:8500/. Consul has a notion of services, and in the development setup has a single service, which is Consul itself. Next, run a Prometheus with the configuration in Example 8-7.

Example 8-7. prometheus.yml using Consul service discovery
scrape_configs:
 - job_name: consul
   consul_sd_configs:
    - server: 'localhost:8500'

Go to http://localhost:9090/service-discovery in your browser and you will see Figure 8-2, showing that the Consul service discovery has discovered a single target with some metadata, which became a target with instance and job labels. If you had more agents and services, they would also show up here.

Service Discovery status page showing one target
Figure 8-2. Service discovery status page showing one discovered target, its metadata, and target labels from Consul

Consul does not expose a /metrics, so the scrapes from your Prometheus will fail. But it does still provide enough to find all your machines running a Consul agent, and thus should be running a Node exporter that you can scrape. I will look at how in “Relabelling”.

Tip

If you want to monitor Consul itself, there is a Consul exporter.

EC2

Amazon’s Elastic Compute Cloud, more commonly known as EC2, is a popular provider of virtual machines. It is one of several cloud providers that Prometheus allows you to use out of the box for service discovery.

To use it you must provide Prometheus with credentials to use the EC2 API. One way you can do this is by setting up an IAM user with the AmazonEC2ReadOnlyAccess policy6 and providing the access key and secret key in the configuration file, as shown in Example 8-8.

Example 8-8. prometheus.yml using EC2 service discovery
scrape_configs:
 - job_name: ec2
   ec2_sd_configs:
    - region: <region>
      access_key: <access key>
      secret_key: <secret key>

If you aren’t already running some, start at least one EC2 instance in the EC2 region you have configured Prometheus to look at. If you go to http://localhost:9090/service-discovery in your browser, you can see the discovered targets and the metadata extracted from EC2. __meta_ec2_tag_Name="My Display Name", for example, is the Name tag on this instance, which is the name you will see in the EC2 Console (Figure 8-3).

You may notice that the instance label is using the private IP. This is a sensible default as it is presumed that Prometheus will be running beside what it is monitoring. Not all EC2 instances have public IPs, and there are network charges for talking to an EC2 instance’s public IP.

Service Discovery status page showing one target
Figure 8-3. Service discovery status page showing one discovered target, its metadata, and target labels from EC2

You will find that service discovery for other cloud providers is broadly similar, but the configuration required and metadata returned vary.

Relabelling

As seen in the preceding examples of service discovery mechanisms, the targets and their metadata can be a little raw. You could integrate with file SD and provide Prometheus with exactly the targets and labels you want, but in most cases you won’t need to. Instead, you can tell Prometheus how to map from metadata to targets using relabelling.

Tip

Many characters, such as periods and asterisks, are not valid in Prometheus label names, so will be sanitised to underscore in service discovery metadata.

In an ideal world you will have service discovery and relabelling configured so that new machines and applications are picked up and monitored automatically. In the real world it is not unlikely that as your setup matures it will get sufficiently intricate that you have to regularly update the Prometheus configuration file, but by then you will likely also have the infrastructure where that is only a minor hurdle.

Choosing What to Scrape

The first thing you will want to configure is which targets you actually want to scrape. If you are part of one team running one service, you don’t want your Prometheus to be scraping every target in the same EC2 region.

Continuing on from Example 8-5, what if you just wanted to monitor the infrastructure team’s machines? You can do this with the keep relabel action, as shown in Example 8-9. The regex is applied to the values of the labels listed in source_labels (joined by a semicolon), and if the regex matches, the target is kept. As there is only one action here, this results in all targets with team="infra" being kept.

But for a target with a team="monitoring" label, the regex will not match, and the target will be dropped.

Note

Regular expressions in relabelling are fully anchored, meaning that the pattern infra will not match fooinfra or infrabar.

Example 8-9. Using a keep relabel action to only monitor targets with a team="infra” label
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [team]
      regex: infra
      action: keep

You can have multiple relabel actions in a relabel_configs; all of them will be processed in order unless either a keep or drop action drops the target. For example, Example 8-10 will drop all targets, as a label cannot have both infra and monitoring as a value.

Example 8-10. Two relabel actions requiring contradictory values for the team label
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [team]
      regex: infra
      action: keep
    - source_labels: [team]
      regex: monitoring
      action: keep

To allow multiple values for a label you would use | (the pipe symbol) for the alternation operator, which is a fancy way of saying one or the other. Example 8-11 shows the right way to keep only targets for either the infrastructure or monitoring teams.

Example 8-11. Using | to allow one label value or another
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [team]
      regex: infra|monitoring
      action: keep

In addition to the keep action that drops targets that do not match, you can also use the drop action to drop targets that do match. You can also provide multiple labels in source_labels; their values will be joined with a semicolon.7 If you don’t want to scrape the Prometheus jobs of the monitoring team, you can combine these as in Example 8-12.

Example 8-12. Using multiple source labels
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [job, team]
      regex: prometheus;monitoring
      action: drop

How you use relabelling is up to you. You should define some conventions. For example, EC2 instances should have a team tag with the name of the team that owns it, or all production services should have a production tag in Consul. Without conventions every new service will require special handling for monitoring, which is probably not the best use of your time.

If your service discovery mechanism includes health checking of some form, do not use this to drop unhealthy instances. Even when an instance is reporting as unhealthy it could be producing useful metrics, particularly around startup and shutdown.

Note

Prometheus needs to have a target for each of your individual application instances. Scraping through load balancers will not work, as you can hit a different instance on each scrape, which could, for example, make counters appear to go backwards.

Target Labels

Target labels are labels that are added to the labels of every time series returned from a scrape. They are the identity of your targets,8 and accordingly they should not generally vary over time as might be the case with version numbers or machine owners.

Every time your target labels change the labels of the scraped time series, their identities also change. This will cause discontinuities in your graphs, and can cause issues with rules and alerts.

So what does make a good target label? You have already seen job and instance, target labels all targets have. It is also common to add target labels for the broader scope of the application, such as whether it is in development or production, their region, datacenter, and which team manages them. Labels for structure within your application can also make sense, for example, if there is sharding.

Target labels ultimately allow you to select, group, and aggregate targets in PromQL. For example, you might want alerts for development to be handled differently to production, to know which shard of your application is the most loaded, or which team is using the most CPU time.

But target labels come with a cost. While it is quite cheap to add one more label in terms of resources, the real cost comes when you are writing PromQL. Every additional label is one more you need to keep in mind for every single PromQL expression you write. For example, if you were to add a host label which was unique per target, that would violate the expectation that only instance is unique per target, which could break all of your aggregation that used without(instance). This is discussed further in Chapter 14.

As a rule of thumb your target labels should be a hierarchy, with each one adding additional distinctiveness. For example, you might have a hierarchy where regions contain datacenters that contain environments that contain services that contain jobs that contain instances. This isn’t a hard and fast rule; you might plan ahead a little and have a datacenter label even if you only have one datacenter today.9

For labels the application knows about but don’t make sense to have as target labels, such as version numbers, you can expose them using info metrics as discussed in “Info”.

If you find that you want every target in a Prometheus to share some labels such as region, you should instead use external_labels for them as discussed in “External Labels”.

Replace

So how do you use relabelling to specify your target labels? The answer is the replace action. The replace action allows you to copy labels around, while also applying regular expressions.

Continuing on from Example 8-5, let’s say that the monitoring team was renamed to the monitor team and you can’t change the file SD input yet so you want to use relabelling instead. Example 8-13 looks for a team label that matches the regular expression monitoring (which is to say, the exact string monitoring), and if it finds it, puts the replacement value monitor in the team label.

Example 8-13. Using a replace relabel action to replace team="monitoring” with team="monitor”
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [team]
      regex: monitoring
      replacement: monitor
      target_label: team
      action: replace

That’s fairly simple, but in practice having to specify replacement label values one by one would be a lot of work for you. Let’s say it turns out that the problem was the ing in monitoring, and you wanted relabelling to strip any trailing “ings” in team names. Example 8-14 does this by applying the regular expression (.*)ing, which matches all strings that end with ing and puts the start of the label value in the first capture group. The replacement value consists of that first capture group, which will be placed in the team label.

Example 8-14. Using a replace relabel action to remove a trailing ing from the team label
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: [team]
      regex: '(.*)ing'
      replacement: '${1}'
      target_label: team
      action: replace

If one of your targets does not have a label value that matches, such as team="infra", then the replace action has no effect on that target, as you can see in Figure 8-4.

Service Discovery status page showing three targets
Figure 8-4. The ing is removed from monitoring, while the infra targets are unaffected

A label with an empty value is the same as not having that label, so if you wanted to you could remove the team label using Example 8-15.

Example 8-15. Using a replace relabel action to remove the team label
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
      - '*.json'
   relabel_configs:
    - source_labels: []
      regex: '(.*)'
      replacement: '${1}'
      target_label: team
      action: replace
Note

All labels beginning with __ are discarded at the end of relabelling for target labels, so you don’t need to do this yourself.

Since performing a regular expression against the whole string, capturing it, and using it as the replacement is common, these are all defaults. Thus you can omit them,10 and Example 8-16 will have the same effect as Example 8-15.

Example 8-16. Using the defaults to remove the team label succinctly
scrape_configs:
 - job_name: file
   file_sd_configs:
    - files:
       - '*.json'
   relabel_configs:
    - source_labels: []
      target_label: team

Now that you have more of a sense of how the replace action works, let’s look at a more realistic example. Example 8-7 produced a target with port 80, but it’d be useful if you could change that to port 9100 where the Node exporter is running. In Example 8-17 I take the address from Consul and append :9100 to it, placing it in the __address__ label.

Example 8-17. Using the IP from Consul with port 9100 for the Node exporter
scrape_configs:
 - job_name: node
   consul_sd_configs:
    - server: 'localhost:8500'
   relabel_configs:
    - source_labels: [__meta_consul_address]
      regex: '(.*)'
      replacement: '${1}:9100'
      target_label: __address__
Tip

If relabelling produces two identical targets from one of your scrape configs, they will be deduplicated automatically. So if you have many Consul services running on each machine, only one target per machine would result from Example 8-17.

job, instance, and __address__

In the preceding examples you may have noticed that there was an instance target label, but no matching instance label in the metadata. So where did it come from? The answer is that if your target has no instance label, it is defaulted to the value of the __address__ label.

instance along with job are two labels your targets will always have, job being defaulted from the job_name configuration option. The job label indicates a set of instances that serve the same purpose, and will generally all be running with the same binary and configuration.11 The instance label identifies one instance within a job.

The __address__ is the host and port your Prometheus will connect to when scraping. While it provides a default for the instance label, it is separate so you can have a different value for it. For example, you may wish to use the Consul node name in the instance label, while leaving the address pointing to the IP address, as in Example 8-18. This is a better approach than adding an additional host, node, or alias label with a nicer name, as it avoids adding a second label unique to each target, which would cause complications in your PromQL.

Example 8-18. Using the IP from Consul with port 9100 as the address, with the node name in the instance label
scrape_configs:
 - job_name: consul
   consul_sd_configs:
    - server: 'localhost:8500'
   relabel_configs:
    - source_labels: [__meta_consul_address]
      regex: '(.*)'
      replacement: '${1}:9100'
      target_label: __address__
    - source_labels: [__meta_consul_node]
      regex: '(.*)'
      replacement: '${1}:9100'
      target_label: instance
Tip

Prometheus will perform DNS resolution on the __address__, so one way you can have more readable instance labels is by providing host:port rather than ip:port.

Labelmap

The labelmap action is different from the drop, keep, and replace actions you have already seen in that it applies to label names rather than label values.

Where you might find this useful is if the service discovery you are using already has a form of key-value labels, and you would like to use some of those as target labels. This might be to allow configuration of arbitrary target labels, without having to change your Prometheus configuration every time there is a new label.

EC2’s tags, for example, are key-value pairs. You might have an existing convention to have the name of the service go in the service tag and its semantics align with what the job label means in Prometheus. You might also declare a convention that any tags prefixed with monitor_ will become target labels. For example, an EC2 tag of monitor_foo=bar would become a Prometheus target label of foo="bar". Example 8-19 shows this setup, using a replace action for the job label and a labelmap action for the monitor_ prefix.

Example 8-19. Use the EC2 service tag as the job label, with all tags prefixed with monitor_ as additional target labels
scrape_configs:
 - job_name: ec2
   ec2_sd_configs:
    - region: <region>
      access_key: <access key>
      secret_key: <secret key>
   relabel_configs:
    - source_labels: [__meta_ec2_tag_service]
      target_label: job
    - regex: __meta_ec2_public_tag_monitor_(.*)
      replacement: '${1}'
      action: labelmap

But you should be wary of blindly copying all labels in a scenario like this, as it is unlikely that Prometheus is the only consumer of metadata such as this within your overall architecture. For example, a new cost center tag might be added to all of your EC2 instances for internal billing reasons. If that tag automatically became a target label due to a labelmap action, that would change all of your target labels and likely break graphing and alerting. Thus, using either well-known names (such as the service tag here) or clearly namespaced names (such as monitor_) is wise.

Lists

Not all service discovery mechanisms have key-value labels or tags; some just have a list of tags, with the canonical example being Consul’s tags. While Consul is the most likely place that you will run into this, there are various other places where a service discovery mechanism must somehow convert a list into key-value metadata such as the EC2 subnet ID.12

This is done by joining the items in the list with a comma and using the now-joined items as a label value. A comma is also put at the start and the end of the value, to make writing correct regular expressions easier.

As an example, say a Consul service had dublin and prod tags. The __meta_​consul_tags label could have the value ,dublin,prod, or ,prod,dublin, as tags are unordered. If you wanted to only scrape production targets you would use a keep action as shown in Example 8-20.

Example 8-20. Keeping only Consul services with the prod tag
scrape_configs:
 - job_name: node
   consul_sd_configs:
    - server: 'localhost:8500'
   relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex:  '.*,prod,.*'
      action: keep

Sometimes you will have tags which are only the value of a key-value pair. You can convert such values to labels, but you need to know the potential values. Example 8-21 shows how a tag indicating the environment of a target can be converted into an env label.

Example 8-21. Using prod, staging, and dev tags to fill an env label
scrape_configs:
 - job_name: node
   consul_sd_configs:
    - server: 'localhost:8500'
   relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex:  '.*,(prod|staging|dev),.*'
      target_label: env
Tip

With sophisticated relabelling rules you may find yourself needing a temporary label to put a value in. The __tmp prefix is reserved for this purpose.

How to Scrape

You now have targets with their target labels and the __address__ to connect to. There are some additional things you may wish to configure, such as a path other than /metrics or client authentication.

Example 8-22 shows some of the more common options you can use. As these change over time, check the documentation for the most up-to-date settings.

Example 8-22. A scrape config showing several of the available options
scrape_configs:
 - job_name: example
   consul_sd_configs:
    - server: 'localhost:8500'
   scrape_timeout: 5s
   metrics_path: /admin/metrics
   params:
     foo: [bar]
   scheme: https
   tls_config:
     insecure_skip_verify: true
   basic_auth:
     username: brian
     password: hunter2

metrics_path is only the path of the URL, and if you tried to put /metrics?foo=bar, for example, it would get escaped to /metrics%3Ffoo=bar. Instead, any URL paramaters should be placed in params, though you usually only need this for federation and the classes of exporters that include the SNMP and Blackbox exporters. It is not possible to add arbitrary headers, as that would make debugging more difficult. If you need flexibility beyond what is offered, you can always use a proxy server with proxy_url to tweak your scrape requests.

scheme can be http or https, and with https you can provide additional options including the key_file and cert_file if you wish to use TLS client authentication. insecure_skip_verify allows you to disable validation of a scrape target’s TLS cert, which is not advisable security-wise.

Aside from TLS client authentication, HTTP Basic Authentication and HTTP Bearer Token Authentication are offered via basic_auth and bearer_token. The bearer token can also be read from a file, rather than from the configuration, using bearer_token_file. As the bearer tokens and basic auth passwords are expected to contain secrets, they will be masked on the status pages of Prometheus so that you don’t accidentally leak them.

In addition to overriding the scrape_timeout in a scrape config, you can also override the scrape_interval, but in general you should aim for a single scrape interval in a Prometheus for sanity.

Of these scrape config settings, the scheme, path, and URL parameters are available to you and can be overridden by you via relabelling, with the label names __scheme__, __metrics_path__, and __param_<name>. If there are multiple URL parameters with the same name, only the first is available. It is not possible to relabel other settings for reasons varying from sanity to security.

Service discovery metadata is not considered security sensitive13 and will be accessible to anyone with access to the Prometheus UI. As secrets can only be specified per scrape config, it is recommended that any credentials you use are made standard across your services.

metric_relabel_configs

In addition to relabelling being used for its original purpose of mapping service discovery metadata to target labels, relabelling has also been applied to other areas of Prometheus. One of those is metric relabelling: relabelling applied to the time series scraped from a target.

The keep, drop, replace, and labelmap actions you have already seen can all be used in metric_relabel_configs as there are no restrictions on which relabel actions can be used where.14

Tip

To help you remember which is which, relabel_configs occurs when figuring out what to scrape, metrics_relabel_configs happens after the scrape has occurred.

There are two cases where you might use metric relabelling: when dropping expensive metrics and when fixing bad metrics. While it is better to fix such problems at the source, it is always good to know that you have tactical options while the fix is in progress.

Metric relabelling gives you access to the time series after it is scraped but before it is written to storage. The keep and drop actions can be applied to the __name__ label (discussed in “Reserved Labels and __name__”) to select which time series you actually want to ingest. If, for example, you discovered that the http_request_size_bytes15 metric of Prometheus had excessive cardinality and was causing performance issues, you could drop it as shown in Example 8-23. It is still being transferred over the network and parsed, but this approach can still offer you some breathing room.

Example 8-23. Dropping an expensive metric using metric_relabel_configs
scrape_configs:
 - job_name: prometheus
   static_configs:
    - targets:
       - localhost:9090
   metric_relabel_configs:
    - source_labels: [__name__]
      regex: http_request_size_bytes
      action: drop

The labels are also available, as mentioned in “Cumulative Histograms”, you can also drop certain buckets (but not +Inf) of histograms and you will still be able to calculate quantiles. Example 8-24 shows this with the prometheus_tsdb_​compaction_duration_seconds histogram in Prometheus.

Example 8-24. Dropping histogram buckets to reduce cardinality
scrape_configs:
 - job_name: prometheus
   static_configs:
    - targets:
       - localhost:9090
   metric_relabel_configs:
    - source_labels: [__name__, le]
      regex: 'prometheus_tsdb_compaction_duration_seconds_bucket;(4|32|256)'
      action: drop
Note

metric_relabel_configs only applies to metrics that you scrape from the target. It does not apply to metrics like up, which are about the scrape itself, and which will have only the target labels.

You could also use metric_relabel_configs to rename metrics, rename labels, or even extract labels from metric names.

labeldrop and labelkeep

There are two further relabel actions that are unlikely to be ever required for target relabelling, but that can come up in metric relabelling. Sometimes exporters can be overly enthusiastic in the labels they apply, or confuse instrumentation labels with target labels and return what they think should be the target labels in a scrape. The replace action can only deal with label names you know the name of in advance, which sometimes isn’t the case.

This is where labeldrop and labelkeep come in. Similar to labelmap, they apply to label names rather than to label values. Instead of copying labels, labeldrop and labelkeep remove labels. Example 8-25 uses labeldrop to drop all labels with a given prefix.

Example 8-25. Dropping all scraped labels that begin with node_
scrape_configs:
 - job_name: misbehaving
   static_configs:
    - targets:
       - localhost:1234
   metric_relabel_configs:
    - regex: 'node_.*'
      action: labeldrop

When you have to use these actions, prefer using labeldrop where practical. With labelkeep you need to list every single label you want to keep, including __name__, le, and quantile.

Label Clashes and honor_labels

While labeldrop can be used when an exporter incorrectly presumes it knows what labels you want, there is a small set of exporters where the exporter does know the labels you want. For example, metrics in the Pushgateway should not have an instance label, as was mentioned in “Pushgateway”, so you need some way of not having the Pushgateway’s instance target label apply.

But first let’s look at what happens when there is a target label with the same name as an instrumentation label from a scrape. To avoid misbehaving applications interfering with your target label setup, it is the target label that wins. If you had a clash on the job label, for example, the instrumentation label would be renamed to exported_job.

If instead you want the instrumentation label to win and override the target label, you can set honor_labels: true in your scrape config. This is the one place in Prometheus where an empty label is not the same thing as a missing label. If a scraped metric explicitly has an instance="" label, and honor_labels: true is configured, the resultant time series will have no instance label. This technique is used by the Pushgateway.

Aside from the Pushgateway, honor_labels can also come up when ingesting metrics from other monitoring systems if you do not follow the recommendation in Chapter 11 to run one exporter per application instance.

Tip

If you want more finegrained control for handling clashing target and instrumentation labels, you can use metric_relabel_configs to adjust the labels before the metrics are added to the storage. Handling of label clashes and honor_labels is performed before metric_relabel_configs.

Now that you understand service discovery, you’re ready to look at monitoring containers and how service discovery can be used with Kubernetes.

1 My home Prometheus uses a hardcoded static configuration, for example, as I only have a handful of machines.

2 The Power Distribution Unit, part of the electrical system in a datacenter. PDUs usually feed a group of racks with electricity, and knowing the CPU load on each machine could be useful to ensure each PDU can provide the power required.

3 You cannot, however, put globs in the directory, so a/b/*.json is fine, a/*/file.json is not.

4 This endpoint was added in Prometheus 2.1.0. On older versions you can hover over the Labels on the Targets page to see the metadata.

5 job_name is only a default, which I’ll look at further in “Duplicate Jobs”. The other __ labels are special and will be covered in “How to Scrape”.

6 Only the EC2:DescribeInstances permission is needed, but policies are generally easier for you to set up initially.

7 You can override the character used to join with the separator field.

8 It is possible for two of your targets to have the same target labels, with other settings different, but this should be avoided because metrics such as up will clash.

9 On the other hand, don’t try to plan too far in advance. It’s not unusual that, as your architecture changes over the years, your target label hierarchy will need to change with it. Predicting exactly how it will change is usually impossible. Consider, for example, if you were moving from a traditional datacenter setup to a provider like EC2, which has availability zones.

10 You could also omit source_labels: []. I left it in here to make it clearer that the label was being removed.

11 A job could potentially be further divided into shards with another label.

12 An EC2 instance can have multiple network interfaces, each of which could be in different subnets.

13 Nor are the service discovery systems typically designed to hold secrets.

14 Which is not to say that all relabel actions make sense in all relabel contexts.

15 In Prometheus 2.3.0 this metric was changed to a histogram and renamed to prometheus_http_response_size_bytes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.117