We went through a couple of strategies for vertically sharding Prometheus, but there's a problem we still haven't addressed: scaling requirements tied to a single job. Imagine that you have a job with tens of thousands of scrape targets inside one datacenter, and there isn't a logical way to split it any further. In this type of scenario, your best bet is to shard horizontally, spreading the same job across multiple Prometheus servers. The following diagram provides an example of this type of sharding:
To accomplish this, we must rely on the hashmod relabeling action. The way hashmod works is by setting target_label to the modulus of a hash of the concatenated source_labels, which we then place in a Prometheus server. We can see this configuration in action in our test environment in both shard01 and shard02, effectively sharding the node job. Let's go through the following configuration, which can be found at /etc/prometheus/prometheus.yml:
...
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['shard01:9100', 'shard02:9100', 'global:9100']
relabel_configs:
- source_labels: [__address__]
modulus: 2 # Because we're using 2 shards
target_label: __tmp_shard
action: hashmod
- source_labels: [__tmp_shard]
regex: 0 # Starts at 0, so this is the first
action: keep
...
In the following screenshot, we can see the /service-discovery endpoint from the shard01 and shard02 Prometheus instances side by side. The result of the hashmod action allowed us to split the node exporter job across both instances, as shown:
Few reach the scale where this type of sharding is needed, but it's great to know that Prometheus supports it out of the box.