Thanos ecosystem

Besides the Thanos querier and sidecar, which we covered previously, there are a few other components in the Thanos ecosystem. All of these components coexist in the same binary and are run by invoking different sub-commands, which we'll enumerate later:

query: Commonly known as querier, it's a daemon that's responsible for fanning out queries and deduplicating results to configured StoreAPI endpoints
sidecar: A daemon which exposes a StoreAPI endpoint for accessing the data from a local Prometheus instance and ships the aforementioned instance's TSDB blocks to object storage
store: A daemon which acts as a gateway for remote storage, exposing a StoreAPI endpoint
compact: Commonly known as compactor, this daemon is responsible for compacting blocks that are available in object storage and creating new downsampled time series
bucket: A command line-tool that can verify, recover, and inspect data stored in object storage
receive: Known as receiver, it's a daemon that accepts remote writes from Prometheus instances, exposing pushed data through a StoreAPI endpoint, and can ship blocks to object storage
rule: Commonly known as ruler, it's a daemon that evaluates Prometheus rules (recording and alerting) against remote StoreAPI endpoints, exposes its own StoreAPI to make evaluation results available for querying, ships results to object storage, and connects to an Alertmanager cluster to send alerts

You can find all the source code and installation files for Thanos at https://github.com/improbable-eng/thanos.

All of the following components work together to solve several challenges:

Global view: Querying every Prometheus instance from the same place, while aggregating and deduplicating the returned time series.
Downsampling: Querying months or even years of data is a problem if samples come at full resolution; by automatically creating downsampled data, queries that span large time periods become feasible.
Rules: Enables the creation of global alerts and recording rules that mix metrics from different Prometheus shards.
Long-term retention: By leveraging object storage, it delegates durability, reliability, and scalability concerns of storage to outside the monitoring stack.

While we'll be providing a glimpse of how these challenges are tackled with Thanos, our main focus will be on the long-term storage aspect of it.

Storage-wise, the Thanos project settled on object storage for long-term data. Most cloud providers provide this service, with the added benefit of also ensuring service-level agreements (SLA) for it. Object storage usually has 99.999999999% durability and 99.99% availability on any of the top cloud providers. If you have an on-premise infrastructure, there are also some options available: using Swift (the OpenStack component that provides object storage APIs), or even the MinIO project, which we use in this chapter's test environment. Most of these on-premise object storage solutions share the same characteristic: they provide APIs that are modeled to mimic the well-known AWS S3 due to so many tools supporting it. Moreover, object storage from cloud providers is typically a very cost-effective solution.

The following diagram provides a simple overview of the core components that are needed to achieve long-term retention of Prometheus time series data using Thanos:

Figure 14.2: High-level Thanos long-term storage architecture

As we can see in the previous diagram, we only need some Thanos components to tackle this challenge. In the following topics, we'll go over every component and expand on its role in the overall design.

Table of Contents for Thanos ecosystem

Create new playlist

Sign In

Sign Up

Table of Contents for
Thanos ecosystem