Local storage

Prometheus' out-of-the-box storage solution for time series data is simply local storage. It is simpler to understand and simpler to manage: the database lives in a single directory, which is easy to back up, restore, or destroy if needed. By avoiding clustering, Prometheus ensures sane behavior when facing network partitions; you don't want your monitoring system to fail when you need it the most. High availability is commonly achieved by simply running two Prometheus instances with the same configuration, each with its own database. This storage solution, however, does not cover all uses cases, and has a few shortcomings:

  • It's not durable – in a container orchestration deployment, the collected data will disappear when the container is rescheduled (as the previous data is destroyed and the current data is started afresh) if persistent volumes are not used, while in a VM deployment, the data will be as durable as the local disk.
  • It's not horizontally scalable – using local storage means that your dataset can only be as big as the disk space you can make available to the instance.
  • It wasn't designed for long-term retention, even though, with the right metric criteria and cardinality control, commodity storage will go a long way.

These shortcomings are the result of trade-offs that are made to ensure that small and medium deployments (which are by far the more common use cases) work great while also making advanced and large-scale use cases possible. Alerting and dashboards, in the context of day-to-day operational monitoring or troubleshooting ongoing incidents, only require a couple of weeks of data at most.

Before going all out for a remote metric storage system for long-term retention, we might consider managing local storage through the use of TSDB admin API endpoints, that is, snapshot and delete_series. These endpoints help keep local storage under control. As we mentioned in Chapter 5, Running a Prometheus Server, the TSDB administration API is not available by default; Prometheus needs to be started with the --web.enable-admin-api flag so that the API is enabled.

In this chapter's test environment, you can try using these endpoints and evaluate what they aim to accomplish. By connecting to the prometheus instance, we can validate that the TSDB admin API has been enabled, and look up the local storage path by using the following command:

vagrant@prometheus:~$ systemctl cat prometheus
...
--storage.tsdb.path=/var/lib/prometheus/data
--web.enable-admin-api
...

Issuing an HTTP POST to the /api/v1/admin/tsdb/snapshot endpoint will trigger a new snapshot that will store the available blocks in a snapshots directory. Snapshots are made using hard links, which makes them very space-efficient as long as Prometheus still has those blocks.  The following instructions illustrate how everything is processed:

vagrant@prometheus:~$ curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
{"status":"success","data":{"name":"20190501T155805Z-55d3ca981623fa5b"}}

vagrant@prometheus:~$ ls /var/lib/prometheus/data/snapshots/
20190501T155805Z-55d3ca981623fa5b

You can then back up the snapshot directory, which can be used as the TSDB storage path for another Prometheus instance through --storage.tsdb.path when that historical data is required for querying. Note that --storage.tsdb.retention.time might need to be adjusted to your data duration, as Prometheus might start deleting blocks outside the retention period.

This, of course, will not prevent the growth of the TSDB. To manage this aspect, we can employ the /api/v1/admin/tsdb/delete_series endpoint, which is useful for weekly or even daily maintenance. It operates by means of an HTTP POST request with a set of match selectors that mark all matching time series for deletion, optionally restricting the deletion to a given time window if a time range is also sent. The following table provides an overview of the URL parameters in question:

URL parameters

Description

match[]=<selector>

One or more match selectors, for example, match[]={__name__=~"go_.*"}

(Deletes all metrics whose name starts with go_)

start=<timestamp>

Start time for the deletion in the RFC 3339 or Unix format (optional, defaults to the earliest possible time)

end=<unix_timestamp>

End time for the deletion in RFC 3339 or Unix format (optional, defaults to the latest possible time)

 

After the POST request is performed, an HTTP 204 is returned. This will not immediately free up disk space, as it will have to wait until the next Prometheus compaction event. You can force this cleanup by requesting the clean_tombstones endpoint, as exemplified in the following instructions:

vagrant@prometheus:~$ curl -X POST -w "%{http_code}
" --globoff 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~"go_.*"}'
204

vagrant@prometheus:~$ curl -X POST -w "%{http_code} " http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
204

This knowledge might help you keep local storage under control and avoid stepping into complex and time-consuming alternatives, when the concern is mostly around scalability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.145.114