Dead man's switch alerts

Imagine that you have a set of Prometheus instances using an Alertmanager cluster for alerting. For some reason, a network partition happens between these two services. Even though each Prometheus instance detects that it can no longer reach any of the Alertmanager instances, they will have no means to send notifications.

In this situation, no alerts about the issue will ever be sent:

Figure 11.13: Network partition between Prometheus and Alertmanager

The original concept of a dead man's switch refers to a mechanism that activates if it stops being triggered/pressed. This concept has been adopted in the software world in several ways; for our purpose, we can achieve this by creating an alert that should always be firing – thereby constantly sending out notifications – and then checking if it ever stops. This way, we can exercise the full alerting path from Prometheus, through Alertmanager, to the notification provider, and ultimately to the recipient of the notifications so that we can ensure end-to-end connectivity and service availability. This, of course, goes against all we know about alert fatigue and we wouldn't want to be constantly receiving pages or emails about an always-firing alert. You can implement your own custom service implementing watchdog timers, but then you'll be in a situation where you need to monitor that as well. Ideally, you should leverage a third party so that you mitigate the risk of this service suffering from the same outage that is preventing notifications from going out.

For this, there's a service built around the dead man's switch type of alert, and it's curiously named Dead Man's Snitch (deadmanssnitch.com). This is a third-party provider, outside of your infrastructure, that's responsible for receiving your always-firing notification via email or Webhook and will, in turn, issue a page, Slack message, or Webhook if that notification stops being received for more than a configurable amount of time. This setup mitigates the problems we presented previously  even if the entire datacenter goes up in flames, you'll still be paged!

The full configuration guide for integrating Dead Man's Snitch with VictorOps and PagerDuty can be found at https://help.victorops.com/knowledge-base/victorops-dead-mans-snitch-integration/ and https://www.pagerduty.com/docs/guides/dead-mans-snitch-integration-guide/, respectively.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.74.54