Top-down alerting by leveraging custom rules

In Chapter 4IT Operational Analytics and Root Cause Analysis, we asked "what percentage of the data that you collect is being paid attention to?" Often, a realistic answer is likely <10% and maybe even <1%. The reason why this is the case is that the traditional approach to making data proactive is to start from scratch and then build up thresholds or rules-based alerts over time. This can be a daunting and/or tedious task that requires upfront knowledge (or at least a guess) as to what the expected behavior of each time series should be. Then, once the alerts have been configured, there can be an extended tuning process that balances alert sensitivity with annoying false positives. Additionally, there could also be metrics whose unusual behaviors could never be caught with a static threshold.

Combine this challenge with scale; if I have 10 metrics per server and 100 servers, there are 1,000 individual metrics. Creating individual alerts for each of these is impractical. 

However, a single ML job could be created against this data in less than 1 minute. ML's self-learning on historical data, which also takes very little time, will minimize false positives by adapting to the natural characteristics of each time series independently. 

Once the ML job has created results, the user can then easily inspect the anomalies that ML finds, and can judge them for their usefulness (or wait for downstream consumers of the anomalies to judge their usefulness). If any of the anomalies are deemed to be not useful, custom rules (https://www.elastic.co/guide/en/elastic-stack-overview/6.4/ml-rules.html) allow the user to inject their own domain knowledge, and allow for the customization of how anomalies are determined (that is, never alert on an anomaly on CPU if the value is still less than 90%):

This top-down, rather than bottom-up, approach is much faster and provides more proactive coverage of the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.56.28