After discussing deduplication and automated baselining in previous chapters, we will now move up the AIOps maturity ladder to discuss anomaly detection, which provides a big leap toward proactiveness. This chapter explains anomaly detection and how it is useful for IT operations.
Anomaly Detection Overview
Anomaly detection is the process of identifying data points that are unusual or unexpected. Regular events of the CPU, memory, swap memory, disk, etc., are normal to operations, but if there is any “application down” or firewall event, then it represents an unusual scenario. The goal of anomaly detection is to identify such unusual scenarios (what we call outliers) within a dataset that differ significantly from other observations. Though the task of detecting an anomaly can be performed by all three types of machine learning algorithms, its implementation is done extensively on unlabeled data by clustering them into various groups. In IT operations, the execution of a service improvement plan is a regular exercise, and operations do not have a specific target to predict. Rather, they need to analyze lots of data and then try to observe similarities and club together them to form different groups to understand anomalies and formulate recommendations. We briefly discussed different clustering algorithms available in unsupervised machine learning in Chapter 5 for anomaly detection, and K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms that we will be discovering in this chapter.
K-Means Algorithms
Here, K represent the number of distinct clusters that can be created around centroids. There are ways to determine the centroids from the dataset, and this section will use the Elbow method in our implementation to determine the appropriate number of clusters. There are three centroids detected in the dataset that are marked as red dots in Figure 8-2.
Root-cause analysis: Groups events that are related to an issue or outage to find how an issue starts unfolding
Performance analysis: Groups events related to a specific application or service to analyze its performance and related bottlenecks
Service improvement: Groups events that create noise and analyzes what monitoring configurations or baselines should be updated at the source to reduce noise and improve service quality
Capacity planning: Groups events that represent hot spots that need timely intervention and resolution by a capacity planning process
From an AIOps perspective, it is probably the simplest clustering algorithm that can detect anomalies in a wide range of scenarios. In this algorithm, first a centroid is chosen for every cluster, and then data points are grouped into the cluster based on their distance from centroids. There are multiple methods to calculate distance such as Minkowski, Manhattan, Euclidean, Hamming, etc.
Clustering is an iterative process to optimize the position of centroids with which the distance is getting calculated. For example, the majority of organizations receive thousands of events related to performance or availability, but rarely do they receive security alerts related to DDoS/DoS attacks. Not only the frequency but the alarm text message will differ. In this case, you can apply K-means clustering on event messages that group related messages over a time scale, leaving aside security alerts as an anomaly.
Let’s consider an implementation scenario where there are multiple alerts in the system and you want to analyze alert messages to segregate frequently occurring events as well as anomalies and then try to determine the root cause of events. For this implementation, you will work with text data; hence, you have to use the subdomain of AI called natural language processing (NLP). You will be using NLP to analyze the text by tokenizing it and getting the relevance of tokens and their similarities in different events to create clusters.
Natural Language ToolKit (nltk): This is a suite of libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning of text to process human language data and use it for further analysis in statistical or ML systems.
sklearn: This is a machine learning library that provides the implementation of various algorithms such as K-means, SVM, random forests, gradient boosting, etc. This library allows you to use these algorithms on data rather than making a huge effort to manually implement them and then use them on data.
Genism: This is an important library for topic modeling, which is a statistical model for text mining and discovering abstract topics that are hidden in a collection of text. Topics can be defined as a repeating pattern of co-occurring terms in a text corpus. For example switch, interface, router, and bandwidth are terms that will occur together frequently and hence can be grouped under the topic of network. This is not a rule-based approach that uses regular expressions or dictionary-based keyword searching techniques. It is an unsupervised ML algorithm to find a bunch of related words and create clusters of text.
Source: Contains the device hostname/IP
AlertTime: Provides the event occurrence time
AlertDescription: Provides the detailed event message
AlertClass: Provides the class of the event
AlertType: Provides the type of the event
AlertManager: Provides a tool that generates events
Sample Events from Input File
-The table has six columns and six rows, has six rows titled Source, Alert Time, Alert Description, Alert Cass, Alert Type and Alert Manager. The another five column, is divided into five sub rows. |
Next, start applying the NLP algorithm. First, we need to download the English words dictionary from nltk. You need to add domain-specific words as that may be missing from the English words dictionary like app, HTTP, CPU, etc. These domain-specific words are important for efficiently creating clusters.
For each event sentence, we have the relevant tokens for analysis. Now we need to map these text tokens from the vocabulary to a corresponding vector of real numbers so that we can apply statistical algorithms. This process is called vectorization.
TF: This considers how frequently the word appears in an event message. As each message is not the same length, it may be possible that a token in a long sentence occurs more frequently compared to a token in a shorter message.
IDF: This is based on the fact that less frequent tokens are more informative and important. So if a token appears frequently in multiple event messages, like high, then it’s not critical compared to the tokens that are not frequent in multiple messages like down.
Readers can get more details on TF-IDF at https://en.wikipedia.org/wiki/Tf–idf.
Now we have the final list of tokens on which we can apply the K-means algorithm.
To apply the K-means algorithm, first we need to determine what will be the ideal value of K. For this we will be using the Elbow method, which is a heuristic used in determining the number of clusters in a dataset.
Now you can see that all the top keywords representing issues are clustered in different clusters.
With the help of ML and NLP capabilities, the algorithm discovered and clustered useful information from 1,000+ events, and that was without writing any static rules or using any topology-related details.
Let’s also analyze these clusters together over a time scale to uncover more insights.
There is a consistent lot of noise due to CPU utilization events that are represented in cluster 0. This is feedback to tune the thresholds of the CPU monitoring parameter. Or you can submit a recommendation to the capacity planning team to increase the CPU capacity.
The algorithm automatically created three clusters, one for each application, which primarily contains alerts related to that specific application only. These clusters can be analyzed for incident or problem management–related tasks. These clusters provide a lot of visibility to the applications teams and are immensely helpful in executing service improvement programs.
This AIOps use case of anomaly detection also makes it extremely helpful and efficient to detect potentially critical events, such as security events, from the enormous pile of events and initiate automated actions, such as invocating a kill switch to minimize the impact. You have seen how K-means clustering detected noise and can generate recommendations for problem and capacity management processes.
As per best practices, security alerts should not be integrated with the operations event console. Security alerts are exclusively for the eyes of security experts and not for everyone. Ideally, there should be a separate SIEM console for the security operations team to monitor.
The number of clusters plays a crucial role in the K-means algorithm’s efficiency.
The K-means cluster gets impacted due to the presence of noise or a high number of outliers in the dataset. Centroids get stretched to incorporate outliers or noise.
A higher number of dimensions (parameters or variables) has a negative impact on the efficiency of the algorithm.
Along with anomaly detection, another common use case of AIOps is correlation and association, which will be discussed next.
Correlation and Association
Correlation is a statistical term that means identifying any relationship or association between two or more entities. From an AIOps perspective, correlation is used to determine any dependency or relation between entities and cluster them together for efficient analysis. We covered multiple regression algorithms that determine the relationship between multiple entities to establish correlation in Chapter 5.
However, establishing correlation at the event layer is a bit difficult as multiple events from varied sources come at a varied time interval. A time-based sequence or pattern of event is quite rare at the event layer. One of the algorithms that works efficiently at the event layer to determine correlation and association is DBSCAN, and it is particularly useful in scenarios where data points are close to each other (dense) along with the presence of a considerable amount of noise or outliers.
Topology-Based Correlation
The accuracy of clustering is one of the most important KPIs of an event management system that drives the efficiency of the engine. To improve the accuracy of clustering and the determination of the root cause, you need to provide some topology context. But the CMDB and discovery are challenges in themselves, and having a 100 percent accurate CMDB with 100 percent discovery and modeling of infrastructure is a tough task.
Network Topology Correlation
This correlation uses the network connectivity diagram as the topology to perform correlation. For example, if a core switch connecting hundreds of servers and devices to its interfaces is down, then alerts from this switch are causal, and all other alerts from the underlying network devices and servers as well as applications running over them are considered impacted.
Whenever alerts are generated, the event management algorithm will use this data to determine relationships, perform correlation, and determine the causal and impacted CIs. This method is dependent on the static information, and thus it has its limitations and challenges. Storing a high volume of topology data can severely impact the event management tool performance.
From an AIOps perspective, it’s recommended to offload this correlation at the network monitoring layer, which has a native capability to scan the subnet ranges, discover devices, and configure them in availability monitoring using ICMP or SNMP. Network monitoring tools have real-time visibility on the availability of all CIs, and their topology, as well as can dynamically update them based on changes in the network. A network monitoring tool should perform this correlation using the topology detected, filtering out impacted CI alerts and forwarding causal alerts to the event management layer. If needed, the network monitoring tool can be configured to forward the impacted CI alerts with the information severity, whereas a causal alert can be forwarded with a critical severity.
In cases of an open source network monitoring tool with limited to no capability for native topology-based correlation, organizations can still import the required topology data in the event management layer and perform correlation as discussed earlier.
Application Topology Correlation
This correlation leverages the application architecture with a relationship between application components. This can be as simple as a vCenter farm running on multiple hypervisors in a cluster and hosting various VMs. This can also be a more complex application architecture for a business application like an insurance claim processing system running on multiple nodes hosting web servers, DB servers, EDI systems, and SAP systems.
High request in application queue from application servers
High response time of an application URL from web servers
Timeout alert from synthetic monitoring from the load balancer
At this point of time, many other unrelated alerts may also be received at event management. In this case, all systems are up and running, but the issue is at the application level. Here the application-level correlation helps in isolating faults.
Discovering application topology requires domain knowledge to determine the pattern that needs to be detected in discovery. Discovery tool SMEs don’t possess such vast domain knowledge.
Information is federated and available inside different tools managed by multiple teams.
There are challenges in getting clearance to read application logs/services/processes. Logs usually contain PII/PHI/confidential data, which causes a red flag to be raised by the security and application teams.
Learn by analyzing the similarity of hostnames/IP addresses as servers belong to one application usually share a common prefix in the hostname or common subnet details.
Learn by analyzing the arrival time of the events and detecting patterns from it.
Learn by analyzing message texts in the event, whether it belongs to the availability category or the performance category or if there any common application/service name in events that comes together within a short interval.
Learn by analyzing the event class and source. Data from multiple operational management databases (OMDBs) like vCenter, SAP, SCCM, etc., is quite useful in this learning as these are the sources where different CIs are getting managed. For example, the vCenter database provides the data of the mapping of VMs to the ESX farm where the event source will be vCenter and the class will be Virtualization. Similarly, SAP can provide application mapping details where the source will be SAP System and the class will be Application. The algorithm can use this data to correlate events and find out whether the issue is due to the underlying hypervisor or due to application performance.
Like network topology correlation, you can store the application topology data for critical applications at the event management layer and let the system learn from it. Unless the application is running on the cloud or on SDI, the application topology usually does not get updated as frequently as the network topology data, so it can be synced on a monthly or quarterly basis. For cloud and SDI-based applications, it is comparatively easy to use native APIs and update the topology data as soon as a change happens.
Assume if the database process running on dummy-db-1server goes down, it will generate a database alert as well as an application queue alert from dummy-app-1 and URL response time alert from dummy-web-1 servers. Using the previous data, the event management system can create clusters and correlate these alerts as well as highlight which one is causal and which ones are impacted.
The availability of application topology or service models can definitely increase the accuracy of algorithms, but it is not a must-have requirement from AIOps. Both supervised and unsupervised machine learning algorithms showed lot of potential and can help a lot in performing the application topology correlation.
Summary
This chapter covered the important use cases in AIOps around anomaly detection using K-means clustering. You also used NLP techniques such as stopword removal and TF-IDF to make sense of textual data. You learned about other techniques such as application dependency mapping and the CMDB and how they are relevant in AIOps. In the next chapter, we will cover how to set up AIOps as a practice in an organization.