© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_8

8. AIOps Use Case: Anomaly Detection

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

After discussing deduplication and automated baselining in previous chapters, we will now move up the AIOps maturity ladder to discuss anomaly detection, which provides a big leap toward proactiveness. This chapter explains anomaly detection and how it is useful for IT operations.

Anomaly Detection Overview

Anomaly detection is the process of identifying data points that are unusual or unexpected. Regular events of the CPU, memory, swap memory, disk, etc., are normal to operations, but if there is any “application down” or firewall event, then it represents an unusual scenario. The goal of anomaly detection is to identify such unusual scenarios (what we call outliers) within a dataset that differ significantly from other observations. Though the task of detecting an anomaly can be performed by all three types of machine learning algorithms, its implementation is done extensively on unlabeled data by clustering them into various groups. In IT operations, the execution of a service improvement plan is a regular exercise, and operations do not have a specific target to predict. Rather, they need to analyze lots of data and then try to observe similarities and club together them to form different groups to understand anomalies and formulate recommendations. We briefly discussed different clustering algorithms available in unsupervised machine learning in Chapter 5 for anomaly detection, and K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms that we will be discovering in this chapter.

K-Means Algorithms

As discussed in Chapter 5, K-means is a centroid-based clustering algorithm that is extensively used in data mining to detect patterns and group together similar data points into a cluster. Mathematically, if there is a dataset {x1, . . . , xn} consisting of N data points, as shown in Figure 8-1, then our goal is to partition the dataset into some number K of clusters.

A scatterplot of distinct clusters that can be created around centroids. There are ways to determine the centroids from the dataset. In the scatterplot horizontal access values are 0, 200, 400, 600, 800 and 1000, and the vertical access values are 0, 5000, 10000, 15000, 20000 and 25000.

Figure 8-1

N data points in space

Here, K represent the number of distinct clusters that can be created around centroids. There are ways to determine the centroids from the dataset, and this section will use the Elbow method in our implementation to determine the appropriate number of clusters. There are three centroids detected in the dataset that are marked as red dots in Figure 8-2.

The K-means algorithm tries to determine the similarity of data points with centroids and accordingly cluster the data points around these centroids, as shown in Figure 8-2.

A scatterplot determine the similarity of data points with centroids and accordingly cluster the data points around these centroids. The scatterplot horizontal access values are 0, 200, 400, 600, 800 and 1000, and the vertical access values are 0, 5000, 10000, 15000, 20000 and 25000.

Figure 8-2

Clustering performed by the K-means algorithm

The IT operations team performs the task of manually filtering hundreds of events and selecting related events in a group for various tasks such as the following:
  • Root-cause analysis: Groups events that are related to an issue or outage to find how an issue starts unfolding

  • Performance analysis: Groups events related to a specific application or service to analyze its performance and related bottlenecks

  • Service improvement: Groups events that create noise and analyzes what monitoring configurations or baselines should be updated at the source to reduce noise and improve service quality

  • Capacity planning: Groups events that represent hot spots that need timely intervention and resolution by a capacity planning process

From an AIOps perspective, it is probably the simplest clustering algorithm that can detect anomalies in a wide range of scenarios. In this algorithm, first a centroid is chosen for every cluster, and then data points are grouped into the cluster based on their distance from centroids. There are multiple methods to calculate distance such as Minkowski, Manhattan, Euclidean, Hamming, etc.

Clustering is an iterative process to optimize the position of centroids with which the distance is getting calculated. For example, the majority of organizations receive thousands of events related to performance or availability, but rarely do they receive security alerts related to DDoS/DoS attacks. Not only the frequency but the alarm text message will differ. In this case, you can apply K-means clustering on event messages that group related messages over a time scale, leaving aside security alerts as an anomaly.

Let’s consider an implementation scenario where there are multiple alerts in the system and you want to analyze alert messages to segregate frequently occurring events as well as anomalies and then try to determine the root cause of events. For this implementation, you will work with text data; hence, you have to use the subdomain of AI called natural language processing (NLP). You will be using NLP to analyze the text by tokenizing it and getting the relevance of tokens and their similarities in different events to create clusters.

To begin, you need to import some libraries. In this implementation, you will be using NLP-related libraries.
  • Natural Language ToolKit (nltk): This is a suite of libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning of text to process human language data and use it for further analysis in statistical or ML systems.

  • sklearn: This is a machine learning library that provides the implementation of various algorithms such as K-means, SVM, random forests, gradient boosting, etc. This library allows you to use these algorithms on data rather than making a huge effort to manually implement them and then use them on data.

  • Genism: This is an important library for topic modeling, which is a statistical model for text mining and discovering abstract topics that are hidden in a collection of text. Topics can be defined as a repeating pattern of co-occurring terms in a text corpus. For example switch, interface, router, and bandwidth are terms that will occur together frequently and hence can be grouped under the topic of network. This is not a rule-based approach that uses regular expressions or dictionary-based keyword searching techniques. It is an unsupervised ML algorithm to find a bunch of related words and create clusters of text.

You can download this code from https://github.com/dryice-devops/AIOps/blob/main/Ch-8_Anomaly%20Detection.ipynb.
#library for mathematical calculations
import pandas as pd
import numpy as np
#library for NLP ext processing
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,silhouette_samples
from sklearn.manifold import TSNE
import gensim
#library for plotting graphs and charts
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from IPython.display import clear_output
import time
Load the file containing some sample events.
raw_alerts = pd.read_csv(r'event_dump.csv')
Let’s understand the data by checking the columns and their sample values.
raw_alerts.info()
As shown in Figure 8-3, we have 1,050 events (the first row is a header), each with six columns.

This is K-mean algorithms dataset. In that image the headers are Column, Non-Null Count and Data type. Where at the top class pandas dot core dot frame dot data Frame, Range Index 1051 entries, 0 to 1050. There are 6 columns in total, Source, Alert Time, Alert Description, Alert Class, Alert Type and Alert Manager.

Figure 8-3

Columns in a dataset

Next let’s check the sample events in the data file.
raw_alerts.head().transpose()
Table 8-1 shows the sample events from the data input file that contains the following slots:
  • Source: Contains the device hostname/IP

  • AlertTime: Provides the event occurrence time

  • AlertDescription: Provides the detailed event message

  • AlertClass: Provides the class of the event

  • AlertType: Provides the type of the event

  • AlertManager: Provides a tool that generates events

Table 8-1

Sample Events from Input File

-The table has six columns and six rows, has six rows titled Source, Alert Time, Alert Description, Alert Cass, Alert Type and Alert Manager. The another five column, is divided into five sub rows.

Check how many unique Class events are present in the data.
raw_alerts['AlertClass'].unique()
As shown in Figure 8-4, the input dataset contains six different event classes that represent performance and availability issues related to the operating system, network, applications, and configuration changes. These cover most common types of events that get generated in any operations.

This is K-mean algorithms unique values. The values are ,array parenthesis box bracket operating system, Network Performance, Transaction Monitoring, Network Availability, MERAKI, Configuration, Box Bracket, Datatype is equal to object parenthesis.

Figure 8-4

Unique values in the Class column

Next, start applying the NLP algorithm. First, we need to download the English words dictionary from nltk. You need to add domain-specific words as that may be missing from the English words dictionary like app, HTTP, CPU, etc. These domain-specific words are important for efficiently creating clusters.

Now apply preprocessing on the event message, which includes removing special characters and punctuation to create tokens. These tokens then pass through the stemming and lemmatization process (explained in Chapter 4) to get more meaningful words that also remove stopwords that represent a set of commonly used words and do not add much meaning to a sentence like is, are, etc. They carry little useful information and hence can be safely removed.
nltk.download('words')
words = set(nltk.corpus.words.words())
stemmer = SnowballStemmer('english')
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    words = set(nltk.corpus.words.words())
    domain_terms = set(["cpu","interface","application","failure",
                        "https","outage", "synthetic","timeout","utc",
                        "www","simulation","simulated","http",
                        "response","app","network","emprecord",
                        "global_hr","pyroll","employee_lms","demoapp1",
                        "emp_logistics_summary","demo","down", "tcp" ,
                        "connect","emp_tsms","payroll_sap_gts",
                        "demoapp3","high","state"])
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            if token.lower() in words or token.lower() in domain_terms:
                result.append(lemmatize_stemming(token))
    return result
Next, pass event messages through NLP preprocessing.
event_msg = np.array(raw_alerts[['AlertDescription']])
temp = []
for i in event_msg :
    temp.append(i[0])
event_msg = temp
for i,v in enumerate(event_msg):
    event_msg[i] = preprocess(v)
for i,v in enumerate(event_msg):
    event_msg[i] = " ".join(v)

For each event sentence, we have the relevant tokens for analysis. Now we need to map these text tokens from the vocabulary to a corresponding vector of real numbers so that we can apply statistical algorithms. This process is called vectorization.

You will be using TF-IDF for the vectorization process. TF-IDF is a statistical measure to calculate the relevance of a token. It consists of two concepts: term frequency (TF) and inverse document frequency (IDF).
  • TF: This considers how frequently the word appears in an event message. As each message is not the same length, it may be possible that a token in a long sentence occurs more frequently compared to a token in a shorter message.

  • IDF: This is based on the fact that less frequent tokens are more informative and important. So if a token appears frequently in multiple event messages, like high, then it’s not critical compared to the tokens that are not frequent in multiple messages like down.

Readers can get more details on TF-IDF at https://en.wikipedia.org/wiki/Tf–idf.

We will be using TF-IDF to vectorize the data in the following:
tf_idf_vectorizor = TfidfVectorizer(stop_words = 'english',
                             max_features = 10000)
tf_idf = tf_idf_vectorizor.fit_transform(event_msg)
tf_idf_norm = normalize(tf_idf)
tf_idf_array = tf_idf_norm.toarray()

Now we have the final list of tokens on which we can apply the K-means algorithm.

To apply the K-means algorithm, first we need to determine what will be the ideal value of K. For this we will be using the Elbow method, which is a heuristic used in determining the number of clusters in a dataset.

Let’s use this method to plot a graph of scores and the number of clusters.
number_clusters = range(3, 12)
kmeans = [KMeans(n_clusters=i, max_iter = 600) for i in number_clusters]
kmeans
score = [kmeans[i].fit(tf_idf_array).score(tf_idf_array) for i in range(len(kmeans))]
score
plt.plot(number_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Method')
plt.show()
From the graph in Figure 8-5 we know that total seven clusters are possible in our dataset. So, let’s set the value of K as 7 and execute the K-means algorithm.

A elbow graph of score versus number of clusters. The Numbers of Clusters from 3 to 11, and the scores from negative 250 to 0. The line goes up and to the right from (3, negative 250) to (7, negative 50), then goes to the right till (11, negative 25). All values are approximated.

Figure 8-5

Elbow graph on dataset

df_temp = raw_alerts
no_cluster = 7
kmeans = KMeans(n_clusters=no_cluster, max_iter=600, algorithm = 'auto')
fitted = kmeans.fit(tf_idf_array)
print("Top terms per cluster:")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
labels = {}
terms = tf_idf_vectorizor.get_feature_names()
for i in range(no_cluster):
    print("Cluster %d:" % i),
    labels[i]=terms[order_centroids[i, 0]]
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])

Now you can see that all the top keywords representing issues are clustered in different clusters.

Let’s understand each cluster in more detail by plotting the top issues that are clustered in each cluster.
preds = kmeans.predict(tf_idf_array)
df_temp['cluster'] = preds
df_temp['AlertTime'] = pd.to_datetime(df_temp['AlertTime'])
def get_top_features_cluster(tf_idf_array, prediction, n_feats):
    labels = np.unique(prediction)
    dfs = []
    for label in labels:
        id_temp = np.where(prediction==label)
        x_means = np.mean(tf_idf_array[id_temp], axis = 0)
        sorted_means = np.argsort(x_means)[::-1][:n_feats]
        features = tf_idf_vectorizor.get_feature_names()
        best_features = [(features[i], x_means[i]) for i in sorted_means]
        df = pd.DataFrame(best_features, columns = ['features', 'score'])
        dfs.append(df)
    return dfs
def plotWords(dfs, n_feats):
    plt.figure(figsize=(8, 4))
    for i in range(0, len(dfs)):
        plt.title(("Most Common Words in Cluster {}".format(i)),
                  fontsize=10, fontweight='bold')
        sns.barplot(x = 'score' , y = 'features', orient = 'h' ,
 data = dfs[i][:n_feats])
        plt.show()
n_feats = 20
dfs = get_top_features_cluster(tf_idf_array, preds, n_feats)
plotWords(dfs, 10)
Figure 8-6 shows that all types of alerts as features get clustered in cluster 0 and their significance or weightage in it. As observed, this cluster primarily consists of CPU-related alerts or issues that occurred with other less significant alerts or issues. As observed from this cluster, CPU utilization alerts are not related to any application-related issues or alerts, which indicates a strong probability that CPU alerts generate a lot of noise, and you may need to tune the CPU utilization parameter.

A horizontal bar graph represents C P U issues. The vertical axis denotes features and the horizontal axis denotes scores. The features are C P U, critic, util, state, w w w, failur, emp t s m, employee Im, emp record and environ. The values of the respective bars across the features are 0.62, 0.59, 0.36, 0.36, 0, 0, 0, 0, 0, and 0 approximate.

Figure 8-6

Cluster 0 represents CPU issues

Cluster 1, as shown in Figure 8-7, represents issues that are primarily related to the application Global HR. Also, this cluster indicates a potential relationship between the Global HR application issue and a memory issue and will help the IT operations teams in the troubleshooting and remediation process. This is quite an important insight that gets automatically detected by ML algorithms without writing any rule.

A horizontal bar graph represents global H R application issues. The horizontal axis denotes scores and vertical axis denotes features. The features are global h r, warn, memori, util, state, w w w, failur, em, emp t s m and employee I m. The value of the respective bars across the features are 0.7, 0.45, 0.45, 0.25, and 0.25 approximately.

Figure 8-7

Cluster 1 represents Global HR application issues

Similar to cluster 1, Figure 8-8 represents issues that are primarily related to the other application, Payroll, which also gets impacted due to memory-related issues. Alerts in this cluster should be used by the application team to understand how to improve its performance and reduce application alerts.

A horizontal bar graph represents payroll application issues. The vertical axis denotes features and the horizontal axis denotes scores. The features are pyrol, warn, memori, util, state, www, failur, em, emp t s m, employee I m. The values of the respective bars across the features are 0.71, 0.42, 0.42, 0.25, 0.25, 0, 0, 0, 0, and 0 approximately.

Figure 8-8

Cluster 2 represents Payroll application issues

Cluster 3 in Figure 8-9 represents alerts related to the application Employee Record, which needs to be analyzed and fixed. This is also an important input to the problem management team because they need to provide a long-term resolution for this issue.

A horizontal bar graph of employee record application issues. The vertical axis denotes features and the horizontal axis denotes scores. The features are emprecord, warn, memori, util, state, www, failur, em, emp t s m and employee I m. The values of the respective bars across the features are 0.71, 0.42, 0.42, 0.25, and 0.25 approximately.

Figure 8-9

Cluster 3 represents Employee Record application issues

Cluster 4 in Figure 8-10 represents issues related to network interface utilization that needs to be analyzed by the network team. AIOps can use automation to create a single ticket for the network team against all the alerts that are clustered here. This single ticket will avoid overloading the ticketing system with duplicate incidents.

A horizontal bar graph of network interface issues. The vertical axis denotes features and the horizontal axis denotes score. The features are Interfac, warm, util, state, w w w, failur, em, emp t s m, employee I m and emprecord. The values of the bars across the features are 0.81, 0.42, 0.25, 0.25, 0, 0, 0, 0, 0, and 0 approximately.

Figure 8-10

Cluster 4 represents network interface–related issues

Cluster 5 in Figure 8-11 represents memory-related issues, and we have also observed in previous clusters that memory-related alerts are clustered along with application alerts. This indicates that multiple applications might be using common infrastructure such as a common virtualized infrastructure, which is nearing capacity. Since different clusters are capturing different issues automatically, analyzing them together significantly improves the resolution time.

A horizontal bar graph of memory utilization issues. The vertical axis denotes features and the horizontal axis denotes score. The features are memori, critic, util, state, www, font, emp tsm, employee Im, emprecord and environ. The values of the bar across the vertical axis are 0.61, 0.58, 0.35, 0.35, 0, 0, 0, 0, 0, and 0 approximately.

Figure 8-11

Cluster 5 represents memory utilization–related issues

Cluster 6 in Figure 8-12 is most interesting as it detected a potential outage. As shown in Figure 8-10, cluster 6 consists of various simulations, transactions, and application-related alerts that indicate some issue at the infrastructure level and not necessarily at the application level. With traditional IT ops methods, finding the root cause in such data would have been difficult, but the ML model created this cluster, which detected a potential relationship and pointed toward the root cause.

A horizontal bar graph of a potential outage. The vertical axis denotes features and the horizontal axis denotes scores. The features are simul, session, client, page, transact, affect, web, payroll sap g t, applic and app. The values of the bars across the vertical axis are 0.26, 0.19, 0.19, 0.9, 0.9, 0.9, 0.9, 0.8, 0.6 and 0.6 approximately.

Figure 8-12

Cluster 6 represents an outage

Let’s observe all the events that are present in this specific cluster, cluster 6.
df_appcluster=df_temp[df_temp['cluster'] == 6 ]
df_appcluster.info()
As shown in Figure 8-13, there are 43 events that got clustered together in cluster 6.

This is K-mean algorithms dataset. In that image the headers are column, Non-Null count and data type. Where at the top class pandas dot core dot frame dot Data Frame. Int 64 Index 43 entries, from 31 to 1050. There are 7 columns in total. Source, Alert Time, Alert Description, Alert Class, Alert Type, Alert Manager and Cluster.

Figure 8-13

Cluster 6 events data

Let’s observe the events in this cluster.
df_appcluster[['AlertTime','AlertClass','AlertType','AlertDescription']]
On closely reviewing the events in cluster 6, you can see there are five events from different transaction failures that arrived on July 8 almost at the same time, about 19:30, indicating an outage. As shown in Figure 8-14, one out of five events belongs to a network interface being down, indicating it as a probable cause because immediately after this event, there are events for failed simulated transactions for the application Global HR.

A table has four columns and seven rows, columns heading are Alert Time, Alert Class, Alert Type and Alert Description. In this tables column five rows are marked by a red box, which are 706, 707, 708, 709 and 710.

Figure 8-14

Events in cluster 6

With the help of ML and NLP capabilities, the algorithm discovered and clustered useful information from 1,000+ events, and that was without writing any static rules or using any topology-related details.

Let’s also analyze these clusters together over a time scale to uncover more insights.

First let’s print the labels that the algorithm has detected based on the events grouped in each cluster.
 print("Cluster Labels are: ", labels)
Cluster Labels are:
 {0: 'pyrol', 1: 'cpu', 2: 'interfac', 3: 'global_hr', 4: 'emprecord', 5: 'memori', 6: 'simul'}
Now let’s observe these clusters over a timeline to explore any useful details.
df_temp.plot(x='AlertTime', y='cluster', lw=0, marker='s', color ="#eb5634",
 figsize=(12,8),rot = 90,  alpha = 0.5,fontsize = 15,grid=True, legend=False,
 ylabel="Cluster Number")
From Figure 8-15, there are few important observations you can make here:
  • There is a consistent lot of noise due to CPU utilization events that are represented in cluster 0. This is feedback to tune the thresholds of the CPU monitoring parameter. Or you can submit a recommendation to the capacity planning team to increase the CPU capacity.

  • The algorithm automatically created three clusters, one for each application, which primarily contains alerts related to that specific application only. These clusters can be analyzed for incident or problem management–related tasks. These clusters provide a lot of visibility to the applications teams and are immensely helpful in executing service improvement programs.

This is K-mean algorithms timeline analysis. In that image horizontal access denotes Alert time and vertical access denotes Cluster numbers. The cluster number which are from 0 to 6, and the alert times from 2021-06-01 to 2021-08-01.

Figure 8-15

Timeline analysis of cluster

This AIOps use case of anomaly detection also makes it extremely helpful and efficient to detect potentially critical events, such as security events, from the enormous pile of events and initiate automated actions, such as invocating a kill switch to minimize the impact. You have seen how K-means clustering detected noise and can generate recommendations for problem and capacity management processes.

Note

As per best practices, security alerts should not be integrated with the operations event console. Security alerts are exclusively for the eyes of security experts and not for everyone. Ideally, there should be a separate SIEM console for the security operations team to monitor.

Though a K-means algorithm is simple to implement even with a very large dataset, it has a few challenges from an AIOps perspective.
  • The number of clusters plays a crucial role in the K-means algorithm’s efficiency.

  • The K-means cluster gets impacted due to the presence of noise or a high number of outliers in the dataset. Centroids get stretched to incorporate outliers or noise.

  • A higher number of dimensions (parameters or variables) has a negative impact on the efficiency of the algorithm.

Along with anomaly detection, another common use case of AIOps is correlation and association, which will be discussed next.

Correlation and Association

Correlation is a statistical term that means identifying any relationship or association between two or more entities. From an AIOps perspective, correlation is used to determine any dependency or relation between entities and cluster them together for efficient analysis. We covered multiple regression algorithms that determine the relationship between multiple entities to establish correlation in Chapter 5.

However, establishing correlation at the event layer is a bit difficult as multiple events from varied sources come at a varied time interval. A time-based sequence or pattern of event is quite rare at the event layer. One of the algorithms that works efficiently at the event layer to determine correlation and association is DBSCAN, and it is particularly useful in scenarios where data points are close to each other (dense) along with the presence of a considerable amount of noise or outliers.

Topology-Based Correlation

The accuracy of clustering is one of the most important KPIs of an event management system that drives the efficiency of the engine. To improve the accuracy of clustering and the determination of the root cause, you need to provide some topology context. But the CMDB and discovery are challenges in themselves, and having a 100 percent accurate CMDB with 100 percent discovery and modeling of infrastructure is a tough task.

Considering the various challenges with the CMDB and discovery, topology correlation can still be implemented for network topology correlation as well as for application topology correlation. Consider the sample topology shown in Figure 8-16. Connections marked in black represent the network topology, whereas connections marked in red represent the application topology. We will use this sample topology to discuss correlation for network topology and application topology.

A chart has one main branch and ten sub branches. The main branch is network switch -1, In that image the red arrow denotes application Topology and the black arrow denote Network Topology.

Figure 8-16

Sample topology diagram

Network Topology Correlation

This correlation uses the network connectivity diagram as the topology to perform correlation. For example, if a core switch connecting hundreds of servers and devices to its interfaces is down, then alerts from this switch are causal, and all other alerts from the underlying network devices and servers as well as applications running over them are considered impacted.

The IT operations team can focus on the causal CI that is the network switch instead of manually filtering alerts in the system due to this outage. This correlation needs to determine the starting position and then depth of hierarchy to compare and create this topology and arrive at the causal event. Since storing the entire topology or discovering everything may not be practical feasible, the next best approach is to store a subset of topology information for critical devices or application at the event management layer for correlation and every 24 hours (or weekly if the environment is not very dynamic) updating it using automated discovery/scripts or manually. The network topology for the sample topology in Figure 8-4 can be stored as a key-value pair at the event manager layer, as shown here:
{"network_topology": [
      {"child-ci": "dummy-app-1 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-app-2 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-web-1 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-web-2 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-db-1 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-db-2 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-srv-1 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-srv-2 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "dummy-srv-3 ", "parent-ci": " ntwrk-switch-1 "},
      {"child-ci": "ntwrk-lb-1", "parent-ci": " ntwrk-switch-1 "}
]}

Whenever alerts are generated, the event management algorithm will use this data to determine relationships, perform correlation, and determine the causal and impacted CIs. This method is dependent on the static information, and thus it has its limitations and challenges. Storing a high volume of topology data can severely impact the event management tool performance.

From an AIOps perspective, it’s recommended to offload this correlation at the network monitoring layer, which has a native capability to scan the subnet ranges, discover devices, and configure them in availability monitoring using ICMP or SNMP. Network monitoring tools have real-time visibility on the availability of all CIs, and their topology, as well as can dynamically update them based on changes in the network. A network monitoring tool should perform this correlation using the topology detected, filtering out impacted CI alerts and forwarding causal alerts to the event management layer. If needed, the network monitoring tool can be configured to forward the impacted CI alerts with the information severity, whereas a causal alert can be forwarded with a critical severity.

In cases of an open source network monitoring tool with limited to no capability for native topology-based correlation, organizations can still import the required topology data in the event management layer and perform correlation as discussed earlier.

Application Topology Correlation

This correlation leverages the application architecture with a relationship between application components. This can be as simple as a vCenter farm running on multiple hypervisors in a cluster and hosting various VMs. This can also be a more complex application architecture for a business application like an insurance claim processing system running on multiple nodes hosting web servers, DB servers, EDI systems, and SAP systems.

Unlike network topology correlation where only the availability of systems are considered, application correlation considers the availability as well as the performance of applications. This correlation not only needs details of network topology but also the details of application blueprints and key performance indicators. Consider an example where an alert is received about database processes being down, which further causes alerts for the following:
  • High request in application queue from application servers

  • High response time of an application URL from web servers

  • Timeout alert from synthetic monitoring from the load balancer

At this point of time, many other unrelated alerts may also be received at event management. In this case, all systems are up and running, but the issue is at the application level. Here the application-level correlation helps in isolating faults.

Application architecture or blueprints ideally should be available (or discovered) and stored in the CMDB as service models. But organizations face lots of challenges in maintaining these blueprints because of the following three most common reasons:
  • Discovering application topology requires domain knowledge to determine the pattern that needs to be detected in discovery. Discovery tool SMEs don’t possess such vast domain knowledge.

  • Information is federated and available inside different tools managed by multiple teams.

  • There are challenges in getting clearance to read application logs/services/processes. Logs usually contain PII/PHI/confidential data, which causes a red flag to be raised by the security and application teams.

From an AIOps perspective, whenever alerts related to an application arrive, then algorithms should leverage application topology to perform the necessary correlation. Considering these challenges with the CMDB, AIOps algorithms can be used to learn patterns and dependencies and automatically derive topology. In our sample topology example, algorithms can use the following features to understand topology and perform application correlation:
  • Learn by analyzing the similarity of hostnames/IP addresses as servers belong to one application usually share a common prefix in the hostname or common subnet details.

  • Learn by analyzing the arrival time of the events and detecting patterns from it.

  • Learn by analyzing message texts in the event, whether it belongs to the availability category or the performance category or if there any common application/service name in events that comes together within a short interval.

  • Learn by analyzing the event class and source. Data from multiple operational management databases (OMDBs) like vCenter, SAP, SCCM, etc., is quite useful in this learning as these are the sources where different CIs are getting managed. For example, the vCenter database provides the data of the mapping of VMs to the ESX farm where the event source will be vCenter and the class will be Virtualization. Similarly, SAP can provide application mapping details where the source will be SAP System and the class will be Application. The algorithm can use this data to correlate events and find out whether the issue is due to the underlying hypervisor or due to application performance.

  • Like network topology correlation, you can store the application topology data for critical applications at the event management layer and let the system learn from it. Unless the application is running on the cloud or on SDI, the application topology usually does not get updated as frequently as the network topology data, so it can be synced on a monthly or quarterly basis. For cloud and SDI-based applications, it is comparatively easy to use native APIs and update the topology data as soon as a change happens.

{"application_topology": [
      {"child-ci": " dummy-db-1 ", "parent-ci": " " ,"application":"dummy_app","kpi":["type":"process","name":" dummy-process"]},
      {"child-ci": " dummy-db-2 ", "parent-ci": " " ,"application":"dummy_app","kpi":["type":"process","name":" dummy-process"]}
      {"child-ci": "dummy-app-1 ", "parent-ci": " dummy-db-1", "application”: “dummy-app","kpi":["type":"service","name":" app_service "]},
      {"child-ci": "dummy-app-2 ", "parent-ci": " dummy-db-2" ,"application":"dummy_app","kpi":["type":"service","name":" app_service "]},
      {"child-ci": "dummy-web-1 ", "parent-ci": " ntwrk-lb-1" ,"application":"dummy_app","kpi":["type":"url","name":" dummy-url.com "]},
      {"child-ci": "dummy-web-2 ", "parent-ci": " ntwrk-lb-1" ,"application":"dummy_app","kpi":["type":"url","name":" dummy-url.com "]},
      {"child-ci": " dummy-web-1 ", "parent-ci": " dummy-app-1 "     ,"application":"dummy_app","kpi":["type":"url","name":" dummy-    url.com "]},
      {"child-ci": " dummy-web-2 ", "parent-ci": " dummy-app-2 " ,"application":"dummy_app","kpi":["type":"url","name":" dummy-    url.com "]}
]}

Assume if the database process running on dummy-db-1server goes down, it will generate a database alert as well as an application queue alert from dummy-app-1 and URL response time alert from dummy-web-1 servers. Using the previous data, the event management system can create clusters and correlate these alerts as well as highlight which one is causal and which ones are impacted.

The availability of application topology or service models can definitely increase the accuracy of algorithms, but it is not a must-have requirement from AIOps. Both supervised and unsupervised machine learning algorithms showed lot of potential and can help a lot in performing the application topology correlation.

Summary

This chapter covered the important use cases in AIOps around anomaly detection using K-means clustering. You also used NLP techniques such as stopword removal and TF-IDF to make sense of textual data. You learned about other techniques such as application dependency mapping and the CMDB and how they are relevant in AIOps. In the next chapter, we will cover how to set up AIOps as a practice in an organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.150.163