© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_6

6. AIOps Use Case: Deduplication

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

In the previous chapters, you learned about the various machine learning techniques and how they are used in general. In this chapter, you will see the implementation of a specific use case of deduplication in AIOps along with practical scenarios where it is used in operations and business.

Environment Setup

In this section, you’ll learn how to set up the environment to run the sample code demonstrating various AIOps use cases. You have three options to set up an environment.
  • Download Python from https://www.python.org/downloads/ and install Python directly with the installer. Other packages need to be installed explicitly on top of Python.

  • Use Anaconda, which is a Python distribution and made for large data processing, predictive analytics, and scientific computing requirements. The Anaconda distribution is the easiest way to perform Python coding, and it works on Linux, Windows, and macOS. It can be downloaded from https://www.anaconda.com/distribution/.

  • Use cloud services. This is the simplest of all the options but needs Internet connectivity to use. Cloud providers such as Microsoft Azure Notebooks and Google Collaboratory are popular and available at the following links:

We will be using the second option and will set up an environment using the Anaconda distribution.

Software Installation

Download the Anaconda Python distribution from https://www.anaconda.com/distribution/, as shown in Figure 6-1.

A dashboard of download anaconda with options, products, pricing, solutions, resources, blog and company. Below the option kit, there is a text written ‘your data science toolkit’.

Figure 6-1

Downloading Anaconda

After downloading the file, trigger the setup to initiate the installation process.
  1. 1.

    Click Continue.

     
  2. 2.

    On the Read me step, click Continue.

     
  3. 3.

    Accept the agreement, and click Continue.

     
  4. 4.

    Select the installation location, and click Continue.

     
  5. 5.

    It will take about 500MB of space on the computer. Click Install.

     
  6. 6.

    Once the Anaconda application gets installed, click Close and proceed to the next step to launch the application.

     

Launch Application

After installing Anaconda Jupyter, open the command-line terminal and type jupyter notebook.

This will launch the Jupyter server listening on port 8888. Usually, a pop-up window with the default browser will automatically open, or you can log in to the application by opening any web browser and opening the URL http://localhost:8888/, as shown in Figure 6-2.

Click New ➤ Python 3 to create a blank notebook.

A pop-up window of Jupyter notebook with options files, running, and clusters on the left. The file option has some items of applications, creative cloud files, desktop, documents, and downloads. To the right, there is a menu bar of Python 3 with options text file, folder, and terminal.

Figure 6-2

Jupyter Notebook UI

A blank notebook will launch in a new window. You can just type commands in the cell and click the Run button to execute them. To test that the environment is working perfectly, type in a print (“Hello World”) statement in the cell and execute it.

Python Notebooks are used in the book for demonstrating the AIOps use cases of deduplication, dynamic baselining, and anomaly detection; they are available at https://github.com/dryice-devops/AIOps.git. You can download the source code to follow along with the exercises in this chapter.

The environment is now ready to develop and test algorithmic models for multiple AIOps use cases. Before getting into model development, let’s understand a few terms that are going to be used to analyze and measure the performance of models.

Performance Analysis of Models

In practice, multiple models need to be created to make predictions, and it is important to quantify the accuracy of their prediction so that the best-performing model can be selected for production use.

In the supervised ML model, a set of actual data points is separated into training data and test data. The model is trained to learn from training data and make predictions that get validated against both training data and testing data. The delta between actual values and predicted values is called the error. The objective is to select a model with minimum error.

There are multiple methods to analyze errors and measure the accuracy of predictions. The following are the most commonly used methods.

Mean Square Error/Root Mean Square Error

In this approach, the first error is calculated for each data point by subtracting the actual result value from the predicted result value. These errors need to be added together to calculate the total error obtained from a specific model. But errors can be positive or negative on different data points, and adding them will cancel out each other, giving an incorrect result. To overcome this issue, in mean square error (MSE), a square of calculated error is performed to ensure that the positive and negative error values should not cancel out each other, and then the mean of all squared error values is performed to determine the overall accuracy.

An equation of M S E equals 1 over n of sigma from i equals 1 to lowercase n of, open parenthesis Y subscript i minus Y cap subscript i close parenthesis to the power 2.

  • A lowercase n. -> Number of data points

  • An uppercase Y subscript lowercase i. -> Actual value at specific test data point

  • An uppercase y cap subscript lowercase i. -> Predicted value at specific test data point

There is another variant of MSE where the square root of MSE is performed to provide the root mean square error (RMSE) value. The lower the MSE/RMSE, the higher the accuracy in prediction and the better the model.

An equation of R M S E equals root over sigma from i equals 1 to lowercase n of open parenthesis y cap subscript i minus y subscript i close parenthesis to the power 2 over n.

Both MSE/RMSE use “mean” as the baseline, and that’s the biggest issue with these methods because mean has inherent sensitivity toward outliers. In simple terms, assume there are four data points provided to model A and model B. Table 6-1 lists the errors obtained on each of the data points for each model.
Table 6-1

Sample Data and Errors in ML Models

Data Point

Error in Model A

Error in Model B

D1

0.45

0

D2

0.27

0

D3

0.6

0

D4 (Outlier)

0.95

1.5

Model B appears to be a better model because it predicted with 100 percent accuracy except on one data point, which was an outlier and gave a high error. On the other hand, model A gave small errors D on almost every data point.

But if we consider the mean of errors for both models, then model A will have a lower mean value of errors than model B, which gives an incorrect interpretation of model accuracy.

Mean Absolute Error

Similar to MSE, in the mean absolute error (MAE) approach the first error is calculated as the difference between the actual result value and the predicted result value, but rather than squaring the error value, MAE considers the absolute error value. Finally, the mean of all absolute error values is performed to determine the overall accuracy.

An equation of M A E equals 1 over n of sigma from i equals 1 to lowercase n mode of Y subscript i minus y cap subscript i.

Mean Absolute Percentage Error

One of the biggest challenges of MAE, MSE, and RMSE is that they provide absolute values that can be used to compare multiple models’ accuracies. But there is no way to validate the effectiveness of an individual model as there is no scale defined to compare that absolute value. The output of these methods can take any value, and there is no range defined.

Instead of calculating the absolute error, MAPE measures accuracy as a percentage by first calculating the error value divided by the actual value for each data point and then taking the mean of them.

An equation of M A P E equals 100 percent over n of sigma from i equals 1 to lowercase n mode of Y subscript i minus y cap subscript I over y subscript i.

MAPE is a useful metric and one of the most commonly used methods to calculate accuracy because the output lies between 0 and 100, enabling intuitive analysis of the model’s performance. It works best if there are no extremes or outliers and no zeros in data point values.

Root Mean Squared Log Error

One of the challenges in RMSE was the effect of outliers on the overall RMSE output value. To minimize (sometimes nullify as well) the effect of these outliers, there is another method called RMSLE, which is a variant of RMSE and uses the logarithmic property to calculate the relative error between the predicted value and the actual values instead of the absolute error.

An equation of R M S L E equals root over 1 over n of sigma from i equals 1 to lowercase n of open parenthesis log open parenthesis y cap subscript i plus 1 close parenthesis minus log open parenthesis y subscript I plus 1 close parenthesis close parenthesis to the power 2.

In cases when there is a huge difference between the actual and predicted values, the error becomes large, resulting in huge penalties by RMSE but handled well by RMSLE. Also, to note that in cases where either the predicted value or actual value is zero, then the log of zero becomes not defined. That’s why one is added to both the predicted and actual values to avoid such situations.

Coefficient of Determination-R2 Score

In statistics, R2 is defined as the “proportion of the variation in the dependent variable that is predictable from the independent variable(s).” It measures the strength of the relationship between the model and the dependent variable on a convenient 0–100 percent scale.
  • 0 percent represents a model that does not explain any of the variations in the response variable around its mean.

  • 100 percent represents a model that explains all the variation in the response variable around its mean.

After fitting a linear regression model, it is important to determine how well the model fits the data. Usually, the larger the R2, the better the regression model fits your observations; however, there are caveats that are beyond the scope of this book.

Having understood the various ways to measure accuracy of models, we will now proceed with our first AIOps use case, which is a deduplication of events.

Deduplication

Deduplication is one of the most primitive features of the event management function to reduce noise by processing millions of events arriving from multiple tools that are monitoring the underlying infrastructure and applications. Consider a simple and common scenario where a batch processing job gets triggered causing CPU utilization to cross the warning threshold of 70 percent and within 15 minutes backup gets initiated, shooting up the CPU utilization to 90 percent and generating a critical event. This event will be repeated for every poll that the monitoring system makes to this system, and once the scheduled job is over, the CPU utilization will return to its normal state. However, within this period, it has generated four events with different timestamps. But essentially, all these events represent the same issue of high CPU utilization on a server. There is a plethora of such operational scenarios, both simple and complex, that generate duplicate events. As shown in Figure 6-3, the deduplication function manages this complexity to validate the uniqueness of a new event by comparing it against an existing event in the system, dropping it if a match is found along with an increasing counter of the original alert, and updating the timestamp based on the latest event generated. This helps in ensuring that the event console is not cluttered with the same event multiple times.

An illustration of the deduplication management function of original data to deduplicated data. The original data has 4 vertical layers with alphabets: A B C D, D C B A, B A D C, and B A D C. The deduplicated data has 4 alphabets with A, B, C and D. In the middle, there is a right-facing arrow marked as deduplication.

Figure 6-3

Deduplication function

To implement the deduplication function, a unique key needs to be defined that defines unique issues and their context. This unique key gets created dynamically using information coming in the event such as the hostname, source, service, location, class, etc. A rule or algorithm can be configured to match this unique key with all events in the system and perform the deduplication function. Please note that this AIOps function does not need machine learning and can be simply done by using a rule-based deduplication system.

Let’s observe the practical implementation of the deduplication function. This implementation will be reading alerts from an input file.

In our code, we will be using Pandas, which is a powerful library of Python to analyze and process alerts, along with the SQLite DB, which is a lightweight disk-based database to store alerts after processing. Unlike client–server database management systems, SQLite doesn’t require a separate installation for a server process and allows accessing the database using a nonstandard variant of the SQL query language. SQLite is popular as an embedded database for local/client storage in software such as web browsers. SQLite stores the entire database (definitions, tables, indices, and the data itself) as a single cross-platform file on a host machine.

After reading alerts from an input file, a unique key of EMUniqueKey will be dynamically created to determine the uniqueness of the issue and execute the deduplication function. At the end of the code, we will determine how many alerts were genuine alerts in our code and how many were duplicates.

In this code we will be using the Python libraries mentioned in Table 6-2.
Table 6-2

Sample Data and Errors in ML

Library Name

Purpose

matplotlib

It is a data visualization library containing various functions to create charts and graphs for analysis. It supports multiple platforms and hence works on many operating systems and graphics backends.

Pandas

It is extensively used for data manipulation tasks. It is built on top of two core Python libraries—matplotlib for data visualization and NumPy for performing mathematical operations.

sqlite3

This library provides an API 2.0 interface for SQLite databases.

Open a new Jupyter notebook or load the already provided notebook from the GitHub download. Download the notebook from https://github.com/dryice-devops/AIOps/blob/main/Ch-6_Deduplication.ipynb.

Let’s start by importing the required libraries.
import pandas as pd
import sqlite3
from matplotlib import pyplot as plt
Now read data from file and perform descriptive analysis to understand it.
raw_events = pd.read_excel("Alert-Bank.xlsx")
Let’s find how many rows and columns we have in a dataset.
raw_events.shape
In the output, as shown in Figure 6-4, we have 1,051 total sample events in the alert bank, and each event has nine columns as an event slot.

Out open box parenthesis 2 close box parenthesis colon open parenthesis 1051 comma 9 close parenthesis.

Figure 6-4

Count of data points in input file

Let’s list all columns with their data types and the number of non-null values in each column.
raw_events.info()
As per the output shown in Figure 6-5, there are 1,051 non-null values, which mean that there are no null values in any column. In cases where the column contains null values, then it needs to be processed before proceeding further. Usually, if the count of null values is less, then the rows containing null values can be removed completely, or else null values can be replaced with mean or median values.

A table of deduplication functions with 4 columns and 9 rows. The columns are: hashtag, column, Non Null count, and Dtype. The column row has entries: Host, alert time, alert description, alert class, alert type, alert manager, source, status, and severity. Below there is dtypes: datetime64[ns](1), int64(2), object(6) memory usage: 74.0 plus KB.

Figure 6-5

Deduplication function

Next, it is important to understand the values contained in these slots, and for that, we execute the head() command to check the values of the first five rows along with transpose() for better understanding and visibility of data. In the output snippet, shown in Table 6-3, we can see all five fields of three events from the alert bank.
raw_events.head().transpose()
Table 6-3

Output Snippet Showing Sample Events

A table of output deduplication with 4 columns and 9 rows. The columns are 0, 1, 2, and 3. The heading of the 9 rows is Host, Alert time, Alert description, Alert class, alert class, alert type, Alert manager, Source, Status, and Severity.

Based on the output, we can observe that each event contains the following details:
  • Host: Mentions CI from where the event got generated.

  • AlertTime: Mentions the timestamp of the event when it got generated at CI.

  • AlertDescription: Mentions details about issues for which event got generated.

  • AlertClass: Mentions the class to which the event belongs such as the operating system.

  • AlertType: Mentions the type to which the event belongs such as the server.

  • AlertManager: Mentions the manager that captured the issue and generated the event. It’s usually the underlying monitoring tools such as Nagios, Zabbix, etc.

  • Source: Mentions the source for which the issue was detected such as the CPU, memory, etc.

  • Status: Mentions the event status, which is a numerical value.

  • Severity: Mentions the event severity, which is again a numerical value.

For the purpose of this example, we have assigned a unique code to different values in the Status field as well as the Severity field. There can be different values assigned to Event Status, as shown in Table 6-4, as well as different values assigned to Event Severity, as shown in Table 6-5. These mappings varied a lot in different tools.
Table 6-4

Event Status Code Description

Status Value

Meaning

1

Open state of event

2

Acknowledge state of event

3

Resolved state of event

Table 6-5

Event Severity Code Description

Severity Value

Meaning

1

Critical severity of event

2

Warning severity of event

3

Information severity of event

Let’s list the unique values of the Status columns in a dataset.
raw_events['Status'].unique()
As per the output in Figure 6-6, there are three types of event severity present in the input file.

Out open box parenthesis 9 close box parenthesis colon array open parenthesis open box parenthesis 3 comma 1 comma 2 close box parenthesis close parenthesis.

Figure 6-6

Deduplication function

Based on the analysis done, we can conclude that each event has a specific AlertTime that got generated from the device mentioned in Host with an event source captured in Source that belongs to a specific category defined by AlertClass and AlertType. This event got captured by the monitoring tool mentioned in AlertManager with a specific severity Severity and Status as an enumerated value. Issue details are provided in AlertDescription.

For the purpose of the deduplication example, let’s name the deduplication key EMUniqueKey and create it as a combination of the Host, Source, AlertType, AlertManager, and AlertClass fields concatenated with the separator ::, as shown in Figure 6-7. Then let’s add it in our dataset for each event.

The E M Unique key: Host (Ex. 10.10.10.1) separator Source (Ex. Tablespace) separator Alert type (Ex. Oracle) separator Alert Manager (Ex. O E M) separator Alert Class (Ex. database). The separator is 4 dots in a square pattern.

Figure 6-7

Unique key composition

raw_events['EMUniqueKey'] = raw_events['Host'].str.strip() + "::"
                    + raw_events['Source'].str.strip() + "::"
                + raw_events['AlertType'].str.strip() + "::"
                + raw_events['AlertManager'].str.strip() + "::"
                + raw_events['AlertClass'].str.strip()
raw_events['EMUniqueKey'] = raw_events['EMUniqueKey'].str.lower()
raw_events = raw_events.sort_values(by="AlertTime")
raw_events["AlertTime"] = raw_events["AlertTime"].astype(str)

EMUniqueKey is a generic identifier that uniquely identifies the underlying issues to provide the required context for the purpose of performing deduplication.

Let’s observe the data after adding the unique identifier in the dataset.
raw_events.head().transpose()
As observed in the output in Table 6-6, we have a new field added called EMUniqueKey in the dataset for each event.
Table 6-6

Updated Dataset After Adding a Unique Identifier

A table of output deduplication with 3 columns and 10 rows. The columns are 0, 1, 2, and 3. The heading of the rows is Host, Alert time, Alert description, Alert class, alert class, alert type, Alert manager, Source, Status, Severity, and E M Unique Key.

Now we are all set to perform the deduplication function on the given dataset.

Before we proceed, it is important to note that we can use arrays as well in code to read and process events fields by referring them with indexes (0,1,2, etc.), but it will limit the scalability if new fields need to be added later. To avoid the scalability issue, we will be using a dictionary datatype so that we can use the event field name instead of indexes, as shown in the following function:
def transform_dictionary(cursor, row):
    key_value = {}
    for idx, col in enumerate(cursor.description):
        key_value[col[0]] = row[idx]
    return key_value

Now values will be stored as a key-value pair, and we can directly refer to any event field with its name.

Let’s define a class to contain a few critical functions for SQLite. First create a connection with the database by using the initialization function _init.
class Database:
# create db connection to store and process events.
    def __init__(self):
        self._con = sqlite3.connect('de_duplicationv5.db')
        self._con.row_factory = transform_dictionary
        self.cur = self.con.cursor()
        self.createTable()
    def getConn(self):
        return self.con

The function getConn will create a connection with the database.

Next, create the required tables in database. Though we need only one table dedup to store the final deduplicated events, we will also be creating a table archive to store duplicated events. The table archive will be used to quantify the benefits of the deduplication function.
def createTable(self):
        self.cur.execute('''CREATE TABLE if not exists dedup
                (AlertTime date, EMUniqueKey text, Host text,
                AlertDescription text, AlertClass text, AlertType text,
                AlertManager text, Source text, Status text,
                Severity text, Count real)''')
        self.cur.execute('''CREATE TABLE if not exists archive
                (AlertTime date, EMUniqueKey text, Host text,
                AlertDescription text, AlertClass text, AlertType text,
                AlertManager text, Source text, Status text,
                Severity text)''')
After reading events, they need to be inserted either into the dedup table or into the archive table depending upon whether it’s a new event or a duplicate event. The function insert will accept the table name and event values that need to be inserted. Before inserting the event, EMUniqueKey is getting dynamically generated for that event.
  def insert(self, table, values):
        columns = []
        val = []
        value = []
        for EMUniqueKey in values:
            columns.append(EMUniqueKey)
            val.append("?")
            value.append(values[EMUniqueKey])
        query = "Insert into {} ({}) values ({})".format
        (table, ",".join(columns), ",".join(val) )
        self.cur.execute(query, tuple(value))
We need the function execute to execute database queries as well as the function update to update the event’s count, severity, and status in the dedup table for any repeated occurrence of an event that is already present in the system in an open state.
    def execute(self, query):
        self.cur.execute(query)
    def update(self, query, values):
        # print(query)
        return self.cur.execute(query, values)
Finally, we need supporting functions to fetch record(s), commit transactions, and close the connection.
    def fetchOne(self, query):
        self.cur.execute(query)
        return self.cur.fetchone()
    def fetchAll(self, query):
        self.cur.execute(query)
        return self.cur.fetchall()
    def commit(self):
        self._con.commit()
    def close(self):
        self._con.close()
Now let’s start reading and processing events data iteratively from the event bank. In the AIOps solution, this processing happens as part of stream processing in real time rather than as batch processing.
db = Database()
for item in raw_events.iterrows():
    #read events
    data = dict(item[1])
    print("Input Data",data)
    dedupData = db.fetchOne("Select * from dedup where EMUniqueKey='{}'
                  and Status != 3".format(data["EMUniqueKey"]))
    if dedupData:
        #increase count and add current row in archive
        count = dedupData['Count'] + 1
        query = "Update dedup set Count=?, AlertDescription=?,
                Severity=?, Status=? where EMUniqueKey=? and Status=?"
        db.update(query, (count, data['AlertDescription'],
                          data["Severity"], data["Status"], data["EMUniqueKey"], dedupData['Status']) )
        db.insert("archive", data)
        db.commit()
    else:
        #insert in dudup table
        data['count'] = 1
        db.insert("dedup", data)
        db.commit()

In the previous code, first EMUniqueKey got generated dynamically using event slots and stored in the data variable (data frame). Then the table dedup gets checked to find out whether there is any open event with the same EMUniqueKey. If there is any matching event found in the dedup table, then it means that a new event is a duplicate event. Hence, the original event in the dedup table gets updated with the new event’s description, severity, and status, and the count field of the older event gets increased by 1.

This duplicate event now gets stored in the archive for later analysis. If there is no matching event found in the dedup table, then the new incoming event represents a new issue and gets stored as a new entry in the dedup table.

After processing all events in the alert bank, let’s find out how much noise (duplicate events) was filtered by the deduplication function. We will compare the events present in the archive table and dedup table and then use the Python library matplotlib to plot the pie chart for this analysis.
df_dedup = pd.read_sql("select * from dedup" , Database().getConn())
df_archive = pd.read_sql("select * from archive", Database().getConn())
fig, ax = plt.subplots()
source = ["Actual-Issues", "Duplicate Event - Noise"]
colors = ['green','yellow']
values = [len(df_dedup), len(df_archive)]
ax.pie(values, labels=source, colors=colors, autopct='%.1f%%', shadow=True)
plt.show()
Based on the pie chart in Figure 6-8, the deduplication function has filtered out 70.6 percent of duplicated events, leaving only 29.4 percent of actual events for IT operations team, which is a considerable reduction in event volume to be processes while retaining the useful information coming via new duplicated events.

A pie chart with values in percentage. Actual Issues: 29.4. Duplicate Event Noise: 70.6.

Figure 6-8

Deduplication function noise removal

Let’s observe what are the top noise maker CIs in data, as shown in Figure 6-6.
df_dedup = df_dedup.sort_values(by="Count", ascending=False)
df_dedup[:10].plot(kind="bar", by="Count", x="Host")
plt.title("Top 10 host with most de-duplications", y=1.02);
Figure 6-9 shows the top 10 CIs that generate maximum events in the environment that need to be analyzed by problem management and capacity management teams.

A bar graph of number versus hosts which are, Payroll underscore S A P underscore G T S, USPRDPAYROLAPP01, AUPRDPAYROLWEB03, UKDEVPAYROLDBA02, USPRDPAYROLAPP03, UKPRDPAYROLAPP03, UKPRDPAYROLDBA02, UKDEVPAYROLDBA02, UKDEVPAYROLAPP02, and UKPRDPAYROLWEB01. The values across horizontal axis are: 8, 5, 5, 5, 5, 5, 5, 5, 5, and 5.

Figure 6-9

Top 10 noisy host

This completes our first use case of deduplication in AIOps.

Summary

In this chapter, we covered how to set up the environment using Anaconda for running the use cases. We covered our first use case, which was deduplication. This use case does not use any machine learning but relies on rule-based correlation to deduplicate events. In the next chapter, we will cover another important use case in AIOps: automated baselining.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.196.211