In the previous chapters, you learned about the various machine learning techniques and how they are used in general. In this chapter, you will see the implementation of a specific use case of deduplication in AIOps along with practical scenarios where it is used in operations and business.
Environment Setup
Download Python from https://www.python.org/downloads/ and install Python directly with the installer. Other packages need to be installed explicitly on top of Python.
Use Anaconda, which is a Python distribution and made for large data processing, predictive analytics, and scientific computing requirements. The Anaconda distribution is the easiest way to perform Python coding, and it works on Linux, Windows, and macOS. It can be downloaded from https://www.anaconda.com/distribution/.
- Use cloud services. This is the simplest of all the options but needs Internet connectivity to use. Cloud providers such as Microsoft Azure Notebooks and Google Collaboratory are popular and available at the following links:
Microsoft Azure Notebooks: https://notebooks.azure.com/
Google Collaboratory: https://colab.research.google.com
We will be using the second option and will set up an environment using the Anaconda distribution.
Software Installation
- 1.
Click Continue.
- 2.
On the Read me step, click Continue.
- 3.
Accept the agreement, and click Continue.
- 4.
Select the installation location, and click Continue.
- 5.
It will take about 500MB of space on the computer. Click Install.
- 6.
Once the Anaconda application gets installed, click Close and proceed to the next step to launch the application.
Launch Application
After installing Anaconda Jupyter, open the command-line terminal and type jupyter notebook.
This will launch the Jupyter server listening on port 8888. Usually, a pop-up window with the default browser will automatically open, or you can log in to the application by opening any web browser and opening the URL http://localhost:8888/, as shown in Figure 6-2.
A blank notebook will launch in a new window. You can just type commands in the cell and click the Run button to execute them. To test that the environment is working perfectly, type in a print (“Hello World”) statement in the cell and execute it.
Python Notebooks are used in the book for demonstrating the AIOps use cases of deduplication, dynamic baselining, and anomaly detection; they are available at https://github.com/dryice-devops/AIOps.git. You can download the source code to follow along with the exercises in this chapter.
The environment is now ready to develop and test algorithmic models for multiple AIOps use cases. Before getting into model development, let’s understand a few terms that are going to be used to analyze and measure the performance of models.
Performance Analysis of Models
In practice, multiple models need to be created to make predictions, and it is important to quantify the accuracy of their prediction so that the best-performing model can be selected for production use.
In the supervised ML model, a set of actual data points is separated into training data and test data. The model is trained to learn from training data and make predictions that get validated against both training data and testing data. The delta between actual values and predicted values is called the error. The objective is to select a model with minimum error.
There are multiple methods to analyze errors and measure the accuracy of predictions. The following are the most commonly used methods.
Mean Square Error/Root Mean Square Error
In this approach, the first error is calculated for each data point by subtracting the actual result value from the predicted result value. These errors need to be added together to calculate the total error obtained from a specific model. But errors can be positive or negative on different data points, and adding them will cancel out each other, giving an incorrect result. To overcome this issue, in mean square error (MSE), a square of calculated error is performed to ensure that the positive and negative error values should not cancel out each other, and then the mean of all squared error values is performed to determine the overall accuracy.
A lowercase n. -> Number of data points
An uppercase Y subscript lowercase i. -> Actual value at specific test data point
An uppercase y cap subscript lowercase i. -> Predicted value at specific test data point
There is another variant of MSE where the square root of MSE is performed to provide the root mean square error (RMSE) value. The lower the MSE/RMSE, the higher the accuracy in prediction and the better the model.
Sample Data and Errors in ML Models
Data Point | Error in Model A | Error in Model B |
---|---|---|
D1 | 0.45 | 0 |
D2 | 0.27 | 0 |
D3 | 0.6 | 0 |
D4 (Outlier) | 0.95 | 1.5 |
Model B appears to be a better model because it predicted with 100 percent accuracy except on one data point, which was an outlier and gave a high error. On the other hand, model A gave small errors D on almost every data point.
But if we consider the mean of errors for both models, then model A will have a lower mean value of errors than model B, which gives an incorrect interpretation of model accuracy.
Mean Absolute Error
Similar to MSE, in the mean absolute error (MAE) approach the first error is calculated as the difference between the actual result value and the predicted result value, but rather than squaring the error value, MAE considers the absolute error value. Finally, the mean of all absolute error values is performed to determine the overall accuracy.
Mean Absolute Percentage Error
One of the biggest challenges of MAE, MSE, and RMSE is that they provide absolute values that can be used to compare multiple models’ accuracies. But there is no way to validate the effectiveness of an individual model as there is no scale defined to compare that absolute value. The output of these methods can take any value, and there is no range defined.
Instead of calculating the absolute error, MAPE measures accuracy as a percentage by first calculating the error value divided by the actual value for each data point and then taking the mean of them.
MAPE is a useful metric and one of the most commonly used methods to calculate accuracy because the output lies between 0 and 100, enabling intuitive analysis of the model’s performance. It works best if there are no extremes or outliers and no zeros in data point values.
Root Mean Squared Log Error
One of the challenges in RMSE was the effect of outliers on the overall RMSE output value. To minimize (sometimes nullify as well) the effect of these outliers, there is another method called RMSLE, which is a variant of RMSE and uses the logarithmic property to calculate the relative error between the predicted value and the actual values instead of the absolute error.
In cases when there is a huge difference between the actual and predicted values, the error becomes large, resulting in huge penalties by RMSE but handled well by RMSLE. Also, to note that in cases where either the predicted value or actual value is zero, then the log of zero becomes not defined. That’s why one is added to both the predicted and actual values to avoid such situations.
Coefficient of Determination-R2 Score
0 percent represents a model that does not explain any of the variations in the response variable around its mean.
100 percent represents a model that explains all the variation in the response variable around its mean.
After fitting a linear regression model, it is important to determine how well the model fits the data. Usually, the larger the R2, the better the regression model fits your observations; however, there are caveats that are beyond the scope of this book.
Having understood the various ways to measure accuracy of models, we will now proceed with our first AIOps use case, which is a deduplication of events.
Deduplication
To implement the deduplication function, a unique key needs to be defined that defines unique issues and their context. This unique key gets created dynamically using information coming in the event such as the hostname, source, service, location, class, etc. A rule or algorithm can be configured to match this unique key with all events in the system and perform the deduplication function. Please note that this AIOps function does not need machine learning and can be simply done by using a rule-based deduplication system.
Let’s observe the practical implementation of the deduplication function. This implementation will be reading alerts from an input file.
In our code, we will be using Pandas, which is a powerful library of Python to analyze and process alerts, along with the SQLite DB, which is a lightweight disk-based database to store alerts after processing. Unlike client–server database management systems, SQLite doesn’t require a separate installation for a server process and allows accessing the database using a nonstandard variant of the SQL query language. SQLite is popular as an embedded database for local/client storage in software such as web browsers. SQLite stores the entire database (definitions, tables, indices, and the data itself) as a single cross-platform file on a host machine.
After reading alerts from an input file, a unique key of EMUniqueKey will be dynamically created to determine the uniqueness of the issue and execute the deduplication function. At the end of the code, we will determine how many alerts were genuine alerts in our code and how many were duplicates.
Sample Data and Errors in ML
Library Name | Purpose |
---|---|
matplotlib | It is a data visualization library containing various functions to create charts and graphs for analysis. It supports multiple platforms and hence works on many operating systems and graphics backends. |
Pandas | It is extensively used for data manipulation tasks. It is built on top of two core Python libraries—matplotlib for data visualization and NumPy for performing mathematical operations. |
sqlite3 | This library provides an API 2.0 interface for SQLite databases. |
Open a new Jupyter notebook or load the already provided notebook from the GitHub download. Download the notebook from https://github.com/dryice-devops/AIOps/blob/main/Ch-6_Deduplication.ipynb.
Output Snippet Showing Sample Events
A table of output deduplication with 4 columns and 9 rows. The columns are 0, 1, 2, and 3. The heading of the 9 rows is Host, Alert time, Alert description, Alert class, alert class, alert type, Alert manager, Source, Status, and Severity. |
Host: Mentions CI from where the event got generated.
AlertTime: Mentions the timestamp of the event when it got generated at CI.
AlertDescription: Mentions details about issues for which event got generated.
AlertClass: Mentions the class to which the event belongs such as the operating system.
AlertType: Mentions the type to which the event belongs such as the server.
AlertManager: Mentions the manager that captured the issue and generated the event. It’s usually the underlying monitoring tools such as Nagios, Zabbix, etc.
Source: Mentions the source for which the issue was detected such as the CPU, memory, etc.
Status: Mentions the event status, which is a numerical value.
Severity: Mentions the event severity, which is again a numerical value.
Event Status Code Description
Status Value | Meaning |
---|---|
1 | Open state of event |
2 | Acknowledge state of event |
3 | Resolved state of event |
Event Severity Code Description
Severity Value | Meaning |
---|---|
1 | Critical severity of event |
2 | Warning severity of event |
3 | Information severity of event |
Based on the analysis done, we can conclude that each event has a specific AlertTime that got generated from the device mentioned in Host with an event source captured in Source that belongs to a specific category defined by AlertClass and AlertType. This event got captured by the monitoring tool mentioned in AlertManager with a specific severity Severity and Status as an enumerated value. Issue details are provided in AlertDescription.
EMUniqueKey is a generic identifier that uniquely identifies the underlying issues to provide the required context for the purpose of performing deduplication.
Updated Dataset After Adding a Unique Identifier
A table of output deduplication with 3 columns and 10 rows. The columns are 0, 1, 2, and 3. The heading of the rows is Host, Alert time, Alert description, Alert class, alert class, alert type, Alert manager, Source, Status, Severity, and E M Unique Key. |
Now we are all set to perform the deduplication function on the given dataset.
Now values will be stored as a key-value pair, and we can directly refer to any event field with its name.
The function getConn will create a connection with the database.
In the previous code, first EMUniqueKey got generated dynamically using event slots and stored in the data variable (data frame). Then the table dedup gets checked to find out whether there is any open event with the same EMUniqueKey. If there is any matching event found in the dedup table, then it means that a new event is a duplicate event. Hence, the original event in the dedup table gets updated with the new event’s description, severity, and status, and the count field of the older event gets increased by 1.
This duplicate event now gets stored in the archive for later analysis. If there is no matching event found in the dedup table, then the new incoming event represents a new issue and gets stored as a new entry in the dedup table.
This completes our first use case of deduplication in AIOps.
Summary
In this chapter, we covered how to set up the environment using Anaconda for running the use cases. We covered our first use case, which was deduplication. This use case does not use any machine learning but relies on rule-based correlation to deduplicate events. In the next chapter, we will cover another important use case in AIOps: automated baselining.