© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_7

7. AIOps Use Case: Automated Baselining

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

We are continuing to discuss specific use cases of AIOps, and in this chapter we explain and implement automated baselining, which is one of the most important and frequently used features in AIOps.

Automated Baselining Overview

In traditional monitoring tools, a baseline is static and rule-based and is set at predetermined levels. An example of a baseline could be CPU utilization, which is set at, say, 95 percent. An event gets triggered when the CPU utilization of a server goes above this threshold. Now the challenge with this approach is that baselines or thresholds cannot be static. Consider a scenario where an application runs a CPU-intensive data processing job at 2 a.m. local time and causes 99 percent CPU utilization until the job ends, say, at 2:30 a.m. This is the usual behavior of the application, but because of the static baseline threshold, an event gets generated. But this alarm doesn’t need any intervention as it will automatically close once the job get completed and the CPU utilization returns to normal. There are many such false positive events in operations due to this static baselining of thresholds.

Automated baselining helps in such scenarios because it considers the dynamic behavior of systems. It analyzes historical performance data to detect the real performance issues that need attention and eliminates the noise by adjusting the baseline threshold. In the given scenario, the system would automatically increase the threshold and not generate the event until after 2:30 a.m. However, if the same machine exhibits a CPU spike at, say, 9 a.m., it would generate an event.

It is important to note that the dynamic baseline of a threshold will work a bit differently in a microservices architecture where the underlying infrastructure will scale up based on increased utilization or load. These aspects of application architecture and infrastructure utilization need to be considered for dynamic baselining.

Noise in operations causes the following inefficiencies:
  • Increases volume of events for operations to manage, which in turn increases time and effort of operation teams.

  • Clutters the operator console with false positive events, which leads to missing important qualified and actionable events.

  • Causes failures in automated diagnosis and remediation. Since the event itself is false, the automated resolution would be triggered unnecessarily.

Automated baselining reduces noise and related inefficiencies. Automated baselining can be achieved by leveraging supervised machine learning techniques that learn from past data and predict dynamic thresholds based on daily, weekly, monthly, and yearly seasonality of the data. From an AIOps perspective, there are three prominent algorithms that are core to implementing automated baselining. We will explore these approaches in this chapter starting with regression algorithms.

Regression

These classes of algorithms are used to determine the relationship between multiple variables and predict the value of a target variable. For example, based on the historical analysis of marketing spend, predict the next month’s sales revenue. Primarily, linear regression belongs to the statistics domain but is extensively used by machine learning for predictive modeling and is widely useful in multiple domains such as predicting stock prices, sales to price ratio of product, housing prices, etc. There are multiple variants of the regression algorithm, but the linear regression algorithm is one of the most-used regression algorithms in machine learning. Figure 7-1 lists some of the important linear regression algorithms available, and we will be discussing linear regression in the next section.

A circle is labeled as types of regression. The regression types given on the right are as follows: linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, ridge regression, lasso regression, and logistic regression.

Figure 7-1

Regression algorithms

Linear Regression

This algorithm is applicable where the dataset shows a linear relationship between input and output variables and the output variable has continuous values, such as percentage utilization of CPU, etc.

In linear regression, input variables, referred to as independent variables or features, from the dataset get chosen to predict the values of the output variable, referred to as the dependent variable or target value. Mathematically, the linear regression model establishes a relationship between the dependent variable (Y) and one or more independent variables (Xi) using a best-fit straight line (called the regression line) and is represented by the equation Y = a + bXi + e, where a is the intercept, b is the slope of the line, and e is the error in prediction from the actual value. When there are multiple independent variables, it’s called multiple linear regression.

Let’s first understand simple linear regression that has only one independent variable.

In Figure 7-2’s graph, there are a set of actual data points that exhibit a linear relationship, and an imaginary straight line runs somewhere in the middle of these data points.

A graph of dependent variable y versus independent variable X. Eight vertical lines, regression line with a dot on tip denotes actual data, joins at the top and bottom of the inclined line between y-intercept a, and y equals a plus bX plus epsilon. Two horizontal and vertical lines, delta X and delta y make a triangle with the regression line.

Figure 7-2

Simple linear regression

This imaginary straight line is the regression line, which consists of predicted data points. Now the distance between each one of the actual data points, and the line is called an error. Multiple lines can be drawn that pass through these data points, so the algorithm finds the best line that minimizes this error by calculating distances with all points. Whichever lines gives the smallest error becomes the regression line and provides the best prediction. From an AIOps perspective, linear regression provides a predictive capability for correlation, but it does not show causation.

Now let’s understand the implementation of linear regression to forecast the database response time (in seconds) based on the memory utilization (MB) and CPU utilization (percent) of the server. Here we have two independent variables, namely, CPU and memory, so we will be using multiple linear regression models to predict the database response time based on the specific utilization value of CPU and memory.

You can download the code from https://github.com/dryice-devops/AIOps/blob/main/Ch-7_AutomatedBaselinig-Regression.ipynb.

Let’s begin by exporting data for the parameters CPU Utilization, Memory Utilization, and DB Response Time from the file https://github.com/dryice-devops/AIOps/blob/main/data.xlsx, which is shown in Table 7-1.
Table 7-1

CPU and Memory Utilization Mapping with DB Response Time

-A table for C P U and memory utilization mapping has three columns with headers D B response time, C P U usage, and memory utilization M B.

Along with Pandas and Matplotlib, we will be using the additional Python libraries listed here:
  • NumPy: This is the most widely used library in Python development to perform mathematical operations on matrices, linear algebra, and Fourier transform.

  • Seaborn: This is another data visualization library that is based on the Matplotlib library and primarily used for generating informative statistical graphics.

  • Sklearn: This is one of the most important and widely used libraries for machine learning–related development as it contains various tools for modeling and predictions.

Let’s begin writing code by importing the required libraries and functions, as shown here:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
 percentmatplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
Now read the performance data from the input file and observe the top five values in the dataset.
df = pd.read_excel("data.xlsx")
df.head()
Figure 7-3 shows the sample values for CPU and memory utilization and the corresponding database response times.

A table for sample values for C P U and memory has three columns with headers D B response time, C P U usage, and memory utilization M B.

Figure 7-3

CPU and memory utilization mapping with DB response time

In the dataset we have utilization values of the parameters DBResponseTime, CPUUsage, andMemoryUtilizationMB as three different columns. Let’s find out how many datapoints are present in the dataset for analysis and prediction.
df.shape
As shown in Figure 7-4, there are a total of 24 rows in the dataset providing values for each of the three independent and dependent variables.

Out open box brackets 3 close square bracket colon open parentheses 24 comma 3 close parentheses.

Figure 7-4

Count of data points in DataSet

One of the important conditions for the applicability of linear regression algorithms is that there must exist a linear relationship between the input and output data points. Let’s first find out if the data satisfies this condition by plotting a graph between DB Response Time and CPU Usage.
sns.jointplot(x=df['DBResponseTime'], y=df['CPUUsage'],
              data=df, kind='reg')
As per the graph in Figure 7-5, there exists a linear relationship between CPU Utilization and DB Response Time.

A graph of C P U usage versus D B response time, where a line with scatter plots is graphed. The scatter plot has a tight pattern with an upward slope on either side of the line between (2.2, 6.8) and (6.5, 11) approximately. The vertical and horizontal bars with a curve line are graphed on the top and right windows, respectively.

Figure 7-5

Impact on DB ResponseTime due to CPU utilization

Similarly, let’s validate the linear relationship between DB Response Time and Memory Usage.
sns.jointplot(x=df['DBResponseTime'], y=df['MemoryUtilizationMB'],
              data=df, kind='reg')
Figure 7-6 also shows that a linear relationship exists between DB Response Time and Memory Usage, which satisfies the condition of applying the linear regression algorithm.

A graph of memory utilization M B versus D B response time, where a line with scatter plots is graphed. The scatter plot has a tight pattern with an upward slope on either side of the line between (2.2, 980) and (6.5, 1800) approximately. The vertical and horizontal bars with a curve line are graphed on the top and right windows, respectively.

Figure 7-6

Impact of DB ResponseTime due to memory utilization

Next, you need to extract data points of independent variables (CPU and Memory usage) in X and dependent variables (DB Response Time) in Y.
X = df[['MemoryUtilizationMB','CPUUsage']]
Y = df['DBResponseTime']
As discussed in supervised learning, the entire labeled dataset needs to be split into training data for learning purposes and testing data for validating the accuracy or quality of learning done by the model. Now this data needs to be split into testing and training data. In our example, we are considering the ratio of 80 percent to 20 percent for splitting the dataset between testing and training data.
x_train, x_test, y_train, y_test = train_test_split(X, Y,
                        test_size = 0.2, random_state = 42)
At this stage, we will invoke LinearRegression from the Sklearn library and apply (called the fit method) training data into this linear regression model to create the equation and learn from it.
LR = LinearRegression()
# fitting the training data
LR.fit(x_train,y_train)
This model can now be used to predict values on the test data.
y_prediction =  LR.predict(x_test)
y_prediction
Based on the CPU and Memory Utilization values present in the test data, our model has forecasted DB Response Time Values, as shown in the output in Figure 7-7.

Out open box bracket 8 close box bracket colon array open parenthesis open box bracket 3.60063185, 5.02933512, 2.49967556, 5.33846418, 3.92287079 close box bracket close parenthesis.

Figure 7-7

Forecasted database response time on test dataset

Let’s compare the predicted values from the ML model with the actual observed values on the test dataset.
indices = np.arange(len(y_prediction))
width = 0.20
# Plotting
plt.bar(indices, y_prediction, width=width)
# Offsetting by width to shift the bars to the right
plt.bar(indices + width, y_test, width=width)
plt.xticks(ticks=indices)
plt.ylabel("DB Response Time")
plt.xlabel("Test Dataset ")
plt.title("Predicted vs. Actual")
valuesType=['Predicted','Actual']
plt.legend(valuesType,loc=2)
plt.show()
As observed in the output in Figure 7-8, predicted values are very close to the actual observed value. This indicates that the model is performing well.

A double bar graph of predicted versus actual, where the vertical axis represents D B response time, and the horizontal axis represents the test dataset. The values across the test data set are as follows. Predicted: 3.6, 5, 2.3, 5.2 and 3.9. Actual: 3.3, 4.9, 2, 5.4, and 4. All the values are approximate.

Figure 7-8

Accuracy analysis of DB Response Time prediction on test data

Let’s evaluate the performance of the model by calculating its R2 score.
score=r2_score(y_test,y_prediction)
print("r2 score is ", score)
print('Root Mean Squared Error is =',np.sqrt(mean_squared_error(y_test,
 y_prediction)))
r2 score is  0.9679128701220477
Root Mean Squared Error is = 0.2102141626503609

As per the output, the r2 score value is 0.9679, which means that the model is predicting the database response time with around 97 percent accuracy for a given scenario of CPU and memory utilization, which is quite good. Based on the complexity and requirements, this model can be further expanded to include other independent variables such as the number of transactions, user connections, available disk space, etc., which makes this model production usable to perform capacity planning, upgrades, or migration-related tasks to improve DB response time, which will eventually improve the application performance and user experience.

There are a few limitations with linear regression model where it performs correlation between the dependent variable with one or more independent variables to perform predictions. The first big challenge is that correlations between variables are independent of any time series associated with them. Linear regression cannot predict the value of the dependent variable at a “specific time” with given values of independent variables. Second, prediction is entirely based on the independent variables without considering values of the dependent variable that were already predicted in the recent past. To overcome such challenges, you can use the time-series model, which we will discuss in the next section.

Note

Technically, a time series can also be considered as an independent variable in linear regressions, but that’s not correct. Rather, time-series models should be used instead of linear regression models in such scenarios.

Time-Series Models

To model scenarios that are based on time series and forecast what values that data is going to take into the future at a specified time interval, you can leverage time-series modeling. Let’s understand a few important terms required in time-series modeling.

Time-Series Data

Any data that is collected at regular time intervals is called time-series data. Data collection frequency can be anything like hourly, daily, weekly, etc., as long as it is happening at regular intervals. The following are common examples of time-series data:
  • Stock price data: The maximum and minimum prices of stock are usually maintained at a daily level. It is then further analyzed on a monthly or yearly time frame to make business decisions.

  • Calls in call center: The call volume at every hour is maintained. It is then analyzed to determine the peak hour load on a monthly or yearly time frame from a resourcing perspective.

  • Sales data: Sales data is important for any business, and usually monthly sales figures are maintained to forecast future sales or profitability of the organization.

  • Website hits: Website hit volume is usually maintained every five to ten minutes to determine the load on the application, detect any potential security threats, or scale the infrastructure to meet business requirements.

Every time-series data will have a date or time column along with one or more data columns. If there is only one data column, then it’s called a univariate time series, and if there are multiple data columns, then it’s called a multivariate time series. Figure 7-9 is the example of a univariate time series with the daily average CPU utilization of a server.

A table of C P U utilization time series has two columns titled time stamp and C P U utilization Y.

Figure 7-9

CPU utilization time series

For the purpose of analysis, a univariate time series may need to be converted into multivariate by adding some additional supporting columns. With the help of TimeStamp, we have added another column, Day of the week, and from Figure 7-10 we can observe that there is an unusual spike in CPU utilization on weekends as compared to weekdays.

A table of time series analysis of C P U utilization has three columns titled time stamp, C P U utilization, and day of the week. The fourth and fifth rows are highlighted.

Figure 7-10

TimeSeries analysis of CPU utilization

Stationary Time Series

One of the most important statistical properties of a time series is stationarity, where the mean and the variance of the time series do not change over time, as shown in Figure 7-11. In simple terms, a stationary time series can have different values with different timestamps, but the underlying logic or method to generate those values remains constant and does not change.

A graph of stationary time series has a high-frequency wave graphed on a horizontal line between two parallel dashed lines.

Figure 7-11

Sample stationary time series

Lag Variable

A lag variable is a dependent variable that is lagged in over time. Lag-1 (Yt-1) represents the value of dependent variable Yt at previous one unit of time, Lag 2 (Yt-2) will have value of Yt at previous two units of time, and so on.

Consider the earlier example of time series on CPU utilization at specific timestamps. In this example, Lag-1 (Yt-1) at 7/3/20 2:00 PM will be the value of Yt at 7/2/20 2:00 PM, which is 8.82, and Lag 2 (Yt-2) will have value of Yt at 7/1/20 2:00 PM, which is 8.33, as shown in Figure 7-12.

A table of lag variables has four columns titled time stamp, C P U utilization Y subscript t, Lag 1 Y subscript t minus 1, and Lag 2 Y subscript t minus 2.

Figure 7-12

Lag variable and its values

Mathematically if there exists a relationship or correlation between Yt and its lagged values Yt-1, Yt-2, etc., then a model can be created to detect the pattern and predict future values of the same series using its lagged values. For example, assume that CPU utilization increases around 100 percent on every Saturday; then the value at Lag-7 (Yt-7) can be considered for prediction.

The next step is to determine the strength of the relationship between Yt and its lags and how many lags have statistically significant relationship with Yt that we can use in the prediction model. That’s where ACF and PACF plots help, and we will be discussing these next.

ACF and PACF

Both ACF and PACF are statistical methods used by researchers to understand the temporal dynamics of an individual time series, which basically means detecting correlation over time, often called autocorrelation. A detailed discussion on ACF and PACF is beyond scope of this book, so we will be limiting the discussion to the basics of ACF and PACF to understand how to use them in the AIOps model development.

ACF is a statistical term that represents the autocorrelation function and is used to measure the correlation of time-series values Yt with its own lags Yt-1, Yt-2, Yt-3, etc. This correlation of various lags can be visualized using an ACF plot on a scale of 1 to -1, where 1 represents a perfect positive relationship and -1 represents a perfect negative relationship.

As shown in Figure 7-13, the y-axis represents the correlation value, and the x-axis represents the lags. The first value of the series will always be 100 percent correlated with itself, and that’s why in the ACF plot the Lag 0 correlation value is going until 1. The ACF plot is useful in detecting the pattern in the correlated data.

An A C F plot where the y-axis represents the correlation value and the x-axis represents the lags from 0 to 7. The dashed lines on either side of the x-axis indicate significance. The first value of the series starts from 1 and decreases gradually with vales 8, 7, 5, negative 4, negative 5, negative 7, and negative 8, approximately.

Figure 7-13

Sample ACF plot

There is a red line shown on both sides in the ACF plot called the significance line, which is creating a range or band referred to as the error band. The lags that cross this line have a statistically significant relationship with variable Y. In a sample ACF plot, three lags are crossing this line, so up to Lag 3, there is a statistically significant relationship, and anything within this band is not statistically significant.

PACF is another statistical term that represents the partial autocorrelation function, which is similar to an ACF plot except that it considers the strength of the relationship between the variable and lag at a specific time after removing the effects of all intermediate lags.

For example, in the sample ACF plot in Figure 7-14, it is visible that there is a significant correlation up to three lags, but the PACF plot tells how much the strength of correlation of Yt-3 alone with Yt, is after removing any impact of Yt-1 and Yt-2. PACF can be considered as more of a filtered value of ACF.

The P A C F plot where the y-axis represents the correlation value and the x-axis represents the lags from 0 to 7. The dashed lines on either side of the x-axis indicate significance. The first value of the series starts from 1 and decreases gradually with vales 6, 3, 2, negative 2, negative 2, negative 4.5, and negative 8, approximately.

Figure 7-14

Sample PACF plot

Now we are all set to implement two most commonly used time-series models, ARIMA and SARIMA, for noise reduction by predicting the appropriate monitoring threshold of parameters. This is one of the most common use cases in an AIOps implementation.

ARIMA

ARIMA is one of the foundational time-series forecasting techniques; it is popular and has been extensively used for quite a while now. This concept was first introduced by George Box and Gwilym Jenkins in 1976 in a book titled Time Series Analysis: Forecasting and Control. They defined a systematic method called the Box: Jenkins Analysis to identify, fit, and use an autoregressive (AR) technique integrated (I) with a moving average (MA) technique resulting in an ARIMA time-series model.

ARIMA leverages the pattern and information contained in a time series to predict its future values. So, is it possible that ARIMA can forecast with any time series? How about predicting the price of your favorite stocks using ARIMA considering its historical prices? Technically, forecasting can be done, but it might not be accurate at all. For example, the price of a stock is controlled by multiple external factors such as the company’s quarterly results, government regulations, weather conditions, competitors price, etc. Without considering all such factors, ARIMA cannot provide meaningful predictions. It is important to note that ARIMA should be used only for those time series that exhibit the following properties:
  • It should be stationary, or nonseasonal.

  • Past values should have a sufficient pattern (not just a random white noise) and information to predict its future values.

  • It should be of medium to long length with at least 50 observations.

Model Development

The ARIMA model has three components.

Differencing (d)

As discussed earlier, the ARIMA model needs the series to be stationary, but then the challenges are as follows:
  • How to attain stationarity in a series

  • How to validate if the series becomes stationary or not

There are two approaches to change nonstationary series into stationary.
  • Perform an appropriate transformation such as a log transformation or square root transformation to change a nonstationary series into a stationary series that is relatively complex.

  • A more reliable and widely used approach is to perform differencing on a series to convert it into stationary. In this approach, the difference between consecutive values in a series gets calculated, and then the resultant series is validated whether or not it is stationary.

One of the most used statistical tests to determine whether a time series is stationary or not is the augmented Dickey Fuller test (ADF test), which is one type of unit root test that determines how strongly a time series is defined by a trend. In simple terms, the ADF test validates a null hypothesis (H0), which claims that a time series has some time-dependent structure and hence it is nonstationary. If this null hypothesis gets rejected, then it means that time series is stationary. The ADF test calculates the p-value, and if it’s less than 0.05 (5 percent), then the null hypothesis gets rejected. Interested readers can explore more about internal functioning and mathematics behind unit root tests and ADF tests online.

We need to continue performing differencing and validating the results via an ADF test until we get a near-stationary series with a defined mean. The number of time differencing performed to make time series stationary is represented by model variable d.

It is important to note that series should not get over-differenced in which case we get a stationary series, but the model parameters get affected, and the prediction’s accuracy gets impacted. One way to check for over-differencing is that Lag 1 of autocorrelation should not get too negative.

Autoregression or AR (p)

This component, known as autoregression, shows that the dependent variable Yt can be calculated as a function of its own lags Yt-1, Yt-2, and so on.

Y subscript t equals alpha plus beta subscript 1 Y subscript t minus 1 plus beta subscript 2 Y subscript t minus 2 plus second derivative plus beta subscript p Y subscript t minus p.Here,

α is the constant term (intercept).

β is the coefficient of lag with a subscript indicating its position.

ε is the error or white noise.

We already discussed that the PACF plot shows the partial autocorrelation between the series and its lag after excluding the effect from the intermediate lags. We can use the PACF plot to determine the last lag p, which cuts off the significant line and can be considered as a good estimate value of AR. This value is represented by the model variable p as an order of the AR model.

Based on the sample PACF plot in Figure 7-15, we can observe that lag 1 seems to be the last lag crossing a significant line. So, we can set p as 1 and represent it as AR(1).

The P A C F plot where the y-axis represents the correlation value and the x-axis represents the lags from 0 to 7. The dashed lines on either side of the x-axis indicate significance. The first value of the series starts from 1 and decreases gradually with vales 6, 4, 3.5, negative 2, negative 3, negative 4, and negative 8, approximately.

Figure 7-15

PACF plot

Moving Average or MA (q)

This component, known as moving average, shows that the dependent variable Yt can be calculated as a function of the weighted average of the forecast errors in previous lags.Y subscript t equals alpha plus epsilon subscript t plus phi subscript 1 epsilon subscript t minus 1 plus phi subscript 2 epsilon subscript t minus 2 plus second derivative plus phi subscript q epsilon subscript t minus q.

α is usually the mean of the series.

ε is the error from the autoregressive models at respective lags.

MA can detect trends and patterns in time-series data. You can use the ACF plots to determine the value of model variable q, which represents how many moving averages are required to remove any autocorrelation in a stationary series. From the ACF plot in Figure 7-16, we can observe that there is cut-off after lags 2 and 6. So, we can try ACF with q =2 , MA (2) or q=6, MA(6).

An A C F plot where the y-axis represents the correlation value and the x-axis represents the lags from 0 to 7. The dashed lines on either side of the x-axis indicate significance. The first value of the series starts from 1 and decreases gradually with vales 8, 7, 5, negative 4, negative 5, negative 7, and negative 8, approximately.

Figure 7-16

ACF plot

Both AR and MA equations needs to be integrated (I or d) to create the ARIMA model, which is represented as ARIMA (p,d,q).

ARIMA acts as a foundation for various other autoregressive time-series models as follows:
  • SARIMA: If there is a repetitive pattern at a regular time interval, then it’s referred to as seasonality, and the SARIMA model is used where S stands for seasonality. We will explore this more in the next section.

  • ARIMAX: If the series depends on external factors, then the ARIMAX model is used, where X stands for external factors.

  • SARIMAX: If along with seasonality there is a strong influence of external factors, then the SARIMAX model is used.

One of the big limitations of ARIMA is the inability to consider the seasonality in time-series data, which is common in real-world scenarios like when there is exceptionally high sales figures during the holiday period of Thanksgiving and Christmas every year, high expenditure during the first week of every month, high footprints during the weekend in shopping malls, and so on. Such seasonality gets considered in another algorithm, called SARIMA, which is a variant of ARIMA model.

SARIMA

A cyclical pattern at a regular time interval is referred as seasonality and should be considered in a prediction model to improve the accuracy of prediction. That’s how the ARIMA model is modified to develop SARIMA, which considers seasonality in time-series data and supports univariate time-series data with a seasonal component.

Along with ARIMA variables, SARIMA needs four new components.
  • p: Autoregression (AR) order of time series.

  • d: Difference (I) order of time series.

  • q: Moving average (MA) order of time series.

  • m (or lag): Seasonal factor, number of time steps to be considered for a single seasonal period. For example, if there is yearly seasonality, then m = 12; if it’s quarterly, then m=4; if it’s weekly, then m=52.

  • P: Seasonal autoregressive order.

  • D: Seasonal difference order.

  • Q: Seasonal moving average order.

The final notation for an SARIMA model is specified as SARIMA (p,d,q)(P,D,Q)m.

Implementation of ARIMA and SARIMA

Let’s perform a sample implementation of the ARIMA and SARIMA models on CPU utilization data to predict its thresholds.

Begin by importing the required Python libraries.
# Library for Data Manipulation
import pandas as pd
import numpy as np
# Library for Data Visualization
import matplotlib.pyplot as plt
# Library for Time Series Models
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.seasonal import seasonal_decompose
# Library for Machine Learning
from sklearn.metrics import mean_squared_log_error
# Library for Track time
from datetime import datetime
Read data on CPU utilization from a CSV in a Pandas data frame and check information about the data types, column names, and count of values present in imported data.
cpu_util_raw_data= pd.read_csv("cpu_utilization-data.csv")
cpu_util_raw_data.info()
As shown in Figure 7-17, there are two columns, date_time and cpu_utilization, and a total of 396 data points (first row is the header) are available for analysis.

A model description data file with header Open chevron class pandas dot core dot frame dot data frame close chevron, where range index is 397 entries, 0 to 396, and data columns has two columns titles non null count and dtype. Dtypes has float64 1, and object 1, and memory usage of 6.3 plus K B.

Figure 7-17

Overview of data in data file

Set the data type of column date_time to DateTime, set it as an index to sort the data points in a data frame, and plot the CPU utilization on a time scale, as shown in Figure 7-18.
total_records = cpu_util_raw_data.count()['cpu_utilization']
cpu_util_raw_data['date_time'] = pd.to_datetime(cpu_util_raw_data['date_time'])
cpu_util_raw_data.set_index('date_time',inplace=True)
cpu_util_raw_data = cpu_util_raw_data.sort_values(by="date_time")
fig = plt.figure(figsize =(10, 5))
cpu_util_raw_data['cpu_utilization'].plot()

A graph of C P U utilization implementation, where the vertical axis is marked as data points and the horizontal axis is marked as date underscore time. A high-frequency wave is graphed from point 4 and ends at 55 approximately. The high-frequency wave has the highest point at 88 and the lowest point at 0.

Figure 7-18

CPU utilization on time scale

It is important to check for the outliers in the data using a box plot. The presence of outliers impacts the quality of predictions. If there are a high number of outliers in data, then they need to be either processed or removed from the dataset based on the domain requirements.
#Global variables
NOISE = False
MAPE = True
TEST_DATA_SIZE = 0.2 #in percent
SEASONAL = False
def getOutliers(data, col):
    Q3 = data[col].quantile(0.75)
    Q1 = data[col].quantile(0.25)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    return lower_limit, upper_limit
lower, upper = getOutliers(cpu_util_raw_data, 'cpu_utilization')
outliers_count = cpu_util_raw_data[(cpu_util_raw_data['cpu_utilization']<lower) | (cpu_util_raw_data['cpu_utilization']>upper)].count()['cpu_utilization']
outlier_percentage = ((outliers_count / total_records) * 100)
if outlier_percentage > 20:
    NOISE = True
print(f"Outlier Percentage in data: {outlier_percentage}%")
#Render box plot
cpu_util_raw_data.boxplot('cpu_utilization', figsize=(10,10))
zero_values = cpu_util_raw_data[(cpu_util_raw_data['cpu_utilization']==0)].count()['cpu_utilization']
zero_values_percentage = ((zero_values / total_records) * 100)
if zero_values_percentage > 20:
    MAPE = False
print("Zero Value percentage in data: ", zero_values_percentage)
Based on Figure 7-19, there are no outliers and no “zero” values, so let’s continue with the given dataset.

A box plot is graphed for outer line analysis. The vertical axis has points, and the horizontal axis denotes C P U utilization. The interquartile range is between points 19 to 59, where the median is at point 40. The minimum outlier is at point 1, and the maximum outlier is at point 93.

Figure 7-19

Box plot analysis for outliers

Let’s check for the stationarity of a time series using an ADF test. Based on an ADF test, if the p-value comes out to more than 0.05, it means the time series is nonstationary and needs to be differenced using the diff() function and again checked with an ADF test for stationarity.
diff_count = 0
differencing_order = {
    1: lambda x: x['cpu_utilization'].diff(),
    2: lambda x: x['cpu_utilization'].diff().diff(),
    3: lambda x: x['cpu_utilization'].diff().diff().diff(),
    4: lambda x: x['cpu_utilization'].diff().diff().diff().diff(),
    5: lambda x: x['cpu_utilization'].diff().diff().diff().diff().diff()
}
while True:
    if diff_count == 0:
        adftestresult = adfuller(cpu_util_raw_data['cpu_utilization'].dropna())
    else:
        adftestresult = adfuller(differencing_order[diff_count](cpu_util_raw_data).dropna())
    print('#' * 60)
    print('ADF Statistic: %f' % adftestresult[0])
    print('p-value: %f' % adftestresult[1])
    print(f'ADF Test Result: The time series is {"non-" if adftestresult[1] >= 0.05 else ""}stationary')
    print('#' * 60)
    if adftestresult[1] < 0.05 or diff_count >= len(differencing_order):
        break
    diff_count += 1
print("Differencing order to make data stationary: ",diff_count)

The A D F result implementation has two series. Series 1, A D F statistic: negative 2.5335679, p-value: 0.107493, A D F test result: the time series is non-stationary. Series 2, A D F statistic: negative 15.396157, p-value: 0.000000, A D F test result: the time series is stationary. A remark is below, differencing order to make data stationery: 1.

The ADF result on differenced time series shows that the p-value is 0, which confirms that data series is now stationary. This can be visualized by plotting both the original series and a differenced series on a time scale graph, as shown in Figure 7-20.
fig, ax = plt.subplots(figsize=(10,8), dpi=100)
#  Differencing
ax.plot(cpu_util_raw_data.cpu_utilization[:], label='Original Series')
ax.plot(cpu_util_raw_data.cpu_utilization.diff(1), label='1st Order Differencing')
ax.set_title('1st Order Differencing')
ax.legend(loc='upper left', fontsize=10)
plt.legend(loc='upper left', fontsize=10)
plt.suptitle('CPU Usage - Time Series Dataset', fontsize=16)
plt.show()

A graph of time-series dataset of C P U usage, where the vertical axis is marked as data points and the horizontal axis is marked as date underscore time. Two high-frequency waves of original series and 1st order differencing are graphed from point 0 in January 2020 and end at points 50 in January 2021 and 0 in January 2021, respectively.

Figure 7-20

Time-series stationarity check

As we have done differencing only one time, we can set the value of the differencing (d) component of ARIMA as d=1.

Next let’s plot the ACF and PACF graph to get the value of the AR (p) and MA (q) components.
plt.rcParams.update({'figure.figsize':(7, 4), 'figure.dpi':120})
plot_pacf(cpu_util_raw_data.cpu_utilization.diff().dropna());
plot_acf(cpu_util_raw_data.cpu_utilization.diff().dropna());
In the PACF graph in Figure 7-21, it can be observed that until lag 6 it is crossing the significant line, so the AR component, p=6, can be initially set for modeling.

A P A C F graph correlation value versus lags from 0 to 25. The shaded bar on the horizontal axis indicates significance. The values of the series that cross the significance bar are 1, negative 0.35, negative 0.22, negative 0.30, negative 0.20, negative 0.13, negative 0.23, negative 0.12, negative 0.11, and negative 0.14 approximately.

Figure 7-21

PACF graph of time series

In the ACF graph in Figure 7-22, there is sharp cut-off on the lag 2 and lag 7 crossing significance line, so we can try with the MA component as q=2 or q =7 for modeling.

An A C F graph correlation value versus lags from 0 to 25. The shaded bar on the horizontal axis indicates significance. The series values that cross the significance bar are 1, negative 0.38, negative 0.15, 0.20, negative 0.11, 0.11, and negative 0.15 approximately.

Figure 7-22

ACF graph of time series

Next split the data into training and test data to be used for the ARIMA model (6,1,2). Out of 396 data points, let’s use the first 300 (~75 percent) data points for training and the rest of the data points for testing purposes and plot them on a time scale, as shown in Figure 7-23.
# Plot train and test data
train_data = cpu_util_raw_data['cpu_utilization'][:300]
test_data = cpu_util_raw_data['cpu_utilization'][300:]
fig, ax = plt.subplots(figsize=(10,5), dpi=100)
ax.plot(train_data, label='Train Data Points')
ax.plot(test_data, label='Test Data Points')
ax.set_title('CPU Usage Timeline')
ax.legend(loc='upper left', fontsize=10)
plt.legend(loc='upper left', fontsize=10)
plt.suptitle('Test-Train Data Splits', fontsize=16)
plt.show()

A graph of test train data splits of C P U usage, where the vertical axis is marked as data points and the horizontal axis is marked as date underscore time. Two high-frequency waves of train data points and test data points start from 4 in January 2020 and November 2020 and end at points 70 in November 2020 and 50 in January 2021, respectively.

Figure 7-23

Time-series data split in test data and training data

Let’s create the ARIMA (6,1,2) model using the train data.
manual_arima_model = ARIMA(train_data, order=(6,1,2))
manual_arima_model_fit = manual_arima_model.fit()
print(manual_arima_model_fit.summary())
Figure 7-24 shows the output from the ARIMA model on the train data.

A report of SARIMAX results has several outputs with headers, depanneur variable, model, date, time, sample, covariance type, number of observations, log-likelihood, AIC, BIC, and HQIC in the top. A table is in the middle and in the bottom, there are more outputs with headers, Ljung box, prob, heteroskedasticity, prob, and Jaraque-Bera.

Figure 7-24

ARIMA model output result on training data

Before applying this model on the test data, the ARIMA model needs to be checked for accuracy with the chosen combination of component values, p,d,q. There are two key statistical measures to compare the relative quality of the different models.
  • Akaike information criteria (AIC): This validates the model in terms of goodness of fit of the data by checking how much it relies on the tuning parameters. The lower the value of AIC, the better the model performance. The current ARIMA (6,1,2) model has an AIC value of 2113.

  • Bayesian information criteria (BIC): In addition to AIC, the BIC uses the number of samples in the training dataset that are used for fitting. Here also a model with a lower BIC is preferred. The current ARIMA (6,1,2) model has a BIC value of 2150.

Based on PACF and ACF charts, multiple values can be tried to lower the AIC and BIC values. For now let’s proceed with ARIMA (6,1,2).

Let’s plot the actual values in a training dataset with the model’s predicted value to analyze how close the actual and model predicted values are.
prediction_manual=manual_arima_model_fit.predict(dynamic=False,typ='levels')
plt.figure(figsize=(15,5), dpi=100)
fig, ax = plt.subplots(figsize=(10,5), dpi=100)
ax.plot(prediction_manual, label='Prediction')
ax.plot(train_data, label='Training Data')
ax.legend(loc='upper left', fontsize=10)
plt.legend(loc='upper left', fontsize=10)
plt.suptitle('Actual VS Predicted CPU Utilization on Train Data', fontsize=16)
plt.show()
As observed in Figure 7-25, the model predictions are quite close to actual values, indicating that it gets trained quite well.

A graph of actual versus predicted C P U utilization on train data, where the vertical axis denotes data points, and the horizontal axis denotes date underscore time. Two high-frequency waves of prediction and training data are graphed. Both waves start at point 70 in January 2020 and end at point 60 in January 2021.

Figure 7-25

Prediction analysis on training data

Now let’s validate the model performance on the test data.
# Forecast with 95% confidence interval
forecast = manual_arima_model_fit.get_forecast(97)
manual_arima_fc = forecast.predicted_mean
manual_arima_conf = forecast.conf_int(alpha=0.05)
manual_arima_fc_series = pd.Series(manual_arima_fc, index=test_data.index)
manual_arima_lower_series = pd.Series(manual_arima_conf["lower cpu_utilization"],
 index=manual_arima_conf.index)
manual_arima_upper_series = pd.Series(manual_arima_conf["upper cpu_utilization"],
 index=manual_arima_conf.index)
plt.figure(figsize=(15,10), dpi=100)
plt.plot(train_data, label='Training Data')
plt.plot(test_data, label='Actual Data')
plt.plot(manual_arima_fc_series, label='Forecast Data')
plt.fill_between(manual_arima_lower_series.index, manual_arima_lower_series,
 manual_arima_upper_series, color='gray', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()
Figure 7-26 shows that the ARIMA (6,1,2) model performance on the test dataset is not good and its predictions are deteriorating over time.

A graph of prediction analysis, data points versus date underscore time. Three high-frequency waves of training data, actual data, and forecast data start from 70 in January 2020, 55 in November 2020 and 70 in November 2020 and end at points 70 in November 2020, 55 in January 2021, and 55 in January 2021, respectively.

Figure 7-26

Prediction analysis on test data

Let’s quantify the model performance with the metrics defined earlier.
def forecast_accuracy(forecast, actual):
    mape = np.mean(np.abs(forecast - actual)/np.abs(actual)*100)
    mae = np.mean(np.abs(forecast - actual))
    rmse = np.mean((forecast - actual)**2)**.5
    rmsle = np.sqrt(mean_squared_log_error(actual, forecast))
    return({'MAPE : ':mape, 'MAE : ': mae,
             'RMSE : ':rmse, 'RMSLE : ':rmsle
           })
forecast_accuracy(manual_arima_fc, test_data)

Out open box bracket 16 close box bracket colon open braces Apostrophe M A P E colon Apostrophe colon 24.071440955749733, Apostrophe M A E colon Apostrophe colon 11.286266474840719, Apostrophe R M S E colon Apostrophe colon 13.201605001517883, Apostrophe R M S L E colon Apostrophe colon 0.2593143147566868 close braces.

Based on the performance metrics, the model predictions are about 76 percent (MAPE = ~24 percent) accurate.

In practical scenarios, multiple values of ARIMA components need to be used to arrive at a model with maximum accuracy in predictions. Manually performing this task requires a lot of time and expertise in statistical techniques. Also, it becomes practically impossible to create models for a large number of entities such as servers, parameters, stocks, etc.

There are ML approaches called Auto ARIMA to learn optimal values of the components p,d,q. In Python there is a library called pdarima that provides the function auto_arima to try various values of ARIMA/SARIMA components in model creation and search for most optimal model with the lowest AIC and BIC values.

Let’s try auto ARIMA to find the most optimal component values for the ARIMA model from the given training dataset. In the auto ARIMA model, we currently specify seasonal to be False to explore more about ARIMA. This value will be true for the SARIMA model to consider seasonality. We will also provide the maximum values of p and q that auto ARIMA can explore to obtain the best model. Let’s proceed with max_p and max_q as 15 to limit the complexity of the model.
# Fit auto_arima on train set
auto_arima_model = pm.auto_arima(train_data, start_p = 1, start_q = 1,
 max_p = 15, max_q = 15,
 seasonal = False,
 d = None, trace = True,
 error_action ='ignore',
 suppress_warnings = True,
 stepwise = True)
# To print the summary
auto_arima_model.summary()
As per the auto ARIMA result shown in Figure 7-27, the best model is identified as ARIMA (6,1,0) with marginal improvement in the AIC and BIC values as compared to the previous ARIMA (6,1,2) model.

A prediction analysis depicts stepwise performance to minimize A I C. SARIMAX results are given below with different output headers depanneur variable, model, date, time, sample, covariance type, number observations, log-likelihood, A I C, B I C, H Q I C.

Figure 7-27

Prediction analysis on test data

Next validate the performance of ARIMA (1,1,1) only on test data.
auto_arima_predictions = pd.Series(auto_arima_model.predict(len(test_data)))
actuals = test_data.reset_index(drop = True)
auto_arima_predictions.plot(legend = True,label = "ARIMA Predictions",
 xlabel = "Index",ylabel = "CPU Utilization",
 figsize=(10, 7))
 actuals.plot(legend = True, label = "Actual");
forecast_accuracy(np.array(auto_arima_predictions), test_data)
As observed in Figure 7-28 and as the performance metrics are obtained, there is even further degradation in the accuracy of the ARIMA (1,1,1) model.

A graph of C P U utilization versus the index. A frequency wave and a curve line are graphed. The frequency wave denotes actual starts from (0, 70) and ends at (40, 95). The curve line denotes ARIMA prediction starts from (0, 65) and ends at (95, 68). The points are approximate.

Figure 7-28

Prediction analysis on test data for ARIMA (1,1,1)

Clearly, the ARIMA model is not going to work in this scenario, and we need to look for the seasonality in data using the seasonal_decompose package.
seasonality_check = seasonal_decompose(cpu_util_raw_data['cpu_utilization'],
 model='additive',extrapolate_trend='freq')
seasonality_check.plot()
plt.show()
Based on the graph in Figure 7-29, we can observe seasonality in the data, which implies we should use the SARIMA model.

Four graphs. Graph 1, C P U utilization versus date-time where a high-frequency wave is graphed. Graph 2, trend versus date-time where a low-frequency wave is graphed. Graph 3, seasonal versus date-time where a high-frequency wave with the same height is graphed. Graph 4, resid versus date-time where a scatter plot is graphed.

Figure 7-29

Time series seasonality check

Let’s explore the SARIMA model using auto ARIMA with seasonality enabled. As we have daily data points and based on the seasonality graph, it seems there is weekly seasonality, so we will be setting the value of the SARIMA component as m = 7.
# Seasonal - fit stepwise auto-ARIMA
sarima_model = pm.auto_arima(train_data, start_p=1, start_q=1,
                         test='adf',
                         max_p=15, max_q=15, m=7,
                         start_P=0, seasonal=True,
                         d=None, D=1, trace=True,
                         error_action='ignore',
                         suppress_warnings=True,
                         stepwise=True)
sarima_model.summary()
As shown in Figure 7-30, auto ARIMA tried various combination and has detected SARIMA (3,0,2)(0,1,1)[7] as a best-fit model with considerable improvement in the AIC values as compared to previous ARIMA models.

An auto Arima model execution result on train data depicts stepwise performance to minimize A I C. SARIMAX results are given below with different output headers depanneur variable, model, date, time, sample, covariance type, number observations, log-likelihood, A I C, B I C, H Q I C.

Figure 7-30

Auto Arima model execution result on train data

Let’s validate the results of this new SARIMA model.
sarima_predictions = pd.Series(sarima_model.predict(len(test_data)))
actuals = test_data.reset_index(drop = True)
sarima_predictions.plot(legend = True,label = "SARIMA Predictions",xlabel = "Index", ylabel = "CPU Utilization", figsize=(10, 7))
auto_arima_predictions.plot(legend = True,label = "ARIMA Predictions")
actuals.plot(legend = True, label = "Actual")
forecast_accuracy(np.array(sarima_predictions), test_data)
After considering the seasonality, the prediction’s accuracy improved to about 62 percent as MAPE reduced to about 38 percent. This improvement in accuracy can also be observed in Figure 7-31 by plotting predictions of the ARIMA and SARIMA models against the actual test data observed.

A graph of C P U utilization versus the index. Two frequency waves and a curve line are graphed. The frequency waves denote SARIMA and actual. The actual starts around (0, 70) and ends at (40, 95) and SARIMA starts around (0, 65) and ends at (45, 95). The curve line denotes ARIMA prediction starts around (0, 65) and ends at (95, 68).

Figure 7-31

Time-series prediction by ARIMA/SARIMA on test data

This model can be used to generate future predictions to determine an appropriate baseline dynamically rather than going with a static global baseline. This approach drastically reduces the noise in large environments.

This model can also be used for one of the most common use cases around the autoscaling of infrastructure, especially when there is seasonality in the utilization. The SARIMA (or SARIMAX) model can be used to analyze historical time-series data along with seasonality and any external factor to predict the utilization and accordingly scale up or down infrastructure for a specific duration and provide potential cost savings.

Automated Baselining in APM and SecOps

Monitoring tools related to SecOps and application monitoring have different characteristics as the majority of time their monitoring parameter values remain very low, almost touching zero, causing imbalanced dataset and hence providing incorrect predictions. In these situations, it makes more sense to compare spikes or unusual high utilization values (called anomaly) with predicted values to calculate the MASE ratio and accordingly tune the ML model rather than leveraging the usual low values of parameters. Anomaly values are the ones that do not follow a regular pattern and show sudden spikes (up or down). We will discuss more about these in Chapter 8.

These predicted thresholds for security-related parameters can be applied on policies that monitor the protected segment comprising the IP address range, port, and protocol that define services, VLAN numbers, or MPLS tags. An appropriate policy will dynamically detect attacks on different protected segments and trigger qualified alarms rather than noise.

Implementing a monitoring solution with an automated baselining feature brings in immediate benefits by enabling the operations team to quickly identify outages, exceptions, and cyberattacks rather than wasting time on noise. But organizations often face some operational challenges in adopting dynamic thresholding, which we will be discussing next.

Challenges with Dynamic Thresholding

Though dynamic thresholding looks promising, there is a challenge in adoption due to fear of missing critical alerts. Especially during the initial phases when the AIOps system has just started the learning process, organizations raise a lot of doubt over the accuracy of prediction. Also, the majority of open source and native monitoring tools don’t have this capability. Implementing AIOps-based dynamic thresholding involves the use of multiple algorithms and techniques and requires data for a longer duration to be able to analyze the seasonality and patterns.

Summary

In this chapter, we covered one of the important use cases in AIOps, which is automated baselining. We covered various types of regression algorithms that can be used for this purpose. We covered hands-on implementation of the use case using multiple algorithms such as linear regression, ARIMA, and SARIMA. In the next chapter, we will cover various anomaly detection algorithms and how they can be used in AIOps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.172.210