Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_7

7. AIOps Use Case: Automated Baselining

Navin Sabharwal¹ and Gaurav Bhardwaj¹

(1)

New Delhi, India

We are continuing to discuss specific use cases of AIOps, and in this chapter we explain and implement automated baselining, which is one of the most important and frequently used features in AIOps.

Automated Baselining Overview

In traditional monitoring tools, a baseline is static and rule-based and is set at predetermined levels. An example of a baseline could be CPU utilization, which is set at, say, 95 percent. An event gets triggered when the CPU utilization of a server goes above this threshold. Now the challenge with this approach is that baselines or thresholds cannot be static. Consider a scenario where an application runs a CPU-intensive data processing job at 2 a.m. local time and causes 99 percent CPU utilization until the job ends, say, at 2:30 a.m. This is the usual behavior of the application, but because of the static baseline threshold, an event gets generated. But this alarm doesn’t need any intervention as it will automatically close once the job get completed and the CPU utilization returns to normal. There are many such false positive events in operations due to this static baselining of thresholds.

Automated baselining helps in such scenarios because it considers the dynamic behavior of systems. It analyzes historical performance data to detect the real performance issues that need attention and eliminates the noise by adjusting the baseline threshold. In the given scenario, the system would automatically increase the threshold and not generate the event until after 2:30 a.m. However, if the same machine exhibits a CPU spike at, say, 9 a.m., it would generate an event.

It is important to note that the dynamic baseline of a threshold will work a bit differently in a microservices architecture where the underlying infrastructure will scale up based on increased utilization or load. These aspects of application architecture and infrastructure utilization need to be considered for dynamic baselining.

Noise in operations causes the following inefficiencies:

Increases volume of events for operations to manage, which in turn increases time and effort of operation teams.
Clutters the operator console with false positive events, which leads to missing important qualified and actionable events.
Causes failures in automated diagnosis and remediation. Since the event itself is false, the automated resolution would be triggered unnecessarily.

Automated baselining reduces noise and related inefficiencies. Automated baselining can be achieved by leveraging supervised machine learning techniques that learn from past data and predict dynamic thresholds based on daily, weekly, monthly, and yearly seasonality of the data. From an AIOps perspective, there are three prominent algorithms that are core to implementing automated baselining. We will explore these approaches in this chapter starting with regression algorithms.

Regression

These classes of algorithms are used to determine the relationship between multiple variables and predict the value of a target variable. For example, based on the historical analysis of marketing spend, predict the next month’s sales revenue. Primarily, linear regression belongs to the statistics domain but is extensively used by machine learning for predictive modeling and is widely useful in multiple domains such as predicting stock prices, sales to price ratio of product, housing prices, etc. There are multiple variants of the regression algorithm, but the linear regression algorithm is one of the most-used regression algorithms in machine learning. Figure 7-1 lists some of the important linear regression algorithms available, and we will be discussing linear regression in the next section.

Linear Regression

This algorithm is applicable where the dataset shows a linear relationship between input and output variables and the output variable has continuous values, such as percentage utilization of CPU, etc.

In linear regression, input variables, referred to as independent variables or features, from the dataset get chosen to predict the values of the output variable, referred to as the dependent variable or target value. Mathematically, the linear regression model establishes a relationship between the dependent variable (Y) and one or more independent variables (Xi) using a best-fit straight line (called the regression line) and is represented by the equation Y = a + bXi + e, where a is the intercept, b is the slope of the line, and e is the error in prediction from the actual value. When there are multiple independent variables, it’s called multiple linear regression.

Let’s first understand simple linear regression that has only one independent variable.

In Figure 7-2’s graph, there are a set of actual data points that exhibit a linear relationship, and an imaginary straight line runs somewhere in the middle of these data points.

This imaginary straight line is the regression line, which consists of predicted data points. Now the distance between each one of the actual data points, and the line is called an error. Multiple lines can be drawn that pass through these data points, so the algorithm finds the best line that minimizes this error by calculating distances with all points. Whichever lines gives the smallest error becomes the regression line and provides the best prediction. From an AIOps perspective, linear regression provides a predictive capability for correlation, but it does not show causation.

Now let’s understand the implementation of linear regression to forecast the database response time (in seconds) based on the memory utilization (MB) and CPU utilization (percent) of the server. Here we have two independent variables, namely, CPU and memory, so we will be using multiple linear regression models to predict the database response time based on the specific utilization value of CPU and memory.

You can download the code from https://github.com/dryice-devops/AIOps/blob/main/Ch-7_AutomatedBaselinig-Regression.ipynb.

Let’s begin by exporting data for the parameters CPU Utilization, Memory Utilization, and DB Response Time from the file https://github.com/dryice-devops/AIOps/blob/main/data.xlsx, which is shown in Table 7-1.

Table 7-1

CPU and Memory Utilization Mapping with DB Response Time

-A table for C P U and memory utilization mapping has three columns with headers D B response time, C P U usage, and memory utilization M B.

Along with Pandas and Matplotlib, we will be using the additional Python libraries listed here:

NumPy: This is the most widely used library in Python development to perform mathematical operations on matrices, linear algebra, and Fourier transform.
Seaborn: This is another data visualization library that is based on the Matplotlib library and primarily used for generating informative statistical graphics.
Sklearn: This is one of the most important and widely used libraries for machine learning–related development as it contains various tools for modeling and predictions.

Let’s begin writing code by importing the required libraries and functions, as shown here:

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error

from sklearn.metrics import r2_score

from sklearn.metrics import mean_squared_error

percentmatplotlib inline

import matplotlib as mpl

import matplotlib.pyplot as plt

Now read the performance data from the input file and observe the top five values in the dataset.

df = pd.read_excel("data.xlsx")

df.head()

Figure 7-3 shows the sample values for CPU and memory utilization and the corresponding database response times.

Figure 7-3
CPU and memory utilization mapping with DB response time

In the dataset we have utilization values of the parameters DBResponseTime, CPUUsage, andMemoryUtilizationMB as three different columns. Let’s find out how many datapoints are present in the dataset for analysis and prediction.

df.shape

As shown in Figure 7-4, there are a total of 24 rows in the dataset providing values for each of the three independent and dependent variables.

Figure 7-4
Count of data points in DataSet

One of the important conditions for the applicability of linear regression algorithms is that there must exist a linear relationship between the input and output data points. Let’s first find out if the data satisfies this condition by plotting a graph between DB Response Time and CPU Usage.

sns.jointplot(x=df['DBResponseTime'], y=df['CPUUsage'],

data=df, kind='reg')

As per the graph in Figure 7-5, there exists a linear relationship between CPU Utilization and DB Response Time.

Figure 7-5
Impact on DB ResponseTime due to CPU utilization

Similarly, let’s validate the linear relationship between DB Response Time and Memory Usage.

sns.jointplot(x=df['DBResponseTime'], y=df['MemoryUtilizationMB'],

data=df, kind='reg')

Figure 7-6 also shows that a linear relationship exists between DB Response Time and Memory Usage, which satisfies the condition of applying the linear regression algorithm.

Figure 7-6
Impact of DB ResponseTime due to memory utilization

Next, you need to extract data points of independent variables (CPU and Memory usage) in X and dependent variables (DB Response Time) in Y.

X = df[['MemoryUtilizationMB','CPUUsage']]

Y = df['DBResponseTime']

As discussed in supervised learning, the entire labeled dataset needs to be split into training data for learning purposes and testing data for validating the accuracy or quality of learning done by the model. Now this data needs to be split into testing and training data. In our example, we are considering the ratio of 80 percent to 20 percent for splitting the dataset between testing and training data.

x_train, x_test, y_train, y_test = train_test_split(X, Y,

test_size = 0.2, random_state = 42)

At this stage, we will invoke LinearRegression from the Sklearn library and apply (called the fit method) training data into this linear regression model to create the equation and learn from it.

LR = LinearRegression()

# fitting the training data

LR.fit(x_train,y_train)

This model can now be used to predict values on the test data.

y_prediction = LR.predict(x_test)

y_prediction

Based on the CPU and Memory Utilization values present in the test data, our model has forecasted DB Response Time Values, as shown in the output in Figure 7-7.

Out open box bracket 8 close box bracket colon array open parenthesis open box bracket 3.60063185, 5.02933512, 2.49967556, 5.33846418, 3.92287079 close box bracket close parenthesis.

Figure 7-7

Forecasted database response time on test dataset

Let’s compare the predicted values from the ML model with the actual observed values on the test dataset.

indices = np.arange(len(y_prediction))

width = 0.20

# Plotting

plt.bar(indices, y_prediction, width=width)

# Offsetting by width to shift the bars to the right

plt.bar(indices + width, y_test, width=width)

plt.xticks(ticks=indices)

plt.ylabel("DB Response Time")

plt.xlabel("Test Dataset ")

plt.title("Predicted vs. Actual")

valuesType=['Predicted','Actual']

plt.legend(valuesType,loc=2)

plt.show()

As observed in the output in Figure 7-8, predicted values are very close to the actual observed value. This indicates that the model is performing well.

Figure 7-8
Accuracy analysis of DB Response Time prediction on test data

Let’s evaluate the performance of the model by calculating its R2 score.

score=r2_score(y_test,y_prediction)

print("r2 score is ", score)

print('Root Mean Squared Error is =',np.sqrt(mean_squared_error(y_test,

y_prediction)))

r2 score is 0.9679128701220477

Root Mean Squared Error is = 0.2102141626503609

As per the output, the r2 score value is 0.9679, which means that the model is predicting the database response time with around 97 percent accuracy for a given scenario of CPU and memory utilization, which is quite good. Based on the complexity and requirements, this model can be further expanded to include other independent variables such as the number of transactions, user connections, available disk space, etc., which makes this model production usable to perform capacity planning, upgrades, or migration-related tasks to improve DB response time, which will eventually improve the application performance and user experience.

There are a few limitations with linear regression model where it performs correlation between the dependent variable with one or more independent variables to perform predictions. The first big challenge is that correlations between variables are independent of any time series associated with them. Linear regression cannot predict the value of the dependent variable at a “specific time” with given values of independent variables. Second, prediction is entirely based on the independent variables without considering values of the dependent variable that were already predicted in the recent past. To overcome such challenges, you can use the time-series model, which we will discuss in the next section.

Note

Technically, a time series can also be considered as an independent variable in linear regressions, but that’s not correct. Rather, time-series models should be used instead of linear regression models in such scenarios.

Time-Series Models

To model scenarios that are based on time series and forecast what values that data is going to take into the future at a specified time interval, you can leverage time-series modeling. Let’s understand a few important terms required in time-series modeling.

Time-Series Data

Any data that is collected at regular time intervals is called time-series data. Data collection frequency can be anything like hourly, daily, weekly, etc., as long as it is happening at regular intervals. The following are common examples of time-series data:

Stock price data: The maximum and minimum prices of stock are usually maintained at a daily level. It is then further analyzed on a monthly or yearly time frame to make business decisions.
Calls in call center: The call volume at every hour is maintained. It is then analyzed to determine the peak hour load on a monthly or yearly time frame from a resourcing perspective.
Sales data: Sales data is important for any business, and usually monthly sales figures are maintained to forecast future sales or profitability of the organization.
Website hits: Website hit volume is usually maintained every five to ten minutes to determine the load on the application, detect any potential security threats, or scale the infrastructure to meet business requirements.

Every time-series data will have a date or time column along with one or more data columns. If there is only one data column, then it’s called a univariate time series, and if there are multiple data columns, then it’s called a multivariate time series. Figure 7-9 is the example of a univariate time series with the daily average CPU utilization of a server.

For the purpose of analysis, a univariate time series may need to be converted into multivariate by adding some additional supporting columns. With the help of TimeStamp, we have added another column, Day of the week, and from Figure 7-10 we can observe that there is an unusual spike in CPU utilization on weekends as compared to weekdays.

Figure 7-10
TimeSeries analysis of CPU utilization

Stationary Time Series

One of the most important statistical properties of a time series is stationarity, where the mean and the variance of the time series do not change over time, as shown in Figure 7-11. In simple terms, a stationary time series can have different values with different timestamps, but the underlying logic or method to generate those values remains constant and does not change.

Lag Variable

A lag variable is a dependent variable that is lagged in over time. Lag-1 (Y_t-1) represents the value of dependent variable Y_t at previous one unit of time, Lag 2 (Y_t-2) will have value of Y_t at previous two units of time, and so on.

Consider the earlier example of time series on CPU utilization at specific timestamps. In this example, Lag-1 (Y_t-1) at 7/3/20 2:00 PM will be the value of Y_t at 7/2/20 2:00 PM, which is 8.82, and Lag 2 (Y_t-2) will have value of Y_t at 7/1/20 2:00 PM, which is 8.33, as shown in Figure 7-12.

Mathematically if there exists a relationship or correlation between Y_t and its lagged values Y_t-1, Y_t-2, etc., then a model can be created to detect the pattern and predict future values of the same series using its lagged values. For example, assume that CPU utilization increases around 100 percent on every Saturday; then the value at Lag-7 (Yt-7) can be considered for prediction.

The next step is to determine the strength of the relationship between Y_t and its lags and how many lags have statistically significant relationship with Y_t that we can use in the prediction model. That’s where ACF and PACF plots help, and we will be discussing these next.

ACF and PACF

Both ACF and PACF are statistical methods used by researchers to understand the temporal dynamics of an individual time series, which basically means detecting correlation over time, often called autocorrelation. A detailed discussion on ACF and PACF is beyond scope of this book, so we will be limiting the discussion to the basics of ACF and PACF to understand how to use them in the AIOps model development.

ACF is a statistical term that represents the autocorrelation function and is used to measure the correlation of time-series values Y_t with its own lags Y_t-1, Y_t-2, Y_t-3, etc. This correlation of various lags can be visualized using an ACF plot on a scale of 1 to -1, where 1 represents a perfect positive relationship and -1 represents a perfect negative relationship.

As shown in Figure 7-13, the y-axis represents the correlation value, and the x-axis represents the lags. The first value of the series will always be 100 percent correlated with itself, and that’s why in the ACF plot the Lag 0 correlation value is going until 1. The ACF plot is useful in detecting the pattern in the correlated data.

There is a red line shown on both sides in the ACF plot called the significance line, which is creating a range or band referred to as the error band. The lags that cross this line have a statistically significant relationship with variable Y. In a sample ACF plot, three lags are crossing this line, so up to Lag 3, there is a statistically significant relationship, and anything within this band is not statistically significant.

PACF is another statistical term that represents the partial autocorrelation function, which is similar to an ACF plot except that it considers the strength of the relationship between the variable and lag at a specific time after removing the effects of all intermediate lags.

For example, in the sample ACF plot in Figure 7-14, it is visible that there is a significant correlation up to three lags, but the PACF plot tells how much the strength of correlation of Y_t-3 alone with Y_t, is after removing any impact of Y_t-1 and Y_t-2. PACF can be considered as more of a filtered value of ACF.

Now we are all set to implement two most commonly used time-series models, ARIMA and SARIMA, for noise reduction by predicting the appropriate monitoring threshold of parameters. This is one of the most common use cases in an AIOps implementation.

ARIMA

ARIMA is one of the foundational time-series forecasting techniques; it is popular and has been extensively used for quite a while now. This concept was first introduced by George Box and Gwilym Jenkins in 1976 in a book titled Time Series Analysis: Forecasting and Control. They defined a systematic method called the Box: Jenkins Analysis to identify, fit, and use an autoregressive (AR) technique integrated (I) with a moving average (MA) technique resulting in an ARIMA time-series model.

ARIMA leverages the pattern and information contained in a time series to predict its future values. So, is it possible that ARIMA can forecast with any time series? How about predicting the price of your favorite stocks using ARIMA considering its historical prices? Technically, forecasting can be done, but it might not be accurate at all. For example, the price of a stock is controlled by multiple external factors such as the company’s quarterly results, government regulations, weather conditions, competitors price, etc. Without considering all such factors, ARIMA cannot provide meaningful predictions. It is important to note that ARIMA should be used only for those time series that exhibit the following properties:

It should be stationary, or nonseasonal.
Past values should have a sufficient pattern (not just a random white noise) and information to predict its future values.
It should be of medium to long length with at least 50 observations.

Model Development

The ARIMA model has three components.

Differencing (d)

As discussed earlier, the ARIMA model needs the series to be stationary, but then the challenges are as follows:

How to attain stationarity in a series
How to validate if the series becomes stationary or not

There are two approaches to change nonstationary series into stationary.

Perform an appropriate transformation such as a log transformation or square root transformation to change a nonstationary series into a stationary series that is relatively complex.
A more reliable and widely used approach is to perform differencing on a series to convert it into stationary. In this approach, the difference between consecutive values in a series gets calculated, and then the resultant series is validated whether or not it is stationary.

One of the most used statistical tests to determine whether a time series is stationary or not is the augmented Dickey Fuller test (ADF test), which is one type of unit root test that determines how strongly a time series is defined by a trend. In simple terms, the ADF test validates a null hypothesis (H0), which claims that a time series has some time-dependent structure and hence it is nonstationary. If this null hypothesis gets rejected, then it means that time series is stationary. The ADF test calculates the p-value, and if it’s less than 0.05 (5 percent), then the null hypothesis gets rejected. Interested readers can explore more about internal functioning and mathematics behind unit root tests and ADF tests online.

We need to continue performing differencing and validating the results via an ADF test until we get a near-stationary series with a defined mean. The number of time differencing performed to make time series stationary is represented by model variable d.

It is important to note that series should not get over-differenced in which case we get a stationary series, but the model parameters get affected, and the prediction’s accuracy gets impacted. One way to check for over-differencing is that Lag 1 of autocorrelation should not get too negative.

Autoregression or AR (p)

This component, known as autoregression, shows that the dependent variable Y_t can be calculated as a function of its own lags Y_t-1, Y_t-2, and so on.

Y subscript t equals alpha plus beta subscript 1 Y subscript t minus 1 plus beta subscript 2 Y subscript t minus 2 plus second derivative plus beta subscript p Y subscript t minus p.Here,

α is the constant term (intercept).

β is the coefficient of lag with a subscript indicating its position.

ε is the error or white noise.

We already discussed that the PACF plot shows the partial autocorrelation between the series and its lag after excluding the effect from the intermediate lags. We can use the PACF plot to determine the last lag p, which cuts off the significant line and can be considered as a good estimate value of AR. This value is represented by the model variable p as an order of the AR model.

Based on the sample PACF plot in Figure 7-15, we can observe that lag 1 seems to be the last lag crossing a significant line. So, we can set p as 1 and represent it as AR(1).

Moving Average or MA (q)

This component, known as moving average, shows that the dependent variable Y_t can be calculated as a function of the weighted average of the forecast errors in previous lags.Y subscript t equals alpha plus epsilon subscript t plus phi subscript 1 epsilon subscript t minus 1 plus phi subscript 2 epsilon subscript t minus 2 plus second derivative plus phi subscript q epsilon subscript t minus q.

α is usually the mean of the series.

ε is the error from the autoregressive models at respective lags.

MA can detect trends and patterns in time-series data. You can use the ACF plots to determine the value of model variable q, which represents how many moving averages are required to remove any autocorrelation in a stationary series. From the ACF plot in Figure 7-16, we can observe that there is cut-off after lags 2 and 6. So, we can try ACF with q =2 , MA (2) or q=6, MA(6).

Both AR and MA equations needs to be integrated (I or d) to create the ARIMA model, which is represented as ARIMA (p,d,q).

ARIMA acts as a foundation for various other autoregressive time-series models as follows:

SARIMA: If there is a repetitive pattern at a regular time interval, then it’s referred to as seasonality, and the SARIMA model is used where S stands for seasonality. We will explore this more in the next section.
ARIMAX: If the series depends on external factors, then the ARIMAX model is used, where X stands for external factors.
SARIMAX: If along with seasonality there is a strong influence of external factors, then the SARIMAX model is used.

One of the big limitations of ARIMA is the inability to consider the seasonality in time-series data, which is common in real-world scenarios like when there is exceptionally high sales figures during the holiday period of Thanksgiving and Christmas every year, high expenditure during the first week of every month, high footprints during the weekend in shopping malls, and so on. Such seasonality gets considered in another algorithm, called SARIMA, which is a variant of ARIMA model.

SARIMA

A cyclical pattern at a regular time interval is referred as seasonality and should be considered in a prediction model to improve the accuracy of prediction. That’s how the ARIMA model is modified to develop SARIMA, which considers seasonality in time-series data and supports univariate time-series data with a seasonal component.

Along with ARIMA variables, SARIMA needs four new components.

p: Autoregression (AR) order of time series.
d: Difference (I) order of time series.
q: Moving average (MA) order of time series.
m (or lag): Seasonal factor, number of time steps to be considered for a single seasonal period. For example, if there is yearly seasonality, then m = 12; if it’s quarterly, then m=4; if it’s weekly, then m=52.
P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.

The final notation for an SARIMA model is specified as SARIMA (p,d,q)(P,D,Q)m.

Implementation of ARIMA and SARIMA

Let’s perform a sample implementation of the ARIMA and SARIMA models on CPU utilization data to predict its thresholds.

Begin by importing the required Python libraries.

# Library for Data Manipulation

import pandas as pd

import numpy as np

# Library for Data Visualization

import matplotlib.pyplot as plt

# Library for Time Series Models

from statsmodels.tsa.stattools import adfuller

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

from statsmodels.tsa.arima.model import ARIMA

from statsmodels.tsa.stattools import acf

from statsmodels.tsa.seasonal import seasonal_decompose

# Library for Machine Learning

from sklearn.metrics import mean_squared_log_error

# Library for Track time

from datetime import datetime

Read data on CPU utilization from a CSV in a Pandas data frame and check information about the data types, column names, and count of values present in imported data.

cpu_util_raw_data= pd.read_csv("cpu_utilization-data.csv")

cpu_util_raw_data.info()

As shown in Figure 7-17, there are two columns, date_time and cpu_utilization, and a total of 396 data points (first row is the header) are available for analysis.

Figure 7-17
Overview of data in data file

Set the data type of column date_time to DateTime, set it as an index to sort the data points in a data frame, and plot the CPU utilization on a time scale, as shown in Figure 7-18.

total_records = cpu_util_raw_data.count()['cpu_utilization']

cpu_util_raw_data['date_time'] = pd.to_datetime(cpu_util_raw_data['date_time'])

cpu_util_raw_data.set_index('date_time',inplace=True)

cpu_util_raw_data = cpu_util_raw_data.sort_values(by="date_time")

fig = plt.figure(figsize =(10, 5))

cpu_util_raw_data['cpu_utilization'].plot()

Figure 7-18
CPU utilization on time scale

It is important to check for the outliers in the data using a box plot. The presence of outliers impacts the quality of predictions. If there are a high number of outliers in data, then they need to be either processed or removed from the dataset based on the domain requirements.

#Global variables

NOISE = False

MAPE = True

TEST_DATA_SIZE = 0.2 #in percent

SEASONAL = False

def getOutliers(data, col):

Q3 = data[col].quantile(0.75)

Q1 = data[col].quantile(0.25)

IQR = Q3 - Q1

lower_limit = Q1 - 1.5 * IQR

upper_limit = Q3 + 1.5 * IQR

return lower_limit, upper_limit

lower, upper = getOutliers(cpu_util_raw_data, 'cpu_utilization')

outliers_count = cpu_util_raw_data[(cpu_util_raw_data['cpu_utilization']<lower) | (cpu_util_raw_data['cpu_utilization']>upper)].count()['cpu_utilization']

outlier_percentage = ((outliers_count / total_records) * 100)

if outlier_percentage > 20:

NOISE = True

print(f"Outlier Percentage in data: {outlier_percentage}%")

#Render box plot

cpu_util_raw_data.boxplot('cpu_utilization', figsize=(10,10))

zero_values = cpu_util_raw_data[(cpu_util_raw_data['cpu_utilization']==0)].count()['cpu_utilization']

zero_values_percentage = ((zero_values / total_records) * 100)

if zero_values_percentage > 20:

MAPE = False

print("Zero Value percentage in data: ", zero_values_percentage)

Based on Figure 7-19, there are no outliers and no “zero” values, so let’s continue with the given dataset.

Figure 7-19
Box plot analysis for outliers

Let’s check for the stationarity of a time series using an ADF test. Based on an ADF test, if the p-value comes out to more than 0.05, it means the time series is nonstationary and needs to be differenced using the diff() function and again checked with an ADF test for stationarity.

diff_count = 0

differencing_order = {

1: lambda x: x['cpu_utilization'].diff(),

2: lambda x: x['cpu_utilization'].diff().diff(),

3: lambda x: x['cpu_utilization'].diff().diff().diff(),

4: lambda x: x['cpu_utilization'].diff().diff().diff().diff(),

5: lambda x: x['cpu_utilization'].diff().diff().diff().diff().diff()

}

while True:

if diff_count == 0:

adftestresult = adfuller(cpu_util_raw_data['cpu_utilization'].dropna())

else:

adftestresult = adfuller(differencing_order[diff_count](cpu_util_raw_data).dropna())

print('#' * 60)

print('ADF Statistic: %f' % adftestresult[0])

print('p-value: %f' % adftestresult[1])

print(f'ADF Test Result: The time series is {"non-" if adftestresult[1] >= 0.05 else ""}stationary')

print('#' * 60)

if adftestresult[1] < 0.05 or diff_count >= len(differencing_order):

break

diff_count += 1

print("Differencing order to make data stationary: ",diff_count)

The ADF result on differenced time series shows that the p-value is 0, which confirms that data series is now stationary. This can be visualized by plotting both the original series and a differenced series on a time scale graph, as shown in Figure 7-20.

fig, ax = plt.subplots(figsize=(10,8), dpi=100)

# Differencing

ax.plot(cpu_util_raw_data.cpu_utilization[:], label='Original Series')

ax.plot(cpu_util_raw_data.cpu_utilization.diff(1), label='1st Order Differencing')

ax.set_title('1st Order Differencing')

ax.legend(loc='upper left', fontsize=10)

plt.legend(loc='upper left', fontsize=10)

plt.suptitle('CPU Usage - Time Series Dataset', fontsize=16)

plt.show()

Figure 7-20
Time-series stationarity check

As we have done differencing only one time, we can set the value of the differencing (d) component of ARIMA as d=1.

Next let’s plot the ACF and PACF graph to get the value of the AR (p) and MA (q) components.

plt.rcParams.update({'figure.figsize':(7, 4), 'figure.dpi':120})

plot_pacf(cpu_util_raw_data.cpu_utilization.diff().dropna());

plot_acf(cpu_util_raw_data.cpu_utilization.diff().dropna());

In the PACF graph in Figure 7-21, it can be observed that until lag 6 it is crossing the significant line, so the AR component, p=6, can be initially set for modeling.

In the ACF graph in Figure 7-22, there is sharp cut-off on the lag 2 and lag 7 crossing significance line, so we can try with the MA component as q=2 or q =7 for modeling.

Next split the data into training and test data to be used for the ARIMA model (6,1,2). Out of 396 data points, let’s use the first 300 (~75 percent) data points for training and the rest of the data points for testing purposes and plot them on a time scale, as shown in Figure 7-23.

# Plot train and test data

train_data = cpu_util_raw_data['cpu_utilization'][:300]

test_data = cpu_util_raw_data['cpu_utilization'][300:]

fig, ax = plt.subplots(figsize=(10,5), dpi=100)

ax.plot(train_data, label='Train Data Points')

ax.plot(test_data, label='Test Data Points')

ax.set_title('CPU Usage Timeline')

ax.legend(loc='upper left', fontsize=10)

plt.legend(loc='upper left', fontsize=10)

plt.suptitle('Test-Train Data Splits', fontsize=16)

plt.show()

Figure 7-23
Time-series data split in test data and training data

Let’s create the ARIMA (6,1,2) model using the train data.

manual_arima_model = ARIMA(train_data, order=(6,1,2))

manual_arima_model_fit = manual_arima_model.fit()

print(manual_arima_model_fit.summary())

Figure 7-24 shows the output from the ARIMA model on the train data.

Figure 7-24
ARIMA model output result on training data

Before applying this model on the test data, the ARIMA model needs to be checked for accuracy with the chosen combination of component values, p,d,q. There are two key statistical measures to compare the relative quality of the different models.

Akaike information criteria (AIC): This validates the model in terms of goodness of fit of the data by checking how much it relies on the tuning parameters. The lower the value of AIC, the better the model performance. The current ARIMA (6,1,2) model has an AIC value of 2113.
Bayesian information criteria (BIC): In addition to AIC, the BIC uses the number of samples in the training dataset that are used for fitting. Here also a model with a lower BIC is preferred. The current ARIMA (6,1,2) model has a BIC value of 2150.

Based on PACF and ACF charts, multiple values can be tried to lower the AIC and BIC values. For now let’s proceed with ARIMA (6,1,2).

Let’s plot the actual values in a training dataset with the model’s predicted value to analyze how close the actual and model predicted values are.

prediction_manual=manual_arima_model_fit.predict(dynamic=False,typ='levels')

plt.figure(figsize=(15,5), dpi=100)

fig, ax = plt.subplots(figsize=(10,5), dpi=100)

ax.plot(prediction_manual, label='Prediction')

ax.plot(train_data, label='Training Data')

ax.legend(loc='upper left', fontsize=10)

plt.legend(loc='upper left', fontsize=10)

plt.suptitle('Actual VS Predicted CPU Utilization on Train Data', fontsize=16)

plt.show()

As observed in Figure 7-25, the model predictions are quite close to actual values, indicating that it gets trained quite well.

Figure 7-25
Prediction analysis on training data

Now let’s validate the model performance on the test data.

# Forecast with 95% confidence interval

forecast = manual_arima_model_fit.get_forecast(97)

manual_arima_fc = forecast.predicted_mean

manual_arima_conf = forecast.conf_int(alpha=0.05)

manual_arima_fc_series = pd.Series(manual_arima_fc, index=test_data.index)

manual_arima_lower_series = pd.Series(manual_arima_conf["lower cpu_utilization"],

index=manual_arima_conf.index)

manual_arima_upper_series = pd.Series(manual_arima_conf["upper cpu_utilization"],

index=manual_arima_conf.index)

plt.figure(figsize=(15,10), dpi=100)

plt.plot(train_data, label='Training Data')

plt.plot(test_data, label='Actual Data')

plt.plot(manual_arima_fc_series, label='Forecast Data')

plt.fill_between(manual_arima_lower_series.index, manual_arima_lower_series,

manual_arima_upper_series, color='gray', alpha=.15)

plt.title('Forecast vs Actuals')

plt.legend(loc='upper left', fontsize=8)

plt.show()

Figure 7-26 shows that the ARIMA (6,1,2) model performance on the test dataset is not good and its predictions are deteriorating over time.

Figure 7-26
Prediction analysis on test data

Let’s quantify the model performance with the metrics defined earlier.

def forecast_accuracy(forecast, actual):

mape = np.mean(np.abs(forecast - actual)/np.abs(actual)*100)

mae = np.mean(np.abs(forecast - actual))

rmse = np.mean((forecast - actual)**2)**.5

rmsle = np.sqrt(mean_squared_log_error(actual, forecast))

return({'MAPE : ':mape, 'MAE : ': mae,

'RMSE : ':rmse, 'RMSLE : ':rmsle

})

forecast_accuracy(manual_arima_fc, test_data)

Based on the performance metrics, the model predictions are about 76 percent (MAPE = ~24 percent) accurate.

In practical scenarios, multiple values of ARIMA components need to be used to arrive at a model with maximum accuracy in predictions. Manually performing this task requires a lot of time and expertise in statistical techniques. Also, it becomes practically impossible to create models for a large number of entities such as servers, parameters, stocks, etc.

There are ML approaches called Auto ARIMA to learn optimal values of the components p,d,q. In Python there is a library called pdarima that provides the function auto_arima to try various values of ARIMA/SARIMA components in model creation and search for most optimal model with the lowest AIC and BIC values.

Let’s try auto ARIMA to find the most optimal component values for the ARIMA model from the given training dataset. In the auto ARIMA model, we currently specify seasonal to be False to explore more about ARIMA. This value will be true for the SARIMA model to consider seasonality. We will also provide the maximum values of p and q that auto ARIMA can explore to obtain the best model. Let’s proceed with max_p and max_q as 15 to limit the complexity of the model.

# Fit auto_arima on train set

auto_arima_model = pm.auto_arima(train_data, start_p = 1, start_q = 1,

max_p = 15, max_q = 15,

seasonal = False,

d = None, trace = True,

error_action ='ignore',

suppress_warnings = True,

stepwise = True)

# To print the summary

auto_arima_model.summary()

As per the auto ARIMA result shown in Figure 7-27, the best model is identified as ARIMA (6,1,0) with marginal improvement in the AIC and BIC values as compared to the previous ARIMA (6,1,2) model.

Figure 7-27
Prediction analysis on test data

Next validate the performance of ARIMA (1,1,1) only on test data.

auto_arima_predictions = pd.Series(auto_arima_model.predict(len(test_data)))

actuals = test_data.reset_index(drop = True)

auto_arima_predictions.plot(legend = True,label = "ARIMA Predictions",

xlabel = "Index",ylabel = "CPU Utilization",

figsize=(10, 7))

actuals.plot(legend = True, label = "Actual");

forecast_accuracy(np.array(auto_arima_predictions), test_data)

As observed in Figure 7-28 and as the performance metrics are obtained, there is even further degradation in the accuracy of the ARIMA (1,1,1) model.

Figure 7-28
Prediction analysis on test data for ARIMA (1,1,1)

Clearly, the ARIMA model is not going to work in this scenario, and we need to look for the seasonality in data using the seasonal_decompose package.

seasonality_check = seasonal_decompose(cpu_util_raw_data['cpu_utilization'],

model='additive',extrapolate_trend='freq')

seasonality_check.plot()

plt.show()

Based on the graph in Figure 7-29, we can observe seasonality in the data, which implies we should use the SARIMA model.

Figure 7-29
Time series seasonality check

Let’s explore the SARIMA model using auto ARIMA with seasonality enabled. As we have daily data points and based on the seasonality graph, it seems there is weekly seasonality, so we will be setting the value of the SARIMA component as m = 7.

# Seasonal - fit stepwise auto-ARIMA

sarima_model = pm.auto_arima(train_data, start_p=1, start_q=1,

test='adf',

max_p=15, max_q=15, m=7,

start_P=0, seasonal=True,

d=None, D=1, trace=True,

error_action='ignore',

suppress_warnings=True,

stepwise=True)

sarima_model.summary()

As shown in Figure 7-30, auto ARIMA tried various combination and has detected SARIMA (3,0,2)(0,1,1)[7] as a best-fit model with considerable improvement in the AIC values as compared to previous ARIMA models.

Figure 7-30
Auto Arima model execution result on train data

Let’s validate the results of this new SARIMA model.

sarima_predictions = pd.Series(sarima_model.predict(len(test_data)))

actuals = test_data.reset_index(drop = True)

sarima_predictions.plot(legend = True,label = "SARIMA Predictions",xlabel = "Index", ylabel = "CPU Utilization", figsize=(10, 7))

auto_arima_predictions.plot(legend = True,label = "ARIMA Predictions")

actuals.plot(legend = True, label = "Actual")

forecast_accuracy(np.array(sarima_predictions), test_data)

After considering the seasonality, the prediction’s accuracy improved to about 62 percent as MAPE reduced to about 38 percent. This improvement in accuracy can also be observed in Figure 7-31 by plotting predictions of the ARIMA and SARIMA models against the actual test data observed.

Figure 7-31
Time-series prediction by ARIMA/SARIMA on test data

This model can be used to generate future predictions to determine an appropriate baseline dynamically rather than going with a static global baseline. This approach drastically reduces the noise in large environments.

This model can also be used for one of the most common use cases around the autoscaling of infrastructure, especially when there is seasonality in the utilization. The SARIMA (or SARIMAX) model can be used to analyze historical time-series data along with seasonality and any external factor to predict the utilization and accordingly scale up or down infrastructure for a specific duration and provide potential cost savings.

Automated Baselining in APM and SecOps

Monitoring tools related to SecOps and application monitoring have different characteristics as the majority of time their monitoring parameter values remain very low, almost touching zero, causing imbalanced dataset and hence providing incorrect predictions. In these situations, it makes more sense to compare spikes or unusual high utilization values (called anomaly) with predicted values to calculate the MASE ratio and accordingly tune the ML model rather than leveraging the usual low values of parameters. Anomaly values are the ones that do not follow a regular pattern and show sudden spikes (up or down). We will discuss more about these in Chapter 8.

These predicted thresholds for security-related parameters can be applied on policies that monitor the protected segment comprising the IP address range, port, and protocol that define services, VLAN numbers, or MPLS tags. An appropriate policy will dynamically detect attacks on different protected segments and trigger qualified alarms rather than noise.

Implementing a monitoring solution with an automated baselining feature brings in immediate benefits by enabling the operations team to quickly identify outages, exceptions, and cyberattacks rather than wasting time on noise. But organizations often face some operational challenges in adopting dynamic thresholding, which we will be discussing next.

Challenges with Dynamic Thresholding

Though dynamic thresholding looks promising, there is a challenge in adoption due to fear of missing critical alerts. Especially during the initial phases when the AIOps system has just started the learning process, organizations raise a lot of doubt over the accuracy of prediction. Also, the majority of open source and native monitoring tools don’t have this capability. Implementing AIOps-based dynamic thresholding involves the use of multiple algorithms and techniques and requires data for a longer duration to be able to analyze the seasonality and patterns.

Summary

In this chapter, we covered one of the important use cases in AIOps, which is automated baselining. We covered various types of regression algorithms that can be used for this purpose. We covered hands-on implementation of the use case using multiple algorithms such as linear regression, ARIMA, and SARIMA. In the next chapter, we will cover various anomaly detection algorithms and how they can be used in AIOps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. AIOps Use Case: Automated Baselining

Create new playlist

Sign In

Sign Up

7. AIOps Use Case: Automated Baselining

Automated Baselining Overview

Regression

Linear Regression

Time-Series Models

Time-Series Data

Stationary Time Series

Lag Variable

ACF and PACF

ARIMA

Model Development

Differencing (d)

Autoregression or AR (p)

Moving Average or MA (q)

SARIMA

Implementation of ARIMA and SARIMA

Automated Baselining in APM and SecOps

Challenges with Dynamic Thresholding

Summary

Table of Contents for
7. AIOps Use Case: Automated Baselining