We are continuing to discuss specific use cases of AIOps, and in this chapter we explain and implement automated baselining, which is one of the most important and frequently used features in AIOps.
Automated Baselining Overview
In traditional monitoring tools, a baseline is static and rule-based and is set at predetermined levels. An example of a baseline could be CPU utilization, which is set at, say, 95 percent. An event gets triggered when the CPU utilization of a server goes above this threshold. Now the challenge with this approach is that baselines or thresholds cannot be static. Consider a scenario where an application runs a CPU-intensive data processing job at 2 a.m. local time and causes 99 percent CPU utilization until the job ends, say, at 2:30 a.m. This is the usual behavior of the application, but because of the static baseline threshold, an event gets generated. But this alarm doesn’t need any intervention as it will automatically close once the job get completed and the CPU utilization returns to normal. There are many such false positive events in operations due to this static baselining of thresholds.
Automated baselining helps in such scenarios because it considers the dynamic behavior of systems. It analyzes historical performance data to detect the real performance issues that need attention and eliminates the noise by adjusting the baseline threshold. In the given scenario, the system would automatically increase the threshold and not generate the event until after 2:30 a.m. However, if the same machine exhibits a CPU spike at, say, 9 a.m., it would generate an event.
It is important to note that the dynamic baseline of a threshold will work a bit differently in a microservices architecture where the underlying infrastructure will scale up based on increased utilization or load. These aspects of application architecture and infrastructure utilization need to be considered for dynamic baselining.
Increases volume of events for operations to manage, which in turn increases time and effort of operation teams.
Clutters the operator console with false positive events, which leads to missing important qualified and actionable events.
Causes failures in automated diagnosis and remediation. Since the event itself is false, the automated resolution would be triggered unnecessarily.
Automated baselining reduces noise and related inefficiencies. Automated baselining can be achieved by leveraging supervised machine learning techniques that learn from past data and predict dynamic thresholds based on daily, weekly, monthly, and yearly seasonality of the data. From an AIOps perspective, there are three prominent algorithms that are core to implementing automated baselining. We will explore these approaches in this chapter starting with regression algorithms.
Regression
Linear Regression
This algorithm is applicable where the dataset shows a linear relationship between input and output variables and the output variable has continuous values, such as percentage utilization of CPU, etc.
In linear regression, input variables, referred to as independent variables or features, from the dataset get chosen to predict the values of the output variable, referred to as the dependent variable or target value. Mathematically, the linear regression model establishes a relationship between the dependent variable (Y) and one or more independent variables (Xi) using a best-fit straight line (called the regression line) and is represented by the equation Y = a + bXi + e, where a is the intercept, b is the slope of the line, and e is the error in prediction from the actual value. When there are multiple independent variables, it’s called multiple linear regression.
Let’s first understand simple linear regression that has only one independent variable.
This imaginary straight line is the regression line, which consists of predicted data points. Now the distance between each one of the actual data points, and the line is called an error. Multiple lines can be drawn that pass through these data points, so the algorithm finds the best line that minimizes this error by calculating distances with all points. Whichever lines gives the smallest error becomes the regression line and provides the best prediction. From an AIOps perspective, linear regression provides a predictive capability for correlation, but it does not show causation.
Now let’s understand the implementation of linear regression to forecast the database response time (in seconds) based on the memory utilization (MB) and CPU utilization (percent) of the server. Here we have two independent variables, namely, CPU and memory, so we will be using multiple linear regression models to predict the database response time based on the specific utilization value of CPU and memory.
You can download the code from https://github.com/dryice-devops/AIOps/blob/main/Ch-7_AutomatedBaselinig-Regression.ipynb.
CPU and Memory Utilization Mapping with DB Response Time
-A table for C P U and memory utilization mapping has three columns with headers D B response time, C P U usage, and memory utilization M B. |
NumPy: This is the most widely used library in Python development to perform mathematical operations on matrices, linear algebra, and Fourier transform.
Seaborn: This is another data visualization library that is based on the Matplotlib library and primarily used for generating informative statistical graphics.
Sklearn: This is one of the most important and widely used libraries for machine learning–related development as it contains various tools for modeling and predictions.
As per the output, the r2 score value is 0.9679, which means that the model is predicting the database response time with around 97 percent accuracy for a given scenario of CPU and memory utilization, which is quite good. Based on the complexity and requirements, this model can be further expanded to include other independent variables such as the number of transactions, user connections, available disk space, etc., which makes this model production usable to perform capacity planning, upgrades, or migration-related tasks to improve DB response time, which will eventually improve the application performance and user experience.
There are a few limitations with linear regression model where it performs correlation between the dependent variable with one or more independent variables to perform predictions. The first big challenge is that correlations between variables are independent of any time series associated with them. Linear regression cannot predict the value of the dependent variable at a “specific time” with given values of independent variables. Second, prediction is entirely based on the independent variables without considering values of the dependent variable that were already predicted in the recent past. To overcome such challenges, you can use the time-series model, which we will discuss in the next section.
Technically, a time series can also be considered as an independent variable in linear regressions, but that’s not correct. Rather, time-series models should be used instead of linear regression models in such scenarios.
Time-Series Models
To model scenarios that are based on time series and forecast what values that data is going to take into the future at a specified time interval, you can leverage time-series modeling. Let’s understand a few important terms required in time-series modeling.
Time-Series Data
Stock price data: The maximum and minimum prices of stock are usually maintained at a daily level. It is then further analyzed on a monthly or yearly time frame to make business decisions.
Calls in call center: The call volume at every hour is maintained. It is then analyzed to determine the peak hour load on a monthly or yearly time frame from a resourcing perspective.
Sales data: Sales data is important for any business, and usually monthly sales figures are maintained to forecast future sales or profitability of the organization.
Website hits: Website hit volume is usually maintained every five to ten minutes to determine the load on the application, detect any potential security threats, or scale the infrastructure to meet business requirements.
Stationary Time Series
Lag Variable
A lag variable is a dependent variable that is lagged in over time. Lag-1 (Yt-1) represents the value of dependent variable Yt at previous one unit of time, Lag 2 (Yt-2) will have value of Yt at previous two units of time, and so on.
Mathematically if there exists a relationship or correlation between Yt and its lagged values Yt-1, Yt-2, etc., then a model can be created to detect the pattern and predict future values of the same series using its lagged values. For example, assume that CPU utilization increases around 100 percent on every Saturday; then the value at Lag-7 (Yt-7) can be considered for prediction.
The next step is to determine the strength of the relationship between Yt and its lags and how many lags have statistically significant relationship with Yt that we can use in the prediction model. That’s where ACF and PACF plots help, and we will be discussing these next.
ACF and PACF
Both ACF and PACF are statistical methods used by researchers to understand the temporal dynamics of an individual time series, which basically means detecting correlation over time, often called autocorrelation. A detailed discussion on ACF and PACF is beyond scope of this book, so we will be limiting the discussion to the basics of ACF and PACF to understand how to use them in the AIOps model development.
ACF is a statistical term that represents the autocorrelation function and is used to measure the correlation of time-series values Yt with its own lags Yt-1, Yt-2, Yt-3, etc. This correlation of various lags can be visualized using an ACF plot on a scale of 1 to -1, where 1 represents a perfect positive relationship and -1 represents a perfect negative relationship.
There is a red line shown on both sides in the ACF plot called the significance line, which is creating a range or band referred to as the error band. The lags that cross this line have a statistically significant relationship with variable Y. In a sample ACF plot, three lags are crossing this line, so up to Lag 3, there is a statistically significant relationship, and anything within this band is not statistically significant.
PACF is another statistical term that represents the partial autocorrelation function, which is similar to an ACF plot except that it considers the strength of the relationship between the variable and lag at a specific time after removing the effects of all intermediate lags.
Now we are all set to implement two most commonly used time-series models, ARIMA and SARIMA, for noise reduction by predicting the appropriate monitoring threshold of parameters. This is one of the most common use cases in an AIOps implementation.
ARIMA
ARIMA is one of the foundational time-series forecasting techniques; it is popular and has been extensively used for quite a while now. This concept was first introduced by George Box and Gwilym Jenkins in 1976 in a book titled Time Series Analysis: Forecasting and Control. They defined a systematic method called the Box: Jenkins Analysis to identify, fit, and use an autoregressive (AR) technique integrated (I) with a moving average (MA) technique resulting in an ARIMA time-series model.
It should be stationary, or nonseasonal.
Past values should have a sufficient pattern (not just a random white noise) and information to predict its future values.
It should be of medium to long length with at least 50 observations.
Model Development
The ARIMA model has three components.
Differencing (d)
How to attain stationarity in a series
How to validate if the series becomes stationary or not
Perform an appropriate transformation such as a log transformation or square root transformation to change a nonstationary series into a stationary series that is relatively complex.
A more reliable and widely used approach is to perform differencing on a series to convert it into stationary. In this approach, the difference between consecutive values in a series gets calculated, and then the resultant series is validated whether or not it is stationary.
One of the most used statistical tests to determine whether a time series is stationary or not is the augmented Dickey Fuller test (ADF test), which is one type of unit root test that determines how strongly a time series is defined by a trend. In simple terms, the ADF test validates a null hypothesis (H0), which claims that a time series has some time-dependent structure and hence it is nonstationary. If this null hypothesis gets rejected, then it means that time series is stationary. The ADF test calculates the p-value, and if it’s less than 0.05 (5 percent), then the null hypothesis gets rejected. Interested readers can explore more about internal functioning and mathematics behind unit root tests and ADF tests online.
We need to continue performing differencing and validating the results via an ADF test until we get a near-stationary series with a defined mean. The number of time differencing performed to make time series stationary is represented by model variable d.
It is important to note that series should not get over-differenced in which case we get a stationary series, but the model parameters get affected, and the prediction’s accuracy gets impacted. One way to check for over-differencing is that Lag 1 of autocorrelation should not get too negative.
Autoregression or AR (p)
This component, known as autoregression, shows that the dependent variable Yt can be calculated as a function of its own lags Yt-1, Yt-2, and so on.
Y subscript t equals alpha plus beta subscript 1 Y subscript t minus 1 plus beta subscript 2 Y subscript t minus 2 plus second derivative plus beta subscript p Y subscript t minus p.Here,
α is the constant term (intercept).
β is the coefficient of lag with a subscript indicating its position.
ε is the error or white noise.
We already discussed that the PACF plot shows the partial autocorrelation between the series and its lag after excluding the effect from the intermediate lags. We can use the PACF plot to determine the last lag p, which cuts off the significant line and can be considered as a good estimate value of AR. This value is represented by the model variable p as an order of the AR model.
Moving Average or MA (q)
This component, known as moving average, shows that the dependent variable Yt can be calculated as a function of the weighted average of the forecast errors in previous lags.Y subscript t equals alpha plus epsilon subscript t plus phi subscript 1 epsilon subscript t minus 1 plus phi subscript 2 epsilon subscript t minus 2 plus second derivative plus phi subscript q epsilon subscript t minus q.
α is usually the mean of the series.
ε is the error from the autoregressive models at respective lags.
Both AR and MA equations needs to be integrated (I or d) to create the ARIMA model, which is represented as ARIMA (p,d,q).
SARIMA: If there is a repetitive pattern at a regular time interval, then it’s referred to as seasonality, and the SARIMA model is used where S stands for seasonality. We will explore this more in the next section.
ARIMAX: If the series depends on external factors, then the ARIMAX model is used, where X stands for external factors.
SARIMAX: If along with seasonality there is a strong influence of external factors, then the SARIMAX model is used.
One of the big limitations of ARIMA is the inability to consider the seasonality in time-series data, which is common in real-world scenarios like when there is exceptionally high sales figures during the holiday period of Thanksgiving and Christmas every year, high expenditure during the first week of every month, high footprints during the weekend in shopping malls, and so on. Such seasonality gets considered in another algorithm, called SARIMA, which is a variant of ARIMA model.
SARIMA
A cyclical pattern at a regular time interval is referred as seasonality and should be considered in a prediction model to improve the accuracy of prediction. That’s how the ARIMA model is modified to develop SARIMA, which considers seasonality in time-series data and supports univariate time-series data with a seasonal component.
p: Autoregression (AR) order of time series.
d: Difference (I) order of time series.
q: Moving average (MA) order of time series.
m (or lag): Seasonal factor, number of time steps to be considered for a single seasonal period. For example, if there is yearly seasonality, then m = 12; if it’s quarterly, then m=4; if it’s weekly, then m=52.
P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
The final notation for an SARIMA model is specified as SARIMA (p,d,q)(P,D,Q)m.
Implementation of ARIMA and SARIMA
Let’s perform a sample implementation of the ARIMA and SARIMA models on CPU utilization data to predict its thresholds.
As we have done differencing only one time, we can set the value of the differencing (d) component of ARIMA as d=1.
Akaike information criteria (AIC): This validates the model in terms of goodness of fit of the data by checking how much it relies on the tuning parameters. The lower the value of AIC, the better the model performance. The current ARIMA (6,1,2) model has an AIC value of 2113.
Bayesian information criteria (BIC): In addition to AIC, the BIC uses the number of samples in the training dataset that are used for fitting. Here also a model with a lower BIC is preferred. The current ARIMA (6,1,2) model has a BIC value of 2150.
Based on PACF and ACF charts, multiple values can be tried to lower the AIC and BIC values. For now let’s proceed with ARIMA (6,1,2).
Based on the performance metrics, the model predictions are about 76 percent (MAPE = ~24 percent) accurate.
In practical scenarios, multiple values of ARIMA components need to be used to arrive at a model with maximum accuracy in predictions. Manually performing this task requires a lot of time and expertise in statistical techniques. Also, it becomes practically impossible to create models for a large number of entities such as servers, parameters, stocks, etc.
There are ML approaches called Auto ARIMA to learn optimal values of the components p,d,q. In Python there is a library called pdarima that provides the function auto_arima to try various values of ARIMA/SARIMA components in model creation and search for most optimal model with the lowest AIC and BIC values.
This model can be used to generate future predictions to determine an appropriate baseline dynamically rather than going with a static global baseline. This approach drastically reduces the noise in large environments.
This model can also be used for one of the most common use cases around the autoscaling of infrastructure, especially when there is seasonality in the utilization. The SARIMA (or SARIMAX) model can be used to analyze historical time-series data along with seasonality and any external factor to predict the utilization and accordingly scale up or down infrastructure for a specific duration and provide potential cost savings.
Automated Baselining in APM and SecOps
Monitoring tools related to SecOps and application monitoring have different characteristics as the majority of time their monitoring parameter values remain very low, almost touching zero, causing imbalanced dataset and hence providing incorrect predictions. In these situations, it makes more sense to compare spikes or unusual high utilization values (called anomaly) with predicted values to calculate the MASE ratio and accordingly tune the ML model rather than leveraging the usual low values of parameters. Anomaly values are the ones that do not follow a regular pattern and show sudden spikes (up or down). We will discuss more about these in Chapter 8.
These predicted thresholds for security-related parameters can be applied on policies that monitor the protected segment comprising the IP address range, port, and protocol that define services, VLAN numbers, or MPLS tags. An appropriate policy will dynamically detect attacks on different protected segments and trigger qualified alarms rather than noise.
Implementing a monitoring solution with an automated baselining feature brings in immediate benefits by enabling the operations team to quickly identify outages, exceptions, and cyberattacks rather than wasting time on noise. But organizations often face some operational challenges in adopting dynamic thresholding, which we will be discussing next.
Challenges with Dynamic Thresholding
Though dynamic thresholding looks promising, there is a challenge in adoption due to fear of missing critical alerts. Especially during the initial phases when the AIOps system has just started the learning process, organizations raise a lot of doubt over the accuracy of prediction. Also, the majority of open source and native monitoring tools don’t have this capability. Implementing AIOps-based dynamic thresholding involves the use of multiple algorithms and techniques and requires data for a longer duration to be able to analyze the seasonality and patterns.
Summary
In this chapter, we covered one of the important use cases in AIOps, which is automated baselining. We covered various types of regression algorithms that can be used for this purpose. We covered hands-on implementation of the use case using multiple algorithms such as linear regression, ARIMA, and SARIMA. In the next chapter, we will cover various anomaly detection algorithms and how they can be used in AIOps.