Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_2

2. AIOps Architecture and Methodology

Navin Sabharwal¹ and Gaurav Bhardwaj¹

(1)

New Delhi, India

In this chapter, you will learn about technologies and components that are in the AIOps architecture along with its implementation methodology and challenges. The chapter will explore the AIOps key features of the Observability, Engage, and Act phases and the role of machine learning in IT operations.

AIOps Architecture

AIOps systems primarily consist of three core services of IT operations, which are enterprise monitoring, IT service management, and automation. The AIOps architecture provides technologies and methods for seamless integration between these three services and delivers a complete AIOps system. Figure 2-1 defines the AIOps platform and its applicability across various processes and functions in the three core services defined in the IT operations value chain as defined by Gartner. Let’s delve deeper into this and understand the AIOps architecture.

The Core Platform

The AIOps system is intended to ingest millions of data points that get generated at a rapid pace and need to be analyzed quickly along with historical data to deliver value. Big Data technologies coupled with machine learning algorithms provide the solution and form the core of the AIOps system.

Big Data

The AIOps platform ingests data from multiple sources, and thus the platform needs to be built on Big Data technologies. Traditional systems have been built on relational systems; however, when it comes to AIOps, there are multiple data sources and a huge amount of data to be processed to arrive at meaningful insights. Big Data is defined as the type of data that is characterized by the five Vs, which are volume, velocity, variety, veracity and value, as shown in Figure 2-2.

Volume

Volume is the core characteristic of Big Data. The amount of data in Big Data systems is much larger than the data handled by RDBMSs. As AIOps integrates data from various systems into a data warehouse, the volume of data becomes unmanageable without Big Data technology platforms at the core.

Velocity

The second V in Big Data technologies is velocity, which describes the speed of data that is sent for processing. Since AIOps deals with event data that has to be processed quickly and in real time, velocity is an important parameter in the data that is being processed with AIOps. The data needs to be processed at near real-time intervals so that the response from human or machine agents is immediate. Data from multiple sources is sent to the AIOps systems at high velocity, and this needs the appropriate platform architecture to process this data at scale and with speed.

Variety

The third V of big data is variety. In AIOps there is a variety of data that needs processing. There are multiple monitoring and management systems and data such as events, logs, metrics, tickets, etc., with varying formats that need to be processed in AIOps.

Veracity

The fourth V is veracity, which means that the data should be high quality. You should have the right data and avoid pitfalls like missing data. The data being sent to AIOps systems needs to be accurate, complete, and clean so that the AIOps platform can process it as per the use cases.

Value

Value means that the data should have business value. The AIOps data is highly valuable as it translates to higher availability and visibility, and it forms the backbone of automation to reduce costs and provide better services to customers.

Machine Learning

Apart from Big Data, a core component of AIOps is machine learning. Artificial intelligence and machine learning technologies are at the core of AIOps. Traditional systems have been used for monitoring and event correlation; however, they were rule-based and did not use machine learning technologies to efficiently and effectively derive insights and provide features and advanced use cases that are possible by leveraging machine learning technologies.

AIOps platforms leverage the power of machine learning to analyze data being fed from various systems and to detect relationships between monitored entities and events in order to detect patterns and anomalies. We will be discussing the machine learning aspects of AIOps in detail in the upcoming chapters.

This data is then used to provide insights and analytics and arrive at root-cause alerts. AIOps platforms combine CMDBs, rule-based correlation, and unsupervised and supervised machine learning to achieve the end objective of finding out the root cause, and they attempt to provide predictive insights to find hidden issues, which may surface later. The core themes are the higher availability of systems, better insights and governance, and higher customer satisfaction scores.

Let’s understand how AIOps improves the IT operations.

The Three Key Areas in AIOps

AIOps cuts across three key areas in IT operations, which are Observe, Engage, and Act.

Traditionally, some aspects of the Observe area were under Monitoring, and now with end-to-end visibility being the focus, it has matured to “observability.”

The second key area is Engage, which is part of the value stream and is related to IT service management and the functions of IT service management such as service desk, command center, and resolution groups as well as the ITSM processes such as incident management, change management, problem management, configuration management, capacity planning, and continual service improvement.

The third area is Act, which defines the technical function where technical teams resolve incidents, complete service requests, and orchestrate changes in the IT systems.

We will now delve deeper into each of these areas and see how AIOps impacts them.

Observe

Unlike traditional monitoring and event management tools, Observability uses machine learning–driven functions and ensures there are no blind spots or gaps left while catering to the enterprise monitoring needs of organizations irrespective of whether monolithic applications are running on traditional physical or virtual infrastructure or modern applications running on cloud-native or microservices architectures. Primarily four processes get performed in this stage, as shown in Figure 2-3.

Data Ingestion

Data ingestion in AIOps is the first important step, and all monitoring and management data is ingested into the AIOps system so that all the data is available to the system for analysis. At times while implementing AIOps projects it is observed that the basic monitoring data is not in place while the organization is insisting on going ahead with AIOps; in such situations a fundamental discussion of how AIOps works and how machine learning algorithms are completely dependent on data comes to the rescue. Setting the right monitoring pieces can be spun off as a separate project under the program while AIOps continues to integrate and move forward with the plan of integrating its data sources. When the entire data is available, the algorithms are trained and tweaked to reflect the new data.

For event management, the following data is required:

Events: These are the events generated from various sources including operating systems, network devices, cloud computing platforms, applications, databases, and middleware platforms. All these platforms generate events, and they are captured through monitoring tools and then forwarded to the AIOps systems.
Metrics: These are performance metrics, which include infrastructure metrics such as CPU utilization, memory, and disk performance parameters, network utilization, and response time metrics. These also include application metrics such as response time of an application, page load times, the completion time of queries, etc. The metrics are collected in a periodic manner, say every five minutes, and the data is used for understanding the behavior of the system over a period of time. This data is also termed performance metrics.
Logs: Many systems maintain log files and provide data in logs. The log collection can be configured and tuned to log certain types of information. These log files are sent to AIOps systems to find patterns.
Traces: The applications use tracing mechanisms to provide information on a complete application transaction right from the users’ browser to the backend servers; these are logged using various mechanisms including widely used formats like OpenTracing. The trace data provides information on the end-to-end transaction, and it provides the path and time that each step of the transaction took. Any errors or performance issues in any component can be diagnosed by using the trace data.

Apart from the real-time data mentioned, AIOps tools need discovery and configuration data as well so that topology and relationship-based correlations can effectively work. This data ingestion may be done periodically rather than on a real-time basis.

Integration

For data ingestion to happen, there needs to be integrations available in AIOps platforms. AIOps platforms should support both push and pull integrations. In a push model, the monitoring tools or forwarders can forward the data to the AIOps engine. In a pull model, the AIOps tools have the capability to pull the data from various monitoring systems.

The other aspect of integration is real time and periodic data upload; the event, metric logs, and trace data is integrated on a real-time basis since the operations teams need to take immediate actions. The discovery and CMDB data can be integrated with the AIOps system periodically, say, every day and weekend as a batch job or just after the synchronization of the CMDB jobs. This helps in keeping the AIOps data up-to-date on the topology and relationships. However, in cloud computing and software-defined environments, the discovery of CMDB data is real time, and the integration is done on a real-time basis during provisioning and deprovisioning using infrastructure as code.

The adapters for integration to various monitoring and management systems are part of the AIOps system. These need to be configured to bring data from different systems into a single Big Data repository for data analytics and inference.

Event Suppression

Using the integrations and data ingestion, all the data from various monitoring and management tools is ingested into the AIOps solution. This voluminous data needs to be cleaned to reduce the noise. The first step is event suppression, where unwanted alerts are suppressed or eliminated from the system. Care should be taken that any data that is relevant for further processing or can provide any information to the AIOps engine is not discarded during suppression. An example of suppression may be the information events in event and log sources. These events are only for informational purposes and do not indicate an underlying problem in the system. Another type of data that may be there in ingested log files could be the successful executions; this data is again for information and not relevant for finding any issues with the system. In other systems, events that are warnings may also be discarded during the event suppression phase; this decision needs to be taken by the domain experts.

Without event suppression, the AIOps system will be overloaded with data, which is not needed for further processing and analytics.

Event Deduplication

Once events are suppressed, the next step is event deduplication. Deduplication is an important step in the processing of AIOps data. In this step, duplicate events are clubbed together and deduplicated. As an example, if a system is down, the monitoring tool may send in this data every one minute; the data shows the same information but with a different timestamp. The AIOps system takes this data and increments the counter of the original event and then updates the time stamp to reflect when the last event for this system and this particular event occurred. This gives the required information to the AIOps system as well as the engineers on what systems are down from what time and what is the timestamp of the last data point. Deduplication ensures that the event console is not cluttered with multiple events in the console and relevant information is available.

Deduplication preserves the information that is sent by successive events that remain in the system and is used for processing by the AIOps system; however, rather than showing it multiple times, the information is aggregated in the console and database.

Without event deduplication, the event console will become cluttered with many events showing the same event again and again.

Rule-Based Correlation

AIOps systems use machine learning technologies to analyze the data. Traditional systems used rule-based correlation to decipher and analyze the data. In AIOps systems, rule-based correlation still plays an important role, and there may be policies in an organization that need implementation through rule-based configuration rather than using probabilistic machine learning models. An example of this could be a rule that increases the severity of an event based on the type of a system being development or production. This is achieved by looking up the CMDB to find the type of the device and then applying a rule or policy to upgrade or downgrade the severity of the event based on the system’s categorization. There are other important rules that are required in AIOps systems like “maintenance window,” where the alerts are suppressed for the period of any maintenance or patching activity. This reduces noise in the system and prevents the event console from showing alerts for systems that have been rebooted or shut down for maintenance activities.

Rule-based correlation also has a subtype called topology-based correlation. Topology and relationships between systems are used both in rule-based correlation as well as in machine learning–based correlation. In rule-based systems the topology of the system and their relationship is used to suppress or correlate events. For example, if a switch is down, then all servers or infrastructure beyond the switch is unreachable from the monitoring systems. The topology-based correlation will flag these events, correlate all infrastructure events with the switch-down event, and flag the switch as a probable root cause for all these events.

Machine Learning–Based Correlation

The machine learning–based correlation that is provided by AIOps tools is what makes it AIOps. The other features that we discussed were available with traditional platforms, but machine learning–based capabilities are the ones that differentiate AIOps products from other event management and event correlation engines. Figure 2-4 shows various types of correlation that are performed by the AIOps engine.

Figure 2-4
AIOps-driven correlation engine

In this section, we will take a deeper look at each of these types, beginning with anomaly detection.

Anomaly Detection

Anomaly detection is the process where you can identify unexpected events or rate events in the data. Anomaly detection is also termed outlier detection since it is about detecting outliers or rare events.

The outlier or anomaly events differ from the normal events. Anomaly detection algorithms try to identify these events from normal events and flag them as anomalies. Typically, events that are anomalies may point to some issues with the system. Anomaly detection techniques have been used in various use cases such as fraud detection in credit card and banking transactions, security systems to detect cybersecurity attacks, etc. With AIOps the same techniques and algorithms are now being used on the IT operations data.

Anomalies are not just about rare or outlier events; anomalies are also detected in AIOps systems for metric data such as network or system utilization parameters. In metric data, the anomalies are the bursts in utilization or activity, and these may point to some underlying causes that are flagged by the AIOps systems.

Anomaly detection techniques use unsupervised and supervised as well as semisupervised machine learning to flag events as anomalies and flag the metric data when there are breaches to normal behavior.

Anomaly detection has advantages over traditional rule-based systems. Anomaly detection algorithms can detect seasonal variations in data and only flag anomalous behavior of the system after taking into consideration the seasonality of variations. Metric data has high seasonality since application load and jobs running on IT infrastructure typically follow a time-of-day season. Some of the jobs that are run monthly also increase the utilization of systems but are not anomalies.

Anomaly detection can be done using a variety of available algorithms in AIOps systems. The AIOps teams can use these algorithms to fine-tune the implementation based on the type of data and in the environment.

Thus, AIOps systems using anomaly detection are better suited to reduce the noise in the event data or metric data by flagging the right events based on the seasonality of data on one hand and by finding anomalous patterns that may get missed in rule-based systems. Together these features help the operations teams by giving them insights into what is happening in their environment so that they can take reactive and proactive steps to remediate problems or prevent problems from occurring in the environment.

Event Correlation

Modern digital applications are all interconnected. Even traditional applications have been developed using distributed architecture where web servers, application servers, and database servers work in tandem to complete the application functionality. The infrastructure itself is distributed in network topologies with routers, switches, and firewalls routing the traffic from users of different locations to the main data center where the application is hosted.

The distributed application and infrastructure are monitored using multiplicity of tools. Thus, an environment will have alerts coming in from the network monitoring tools, server monitoring, database, and platform monitoring, and the application itself will be logging events and traces. All this data needs correlation so that the noise can be eliminated, and the causal event (the event causing the problem) is flagged and identified automatically. In the scenario where AIOps tools are not deployed, this activity is done by different subject-matter teams coming together over a call that is termed the operations bridge or critical incident bridge and analyzing all the systems and data to collectively identify what could be the potential source of the problem in a distributed system. You can imagine the complexity and the time taken to go through all the data and arrive at a conclusion.

Event correlation also takes feeds from configuration management and change management systems, thus correlating these changes in systems to the events getting generated by monitoring tools. This helps in root-cause analysis as many incidents and problems arise after making a configuration change or patching an existing system. Change and configuration management data needs to be made available to the AIOps engine to correlate with event and performance data coming in from the monitoring systems. In fact, the first thing subject-matter experts look for is any changes that may have been made to the system recently that would have caused the incident.

Machine learning–based event correlation helps solve this problem by automatically grouping related alerts by correlating across various parameters so that the resolution group gets all the information at a single place. Event correlation is done using the topology and relationship information that is available in discovery and CMDB systems; it also uses timestamps and historical data to group events and provides insights to operations teams.

With time and enough supervised learning data, the event correlation engine becomes more accurate. The event correlation engine feeds in the data to “root-cause analysis.” We will be covering root-cause analysis in greater detail in the next section. Without event correlation, it is not possible to do root-cause analysis or predictive analytics. Thus, event correlation is the first step to move forward with root-cause analysis and predictive analytics.

Root-Cause Analysis

Root-cause analysis is the most important module of AIOps; this is where the most value is realized by operations. We have already seen in event correlation how modern distributed applications and infrastructure are spread out and events are generated from various monitoring and management systems and how this impacts operations’ ability to go through this large set of data and try to find the root cause of the problem manually. With such complexity coming into operations, it is becoming impossible to do root-cause analysis without the aid of event correlation systems. Whether an organization uses a rule-based approach or an AIOps-based approach, without leveraging technology, it is impossible to do root-cause analysis and remediation of issues and meet the service levels that are agreed with the business.

Identifying the root cause manually or with event correlation and automated analysis involves multiple teams from different IT domains coming together and analyzing the situation to arrive at a conclusion on what could be the problem. This also involves the need for collaboration and tools for collaboration so that different stakeholders are on a common platform and can effectively do “root-cause analysis.” Root-cause analysis with AIOps looks at all the data being ingested into the AIOps system and provides insights to the teams for faster and better root-cause identification. Since machine learning technologies are probabilistic in nature, root-cause analysis in AIOps parlance is also referred to as probable cause analysis; thus, it may throw multiple probable causes for a root cause with an attached confidence score. The higher the confidence score assigned to an event, the higher the probability that the AIOps engine has assigned to the event to be the root cause. Based on the probable cause, the operations teams can do a deep dive and arrive at the final root cause and mark it as such in the system.

Root-cause analysis leverages anomaly detection and event correlation techniques as well as supervised learning feedback to arrive at the root cause. Root-cause analysis leverages both supervised and unsupervised techniques to arrive at this result.

Since the IT data is vast and dependent on the environment, the feedback loop is an important aspect of root-cause analysis. There are vast amounts of tribal knowledge in the minds of the operations teams that may be undocumented and known only in the form of informal knowledge. When the operations teams collaborate and mark the root cause from various probable causes that the system has generated, the AIOps systems learns from the human operators and updates its model. Thus, the system can store the actions taken by operations teams and can recall the root cause from previous incidents. We will cover this in greater detail in the feedback section; however, it is important to understand that root-cause analysis has a dependency on human feedback, and without this, the root-cause analysis accuracy may hit a ceiling.

The benefits of automated root-cause analysis are many. There is a marked improvement in the mean time to respond and mean time to resolution since the system makes the job of human operators easier by flagging the probable cause and eliminating noise from the system. Integrated with knowledge management and feedback loop from operators, root-cause analysis creates a robust and highly accurate system as its usage gets expanded over time.

There are certain limitations of the root-cause analysis using machine learning technologies. Since this is probability based, there is no guarantee that the root cause identified is the actual root cause. The other limitation of root-cause analysis is that unlike anomaly detection and predictive analytics, which can be done using data alone, root cause is dependent on human feedback. If there is no participation from the operations teams, the root-cause analysis will remain at low levels of accuracy. Thus, the human element, feedback, and training of the AIOps engines are important parameters and limitations as well. Trying to do root-cause analysis with only unsupervised learning will not be effective as it will only be able to flag anomalies, but whether the anomaly is actually causing a system degradation or an incident may not be accurately flagged by the AIOps engine. Another limitation of root-cause analysis is the nature of incidents in the IT domain; an incident may be unique and have a unique combination of events that are generated during that incident, and that incident may not have happened in the past, so there is no precedence or data in the AIOps engine that it can use to arrive at a conclusion. Thus, novel or new incidents with new events previously unseen are a challenge for current AIOps systems.

Thus, we cannot expect the current generation of AIOps systems to accurately pinpoint root cause without training by subject-matter experts and human operators. There are scenarios where the customers expect the AIOps engine to be a magic wand and automatically start finding the root cause and remediation of problems; however, deep learning and machine learning systems are dependent on labeled data and training, and without this training, the system is incapable of providing accurate results.

Since root-cause analysis is the most important and complex lever in the entire AIOps stack, it is important to pay utmost attention to its implementation and continued operations effectiveness. As a future direction, AIOps tools can use multiple algorithms and an ensemble of algorithms to do root-cause analysis and provide better accuracy even with limited training data.

Root-cause analysis feeds into automation; without having this process step trained and generating accurate results, end-to-end automation and remediation are not possible. Once the root cause is identified, it feeds into the automation engine to resolve problems automatically and catapult the organization into the highest levels of maturity with autohealing.

Predictive Analysis

Predictive analysis brings in the predictive element to IT operations. This is an area where the customers have always expected predictive capabilities from IT operations systems but have not been able to achieve them. AIOps brings predictive analytics capabilities to IT operations and fulfills this unmet demand from the operations teams.

As the name suggests, predictive analytics means the ability to predict things in advance based on the data provided to AIOps systems. There are use cases where predictive analytics has a role to play in the IT operations space, so let’s take a look at some of them.

One important application of predictive analytics is in the area of performance management and capacity planning. With metric data being available to AIOps systems, it is possible for the AIOps engine to predict what the future utilization of these systems will be. The data around users accessing an application and the associated utilization of the system can be used to make predictions based on scenarios of how many users will hit the application and how much infrastructure will be needed to support these users. Regression techniques can be used to consider the current performance and workload of a system and predict the future utilization of the infrastructure. Being able to predict in advance the utilization helps IT operations teams to plan better for infrastructure capacity and instantiate new virtual machines or cloud instances to cater to the predicted demand. In microservices-based applications, new pods are spun automatically to cater to increased demand on infrastructure.

Predictive analytics uses regression techniques, which can take into consideration the seasonality of the data and provide accurate results. As an example, a backup or data processing job at the end or beginning of a month may be causing performance issues and errors in an application. Leveraging predictive analytics techniques can ensure that the AIOps system is able to forecast the utilization and the operations teams can take appropriate actions to increase the capacity during that period by spinning extra instances or by vertically scaling up the capacity so that there are no performance issues and incidents are avoided.

Another use case based on metrics can be around trend finding, where the AIOps engine is able to spot a trend and an associated event at the end of a trend. Based on this association, it can predict in advance things like a system failure. An example of this would be a memory leak issue in an application that causes the memory utilization of a machine to keep increasing and thus following a trend. After consuming all the available memory, the system starts to consume the disk space, and the memory is paged to the physical disk. After a while, the application starts to slow down and eventually crashes, causing a set of events. This pattern can be detected by predictive analytics engine, and on observing, a trend the AIOps system can warn the operations teams of impending failure.

Another example on the previous lines can be of a faulty database connection code where the database connections are not released and after a while choke up the entire database and the application starts to get connection failure alerts. The metric around database connections when plotted will form an increasing trend and can be deciphered by the AIOps systems to forewarn the operations teams.

Like other elements in the AIOps space, predictive analytics is also based on probability and thus may not be 100 percent accurate. Predictive analytics in the IT operations space looks at events and metrics to predict what could be a probable outcome and then alerts the operations teams.

The customers sometimes think of AIOps tools as a magic wand and expect them to predict all kinds of failures and prevent them. This is impossible with the current state of the technology as all failures are not predictable. Only events that have underlying patterns decipherable through trends or a sequence of events prior to the failure can be deciphered by the AIOps engine. We have to be aware of the limitations of the system and configure them to their best capabilities and not expect magic. There are lots of failures that are unpredictable in nature and happen at random. Devices and systems fail randomly without any forewarning, and predicting their failure accurately is not possible with the current state of the technology.

Predictive analytics can be simple based on a single variable, or it can work on multiple variables and the correlation between them to arrive at a prediction. Predictive analytics systems can take care of seasonality of data in their models to arrive at accurate predictions with a high level of accuracy.

Predictive analytics results in proactivity in operations and a higher availability of the system since the problem is remediated before it is able to impact the availability or response time of an application.

Visualization

Visualizations are important parameters in the AIOps tools. There are various types of views and dashboards that are needed from an operations point of view in AIOps tools.

The foremost visualization is the event console. The AIOps tools need to have an intuitive and easy-to-use event console. The event console is a grid view that has all the alerts that need action or analysis from the operations teams.

The following is important information that is available in the event console:

Event ID/alert ID
Description
First occurrence
Last occurrence
The number of times the event has occurred
Any associated incidents with the alert
Severity of the event
Whether it is a probable cause or not
Status whether it is open or cleared
History event/alert with associated actions and state changes

The event console typically color codes the events per their severity and their tagging of being probable cause or not. Events that are correlated together are shown together in a consolidated console that enables the operations teams to look at all correlated events associated with a probable cause in a single console and deliberate on the root cause.

Beyond the event console, the AIOps console has other dashboards that provide aggregated and consolidated information, as shown here:

Event trends, patterns; graphical plot of event trend
Top events across the environment
Top applications or infrastructure elements causing events
Information on event flood
Performance data plots for metrics
CMDB views/topology views
Historical data around events, alerts, and performance metrics

Apart from the previous views, the AIOps engine may also provide views and information on the performance of the AIOps engine itself.

Visualization in AIOps engine thus comprises the Event console, dashboards for real-time data, and reports for historical analysis of data that drives the collaboration process and is going to be discussed next.

Collaboration

IT operations teams collaborate to find root cause and to deliberate on ways of solving the problem. The Command Center function in IT operations is where the collaboration between various teams happens over a bridge. A bridge is an online real-time call over communication channels like Microsoft Teams and Phone where multiple teams interact and collaborate to look at the events and incidents and try to analyze and find the root cause of the issue at hand. Priority 1 incidents that are impacting an application or infrastructure mandate the opening of a P1 bridge where the required stakeholders from different technical domains collaborate.

In AIOps, the same process runs; however, there are a few differences. More and more teams are leveraging the built-in ChatOps features of AIOps tools where different team members can converse as well as run scripts to diagnose and resolve the problem.

Another change in AIOps is that rather than looking at different consoles and events, the entire team has access to the AIOps event console where consolidated and correlated events along with probable cause that are flagged by the AIOps system are available for the teams.

The views on topology and relationships between the affected or systems under investigation is also available from the AIOps console; thus, the teams don’t have to go to various systems to get the complete picture.

This results in speeding up the entire process of root-cause analysis, problem identification, and resolution of the problem.

Another important aspect of collaboration in AIOps is that this is no longer a collaboration between people only. The artificial intelligence system is a party to the entire collaboration and is storing the information in its records to be used as a learning tool and to be used in future incidents involving the same set of events or same probable cause. AIOps tools can bring up the past records and help the operations teams in referring to the accumulated knowledge from past collaborative analysis that was done. Thus, historical knowledge is not lost but accumulated for usage in the future.

Feedback

Feedback is the last step in our Observe process, but it’s perhaps the most important step. As you learned in earlier sections, root-cause or probable cause analysis is one of the most important use cases in AIOps, and the foundation of root-cause analysis is the continuous feedback on its accuracy and confidence scores provided by the operations teams. Every root cause identified by the AIOps engine is analyzed, and feedback is provided in the system by the operations teams. Thus, an incorrect root cause provided by the AIOps engine is marked as incorrect, and the correct ones are marked as correct. This data feed helps the AI engine to understand the environment and improve on its model. This data is the labeled data that is required for training the supervised learning system in AIOps. Once there is enough data with the AIOps engine on what events are root cause and which events are not root cause, it is able to better analyze and interpret the next set of events based on this learning. Thus, feedback is what powers continuous learning of the system. This enables the AI system to learn and improve its accuracy and confidence scores and achieve a level of accuracy where the data can be then used to initiate automation.

Typically, after a few months of running the AIOps system and providing the correct feedback, the system’s accuracy and confidence scores reach a level where the automation can be initiated from the AIOps engine for high-confidence probable-cause alerts so that the entire process from detecting a problem to taking a correction action through automation is achieved automatically without human intervention. This concludes the Observe stage of AIOps system. Let’s now move to the other core function under AIOPs, which is Engage.

Engage

The Engage area is related to ITSM and its functions. It is an important piece in the AIOps space as it primarily deals with the processes and their execution by various functions and the metrics around process and people. The Engage piece deals with the service management data and hence is a repository of all the actions taking place in important ITSM functions like incident management, problem management, change management, configuration management, service level agreements, availability, and capacity management. Figure 2-5 illustrates this.

Figure 2-5
AIOps-driven IT service management

Continual service improvement is an important lifecycle stage in ITSM, and that’s where most of the analytics is performed in AIOps. In Observe, the primary data includes events, metrics, logs, and traces, but here the primary data is around the activities being done in various processes. Workflows in Observe are more machine to machine; here the workflows involve the human element.

The data in Observe is mostly real time, but in Engage it is a mix of real time as well as on-demand analytics.

Let’s do a deep dive on this and understand its elements and stages.

Incident Creation

The Engage phase starts with the Observe phase creating an incident in the ITSM system. After probable cause analysis creates a qualified alert, the alert is sent to the ITSM system for creation of an incident. Incident creation needs various fields to be populated in the ITSM system so that the information is complete and helps the resolution teams in resolving the incident. The AIOps Observe tool and the ITSM tool are integrated to automatically create incidents in the ITSM system and have the fields in ITSM autopopulated from the information available in the Observe module. This includes the description of the alert and other related information as defined earlier in the Observe section.

If an alert is cleared from the Observe module, the AIOps engine keeps updating the incident in ITSM automatically and marks it as cleared so that the incident can be closed. If the alert gets new events that are triggered, it keeps updating the incident in the ITSM module with new information that alerts the operations teams.

There are scenarios where the events in the Observe console do not get autocleared if a problem is resolved. In those scenarios, there is a two-way integration where the ITSM system clears the alarm when an incident is closed so that the Event console reflects the accurate state of systems being monitored.

Task Assignment

In traditional systems, tasks are assigned to engineers by track leads based on the availability of the resources and the skill required to deliver a particular task. In modern AIOps-based systems, the task assignment is done through automation defined in the ITSM system or outside of it. The task assignment engine takes into consideration the availability of the resource in a particular shift, their skill level, the technology required to solve a particular task or incident, and the workload that is already with the resource. Based on these parameters, the ticket is assigned to an individual to work upon and update the progress until closure.

Task assignment is done using rule-based systems rather than machine learning systems since it matches the skills to a task and matches the experience level of a resource along with the workload or availability, and these lookups can be done using rule-based systems rather than leveraging machine learning technologies.

However, natural language processing and text extraction–based systems can be used to extract the information from the incident and map it probabilistically to the right skill and thus aid the task assignment engine. The machine learning capabilities using text extraction help in automatic mapping of the task to the right skill rather than using a regular expression–based approach to look for keywords. Using or not using machine learning for this is entirely dependent on the scale, size, and complexity of the environment. For smaller environments, rule-based systems would work perfectly well, and leveraging machine learning here may not be needed. However, larger and more complex operations would require this capability to run efficient operations.

Task Analytics

Tasks assigned to individuals need to be analyzed; thus, each task that is in the system is generating data. Statistical analysis of the tasks in the system is used to provide insights into how the process and people are performing. Tasks can be analyzed for volumetric data as well as for efficiency data in terms of time taken at each step. Analyzing the tasks gives important insights to run Six Sigma or Lean projects in the organization.

This is also used for assessing the accuracy of the assignment engine to see if the tasks are correctly assigned. If the tasks are not correctly assigned, the task will keep hopping between different teams, and this may indicate a problem with the assignment engine.

Agent Analytics

Similar to task analytics, the other important lever in ITSM is the agents or resources working on these tasks. Agent analytics analyzes the performance of human as well automated agents on parameters such as accuracy, time taken to resolve issues, individual performance, and performance as compared to baselines. This can flag any issues with skills or availability of resources. This data is also useful to analyze if the assignment engine is assigning tasks correctly.

Change Analytics

Changes including patches, updates, upgrades, configuration changes, and release of new software into production are potential sources of incidents. What was working before may stop working after making a change. Thus, it is important to analyze changes that are happening in the infrastructure and application environment.

Change analytics includes areas where the impact of change can be assessed by using topology and configuration information. Change analytics also includes probabilistic analytics of risk to infrastructure and platforms because of changes. This may involve analyzing the data around topology, relationships between various components, the size and complexity of the change involved, and the associated historical data with these changes to arrive at a risk score for a particular change. The feedback score from technical evaluators and business approvers of change is also an important input to analyze the change and plan for its execution, keeping in mind the risks that it carries.

Process Analytics

We talked about analyzing key basic processes in ITSM including incident management and change management. However, all processes in ITSM need analytics particularly around the KPIs that are defined for each process.

As an example, change management has associated KPIs for changes implemented within a certain time, changes that caused outages and incidents, etc. Similarly, incident management has KPIs around response time and resolution time of an incident along with other process KPIs such as time taken to identify root cause, etc.

Service level management processes have KPIs around the SLAs that may be related to the response and resolution of priority-based problems. For example, all P1 incidents should be responded to within 5 minutes and resolved within 30 minutes with an SLA of 90 percent over a monthly cycle. This means that 90 percent of the P1 incidents should be responded to and resolved within the time defined, and this calculation is done on a monthly basis and reset at the beginning of each month.

All this process data is fed into the AIOps analytics engine to do statistical analysis of this data and analyze it for process improvement purposes. You can also use AIOps machine learning techniques like regression here to predict the future metrics based on the historical data; the regression techniques will take into consideration the seasonality variations and the data from the past to arrive at the future predicted values.

This data helps to better plan the resources and also feeds into process improvement initiatives.

Visualization

Since most of this data in ITSM systems is about people, process, and technology aspects, it is important that we have the right level of visualization and dashboarding technology available to make sense of this data and walk the path of continual service improvement.

There are various stakeholders who need access to this data, and their requirements are different; thus, the visualization layer needs to have role-based access and role-based views to facilitate the operations team.

There are service delivery managers, incident managers, command center leads, process analysts consultants, and owners. You also have the service level manager and change and configuration managers who are responsible for the SLAs with customers and responsible for maintaining the CMDB, respectively. All these roles require the right level of insights and visualization into the relevant data to be able to manage their respective processes.

Visualization is also needed for business owners and application owners, and in the case of outsourcing engagements, there are views needed by the customer and service provider.

Right dashboarding and visualization tooling in AIOps is essential to get all the required data and insights including insights generated by machine learning algorithms to run operations with higher efficiency and maturity.

Collaboration

Just like Observe, collaboration is important in the Engage phase as well. In the Observe phase, the collaboration between teams happens on a bridge or using ChatOps to find the root cause of a problem. In the Engage phase, collaboration happens between various stakeholders to resolve the problem. Thus, various stakeholders collaborate on the tickets to bring the problem to closure. Unlike Observe, where multiple teams have to come together to analyze the issues, here it is a limited set of people sometimes restricted to a particular technology domain; at times it is only one individual who is working on the ticket to resolve the issue.

Collaboration also happens in the service request and change execution task; however, most of it is orchestrated through a rule-based system where completion of each task is sequentially assigned to the person who needs to do the job. There is a greater degree of collaboration required in change management as multiple teams may be involved in executing a complex and big change; the change management process manages these through a rule-based approach where the required stakeholders are brought together by the system at various stages of a change.

If the person who is responsible for executing a task is not able to finish it within the time period assigned, the system assigns or engages a higher skilled resource to help and complete the task in time. This is all done using a rule-based engine that keeps a tab on the time to complete a task and escalates it after expiration of a set period of time.

Collaboration also happens in these processes on the visualization or dashboarding layer where different stakeholders can collaboratively look at the data, analyze it, and make decisions that require inputs from multiple teams or stakeholders.

Though largely rule based, there are aspects of ChatOps that can be used in the Engage step where teams can use ChatOps to collaborate over incidents, problems, and changes in real time. This data is also stored for creating knowledge management.

Knowledge management is a key area in Engage, since ITSM systems are the primary repository of most of the information in IT service management. AIOps techniques such as natural language processing and text extraction come in handy to find the relevant information while resolving incidents and executing changes and service requests. The AIOps system can use information retrieval and search techniques to find the relevant information quickly and easily so that the operations teams can resolve incidents faster.

Feedback

The feedback in the Engage phase is generated through various mechanisms and in various processes. In the incident management process, the closure of an incident triggers feedback that is filled in by the impacted user; similarly, reopened incidents are a feedback mechanism for analytics. Feedback on failed changes or changes that had to be aborted and changes that were executed fully but caused incidents is an important input.

Service requests raised by users also trigger a feedback post-completion and form an input to analytics to understand how well the process is performing.

All this data is fed into the system to be visualized in the visualization layer and analyzed using analytics techniques.

Rather than acting as feedback to an algorithm, the feedback here is used mostly in data analytics for decision-making to improve the overall process.

The Engage ITSM systems orchestrate the entire process, and every step of the process is logged and updated in the Engage phase; however, the actual action performed is under the Act phase, which we will be discussing next.

Act

The Act phase is the actual technical execution of the task that includes execution of incident resolution, service request fulfilment, change execution, etc. Figure 2-6 offers a visualization of this phase.

Thus, all technical tasks executed by the operations team come under this phase.

The completion of the AIOps journey happens with the Act layer; it is here that the incident is resolved, and the system is brought back to its normal condition. AIOps has benefits without this layer as well, where most of the diagnostic and analytics activities are covered under the Observe and Engage sections; however, extending AIOps to Act increases the benefits manifold as organizations are able to not just find problems quickly but are able to resolve them automatically without human intervention.

For the Act layer to work, it is essential that we have the Observe layer implemented and fine-tuned. Without the ability of the AIOps engine to detect anomalies and probable cause and trigger an action, it is not possible for the Act layer to resolve a problem.

Thus, the Act layer is integrated with the Engage and Observe layers to get its data feed and then acts on that data feed to take resolution or other actions on the technical environment. The Observe layer uses AIOps techniques as described earlier to find the probable cause and then create an incident in the ITSM system or the Engage layer; the automation of Act layer can pick up these incidents from the Engage layer and then resolve the incident automatically. To resolve incidents automatically, it needs to know how to resolve an incident. It also needs to understand the incident and the infrastructure on which this incident has happened. We will look at various techniques that are used in AIOps starting with least complex but very effective technique of automation recommendation.

Automation Recommendation

The first step for resolution is to recommend which automation will resolve a particular problem. This can be done using a rule-based approach or a machine learning approach. In the rule-based approach, each type of probable cause is mapped with an automation, which is fired to resolve the probable cause. In machine learning AIOps approaches, this relationship is not fixed and is probabilistic.

Various techniques like natural language processing and text extraction are used to find the right automation for resolving a problem and then recommending that as a solution for the probable cause identified in the Observe layer.

The automations are generally static in the AIOps domain; however, there are newer technologies that use advanced machine learning techniques to club together runbooks that can be chained to resolve problems, thus creating new automations on the fly using machine learning. Tools like DryICE iAutomate provide these advanced features along with out-of-the-box runbooks and pretrained models to significantly enhance the automation recommendation capabilities.

Automation recommendations provide a confidence score for the automation. Low-risk tasks can be automatically mapped for execution of the recommendation, and any high-risk execution tasks can have a human in the middle approach where a human operator validates the recommendation before it is sent for execution.

Automation Execution

Automation execution is the actual act of executing on the recommendation generated in the previous step. Thus, once a probable cause has been mapped with an automation, the automation is triggered to resolve the problem.

The execution layer can be built in the AIOps platform or can leverage existing automation tools available in the environment. The automation execution engine provides feedback to the AIOps tool on successful or unsuccessful execution results.

The automation can be triggered using a variety of tools including runbook automation tools, configuration management tools, provisioning tools, infrastructure as code tools, robotic process automation, and DevOps tools. Most organizations have multiple automation tools, and the relevant ones can get integrated as the automation execution arm to be fed by the automation recommendation engine.

Automation tasks can be simple ones like running a PowerShell or shell script to reboot or restart services or can encompass more complex tasks that involve complex workflows and even involves spinning up new instances and infrastructure.

Most AIOps platforms today operate only at the Observe layer and do not have automation execution or recommendation as part of the toolset; however, there are tools like DryICE iAutomate that provide automation recommendation and execution capabilities bundled with out-of-the-box workflows so that organizations can leapfrog and reach a higher level of maturity quickly.

Outside of the AIOps platforms, there are countless scenarios where it is possible to use AIOps disciplines as a trigger to an existing automation. Most organizations have Python or PowerShell scripts that can automate routine remediation workflows, such as rebooting a virtual machine. Expect these types of predefined automations to be a substantial portion of your intelligent automation portfolio, and reuse automation assets with AIOps solutions to increase the value of both the AIOps analysis and the automation development.

Automated execution can span different types of use cases including incident resolution, service request fulfilment, and change orchestration. Some of them are more tuned for a probabilistic scenario, while others can use rule-based process workflows.

One should take care while triggering low-confidence recommendations, or the failure of automation may lead to further problems. In such situations, it may be prudent to use other techniques such as diagnostics runbooks. The low-confidence automations can also mandate a doer checker process whereby the actions of the system are evaluated by a human operator before being fired in the system. A combination of a rule-based system with confidence-based recommendations would best work in certain situations, and thus an intelligent call needs to be taken based on the environment and the risk associated.

Incident Resolution

Incident resolution is one of the types of automated execution. This is one area that tightly integrates with the Observe phase. Thus, the output of the observe phase, which is a probable cause, becomes an input to the automation assessment, and if there is an automation available for that particular root cause, it can be triggered automatically or using a human in the middle approach to resolve the problem.

Incident resolution is an area where machine learning techniques of recommending an automation can be used effectively and will provide better results than rule-based systems.

Incident resolution is the primary area where probabilistic machine learning technologies in AIOps play a role.

SR Fulfilment

Service request fulfilment is an area where users request particular services, which are logged into the ITSM system as a service request. The service request is then fulfilled as a series of tasks that get executed automatically or by human agents, or by a combination of both automation and human actions.

Since service request tasks are mostly definitive in nature and there is no ambiguity on how this task needs to be executed, the role for machine learning technologies is limited.

Service requests are fulfilled through a series of step-by-step processes termed tasks. At each stage of execution, the requester is kept updated on the progress of his request, and on completion of the task, the requester is notified of fulfilment and a way to access the fulfilled deliverable.

Service requests can be for software systems, or they may be for hardware that needs to be delivered; for example, the delivery of a laptop to a new employee is a service request, which actually goes through a physical fulfilment process and hence cannot be fully automated. Automation here is about integration of service request systems with procurement and ordering systems so that the request can be automatically forwarded to third-party partners or vendors who would take this up for fulfillment by way of shipping a laptop/hardware.

Software deployment tasks are fully automated using software delivery platforms for service requests, which include things like deployment of end user applications like Microsoft Office on laptops.

The service requests are initiated by the requester from a service catalog. The service catalog is similar to the shopping cart where you can order goods and services from online marketplaces like Amazon.

The role of machine learning in this area is around cognitive virtual assistants, which can provide an intuitive chat or voice interface to users rather than a web portal or catalog. This makes it easier for users to converse in natural language, arrive at the right catalog item, and then order it all from a chat interface. The cognitive virtual assistants are integrated in the Engage layer and raise a request after the confirmation from the user. The virtual agents can also be used by the requester to track the progress of fulfilling the request.

Cognitive virtual assistants internally use natural language processing and understanding along with various machine learning and deep learning technologies to decipher the intent of the user and provide appropriate responses.

The primary use case for cognitive virtual assistants is in the area of service request; however, similar functionality and use cases are equally appliable in incident management and change management.

Change Orchestration

Similar to service request fulfilment, the changes are granularly planned and comprise a series of tasks to be performed by different teams.

There are additional tasks around change review where a change advisory board comprising technical and business stakeholders reviews all aspects of the change and approves it.

There are other steps such as reviewing the change test plans, rollback plans, etc., and at each stage different stakeholders may be involved in the review and analysis of the change.

Once everything is reviewed and the change is ready to be executed as per schedule, the change is executed in a step-by-step fashion by various teams involved in the technical execution.

Thus, change orchestration is a well-scripted process that involves well-defined steps and tasks at each stage, and rule-based systems have been used to run this process for ages. Since change orchestration tasks are mostly definitive in nature and there is no ambiguity on how this task needs to be executed, the role for machine learning technologies is limited.

There are a few areas where machine learning or analytics technologies can be used in change orchestration.

Change scheduling and conflicts is one such area. When a change is scheduled, it involves infrastructure, platform, and application components that form part of the change; the change is also scheduled on a particular time on a particular date. Analytics techniques can be used to find out if there are items that are getting impacted because of the change and if there is conflicting or overlapping changes that impact connected devices or systems. This data can be analyzed by overlaying the topology and configuration data with the change schedules, and this information can be used by the change advisory board to better analyze the change and its impact and may result in the rescheduling of changes in the case of conflicts.

The other area is around change risk analysis; we are aware that every change carries a risk to the application and infrastructure and can result in downtime, and the predictive analytics techniques can be used to find out the risk of the change based on components involved, the complexity of the change, and the risk analysis data from previous such changes. This predictive analytics component from AIOps can come in handy for such analytics and providing this additional information to the change advisory board and technical execution teams.

Cognitive virtual assistants internally use natural language processing and understanding and can also be used in the change management process for technical and process teams to collaborate and find information about the change using the intuitive NLP capabilities of the cognitive virtual agents.

We have covered some of these elements in the change analytics section in Engage as well.

Automation Analytics

Automation analytics is an important area in the execution space. Though most of the analytics is relevant in the Engage phase, automation execution generates its own valuable data that needs analytics.

Some of the data generated by the automation is used to further improve the accuracy and efficiency of the automation system. The other data is used for reporting and analytics of how the current state of automation is performing in the organization.

Generally, the following are important automation KPIs:

Automation coverage
Automation success rate
Automation failure rate
High used use cases
Low used use cases
Failure cause analysis
Low automation areas

Visualization

Just like the visualization and dashboarding that we covered in the Engage phase, the automation phase also logs its data into the system. We mentioned a few important parameters that are important for automation teams to track, and these need to be visualized using dashboarding technologies to provide a bird’s-eye view of how automation is performing and run some level of analytics to run continual service improvement on this data.

Collaboration

Collaboration in automation is largely delivered through the Engage phase since all the activities are logged in the IT service management systems. Thus, collaboration between various teams and people involved in incident, service request, and change orchestration is done in the Engage phase using ITSM tools.

However, with AIOps ChatOps, cognitive virtual agents become the core of collaboration for various teams where they can interact with the right stakeholders as well as get the required data from the ITSM systems, which is required during the Act phase. Thus, real-time collaboration during the actual activities in the Act phase is done using ChatOps where humans interface with other teams and with machines to analyze the data and take appropriate decisions.

Feedback

This information is available in the AIOps system to analyze the efficiency of the automation execution. Feedback from automated resolutions is an essential input into the AIOps system. Success and failure of automation scripts is an important learning data for the machine learning algorithms. This data helps the system to improve its accuracy scores and to change the confidence scores of the various automation engines and scripts.

Operators confirm the actions that the AIOps algorithms are suggesting, and thus the human input provides training to the AIOps algorithms where they can improve on their model and change the confidence scores based on human feedback.

With time the AIOps engine becomes better tuned to the environment that they are operating in by learning from the actions of human operators. The confidence scores of various resolutions become high enough to move them to fully autonomous mode where the human in the loop is no longer needed for certain use cases. This fulfills the promise of AIOps to turn operations into NoOps, by leveraging the autohealing capabilities of AIOps technologies. Expansion of fully automated use case library and its success rate gets greatly impacted by the knowledge of the application blueprint and its relevance or its business impact level (BIL), and that’s where application discovery becomes important, which we will be discussing next.

Application Discovery and Insights

To manage business transactions’ key performance indicators (KPIs) and to guarantee business process service level agreements (SLAs), enterprises also need powerful full-stack analytics.

These analytics need to automatically map business transactions (such as orders, invoices, and so on) to their application services (web server, application server, databases, and so on) and to the supporting infrastructure (compute, network, and storage), as shown in Figure 2-7.

Figure 2-7
AIOps for business and service impact analysis

This must be done in real time across distributed, hybrid IT environments. Without it, they will be forced to engage in extensive and complex troubleshooting exercises to triangulate hundreds of thousands, if not millions, of data points. The time required to do so can negatively impact the uptime and performance of key processes such as e-commerce, order to cash, and others.

Making Connections: The Value of Data Correlation

The app economy is upon us, and businesses of all stripes are moving to address it. In this age of digital transformation, businesses rely on applications to serve customers and improve operations. Businesses need to rapidly introduce new applications and adopt new technologies to become more agile, efficient, and responsive.

As part of these efforts, businesses are employing cloud-based solutions, software-centric and microservices architectures, and virtualization and containers. But these new architectures and technologies are creating challenges of their own.

Some business applications today are hosted on public clouds, and enterprises tend to have no, or very limited, visibility into those clouds.

Applications increasingly deployed on virtual machines rather than physical servers, which adds more complexity.

Containers often exist alongside, or within, virtual machines. The use of containers—and the number of containers themselves—is quickly proliferating enterprise IT environments.

Because this environment is very different from what came before, the application performance tools created a decade or so ago no longer apply. Tools that consider only the application—and not the underlying infrastructure—fall short. These tools must collect and correlate information about the application itself and about the underlying infrastructure. This should include data about application server performance, events, logs, transactions, and more. The compute, network, and storage resources involved in application delivery also need to be figured into the equation.

Summary

In this chapter we covered the different layers of AIOps, i.e., Observe, Engage, and Act. The three core foundation pillars of AIOps along with the core analytics techniques of event correlation, predictive analytics, anomaly detection, and root-cause analysis were covered in depth. Each function within the Observe, Engage, and Act layers was covered in depth by providing real-life practical examples so that AIOps and its functions can be demystified and correlated with on-the-ground activities being done by operations teams today. In the next chapter, we will cover challenges that organizations face while deploying AIOps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. AIOps Architecture and Methodology

Create new playlist

Sign In

Sign Up

2. AIOps Architecture and Methodology

AIOps Architecture

The Core Platform

Big Data

Volume

Velocity

Variety

Veracity

Value

Machine Learning

The Three Key Areas in AIOps

Observe

Data Ingestion

Integration

Event Suppression

Event Deduplication

Rule-Based Correlation

Machine Learning–Based Correlation

Anomaly Detection

Event Correlation

Root-Cause Analysis

Predictive Analysis

Visualization

Collaboration

Feedback

Engage

Incident Creation

Task Assignment

Task Analytics

Agent Analytics

Change Analytics

Process Analytics

Visualization

Collaboration

Feedback

Act

Automation Recommendation

Automation Execution

Incident Resolution

SR Fulfilment

Change Orchestration

Automation Analytics

Visualization

Collaboration

Feedback

Application Discovery and Insights

Making Connections: The Value of Data Correlation

Summary

Table of Contents for
2. AIOps Architecture and Methodology