© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_1

1. What Is AIOps?

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

This chapter introduces artificial intelligence for IT operations (abbreviated as AIOps). In today’s rapidly transforming application and infrastructure landscape and adoption of cloud-native technologies, organizations are finding it difficult to provide 24/7 operations that can scale and meet the needs of businesses that now want much higher availability and agility to change based on customer and market feedback. This chapter also provides details on the benefits that AIOps brings to the table and how it supports the digitization journey of enterprises.

Introduction to AIOps

AIOps is a buzzword in the operations world and was coined by Gartner in 2016. As mentioned, it means implementing artificial intelligence for IT operations. AIOps refers to a transformational approach to running operations using AI and machine learning technologies in various operations domains such as monitoring, observability, event correlation, service management, and automation. With the exponential growth seen in application and platform diversity, including the movement to microservices and cloud architectures, there is an enormous amount of data being generated in operations. The operations teams are overwhelmed with this vast amount of data and the diversity in applications, platforms, and infrastructures in the environment. Most enterprises today are rapidly migrating and adopting new technologies such as cloud and microservices architecture, and thus the rate of change in infrastructure and platforms is unlike anything seen before. The challenge in IT operations is to run steady-state operations without disruption and also support this agility and migration and bring new services into operations. These disruptions and changes are putting an enormous strain on the operations teams. Processes and systems that have worked in the past are not working anymore, and the new digitalized world with rapid changes both in applications and infrastructure is resulting in newer challenges. Thus, AIOps has evolved over the last few years as a potential solution to the operational challenges of the new model.

The huge amount of data getting generated from monitoring and observability systems is one of the sources of data that is fed into AIOps-based systems, and then AI and machine learning techniques are used to make sense of the data and filter the noise from critical events. This results in automating most of the tasks that were manual before and that relied on human judgment and tribal knowledge. The events that can cause disruption in business operations and are the root cause are efficiently identified using analytics techniques and thus provide immediate notification to the groups that are resolving them. Without AIOps this process will be difficult to run with changes in technology happening at a rapid pace; relying on older systems and tribal knowledge would mean operations is not scalable and predictable.

All of this is enabled by the emergence and maturity of artificial intelligence and machine learning technologies, which are the foundation of AIOps.

Artificial intelligence has transformed the way systems are developed and business processes are run. AI is everywhere from the image processing in your phone to recommendation engines on Amazon that provide you with new product recommendations based on your preferences. Face recognition and image beautification on phones are examples of applications that people are consuming every day without even knowing that artificial intelligence is powering these applications. Natural language processing advancements have transformed the way we interact with applications. Today voice assistants like Alexa, Siri, and Cortana are changing the way we communicate with content.

Information technology has leveraged these technologies to solve varying business problems in the areas of building recommendation systems, predictive systems, image recognition, voice recognition, text extraction, and natural language understanding systems.

However, when it comes to solving IT problems using artificial intelligence, enterprises and technology companies have yet to embrace this fully.

Finally, the AI technologies that IT teams have used to deliver exciting new applications for consumers and businesses are now finding their way into monitoring and managing the IT technologies. Thus, a new class of systems that are using algorithms to run IT operations was born.

AIOps is a term Gartner invented to describe a general trend of applying AI techniques to IT operations data sources to provide additional insights. AIOps is essentially a feature or set of features to analyze, combine, and collect data.

According to Garner, “By 2023, 40% of DevOps teams will augment application and infrastructure monitoring tools with artificial intelligence for IT operations (AIOps) platform capabilities.” AIOps platforms are platforms that “utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations functions with proactive, personal and dynamic insight.”

Figure 1-1 defines the various areas in IT operations that are included in AIOps. These include monitoring, event analytics, predictive and recommendation systems, collaboration and engagement, and reporting and dashboarding technologies.

A chart of various layers of information technology in A I Ops. In the middle, there is a chart labeled Devops/operation. To the left, there are three icons Chatbox, robotic process automation, and B P M, with the heading Business services. There are three layers C M D B/discovery, security, and governance to the right.

Figure 1-1

AIOps in the IT operations landscape

AIOps covers various layers of information technology. From the network to the endpoints, everything in IT can use AIOps technologies to reap the benefits that AIOps provides.

Enterprise monitoring provides a real-time data feed to the AIOps system to perform ML-driven correlation and analysis using various techniques to detect patterns and anomalies, as well as perform causal-impact analysis. This is one of the most important stages as this analysis needs to consider both real-time streaming data as well as historical data to provide predictive recommendations or proactive remediations, which then get executed using IT process or runbook automation tools. The recommendations for resolution help the organizations in achieving end-to-end automation by resolving problems without human intervention.

The reporting and dashboard layer provides views for different IT teams and stakeholders to collaborate and manage incidents, capacity, change, and problems to further support business by providing KPIs and SLAs that are driven by insights and provide an element of predictive analytics to make operations more proactive.

AIOps systems leverage a Configuration Management Database (CMDB) to improve the quality of correlations and accuracy of predictions and recommendations, but organizations usually struggle in maintaining the accuracy of the CMDB and discovery data due to the ever-changing infrastructure landscape. With cloud computing, it is practically impossible to update the CMDB using traditional tools and processes. An AIOps system solves this problem by automatically populating the required missing data in the CMDB. This well-oiled engine of AIOps has to work within organizations’ security policies defined under their governance framework. Various compliance needs like GDPR, data classifications, etc., should be considered at each layer of the AIOps engine while they are being integrated or set up. Once AIOps systems get up and running, they learn patterns, anomalies, and behavior using data over a period of time. Gradually, based on maturity, the AIOps system gets consumed by other technology or business units such as ChatOps, robotic process automation, business process automation, etc. More complex business process workflow or chat responses can be triggered based on the AIOps system recommendations.

Just like in software engineering, continuous integration and continuous deployment integrates different activities of development, testing, and deployment of applications and shares feedback for improvement. Similarly, AIOps is something that provides seamless integrations between various operations components and provides feedback for continuous service improvement.

The basic definition of AIOps is that it involves using artificial intelligence and machine learning to support all primary IT operations. As depicted in Figure 1-2, there are three layers in AIOps when it comes to event correlation.

The image depicts three layers of A I Ops. Data investigation has 4 sub-layers, logs, traces, metrics, and signals, each with 2 sub-layers. Data processing has 3 sub-layers analytics, machine learning, and N L P. Data representation has 4 sub-layers trend charts, box plot, service views, and heat maps. To the left, there are 18 applications.

Figure 1-2

Event correlation design with AIOps

Data Ingestion Layer

There are many heterogeneous entities in the infrastructure, and a comprehensive monitoring landscape consists of multiple tools and solutions to monitor them. The data ingestion layer is where data from different applications, platforms, and infrastructure layers is ingested using various integration mechanisms. Typical data that is ingested is in the form of events, logs, metrics, and traces. Popular mechanisms to ingest data are Representational State Transfer (REST), Simple Network Management Protocol (SNMP), and application programming interface (API) integrations.

Data Processing Layer

The data processing layer is the heart of an AIOps system; it is here that AI and machine learning techniques are used to process data and generate insights. Once the data is ingested into the AIOps system, the data processing layer uses machine learning and deep learning techniques to find anomalies in the data. It also uses the metric data to predict problems that may cause incidents and disrupt the business services. This layer forms the core of AIOps as far as event management is concerned.

Data Representation Layer

The data representation layer acts as the dashboarding layer where the results of data processing layer are displayed using intuitive dashboards in various formats. The actionable data for resolution is also forwarded to external systems like ITSM so that resolver groups can act on the data and resolve the issues.

The goal here is to use the enormous amounts of data that IT systems are generating and to use AI and machine learning to make sense of that data to arrive at analytics and insights and use them to make the IT systems perform faster, better, and cheaper, and make them more resilient to failures.

AIOps helps humans to address the gap that exists in their abilities to service the needs of IT operations. It does not take away people from the operations roles but augments their capabilities to provide better on-time services leveraging AIOps.

Together humans and AI are able to deliver a level of service that both individually cannot deliver. Figure 1-3 defines how AI and human agents work in collaboration to deliver better IT operations services and which functions belong where. This humongous, ever-increasing volume of events makes it impossible for a human to analyze and then write static rules and policies. Challenges in the analysis process get cascaded to various IT services such as capacity planning, problem management, incident management, etc., who consume analysis output. AI and humans are intertwined and joined at the hip in delivering IT operations services in the AIOps model. AIOps takes the majority of time-consuming and complex tasks of data preprocessing, filtering, and analysis thereby providing key insights to the experts for making well-informed decisions.

Two layers of Millions of data points. The layer, A.I. Agent, has six services: Filtering, trend analysis, pattern detection, anomaly detection, clustering, and automation. The layer, A I O Dryice A I Ops, has five services: Collaboration, root cause analysis, change management, service improvement, and capacity optimization & recommendation.

Figure 1-3

AIOps-driven collaboration between AI and humans

Through the application of AI/ML-powered data analysis and heuristics, engineers can reactively work on incidents with the aid of AIOps, which points them in the right direction and also provides them with the past data on such resolutions. AIOps is also used proactively to determine how to optimize application performance and infrastructure performance by analyzing the performance and capacity data.

Application monitoring and AIOps embedded as part of the development lifecycle would aid the development teams to proactively find availability and performance issues with either the application or the deployment infrastructure and resolve them before the application is released to production.

Adopting AIOps allows enterprises to save money by ensuring optimal utilization of capacity while at the same time avoiding downtime. If something goes wrong, the engineers are able to bring up the systems much faster than using traditional tools.

AIOps is helping to automate mundane tasks that do not require IT operators while providing contextual information for developers to improve mean time to resolution (MTTR) and customer experience. Though proactivity is core to AIOps, it applies equally to reactive situations.

Businesses are using AIOps to solve different use cases. Figure 1-4 shows the most common use cases in AIOps. Organizations start with intelligent alerting where they can do basic root-cause analysis and then move to correlation so that the root cause between various systems can be identified. As organizations move up the maturity curve, features such as anomaly detection are configured so that the operations become more proactive than reactive. Enterprises at the top of the curve have been able to deploy self-healing and automated resolution technologies so that the detect-to-correct cycle is automated completely.

A horizontal line labeled as A I Ops use cases. Four dotted vertical lines across the horizontal line represent four organizations, intelligent alerting, cross entity correlation, anomaly & threat detection, and self-healing & automated resolution with a maturity curve that moves up across the vertical lines.

Figure 1-4

AIOps key use cases

DevOps and infrastructure operations teams have deployed many monitoring tools to get data for observability, and they are today swamped with too many events. Organizations have deployed various monitoring tools such as Nagios, Zabbix, ELK, Prometheus, the BMC stack, the Microfocus stack, SolarWinds, Zenoss, Datadog, Appdynamics, Dynatrace, etc. In addition to these tools, enterprises use cloud-native monitoring tools like Azure Monitor and AWS CloudWatch to monitor cloud-native PaaS systems. All these monitoring systems are collecting huge amounts of data from an observability perspective. Monitoring the entire stack from the network to the application is being done in many organizations. However, even with all these investments and multiple tools, organizations are struggling to get insights and actionable intelligence. The engineers are overloaded with false alerts and too many tickets to handle.

In the DevOps model, without technologies like AIOps, there are scenarios where the DevOps teams will get overwhelmed with alerts and on-call support. Bringing AIOps into the mix ensures that only actionable alerts are converted into incidents and flagged to the right teams for resolution. AIOps deployed on nonproduction systems helps to find the development and configuration issues and results in better collaboration between the development and operations teams. AIOps is instrumental in ensuring that the business services are not affected, and the right teams and resources are aligned for resolution.

Many IT teams are not well equipped to cope with the changing demands of technology. With the cloud becoming all-pervasive, the entire IT landscape is changing fast, with hybrid and cloud-native technologies being used extensively in enterprises.

Operations engineers in these situations where the transformation of the core infrastructure and application landscape is happening don’t have adequate time to assess alerts and get to the root cause of the problem. In these situations, organizations are carrying a risk of unavailability and downtime.

Traditional IT monitoring and management solutions are unable to keep up with the changes in technology and depth of monitoring that are resulting in huge amounts of monitoring data being generated (see Figure 1-5). The ever-changing technology landscape means that log data and trace data are being generated at ever-increasing volumes, and it is not possible to define all rules in the monitoring systems. AIOps comes to the rescue by ingesting and analyzing all this data to make sense of it and create meaningful and relevant alerts so that the operations teams can focus on their core job of providing high availability and meeting their goals on their SLAs.

The image depicts a representation layer labeled as 'too much of data' with 25 icons, among which some of the icons are cloud, location, wifi, battery, message, lock, wifi router, and web browser.

Figure 1-5

Data explosion impacting traditional IT operations

Figure 1-6 shows the core functionality that AIOps tools can provide.

An image of a chart with the heading A I Ops. It has three tables labeled as Data ingestion, Analytics, and Automation. Data ingestion is further divided into 5 tools, metrics, traces, logs, topology, and configuration. Analytics has predictive analysis with two sub-divisions. Automation has 3 tools, diagnosis, triaging, and remediation.

Figure 1-6

AIOps tools core functionality

Ingestion of data: Data from various monitoring tools including metrics, traces, and logs is ingested, stored, and indexed for further processing. In addition, data from configuration management systems and topology data is also stored in the AIOps engine to provide correlation based on CMDB and topology relationships.

Analytics using machine learning: AIOps uses different types of approaches for analyzing this data to find patterns and anomalies. There are rule-based and machine learning approaches used in AIOps platforms to make sense of the ingested data. Some of the techniques that we are going to discuss in this book are statistical analysis using clustering, correlation, and classification; anomaly detection to detect anomalies in the event data; predictive analytics to find what may happen in the near future based on patterns; and topology-based and CMDB-based correlation. The idea is to convert all this event data into probable causal alerts that are the root cause for an issue so that the operations teams can focus on this and resolve the incident in a timely manner.

Automated diagnosis and remediation: Most AIOps tools today focus on and deliver the functionality until analysis. Automation is not part of a majority of AIOps toolsets. However, there are a few tools like DRYiCE iAutomate that apply the previous techniques to diagnosis and remediation as well, where the engine takes the probable cause as input, provides the remediation, and runs the remediation automatically. This results in automated healing and provides a complete end-to-end workflow. Let’s discuss the benefits of AIOps in detail.

AIOps Benefits

Enterprises that have deployed AIOps solutions have experienced transformational benefits. Some of them are as follows:
  • Higher availability of systems: This is one of the key reasons and benefits of AIOps that ensures continuous services and uninterrupted business. AIOps proved to be a potential game-changer, ensuring maximum availability in today’s hybrid infrastructure running containerized applications.

  • Reduction in human errors: Due to increasing complexity and the rate of change in the infrastructure ad application landscape, the majority of the outages happened due to human errors. This is another lever for AIOps system adoption because AIOps automates most of the repetitive and mundane tasks.

  • Better SLA compliance on mean time to repair: This is the target goal of any IT operations and a genuine expectation from the business. AIOps system integration with ITSM functions makes it feasible by uncovering useful insights, finding patterns of issues, and enabling collaboration with automation solutions to resolve them quickly. All this means that the mean time to repair is reduced and helps IT operations teams to not only meet but exceed the current SLAs.

  • Better automated detection of incidents: This is another key benefit of AIOps. An AIOps system eliminates a lot of waste by reducing the noise that gets created due to the creation of false-positive incidents. An AIOps system leads to the thorough analysis of events to qualify for the incident creation with appropriate severity. This saves IT operations teams’ time, which is wasted when chasing false positives.

  • Prediction and prevention of outages: AIOps leads to proactive operations and an important KPI to measure the operations performance. The AIOps system generates intelligent recommendations that help IT operations to meet this objective.

  • Cost optimization: IT is still being considered as a cost to many organizations. A mature AIOps system drastically brings down operational costs. By offloading work to algorithms and freeing up the human resources to spend time and energy on value-adding items, organizations are better able to utilize their precious human resources.

  • Better visibility into the environment: AIOps not only enables IT operations to identify areas of improvement but also enables businesses to uncover new opportunities or take strategic decisions. As AIOps systems touch all IT functions, they are best suited to filter out the noise and provide relevant visibility of the IT estate being managed to stakeholders.

  • Reduced risk of operations: Risk management is one crucial domain in IT operations, but an AIOps system taking charge of automated execution of tasks, reducing human errors, and enhanced analytics using AI-powered tools greatly reduces the operations RISK irrespective of whether it is related to security, disaster recovery (DR), or day-today operational tasks of incident management, change management, and problem management.

  • Automation benefits: Automation is a journey, but it often fails or does not deliver expected results when it works in silos. An AIOps system, on the other hand, enables the integration of core IT functions by providing end-to-end automation services.

  • Higher maturity of IT operations: AIOps’ continuous feedback provides visibility into gaps and challenges in processes, tools, and infrastructure. This leads IT operations from a reactive state to a mature proactive state.

  • Better visibility, governance, and control: Organizations often implement various event management and reporting tools for operational governance and control but often fail due to the dynamic nature of the infrastructure and the inability of the operations team to keep the systems updated. AIOps system, on the other hand, can automatically detect and absorb such changes using algorithms and deliver the required visibility for governance and control.

  • Easier to move to SRE, the DevOps model: The AIOps system brings automation and maturity in IT processes and tools, thereby enabling the operations team to adopt SRE and a DevOps model.

  • More efficient use of infrastructure capacity: An AIOps system provides much more efficient and granular visibility into capacity utilization, enabling the capacity manager to perform demand-forecast and cost-benefit analysis in a much better and faster way.

  • Faster delivery of new services: An AIOps system eliminates wastes, upskills the operation team, and brings maturity in processes and tools. This enables IT teams to support new initiatives and services.

Summary

In this chapter, we covered the challenges being faced by operations teams and how AIOps helps organizations overcome these challenges. We discovered AIOps and its various components and the capabilities that each of these components provides. We also listed the benefits organizations can expect when they deploy AIOps. In the next chapter, we will explore the AIOps architecture and methodology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.35.81