This chapter introduces artificial intelligence for IT operations (abbreviated as AIOps). In today’s rapidly transforming application and infrastructure landscape and adoption of cloud-native technologies, organizations are finding it difficult to provide 24/7 operations that can scale and meet the needs of businesses that now want much higher availability and agility to change based on customer and market feedback. This chapter also provides details on the benefits that AIOps brings to the table and how it supports the digitization journey of enterprises.
Introduction to AIOps
AIOps is a buzzword in the operations world and was coined by Gartner in 2016. As mentioned, it means implementing artificial intelligence for IT operations. AIOps refers to a transformational approach to running operations using AI and machine learning technologies in various operations domains such as monitoring, observability, event correlation, service management, and automation. With the exponential growth seen in application and platform diversity, including the movement to microservices and cloud architectures, there is an enormous amount of data being generated in operations. The operations teams are overwhelmed with this vast amount of data and the diversity in applications, platforms, and infrastructures in the environment. Most enterprises today are rapidly migrating and adopting new technologies such as cloud and microservices architecture, and thus the rate of change in infrastructure and platforms is unlike anything seen before. The challenge in IT operations is to run steady-state operations without disruption and also support this agility and migration and bring new services into operations. These disruptions and changes are putting an enormous strain on the operations teams. Processes and systems that have worked in the past are not working anymore, and the new digitalized world with rapid changes both in applications and infrastructure is resulting in newer challenges. Thus, AIOps has evolved over the last few years as a potential solution to the operational challenges of the new model.
The huge amount of data getting generated from monitoring and observability systems is one of the sources of data that is fed into AIOps-based systems, and then AI and machine learning techniques are used to make sense of the data and filter the noise from critical events. This results in automating most of the tasks that were manual before and that relied on human judgment and tribal knowledge. The events that can cause disruption in business operations and are the root cause are efficiently identified using analytics techniques and thus provide immediate notification to the groups that are resolving them. Without AIOps this process will be difficult to run with changes in technology happening at a rapid pace; relying on older systems and tribal knowledge would mean operations is not scalable and predictable.
All of this is enabled by the emergence and maturity of artificial intelligence and machine learning technologies, which are the foundation of AIOps.
Artificial intelligence has transformed the way systems are developed and business processes are run. AI is everywhere from the image processing in your phone to recommendation engines on Amazon that provide you with new product recommendations based on your preferences. Face recognition and image beautification on phones are examples of applications that people are consuming every day without even knowing that artificial intelligence is powering these applications. Natural language processing advancements have transformed the way we interact with applications. Today voice assistants like Alexa, Siri, and Cortana are changing the way we communicate with content.
Information technology has leveraged these technologies to solve varying business problems in the areas of building recommendation systems, predictive systems, image recognition, voice recognition, text extraction, and natural language understanding systems.
However, when it comes to solving IT problems using artificial intelligence, enterprises and technology companies have yet to embrace this fully.
Finally, the AI technologies that IT teams have used to deliver exciting new applications for consumers and businesses are now finding their way into monitoring and managing the IT technologies. Thus, a new class of systems that are using algorithms to run IT operations was born.
AIOps is a term Gartner invented to describe a general trend of applying AI techniques to IT operations data sources to provide additional insights. AIOps is essentially a feature or set of features to analyze, combine, and collect data.
According to Garner, “By 2023, 40% of DevOps teams will augment application and infrastructure monitoring tools with artificial intelligence for IT operations (AIOps) platform capabilities.” AIOps platforms are platforms that “utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations functions with proactive, personal and dynamic insight.”
AIOps covers various layers of information technology. From the network to the endpoints, everything in IT can use AIOps technologies to reap the benefits that AIOps provides.
Enterprise monitoring provides a real-time data feed to the AIOps system to perform ML-driven correlation and analysis using various techniques to detect patterns and anomalies, as well as perform causal-impact analysis. This is one of the most important stages as this analysis needs to consider both real-time streaming data as well as historical data to provide predictive recommendations or proactive remediations, which then get executed using IT process or runbook automation tools. The recommendations for resolution help the organizations in achieving end-to-end automation by resolving problems without human intervention.
The reporting and dashboard layer provides views for different IT teams and stakeholders to collaborate and manage incidents, capacity, change, and problems to further support business by providing KPIs and SLAs that are driven by insights and provide an element of predictive analytics to make operations more proactive.
AIOps systems leverage a Configuration Management Database (CMDB) to improve the quality of correlations and accuracy of predictions and recommendations, but organizations usually struggle in maintaining the accuracy of the CMDB and discovery data due to the ever-changing infrastructure landscape. With cloud computing, it is practically impossible to update the CMDB using traditional tools and processes. An AIOps system solves this problem by automatically populating the required missing data in the CMDB. This well-oiled engine of AIOps has to work within organizations’ security policies defined under their governance framework. Various compliance needs like GDPR, data classifications, etc., should be considered at each layer of the AIOps engine while they are being integrated or set up. Once AIOps systems get up and running, they learn patterns, anomalies, and behavior using data over a period of time. Gradually, based on maturity, the AIOps system gets consumed by other technology or business units such as ChatOps, robotic process automation, business process automation, etc. More complex business process workflow or chat responses can be triggered based on the AIOps system recommendations.
Just like in software engineering, continuous integration and continuous deployment integrates different activities of development, testing, and deployment of applications and shares feedback for improvement. Similarly, AIOps is something that provides seamless integrations between various operations components and provides feedback for continuous service improvement.
Data Ingestion Layer
There are many heterogeneous entities in the infrastructure, and a comprehensive monitoring landscape consists of multiple tools and solutions to monitor them. The data ingestion layer is where data from different applications, platforms, and infrastructure layers is ingested using various integration mechanisms. Typical data that is ingested is in the form of events, logs, metrics, and traces. Popular mechanisms to ingest data are Representational State Transfer (REST), Simple Network Management Protocol (SNMP), and application programming interface (API) integrations.
Data Processing Layer
The data processing layer is the heart of an AIOps system; it is here that AI and machine learning techniques are used to process data and generate insights. Once the data is ingested into the AIOps system, the data processing layer uses machine learning and deep learning techniques to find anomalies in the data. It also uses the metric data to predict problems that may cause incidents and disrupt the business services. This layer forms the core of AIOps as far as event management is concerned.
Data Representation Layer
The data representation layer acts as the dashboarding layer where the results of data processing layer are displayed using intuitive dashboards in various formats. The actionable data for resolution is also forwarded to external systems like ITSM so that resolver groups can act on the data and resolve the issues.
The goal here is to use the enormous amounts of data that IT systems are generating and to use AI and machine learning to make sense of that data to arrive at analytics and insights and use them to make the IT systems perform faster, better, and cheaper, and make them more resilient to failures.
AIOps helps humans to address the gap that exists in their abilities to service the needs of IT operations. It does not take away people from the operations roles but augments their capabilities to provide better on-time services leveraging AIOps.
Through the application of AI/ML-powered data analysis and heuristics, engineers can reactively work on incidents with the aid of AIOps, which points them in the right direction and also provides them with the past data on such resolutions. AIOps is also used proactively to determine how to optimize application performance and infrastructure performance by analyzing the performance and capacity data.
Application monitoring and AIOps embedded as part of the development lifecycle would aid the development teams to proactively find availability and performance issues with either the application or the deployment infrastructure and resolve them before the application is released to production.
Adopting AIOps allows enterprises to save money by ensuring optimal utilization of capacity while at the same time avoiding downtime. If something goes wrong, the engineers are able to bring up the systems much faster than using traditional tools.
AIOps is helping to automate mundane tasks that do not require IT operators while providing contextual information for developers to improve mean time to resolution (MTTR) and customer experience. Though proactivity is core to AIOps, it applies equally to reactive situations.
DevOps and infrastructure operations teams have deployed many monitoring tools to get data for observability, and they are today swamped with too many events. Organizations have deployed various monitoring tools such as Nagios, Zabbix, ELK, Prometheus, the BMC stack, the Microfocus stack, SolarWinds, Zenoss, Datadog, Appdynamics, Dynatrace, etc. In addition to these tools, enterprises use cloud-native monitoring tools like Azure Monitor and AWS CloudWatch to monitor cloud-native PaaS systems. All these monitoring systems are collecting huge amounts of data from an observability perspective. Monitoring the entire stack from the network to the application is being done in many organizations. However, even with all these investments and multiple tools, organizations are struggling to get insights and actionable intelligence. The engineers are overloaded with false alerts and too many tickets to handle.
In the DevOps model, without technologies like AIOps, there are scenarios where the DevOps teams will get overwhelmed with alerts and on-call support. Bringing AIOps into the mix ensures that only actionable alerts are converted into incidents and flagged to the right teams for resolution. AIOps deployed on nonproduction systems helps to find the development and configuration issues and results in better collaboration between the development and operations teams. AIOps is instrumental in ensuring that the business services are not affected, and the right teams and resources are aligned for resolution.
Many IT teams are not well equipped to cope with the changing demands of technology. With the cloud becoming all-pervasive, the entire IT landscape is changing fast, with hybrid and cloud-native technologies being used extensively in enterprises.
Operations engineers in these situations where the transformation of the core infrastructure and application landscape is happening don’t have adequate time to assess alerts and get to the root cause of the problem. In these situations, organizations are carrying a risk of unavailability and downtime.
Ingestion of data: Data from various monitoring tools including metrics, traces, and logs is ingested, stored, and indexed for further processing. In addition, data from configuration management systems and topology data is also stored in the AIOps engine to provide correlation based on CMDB and topology relationships.
Analytics using machine learning: AIOps uses different types of approaches for analyzing this data to find patterns and anomalies. There are rule-based and machine learning approaches used in AIOps platforms to make sense of the ingested data. Some of the techniques that we are going to discuss in this book are statistical analysis using clustering, correlation, and classification; anomaly detection to detect anomalies in the event data; predictive analytics to find what may happen in the near future based on patterns; and topology-based and CMDB-based correlation. The idea is to convert all this event data into probable causal alerts that are the root cause for an issue so that the operations teams can focus on this and resolve the incident in a timely manner.
Automated diagnosis and remediation: Most AIOps tools today focus on and deliver the functionality until analysis. Automation is not part of a majority of AIOps toolsets. However, there are a few tools like DRYiCE iAutomate that apply the previous techniques to diagnosis and remediation as well, where the engine takes the probable cause as input, provides the remediation, and runs the remediation automatically. This results in automated healing and provides a complete end-to-end workflow. Let’s discuss the benefits of AIOps in detail.
AIOps Benefits
Higher availability of systems: This is one of the key reasons and benefits of AIOps that ensures continuous services and uninterrupted business. AIOps proved to be a potential game-changer, ensuring maximum availability in today’s hybrid infrastructure running containerized applications.
Reduction in human errors: Due to increasing complexity and the rate of change in the infrastructure ad application landscape, the majority of the outages happened due to human errors. This is another lever for AIOps system adoption because AIOps automates most of the repetitive and mundane tasks.
Better SLA compliance on mean time to repair: This is the target goal of any IT operations and a genuine expectation from the business. AIOps system integration with ITSM functions makes it feasible by uncovering useful insights, finding patterns of issues, and enabling collaboration with automation solutions to resolve them quickly. All this means that the mean time to repair is reduced and helps IT operations teams to not only meet but exceed the current SLAs.
Better automated detection of incidents: This is another key benefit of AIOps. An AIOps system eliminates a lot of waste by reducing the noise that gets created due to the creation of false-positive incidents. An AIOps system leads to the thorough analysis of events to qualify for the incident creation with appropriate severity. This saves IT operations teams’ time, which is wasted when chasing false positives.
Prediction and prevention of outages: AIOps leads to proactive operations and an important KPI to measure the operations performance. The AIOps system generates intelligent recommendations that help IT operations to meet this objective.
Cost optimization: IT is still being considered as a cost to many organizations. A mature AIOps system drastically brings down operational costs. By offloading work to algorithms and freeing up the human resources to spend time and energy on value-adding items, organizations are better able to utilize their precious human resources.
Better visibility into the environment: AIOps not only enables IT operations to identify areas of improvement but also enables businesses to uncover new opportunities or take strategic decisions. As AIOps systems touch all IT functions, they are best suited to filter out the noise and provide relevant visibility of the IT estate being managed to stakeholders.
Reduced risk of operations: Risk management is one crucial domain in IT operations, but an AIOps system taking charge of automated execution of tasks, reducing human errors, and enhanced analytics using AI-powered tools greatly reduces the operations RISK irrespective of whether it is related to security, disaster recovery (DR), or day-today operational tasks of incident management, change management, and problem management.
Automation benefits: Automation is a journey, but it often fails or does not deliver expected results when it works in silos. An AIOps system, on the other hand, enables the integration of core IT functions by providing end-to-end automation services.
Higher maturity of IT operations: AIOps’ continuous feedback provides visibility into gaps and challenges in processes, tools, and infrastructure. This leads IT operations from a reactive state to a mature proactive state.
Better visibility, governance, and control: Organizations often implement various event management and reporting tools for operational governance and control but often fail due to the dynamic nature of the infrastructure and the inability of the operations team to keep the systems updated. AIOps system, on the other hand, can automatically detect and absorb such changes using algorithms and deliver the required visibility for governance and control.
Easier to move to SRE, the DevOps model: The AIOps system brings automation and maturity in IT processes and tools, thereby enabling the operations team to adopt SRE and a DevOps model.
More efficient use of infrastructure capacity: An AIOps system provides much more efficient and granular visibility into capacity utilization, enabling the capacity manager to perform demand-forecast and cost-benefit analysis in a much better and faster way.
Faster delivery of new services: An AIOps system eliminates wastes, upskills the operation team, and brings maturity in processes and tools. This enables IT teams to support new initiatives and services.
Summary
In this chapter, we covered the challenges being faced by operations teams and how AIOps helps organizations overcome these challenges. We discovered AIOps and its various components and the capabilities that each of these components provides. We also listed the benefits organizations can expect when they deploy AIOps. In the next chapter, we will explore the AIOps architecture and methodology.