© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_4

4. AIOps Supporting SRE and DevOps

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

Just like AIOps is disrupting the technology and processes for IT operations, there are other transformational changes that are happening in the IT operations space.

DevOps as a movement has gained momentum and now cuts across the development, infrastructure, and operations worlds. With cloud infrastructure and cloud-native applications becoming the norm, DevOps is expanding its coverage to include end-to-end workflows right from development to deployment. AIOps supports the DevOps model and promotes collaboration across the development and operations teams.

Similarly, another movement that is closely related but has a slightly different take on operations is site reliability engineering (SRE), where the focus is on managing and operating applications and platforms with a focus on reliability, availability, and automation.

This chapter explains how AIOps, DevOps, and the SRE model are complementary and how one can underpin the DevOps and SRE services by using AIOps technologies.

Let’s begin with the overview of the SRE model and DevOps.

Overview of SRE and DevOps

AIOps is not the only discipline that is changing the way IT is run. DevOps, Agile, and SRE are the other disciplines that are transforming IT operations. Agile and DevOps have driven a cultural shift in organizations where agility and speed of delivery coupled with collaboration between development and operations resulted in benefits to enterprises that adopted them. Similarly, AIOps is changing the way collaboration is done between the development and operations teams using data and analytics and providing insights and knowledge that weren’t available earlier. With ChatOps and knowledge management, AIOps is providing the technology foundation on which the development and operations team collaborate. Both AIOps and DevOps together are taking organizations to even higher levels of automation and maturity.

AIOps is providing access to technologies that can bring data, data analytics, and machine intelligence together to make informed decisions and perform automated actions by collecting and analyzing data. Figure 4-1 shows the DevOps process from Plan to Deploy and the various AIOps functions that we discussed in earlier chapters.

An image of A I Ops's big data and machine learning functions that circulates under the build, test, release, monitor, and plan procedures. The icon to the left is leveled as DevOps team, and to the right, the icon is labeled as customer.

Figure 4-1

AIOps and DevOps, better together

AIOps and DevOps should be adopted across the enterprise to reap their benefits; deploying them in departmental silos wouldn’t make organizations move up the maturity ladder. Though DevOps implementations generally follow a step-by-step approach to implementation, AIOps needs all the data that is in the monitoring and management space to be able to better correlate data and provide insights and analytics.

AIOps eliminates noise and thus helps in making DevOps more effective. With AIOps in place, false alarms are reduced in the system, and this reduces the wasteful work for DevOps teams to analyze the false positives in the system. DevOps teams get single alerts rather than multiple alerts from various systems in the AIOps model along with probable root cause so that they can focus their energies on resolving the problem rather than trying to decipher and analyze so many alerts that are getting generated because of an issue in an application or infrastructure.

Thus, AIOps supports DevOps and helps the DevOps teams realize their vision of greater collaboration between Dev and Ops. AIOps helps in breaking the boundaries between Dev and Ops through a platform and makes life easier for both the Dev and Ops teams by eliminating waste and helping the teams focus their energies to elevate the DevOps processes to a higher level of maturity and provide higher availability and agility. AIOps is the foundation that supports SRE and DevOps. Thus, the DevOps and SRE models can gain from the machine learning capabilities provided by AIOps, as depicted in Figure 4-2.

An image of a shaded circle with a brain labeled as A I Ops. To the left, there is a shaded box labeled DevOps that points towards the circle. To the right, there is a shaded box labeled S R E that points towards the circle.

Figure 4-2

AIOps collaborating between SRE and DevOps

In today’s world, where so many essential business tasks have become digitized, IT teams must deal with constant change while ensuring zero downtime.

In the modern digital IT operations world, the DevOps teams work together, each on their own microservice. The DevOps teams are supported by SRE teams or embedded individuals with the DevOps team with the primary SRE role of maintaining the high availability of applications. The site reliability engineers provide insights to the DevOps team to improve the architecture and code of the applications by analyzing the operations feeds.

The challenge for SREs is to improve the stability, reliability, and availability across disparate systems, while application teams are delivering new features at a rapid pace. To achieve their targets, site reliability engineers have to be one step ahead of the outages and resolve incidents quickly. However, the lack of AIOps tools results in teams getting overwhelmed by noise, and it becomes difficult to isolate the root cause and provide immediate analysis and recommendation.

Analyzing alert data manually is more and more becoming an impossible task. Taking huge dumps of alert data and using Excel and other BI tools to analyze data is no longer going to work out since the monitoring and management data is humongous and in various formats. It is important for the SRE teams to be able to remove the noise and focus on the alerts, which are the root cause of incidents.

With multiple and geographically dispersed teams, collaboration also becomes a challenge. How do SRE and DevOps teams with ownership of different microservices that are mashed up together create an application and resolve incidents? Where do they get the data, the visualization, and the collaboration tools to run operations? AIOps comes to the rescue by providing the DevOps and SRE teams with the tools and technologies to run operations efficiently by providing them the visualization, dashboards, topology, and configuration data, along with the alerts that are relevant to the issue at hand. Thus, AIOps provides a unique solution to address operational challenges.

As a result, SRE teams are adopting AIOps tools to help address these challenges, including the adoption of AIOps for incident analysis as well as remediation.

Here are a few questions that an enterprise should ask about their SRE operations to arrive at the need for AIOps:
  • Do you have effective collaboration tools for DevOps and SRE teams?

  • Does your organization use automation and tools to improve resiliency?

  • Are your SRE teams able to manage the SLAs and error budgets?

  • Are your SRE teams getting the right alerts, or are they overloaded with false positives?

  • Are your SREs able to quickly find root causes using automated mechanisms?

  • Are you using ChatOps and generating knowledge while running operations?

  • Are your SREs using automation for incident resolution and configuration changes?

Based on the environment complexity, processes maturity, and investments made on tools and solutions, different organizations have defined and implemented SRE principles that may vary. In the next section, we will be discussing best practices and SRE principles in relation to AIOps that can be widely adopted by organizations and how AIOps supports the key principles of SRE model.

SRE Principles and AIOps

Site reliability engineering has gained traction as a domain and skill in recent years. With application and infrastructure complexity increasing because of many architectural choices, availability and resilience are essential in both architecture and operations. The SRE model is built on the following principles; you will see how AIOps enables most of them.

Principle 1: Embracing Risk

Embracing risk means weighing the costs of improving reliability and the impact it has on customer satisfaction. No service can be reliable 100 percent of the time. There is also a cost trade-off with reliability; after a certain point, adding more reliability would mean doubling the cost or even increasing the cost multifold. As an example, to support 99.95 availability versus supporting 99.999, the cost difference can be multiple times. Thus, there has to be a balance between the goal for reliability and the cost associated with it.

SRE service level agreements around availability and response time coupled with error budgets support this principle where they are free to manage availability within the SLA ranges. The SREs have a right to reject changes to applications or infrastructure if they are running low on error budget. Also, the SRE focus is on getting the services up and running as quickly as possible, which may involve taking some level of risk with quick decision-making.

AIOps helps SREs in this principle by providing them with all the data to measure SLIs and SLOs and then aggregate them under SLAs. The AIOps tools enable quick resolution of incidents using automated mechanisms that can be fired by the SREs to resolve availability problems. The AIOps models and analytics are not deterministic but probabilistic and thus carry an element of uncertainty and will never be 100 percent accurate and thus align with the SRE model.

Principle 2: Service Level Objectives

The observability data forms the basis for service level indicators that provide data on things like availability, response time, etc. These metrics are then aggregated under service level objectives (SLOs). Service level objectives are set to the point where the customers will feel dissatisfaction with a service. The service level indicators and objectives will be different for different types of business requirements and users. As an example, we have a 100 percent availability expectation when it comes to mission-critical applications or internet-scale applications like search and email. However, we may not have the same expectations from less critical systems like a timesheet application. Similarly, our requirements for response time from a mail or search engine application versus an ERP application are different. Service level objectives take the business and customer context and apply it on top of the service level indicators to arrive at the objective or target that the SRE team is willing to live with. An example could be 99.95 availability.

Service level objectives are for a timeframe. For example, an objective could be to meet the 99.95 availability goal on a monthly basis.

SLOs also leave room for an error budget. Whenever a failure or degradation affects the service, the error budget decreases. Thus, in the previous example, .05 percent is the error budget available.

It is difficult to achieve granular SLIs and SLOs and measure them accurately without AIOps. With basic monitoring and fragmented monitoring tools working in silos, it is extremely difficult to calculate and arrive at application availability or business process availability. Thus, AIOps is essential to provide the correct data on the availability and response time of an application or business process to enable the SREs to have granular and correct data to arrive at the right SLIs and SLOs.

Principle 3: Eliminating Toil

Eliminating toil means reducing the amount of repetitive work a team must do.

This is an extremely important principle for SREs; it differentiates the SRE model from other operations models that do not have metrics to measure and eliminate toil. SREs carry targets for toil elimination, and for this there are two important elements. One is the ability to measure toil in the system, and the second is to eliminate toil rapidly so that the teams can focus on more value-adding work.

SREs eliminate toil by automating routine and standard work that does not need human intelligence for every transaction. Things like health checks, routine checks, reporting, automated monitoring, and autoremediation for known errors are some of the tasks that SREs work to automate.

Another way SREs eliminate toil is by creating guides or standard operating procedures so that the knowledge base is enriched and can be quickly searched and used whenever required.

AIOps comes to the rescue here in more ways than one. Though not many AIOps products support algorithmic or machine learning–based automation systems, but there are a few upcoming products like iAutomate that use machine learning technologies to manage automation work. AIOps products provide capabilities where the SRE teams can use algorithms to estimate the amount of toil and keep track of automation and its benefits using analytics to identify, automate, and report on toil. The tools also come with built-in automations that can be used easily by the SREs to eliminate toil rapidly without having to create automations from scratch.

Principle 4: Monitoring

Monitoring or observability is the key to getting data from systems and applications. Without monitoring and observability, the availability, performance, and response time of applications and business processes cannot be measured. Monitoring looks at events, metrics, logs, and traces to provide meaningful data that can be analyzed and used for reactive and proactive actions to provide higher availability.

Monitoring and observability tools provide the raw data for the AIOps tools that can then use this input and apply algorithms and machine learning techniques to provide deeper actionable insights on this data.

The most common metrics focused on for reliability are these four golden signals:
  • Latency: The time it takes for a service to respond to a request

  • Traffic: The amount of load a service is experiencing

  • Error rate: How often requests to the service fail

  • Saturation: How much longer the service’s resources will last

All these golden signals are analyzed by the AIOps tools to provide forecasting to see if something will break in the near future. This removes noise and selects actionable alerts and supporting data so that the teams can focus on the right alerts without wasting time to find the root cause. Advanced statistical techniques are used to create causality maps that show which component failure resulted in failure of the system. Pattern matching techniques are used to find relevant information in log files, which is then provided to the site reliability engineers to do a deeper analysis.

AIOps tools greatly enhance the SRE function by providing them with the right information at the right time without having to dig through multiple records to arrive at a conclusion on the root cause. The dashboards created in AIOps provide topology, alerts, and knowledge articles so that the SREs can collaborate with development teams.

Using AIOps, the SRE and DevOps teams are able to visualize the entire architecture of an application using the topology and discovery data. Then this data is overlayed with the events that are getting generated from individual layers and components. This gives a bird’s-eye view to the entire application landscape and makes it easier for SREs to use this data and visualization in both operations as well as architectural transformation decisions.

The AIOps tools also help the SRE teams in understanding the application from an infrastructure-up perspective by getting the data from infrastructure monitoring tools and overlaying that with real user-monitoring data from application monitoring tools. This helps the SRE teams to narrow down on the root cause, which can be at the user end-level or in the network, infrastructure, or software code. The visibility provided by the AIOps tools helps SREs arrive at the root cause much faster.

Root-cause analysis done using machine learning technologies aids in faster response and resolution time and helps the SRE teams in maintaining a high level of availability and manage their SLAs. This results in increased productivity of the operations teams and higher customer satisfaction scores.

Principle 5: Automation

Automation means creating ways to complete repetitive tasks without human intervention. This helps free up teams for higher-value work.

This principle is an extension of the eliminate toil principle where everything that creates repetitive work for the team is eliminated. Automation helps to achieve the elimination of toil.

The SREs work on many automation levers.
  • Incident response: The SREs respond to incidents that have or are likely to impact SLOs. This is an important function for the SREs. The SREs automate incident response by using scripts and creating automation for simple scenarios. They can also leverage tools to deploy out-of-the-box automations for incident response.

  • Deployment: The SRE teams automate the deployment of monitoring and other applications as well, depending on how the function is structured in an organization.

  • Testing: This is an important element of SRE work, where automated testing for infrastructure and applications is used by SREs to find resilience issues. SREs use various methods and tools.

  • Communication: This is an essential ingredient in any operation. SREs use ChatOps tools to collaborate and communicate about issues and development work along with incidents and other actions, and tools like Slack are used by SREs for real-time communication.

AIOps tools help in the automation of monitoring, root-cause or probable-cause analysis, and remediation of incidents. The AIOps tools also provide embedded or integrated ChatOps to facilitate communication between SRE members and development teams on a real-time basis where both can have access to a common shared dashboard and alert data to better analyze the situation and take appropriate remedial action.

Principle 6: Release Engineering

Release engineering means building and deploying software in a consistent, stable, repeatable way. The previous SRE principles are applied to releasing software.

Some of the activities done as part of release engineering are as follows:
  • Configuration management: Define the baseline configuration and ensure that the releases change the configuration based on defined processes and that the configuration changes are tracked. It also involves defining “desired state configuration” for systems so that there is no deviation from the approved configuration.

  • Testing: Implement continuous testing and automated testing to ensure the release meets the requirements and definition of done.

  • CI/CD and rapid deployment: Where possible, automate the release process by leveraging the DevOps principles of continuous integration, continuous delivery, and continuous deployment. This enables consistent, repeatable releases in an automated fashion and increases the agility by providing teams with a system and process to release very frequently in response to business needs or for bug fixing.

Since the AIOps tools today are focused more on the operations aspects, they are used in the development process to ensure that the release has the requisite operations aspects covered during the development process.

AIOps tools help in identifying configuration drift and unapproved changes to infrastructure or applications by providing analytics over log data that is ingested into the AIOps tools. The tools facilitate continuous testing and deployment by providing insights into the availability and performance of the application and infrastructure in the development lifecycle. The AIOps tools support rapid deployment by providing before and after deployment comparison when SREs and DevOps teams are using continuous deployment and deploying to production frequently. Without AIOps, these processes are manual and prone to errors; AIOps helps the SREs to focus on the core aspects of availability resilience and performance while doing the heavy lifting of providing the right data to them.

Principle 7: Simplicity

Simplicity means developing the least complex system that still performs as intended. The goals of simplicity and reliability go hand in hand. A simpler system is easier to monitor, repair, and improve.

AIOps Enabling Visibility in SRE and DevOps

SREs advocate an end-to-end approach to reliability; building models is a good way to gain insights into the internal components. SRE advocates a holistic, end-to-end approach to reliability.

AIOps tools bring simplicity to the architecture by ensuring everything is integrated from observability aspects and integrated into a dashboard. AIOps ties together all the monitoring tools and integrates the automation aspects into the mix. Using AIOps simplifies operations and integrates the Dev and Ops teams through collaboration tooling.

Culture

DevOps is the culture and mindset forging strong collaborative bonds between the software development and infrastructure operations teams. This culture is built upon the following pillars:
  • Constant collaboration and communication: AIOps provides the tools for DevOps and SRE teams to collaborate and communicate and provides ChatOps for real-time collaboration.

  • Gradual changes: AIOps gets well entrenched in the overall culture of DevOps and SRE, and the models evolve over time with the changes in applications, infrastructure, and data.

  • Shared end-to-end responsibility: AIOps tools enable shared end-to-end responsibility through better coordination across the development and operations lifecycle as well as promoting collaboration and communication.

  • Early problem-solving: AIOps provides for rapid resolution of incidents using automated root cause and analysis and intelligent run book automation execution.

Automation of Processes

Automation of processes is the key goal of DevOps and SRE teams. Automating routine work and eliminating toil are achieved using AIOps tools, because the automated analysis of root cause, inputs to proactive problem management, and automated resolution of alerts free up the team’s time to take on higher-level activities.

Measurement of Key Performance Indicators (KPIs)

Measuring various metrics of a system allows for understanding what works well and what can be improved. AIOps tools provide in-depth reporting and analytics and help in creating service level objectives, service level indicators, and SLAs.

Sharing

AIOps tools enable sharing across the value stream by bringing the data from observability tools and providing insights to both development and operations teams. AIOps helps in knowledge sharing by providing knowledge management capabilities where teams can collaborate, create, and share content using the AIOps engine. AIOps enables the SRE teams to deliver faster and better and is a foundational technology on which SRE processes and teams are built.

Summary

In this chapter, we covered the SRE model and how AIOps enables the SRE teams to deliver on their promise. We also covered how AIOps enables true collaboration between the development and operations teams. We looked at various features provided by AIOps tools and how they support and enable SRE and DevOps principles. In the next chapter, we will cover the fundamentals of AI, machine learning, and deep learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.82.23