© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
N. Sabharwal, G. BhardwajHands-on AIOpshttps://doi.org/10.1007/978-1-4842-8267-0_9

9. Setting Up AIOps

Navin Sabharwal1   and Gaurav Bhardwaj1
(1)
New Delhi, India
 

This chapter provides best practices for the AIOps journey and gives guidance on setting up AIOps in your organization.

Having learned about AIOps, its techniques, and its benefits, now it is time to look at how to implement AIOps in an organization. Figure 9-1 describes the steps for setting up an AIOps practice in an organization.

An A I Ops's implementation in an organization has 10 steps with two points each. The steps are Define A I Ops charter, build A I Ops team, landscape, data sources, and integration methods, install A I Ops engine, configure A I Ops features, deploy service management features, automation features, measure success, and celebrate and share success.

Figure 9-1

AIOps establishment process

Let’s begin the AIOps journey by defining the AIOps charter.

Step 1: Write an AIOps Charter

AIOps is not just a technology change; it is a cultural change that involves changes to processes and people skills, so it needs commitment from executive management for it to be implemented successfully.

AIOps impacts various functions and teams including process teams, process owners, the command center, and IT operations. There may also be other initiatives in the organization such as site reliability engineering and DevOps that need alignment with the AIOps initiative.

Having a formal project charter, funding, and approval of executive leadership so that the AIOps project can get the required funding and attention is essential. This will help get the buy-in from various teams that are going to be impacted by the project. Once the AIOps charter gets approved, the next task is to build the AIOps team.

Step 2: Build Your AIOps Team

The next step is to get a team together for this project. This needs a dedicated project manager and subject-matter experts with AIOps experience along with members from other teams such as monitoring, observability, process, ITSM tools, the command center, IT operations, DevOps, and SRE.

The core team will implement the AIOps system; however, since this function cuts across and integrates with various other functions, it is important to have a higher level of governance and reporting mechanism.

It is a cultural change, and in large enterprises with siloed hierarchies and vertically split functions, it is necessary to get representation from organizational change management and HR teams so that this is driven across various functions effectively and the people aspect of the change is handled through a process-oriented approach. At this stage, you need to start exploring and evaluating the scope and goals for an AIOps implementation, which we will be covering next.

Step 3: Define Your AIOps Landscape

Before embarking on any technology change, it is essential to define what the goals are and what you trying to achieve. The team should go through various areas in AIOps and identify what would work for them best.

The next step should be to either define a subset where AIOps will be implemented or define a phased approach to implementation where various functions and features will be rolled out in stages.

Thus, you can go ahead and deploy the complete suite of AIOps features for a specific business line or go step-by-step and implement module by module for the entire enterprise as a whole. The decision should be based on the organizational structure, team structure and size, scale, and complexity. For a highly siloed and large organization, it would be better to implement it in a business vertical, whereas a medium-sized organization can proceed to implement it for the entire organization.

The first step in the planning stage is to gather the data around the current implementation of various tools and technologies and the ITSM processes. This should include the following:
  • Current monitoring tools landscape

  • Current infrastructure and application landscape

  • Data sources for topology/CMDB

  • Service management tools

  • Processes

  • Command center function and procedures

  • Resolution groups and procedures

  • Current issues and challenges in monitoring and management

  • Rule-based policies in existence for event correlation

  • Automation coverage and tools

After gathering all of this data, it is necessary to have a tollgate at this stage. The analysis of this data will result in better understanding of the maturity of the organization in terms of monitoring and service management. It is possible that the analysis of this data may lead to another subproject where some of the basic elements of monitoring or service management need to be enhanced or changed so that the AIOps project gets the integrations and data for it to function.

If the analysis results in another subproject to enhance the monitoring and management tools or processes, it should be handled as a separate project under the same umbrella program since the owners for this project may be different. The AIOps project can continue on its journey while this monitoring enhancement project runs in parallel.

As part of the data gathering, you should gather data on the current KPIs so that before and after implementation of AIOps the KPIs can be compared. The following KPIs are important indicators for measuring the success of AIOps deployment and should be measured before implementation and on an ongoing basis after deploying AIOps.
  • Alert to incident ratio

  • Mean time to respond

  • Mean time to resolve

  • SLA metrics for P1 and P2

  • Time for closure of SRs

  • Percentage of SRs automatically resolved

  • Percentage of incidents automatically resolved

  • Availability of critical systems

The next step in the AIOps journey is to define data sources and their integration.

Step 4: Define Integrations and Data Sources

After having collected the relevant data for the environment, you will have clarity on what integrations are required. The integrations will be broadly with the following systems:
  • Monitoring tools
    • SNMP (NetBackup, SAP Solman, etc.)

    • Syslog (config change alerts, UNIX kernel alerts, etc.)

    • APIs (Zabbix, vCenter, etc.)

    • Other database-based connectors

  • Service management tools using APIs

  • CMDB using APIs

  • Knowledge management sources using APIs

  • Automation tools using APIs

In general, you will find multiple monitoring tools in the environment. Some of them will support APIs, and the data can be ingested into AIOps using API-based connectors. Tools like vCenter provide APIs to check the status of the underlying visualized infrastructure; others may send SNMP-based alerts. For example, SAP Solman triggers SNMP alerts for SAP resources. There are multiple options for getting the data into the AIOps engine; these should be evaluated and then implemented based on the best approach.

Typically, you will see network monitoring, server monitoring, and application monitoring tools in the environment along with specialized tools for monitoring backups, jobs, storage, and other OEM devices. All these need to be integrated with the AIOps engine so that all the monitoring data resides in the single engine for analytics and for its algorithms to be trained.

From a service management perspective, there will be typically one tool, and the CMDB may be within the same toolset along with the KEDB and basic knowledge management. Most of the leading tools provide APIs, so data can be easily ingested using API-based connectors for these systems.

Automation tools may get integrated based on use cases that you are planning for AIOps and whether you plan a direct integration with AIOps or through service management tools.

Even before the AIOps engine gets trained, once integrations are complete, the organization will start seeing the benefits of a single-pane-of-glass view into operations with all alerts in a single console.

This stage concludes the important step in the AIOps journey where prerequisites get completed and you actually start the installation, deployment, and configuration steps, which will be covered in subsequent sections.

Step 5: Install and Configure the AIOps Engine

Once all the data sources and data has been identified, the next step is to start installing the AIOps engine. This depends on the tools and technology that you are choosing. Since these are complex systems, organizations generally take implementation services or professional services from partners who are well versed with the product and have the required expertise to implement these solutions. The Core AIOps team can shadow the implementation and learn and gain expertise while the AIOps engine is getting implemented. Alternatively, if the required skills and expertise are available in house, the team can start building the solution themselves.

For large, complex deployments, it’s highly recommended to proceed with a phased approach to ensure minimal to zero disruption to the services or business. Trying to accomplish everything quickly in minimal time needs good experience and multiple stakeholders’ support. In the first or initial phase of implementation, install the base AIOps solution and select and integrate APM (transaction monitoring), platform, and network data sources based on the output from step 3 and step 4 as explained earlier. These integrations will provide visibility into the health of servers, databases, and network devices, which covers the majority of organization estate.

This phase provides good learning and ample confidence for subsequent phases. Learning from phase 1 can then be used for improving the existing integrations as well as executing other data sources integrations like APM (deep dive), storage, backup, VMware, SAP Solman, job scheduling, etc., in subsequent phases. It is important to note that selecting data sources for integration in a specific phase depends on the criticality and urgency of the organization’s requirements.

Configuring the AIOps engine with all data sources integrations as defined in step 3 and step 4 may take time depending upon their coverage and maturity and organizational processes. For example, vCenter is planned to be integrated, but not all ESXs and farms are not configured in it, or an upgrade of storage, like EMC, comes during its integration with the AIOps engine. It is practically difficult to have detailed insight into all technology towers, especially in large organizations, and that’s why continuous feedback is important to tune the integrations and expand the coverage.

To reduce complications and have better time to value (TTV), SaaS platforms can be leveraged. On SaaS platforms, it would not involve installation, so you can proceed directly with configuring the solutions once the AIOps vendor onboards you onto the platform.

Once the AIOps engine is set up and configured, the next step is to set up and test the integrations. This involves getting the integration adapters configured and integrated. Each data source needs to be tested once the integration is set up to validate that the required data is being captured correctly in the AIOps engine.

If you are directly integrating with automation, then the outbound integration to automation is also set up and configured.

At this stage, your AIOps system is set up and ready for validation by the operations teams where they can see all the events and alerts in a single console. You can proceed to configure roles, users, and access based on roles and provide them with the required console and dashboards.

At this stage, your AIOps engine acts like a central repository of all the data sources but doesn’t do any AIOps functionality and features for event management.

Step 6: Configure AIOps Features

The next step is to use the core AIOps features of the platform to configure the event management functionality. We have already covered the various layers in event management in earlier chapters; these need to be implemented and configured in the system. For large and complex deployments, this step happens in parallel with the integration phases of step 5. Functions such as deduplication, enrichment, topology correlation, application correlation, anomaly detection, etc., process the events coming from integrations with various data sources and select the qualified alerts for incident creation and subsequent automation of diagnosis and remediation. Again, continuous feedback is important while configuring the AIOps features and functions to tune the performance and accuracy of the AIOps engine. We have discussed in detail the implementation and best practices around these AIOps functions in previous chapters.

After configuring all the features, the system starts to learn from the data and the feedback being provided by the operations teams and forms a continuous loop where new data and new feedback provides new inputs to the AIOps engine to fine-tune itself and improve its coverage and accuracy.

Once you have set up everything and the system is up and running, you should monitor the system on an ongoing basis for accuracy and any drift that may happen in the data. During its lifetime there will be new integrations and new tools that may come up in the environment that need integration with the AIOps engine. These need to be handled as and when new monitoring capabilities are brought into the environment.

Step 7: Deploy the Service Management Features

You can start using the capabilities of AIOps in the service management space next. As defined earlier in the Engage layer, the step-by-step approach can be implemented, and each function or feature can be rolled out as part of the plan. The detailed activities in AIOps service management will start with the integration of the Observe layer with incident creation and assignment, and the remaining features in the Engage layer will get implemented in order. An organization that prefers a highly cautious approach can proceed with selective automatic incident creation, such as creating automatic incidents only for production servers or only for a down event, etc., and then gradually expand automatic incident and auto-assignment for other types of qualified alerts, continuously improving both mean time to respond and mean time to repair.

Step 8: Deploy Automation Features

Once the AIOps engine has been configured for the Observe and Engage phases the automation actions can be configured.

The automation engine needs to be configured in two phases. Phase-1 can be referred as “human assisted automation”, where the AIOps engine provides probable cause that needs to be validated by the human agents and the automation engine provides the recommendation for resolution that also needs validation by the human agent.

Once the AIOps Observe engine has run its course over a few weeks and reached a level of confidence score and once the probable cause can be considered as the root cause and the confidence score of the automation engine has also reached a threshold based on the feedback from human actions, we can proceed to deploy the system in fully automated mode.

From here on, the system will gradually adding more root cause and automations to its repository to cover more areas under full automation mode.

Ensure that all the modules are monitored for accuracy and any drift that may happen is tracked and remediated for the accuracy levels to be maintained. New automation runbooks and tools may get deployed in the environment and will need integrations as and when they are available in the environment.

Once the AIOps system is set up and ready, it is time to observe and measure its value and benefits.

Step 9: Measure Success

Once the implementation has been rolled out, it is time to measure where you stand. In the initial stages of the project, you measured the initial KPIs when you were embarking on the AIOps journey. It is now time to measure these KPIs again and see where you stand after implementation. Remember, we had defined the following KPIs:
  • Alert to incident ratio: You should see a significant improvement in the alert incident ratio. Because of the removal of noise in the system and probable cause analysis, false alerts that trigger an incident do not happen or are significantly reduced. This also results in a lower number of incidents in the incident management process since false or duplicate incidents are now suppressed by the AIOps engine.

  • Mean time to respond: The mean time to respond is significantly reduced since the automated analysis of events provides better input to the response teams and some of the response is automated by the system as well. This metric should see a marked improvement.

  • Mean time to resolve: This metric should also see significant reduction since the time to respond is reduced, and the time to resolve problems through automation is significantly lower. Less time is spent in trying different options, and the AIOps engine is able to guide the resolution teams to the right runbook for resolving the issues. This metric also impacts the availability of the systems, and you should see an improvement in the availability of the applications and infrastructure.

  • SLA metrics for P1 and P2: Since the SLAs are based on response and resolution, a marked improvement on these would translate to better scores on SLAs. In fact, the operations teams can go back and commit on better SLAs because of improvement in operations using AIOps.

  • Time for closure of SRs: Service requests get automated as a part of AIOps initiative, and thus the time required for closure of SRs is significantly reduced. There are also fewer errors since the SRs are handled automatically, thus avoiding human errors.

  • Percentage of SRs automatically resolved: Since automation resolves many SRs, this number sees significant increase through automation.

  • Percentage of incidents automatically resolved: Once automation has been established as part of AIOps toolset, then automated incidents significantly increase.

  • Availability of critical systems: A high degree of automation results in lower time to diagnose and resolve incidents, and this results in the higher availability of critical systems.

Step 10: Celebrate and Share Success

Once you are done with all the layers, the final step in your AIOps journey is sharing your success with the business, leadership, and all the teams. Prepare a detailed case study leveraging the metrics that you set out to achieve and the actual results. Cover how you undertook the journey and what challenges were faced and how those were overcome. Share the knowledge with the AIOps community so that others can learn from your experiences. You can drop a note to us on your journey on AIOps as well at [email protected].

Next, let’s discuss some best practices and guidelines for AIOps implementation.

Guidelines on Implementing AIOps

The following are some guidelines when implementing AIOps.

Hype vs. Clarity

Do not undertake an AIOps project or any other project for that matter because it is fancy and hyped. There should be a definitive need and a use case for deploying AIOps. Clarity of need and purpose is essential for your successful journey to AIOps journey.

Be Goal and KPI Driven

It is important that you measure your KPIs in the beginning and the end of the implementation. There are implementations that will fail because there is no clear idea on what the team wants to achieve. Having KPIs and knowing how they get impacted by the current project keeps everyone focused on the end outcome.

Expectations

AIOps applies extensive automation and statistical analysis to the events, performance metrics, logs, and trace data collected from monitoring tools to learn behaviors, identify anomalies, correlate alerts, reduce noise, and pinpoint root causes.

One has to understand and accept that machine learning technologies are probabilistic in nature, and thus they cannot be 100 percent accurate all the time. The idea is to progressively train them to a level of accuracy and confidence where the recommendations can be used by the IT operations teams in finding and resolving problems.

Time to Realize Benefits

Have realistic expectations for the time that machine learning needs to analyze data, build and train models, and begin providing insights, such as performance anomalies, grouped alerts, and root causes. For example, identifying weekly seasonality requires at least a couple of weeks of observation.

Given the time and feedback, AIOps will provide IT operations with more accurate insights, allowing better decisions to be made. However, expecting AIOps to be a turnkey solution that automates everything on day one is an unrealistic expectation.

One Size Doesn’t Fit All

Every organization is unique, and you will have differences in infrastructure, application landscape, monitoring, and management tools, and hence there will be differences in approach, implementation, and the time to realize value. The size, scale, and complexity of an environment also have a bearing on the time taken to implement and realize value. However, we have seen that AIOps, if implemented correctly, provides rapid value realization and positive ROI.

Organizational Change Management

Familiarity with AIOps tools and processes is one part of the puzzle; however, a bigger piece is how to get buy-in from different teams that need to be part of this initiative; thus, organizational change management is a key factor in successful AIOps projects.

Plan Big, Start Small, and Iterate Fast

Rather than attempting to do everything in one massive undertaking, start small. That gives you the chance to learn from your accomplishments, validate and fine-tune your approach, acquire and build on capabilities, and achieve all-important quick wins for your organization. Taking on too much at once can lead to disappointments and may lead to poor business adoption.

Continually Improve

Once it’s deployed, you can continuously improve by extending the coverage and scope of AIOps and fine-tuning its algorithms to gain higher accuracy. AI is a dynamic and fast-evolving field, and newer technologies in AI are emerging that can impact how AIOps will be delivered in the near future.

The Future of AIOps

With advancements in AI and ML technology, AIOps systems will also get enhanced with more accurate predictions. Let’s understand the potential future of AIOps in enterprises.

Observability has a key role to play in DevOps and site reliability engineering. With new tools and techniques available for observability, the data available for analysis increases multifold. With AIOps and observability, the operations teams will have much higher visibility and greater control over their infrastructure and application landscape. After implementing AIOps in the observability area, deploying AI techniques to make sense of the event data, and helping the operations teams find the root cause of issues, the next step is to leverage these techniques to automate the resolution itself.

Tools such as iAutomate provide these capabilities where advanced machine learning techniques are used to take the probable cause as an input and apply AI to find the right automation and take automation in the operations area to the next level of maturity. Currently enterprises that are early adopters are using these technologies in the monitoring and observability domain and implementing automation to realize the complete benefits of end-to-end automation.

The next frontier for AIOps is the better integration of the development pipeline and using these technologies in the development area. AIOps should be able to help by automating the path from development to production, predicting the effect of deployment on production and responding to changes in the production environment.

With cloud computing and the adoption of microservices architecture, the resolution of root cause, especially the root causes that are related to capacity or performance of the infrastructure, would simply be to spin more containers or to launch more virtual machines. Thus, microservices architecture on one hand increases the complexity of the application by breaking it down into many services to map and manage; however, it eases the task of resolution for performance and capacity issues by providing an easy automation capability to spin up more containers quickly and handle the spike in workload.

AIOps and DevOps together solve a lot of problems. DevOps brings together the teams that were siloed earlier, while AIOps brings together data from multiple sources at a single place for insights and analytics. Both require a level of cultural change since they look at the entirely of the process as well as cut across multiple teams and processes. We will see much better integration of AIOps and DevOps in times to come with enterprises making the transition to AIOps by relooking at their DevOps processes and vice versa. Organizations will go on to create interfaces between the AIOps and DevOps processes and create procedures that cut across the two domains.

AIOps in its current form and shape will continue to be adopted by enterprises, and with better integration, DevOps organizations will start using the AI techniques in other areas beyond operations. AIOps is primarily focused on the operations world today; however, the tools and techniques and the algorithms are equally applicable in the development world. On one hand, AIOps data will be used as an input into the development processes for creating more resilient and optimized applications, and on the other hand new use cases will evolve for taking AIOps technology into the development world.

Similar to the operations world, where we have multiplicity of monitoring and management tools, the development world also has multiplicity of tools. The development tools that generate the development data are the Agile dashboards that track the pipeline of features to be built, the team statistics, the burndown charts, and the schedule of releases. Then there is data from testing tools that include functional testing, performance testing, code quality, and security and vulnerability testing. All this data is still analyzed by humans, and the decisions for deployment are made by the release and management teams. There are enterprises that have perfected the art of continuous delivery and continuous deployment; however, using AIOps technologies to go through the development data and provide insights and analytics similar to what is now available in the operations world will be the next frontier for AIOps technologies.

Some application performance management tools have started using these technologies to provide insights into the application code to find where the root cause of performance or availability issues are. Though most APM technologies are still rule or topology based, a few vendors have moved in this direction to bring machine learning and AI technologies to the area of application development and testing.

There will be better integration of the AIOps technologies in the development pipeline, wherein events and data getting generated through the pipeline are sent to the AIOps pipeline for analysis and automated actions. The next step is where AIOps is able to provide intelligent guidance on the code and configuration changes being made in various environments. This can be based on the data being generated by the pipeline. There would be integration with historical analysis of issues and bugs that were generated, and these can be source of input to arrive at predictions on the current pipeline.

There are lot of false positives that are generated in testing. The automated testing tools that look for code quality or static code analysis often generate false positives. AIOps can be used to eliminate these false positives. Regression testing can be automated and simplified through the use of machine learning technologies. Thus, AIOps technologies in the DevOps pipeline can reduce the amount of testing, help in automated testing, reduce false positives, and guide the team in focusing their time and energy on the most relevant pieces.

AIOps can be intelligently used in both preproduction and production systems to analyze the behavior of applications and impact of configuration changes on the application performance. These alerts can be configured to map the changes between the preproduction and production deployment to find any issues that may crop up because of configuration issues or because of changes in the workload pattern.

AIOps is already being used by advanced automation systems to find the best remediation available and to apply it automatically; however, future systems will also enable ChatOps-based collaboration between the developers and operations teams to enable knowledge sharing and just-in-time knowledge retrieval for bug fixing as well as fixing errors in production environments. Thus, AIOps will see its usage expand beyond operations into the entire DevSecOps value chain.

Though there has been work in the areas of predictive analytics, there will be further enhancements to the algorithms and techniques being used in AIOps to deliver ever-increasing use cases and also to improve the accuracy of the results. As more and more data is fed into the systems, the AIOps engines will become more powerful and provide ever deeper insights and analytics.

The core algorithms that involve lots of machine learning models today will evolve to use deep learning technologies that will provide better insight since the amount of data available to AIOps systems will increase, and neural network systems will become more accurate.

AIOps has had a great start, and organizations are seeing immense benefits of using these technologies today. It will go on to conquer new domains and areas beyond IT operations and eventually impact the entire IT value chain.

Summary

In this chapter, we covered a step-by-step approach to implementing AIOps in an organization. We also covered what to watch out for while implementing AIOps and avoid pitfalls. We gave guidance on a successful implementation of a project and ways to measure success as well as the important KPIs to measure the success of an AIOps project.

This brings us to the end of this book; we have covered AIOps in depth, beginning with the definition and areas that AIOps covers to a complete architecture of AIOps covering the Observe, Engage, and Act phases. We also covered various machine learning tools and techniques and algorithms that can be used in the AIOps domain. We gave you a detailed walk-through of the process that you can use to implement AIOps in your organization in a step-by-step manner. We sincerely hope that you enjoyed this book as much as we enjoyed authoring it. We look forward to your feedback and comments at [email protected].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.184.90