Chapter 19: Introducing AIOps in Multi-Cloud

AIOps stands for Artificial Intelligence for Operations, but what does it really mean? AIOps is still a rather new concept but can help to optimize your multi-cloud platform. It analyzes the health and behavior of workloads end-to-end – that is, right from the application's code all the way down to the underlying infrastructure. AIOps tooling will help in discovering issues, thereby providing advice for optimization. The best part is that good AIOps tools do this cross-platform since they operate from the perspective of the application and even the business chain. 

This chapter is an introduction to the concept of AIOps. The components of AIOps will be discussed, including data analytics, automation, and Machine Learning (ML). After completing this chapter, you will have a good understanding of how AIOps can help in optimizing cloud environments and how enterprises can get started with implementing AIOps. The chapter concludes with a brief overview of some market-leading tools in this space.

In this chapter, we're going to cover the following main topics:

  • Understanding the concept of AIOps
  • Optimizing cloud environments using AIOps
  • Exploring AIOps tools for multi-cloud

Understanding the concept of AIOps

AIOps combines analytics of big data and ML to automatically investigate and remediate incidents that occur in the IT environment. AIOps systems learn how to correlate incidents between the various components in the environment by continuously analyzing all logging sources and the performance of assets within the entire IT landscape of an enterprise. They learn what the dependencies are inside and outside of IT systems.

Especially in the world of multi-cloud, where enterprises have systems in various clouds and still on-premises, gaining visibility over the full landscape is not easy. How would an engineer tell that the bad performance of a website that hosts its frontend in a specific cloud is caused by a bad query in a database that runs from a data lake in a different cloud?

AIOps requires highly sophisticated systems, comprising the following components:

  • Data analytics: The system gathers data from various sources containing log files, system metrics, monitoring data, and also data from systems outside the actual IT environment, such as posts on forums and social media. A peak of incidents logged into the systems of the service desk may also be a source. AIOps systems will aggregate the data, look for trends and patterns, and compare these to known models. This way, AIOps is able to determine issues quickly and accurately.
  • ML: AIOps uses algorithms. In the beginning, it will have a baseline that represents the normal behavior of systems, applications, and users. Applications and the usage of data and systems might change over time. AIOps will constantly evaluate these new patterns and learn from them, teaching itself what the new normal behavior is and what events will create alerts. From the algorithm, AIOps will prioritize events and alerts and start remediating actions.
  • Automation: This is the heart of AIOps. If the system detects issues, unexpected changes, or abnormalities in behavior, it will prioritize and start remediation. It can only do that when the system is highly automated. From the analytics output and the algorithm, AIOps systems can determine what the best solution is to solve an issue. If a system runs out of memory because of peak usage, it can automatically increase the size of memory. Some AIOps systems may even be capable of predicting the peak usage and already start increasing the memory before the actual usage occurs, without any human intervention. Be aware that cloud engineers will have to allow this automated scaling in the cloud systems themselves.
  • Visualization: Although AIOps is fully automated and self-learning, engineers would want to have visibility of the system and its actions. For this, AIOps offers real-time dashboards and extensive ways of creating reports that will help in improving the architecture of systems. That's the only thing AIOps will not do: it will not change the architecture. Enterprises will still need cloud architects for that. The next section discusses how AIOps can help in improving cloud environments.

AIOps is a good extension of DevOps, where enterprises automate the delivery, deployment, and operations of systems. With AIOps, they can automate operations. Is there something after AIOps? Yes, it's called NoOps, or No IT Operations, where all operational activities are fully automated. The idea here is that teams can completely concentrate on development. All daily management routines on IT systems are taken over by automated systems, such as system updates, bug fixing, scaling, and security operations. Although it's called NoOps, engineers will still be needed to set up the systems and implement the operation's baseline.

Optimizing cloud environments using AIOps

The two major benefits of AIOps are first, the speed and accuracy in detecting anomalies and responding to them without human intervention. Second, AIOps can be used for capacity optimization. Most cloud providers offer some form of scale-out/-up mechanism driven by metrics, already available natively within the platform. AIOps can optimize this scaling since it knows what thresholds are required to do this, whereas the cloud provider requires engineers to define and hardcode it. Since the system is learning, it can help in predicting when and what resources are needed. The following diagram shows the evaluation of operations, from descriptive to prescriptive. Most monitoring tools are descriptive, whereas AIOps is predictive:

Figure 19.1 – Evolution of monitoring to AIOps

Figure 19.1 – Evolution of monitoring to AIOps

Monitoring simply registers what's happening. With log analytics, companies can set a diagnosis of events and take remediation actions based on the outcomes of these analyses This is all reactive, whereas AIOps is proactive and predictive. By analyzing data, it can predict the impact of changes. The last step is systems that are prescriptive, being able to tell what should happen and already prepare systems for events, fully automated. Some very sophisticated AIOps systems can already do that, but generally, market analysts see this as the domain of NoOps.

Enterprises are discovering AIOps because it helps them in optimizing the IT infrastructure. But how do companies start with AIOps? The following guidelines are recommended to successfully implement an AIOps strategy:

  • AIOps systems are learning systems: Enterprises will have to learn to work and interpret analysis from these systems as well to get the best out of it. So, don't try to get the entire IT environment under AIOps in one go, but start with a small pilot and iterate from there.
  • Data is essential in AIOps: This should not only be data that comes from IT systems, but also business data. After all, the great benefit of AIOps is that it can take actions that are based on business data. If AIOps knows that certain products sell better at specific times of the year – which is business-driven data – it can take actions to optimize IT systems for that peak period. Also, if it turns out that systems are not used as expected, AIOps will be able to analyze the usage and correlate it with other events. In that way, AIOps can be a fantastic source for the business in becoming a truly data-driven organization. Businesses, therefore, absolutely need to be involved in the implementation of AIOps.
  • Most important in a successful implementation is to standardize: Throughout this book, it has been stressed that multi-cloud environments need to be implemented in a consistent way, meaning that infrastructure must be defined as infrastructure and configuration as code so that it can be deployed in a consistent manner to various cloud platforms. The code must be centrally managed from one repository, as much as possible. This will ensure that AIOps systems will learn quickly how systems look and how they should behave so that anomalies can be detected quickly.

Next, how does AIOps help in optimizing IT environments? As explained in the Understanding the concept of AIOps section, AIOps can best be seen as an extension to DevOps: it helps development in optimizing systems. The key is in testing. In the previous chapter, the principles of CI/CD were discussed. An important phase in CI/CD is testing. Typically, developers test against the functionality of one application first with unit testing and then integration with other applications or systems. The problem is that developers can't test everything; for instance, they can't test against scenarios where system components in an IT chain change. These can be changes that theoretically might not have a major impact, but in real life do or even trigger completely unexpected behavior.

AIOps can help in testing against real-life scenarios and take much more into consideration in terms of testing. AIOps will know what systems would be impacted when changes are applied to a certain system and also vice versa: what systems will respond to changes in terms of performance and stability. These can be systems that are hosted in different clouds or platforms; they can be part of the application chain.

This problem of the coexistence of applications and systems that disproportionately consume resources is referred to as noisy neighbor. AIOps will identify the neighbors, warn them of upcoming changes, and even take proactive measures to avoid the applications and systems from running into trouble. This goes beyond the unit and integration tests that are triggered by a CI/CD pipeline.

Today's environments in multi-cloud are complex, with servers and services running in various clouds. Systems are connected over network backbones of different cloud platforms, routing data over the enterprise's gateways, yet continuously checking whether users and systems are still compliant against applied security frameworks. There's a good chance something is missed when distributing applications across these environments.

AIOps can be used to improve the overall architecture. Architects will have much better insight into the environment and all the connections between applications and systems; this includes not only servers but also network and security devices. Next, AIOps will help in the distribution of applications across platforms and the scaling of infrastructure without impacting the neighbors, even if the neighbors are sitting on a different platform.

Exploring AIOps tools for multi-cloud

The market for AIOps is in its infancy, although market analysts expect that use of AIOps will grow from around the current 5 percent to 30 percent of big enterprises in 2023 (refer to https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops/). This explains why a lot of leading IT companies are investing heavily in AIOps. Manufacturers include big names such as IBM, Splunk, VMware, Moogsoft, Dynatrace, BMC, and ServiceNow. But there are a lot more tools that are certainly worthwhile to have a look at, such as DataDog, ExtraHop, FixStream, Grok, and StackState, just to name a few.

How does an enterprise choose the right tool? When an enterprise is working in multi-cloud, it needs AIOps that can handle multi-cloud. These are AIOps platforms that have APIs to the major cloud providers and can integrate with the monitoring solutions of these providers and third-party tools that enterprises have in the cloud environments. An example of such a platform is Splunk Enterprise, which collects, correlates, and analyzes data from IT infrastructure, applications, and security systems.

In essence, all of these tools work in layers. The layers are depicted in the following diagram:

Figure 19.2 – Layers of AIOps

Figure 19.2 – Layers of AIOps

Most AIOps systems combine a set of tools in the different layers into an AIOps platform that can handle the various aspects of AIOps.

Splunk is one of the platforms that have a wide variety of products that can support development and operations in an enterprise. The suite contains the following products, among others:

  • Splunk Cloud to manage infrastructure in any cloud.
  • Splunk User Behavior Analytics to detect threats and anomalies in behavior using ML.
  • Splunk Phantom for cross-platform security orchestration and automating specific solutions, such as Splunk Insights for AWS Cloud Monitoring. The latter is a solution that offers tools to migrate mission-critical workloads to AWS, monitor them, gain insights into costs, and keep track of security and compliance.

All these solutions come together in Splunk Enterprise.

Comparable solutions are ServiceNow, Dynatrace, and StackState, the latter being recognized by market analysts as a platform that will grow significantly in the coming years and might become one of the market leaders, according to a Gartner report, shown here: https://www.gartner.com/en/documents/3971186 (be aware that a login is required to download reports from Gartner). ServiceNow has the Now Platform, connecting various solutions to get visibility of all IT systems in any environment to detect issues, automate workflows and responses, and manage security. Dynatrace works with Davis, the AI solution of Dynatrace. StackState follows the 4T model – time, topology, telemetry, and tracing. It actually visualizes changes in the full, cross-cloud environment so that operators can time travel and spot where changes have occurred. To do that, the systems correlate infrastructure data from all layers, such as applications, databases, servers, operating systems, and traffic routing through network and security devices.

Key in all these solutions is that they auto-discover any changes in environments in real time, and can predict the impact on any other component in the IT environment before events actually occur, also from changes that are planned from CI/CD pipelines.

AIOps helps enterprises in becoming data-driven organizations. From the first chapter of this book, the message has been that IT – and IT architecture – is driven by business decisions. But business itself is driven by data: how fast does a market develop, where are the customers, what are the demands of these customers, and how can IT prepare for these demands? The agility to adapt to market changes is key in IT and that's exactly what cloud environments are for – that is, cloud systems can adapt quickly to changes. It becomes even faster when data drives the changes directly, without human interference. Data drives every decision.

That's the promise of AIOps. An organization that adopts the principle of becoming a data-driven enterprise must have access to vast amounts of data from a lot of different sources, inside and outside IT. It needs to embrace automation. But above all, it needs to trust and rely on sophisticated technology with data analytics, AI, and ML. That's a true paradigm shift for a lot of companies. It will only succeed when it's done in small steps. The good news is that companies already have a lot of business and IT data available that they can feed into AI and ML algorithms. So, they can get started.

Summary

AIOps is the new kid on the block. These are complex systems that help organizations in detecting changes and anomalies in their IT environments and already predict what impact these events might have on other components within their environments. AIOps systems can even predict this from planned changes coming from DevOps systems such as CI/CD pipelines. To be able to do that, AIOps makes use of big data analysis: it has access to a lot of different data sources, inside and outside IT environments. This data is analyzed and fed into algorithms: this is where AI comes in, and ML. AIOps systems learn so that they can actually predict future events.

AIOps are complex systems that require vast investments from vendors and thus from companies that want to start working with AIOps. However, most organizations want to become more and more data-driven, meaning that data is driving all decisions. This makes a company more agile and faster in responding to market changes.

After completing this chapter, you should have a good understanding of the benefits as well as the complexity of AIOps. You should also be able to name a few of the market leaders in the field of AIOps. At the end of the day, it's all about being able to respond quickly to changes, but with minimum risk and keeping IT systems running at all times. That is what the final chapter of this book is about: site reliability engineering.

Questions

  1. AIOps correlates data from a lot of different systems, including IT systems that are not directly in the delivery chain of an application, but might be impacted by changes to that chain. What are these systems called in terms of AIOps definitions?
  2. Name at least two vendors of AIOps systems, recognized as such by market analysts.
  3. AIOps works in layers. Rate the following statement true or false: most AIOps systems have separate solutions for the layers that are combined in an AIOps platform.
  4. In terms of the level of automation, would you rate NoOps before or after AIOps?

Further reading

You can refer to the blog and video on AIOps at https://searchitoperations.techtarget.com/feature/Just-what-can-AI-in-IT-operations-accomplish.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.206.12.31