AIOps stands for Artificial Intelligence for Operations, but what does it really mean? AIOps is still a rather new concept but can help to optimize your multi-cloud platform. It analyzes the health and behavior of workloads end-to-end – that is, right from the application's code all the way down to the underlying infrastructure. AIOps tooling will help in discovering issues, thereby providing advice for optimization. The best part is that good AIOps tools do this cross-platform since they operate from the perspective of the application and even the business chain.
This chapter is an introduction to the concept of AIOps. The components of AIOps will be discussed, including data analytics, automation, and Machine Learning (ML). After completing this chapter, you will have a good understanding of how AIOps can help in optimizing cloud environments and how enterprises can get started with implementing AIOps. The chapter concludes with a brief overview of some market-leading tools in this space.
In this chapter, we're going to cover the following main topics:
AIOps combines analytics of big data and ML to automatically investigate and remediate incidents that occur in the IT environment. AIOps systems learn how to correlate incidents between the various components in the environment by continuously analyzing all logging sources and the performance of assets within the entire IT landscape of an enterprise. They learn what the dependencies are inside and outside of IT systems.
Especially in the world of multi-cloud, where enterprises have systems in various clouds and still on-premises, gaining visibility over the full landscape is not easy. How would an engineer tell that the bad performance of a website that hosts its frontend in a specific cloud is caused by a bad query in a database that runs from a data lake in a different cloud?
AIOps requires highly sophisticated systems, comprising the following components:
AIOps is a good extension of DevOps, where enterprises automate the delivery, deployment, and operations of systems. With AIOps, they can automate operations. Is there something after AIOps? Yes, it's called NoOps, or No IT Operations, where all operational activities are fully automated. The idea here is that teams can completely concentrate on development. All daily management routines on IT systems are taken over by automated systems, such as system updates, bug fixing, scaling, and security operations. Although it's called NoOps, engineers will still be needed to set up the systems and implement the operation's baseline.
The two major benefits of AIOps are first, the speed and accuracy in detecting anomalies and responding to them without human intervention. Second, AIOps can be used for capacity optimization. Most cloud providers offer some form of scale-out/-up mechanism driven by metrics, already available natively within the platform. AIOps can optimize this scaling since it knows what thresholds are required to do this, whereas the cloud provider requires engineers to define and hardcode it. Since the system is learning, it can help in predicting when and what resources are needed. The following diagram shows the evaluation of operations, from descriptive to prescriptive. Most monitoring tools are descriptive, whereas AIOps is predictive:
Monitoring simply registers what's happening. With log analytics, companies can set a diagnosis of events and take remediation actions based on the outcomes of these analyses This is all reactive, whereas AIOps is proactive and predictive. By analyzing data, it can predict the impact of changes. The last step is systems that are prescriptive, being able to tell what should happen and already prepare systems for events, fully automated. Some very sophisticated AIOps systems can already do that, but generally, market analysts see this as the domain of NoOps.
Enterprises are discovering AIOps because it helps them in optimizing the IT infrastructure. But how do companies start with AIOps? The following guidelines are recommended to successfully implement an AIOps strategy:
Next, how does AIOps help in optimizing IT environments? As explained in the Understanding the concept of AIOps section, AIOps can best be seen as an extension to DevOps: it helps development in optimizing systems. The key is in testing. In the previous chapter, the principles of CI/CD were discussed. An important phase in CI/CD is testing. Typically, developers test against the functionality of one application first with unit testing and then integration with other applications or systems. The problem is that developers can't test everything; for instance, they can't test against scenarios where system components in an IT chain change. These can be changes that theoretically might not have a major impact, but in real life do or even trigger completely unexpected behavior.
AIOps can help in testing against real-life scenarios and take much more into consideration in terms of testing. AIOps will know what systems would be impacted when changes are applied to a certain system and also vice versa: what systems will respond to changes in terms of performance and stability. These can be systems that are hosted in different clouds or platforms; they can be part of the application chain.
This problem of the coexistence of applications and systems that disproportionately consume resources is referred to as noisy neighbor. AIOps will identify the neighbors, warn them of upcoming changes, and even take proactive measures to avoid the applications and systems from running into trouble. This goes beyond the unit and integration tests that are triggered by a CI/CD pipeline.
Today's environments in multi-cloud are complex, with servers and services running in various clouds. Systems are connected over network backbones of different cloud platforms, routing data over the enterprise's gateways, yet continuously checking whether users and systems are still compliant against applied security frameworks. There's a good chance something is missed when distributing applications across these environments.
AIOps can be used to improve the overall architecture. Architects will have much better insight into the environment and all the connections between applications and systems; this includes not only servers but also network and security devices. Next, AIOps will help in the distribution of applications across platforms and the scaling of infrastructure without impacting the neighbors, even if the neighbors are sitting on a different platform.
The market for AIOps is in its infancy, although market analysts expect that use of AIOps will grow from around the current 5 percent to 30 percent of big enterprises in 2023 (refer to https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops/). This explains why a lot of leading IT companies are investing heavily in AIOps. Manufacturers include big names such as IBM, Splunk, VMware, Moogsoft, Dynatrace, BMC, and ServiceNow. But there are a lot more tools that are certainly worthwhile to have a look at, such as DataDog, ExtraHop, FixStream, Grok, and StackState, just to name a few.
How does an enterprise choose the right tool? When an enterprise is working in multi-cloud, it needs AIOps that can handle multi-cloud. These are AIOps platforms that have APIs to the major cloud providers and can integrate with the monitoring solutions of these providers and third-party tools that enterprises have in the cloud environments. An example of such a platform is Splunk Enterprise, which collects, correlates, and analyzes data from IT infrastructure, applications, and security systems.
In essence, all of these tools work in layers. The layers are depicted in the following diagram:
Most AIOps systems combine a set of tools in the different layers into an AIOps platform that can handle the various aspects of AIOps.
Splunk is one of the platforms that have a wide variety of products that can support development and operations in an enterprise. The suite contains the following products, among others:
All these solutions come together in Splunk Enterprise.
Comparable solutions are ServiceNow, Dynatrace, and StackState, the latter being recognized by market analysts as a platform that will grow significantly in the coming years and might become one of the market leaders, according to a Gartner report, shown here: https://www.gartner.com/en/documents/3971186 (be aware that a login is required to download reports from Gartner). ServiceNow has the Now Platform, connecting various solutions to get visibility of all IT systems in any environment to detect issues, automate workflows and responses, and manage security. Dynatrace works with Davis, the AI solution of Dynatrace. StackState follows the 4T model – time, topology, telemetry, and tracing. It actually visualizes changes in the full, cross-cloud environment so that operators can time travel and spot where changes have occurred. To do that, the systems correlate infrastructure data from all layers, such as applications, databases, servers, operating systems, and traffic routing through network and security devices.
Key in all these solutions is that they auto-discover any changes in environments in real time, and can predict the impact on any other component in the IT environment before events actually occur, also from changes that are planned from CI/CD pipelines.
AIOps helps enterprises in becoming data-driven organizations. From the first chapter of this book, the message has been that IT – and IT architecture – is driven by business decisions. But business itself is driven by data: how fast does a market develop, where are the customers, what are the demands of these customers, and how can IT prepare for these demands? The agility to adapt to market changes is key in IT and that's exactly what cloud environments are for – that is, cloud systems can adapt quickly to changes. It becomes even faster when data drives the changes directly, without human interference. Data drives every decision.
That's the promise of AIOps. An organization that adopts the principle of becoming a data-driven enterprise must have access to vast amounts of data from a lot of different sources, inside and outside IT. It needs to embrace automation. But above all, it needs to trust and rely on sophisticated technology with data analytics, AI, and ML. That's a true paradigm shift for a lot of companies. It will only succeed when it's done in small steps. The good news is that companies already have a lot of business and IT data available that they can feed into AI and ML algorithms. So, they can get started.
AIOps is the new kid on the block. These are complex systems that help organizations in detecting changes and anomalies in their IT environments and already predict what impact these events might have on other components within their environments. AIOps systems can even predict this from planned changes coming from DevOps systems such as CI/CD pipelines. To be able to do that, AIOps makes use of big data analysis: it has access to a lot of different data sources, inside and outside IT environments. This data is analyzed and fed into algorithms: this is where AI comes in, and ML. AIOps systems learn so that they can actually predict future events.
AIOps are complex systems that require vast investments from vendors and thus from companies that want to start working with AIOps. However, most organizations want to become more and more data-driven, meaning that data is driving all decisions. This makes a company more agile and faster in responding to market changes.
After completing this chapter, you should have a good understanding of the benefits as well as the complexity of AIOps. You should also be able to name a few of the market leaders in the field of AIOps. At the end of the day, it's all about being able to respond quickly to changes, but with minimum risk and keeping IT systems running at all times. That is what the final chapter of this book is about: site reliability engineering.
You can refer to the blog and video on AIOps at https://searchitoperations.techtarget.com/feature/Just-what-can-AI-in-IT-operations-accomplish.
18.206.12.31