This chapter will introduce you to the concepts of artificial intelligence (AI) and machine learning (ML). We will also learn about AIOps, why we need it, and how it is applied to IT operations. We will cover some of the areas where AIOps can be helpful. We will learn about the AWS DevOps Guru AIOps tool and implement two use cases. We will deploy a serverless application and inject some failure, and then analyze the insights and remediation provided by Amazon DevOps Guru. Then, another use case will be covered about identifying anomalies in CPU, memory, and networking within an Elastic Container Service for Kubernetes (EKS) cluster.
This chapter contains the following main sections:
To implement the solutions mentioned in this chapter, you will need an AWS account with DevOps Guru enabled.
The github repo for this chapter can be found here:
https://github.com/PacktPublishing/Accelerating-DevSecOps-on-AWS/tree/main/chapter-10
AI is a hot topic everywhere and, unlike previously, lots of people now really understand the meaning of AI and how it is applied in real-life scenarios or use cases. Before jumping right to AIOps, we first need to set the context by understanding AI and ML.
AI is the broadest term and has been around for decades as a research topic. AI allows a computer to perform any task that normally requires human intelligence. ML is a subset of AI that solves specific tasks by learning from patterns and data and making predictions without explicitly being programmed. This phrase of explicitly being programmed is the key factor that translates some huge opportunities to transform how we do IT operations today. ML is one approach to AI, recently popular due to big data and cheap compute through cloud computing. Broadly, there are two types of ML algorithms:
Different types of input data are usually a better fit for one approach or the other and result in different end user experiences, but ultimately, supervised and unsupervised are just implementation details. Two things impact the IT operations team every day, which are the open box and black box approaches to ML.
A black box approach means an algorithm where you give it an input and you get an output, but you don't know anything about exactly how that output was arrived at. In some cases, this could have serious implications; for example, a credit card company might use ML to provide credit, and they might be compliant with foreign regulations that require them to be able to go back and explain to someone who applies for a credit card why an algorithm may have reached a certain decision.
An open box approach to ML is easy to explain and for the IT operations team, the openness is really an important property to improve adoption and outcomes. In brief, open box ML is explainable, editable, and insightful, as follows:
So, basically, there are three factors that are important while adopting ML-based tools for IT operations, and those are transparency, control, and trust.
AIOps is a term coined by Gartner and it originally meant algorithmic IT operations, but later changed to artificial intelligence for IT operations. It's a new label for the category of tools that take ML capabilities and applies to the IT operations space. There are three main categories in IT operations where we can apply ML:
The following diagram shows the workflow of AIOps:
But why do we need AIOps? Everything is working fine right now. What are the challenges that lead us to use AIOps-powered tools?
In the past, IT teams have been very internally oriented, meaning that they mostly think about what they can accomplish with the technology, how can they reduce and control the cost structures, and how to organize all the processes within IT to be the most efficient organization.
But in the age of digital transformation, IT teams are more focused on where the business is going. Instead of thinking about cost reduction, they are focusing on revenue generation. There are lots of changes happening throughout the business. In any software development organization, they are incorporating lots of automated processes, such as Infrastructure as Code (IaC) to spin up the application infrastructure and CI/CD (Continuous Integration/Continuous Delivery or Deployment) to deliver the new features quickly. We are also seeing lots of changes in the infrastructure models of the public cloud and hybrid cloud. All these changes are creating a seismic shift. Ten years ago, we had fewer tools, and application delivery was a bit slow, but nowadays we have so many tools and every now and then new features are released. These things create lots of noise, issues, and maintenance, and we can't simply employ more and more people to resolve and maintain this. Now the question remains: what is happening to IT and IT operations and how is it impacting IT operations? What sort of changes are required in IT operations to support this large infrastructure, smooth application delivery, and the maintenance of it?
IT operations teams must keep up with a more dynamic IT landscape, including containers, serverless functions, microservices, cloud-native architectures, and public and hybrid cloud deployments. IT operations teams also must support innovation and digital transformation but at the same time they still need to maintain the legacy and traditional infrastructure because IT operations is additive so they still have to maintain previous responsibilities. Now, at this point, IT operations teams leverage automation and ML-based tools and capabilities to ensure that they can handle the scale and complexity.
IT operations teams are already using some tools such as rule-based engine automation, which works for now, but it just doesn't handle all scenarios. The rule-based tools work in the following manner:
But the problem is, when you scale, you have tens and thousands of those rules. Another problem is how to manage those thousands of rules, and how those rules are aligned with the subset of target data. Now, at this juncture, ML can help.
As the data comes to your system, ML can learn about your environment. It can leverage all the institutional knowledge to handle the known knowns, but also leverage the intelligence in the system to be able to handle the unknown unknowns at a large scale. It can autonomously respond 24 hours a day, 365 days a year. Most importantly, if you are using open-box ML, you can control and modify the logic as well.
Now that we understand how ML-based tools or AIOps tools help IT operations teams, let's discover how AIOps works.
AIOps tools comprise three fundamental technology components, and those are big data, ML, and automation. It uses a big data platform to aggregate siloed IT operations data in one place. This data can include historical performance, event data that streams real-time operation events, system logs and metrics, and network data. Then, AIOps applies focus analytics and ML capabilities to separate significant event alerts from the noise. It uses analytics such as application rules and pattern matching to comb through the operational data and separate signals from the noise and significant abnormal event alerts. It then helps in identifying and proposing solutions. It uses industry-specific or environment-specific algorithms to correlate abnormal events with other event data across environments to identify the cause of an outage or performance problem, and eventually suggests remedies or solutions. AIOps then automates responses as well as a real-time proactive resolution. AIOps also automatically routes alerts and recommended solutions to the appropriate IT teams or even creates response teams based on the nature of the problem and the solution. In lots of scenarios, AIOps processes the results from ML to trigger automatic system responses that address problems in real time, before IT teams are even aware of them. AIOps also learns continuously and independently in order to improve how it handles future problems. It uses the results of the analytics to change the algorithm or create a new one to identify the problem even earlier and recommend more effective solutions.
AIOps encompasses a broad category of tools that IT operations face today, such as the following:
There are some amazing companies that provide effective AIOps tools, such as Splunk, Moogsoft, and BigPanda. AWS has recently announced an AIOps service, DevOps Guru, which is at an early stage but is improving quite rapidly. In the next section, we will learn more about Amazon DevOps Guru and how we can use it to improve IT operations.
AWS has been hearing customers' feedback and acting on it by providing amazing managed services. One of the recently launched AWS services is Amazon DevOps Guru, which is powered by ML to improve an application's operational performance and availability. It helps detect behaviors that act differently from normal operating patterns so that you can easily identify operational issues long before they impact your application.
DevOps Guru uses ML models built on information that has been collected by years of Amazon and AWS operation excellence to identify anomalous application behavior (for example, resource constraints, error rates, and increased latency) and raise critical issues beforehand so that they do not cause any potential outages or service disruptions. Once DevOps Guru identifies an issue, it automatically sends an alert and provides information and a summary related to anomalies, root cause, timestamp, and the location where the issue has occurred. It also provides recommendations on how to remediate the issue.
DevOps Guru has the following main features:
DevOps Guru is a managed service so you can enable it's feature with a single click. It doesn't need any additional software to deploy and manage. Before jumping to perform any operation on DevOps Guru, let's get familiar with some terminologies related to DevOps Guru:
In the next section, we will deploy a container-based application on an EKS cluster and detect the anomalies using DevOps Guru.
Because of the vast number of abstractions and supporting infrastructure, observability in a container-centric system gives new issues for operators. Hundreds of clusters and thousands of services, tasks, and pods can operate concurrently in many companies. This section will demonstrate new features in Amazon DevOps Guru that will assist in simplifying and increasing the operator's capabilities. Anomalies are grouped by metric and container clusters to increase context and facilitate access, and more Amazon CloudWatch Container Insight metrics are supported.
We will first deploy an EKS cluster using eksctl and then deploy the OpenTelemetry Collector to aggregate all the metrics and provide them to CloudWatch. Then, we will enable DevOps Guru for the EKS cluster resources.
Perform the following steps to enable DevOps Guru on EKS cluster resources:
$ eksctl create cluster --name=dgo-cluster --nodes=1
$ curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml | kubectl apply -f -
$ kubectl get pods -n aws-otel-eks
NAME READY STATUS RESTARTS AGE
aws-otel-eks-ci-rgt4l 1/1 Running 0 1h
This way, we have enabled DevOps Guru on an EKS cluster. DevOps Guru will basically monitor and analyze the following metrics to detect the anomalies:
In the next section, we will deploy an application and inject a failure in it and see how DevOps Guru analyzes and provides the recommendation to resolve the anomalies.
In this section, we will deploy an application that we deployed in previous chapters, and then we will inject a failure and validate how DevOps Guru analyzes those anomalies and provides the recommendation to resolve the anomalies.
Perform the following steps:
$ curl -o starkapp.yaml https://raw.githubusercontent.com/nikitsrj/kube-app-golang/master/kube-deployment.yml
$ kubectl create -f starkapp.yaml
$ kubectl get svc web
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
web LoadBalancer 10.100.1.2 a4ce7bd79683043d5ae122943d00d2c4-1894525087.us-east-1.elb.amazonaws.com 80:31536/TCP 85m
The following screenshot shows the application UI:
$ kubectl edit deployment db -o yaml
$ watch -n 1 curl http://<URLOFWEB>
$ watch kubectl get pods -A
The following screenshot shows the db pod status:
You can also check the application in the browser. The words in the application box will be missing:
We just saw how DevOps Guru caught anomalies in the insight and provided a recommendation. We can configure this step in CI/CD to have a strong AIOps solution along with the application delivery. In the next section, we will deploy a serverless application and enable DevOps Guru on the serverless stack.
In this section, we will deploy a serverless application using a CloudFormation template. This CloudFormation template will create a few resources, as follows:
This CloudFormation template has been cloned and modified from the AWS DevOps Guru sample repository provided by AWS, but we will be using this only to create application-related AWS resources with some modification. After that, we will enable DevOps Guru on the application stack.
Perform the following steps to deploy the stack:
$ sudo yum install jq -y
$ sudo yum install python36 -y
$ sudo python3 -m pip install --upgrade pip
$ pip3 install requests
$ git clone https://github.com/PacktPublishing/Accelerating-DevSecOps-on-AWS.git
$ cd Modern-CI-CD-on-AWS/chapter-10
Resources:
ShopsTableMonitorOper:
Type: AWS::DynamoDB::Table
Properties:
KeySchema:
- AttributeName: name
KeyType: HASH
AttributeDefinitions:
- AttributeName: name
AttributeType: S
ProvisionedThroughput:
ReadCapacityUnits: 1
WriteCapacityUnits: 5
$ aws cloudformation create-stack –stack-name gdo-serverless-stack –template-body file://cfn-cartoon-code.yaml –capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
The following command gives you the DynamoDB table name:
$ dynamoDBTableName=$(aws cloudformation list-stack-resources --stack-name gdo-serverless-stack | jq '.StackResourceSummaries[]|select(.ResourceType == "AWS::DynamoDB::Table").PhysicalResourceId' | tr -d '"')
$ sudo sed -i s/"<YOUR-DYNAMODB-TABLE-NAME>"/$dynamoDBTableName/g cartoon-shops-dynamodb-table.json
$ aws dynamodb batch-write-item --request-items file://cartoon-shops-dynamodb-table.json
{
"UnprocessedItems": {}
}
$ aws cloudformation describe-stacks --stack-name gdo-serverless-stack --query 'Stacks[0].Outputs'
[
{
"OutputKey": "ListRestApiEndpointMonitorOper",
"OutputValue": "https://unuoirftn0.execute-api.us-east-1.amazonaws.com/prod/",
"ExportName": "ListRestApiEndpointMonitorOper"
},
{
"OutputKey": "CreateRestApiEndpointMonitorOper",
"OutputValue": "https://lq8d1zs36m.execute-api.us-east-1.amazonaws.com/prod/",
"ExportName": "CreateRestApiEndpointMonitorOper"
}
]
The following screenshot shows the application output in the browser:
Let DevOps Guru analyze the application for at least 2 hours for a baseline before injecting any failure. Sometimes it takes 4-5 hours to analyze. It is very important for DevOps Guru to understand the current working metrics of the application. In the next section, we will be integrating DevOps Guru with Systems Manager OpsCenter.
Whenever there is any insight taking place within DevOps Guru, it is recommended that some engineer is at least aware of the insight status. With the integration of DevOps Guru and OpsCenter, the response team can easily be notified with an OpsItem whenever there is a new insight taking place in DevOps Guru.
AWS has made it simple when it comes to integration between the services. To enable DevOps Guru to create an OpsItem in OpsCenter, go to the DevOps Guru console, click on Settings, and click on the checkbox under Service: AWS Systems Manager:
Any new insight will automatically lead DevOps Guru to create an OpsItem in OpsCenter. In the next section, we will inject a failure and analyze the recommendation from DevOps Guru.
In the previous section, we enabled DevOps Guru on the gdo-serverless-stack serverless application. We also enabled the integration between DevOps Guru and Systems Manager OpsCenter. Now, let's inject some failure and see the insight.
Perform the following steps to inject the failure:
$ python3 sendAPIRequest.py
So far, we have seen that DevOps Guru created an insight because of a 5XX error. The reason was the ReadCapacityUnits value for the DynamoDB table was 1. Let's change it to 5 and let DevOps Guru baseline for a while. Then, we will change ReadCapacityUnits back to 1 and inject the traffic again and see what sort of insight and recommendation DevOps Guru produces. Perform the following steps to change ReadCapacity to 5 and inject traffic:
Resources:
ShopsTableMonitorOper:
Type: AWS::DynamoDB::Table
Properties:
KeySchema:
- AttributeName: name
KeyType: HASH
AttributeDefinitions:
- AttributeName: name
AttributeType: S
ProvisionedThroughput:
ReadCapacityUnits: 5
WriteCapacityUnits: 5
$ aws cloudformation update-stack --stack-name gdo-serverless-stack --template-body file:///$PWD/cfn-cartoon-code.yaml --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
$ python3 sendAPIRequest.py
So, you can see that DevOps Guru also analyzes the changes before and after the insight and gives the recommendation.
DevOps Guru is still at an early phase, but AWS is continuously adding new features to this service. Recently, they also added metrics such as database load and wait time to analyze the RDS anomalies. DevOps Guru is a black box AIOps tool, but there are lots of other features such as seamless integration, which is a good reason to opt for this service.
AIOps is the upcoming important thing for IT operations teams to manage their activity. In this chapter, we looked at the need for AIOps tools and how they help in revolutionizing IT operations. We learned some important concepts of ML, which also helps you analyze which AIOps tool to use, based on the open box and black box approaches. We learned about the new and managed AIOps service DevOps Guru and enabled it on containerized applications. We also enabled DevOps Guru on a serverless application and integrated it with OpsCenter to create an OpsItem for a generated insight. After reading all the chapters of this book, you should be able to create successful, secure, and intelligent CI/CD pipelines for applications as well as infrastructure.
18.221.163.13