We're getting to the last section on basic operations, or BaseOps, in multi-cloud environments. This chapter is about monitoring your multi-cloud environment. How do we keep track of the health of our cloud resources? We will be using the native monitoring tools of different cloud providers, but we'll also explore some tools that provide a single-pane-of-glass view—one overall dashboard where we have a unified view of our entire environment.
In this chapter, we will first define the monitoring and management processes, before we take a look at the different tools of Azure, AWS, and Google Cloud Platform. We will also learn about what tools we can use to manage our environments. We will get a deeper understanding of what we should monitor in the seven layers of the Open System Interconnections (OSI) model and how we can consolidate monitoring data so that it becomes relevant to the business. Lastly, we will briefly explore two platforms that offer a single-pane-of-glass view: ServiceNow and BMC Helix.
In this chapter, we're going to cover the following topics:
This section of the book is all about BaseOps, and we will be looking at the Cloud Adoption Framework offered by the major cloud providers and hyperscalers—Azure, AWS, and Google Cloud Platform. These frameworks share a lot of basic principles. We can identify five domains in the frameworks:
In the previous chapters, we covered governance, service design, policies, implementing landing zones in different clouds, resilience, and performance. Lastly, we learned how we can automate the deployment and management of our environments. We now have one final thing to discuss in terms of BaseOps, and that's monitoring.
Monitoring is crucial in the professional management of IT systems. It doesn't make sense to monitor everything in your environment, but that's what a lot of companies still do: they simply turn monitoring on and then wait until the alerts come in. But in a larger enterprise estate in a multi-cloud environment, the overwhelming amount of alerts would likely make it difficult to extract the most important warnings and critical events that need attention. As with everything that we have already discussed, we need a plan. That plan needs to come from the business requirements: they set the stage. These requirements define how we have to manage our systems in the cloud and what we should monitor.
It helps to think in layers, as we did when we discussed automation in Chapter 8, Defining Automation Tools and Processes. If we're talking about end-to-end monitoring, then we mean that we are monitoring from the perspective of the end user: this type of monitoring views all layers. The top layer is the business layer, where the end user sits and kicks off the business process. That can be any process, but typically it's a transaction on a frontend system, such as a web page. That transaction has to be processed by a function that is embedded in an application. This is the second layer: the application and the interfaces of the application to databases and other systems that it communicates with in order to process the transaction.
Applications, databases, and interfaces use infrastructure—virtual machines, a network, storage, or a container platform. This is the infrastructure layer, but Platform as a Service (PaaS) is also involved in this layer. Infrastructure consists of different components, something that we learned in the previous chapter. The final layer is the components or resources layer. The following diagram shows the conceptual model of the layers. You will find that it overlaps with figure 8.1 in Chapter 8, Defining Automation Tools and Processes:
For each of these layers, we can have separate monitoring tools. However, we want to have the end user's view. Resources might be up and running and performing well, but if something's wrong in the application layer, the end user will be confronted with a failing system—even though the monitoring of our resources tells us that everything is fine. It's a common joke in IT: all lights green, yet nothing works. Hence, we want event monitoring or the aforementioned end-to-end monitoring: monitoring systems that can correlate between the different layers and provide us with a full-stack view.
The main topics that we want to monitor in the stack are explained in the following sections.
Sometimes this is referred to as heartbeat monitoring, and basically, that's exactly what it is. We want to know whether the cloud platform on which we are hosting our environment is still there and healthy. Are systems and services still running correctly, for instance? Hyperscalers have dashboards that provide us with insights as to the status of their platforms. At https://status.aws.amazon.com/, we can check the status of all services in all regions of AWS, seeing whether they are running normally and whether services are suffering from incidents. Azure has a similar dashboard at https://status.azure.com/en-us/status. Google provides the health status of Google Cloud Platform at https://status.cloud.google.com/.
Now, if you check the health status of, for instance, Azure, via the dashboard, you will find that services in some regions are not reported. The explanation for this is very simple: Azure doesn't offer all services in every region. The following screenshot shows that CloudSimple and large instances for the large in-memory ERP system SAP HANA are not provided in Middle Eastern and African data centers of Azure:
The preceding screenshot shows an example for the Amazon Managed Streaming for Apache Kafka service in the Europe region of AWS, running in the data centers of Frankfurt, Ireland, London, and Paris.
Performance is related to health but is more to do with the responsiveness of systems. The responsiveness of a system determines its performance level. When is a system slow? To be able to answer that question, we have to define what acceptable performance is. It comes back to what the business requirements are. Say that the expectation is that if someone pays for an article on a website, the transaction must go through in seconds. Requesting a complex calculation regarding a star orbiting somewhere in outer space, though, might take some time. Accuracy of processing in both instances is important, but the actual processing times will differ greatly. Both indicators—the quality of the processing outcome and the processing time—are important in terms of defining cloud performance and determine how we should monitor these indicators.
As part of implementing the guardrails of the Cloud Adoption Framework, we have designed a model for Role-Based Access Control (RBAC). RBAC defines who is authorized to perform certain actions against specific conditions: who is allowed to do what, when, and how. It's tightly related to security, but governance monitoring does a bit more than that. It also monitors the usage of APIs within cloud environments and checks what processes are triggered to get access to systems. For example, if an administrator—or a system API calling for a system account—creates a new account, monitoring will follow the authorization path before the account gets acknowledged.
Security monitoring is more or less self-explanatory This is about checking for vulnerabilities in our environments and getting alerts if a vulnerability is spotted. Security monitoring is about prevention and defense: preventing intrusion by viruses, malicious access, or traffic patterns trying to flood systems so that they eventually collapse, as in Distributed Denial of Service (DDoS) attacks. If someone or something manages to get through defensive lines, then mitigating actions should be triggered immediately to control the damage, for instance, automatically shutting down systems.
We want to know what we use in the cloud in terms of resources, virtual machines, network, storage, and services. For example, we can monitor the utilization of the processors (CPUs) that we use in our virtual machines. If we measure the utilization over a long period of time, we can analyze this data and see whether we need to adjust the sizing of the machines. If CPUs are under-utilized, we might want to downscale and save costs. Alternatively, if we found from performance monitoring that systems responded slowly, we should analyze whether CPUs are being over-utilized. In that case, we should consider the scaling out or scaling up of our system. These analytics will also report trends of what we can expect in terms of the capacity required in the near and more distant future.
Now that we have defined what we can and maybe even must monitor in cloud environments, it's time to explore the tooling that cloud providers have to offer. The question is, how we can consolidate all this data from monitoring systems and provide us with a single-pane-of-glass view? We will see how in the next sections.
In this section, we will first study the native monitoring that Azure, AWS, and Google Cloud Platform have to offer. After that, we will take a brief look at some other popular end-to-end monitoring systems that are on the market and are more cross-cloud.
Before we dive into the tools, we should get a high-level understanding of how monitoring works. Typically, these tools work with agents that collect data on the health and performance of resources. This is often raw data that is compiled into a more comprehensible format so it can be analyzed. From there, it gets visualized, for instance, in graphical presentations in dashboards that can be viewed from a console.
Monitoring can also lead to triggers: a system can suffer from malfunctions or other issues. In that case, the monitoring service will send out an alert that triggers a response. That response can be either to start a scaling process if systems run out of resources such as processor power or memory capacity, or an automated process to start a failover mechanism.
The following diagram shows a high-level overview of basic monitoring functionality:
Azure Monitor is available from the Azure portal at https://portal.azure.com. The key components of Azure Monitor are metrics and logs, the two types of data stores in Azure Monitor. This is the place where Azure Monitor collects all the data that it can retrieve from resources and services that are used in Azure. Metrics contain real-time data on the status of resources, and logs are collections of data that can be used to trace events. To analyze logs, you can use Log Analytics to get deeper insights into the performance of resources and services. Log Analytics is a separate module in Azure that is used to analyze monitored metrics.
Azure Monitor collects a lot of data that can be immediately viewed from the dashboard in the Azure portal. Azure Monitor can monitor resources that have been deployed on resources both within Azure and outside Azure, including other cloud platforms and on premises.
Next, there are additional services that can widen the monitoring scope of Azure Monitor. Two important services are Application Insights and Azure Monitor for containers. Application Insights is a separate service in Azure that monitors applications that are hosted in Azure, but it can be used for on-premises applications too. It keeps track of the operations of applications and reports failures in web application code and connected services.
We are moving from the more traditional environments of virtual machines to containers, and of course Azure Monitor is prepared for that. This service collects metrics on the usage of processors and memory for Kubernetes resources such as nodes, controllers, and the containers themselves. Azure Monitor for containers connects to Azure Kubernetes Services (AKS) for this. However, be aware that it only monitors the infrastructure that is used for containers. It does not monitor the contents of the container; for that, you will need different tooling.
In terms of the management of the environments in Azure, there's an offering that deserves to be discussed. In 2019, Microsoft launched Azure Lighthouse. The challenge that a lot of enterprises have is that they have not got just one tenant in a public cloud. Typically, an enterprise will have multiple tenants and subscriptions, just as they have multiple divisions, business groups, or delivery units. The problem they then face is monitoring and managing all these different tenants and subscriptions. For this, Lighthouse is the solution.
Lighthouse offers centralized management and monitoring across multiple tenants and subscriptions. It's basically the single pane of glass for all environments that an enterprise runs in the Azure cloud.
More information on Azure Monitor can be found at https://docs.microsoft.com/en-us/azure/azure-monitor/overview.
AWS CloudWatch is available through the CloudWatch console at https://console.aws.amazon.com/cloudwatch/. The key components are comparable to Azure Monitor's. AWS CloudWatch uses metrics and statistics, although the core is the metrics repository. Data from resources is collected in metrics and translated into graphical overviews that are presented on the console. CloudWatch collects a lot of metrics, from almost every kind of resource that AWS offers. Obviously, there are metrics on compute, storage, and networks in AWS, but also on services such as data and video streams with Kinesis and IoT environments in AWS, as well as data analytics with SageMaker.
CloudWatch integrates with a number of other services in AWS, of which Simple Notification Service (SNS) and Elastic Compute Cloud (EC2) auto-scaling are the most important ones to mention. SNS is basically the pub/sub messaging mechanism in AWS. It can trigger the auto-scaling process to increase the capacity of resources.
As with Azure, AWS will start collecting metrics as soon as we deploy our first account on the platform. A lot of AWS services automatically call the CloudWatch API to start monitoring. This applies to compute in EC2, storage in Elastic Block Store (EBS), and database instances in Relational Database Services (RDS). However, we still need to configure the monitoring in terms of how we want metrics to be presented, at what frequency, and what type of alerts we find to be valuable. In AWS CloudWatch, we can create alerts for specific items.
One service needs to be highlighted, since it's the solution for end-to-end monitoring in AWS: CloudWatch Synthetics. In Synthetics, we create canaries. These are scripts that simulate processes as an end user would execute in AWS.
In terms of management, AWS offers Control Tower. It's the centralized management console for all accounts that an enterprise enrolls in AWS, making sure that all of these accounts are consistent and compliant with the frameworks that the enterprise has to adhere to. The landing zone that we discussed in Chapter 6, Designing, Implementing, and Managing the Landing Zone, is part of Control Tower. It holds all organization units, users, accounts, and all the resources that are subject to compliance and security regulations within the enterprise and that are enforced to environments in AWS.
To keep track of all these items in AWS, the platform offers guardrails. We can have preventative and detective guardrails. These are rules that make sure that units, accounts, users, and resources are always consistent with the policies that have been set for environments. These rules can have different labels: mandatory, strongly recommended, and elective.
Control Tower comes with a dashboard so that the team that is concerned with the overall management of the systems has a single-pane-of-glass view encompassing all environments that have been deployed to AWS.
More information on AWS CloudWatch can be found at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html. More information on Control Tower can be found at https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html.
Lastly, we will look at monitoring in Google Cloud Platform, before we discuss some other important tools in terms of monitoring and managing container platforms.
Cloud monitoring in Google Cloud Platform is really easy. You don't need to install or configure anything: as soon as you start using systems or services in Google Cloud Platform, Cloud Monitor will start collecting metrics. This is done through Cloud Run, a fully managed service by Google Cloud Platform.
Cloud Monitoring in Google Cloud Platform has one more service to offer, and that is Cloud Run on GKE. GKE is Google Kubernetes Engine. Google offers a separate service to monitor the Kubernetes clusters where you host containers. The service is actually named Cloud Monitoring for Anthos, since it targets the Anthos stack of Google.
Anthos—a suite of Google services that runs on VMware vSphere or Hyper-V instances—helps in transforming legacy applications on virtual machines to containers so that they can run on Kubernetes. Anthos is a truly hybrid, multi-cloud environment, since you can manage the containers on Anthos from other clouds as well, such as AWS. It's again an indicator that Google Cloud Platform is much more focused on cloud-native and containerization than other major providers.
Cloud Monitoring is available from the console at https://console.cloud.google.com/getting-started. You don't need to configure the monitoring, but you can add custom metrics. This service uses OpenCensus, an open source monitoring library. Here we can add specific metric data that we want to retrieve from monitoring our resources in Google Cloud Platform. If we want to do so for container monitoring, we have to do the same with Knative metrics. Knative is a Kubernetes-based platform used to deploy and manage modern serverless workloads.
So, Google Cloud Platform's Cloud Monitoring is a fully integrated service that we don't need to configure. However, we can get more out of it if we add metrics with uptime checks and alerts.
Cloud Monitoring for Google Cloud Platform and AWS is part of Google Cloud's operations suite, which was formerly called Stackdriver. The operations suite contains also Cloud Logging, Error Reporting, and automation services such as Cloud Debugger and Cloud Trace, which we discussed in Chapter 8, Defining and Using Automation Tools and Processes.
More information on Cloud Monitoring in Google Cloud Platform can be found at https://cloud.google.com/run/docs/monitoring.
The world is moving more and more toward containers, and that's something that we haven't discussed in depth in terms of management. How can we manage cross-cloud and multi-cloud containers? One product that should be mentioned here is VMware's Tanzu, and particularly Tanzu Mission Control.
We have seen that all major cloud providers have their own implementation of container orchestration. They all use Kubernetes to launch and control cluster nodes to host containers, but they all have their own flavor: Azure has AKS, AWS has EKS, Google works with GKE, and on VMware platforms, we can use Pivotal Kubernetes Services (PKS). But the major advantage of containers is that because they don't rely on their own operating system, they can be really cloud independent. But how do we control Kubernetes environments cross-cloud then?
VMware monitors and operates the different Kubernetes environments on all these aforementioned clouds with Tanzu, and more specifically Tanzu Mission Control, which was introduced in 2019. We can attach Kubernetes environments in any cloud—Azure, AWS, or Google Cloud Platform, to name the big ones—to Tanzu, and from there we have centralized management of identity, centralized access to the clusters, centralized policy management so that all our clusters operate in the same way, and one monitoring system watching over all our clusters.
More information on Tanzu Mission Control can be found at https://tanzu.vmware.com/mission-control.
We have already introduced the term end-to-end monitoring. What does it mean? Typically, we mean that the monitoring is looking at systems from the end user's perspective. To understand this, we have to understand the OSI model. That model contains seven layers. The following diagram represents the model:
Now, let's explain what's happening in these layers in a bit more detail, to get a better understanding of what each layer represents. Just a note: these are technical layers and not the layers that we talked about in the first section of this chapter. There, we talked about three layers at a very high level: business, applications, and technical. That corresponds with the The Open Group Architecture Framework (TOGAF) that we explored in Chapter 5, Successfully Managing an Enterprise Cloud Architecture; the OSI model is really about the technology stack:
What does end-to-end monitoring do? It sends an agent from layer 7 all the way down to layer 1, retrieving metrics in all the layers it traverses. Typically, the monitoring mechanism will issue a transaction from layer 7 through the stack and measure the performance of this transaction. If the transaction fails, the monitoring mechanism can determine where it failed and why. If we go one step further than just the monitoring mechanism, then we can imagine that the monitoring mechanism will trigger processes to mitigate the failures: that's where automation comes in. Ultimately, we have systems that are able to predict and prevent failures because they actually learn from the information they receive from the monitoring agents. Then, we're talking about AIOps, something that we will cover in Chapter 19, Optimizing Multi-Cloud Environments with AIOps.
There's a wide range of products available when we're looking at end-to-end monitoring. It would take another book to name them all, but examples include Lakeside, Splunk, Datadog, and CheckMK. All these suites have products that target cloud environments, all from the end user's perspective. For instance, Lakeside offers SysTrack Cloud Edition for this; CheckMK is a popular open source monitoring environment for infrastructure and applications.
Splunk and Datadog are a bit different and are more in the league of AIOps. Splunk Cloud claims to be the monitoring environment that truly enables operational intelligence. Splunk Cloud is cross-cloud and works across business use cases. A use case could be anti-fraud, where we have to combine data from different sources to detect fraud. The monitoring tool of Splunk will collect data that might hint at fraud operations in cloud environments. The power is in the search engine that Splunk uses, the Search Processing Language. You can ask the monitoring system to correlate data from different systems to gain insight into a full chain of application delivery and its performance.
The big question is this: when is data from monitoring relevant to the business? It doesn't make sense to inform a business leader about the performance of CPUs in virtual machines, but it does make sense to inform him or her when system capacity is lacking and hindering the speed of processing transactions. In that case, the business might lose money since transactions might be processed too slowly or, worse, dropped because of timeout failures.
When is data relevant to a business? In short, data should enable business decisions. Deploying extra virtual machines or scaling out environments are not business decisions. These are technical decisions. A business decision would be to launch a new product at a given moment. In that case, we should know whether our environment is ready for that. From monitoring data, we should analyze how our systems have been performing with the existing product portfolio. Would systems have enough capacity to absorb extra traffic? Such questions and their answers drive architecture: if we find from monitoring data that systems are not ready from a technological point of view or are not expected to be able to absorb extra load, then we might have to re-architect systems.
One of the major pitfalls in monitoring is that a lot of companies treat it as reactive. Monitoring is, in that case, just a tool that starts alerting when systems fail. But by then, we are already far too late. The business might already be impacted on a large scale.
Before the end user starts to see that their requests are processed in a slower way, our monitoring system should alert about system components reaching certain capacity thresholds or interfaces that are suffering faults. We can do that by collecting a lot of data so that we know how systems respond under normal conditions, the baseline. Any deviation from those conditions will lead to an alert. These alerts can be proactive, so that we can adjust before something really breaks.
To summarize: the business is only interested in what happens at layer 7, the layer where the actual interactions between users and systems are. How quickly can end users access systems, how fast can transactions be processed, and how fast can new products be launched? To answer these questions, we have to collect a lot of data from our systems so that we know what the critical thresholds are and so that we can anticipate demand on the business.
The monitoring data must be easy to understand for business decision makers; for example, say that our current systems can hold an extra 10,000 visitors to the company's website per day. The rationale for such a statement should come from the monitoring data.
Monitoring is very important to get to the right decisions in development and operations, or DevOps. But monitoring is obviously also highly important in terms of financial reporting. That is part of Financial Operations, or FinOps, which is part of Chapter 13, Validating and Managing Bills.
We have come across the term single-pane-of-glass view a couple of times during this chapter. But what do we really mean by that? Typically, we mean that we have one console from which we can monitor and manage environments from multiple platforms. Imagine that we have cloud environments in Azure, AWS, and Google Cloud Platform. We might even have on-premises systems in privately owned data centers. If we want our system administrators to manage these environments, the chances are fairly high that they would need to log in to every single platform. For Azure, they would need to log in through the Azure portal, for AWS through the AWS portal, and so on. That is not very efficient.
The solution for this is to have one console where we can view the environments independently from the platform they run and, even better, manage the environments from this single console. Imagine it like a Swiss Army knife: a tool that we can use for different purposes. Knife, screwdriver, scissors—but it's still only one tool.
Sounds fantastic, but the reality is that these tools are extremely complicated to develop and keep up to date with all the new features that are constantly being released on various cloud platforms. That one tool, providing the single pane of glass, will have to integrate with all the cloud platforms. APIs that enable this would have to be re-evaluated constantly. It's the reason why there are just a few tools on the market that can actually do this. These two suites are the only ones in the leader section of the Gartner Magic Quadrant: ServiceNow and BMC Helix. Of course, there are alternatives, such as Cherwell and Provance, but in the enterprise market, the share that ServiceNow and BMC Helix hold is dominant.
Both ServiceNow—Orlando being the most recent release at the time of writing—and BMC Helix provide a platform to perform IT services management, multi-cloud operations, multi-cloud cost management, the management of security and compliance policies in multi-cloud environments, and monitoring all in one suite. They integrate with the native tools of the major cloud platforms. For example, their APIs connect to Azure Monitor, CloudWatch, and Google's Cloud Monitoring to retrieve data from the cloud platforms and consolidate that data in the management platform integrated into ServiceNow or BMC Helix.
These suites have a large portfolio of modules that enterprises can use to run many services from one console. But there's one pitfall: these platforms will take care of all the essential services, but there will always be services that can't be viewed and managed from this one, integrated console—serverless functions, for instance. These multi-tool platforms will never capture every function for each cloud platform, which is absolutely not a shortcoming. It's simply something that we have to take into account.
In this chapter, we have learned what it takes to set up good monitoring by defining the monitoring process on different layers and by deciding what we should monitor in our environments. We have learned that it's better to have end-to-end monitoring in place, looking at systems the way the end user would experience the behavior of these systems.
We have studied the OSI model and have gained an understanding of how monitoring can retrieve data from the various layers. We have learned that we need to consolidate and interpret monitored data to make it valuable to a business, enabling it to be used to make business decisions. We now also should have an understanding of the concept of the single-pane-of-glass view.
We are now able to decide how we will monitor systems. We are also able to tell the difference between different monitoring systems and methods of monitoring. Lastly, we have learned about the various options that cloud providers offer and how we can use them.
This concludes our section on BaseOps. The next part of this book will address FinOps in multi-cloud environments. We will begin with license management in multi-cloud environments.