Chapter 12

Monitoring Strategies

Real-time monitoring is the new face of testing.

—Noah Sussman

Most cloud services are built to be always on, meaning the customer expects to be able to use the service 24 hours a day, 365 days a year. A considerable amount of engineering is required to build cloud services that provide the high levels of uptime, reliability, and scalability required to be always on. Even with a great architecture, it still takes a proactive monitoring strategy in order to meet the service level agreements (SLAs) required to deliver a system that does not go down. This chapter discusses strategies for monitoring cloud services.

Proactive vs. Reactive Monitoring

Many IT shops are accustomed to monitoring systems to detect failures. These shops track the consumption of memory, CPU, and disk space of servers and the throughput of the network to detect symptoms of system failures. Tools that ping URLs to check if websites are responding are very common, as well. All of these types of monitors are reactive. The tools tell us either that something is failing or that something is about to fail. Reactive monitoring focuses on detection. There should be a corresponding monitoring strategy for prevention.

The goal of proactive monitoring is to prevent failures. Prevention requires a different mind-set than detection. To prevent failures, we first must define what healthy system metrics look like. Once we define the baseline metrics for a healthy system, we must watch patterns to detect when data is trending toward an unhealthy system and fix the problem before our reactive monitors start sounding the warning bells. Combining both reactive and proactive monitoring is a best practice for implementing cloud services that must always be on. Proactive or preventive monitoring strives to find and resolve issues early, before they have a large impact on the overall system and to increase the odds that the issues are found and corrected before the customer is impacted.

What Needs to Be Monitored?

The purpose of monitoring is to help track that systems are behaving in-line with their expectations. Back in Chapter 11, “SLA Management,” we discussed that SLAs set an expectation between the cloud service provider and the cloud service consumer regarding the level of service that will be provided. To ensure that these SLAs are met, each SLA must be monitored, measured, and reported on. There are metrics-based SLAs such as response time and uptime, and there are SLAs focusing on processes around privacy, security, and regulations. Monitoring should cover all types of SLAs.

But SLAs are only part of the story. Many cloud-based services are distributed systems composed of many parts. All parts of the system are a point of failure and need to be monitored. Different people within the organization may need different information about the system in order to ensure that the system functions properly. For example, a front-end developer might be concerned with page-load times, network performance, the performance of the application programming interfaces (APIs), and so forth. The database architects may want to see metrics about the database server in the areas of threads, cache, memory, and CPU utilization in addition to metrics about the SQL statements and their response times. The system administrators may want to see metrics such as requests per second (RPS), disk space capacity, and CPU and memory utilization. The product owners may want to see how many unique visits per day, new users, cost per user, and other business-related metrics.

All of these metrics provide insights to determine if the system is behaving correctly and if the system is causing the desired behaviors from the end users. A system can be running flawlessly from a technology standpoint, but if the customer usage is consistently declining, there might be something drastically wrong in the areas of usability or product strategy. Metrics are also critical for accessing the success of each deployment. When a new version of software is deployed, it is critical to watch key system metrics and compare them against the baseline to see if the deployment has a negative impact on the overall system. For systems that use switches to turn features on and off, tracking metrics post-deployment can help discover when a switch is inadvertently set to the wrong value. This preventive measure can allow for a mistake to be quickly fixed before the error becomes problematic. Without a preventive approach, a simple issue like an erroneous configuration setting might not be found until a long time later, when reporting data shows a large delta in the data or, even worse, by the customers discovering it first.

A number of categories should be monitored:

  • Performance
  • Throughput
  • Quality
  • Key performance indicators (KPIs)
  • Security
  • Compliance

Monitoring also occurs within the different layers of the cloud stack:

  • User layer
  • Application layer
  • Application stack layer
  • Infrastructure layer

In addition, there are three distinct domains that need to be monitored:

1. Cloud vendor environment
2. Cloud application environment
3. User experience

Let’s briefly touch on each one of these areas. The intent of this chapter is to give a broad overview of some basic metrics and best practices. For a more in-depth commentary on measuring metrics for scaling systems, I recommend Cal Henderson’s book, Building Scalable Websites, in which he explains how the team at Flickr scaled out the company’s famous photo-sharing website.

Monitoring Strategies by Category

There are many categories of information that can be monitored. In this chapter, we will discuss monitoring strategies for measuring performance, throughput, quality, KPIs, security, and compliance. Each company will have a unique set of categories that are relevant to its business model and the target application. The categories discussed here are the ones that are typical in any cloud application or service.

Performance

Performance is an important metric within each layer of the cloud stack. At the user layer, performance metrics track attributes about how the users interact with the system. Here are some examples of user performance metrics:

  • Number of new customers
  • Number of unique visitors per day
  • Number of page visits per day
  • Average time spend on site
  • Revenue per customer
  • Bounce rate (percent of users who leave without viewing pages)
  • Conversion rate (percent of users who perform desired action based on direct marketing)

The goal of these metrics is to measure the behavior of the customers using the system. If these numbers decrease drastically from the baseline numbers after a deployment, there is a good chance either that there is an issue with the new code or that the new features were not well received by the customers.

Sometimes the end user is not a person but another system. Similar metrics can be used to ensure that the system and its users are behaving in the expected manner.

  • Number of new users
  • Number of unique users per day
  • Number of calls per user per day
  • Average time per call
  • Revenue per user

In this case a user represents another system. If the expectation is that the number of users is fixed or static and the metric shows the number is decreasing, then there is likely a problem preventing the system from getting access or the requests are failing. If the number of users goes up, then there might be a security issue and unauthorized accounts are gaining access. If the number of users is dynamic, then a decline in any of the metrics might be evidence that there are issues with the system.

At the application layer, performance measures how the system responds to the end user, whether that user is a person or another system. Here are some common performance metrics that are often tracked:

  • Page-load times
  • Uptime
  • Response time (APIs, reports, queries, etc.)

These metrics might be tracked and aggregated at different levels. For example, a system may be made up of a consumer-facing web page, a collection of APIs, an administrator portal for data management, and a reporting subsystem. It would be wise to track these metrics for each one of the four components separately because they likely all have unique performance requirements and SLAs. Also, if this system is being delivered as a Software as a Service (SaaS) solution to numerous clients, it would be wise to track these metrics uniquely by client, as well.

At the application stack layer, the metrics are similar, but instead of tracking the application performance, now we are tracking the performance of the underlying components of the application stack, such as the operating system, application server, database server, caching layer, and so on. Every component that makes up this layer needs to be monitored on every machine. If a MySQL database is made up of a master node with three slave nodes, each node needs to have a baseline established and needs to be tracked against its baseline. The same applies for the web servers. A 100-node web server farm needs each node to be monitored independently. At the same time, servers need to be monitored in clusters or groups to compute the metrics for a given customer. For example, if each customer has its own dedicated master and slave databases, the average response time and uptime is the aggregation of performance metrics for all of the servers in the cluster.

At the infrastructure layer, the metrics apply to the physical infrastructure, such as servers, networks, routers, and so on. Public Infrastructure as a Service (IaaS) providers will host a web page showing the health of their infrastructure, but they only give red, yellow, and green indicators, which indicate whether the services are functioning normally, are having issues, or are completely down.

Throughput

Throughput measures average rate at which data moves through the system. Like performance, it is important to understand the throughput at each layer of the cloud stack, at each component of the system, and for each unique customer. At the user layer, throughput measures how many concurrent users or sessions the system is processing. At the application layer, throughput measures how much data the system can transmit from the application stack layer through the application layer to the end user. This metric is often measured in transactions per second (TPS), RPS, or some business-related metric like click-throughs per second, requests per second (RPS), or page visits per second.

At the application stack layer, measuring throughput is critical in diagnosing issues within the system. If the TPS at the application layer is lower than normal, it is usually due to a reduction in throughput to one or many components within the application stack. Common monitoring solutions like open source Nagios or SaaS products like New Relic are commonly used to gather various metrics on the application stack components. These tools allow the administrators to set alerts and notifications when certain thresholds are met and provide analytics for spotting trends in the data. At the infrastructure layer, throughput measures the flow from physical servers and other hardware and network devices.

Quality

Quality is a measure of both the accuracy of information and impact of defects on the end user in the production environment. The key here is the emphasis on the production environment. Having 1,000 defects in the quality assurance or development environments is meaningless to an end user and to the SLAs of a system. It is the number of defects and the impacts they have on the applications and the end users that matter. One hundred defects in production might sound horrible, but if a majority of them have no or minimal impact on the end user, then they have less impact on the measurement of quality. I bring this up because I have seen too many companies use a quality metric to drive the wrong results. Quality should not be measured in bugs or defects. If it is, the team spends valuable time correcting many defects that do not have an impact on the overall health of the system and the end user’s perception. Instead, quality should focus on accuracy, the correctness of the data that is being returned to the end user; the error rates, the frequency in which errors occur; deployment failure rates, the percentage of time deployments fail or have issues; and customer satisfaction, the perception of quality and service from the voice of the customer.

To measure quality, standardization of data collection is required. As was mentioned in Chapter 10, error codes, severity level, and log record formats should all be standardized and common error and logging APIs should be used to ensure that consistent data is sent to the central logging system. Automated reports and dashboards that mine the data from the logging system should generate all of the relevant key metrics, including quality, error rates, error types, and so forth. Thresholds should be set that cause alerts and notifications to be triggered when the quality metric reaches the alert threshold. Quality must be maintained at every layer within the cloud stack.

At the user layer, quality measures the success and accuracy of user registration and access. If an unacceptable number of users fail the registration process, somebody must resolve the issue quickly. Sometimes the quality issue is not a defect but rather a usability issue. Users may require more training or the user interface may be too cumbersome or confusing. At the application layer, quality is in the eye of the beholder, also known as the end user. At this layer we are concerned with the defect types. Errors related to erroneous data, failed transactions, and 400- to 500-level http response codes are typically the culprits that cause invalid results and unhappy customers. These errors must be tracked for each API and for each module within the system. At the application stack layer, errors need to be logged and tracked for each component and the same applies to the physical infrastructure within the infrastructure layer.

KPIs

Key performance indicators are those metrics that tell us if the system is meeting the business goals. Some examples of KPIs are:

  • Revenue per customer
  • Revenue per hour
  • Number of incoming customer calls per day
  • Number of jobs completed per day
  • Site traffic
  • Shopping cart abandonment rate

KPIs are unique to each company’s business model. Each company invests in systems to achieve its business goals. Monitoring and measuring KPIs is a best practice for proactively detecting potential issues. Detecting KPIs trending in the wrong direction allows the team to proactively research root causes and potentially fix the issue(s) before too much damage is done. It is also important to detect when KPIs are trending in a positive direction so the team can figure out what the catalyst is so the team can understand what drives the appropriate behaviors.

KPIs are measured at the application layer. Typically, the product team establishes what those key metrics are. IT teams often establish their own KPIs, as well. In Chapter 14, we will discuss how metrics are used to proactively monitor the health of the underlying architecture and deployment processes.

Security

Securing cloud-based systems can be quite a challenge. The methods that cyber-criminals and other people or systems that attack systems with malicious intent deploy are very dynamic. A system that is very secure today can be exposed tomorrow as new and more complex threats are launched. To combat the dynamic nature of security threats, a system should proactively monitor all components for suspicious patterns. There are many good books that go into great detail about securing systems, and I’ll spare you the gory details. The point to get across in this book is that building security into a system is only part of the job. In Chapter 9, we discussed the PDP method, which stands for protection, detection, and prevention. Monitoring is one area where detection and protection take place. Monitoring security is a proactive approach that focuses on mining log files and discovering abnormal patterns that have the potential of being an unsuccessful or successful attempt at attacking the system.

As with the other metrics discussed in this chapter, security should be monitored at every layer of the cloud stack and at every component within each layer. Every component of a system typically requires some level of authentication in order for a user or a system to access it. Security monitoring should look at all failed authentication attempts for every component and detect if there is a certain user, system, or IP address that is constantly trying and failing to authenticate. Most attacks are attempted by unattended scripts, usually referred to as bots. These bots work their way into the system through some unsecure component and then run a series of other scripts that try to access any application or server that it can.

Once detected, administrators can blacklist the IP address to prevent it from doing any damage. The next step is prevention. How did the intruder gain access in the first place? Without detection, the only way to know that an outside threat has penetrated the system is when the threat accomplishes its objectives, which could be catastrophic, such as stealing sensitive data, destroying or corrupting files and systems, installing viruses and worms, consuming compute resources that impact the system performance, and many other horrible scenarios. For systems that are required to pass security audits, it is mandatory to implement a PDP security strategy.

Compliance

Systems that fall under various regulatory constraints should implement a monitoring strategy for compliance. The goal of this strategy is to raise alerts when parts of the system are falling out of compliance. Compliance requires policies and procedures to be followed both within a system and within the business. Examples of policies that the business must follow are policies pertaining to running background checks on employees and restricting access to buildings. Policies pertaining to the system, such as restricting production access on a need-to-know basis, can be monitored within the system. Once again, a team can mine log files to track enforcement of policies. There are also many new SaaS and open source tools that have recently entered the marketplace that allow policies to be set up in the tools, and then the tools monitor the enforcement of these policies. These tools raise alerts and offer canned and ad hoc reporting for monitoring policy enforcement.

Monitoring is not a silver bullet. But without information and tools, systems are ticking time bombs waiting to go off at any minute. Monitoring allows people to learn about their systems. The best and most reliable systems are ones that are always changing and adapting to the environment around them. Whether it is tweaks to the code, the infrastructure, the product, or the customer experience, it takes insights provided by information to make the right changes in order to create the desired result. Without monitoring, a system is like a fish out of water.

Monitoring by Cloud Service Level

Now that we know what to monitor, let’s see how monitoring is accomplished within each cloud service model. As with everything else in the cloud, the further down the cloud stack you go, the more responsibility you take on. Starting with SaaS, there is very little, if anything, that the end user needs to do. The SaaS service is either up or down. If it is down or appears to be down, most SaaS solutions have a web page that shows the latest status, and they have a customer support web page and phone number to call. If the SaaS system is critical to the business, then the end user may want some kind of alert to be triggered when the service is down. Some SaaS vendors have a feature that allows end users to get alerts. If the SaaS tool does not have this feature, the end user can use a tool like Pingdom that pings the URL and alerts the appropriate people that the service is unavailable. Even with this alerting capability, with SaaS there is nothing the end user can do but wait until the vendor restores the service.

In Chapter 13, “Disaster Recovery Planning,” we will discuss the idea of having a secondary SaaS solution in place in case the primary service goes down. For example, if an e-commerce site leverages a SaaS solution for processing online payments or for fulfillment and the service goes down, the e-commerce site could detect the failure and configure itself to switch over to its secondary provider until the service is recovered. The trigger for this event could be an alert message from the URL monitoring software.

Public and private Platform as a Service (PaaS) solutions handle monitoring differently. With public PaaS, the vendor manages both the infrastructure layer and the application stack layer. The PaaS vendor supplies APIs to various monitoring and logging solutions that they integrate with. The application code that the consumer builds on top of the PaaS should leverage these APIs so that all logs go to the PaaS-provided central logging system (if that is desirable). The consumer can use its own monitoring tools or it can leverage the APIs of the monitoring tools that are integrated with the PaaS. Not all PaaS solutions have intrusion detection tools that are exposed to the end user. The thought process here is that the vendor owns that responsibility and the consumer should focus on its applications.

Private PaaS is more like IaaS. For both, the consumer must monitor the system down to the application stack layer. Like public PaaS, many private PaaS solutions have plug-ins for modern logging and monitoring solutions. For IaaS solutions, the logging and monitoring solutions must be installed and managed by the consumer. For companies building their own private clouds, they must also monitor the physical infrastructure and data center.


AEA Case Study: Monitoring Considerations
The Acme eAuctions (AEA) auction platform is made up of many components that support many actors. There are many points of failure that need to be monitored. Uptime, performance, reliability, security, and scalability are all important to the success of the platform. AEA will want to proactively monitor the platform to minimize any service interruptions, performance degradation, or security breaches. In order to protect the platform from the misuse of resources (intentional or unintentional) by the external partners, the partners’ resources will be throttled at predefined maximum levels. Here is a short list of items that AEA determined that it must monitor:
  • Infrastructure—memory, disk, CPU utilization, bandwidth, and so on
  • Database—query performance, memory, caching, throughput, swap space, and the like
  • Application—transactions per second, page-load times, API response time, availability, and so forth
  • Access—external partner resource consumption
  • Security—repeated failed login attempts, unauthorized access
  • KPIs—financial metrics, transaction metrics, performance metrics
  • Costs—cloud cost optimization
AEA will need to use a variety of monitoring tools to satisfy these requirements. Some of these tools will mine the centralized log files to raise alerts, such as detecting repeated login failures from a single intrusion detection. There are both open source and commercial tools for monitoring infrastructure and databases. There are some great SaaS solutions, like New Relic, that can be configured to set performance, availability, and service level thresholds and alert the appropriate people when those metrics fall out of range. Another important tool is the cloud cost monitoring solution. It is easy to quickly provision cloud resources. The downside is that it is easy to run up the monthly infrastructure bill just as fast if a close eye is not kept on the costs.

Understanding your monitoring requirements up front allows you to find monitoring solutions that can meet many of your overall needs. Companies that don’t take an enterprise approach to evaluating their monitoring needs often wind up with too many different tools, which makes it hard to piece together data from various unrelated systems. By looking across the enterprise, monitoring requirements can be satisfied by fewer tools and hopefully by tools that can be integrated with each other.

Summary

Monitoring is a critical component of any cloud-based system. A monitoring strategy should be put in place early on and continuously improved over time. There is no one monitoring tool that will meet all the needs of a cloud solution. Expect to leverage a combination of SaaS and open source solutions and possibly even some homegrown solutions to meet the entire needs of the platform. Managing a cloud solution without a monitoring strategy is like driving down the highway at night with the lights off. You might make it home safe, but you might not!

Reference

Henderson, C. (2006). Building Scalable Websites. Cambridge, MA. O’Reilly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.152.26