Chapter 15. Observability and Monitoring

Nothing is ever completely right aboard a ship.

William Langewiesche, The Outlaw Sea

In this chapter we’ll consider the question of observability and monitoring for cloud native applications. What is observability? How does it relate to monitoring? How do you do monitoring, logging, metrics, and tracing in Kubernetes?

What Is Observability?

Observability may not be a familiar term to you, though it’s becoming increasingly popular as a way to express the larger world beyond traditional monitoring. Let’s tackle monitoring first before we see how observability extends it.

What Is Monitoring?

Is your website working right now? Go check; we’ll wait. The most basic way to know whether all your applications and services are working as they should is to look at them yourself. But when we talk about monitoring in a DevOps context, we mostly mean automated monitoring.

Automated monitoring is checking the availability or behavior of a website or service, in some programmatic way, usually on a regular schedule, and usually with some automated way of alerting human engineers if there’s a problem. But what defines a problem?

Black-Box Monitoring

Let’s take the simple case of a static website; say, the blog that accompanies this book.

If it’s not working at all, it just won’t respond, or you’ll see an error message in the browser (we hope not, but nobody’s perfect). So the simplest possible monitoring check for this site is to fetch the home page and check the HTTP status code (200 indicates a successful request). You could do this with a command-line HTTP client such as httpie or curl. If the exit status from the client is nonzero, there was a problem fetching the website.

But suppose something went wrong with the web server configuration, and although the server is working and responding with HTTP 200 OK status, it is actually serving a blank page (or some sort of default or welcome page, or maybe the wrong site altogether). Our simple-minded monitoring check won’t complain at all, because the HTTP request succeeds. However, the site is actually down for users: they can’t read our fascinating and informative blog posts.

A more sophisticated monitoring check might look for some specific text on the page, such as Cloud Native DevOps. This would catch the problem of a misconfigured, but working, web server.

Beyond static pages

You can imagine that more complex websites might need more complex monitoring. For example, if the site had a facility for users to log in, the monitoring check might also try to log in with a precreated user account and alert if the login fails. Or if the site had a search function, the check might fill in a text field with some search text, simulate clicking the search button, and verify that the results contain some expected text.

For simple websites, a yes/no answer to the question “Is it working?” may be sufficient. For cloud native applications, which tend to be more complex distributed systems, the question may turn into multiple questions:

  • Is my application available everywhere in the world? Or only in some regions?

  • How long does it take to load for most of my users?

  • What about users who may have slow download speeds?

  • Are all of the features of my website working as intended?

  • Are certain features working slowly or not at all, and how many users are affected?

  • If it relies on a third-party service, what happens to my application when that external service is faulty or unavailable?

  • What happens when my cloud provider has an outage?

It starts to become clear that in the world of monitoring cloud native distributed systems, not very much is clear at all.

The limits of black-box monitoring

However, no matter how complicated these checks get, they all fall into the same category of monitoring: black-box monitoring. Black-box checks, as the name suggests, observe only the external behavior of a system, without any attempt to observe what’s going on inside it.

Until a few years ago, black-box monitoring, as performed by popular tools such as Nagios, Icinga, Zabbix, Sensu, and Check_MK, was pretty much the state of the art. To be sure, having any kind of automated monitoring of your systems is a huge improvement on having none. But there are a few limitations of black-box checks:

  • They can only detect predictable failures (for example, a website not responding).

  • They only check the behavior of the parts of the system that are exposed to the outside.

  • They are passive and reactive; they only tell you about a problem after it’s happened.

  • They can answer the question “What’s broken?”, but not the more important question “Why?”

To answer the why? question, we need to move beyond black-box monitoring.

There’s a further issue with this kind of up/down test; what does up even mean?

What Does “Up” Mean?

In operations we’re used to measuring the resilience and availability of our applications in uptime, usually measured as a percentage. For example, an application with 99% uptime was unavailable for no more than 1% of the relevant time period. 99.9% uptime, referred to as three nines, translates to about nine hours downtime a year, which would be a good figure for the average web application. Four nines (99.99%) is less than an hour’s downtime per year, and five nines (99.999%) is about five minutes.

So, the more nines the better, you might think. But looking at things this way misses an important point:

Nines don’t matter if users aren’t happy.

Charity Majors

Nines don’t matter if users aren’t happy

As the saying goes, what gets measured gets maximized. So you’d better be very careful what you measure. If your service isn’t working for users, it doesn’t matter what your internal metrics say: the service is down. There are lots of ways a service can be making users unhappy, even if it’s nominally up.

To take an obvious one, what if your website takes 10 seconds to load? It might work fine after that, but if it’s too slow to respond, it might as well be down completely. Users will just go elsewhere.

Traditional black-box monitoring might attempt to deal with this problem by defining a load time of, say, five seconds as up, and anything over that is considered down and an alert generated. But what if users are experiencing all sorts of different load times, from 2 seconds to 10 seconds? With a hard threshold like this, you could consider the service down for some users, but up for others. What if load times are fine for users in North America, but unusable from Europe or Asia?

Cloud native applications are never up

While you could go on refining more complex rules and thresholds to enable us to give an up/down answer about the status of the service, the truth is that the question is irredeemably flawed. Distributed systems like cloud native applications are never up; they exist in a constant state of partially degraded service.

This is an example of a class of problems called gray failures. Gray failures are, by definition, hard to detect, especially from a single point of view or with a single observation.

So while black-box monitoring may be a good place to start your observability journey, it’s important to recognize that you shouldn’t stop there. Let’s see if we can do better.

Logging

Most applications produce logs of some kind. Logs are a series of records, usually with some kind of timestamps to indicate when records were written, and in what order. For example, a web server records each request in its logs, including information such as:

  • The URI requested

  • The IP address of the client

  • The HTTP status of the response

If the application encounters an error, it usually logs this fact, along with some information that may or may not be helpful for operators to figure out what caused the problem.

Often, logs from a wide range of applications and services will be aggregated into a central database (Elasticsearch, for example), where they can be queried and graphed to help with troubleshooting. Tools like Logstash and Kibana, or hosted services such as Splunk and Loggly, are designed to help you gather and analyze large volumes of log data.

The limits of logging

Logs can be useful, but they have their limitations too. The decision about what to log or not to log is taken by the programmer at the time the application is written. Therefore, like black-box checks, logs can only answer questions or detect problems that can be predicted in advance.

It can also be hard to extract information from logs, because every application writes logs in a different format, and operators often need to write customized parsers for each type of log record to turn it into usable numerical or event data.

Because logs have to record enough information to diagnose any conceivable kind of problem, they usually have a poor signal-to-noise ratio. If you log everything, it’s difficult and time-consuming to wade through hundreds of pages of logs to find the one error message you need. If you log only occasional errors, it’s hard to know what normal looks like.

Logs are hard to scale

Logs also don’t scale very well with traffic. If every user request generates a log line that has to be sent to the aggregator, you can end up using a lot of network bandwidth (which is thus unavailable to serve users), and your log aggregator can become a bottleneck.

Many hosted logging providers also charge by the volume of logs you generate, which is understandable, but unfortunate: it incentivizes you financially to log less information, and to have fewer users and serve less traffic!

The same applies to self-hosted logging solutions: the more data you store, the more hardware, storage, and network resources you have to pay for, and the more engineering time goes into merely keeping log aggregation working.

Is logging useful in Kubernetes?

We talked a little about how containers generate logs and how you can inspect them directly in Kubernetes, in “Viewing a Container’s Logs”. This is a useful debugging technique for individual containers.

If you do use logging, you should use some form of structured data, like JSON, which can be automatically parsed (see “The Observability Pipeline”) rather than plain-text records.

While centralized log aggregation (to services like ELK) can be useful with Kubernetes applications, it’s certainly not the whole story. While there are some business use cases for centralized logging (audit and security requirements, for example, or customer analytics), logs can’t give us all the information we need for true observability.

For that, we need to look beyond logs, to something much more powerful.

Introducing Metrics

A more sophisticated way of gathering information about your services is to use metrics. As the name suggests, a metric is a numerical measure of something. Depending on the application, relevant metrics might include:

  • The number of requests currently being processed

  • The number of requests handled per minute (or per second, or per hour)

  • The number of errors encountered when handling requests

  • The average time it took to serve requests (or the peak time, or the 99th percentile)

It’s also useful to gather metrics about your infrastructure as well as your applications:

  • The CPU usage of individual processes or containers

  • The disk I/O activity of nodes and servers

  • The inbound and outbound network traffic of machines, clusters, or load balancers

Metrics help answer the why? question

Metrics open up a new dimension of monitoring beyond simply working/not working. Like the speedometer in your car, or the temperature scale on your thermometer, they give you numerical information about what’s happening. Unlike logs, metrics can easily be processed in all sorts of useful ways: drawing graphs, taking statistics, or alerting on predefined thresholds. For example, your monitoring system might alert you if the error rate for an application exceeds 10% for a given time period.

Metrics can also help answer the why? question about problems. For example, suppose users are experiencing long response times (high latency) from your app. You check your metrics, and you see that the spike in the latency metric coincides with a similar spike in the CPU usage metric for a particular machine or component. That immediately gives you a clue about where to start looking for the problem. The component may be wedged, or repeatedly retrying some failed operation, or its host node may have a hardware problem.

Metrics help predict problems

Also, metrics can be predictive: when things go wrong, it usually doesn’t happen all at once. Before a problem is noticeable to you or your users, an increase in some metric may indicate that trouble is on the way.

For example, the disk usage metric for a server may creep up, and up over time eventually reach the point where the disk actually runs out of space and things start failing. If you alerted on that metric before it got into failure territory, you could prevent the failure from happening at all.

Some systems even use machine learning techniques to analyze metrics, detect anomalies, and reason about the cause. This can be helpful, especially in complex distributed systems, but for most purposes, simply having a way to gather, graph, and alert on metrics is plenty good enough.

Metrics monitor applications from the inside

With black-box checks, operators have to make guesses about the internal implementation of the app or service, and predict what kind of failures might happen and what effect this would have on external behavior. By contrast, metrics allow application developers to export key information about the hidden aspects of the system, based on their knowledge of how it actually works (and fails):

Stop reverse engineering applications and start monitoring from the inside.

Kelsey Hightower, Monitorama 2016

Tools like Prometheus, statsd, and Graphite, or hosted services such as Datadog, New Relic, and Dynatrace, are widely used to gather and manage metrics data.

We’ll talk much more in Chapter 16 about metrics, including what kinds you should focus on, and what you should do with them. For now, let’s complete our survey of observability with a look at tracing.

Tracing

Another useful technique in the monitoring toolbox is tracing. It’s especially important in distributed systems. While metrics and logs tell you what’s going on with each individual component of your system, tracing follows a single user request through its whole life cycle.

Suppose you’re trying to figure out why some users are experiencing very high latency for requests. You check the metrics for each of your system components: load balancer, ingress, web server, application server, database, message bus, and so on, and everything appears normal. So what’s going on?

When you trace an individual (hopefully representative) request from the moment the user’s connection is opened to the moment it’s closed, you’ll get a picture of how that overall latency breaks down for each stage of the request’s journey through the system.

For example, you may find that the time spent handling the request in each stage of the pipeline is normal, except for the database hop, which is 100 times longer than normal. Although the database is working fine and its metrics show no problems, for some reason the application server is having to wait a very long time for requests to the database to complete.

Eventually you track down the problem to excessive packet loss over one particular network link between the application servers and the database server. Without the request’s eye view provided by distributed tracing, it’s hard to find problems like this.

Some popular distributed tracing tools include Zipkin, Jaeger, and LightStep. Engineer Masroor Hasan has written a useful blog post describing how to use Jaeger for distributed tracing in Kubernetes.

The OpenTracing framework (part of the Cloud Native Computing Foundation) aims to provide a standard set of APIs and libraries for distributed tracing.

Observability

Because the term monitoring means different things to different people, from plain old black-box checks to a combination of metrics, logging, and tracing, it’s becoming common to use observability as a catch-all term that covers all these techniques. The observability of your system is a measure of how well-instrumented it is, and how easily you can find out what’s going on inside it. Some people say that observability is a superset of monitoring, others that observability reflects a completely different mindset from traditional monitoring.

Perhaps the most useful way to distinguish these terms is to say that monitoring tells you whether the system is working, while observability prompts you to ask why it’s not working.

Observability is about understanding

More generally, observability is about understanding: understanding what your system does and how it does it. For example, if you roll out a code change that is designed to improve the performance of a particular feature by 10%, then observability can tell you whether or not it worked. If performance only went up a tiny bit, or worse, went down slightly, you need to revisit the code.

On the other hand, if performance went up 20%, the change succeeded beyond your expectations, and maybe you need to think about why your predictions fell short. Observability helps you build and refine your mental model of how the different parts of your system interact.

Observability is also about data. We need to know what data to generate, what to collect, how to aggregate it (if appropriate), what results to focus on, and how to query and display them.

Software is opaque

In traditional monitoring we have lots of data about the machinery; CPU loads, disk activity, network packets, and so on. But it’s hard to reason backwards from that about what our software is doing. To do that, we need to instrument the software itself:

Software is opaque by default; it must generate data in order to clue humans in on what it is doing. Observable systems allow humans to answer the question, “Is it working properly?”, and if the answer is no, to diagnose the scope of impact and identify what is going wrong.

Christine Spang (Nylas)

Building an observability culture

Even more generally, observability is about culture. It’s a key tenet of the DevOps philosophy to close the loop between developing code, and running it at scale in production. Observability is the primary tool for closing that loop. Developers and operations staff need to work closely together to instrument services for observability, and then figure out the best way to consume and act on the information it provides:

The goal of an observability team is not to collect logs, metrics or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization.

Brian Knox (DigitalOcean)

The Observability Pipeline

How does observability work, from a practical point of view? It’s common to have multiple data sources (logs, metrics, and so on) connected to various different data stores in a fairly ad hoc way.

For example, your logs might go to an ELK server, while metrics go to three or four different managed services, and traditional monitoring checks report to yet another service. This isn’t ideal.

For one thing, it’s hard to scale. The more data sources and stores you have, the more interconnections there are, and the more traffic over those connections. It doesn’t make sense to put engineering time into making all of those different kinds of connections stable and reliable.

Also, the more tightly integrated your systems become with specific solutions or providers, the harder it is to change them or to try out alternatives.

An increasingly popular way to address this problem is the observability pipeline:

With an observability pipeline, we decouple the data sources from the destinations and provide a buffer. This makes the observability data easily consumable. We no longer have to figure out what data to send from containers, VMs, and infrastructure, where to send it, and how to send it. Rather, all the data is sent to the pipeline, which handles filtering it and getting it to the right places. This also gives us greater flexibility in terms of adding or removing data sinks, and it provides a buffer between data producers and consumers.

Tyler Treat

An observability pipeline brings great advantages. Now, adding a new data source is just a matter of connecting it to your pipeline. Similarly, a new visualization or alerting service just becomes another consumer of the pipeline.

Because the pipeline buffers data, nothing gets lost. If there’s a sudden surge in traffic and an overload of metrics data, the pipeline will buffer it rather than drop samples.

Using an observability pipeline requires a standard metrics format (see “Prometheus”) and, ideally, structured logging from applications using JSON or some other sensible serialized data format. Instead of emitting raw text logs, and parsing them later with fragile regular expressions, start with structured data from the very beginning.

Monitoring in Kubernetes

So now that we understand a little more about what black-box monitoring is and how it relates to observability in general, let’s see how it applies to Kubernetes applications.

External Black-Box Checks

As we’ve seen, black-box monitoring can only tell you that your application is down. But that’s still very useful information. All kinds of things could be wrong with a cloud native application, and it might still be able to serve some requests acceptably. Engineers can work on fixing internal problems like slow queries and elevated error rates, without users really being aware of an issue.

However, a more serious class of problems results in a full-scale outage; the application is unavailable or not working for the majority of users. This is bad for the users, and depending on the application, it may be bad for your business as well. In order to detect an outage, your monitoring needs to consume the service in the same way that a user would.

Monitoring mimics user behavior

For example, if it’s an HTTP service, the monitoring system needs to make HTTP requests to it, not just TCP connections. If the service just returns static text, monitoring can check the text matches some expected string. Usually, it’s a little bit more complicated than that, and as we saw in “Black-Box Monitoring”, your checks can be more complicated too.

In an outage situation, though, it’s quite likely that a simple text match will be sufficient to tell you the application is down. But making these black-box checks from inside your infrastructure (for example, in Kubernetes) isn’t enough. An outage can result from all sorts of problems and failures between the user and the outside edge of your infrastructure, including:

  • Bad DNS records

  • Network partitions

  • Packet loss

  • Misconfigured routers

  • Missing or bad firewall rules

  • Cloud provider outage

In all these situations, your internal metrics and monitoring might show no problems at all. Therefore, your top-priority observability task should be to monitor the availability of your services from some point external to your own infrastructure. There are many third-party services that can do this kind of monitoring for you (sometimes called monitoring as a service, or MaaS), including Uptime Robot, Pingdom, and Wormly.

Don’t build your own monitoring infrastructure

Most of these services have either a free tier, or fairly inexpensive subscriptions, and whatever you pay for them you should regard as an essential operating expense. Don’t bother trying to build your own external monitoring infrastructure; it’s not worth it. The cost of a year’s Pro subscription to Uptime Robot likely would not pay for a single hour of your engineers’ time.

Look for the following critical features in an external monitoring provider:

  • HTTP/HTTPS checks

  • Detect if your TLS certificate is invalid or expired

  • Keyword matching (alert when the keyword is missing or when it’s present)

  • Automatically create or update checks via an API

  • Alerts by email, SMS, webhook, or some other straightforward mechanism

Throughout this book we champion the idea of infrastructure as code, so it should be possible to automate your external monitoring checks with code as well. For example, Uptime Robot has a simple REST API for creating new checks, and you can automate it using a client library or command-line tool like uptimerobot.

It doesn’t matter which external monitoring service you use, so long as you use one. But don’t stop there. In the next section we’ll see what we can do to monitor the health of applications inside the Kubernetes cluster itself.

Internal Health Checks

Cloud native applications fail in complex, unpredictable, and hard-to-detect ways. Applications have to be designed to be resilient and degrade gracefully in the face of unexpected failures, but ironically, the more resilient they are, the harder it is to detect these failures by black-box monitoring.

To solve this problem, applications can, and should, do their own health checking. The developer of a particular feature or service is best placed to know what it needs to be healthy, and she can write code to check this that exposes the results in a way that can be monitored from outside the container (like an HTTP endpoint).

Are users happy?

Kubernetes gives us a simple mechanism for applications to advertise their liveness or readiness, as we saw in “Liveness Probes”, so this is a good place to start. Usually, Kubernetes liveness or readiness probes are pretty simple; the application always responds “OK” to any requests. If it doesn’t respond, therefore, Kubernetes considers it to be down or unready.

However, as many programmers know from bitter experience, just because a program runs, doesn’t necessarily mean it works correctly. A more sophisticated readiness probe should ask “What does this application need to do its job?”

For example, if it needs to talk to a database, it can check that it has a valid and responsive database connection. If it depends on other services, it can check the services’ availability. (Because health checks are run frequently, though, they shouldn’t do anything too expensive that might affect serving requests from real users.)

Note that we’re still giving a binary yes/no response to the readiness probe. It’s just a more informed answer. What we’re trying to do is answer the question “Are users happy?” as accurately as possible.

Services and circuit breakers

As you know, if a container’s liveness check fails, Kubernetes will restart it automatically, in an exponential backoff loop. This isn’t really that helpful in the situation where there’s nothing wrong with the container, but one of its dependencies is failing. The semantics of a failed readiness check, on the other hand, are “I’m fine, but I can’t serve user requests at the moment.”

In this situation, the container will be removed from any Services it’s a backend for, and Kubernetes will stop sending it requests until it becomes ready again. This is a better way to deal with a failed dependency.

Suppose you have a chain of 10 microservices, each of which depends on the next for some critical part of its work. The last service in the chain fails. The next-to-last service will detect this and start failing its readiness probe. Kubernetes will disconnect it, and the next service in line detects this, and so on up the chain. Eventually the frontend service will fail, and (hopefully) a black-box monitoring alert will be tripped.

Once the problem with the base service is fixed, or maybe cured by an automatic restart, all the other services in the chain will automatically become ready again in turn, without being restarted or losing any state. This is an example of what’s called a circuit breaker pattern. When an application detects a downstream failure, it takes itself out of service (via the readiness check) to prevent any more requests being sent to it until the problem is fixed.

Graceful degradation

While a circuit breaker is useful for surfacing problems as soon as possible, you should design your services to avoid having the whole system fail when one or more component services are unavailable. Instead, try to make your services degrade gracefully: even if they can’t do everything they’re supposed to, maybe they can still do some things.

In distributed systems, we have to assume that services, components, and connections will fail mysteriously and intermittently more or less all the time. A resilient system can handle this without failing completely.

Summary

There’s a lot to say about monitoring. We didn’t have space to say as much as we wanted to, but we hope this chapter has given you some useful information about traditional monitoring techniques, what they can do and what they can’t do, and how things need to change in a cloud native environment.

The notion of observability introduces us to a bigger picture than traditional log files and black-box checks. Metrics form an important part of this picture, and in the next and final chapter, we’ll take you on a deep dive into the world of metrics in Kubernetes.

Before turning the page, though, you might like to recall these key points:

  • Black-box monitoring checks observe the external behavior of a system, to detect predictable failures.

  • Distributed systems expose the limitations of traditional monitoring, because they’re not in either up or down states: they exist in a constant state of partially degraded service. In other words, nothing is ever completely right aboard a ship.

  • Logs can be useful for post-incident troubleshooting, but they’re hard to scale.

  • Metrics open up a new dimension beyond simply working/not working, and give you continuous numerical time-series data on hundreds or thousands of aspects of your system.

  • Metrics can help you answer the why question, as well as identify problematic trends before they lead to outages.

  • Tracing records events with precise timing through the life cycle of an individual request, to help you debug performance problems.

  • Observability is the union of traditional monitoring, logging, metrics, and tracing, and all the other ways you can understand your system.

  • Observability also represents a shift toward a team culture of engineering based on facts and feedback.

  • It’s still important to check that your user-facing services are up, with external black-box checks, but don’t try to build your own: use a third-party monitoring service like Uptime Robot.

  • Nines don’t matter if users aren’t happy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.164.228