In the previous chapter, we looked at serverless computing and its manifestations on Kubernetes. A lot of innovation happens in this space, and it is both super useful and fascinating to follow the evolution.
In this chapter, we’re going to talk about how to make sure your systems are up and running and performing correctly, and how to respond when they’re not. In Chapter 3, High Availability and Reliability, we discussed related topics. The focus here is on knowing what’s going on in your system and what practices and tools you can use.
There are many aspects to monitoring, such as logging, metrics, distributed tracing, error reporting, and alerting. Practices like auto-scaling and self-healing depend on monitoring to detect that there is a need to scale or to heal.
The topics we will cover in this chapter include:
The Kubernetes community recognizes the importance of monitoring and has put a lot of effort to make sure Kubernetes has a solid monitoring story. The Cloud Native Computing Foundation (CNCF) is the de-facto curator of cloud-native infrastructure projects. It has graduated twenty projects so far. Kubernetes was the first project to graduate, and out of the early graduated projects, three other projects that graduated more than two years ago are focused on monitoring: Prometheus, Fluentd, and Jaeger. This means that monitoring and observability are the foundation for a large-scale Kubernetes-based system. Before we dive into the ins and out of Kubernetes monitoring and specific projects and tools, we should get a better understanding of what monitoring is all about. A good framework for thinking about monitoring is how observable your system is.
Observability is a big word. What does it mean in practice? There are different definitions out there and big debates about how monitoring and observability are similar and different. I take the stance that observability is the property of the system that defines what we can tell about the state and behavior of the system right now and historically. In particular, we are interested in the health of the system and its components. Monitoring is the collection of tools, processes, and techniques that we use to increase the observability of the system.
There are different facets of information that we need to collect, record, and aggregate in order to get a good sense of what our system is doing. Those facets include logs, metrics, distributed traces, and errors. The monitoring or observability data is multidimensional and crosses many levels. Just collecting it doesn’t help much. We need to be able to query it, visualize it, and alert other systems when things go wrong. Let’s review the various components of observability.
Logging is a key monitoring tool. Every self-respecting long-running software must have logs. Logs capture timestamped events. They are critical for many applications like business intelligence, security, compliance, audits, debugging, and troubleshooting. It’s important to understand that a complicated distributed system will have different logs for different components, and extracting insights from logs is not a trivial undertaking.
There are several key attributes to logs: format, storage, and aggregation.
Logs may come in various formats. Plain text is very common and human-readable but requires a lot of work to parse and merge with other logs. Structured logs are better suitable for large systems because they can be processed at scale. Binary logs make sense for systems that generate a lot of logs as they are more space efficient, but require custom tools and processing to extract their information.
Logs can be stored in memory, on the file system, in a database, in cloud storage, sent to a remote logging service, or any combination of these. In the cloud-native world, where software runs in containers, it’s important to pay special attention to where logs are stored and how to fetch them when necessary.
Questions like durability come to mind when containers can come and go. In Kubernetes, the standard output and standard error streams of containers are automatically logged and available, even when the pod terminates. But, issues like having enough space for logs and log rotation are always relevant.
In the end, the best practice is to send local logs to a centralized logging service that is designed to handle various log formats, persist them as necessary, and aggregate many types of logs in a way that can be queried and reasoned about.
Metrics measure some aspects of the system over time. Metrics are time series of numerical values (typically floating point numbers). Each metric has a name and often a set of labels that help later in slicing and dicing. For example, the CPU utilization of a node or the error rate of a service are metrics.
Metrics are much more economical than logs. They require a fixed amount of space per time period that doesn’t ebb and flow with incoming traffic like logs.
Also, since metrics are numerical in nature, they don’t need parsing or transformations. Metrics can be easily combined and analyzed using statistical methods and serve as triggers for events and alerts.
A lot of metrics at different levels (node, container, process, network, and disk) are often collected for you automatically by the OS, cloud provider, or Kubernetes.
But you can also create custom metrics that map to high-level concerns of your system and can be configured with application-level policies.
Modern distributed systems often use microservice-based architecture, where an incoming request is bounced between multiple microservices, waits in queues, and triggers serverless functions. When you try to analyze errors, failures, data integrity issues, or performance issues, it is critical to be able to follow the path of a request. This is where distributed tracing comes in.
A distributed trace is a collection of spans and references. You can think of a trace as a directed acyclic graph (DAG) that represents a request’s traversal through the components of a distributed system. Each span records the time the request spent in a given component and references are the edges of the graph that connect one span to the following spans.
Here is an example:
Figure 13.1: The path of a sample distributed trace
Distributed tracing is indispensable for understanding complex distributed systems.
Error and exception reporting is sometimes done as part of logging. You definitely want to log errors, and looking at logs when things go wrong is a time-honored tradition. However, there are levels for capturing error information that go beyond logging. When an error occurs in one of your applications, it is useful to capture an error message, the location of the error in the code, and the stack trace. This is pretty standard and most programming languages can provide all this information, although stack traces are multi-line and don’t fit well with line-based logs. A useful piece of additional information to capture is the local state in each level of the stack trace. This helps when a problem occurs in a central place but local states, like the number and size of entries in some lists can help identify the root cause.
A central error reporting service like Sentry or Rollbar provides a lot of value specific to errors beyond logging, such as rich error information, context, and user information.
OK. You’ve done a great job of collecting logs, defining metrics, tracing your requests, and reporting rich errors. Now, you want to figure out what your system or parts of it, are doing. What is the baseline? How does traffic fluctuate throughout the day, week, and on holidays? When the system is under stress, what parts are the most vulnerable?
In a complicated system that involves hundreds and thousands of services and data stores and integrates with external systems, you can’t just look at the raw log files, metrics, and traces.
You need to be able to combine a lot of information and build system health dashboards, visualize your infrastructure, and create business-level reports and diagrams.
You may get some of it (especially for infrastructure) automatically if you’re using cloud platforms. But you should expect to do some serious work around visualization and dashboards.
Dashboards are great for humans that want to get a broad view of the system and be able to drill down and understand how it behaves. Alerting is all about detecting abnormal situations and triggering some action. Ideally, your system is self-healing and can recover on its own from most situations. But you should at least report it, so humans can review what happened at their leisure and decide if further action is needed.
Alerting can be integrated with emails, chat rooms, and on-call systems. It is often linked to metrics, and when certain conditions apply, an alert is raised.
Now that we have covered, in general, the different elements involved in monitoring complex systems, let’s see how to do it with Kubernetes.
We need to consider carefully our logging strategy with Kubernetes. There are several types of logs that are relevant for monitoring purposes. Our workloads run in containers, of course, and we care about these logs, but we also care about the logs of Kubernetes components like the API server, kubelet, and container runtime.
In addition, chasing logs across multiple nodes and containers is a non-starter. The best practice is to use central logging (a.k.a. log aggregation). There are several options here, which we will explore soon.
Kubernetes stores the standard output and standard error of every container. They are available through the kubectl logs
command.
Here is a pod manifest that prints the current date and time every 10 seconds:
apiVersion: v1
kind: Pod
metadata:
name: now
spec:
containers:
- name: now
image: g1g1/py-kube:0.3
command: ["/bin/bash", "-c", "while true; do sleep 10; date; done"]
We can save it to a file called now-pod.yaml
and create it:
$ k apply -f now-pod.yaml
pod/now created
To check out the logs, we use the kubectl logs
command:
$ kubectl logs now
Sat Jan 4 00:32:38 UTC 2020
Sat Jan 4 00:32:48 UTC 2020
Sat Jan 4 00:32:58 UTC 2020
Sat Jan 4 00:33:08 UTC 2020
Sat Jan 4 00:33:18 UTC 2020
A few points about container logs. The kubectl logs
command expects a pod name. If the pod has multiple containers, you need to specify the container name too:
$ k logs <pod name> -c <container name>
If a deployment or replica set creates multiple copies of the same pod, you can query the logs of all pods in a single call by using a shared label:
k logs -l <label>
If a container crashes for some reason, you can use the kubectl logs -p
command to look at logs from the crashed container.
If you run Kubernetes in a managed environment like GKE, EKS, or AKS, you won’t be able to access Kubernetes component logs directly, but this is expected. You’re not responsible for the Kubernetes control plane. However, the logs of control plane components (like the API server and cluster autoscaler), as well as node components (like the kubelet and container runtime), may be important for troubleshooting issues. Cloud providers often offer proprietary ways to access these logs.
Here are the standard control plane components and their log location if you run your own Kubernetes control plane:
/var/log/kube-apiserver.log
/var/log/kube-scheduler.log
/var/log/kube-controller-manager.log
The worker node components and their log locations are:
/var/log/kubelet.log
/var/log/kube-proxy.log
Note that on a systemd-based system, you’ll need to use journalctl
to view the worker node logs.
Reading container logs is fine for quick and dirty troubleshooting problems in a single pod. To diagnose and debug system-wide issues, we need centralized logging (a.k.a. log aggregation). All the logs from our containers should be sent to a central repository and made accessible for slicing and dicing using filters and queries.
When deciding on your central logging approach, there are several important decisions:
We will answer these questions in the following sections.
Logs are collected typically by an agent that is running close to the process generating the logs and making sure to deliver them to the central logging service.
Let’s look at the common approaches.
In this approach, there is no log agent. It is the responsibility of each application container to send logs to the remote logging service. This is typically done through a client library. It is a high-touch approach and applications need to be aware of the logging target as well as be configured with proper credentials.
Figure 13.2: Direct logging
If you ever want to change your log collection strategy, it will require changes to each and every application (at least bumping to a new version of the library).
The node agent approach is best when you control the worker nodes, and you want to abstract away the act of log aggregation from your applications. Each application container can simply write to standard output and standard error and the agent running on each node will intercept the logs and deliver them to the remote logging service.
Typically, you deploy the node agent as a DaemonSet so, as nodes are added or removed from the cluster, the log agent will always be present without additional work.
Figure 13.3: Using a node agent for logging
The sidecar container is best when you don’t have control over your cluster nodes, or if you use some serverless computing infrastructure to deploy containers but don’t want to use the direct logging approach. The node agent approach is out of the question if you don’t control the node and can’t install the agent, but you can attach a sidecar container that will collect the logs and deliver them to the central logging service. It is not as efficient as the node agent approach because each container will need its own logging sidecar container, but it can be done at the deployment stage without requiring code changes and application knowledge.
Figure 13.4: Using a sidecar container for logging
Now that we covered the topic of log collection, let’s consider how to store and manage those logs centrally.
If your entire system is running in a single Kubernetes cluster, then cluster-level logging may be a great choice. You can install a central logging service like Grafana Loki, ElasticSearch, or Graylog in your cluster and enjoy a cohesive log aggregation experience without sending your log data elsewhere.
There are situations where in-cluster central logging doesn’t cut it for various reasons:
Logging to a remote central location can be done by all methods: direct logging, node agent logging, or sidecar logging. In all cases, the endpoint and credentials to the remote logging service must be provided, and the logging is done against that endpoint. In most cases, this will be done via a client library that hides the details from the applications. As for system-level logging, the common method is to collect all the necessary logs by a dedicated logging agent and forward them to the remote logging service.
OK. We can collect the logs and send them to a central logging service. If the central logging service is remote, you might need to be selective about which information you log.
For example, personally identifiable information (PII) and protected health information (PHI) are two categories of information that you probably shouldn’t log without making sure access to the logs is properly controlled.
It is common to redact or remove PII like user names and emails from log statements.
Fluentd (https://www.fluentd.org) is an open source CNCF-graduated project. It is considered best in class in Kubernetes, and it can integrate with pretty much every logging backend you want. If you set up your own centralized logging solution, I recommend using Fluentd. Fluentd operates as a node agent. The following diagram shows how Fluentd can be deployed as a DaemonSet in a Kubernetes cluster:
Figure 13.5: Deploying Fluentd as a DaemonSet in a Kubernetes cluster
One of the most popular DIY centralized logging solutions is ELK, where E stands for ElasticSearch, L stands for LogStash, and K stands for Kibana. On Kubernetes, EFK (where Fluentd replaces LogStash) is very common.
Fluentd has plugin-based architecture so don’t feel limited to EFK. Fluentd doesn’t require a lot of resources, but if you really need a high-performance solution, Fluentbit (http://fluentbit.io/) is a pure forwarder that uses barely 450 KB of memory.
We have covered a lot of ground about logging. Let’s look at the next piece of the observability story, which is metrics.
Kubernetes has a Metrics API. It supports node and pod metrics out of the box. You can also define your own custom metrics.
A metric contains a timestamp, a usage field, and the time range in which the metric was collected (many metrics are accumulated over a time period). Here is the API definition for node metrics:
type NodeMetrics struct {
metav1.TypeMeta
metav1.ObjectMeta
Timestamp metav1.Time
Window metav1.Duration
Usage corev1.ResourceList
}
// NodeMetricsList is a list of NodeMetrics.
type NodeMetricsList struct {
metav1.TypeMeta
// Standard list metadata.
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metav1.ListMeta
// List of node metrics.
Items []NodeMetrics
}
The usage field type is ResourceList
, but it’s actually a map of a resource name to a quantity:
// ResourceList is a set of (resource name, quantity) pairs.
type ResourceList map[ResourceName]resource.Quantity
A Quantity
represents a fixed-point number. It allows easy marshaling/unmarshaling in JSON and YAML, and accessors such as String()
and Int64()
:
type Quantity struct {
// i is the quantity in int64 scaled form, if d.Dec == nil
i int64Amount
// d is the quantity in inf.Dec form if d.Dec != nil
d infDecAmount
// s is the generated value of this quantity to avoid recalculation
s string
// Change Format at will. See the comment for Canonicalize for more details.
Format
}
The Kubernetes Metrics Server implements the Kubernetes Metrics API.
You can deploy it with Helm:
$ helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
$ helm upgrade --install metrics-server metrics-server/metrics-server
Release "metrics-server" does not exist. Installing it now.
NAME: metrics-server
LAST DEPLOYED: Sun Oct 9 14:11:54 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
***********************************************************************
* Metrics Server *
***********************************************************************
Chart version: 3.8.2
App version: 0.6.1
Image tag: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
***********************************************************************
On minikube, you can just enable it as an add-on:
$ minikube addons enable metrics-server
▪ Using image k8s.gcr.io/metrics-server/metrics-server:v0.4.2
The 'metrics-server' addon is enabled
Note that at the time of writing, there was an issue with the Metrics Server on minikube, which was fixed in Kubernetes 1.27 (see https://github.com/kubernetes/minikube/issues/13969).
We will use a kind cluster to deploy the metrics-server
.
After a few minutes to let the metrics server collect some data, you can query it using these commands for node metrics:
$ k get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .
{
"kind": "NodeMetricsList",
"apiVersion": "metrics.k8s.io/v1beta1",
"metadata": {},
"items": [
{
"metadata": {
"name": "kind-control-plane",
"creationTimestamp": "2022-10-09T21:24:12Z",
"labels": {
"beta.kubernetes.io/arch": "arm64",
"beta.kubernetes.io/os": "linux",
"kubernetes.io/arch": "arm64",
"kubernetes.io/hostname": "kind-control-plane",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/control-plane": "",
"node.kubernetes.io/exclude-from-external-load-balancers": ""
}
},
"timestamp": "2022-10-09T21:24:05Z",
"window": "20.022s",
"usage": {
"cpu": "115537281n",
"memory": "47344Ki"
}
}
]
}
In addition, the kubectl top
command gets its information from the metrics server:
$ k top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
kind-control-plane 125m 3% 46Mi 1%
We can also get metrics for pods:
$ k top pods -A
NAMESPACE NAME CPU(cores) MEMORY(bytes)
default metrics-server-554f79c654-hw2c7 4m 18Mi
kube-system coredns-565d847f94-t8knf 2m 12Mi
kube-system coredns-565d847f94-wdqzx 2m 14Mi
kube-system etcd-kind-control-plane 24m 28Mi
kube-system kindnet-fvfs7 1m 7Mi
kube-system kube-apiserver-kind-control-plane 43m 339Mi
kube-system kube-controller-manager-kind-control-plane 18m 48Mi
kube-system kube-proxy-svdc6 1m 11Mi
kube-system kube-scheduler-kind-control-plane 4m 21Mi
local-path-storage local-path-provisioner-684f458cdd-24w88 2m 6Mi
The metrics server is also the source of performance information in the Kubernetes dashboard.
Prometheus (https://prometheus.io/) is yet another graduated CNCF open source project. It focuses on metrics collection and alert management. It has a simple yet powerful data model for managing time-series data and a sophisticated query language. It is considered best in class in the Kubernetes world. Prometheus lets you define recording rules that are fired at regular intervals and collect data from targets. In addition, you can define alerting rules that evaluate a condition and trigger alerts if the condition is satisfied.
It has several unique features compared to other monitoring solutions:
The following diagram illustrates the entire system:
Figure 13.6: Prometheus architecture
Prometheus is a complex beast, as you can see. The best way to install it is using the Prometheus operator (https://github.com/prometheus-operator/). The kube-prometheus (https://github.com/prometheus-operator/kube-prometheus) sub-project installs the operator itself, and a lot of additional components, and configures them in a robust manner.
The first step is cloning the git repo:
$ git clone https://github.com/prometheus-operator/kube-prometheus.git
Cloning into 'kube-prometheus'...
remote: Enumerating objects: 17062, done.
remote: Counting objects: 100% (185/185), done.
remote: Compressing objects: 100% (63/63), done.
remote: Total 17062 (delta 135), reused 155 (delta 116), pack-reused 16877
Receiving objects: 100% (17062/17062), 8.76 MiB | 11.63 MiB/s, done.
Resolving deltas: 100% (11135/11135), done.
Next, the setup manifests install several CRDs and creates a namespace called monitoring
:
$ kubectl create -f manifests/setup
customresourcedefinition.apiextensions.k8s.io/alertmanagerconfigs.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created
namespace/monitoring created
Now, we can install the manifests:
$ kubectl create -f manifests
...
The output is too long to display, but let’s examine what was actually installed. It turns out that it installed multiple deployments, StatefulSets, a DaemonSet, and many services:
$ k get deployments -n monitoring
NAME READY UP-TO-DATE AVAILABLE AGE
blackbox-exporter 1/1 1 1 3m38s
grafana 1/1 1 1 3m37s
kube-state-metrics 1/1 1 1 3m37s
prometheus-adapter 2/2 2 2 3m37s
prometheus-operator 1/1 1 1 3m37s
$ k get statefulsets -n monitoring
NAME READY AGE
alertmanager-main 3/3 2m57s
prometheus-k8s 2/2 2m57s
$ k get daemonsets -n monitoring
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-exporter 1 1 1 1 1 kubernetes.io/os=linux 4m4s
$ k get services -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-main ClusterIP 10.96.231.0 <none> 9093/TCP,8080/TCP 4m25s
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 3m35s
blackbox-exporter ClusterIP 10.96.239.94 <none> 9115/TCP,19115/TCP 4m25s
grafana ClusterIP 10.96.80.116 <none> 3000/TCP 4m24s
kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 4m24s
node-exporter ClusterIP None <none> 9100/TCP 4m24s
prometheus-adapter ClusterIP 10.96.139.149 <none> 443/TCP 4m24s
prometheus-k8s ClusterIP 10.96.51.85 <none> 9090/TCP,8080/TCP 4m24s
prometheus-operated ClusterIP None <none> 9090/TCP 3m35s
prometheus-operator ClusterIP None <none> 8443/TCP 4m24s
This is a high-availability setup. As you can see, Prometheus itself is deployed as a StatefulSet with two replicas, and the alert manager is deployed as a StatefulSet with three replicas.
The deployments include the blackbox-exporter
, Grafana for visualizing the metrics, kube-state-metrics
to collect Kubernetes-specific metrics, the Prometheus adapter (a compatible replacement to the standard Kubernetes Metrics Server), and finally, the Prometheus operator.
Prometheus has a basic web UI that you can use to explore its metrics. Let’s do port forwarding to localhost
:
$ k port-forward -n monitoring statefulset/prometheus-k8s 9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Then, you can browse to http://localhost:9090
, where you can select different metrics and view raw data or graphs:
Figure 13.7: Prometheus UI
Prometheus records an outstanding number of metrics (9090
in my current setup). The most relevant metrics on Kubernetes are the metrics exposed by the kube-state-metrics
and node exporters.
The Prometheus operator already installs kube-state-metrics
. It is a service that listens to Kubernetes events and exposes them through a /metrics
HTTP endpoint in the format Prometheus expects. So, it is a Prometheus exporter.
This is very different from the Kubernetes metrics server, which is the standard way Kubernetes exposes metrics for nodes and pods and allows you to expose your own custom metrics too. The Kubernetes metrics server is a service that periodically queries Kubernetes for data and stores it in memory. It exposes its data through the Kubernetes Metrics API. The Prometheus adapter adapts the Kubernetes metrics server information and exposes it in Prometheus format.
The metrics exposed by kube-state-metrics
are vast. Here is the list of the groups of metrics, which is pretty massive on its own. Each group corresponds to a Kubernetes API object and contains multiple metrics:
CertificateSigningRequest
MetricsConfigMap
MetricsCronJob
MetricsDaemonSet
MetricsDeployment
MetricsEndpoint
MetricsHorizontalPodAutoscaler
MetricsIngress
MetricsJob
MetricsLimitRange
MetricsMutatingWebhookConfiguration
MetricsNamespace
MetricsNetworkPolicy
MetricsNode
MetricsPersistentVolume
MetricsPersistentVolumeClaim
MetricsPodDisruptionBudget
MetricsPod
MetricsReplicaSet
MetricsReplicationController
MetricsResourceQuota
MetricsSecret
MetricsService
MetricsStatefulSet
MetricsStorageClass
MetricsValidatingWebhookConfiguration
MetricsVerticalPodAutoscaler
MetricsVolumeAttachment
MetricsFor example, here are the metrics collected for Kubernetes services:
kube_service_info
kube_service_labels
kube_service_created
kube_service_spec_type
kube-state-metrics
collects node information from the Kubernetes API server, but this information is pretty limited. Prometheus comes with its own node exporter, which collects tons of low-level information about the nodes. Remember that Prometheus may be the de-facto standard metrics platform on Kubernetes, but it is not Kubernetes-specific. For other systems that use Prometheus, the node exporter is super important. On Kubernetes, if you manage your own nodes, this information can be invaluable too.
Here is a small subset of the metrics exposed by the node exporter:
Figure 13.8: Node exporter metrics
The built-in metrics, node metrics, and Kubernetes metrics are great, but very often, the most interesting metrics are domain-specific and need to be captured as custom metrics. There are two ways to do it:
In my book Hands-On Microservices with Kubernetes (https://www.packtpub.com/product/hands-on-microservices-with-kubernetes/9781789805468), I provide a full-fledged example of how to implement your own exporter from a Go service.
The Push gateway is more appropriate if you already have a push-based metrics collector in place, and you just want to have Prometheus record those metrics. It provides a convenient migration path from other metrics collection systems to Prometheus.
Collecting metrics is great, but when things go south (or ideally, BEFORE things go south), you want to get notified. In Prometheus, this is the job of the Alertmanager
. You can define rules as expressions-based metrics, and when those expressions become true, they trigger an alert.
Alerts can serve multiple purposes. They can be handled automatically by a controller that is responsible for mitigating specific problems, they can wake up a poor on-call engineer at 3 am, they can result in an email or group chat message, or any combination of these.
The Alertmanager
lets you group similar alerts into a single notification, inhibiting notifications if other alerts are already firing and silencing alerts. All those features are useful when a large-scale system is in trouble. The stakeholders are aware of the situation and don’t need repeated alerts or multiple variations of the same alert to fire constantly while troubleshooting and trying to find the root cause.
One of the cool things about the Prometheus operator is that it manages everything in CRDs. That includes all the rules, including the alert rules:
$ k get prometheusrules -n monitoring
NAME AGE
alertmanager-main-rules 11h
grafana-rules 11h
kube-prometheus-rules 11h
kube-state-metrics-rules 11h
kubernetes-monitoring-rules 11h
node-exporter-rules 11h
prometheus-k8s-prometheus-rules 11h
prometheus-operator-rules 11h
Here is the NodeFilesystemAlmostOutOfSpace
alert that checks if available disk space for the file system on the node is less than a threshold for 30 minutes. If you notice, there are two almost identical alerts. When the available space drops below 5%, a warning alert is triggered. However, if the space drops below 3%, then a critical alert is triggered. Note the runbook_url
field, which points to a page that explains more about the alert and how to mitigate the problem:
$ k get prometheusrules node-exporter-rules -n monitoring -o yaml | grep NodeFilesystemAlmostOutOfSpace -A 14
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 30m
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
summary: Filesystem has less than 3% space left.
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 30m
labels:
severity: critical
Alerts are very important, but there are cases where you want to visualize the overall state of your system or drill down into specific aspects. This is where visualization comes into play.
You’ve already seen the Prometheus Expression browser, which can display your metrics as a graph or in table form. But we can do much better. Grafana (https://grafana.com) is an open source monitoring system that specializes in stunningly beautiful visualizations of metrics. It doesn’t store the metrics itself but works with many data sources, and Prometheus is one of them. Grafana has alerting capabilities, too. When working with Prometheus, you may prefer to rely on its Alertmanager
.
The Prometheus operator installs Grafana and configures a large number of useful Kubernetes dashboards. Check out this beautiful dashboard of Kubernetes capacity:
Figure 13.9: Grafana dashboard
To access Grafana, type the following commands:
$ k port-forward -n monitoring deploy/grafana 3000
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000
Then you can browse to http://localhost:3000
and have some fun with Grafana. Grafana requires a username and password. The default credentials are admin for the user and admin for the password.
Here are some of the default dashboards that are configured when deploying Grafana via kube-prometheus:
Figure 13.10: Default Grafana Dashboards
As you can see, the list is pretty extensive, but you can define your own dashboards if you want. There are a lot of fancy visualizations you can create with Grafana. I encourage you to explore it further. Grafana dashboards are stored as config maps. If you want to add a custom dashboard, just add a config map that contains your dashboard spec. There is a dedicated sidecar container watching new config maps being added and it will make sure to add your custom dashboard.
You can also add dashboards via the Grafana UI.
If you like Prometheus and Grafana and you didn’t settle on a centralized logging solution yet (or if you’re unhappy with your current logging solution), then you should consider Grafana Loki (https://grafana.com/oss/loki/). Loki is an open source project for log aggregation inspired by Prometheus. Unlike most log aggregation systems, it doesn’t index the log contents but rather a set of labels applied to the log. That makes it very efficient. It is still relatively new (started in 2018), so you should evaluate if it fits your needs before making the decision to adopt it. One thing is sure: Loki has excellent Grafana support.
There are several advantages for Loki compared to a solution like EFK when Prometheus is used as the metrics platform. In particular, the set of labels you use to tag your metrics will serve just as well to tag your logs. Also, the fact that Grafana is used as a uniform visualization platform for both logs and metrics is useful.
We have dedicated a lot of time to discussing metrics on Kubernetes. Let’s talk about distributed tracing and the Jaeger project.
In a microservice-based system, every request may travel between multiple microservices calling each other, wait in queues, and trigger serverless functions. To debug and troubleshoot such systems, you need to be able to keep track of requests and follow them along their path.
Distributed tracing provides several capabilities that allow the developers and operators to understand their distributed systems:
Distributed tracing often requires the participation of the applications and services instrumenting endpoints. Since the microservices world is polyglot, multiple programming languages may be used. It makes sense to use a shared distributed tracing specification and framework that supports many programming languages. Enter OpenTelemetry.
OpenTelemetry (https://opentelemetry.io) is an API specification and a set of frameworks and libraries in different languages to instrument, collect, and export logs, metrics, and traces. It was born by merging the OpenCensus and OpenTracing projects in May 2019. It is also an incubating CNCF project. OpenTelemetry is supported by multiple products and became a de-facto standard. It can collect data from a variety of open source and commercial sources. Check out the full list here: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver.
By using a product that complies with OpenTelemetry, you are not locked in, and you will work with an API that may be familiar to your developers.
There are instrumentation libraries for pretty much all the mainstream programming languages:
We will focus here on the tracing concept of OpenTelemetry and skip the logging and metrics concepts we covered earlier.
The two main concepts are Span and Trace.
A Span is the basic unit of work or operation. It has a name, start time, and duration. Spans can be nested if one operation starts another operation. Spans propagate with a unique ID and context. The Trace is an acyclic graph of Spans that originated from the same request and shares the same context. A Trace represents the execution path of a request throughout the system. The following diagram illustrates the relationship between a Trace and Spans:
Figure 13.11: The relationship between Traces and Spans in OpenTelemetry
Now that we understand what OpenTelemetry is about, let’s take a look at the Jaeger project.
Jaeger (https://www.jaegertracing.io/) is yet another CNCF-graduated project, just like Fluentd and Prometheus. It completes the trinity of CNCF-graduated observability projects for Kubernetes. Jaeger was developed originally by Uber and quickly became the forerunner distributed tracing solution for Kubernetes.
There are other open-source distributed tracing systems like Zipkin (https://zipkin.io) and SigNoz (https://signoz.io). The inspiration for most of these systems (as well as Jaeger) is Google’s Dapper (https://research.google.com/pubs/pub36356.html). Cloud platforms provide their own tracers, like AWS X-Ray. There are also multiple commercial products in the space:
Jaeger’s strong points are:
Jaeger is a scalable system. It can be deployed as a single binary with all its components and store the data in memory, but also as a distributed system where spans and traces are stored in persistent storage.
Jaeger has several components that collaborate to provide a world-class distributed tracing experience. The following diagram illustrates the architecture:
Figure 13.12: Jaeger architecture
Let’s understand the purpose of each component.
Originally, Jaeger had its own client libraries that implemented the OpenTracing API in order to instrument a service or application for distributed tracing. Now, Jaeger recommends using the OpenTelemetry client libraries. The Jaeger client libraries have been retired.
The agent is deployed locally to each node. It listens to spans over UDP – which makes it pretty performant – batches them and sends them in bulk to the collector. This way, services don’t need to discover the collector or worry about connecting to it. Instrumented services simply send their spans to the local agent. The agent can also inform the client about sampling strategies.
The collector receives traces from all the agents. It is responsible for validating and transforming the traces. It then sends the traces to a data store to a Kafka instance, which enables async processing of traces.
The ingester indexes the traces for easy and performant queries later and stores them in the data store, which can be a Cassandra or Elasticsearch cluster.
The Jaeger query service is responsible for presenting a UI to query the traces and the spans that the collector put in storage.
That covers Jaeger’s architecture and its components. Let’s see how to install and work with it.
There are Helm charts to install Jaeger and the Jaeger operator:
$ helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
"jaegertracing" has been added to your repositories
$ helm search repo jaegertracing
NAME CHART VERSION APP VERSION DESCRIPTION
jaegertracing/jaeger 0.62.1 1.37.0 A Jaeger Helm chart for Kubernetes
jaegertracing/jaeger-operator 2.36.0 1.38.0 jaeger-operator Helm chart for Kubernetes
The Jaeger operator requires cert-manager
, but doesn’t install it automatically. Let’s install it first:
$ helm repo add jetstack https://charts.jetstack.io
"jetstack" has been added to your repositories
$ helm install
cert-manager jetstack/cert-manager
--namespace cert-manager
--create-namespace
--version v1.9.1
--set installCRDs=true
NAME: cert-manager
LAST DEPLOYED: Mon Oct 17 10:28:43 2022
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
cert-manager v1.9.1 has been deployed successfully!
In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).
More information on the different types of issuers and how to configure them
can be found in our documentation:
https://cert-manager.io/docs/configuration/
For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:
https://cert-manager.io/docs/usage/ingress/
Now, we can install the Jaeger operator into the observability namespace:
$ helm install jaeger jaegertracing/jaeger-operator
-n observability --create-namespace
NAME: jaeger
LAST DEPLOYED: Mon Oct 17 10:30:58 2022
NAMESPACE: observability
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
jaeger-operator is installed.
Check the jaeger-operator logs
export POD=$(kubectl get pods -l app.kubernetes.io/instance=jaeger -l app.kubernetes.io/name=jaeger-operator --namespace observability --output name)
kubectl logs $POD --namespace=observability
The deployment is called jaeger-jaeger-operator
:
$ k get deploy -n observability
NAME READY UP-TO-DATE AVAILABLE AGE
jaeger-jaeger-operator 1/1 1 1 3m21s
Now, we can create a Jaeger instance using the Jaeger CRD. The operator watches for this custom resource and creates all the necessary resources. Here is the simplest possible Jaeger configuration. It uses the default AllInOne
strategy, which deploys a single pod that contains all the components (agent, collector, query, ingester, and Jaeger UI) and uses in-memory storage. This is suitable for local development and testing purposes:
$ cat <<EOF | k apply -f -
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
namespace: observability
EOF
jaeger.jaegertracing.io/simplest created
$ k get jaegers -n observability
NAME STATUS VERSION STRATEGY STORAGE AGE
simplest 5m54s
Let’s bring up the Jaeger UI:
$ k port-forward deploy/simplest 8080:16686 -n observability
Forwarding from 127.0.0.1:8080 -> 16686
Forwarding from [::1]:8080 -> 16686
Now, we can browse to http://localhost:8080
and see the Jaeger UI:
Figure 13.13: Jaeger UI
In the next chapter, Chapter 14, Utilizing Service Meshes, we will see more of Jaeger and how to use it specifically to trace requests going through the mesh. Now, let’s turn our attention to troubleshooting using all the monitoring and observability mechanisms we discussed.
Troubleshooting a complex distributed system is no picnic. Abstractions, separation of concerns, information hiding, and encapsulation are great during development, testing, and when making changes to the system. But when things go wrong, you need to cross all those boundaries and layers of abstraction from the user action in their app, through the entire stack, all the way to the infrastructure, crossing all the business logic, asynchronous processes, legacy systems, and third-party integrations. This is a challenge even with large monolithic systems, but even more so with microservice-based distributed systems. Monitoring will assist you, but let’s talk first about preparation, processes, and best practices.
When building a large system, developers work on their local machines (ignoring the cloud development environment here) and, eventually, the code is deployed to the production environment. But there are a few steps between those two extremes. Complex systems operate in an environment that is not easy to duplicate locally.
You should test changes to code or configuration in an environment that is similar to your production environment. This is your staging environment, where you should catch most problems that can’t be caught by the developer running tests locally in their development environment.
The software delivery process should accommodate the detection of bad code and configuration as early as possible. But, sometimes, bad changes will be detected only in production and cause an incident. You should have an incident management process in place as well, which typically involves reverting to the previous version of whatever component caused the issue and then trying to find the root cause by looking at logs, metrics, and traces – sometimes, by debugging in the staging environment too.
But, sometimes, the problem is not with your code or configuration. In the end, your Kubernetes cluster runs on nodes (yes, even if it’s managed), and those nodes can suffer many issues.
In Kubernetes’ conceptual model, the unit of work is the pod. However, pods are scheduled on nodes. When it comes to monitoring and the reliability of the infrastructure, the nodes are what require the most attention, as Kubernetes itself (the scheduler, replica sets, and horizontal pod autoscalers) takes care of the pods. The kubelet is aware of many issues on the nodes and it will update the API server. You can see the node status and if it’s ready using this command:
$ k describe no kind-control-plane | grep Conditions -A 6
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 21 Oct 2022 01:09:33 -0700 Mon, 17 Oct 2022 10:27:24 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Oct 2022 01:09:33 -0700 Mon, 17 Oct 2022 10:27:24 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Oct 2022 01:09:33 -0700 Mon, 17 Oct 2022 10:27:24 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 21 Oct 2022 01:09:33 -0700 Mon, 17 Oct 2022 10:27:52 -0700 KubeletReady kubelet is posting ready status
Note the last condition, Ready
. This means that Kubernetes can schedule pending pods to this node.
But, there might be problems that the kubelet can’t detect. Some of the problems are:
We need another solution. Enter the node problem detector.
The node problem detector is a pod that runs on every node. It needs to solve a difficult problem. It must be able to detect various low-level problems across different environments, different hardware, and different operating systems. It must be reliable enough not to be affected itself (otherwise, it can’t report on problems), and it needs to have a relatively low overhead to avoid spamming the control plane. The source code is at https://github.com/kubernetes/node-problem-detector.
The most natural way is to deploy the node problem detector as a DaemonSet, so every node always has a node problem detector running on it. On Google’s GKE clusters, it runs as an add-on.
The problem with the node problem detector (pun intended) is that there are too many problems that it needs to handle. Trying to cram all of them into a single codebase can lead to a complex, bloated, and never-stabilizing codebase. The design of the node problem detector calls for the separation of the core functionality of reporting node problems to the master from the specific problem detection. The reporting API is based on generic conditions and events. The problem detection should be done by separate problem daemons (each in its own container). This way, it is possible to add and evolve new problem detectors without impacting the core node problem detector. In addition, the control plane may have a remedy controller that can resolve some node problems automatically, therefore implementing self-healing.
At this time, problem daemons are baked into the node problem detector binary, and they execute as Goroutines, so you don’t get the benefits of the loosely-coupled design just yet. In the future, each problem daemon will run in its own container.
In addition to problems with nodes, the other area where things can break down is networking. The various monitoring tools we discussed earlier can help us identify problems across the infrastructure, in our code, or with third-party dependencies.
Let’s talk about the various options in our toolbox, how they compare, and how to utilize them for maximal effect.
Dashboards are purely for humans. The idea of a good dashboard is to provide, at one glance, a lot of useful information about the state of the system or a particular component. There are many user experience elements to designing good dashboards, just like designing any user interface. Monitoring dashboards can cover a lot of data across many components, over long time periods, and may support drilling down into finer and finer levels of detail.
Alerts, on the other hand, are periodically checking certain conditions (often based on metrics) and, when triggered, can either result in automatic resolution of the cause of the alert or eventually notify a human, who will probably start the investigation by looking at some dashboards.
Self-healing systems can handle certain alerts automatically (or ideally, resolve the issue before an alert is even raised). Humans will typically be involved in troubleshooting. Even in cases where the system automatically recovered from a problem at some point, a human will review the actions the system took and verify that the current behavior, including the automatic recovery from problems, is adequate.
In many cases, severe problems (a.k.a. incidents) discovered by humans looking at dashboards (not scalable) or notified by alerts will require some investigation, remediation, and later, post-mortem. In all those stages, the next layer of monitoring comes into play.
Let’s understand where each of these tools excels and how to best combine their strengths to debug difficult problems. Let’s assume we have good test coverage and our business/domain logic code is by and large correct. We run into problems in the production environment. There could be several types of problems that happen only in production:
That’s quite a list and it’s probably not even complete. Typically, when something goes wrong, it is in response to some change. What kind of changes are we talking about? Here are a few:
Since there is such a broad spectrum of problems and causes, it is difficult to suggest a linear path to resolution. For example, if the failure caused an error, then looking at an error report might be the best starting point. But, if the issue is that some action that was supposed to happen didn’t happen, then there is no error to look at. In this case, it might make sense to look at the logs and compare them to the logs from a previous successful request. In case of infrastructure or scalability problems, metrics may give us the best initial insight.
The bottom line is that debugging distributed systems requires using multiple tools together in the pursuit of the ever-elusive root cause.
Of course, in distributed systems with lots of components and microservices, it is not even clear where to look. This is where distributed tracing shines and can help us narrow down and identify the culprit.
With distributed tracing in place, every request will generate a trace with a graph of spans. Jaeger uses sampling of 1/1000 by default so, once in a blue moon, issues might escape it, but for persistent problems, we will be able to follow the path of a request, see how long each span takes, and if the processing of a request bails out for some reason, it will be very easy to notice. At this point, you’re back to the logs, metrics, and errors to hunt the root cause.
As you can see, troubleshooting problems in a complex system like Kubernetes is far from trivial. You need comprehensive observability in place including logging, metrics, and distributed tracing. You also need deep knowledge and understanding of your system to be able to configure, monitor, and mitigate issues quickly and reliably.
In this chapter, we covered the topics of monitoring, observability, and troubleshooting. We started with a review of the various aspects of monitoring: logs, metrics, error reporting, and distributed tracing. Then, we discussed how to incorporate monitoring capabilities into your Kubernetes cluster. We looked at several CNCF projects, like Fluentd for log aggregation, Prometheus for metrics collection and alert management, Grafana for visualization, and Jaeger for distributed tracing. Then, we explored troubleshooting large distributed systems. We realized how difficult it can be and why we need so many different tools to conquer the issues.
In the next chapter, we will take it to the next level and dive into service meshes. I’m super excited about service meshes because they take much of the complexity related to cloud-native microservice-based applications and externalize them outside of the microservices. That has a lot of real-world value.
Read this book alongside other users, cloud experts, authors, and like-minded professionals.
Ask questions, provide solutions to other readers, chat with the authors via. Ask Me Anything sessions and much more.
Scan the QR code or visit the link to join the community now.
18.216.32.116