The last chapter was pretty granular, and you might have found it a little bit theoretical. After all, nobody really needs to build their own Pods from scratch nowadays (unless you’re Facebook). Never fear, from here on out, we will start moving a little bit further up the stack.
In this chapter, we’ll dive a bit deeper into cgroups: the control structures that isolate resources from one another in the kernel. In the previous chapter, we actually implemented a simple cgroup boundary for a Pod that we made all by ourselves. This time around, we’ll create a “real” Kubernetes Pod and investigate how the kernel manages that Pod’s cgroup footprint. Along the way, we’ll go through some silly, but nevertheless instructive, examples of why cgroups exist. We’ll conclude with a look at Prometheus, the time-series metrics aggregator that has become the de facto standard for all metrics and observation platforms in the cloud native space.
The most important thing to keep in mind as you follow along in this chapter is that cgroups and Linux Namespaces aren’t any kind of dark magic. They are really just ledgers maintained by the kernel that associates processes with IP addresses, memory allocations, and so on. Because the kernel’s job provides these resources to programs, it’s then quite evident why these data structures are also managed by the kernel itself.
In the last chapter, we touched briefly on what happens when a Pod starts. Let’s zoom in a little bit on that scenario and look at what the kubelet actually needs to do to create a real Pod (figure 4.1). Note that our app is idle until the pause container is added to our namespace. After that, the actual application we have finally starts.
Figure 4.1 shows us the states of various parts of the kubelet during the creation of a container. Every kubelet will have an installed CRI, responsible for running containers, and a CNI, responsible for giving containers IP addresses, and will run one or many pause containers (placeholders where the kubelet creates namespaces and cgroups for a container to run inside of). In order for an app to ultimately be ready for Kubernetes to begin load balancing traffic to it, several ephemeral processes need to run in a highly coordinated manner:
If the CNI were to run before the CNI’s pause container, there would be no network for it to use.
If there aren’t any resources available, the kubelet won’t finish setting up a place for a Pod to run, and nothing will happen.
Before every Pod runs, a pause container runs, which is the placeholder for the Pod’s processes.
The reason we chose to illustrate this intricate dance in this chapter is to drive home the fact that programs need resources, and resources are finite: orchestrating resources is a complex, ordered process. The more programs we run, the more complex the intersection of these resource requests. Let’s look at a few example programs. Each of the following programs has different CPU, memory, and storage requirements:
Calculating Pi—Calculating Pi needs access to a dedicated core for continuous CPU usage.
Caching the contents of Wikipedia for fast look ups—Caching Wikipedia into a hash table for our Pi program needs little CPU, but it could call for about 100 GB or so of memory.
Backing up a 1 TB database—Backing up a database into cold storage for our Pi program needs essentially no memory, little CPU, and a large, persistent storage device, which can be a slow spinning disk.
If we have a single computer with 2 cores, 101 GB of memory, and 1.1 TB of storage, we could theoretically run each program with the equivalent CPU, memory, and storage access. The result would be
The Pi program, if written incorrectly (if it wrote intermediate results to a persistent disk, for example) could eventually overrun our database storage.
The Wikipedia cache, if written incorrectly (if its hashing function was too CPU-intensive, for example) might prevent our Pi program from rapidly doing mathematical calculations.
The database program, if written incorrectly (if it did too much logging, for example) might prevent the Pi program from doing its job by hogging all of the CPU.
Instead of running all processes with complete access to all of our system’s (limited) resources, we could do the following—that is, if we have the ability to portion out our CPU, memory, and disk resources:
Run the Wikipedia caching with half a core and 99 GB of memory
Run the database backup program with 1 GB of memory and the remaining CPU with a dedicated storage volume not accessible by other apps
So that this can be done in a predictable manner for all programs controlled by our OS, cgroups allow us to define hierarchically separated bins for memory, CPU, and other OS resources. All threads created by a program use the same pool of resources initially granted to the parent process. In other words, no one can play in someone else’s pool.
This is, in and of itself, the argument for having cgroups for Pods. In a Kubernetes cluster, you might be running 100 programs on a single computer, many of which are low-priority or entirely idle at certain times. If these programs reserve large amounts of memory, they make the cost of running such a cluster unnecessarily high. The creation of new nodes to provide memory to starving processes leads to administrative overhead and infrastructure costs that compound over time. Because the promise of containers (increased utilization of data centers) is largely predicated on being able to run smaller footprints per service, careful usage of cgroups is at the heart of running applications as microservices.
Every process in Linux can create one or more threads. An execution thread is an abstraction that programs can use to create new processes that share the same memory with other processes. As an example, we can inspect the use of various independent scheduling threads in Kubernetes by using the ps -T
command:
root@kind-control-plane:/# ps -ax | grep scheduler ❶ 631 ? Ssl 60:14 kube-scheduler --authentication-kubeconfig=/etc/kubernetes/... root@kind-control-plane:/# ps -T 631 ❷ root@kind-control-plane:/# ps -T 631 PID SPID TTY STAT TIME COMMAND 631 631 ? Ssl 4:40 kube-scheduler --authentication-kube.. 631 672 ? Ssl 12:08 kube-scheduler --authentication-kube.. 631 673 ? Ssl 4:57 kube-scheduler --authentication-kube.. 631 674 ? Ssl 4:31 kube-scheduler --authentication-kube.. 631 675 ? Ssl 0:00 kube-scheduler --authentication-kube..
❶ Gets the PID of the Kubernetes scheduler Pod
❷ Finds the threads in the Pod
This query shows us parallel scheduler threads that share memory with one another. These processes have their own subprocess IDs, and to the Linux kernel, they are all just regular old processes. That said, they have one thing in common: a parent. We can investigate this parent/child relationship by using the pstree
command in our kind
cluster:
/# pstree -t -c | grep sched ❶ |-containerd-sh-+-kube-scheduler-+-{kube-} ❷ | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-} | | |-{kube-}
❶ The scheduler has the parent container shim, so it is run as a container.
❷ Every scheduler thread shares the same parent thread, the scheduler itself.
One of the things that makes containers so popular is the fact that, especially in Linux, they don’t create an artificial boundary between a program and its host. Rather, they just allow for scheduling programs in a way that is lightweight and easier to manage than a VM-based isolation.
Now that you’ve seen a process hierarchy in action, let’s take a step back and ask what it really means to be a process. Back in our trusty kind
cluster, we ran the following command to see who started this whole charade (look at the first few lines of systemd’s status log). Remember, our kind
node (which we exec
into in order to do all of this) is really just a Docker container; otherwise, the output of the following command might scare you a little:
root@kind-control-plane:/# systemctl status | head kind-control-plane State: running Jobs: 0 queued Failed: 0 units Since: Sat 2020-06-06 17:20:42 UTC; 1 weeks 1 days CGroup: /docker/b7a49b4281234b317eab...9 ❶ ├── init.scope │ ├── 1 /sbin/init └── system.slice ├── containerd.service ❷ │ ├── 126 /usr/local/bin/containerd
❶ This single cgroup is the parent of our kind node.
❷ The containerd service is a child of the Docker cgroup.
If you happen to have a regular Linux machine, you can see the following output. This gives you a more revealing answer:
State: running Jobs: 0 queued Failed: 0 units Since: Thu 2020-04-02 03:26:27 UTC; 2 months 12 days cgroup: / ├── docker │ ├── ae17db938d5f5745cf343e79b8301be2ef7 │ │ ├── init.scope │ │ │ └── 20431 /sbin/init │ │ └── system.slice
And, under the system.slice
, we’ll see
├── containerd.service ├── 3067 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id db70803e6522052e ├── 3135 /pause
In a standard Linux machine or in a kind cluster node, the root of all cgroups is /. If we really want to know what cgroup is the ultimate parent of all processes in our system, it’s the /
cgroup that is created at startup. Docker itself is a child of this cgroup, and if we run a kind
cluster, our kind
nodes are the child of this Docker process. If we run a regular Kubernetes cluster, we would likely not see a Docker cgroup at all, but instead, we would see that containerd itself was a child of the systemd root process. If you have a handy Kubernetes node to ssh
into, this might be a good follow-up exercise.
If we traverse down these trees far enough, we’ll find the available processes, including any process started by any container, in our entire OS. Note that the process IDs (PIDs), such as 3135 in the previous snippet, are actually high numbers if we inspect this information in our host machine. That is because the PID of a process outside of a container is not the same as the PID of a process inside a container. If you’re wondering why, recall how we used the unshare
command in the first chapter to separate our process namespaces. This means that processes started by containers have no capacity to see, identify, or kill processes running in other containers. This is an important security feature of any software deployment.
You may also be wondering why there are pause processes. Each of our containerd-shim programs has a pause program that corresponds to it, which is initially used as a placeholder for the creation of our network namespace. The pause container also helps clean up processes and serves as a placeholder for our CRI to do some basic process bookkeeping, helping us to avoid zombie processes.
We now have a pretty good idea of what this scheduler Pod is up to: it has spawned several children, and most likely, it was created by Kubernetes because it’s a child of containerd, which is the container runtime that Kubernetes uses in kind
. As a first look at how processes work, you can kill the containerd process, and you’ll naturally see the scheduler and its subthreads come back to life. This is done by the kubelet itself, which has a /manifests directory. This directory tells the kubelet about a few processes that should always run even before an API server is able to schedule containers. This, in fact, is how Kubernetes installs itself via a kubelet. The life cycle of a Kubernetes installation, which uses kubeadm
(now the most common installation tool), looks something like this:
The kubelet has a manifests directory that includes the API server, scheduler, and controller manager.
The kubelet tells containerd (or whatever the container runtime is) to start running all the processes in the manifests directory.
Once the API server comes up, the kubelet connects to it and then runs any containers that the API server asks it to execute.
You might ask, “What does this all have to do with cgroups?” It turns out that the scheduler we’ve been spelunking is actually identified as a mirror Pod, and the cgroups that it is assigned to are named using this identity. The reason it has this special identity is that, originally, the API server doesn’t actually have any knowledge of the mirror Pod because it was created by the kubelet. To be a little less abstract, let’s poke around with the following code and find its identity:
apiVersion: v1 kind: Pod metadata: annotations: kubernetes.io/config.hash: 155707e0c1919c0ad1 kubernetes.io/config.mirror: 155707e0c19147c8 ❶ kubernetes.io/config.seen: 2020-06-06T17:21:0 kubernetes.io/config.source: file creationTimestamp: 2020-06-06T17:21:06Z labels:
❶ The mirror Pod ID of the scheduler
We use the mirror Pod ID of the scheduler for finding its cgroups. You can get at these Pods to view their contents by running edit
or get action
against a control plane Pod (for example, kubectl
edit
Pod
-n
kube-system
kube-apiserver-calico-control-plane
). Now, let’s see if we can find any cgroups associated with our processer by running the following:
With this command, we used the PID we found earlier to ask the Linux kernel about what cgroups exist for the scheduler. The output is pretty intimidating (something like that shown in the following). Don’t worry about the burstable folder; we will explain the burstable concept, which is a quality of service or QoS class, later, when we look at some kubelet internals. In the meantime, a burstable Pod is generally one that doesn’t have hard usage limits. The scheduler is an example of a Pod that typically runs with the ability to use large bursts of CPU when necessary (for example, in an instance where 10 or 20 Pods need to be quickly scheduled to a node). Each of these entries has an extremely long identifier for the cgroup and Pod identity like so:
13:name=systemd:/docker/b7a49b4281234b31 ➥ b9/kubepods/burstable/pod155707e0c19147c../391fbfc.. ➥ a08fc16ee8c7e45336ef2e861ebef3f7261d
The kernel is thus tracking all of these processes in the /proc location, and we can keep digging further to see what each process is getting in terms of resources. To abbreviate the entire listing of cgroups for process 631, we can cat
the cgroup file, as the following shows. Note that we’ve abbreviated the extra-long IDs for readability:
root@kind-control-plane:/# cat /proc/631/cgroup 13:name=systemd:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f... 12:pids:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f... 11:hugetlb:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f... 10:net_prio:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f... 9:perf_event:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 8:net_cls:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 7:freezer:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 6:devices:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 5:memory:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 4:blkio:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 3:cpuacct:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 2:cpu:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 1:cpuset:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f... 0::/docker/b7a49b42.../system.slice/containerd.service
We’ll look inside these folders, one at a time, as follows. Don’t worry too much about the docker folder, though. Because we’re in a kind
cluster, the docker folder is the parent of everything. But note that, actually, our containers are all running in containerd:
docker—The cgroup for Docker’s daemon running on our computer, which is essentially like a VM that runs a kubelet.
b7a49b42 . . .—The name of our Docker kind
container. Docker creates this cgroup for us.
kubepods—A division of cgroups that Kubernetes puts aside for its Pods.
burstable—A special cgroup for Kubernetes that defines the quality of service the scheduler gets.
pod1557 . . .—Our Pod’s ID, which is reflected inside our Linux kernel as its own identifier.
At the time of the writing of this book, Docker has been deprecated in Kubernetes. You can think of the docker folder in the example, not as a Kubernetes concept, but rather as “the VM that runs our kubelet,” because kind
itself is really just running one Docker daemon as a Kubernetes node and then putting a kubelet, containerd, and so on, inside this node. Thus, continue to repeat to yourself when exploring Kubernetes, “kind
itself does not use Docker to run containers.” Instead, it uses Docker to make nodes and installs containerd as the container runtime inside those nodes.
We’ve now seen that every process Kubernetes (for a Linux machine) ultimately lands in the bookkeeping tables of the proc directory. Now, let’s explore what these fields mean for a more traditional Pod: the NGINX container.
The scheduler Pod is a bit of a special case in that it runs on all clusters and isn’t something that you might directly want to tune or investigate. A more realistic scenario might be one wherein you want to confirm that the cgroups for an application you’re running (like NGINX) were created correctly. In order to try this out, you can create a Pod similar to our original pod.yaml, which runs the NGINX web server with resource requests. The specification for this portion of the Pod looks like the following (which is probably familiar to you):
spec: containers: - image: nginx imagePullPolicy: Always name: nginx resources: requests: cpu: "1" memory: 1G
In this case, the Pod defines a core count (1) and a memory request (1 GB). These both go in to the cgroups defined under the /sys/fs directory, and the kernel enforces the cgroup rules. Remember, you need to ssh
into your node to do this or, if you’re using kind
, use docker
exec
-t
-i
75
/bin/sh
to access the shell for the kind
node.
The result is that now your NGINX container runs with dedicated access to 1 core and 1 GB of memory. After creating this Pod, we can actually take a direct look at its cgroup hierarchy by traversing its cgroup information for the memory field (again running the ps
-ax
command to track it down). In doing so, we can see how Kubernetes really responds to the memory request we give it. We’ll leave it to you, the reader, to experiment with other such limits and see how the OS expresses them.
If we now look into our kernel’s memory tables, we can see that there is a marker for how much memory has been carved up for our Pod. It’s about 1 GB. When we made the previous Pod, our underlying container runtime was in a cgroup with a limited amount of memory. This solves the exact problem we originally discussed in this chapter—isolating resources for memory and CPU:
Thus, the magic of Kubernetes isolation really can just be viewed on a Linux machine as regular old hierarchical distribution of resources that are organized by a simple directory structure. There’s a lot of logic in the Kernel to “get this right,” but it’s all easily accessible to anyone with the courage to peer under the covers.
We now know how to confirm that our cgroups are created correctly. But how do we test that the cgroups are being honored by our processes? It’s a well-known fact that container runtimes and the Linux kernel itself may have bugs when it comes to isolating things in the exact way we expect. For example, there are instances where the OS might allow a container to run above its allotted CPU allocation if the other processes aren’t starving for resources. Let’s run a simple process with the following code to test whether our cgroups are working properly:
$ cat /tmp/pod.yaml apiVersion: v1 kind: Pod metadata: name: core-k8s labels: role: just-an-example app: my-example-app organization: friends-of-manning creator: jay spec: containers: - name: an-old-name-will-do image: busybox:latest command: ['sleep', '1000'] resources: limits: ❶ cpu: 2 requests: ❷ cpu: 1 ports: - name: webapp-port containerPort: 80 protocol: TCP
❶ Ensures our Pod has plenty of opportunity to use lots of CPU
❷ Ensures our Pod won’t start until it has a full core of CPU to access
Now, we can execute into our Pod and run a (nasty) CPU usage command. We’ll see in the output that the top
command blows up:
$ kubectl create -f pod.yaml $ kubectl exec -t -i core-k8s /bin/sh ❶ #> dd if=/dev/zero of=/dev/null ❷ $ docker exec -t -i 75 /bin/sh root@kube-control-plane# top ❸ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 91467 root 20 0 1292 4 0 R 99.7 0.0 0:35.89 dd
❶ Creates a shell into your container
❷ Consumes CPU with reckless abandon by running dd
❸ Runs the top command to measure CPU usage in our Docker kind node
What happens if we fence this same process and rerun this experiment? To test this, you can change the resources
stanza to something like this:
❶ Limits CPU usage to .1 core as a maximum
❷ Reserves the whole .1 core, guaranteeing this CPU share
Let’s rerun the following command. In this second example, we can actually see a much less stressful scenario for our kind
node taking place:
root@kube-control-plane# top ❶ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND 93311 root 20 0 1292 4 0 R 10.3 0.0 0:03.61 dd
❶ This time only 10% of the CPU is used for the node.
Earlier in this chapter we glossed over the other cgroups like blkio
. To be sure, there are many different kinds of cgroups, and it’s worth understanding what they are, even though 90% of the time, you will only be concerned about CPU and memory isolation for most containers.
At a lower level, clever use of the cgroup primitives listed in /sys/fs/cgroup exposes control knobs for managing how these resources are allocated to processes. Some such groups are not readily useful to a Kubernetes administrator. For example, the freezer
cgroup assigns groups of related tasks to a single stoppable or freezable control point. This isolation primitive allows for efficient scheduling and descheduling of gang processes (and, ironically, some have criticized Kubernetes for its poor handling of this type of scheduling).
Another example is the blkio
cgroup, which is also a lesser-known resource that’s used to manage I/O. Looking into the /sys/fs/cgroup, we can see all of the various quantifiable resources that can be allocated hierarchically in Linux:
$ ls -d /sys/fs/cgroup/* /sys/fs/cgroup/blkio freezer perf_event /sys/fs/cgroup/cpu hugetlb pids /sys/fs/cgroup/cpuacct memory rdma /sys/fs/cgroup/cpu,cpuacct net_cls systemd /sys/fs/cgroup/cpuset net_cls,net_prio unified /sys/fs/cgroup/devices net_prio
You can read about the original intent of cgroups at http://mng.bz/vo8p. Some of the corresponding articles might be out of date, but they provide a lot of information about how cgroups have evolved and what they are meant to do. For advanced Kubernetes administrators, understanding how to interpret these data structures can be valuable when it comes to looking at different containerization technologies and how they affect your underlying infrastructure.
Now that you understand where cgroups come from, it is worth taking a look at how cgroups are used in a kubelet; namely, by the allocatable
data structure. Looking at an example Kubernetes node (again, you can do this with your kind
cluster), we can see the following stanza in the output from kubectl
get
nodes
-o
yaml
:
... allocatable: cpu: "12" ephemeral-storage: 982940092Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 32575684Ki pods: "110"
Do these settings look familiar? By now, they should. These resources are the amount of cgroup budget available for allocating resources to Pods. The kubelet calculates this by determining the total capacity on the node. It then deducts how much CPU bandwidth is required for itself as well as for the underlying node and subtracts this from the amount of allocatable resources for containers. The equations for these numbers are documented at http://mng.bz/4jJR and can be toggled with parameters, including --system-reserved
and --kubelet-reserved
. This value is then used by the Kubernetes scheduler to decide whether to request a running container on this particular node.
Typically, you might launch --kubelet-reserved
and --system-reserved
with half of a core each, leaving a 2-core CPU with ~ 1.5 cores free to run workloads, because a kubelet is not an incredibly CPU-hungry resource (except in times of burst scheduling or startup). At large scales, all of these numbers break down and depend on a variety of performance factors related to workload types, hardware types, network latency, and so on. As an equation, when it comes to scheduling, we have the following implementation (system-reserved
refers to the quantity of resources a healthy OS needs to run):
Allocatable = node capacity - kube-reserved - system-reserved
the amount of allocatable CPU is 15 cores. To contextualize how all of this relates to a scheduled, running container
The kubelet creates cgroups when you run Pods to bound their resource usage.
Your container runtime starts a process inside the cgroups, which guarantees the resource requests you gave it in the Pod specification.
systemd usually starts a kubelet, which broadcasts the total available resources to the Kubernetes API periodically.
systemd also typically starts your container runtime (containerd, CRI-O, or Docker).
When you start a kubelet, there is parenting logic embedded in it. This setting is configured by a command-line flag (that you should leave enabled), which results in the kubelet itself being a top-level cgroup parent to its children’s containers. The previous equation calculates the total amount of allocatable cgroups for a kubelet. It is called the allocatable resource budget.
To understand this, we have to dive a little deeper into the specific cgroups that we saw earlier. Remember how our Pods resided under special folders, such as guaranteed and burstable? If we allowed our OS to swap inactive memory to disk, then an idle process might suddenly have slow memory allocation. This allocation would violate the guaranteed access to memory that Kubernetes provides users when defining Pod specifications and would make performance highly variable.
Because the scheduling of large amounts of processes in a predictable manner is more important than the health of any one process, we disable swapping entirely on Kubernetes. To avoid any confusion around this, the Kubernetes installers, such as kubeadm
, fail instantly if you bootstrap your kubelets on machines with swap enabled.
That said, there is a lot of subtlety at the container runtime level when it comes to resource usage such as memory. For example, cgroups differentiate between soft and hard limits as follows:
A process with soft memory limits has varying amounts of RAM over time, depending on the system load.
A process with hard memory limits is killed if it exceeds its memory limit for an extended period.
Note that Kubernetes relays an exit code and the OOMKilled status back to you in the cases where a process has to be killed for these reasons. You can increase the amount of memory allocated to a high-priority container to reduce the odds that a noisy neighbor causes problems on a machine. Let’s look at that next.
HugePages is a concept that initially was not supported in Kubernetes because it was a web-centric technology at inception. As it moved to a core data-center technology, more subtle scheduling and resource allocation strategies became relevant. HugePages configuration allows a Pod to access memory pages larger than the Linux kernel’s default memory page size, which is usually 4 KB.
Memory, like CPU, can be allocated explicitly for Pods and is denoted using units for kilobytes, megabytes, and gigabytes (Kis, Mis, and Gis, respectively). Many memory-intensive applications like Elasticsearch and Cassandra supports using HugePages. If a node supports HugePages and also sustains 2048 KiB page sizes, it exposes a schedulable resource: HugePages - 2 Mi. In general, it is possible to schedule against HugePages in Kubernetes using a standard resources
directive as follows:
Transparent HugePages are the optimization of HugePages that can have highly variable effects on Pods that need high performance. You’ll want to disable them in some cases, especially for high-performance containers that need large, contiguous blocks of memory at the bootloader or OS level, depending on your hardware.
We’ve come full circle now. Remember how at the beginning of this chapter we looked at the /sys/fs directory and how it managed various resources for containers? The rigging of HugePages can be done in init
containers if you can run these as root and mount /sys using a container to edit these files.
The configuration of HugePages can be toggled by merely writing files to and from the sys directory. For example, to turn off transparent HugePages, which might make a performance difference for you on specific OSs, you would typically run a command such as echo
'never'
>
/sys/kernel/mm/redhat_transparent_hugepage/enabled
. If you need to set up HugePages in a specific way, you could do so entirely from a Pod specification as follows:
Declare a Pod, which presumably has specific performance needs based around HugePages.
Declare an init
container with this Pod, which runs in privileged mode and mounts the /sys directory using the volume type of hostPath
.
Have the init
container execute any Linux-specific commands (such as the previous echo
statement) as its only execution steps.
In general, init
containers can be used to bootstrap certain Linux features that might be required for a Pod to run properly. But keep in mind that any time you mount a hostPath, you need special privileges on your cluster, which an administrator might not readily give you. Some distributions, such as OpenShift, deny hostPath volume mounts by default.
We’ve seen terms such as guaranteed and burstable throughout this chapter, but we haven’t defined these terms yet. To define these concepts, we first need to introduce QoS.
When you go to a fancy restaurant, you expect the food to be great, but you also expect the wait staff to be responsive. This responsiveness is known as quality of service or QoS. We hinted at QoS earlier when we looked at why swap is disabled in Kubernetes to guarantee the performance of memory access. QoS refers to the availability of resources at a moment’s notice. Any data center, hypervisor, or cloud has to make a tradeoff around resource availability for applications by
Guaranteeing that critical services stay up, but you’re spending lots of money because you have more hardware than you need
Spending little money and risking essential services going down
QoS allows you to walk the fine line of having many services performing suboptimally during peak times without sacrificing the quality of critical services. In practice, these critical services might be payment-processing systems, machine-learning or AI jobs that are costly to restart, or real-time communications processes that cannot be interrupted. Keep in mind that the eviction of a Pod is heavily dependent on how much above its resource limits it is. In general
Nicely-behaved applications with predictable memory and CPU usage are less likely to be evicted in times of duress than others.
Greedy applications are more likely to get killed during times of pressure when they attempt to use more CPU or memory than allocated by Kubernetes, unless these apps are in the Guaranteed class.
Applications in the BestEffort QoS class are highly likely to get killed and rescheduled in times of duress.
You might be wondering how we decide which QoS class to use. In general, you don’t directly decide this, and instead, you influence this decision by determining whether your app needs guaranteed access to resources by using the resource
stanza in your Pod specification. We’ll walk through this process in the following section.
Burstable, Guaranteed, and BestEffort are the three QoS classes that are created for you, depending on how you define a Pod. These settings can increase the number of containers that you can run on your cluster, where some may die off at times of high utilization and can be rescheduled later. It’s tempting to make global policies for how much CPU or memory you should allocate to end users but, be warned, rarely does one size fit all:
If all the containers on your system have a Guaranteed QoS, your ability to handle dynamic workloads with modulating resources needs is hampered.
If no containers on your servers have a Guaranteed QoS, then a kubelet won’t be able to make certain critical processes stay up.
The rules for QoS determination are as follows (these are calculated and displayed as a status
field in your Pod):
BestEffort Pods are those that have no CPU or memory requests. They are easily killed and displaced (and are likely to pop up on a new node) when resources are tight.
Burstable Pods are those that have memory or CPU requests but do not have limits defined for all classes. These are less likely to be displaced than BestEffort Pods.
Guaranteed Pods are those that have both CPU and memory requests. These are less likely to be displaced than Burstable Pods.
Let’s see this in action. Create a new deployment by running kubectl
create
ns
qos; kubectl
-n
qos
run
--image=nginx myapp
. Then, edit the deployment to include a container specification that states a request but does not define a limit. For example:
spec: containers: - image: nginx imagePullPolicy: Always name: nginx resources: requests: cpu: "1" memory: 1G
You will now see that when you run kubectl
get
Pods
-n
qos
-o
yaml
, you will have a Burstable class assigned to the status
field of your Pod, as the following code snippet shows. In crunch time, you might use this technique to ensure that the most critical processes for your business all have a Guaranteed or Burstable status.
hostIP: 172.17.0.3 phase: Running podIP: 192.168.242.197 qosClass: Burstable startTime: "2020-03-08T08:54:08Z"
We’ve looked at a lot of low-level Kubernetes concepts and mapped them to the OS in this chapter, but in the real world, you won’t be manually curating this data. Instead, for system metrics and overall trends, people typically aggregate container and system-level OS information in a single, time-series dashboard so that, in case of emergencies, they can ascertain the timescale of a problem and drill into it from various perspectives (application, OS, and so forth).
To conclude this chapter, we’ll up-level things a little bit and use Prometheus, the industry standard for monitoring cloud native applications, as well as monitoring Kubernetes itself. We’ll look at how Pod resource usage can be quantified by direct inspection of cgroups. This has several advantages when it comes to an end-to-end system visibility:
It can see sneaky processes that might overrun your cluster that aren’t visible to Kubernetes.
You can directly map resources that Kubernetes is aware of with kernel-level isolation tools, which might uncover bugs in the way your cluster is interacting with your OS.
It’s a great tool for learning more about how containers are implemented at scale by the kubelet and your container runtime.
Before we get into Prometheus, we need to talk about metrics. In theory, a metric is a quantifiable value of some sort; for example, how many cheeseburgers you ate in the last month. In the Kubernetes universe, the myriad of containers coming online and offline in a data center makes application metrics important for administrators as an objective and app-independent model for measuring the overall health of a data center’s services.
Sticking with the cheeseburger metaphor, you might have a collection of metrics that look something like the following code snippet, which you can jot down in a journal. There are three fundamental types of metrics that we’ll concern ourselves with—histograms, gauges, and counters:
Gauges: Indicate how many requests you get per second at any given time.
Histograms: Show bins of timing for different types of events (e.g., how many requests completed in under 500 ms).
Counters: Specify continuously increasing counts of events (e.g., how many total requests you’ve seen).
As a concrete example that might be a little closer to home, we can output Prometheus metrics about our daily calorie consumption. The following code snippet shows this output:
❶ The total number of meals you had today
❷ The number of cheeseburgers you’ve eaten today
❸ The amount of calories you’ve had, binned into buckets of 2, 4, 8, 16, and so on, up to 2,048
You might publish the total number of meals once a day. This is known as a gauge, as it goes up and down and is updated periodically. The amount of cheeseburgers you’ve eaten today would be a counter, which continually gets incremented over time. With the amount of calories you’ve had, the metric says you had one meal with less than 1,024 calories. This gives you a discrete way to bin how much you ate without getting bogged down in details (anything above 2,048 is probably too much and anything below 1,024 is most likely too few).
Note that buckets like this are commonly used to monitor etcd over long time periods. The amount of writes above 1 second are important for predicting etcd outages. Over time, if we aggregated the daily journal entries that you made, you might be able to make some interesting correlations as long as you logged the time of these metrics. For example:
meals_today 2 cheeseburger 50 salad 99 dinner 101 lunch 99 calories_total_bucket_bucket[le=512] 10 calories_total_bucket_bucket[le=1024] 40 calories_total_bucket_bucket[le=2048] 60
If you plotted these metrics on their own individual y-axes with the x-axis being time, you might be able to see that
Days where you ate cheeseburgers were inversely correlated to days you ate breakfast.
The amount of cheeseburgers you’ve been eating is increasing steadily.
Metrics are important for containerized and cloud-based applications, but they need to be managed in a lightweight and decoupled manner. Prometheus gives us the tools to enable metrics at scale without creating any unnecessary boilerplate or frameworks that get in our way. It is designed to fulfill the following requirements:
Hundreds or thousands of different processes might publish similar metrics, which means that a given metric needs to support metadata labels to differentiate these processes.
Applications should publish metrics in a language-independent manner.
Applications should publish metrics without being aware of how those metrics are being consumed.
It should be easy for any developer to publish metrics for a service, regardless of the language they use.
Programmatically, if we were to journal our diet choices in the previous analogy, we would declare instances of cheeseburger
, meals_today
, and calories_total
that would be of the type counter
, gauge
, and histogram
, respectively. These types would be Prometheus API types, supporting operations that automatically store local values to memory, which could be scraped as a CSV file from a local endpoint. Typically, this is done by adding a Prometheus handler to a REST API server, and this handler serves only one meaningful endpoint: metrics/. To manage this data, we might use a Prometheus API client like so:
Periodically, to observe a value for how many meals_today
we’ve had as that is a Gauge API call
Periodically, to increment a value for the cheeseburger
right after lunch
Daily, to aggregate the value of calories_total
, which can be fed in from a different data source
Over time, we could possibly correlate whether eating cheeseburgers related to a higher total calorie consumption on a per day basis, and we might also be able to tie in other metrics (for example, our weight) to these values. Although any time-series database could enable this, Prometheus, as a lightweight metrics engine, works well well in containers because it is entirely published by processes in a way that is independent and stateless, and it’s emerged as the modern standard for adding metrics to any application.
There are Prometheus clients for all major programming languages. Thus, for any microservice, it is simple and cheap to journal the daily goings-on of various events as a Prometheus metric.
In this book, we focus on Prometheus because it is the de facto standard in the cloud native landscape, but we’ll try to convince you that it deserves this status with a simple, powerful example of how to quickly do a health check on the inner workings of your API server. As an example, you can take a look at whether requests for Pods has put a lot of strain on your Kubernetes API server by running the following commands in your terminal (assuming that you have your kind
cluster up and running). In a separate terminal, run a kubectl
proxy
command, and then curl
the API server’s metrics endpoint like so:
$ kubectl proxy ❶ $> curl localhost:8001/metrics |grep etcd ❷ etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.005"} 174 etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.01"} 194 etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.025"} 201 etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.05"} 203
❶ Allows you to access the Kubernetes API server on localhost:8001
❷ curls the API server’s metrics endpoint
Anyone with a kubectl
client can immediately use the curl
command to ingest real-time metrics about the response times for a certain API endpoint. In the previous snippet, we can see that almost all get
calls to the Pod’s API endpoint return in less than .025 seconds, which is generally considered as reasonable performance. For the remainder of this chapter, we’ll set up a Prometheus monitoring system for your kind
cluster from scratch.
We can use a Prometheus monitoring service to inspect the way cgroups and system resources are utilized under duress. The architecture of a Prometheus monitoring system (figure 4.2) on kind
includes the following:
Note that, in general, a Prometheus master might be scraping metrics from many different sources, including API servers, hardware nodes, standalone databases, and even standalone applications. Not all services conveniently get aggregated into the Kubernetes API server for use, however. In this simple example, we want to look at how to use Prometheus for the specific purpose of monitoring cgroup resource usage on Kubernetes, and conveniently, we can do this by simply scraping data from all of our nodes directly from the API server. Also, note that our kind
cluster in this example has only one node. Even if we had more nodes, we could still scrape all of this data directly from the API server by adding more target
fields to our scrape YAML file (which we will introduce shortly).
We will launch Prometheus with the configuration file that follows. Then we can store the configuration file as prometheus.yaml:
$ mkdir ./data $ ./prometheus-2.19.1.darwin-amd64/prometheus --storage.tsdb.path=./data --config.file=./prometheus.yml
The kubelet uses the cAdvisor library to monitor cgroups and to collect quantifiable data about them (for example, how much CPU and memory a Pod in particular group uses). Because you already know how to browse through cgroup filesystem hierarchies, reading the output of a kubelet collected by cAdvisor metrics will yield an “aha” moment for you (in terms of your understanding how Kubernetes itself connects to the lower-level kernel resource accounting). To scrape up these metrics, we’ll tell Prometheus to query the API server every 3 seconds like so:
global: scrape_interval: 3s evaluation_interval: 3s scrape_configs: - job_name: prometheus metrics_path: /api/v1/nodes/kind-control-plane/ proxy/metrics/cadvisor ❶ static_configs: - targets: ['localhost:8001'] ❷
❶ The kind control plane node is the only node in our cluster.
❷ Add more nodes in our cluster or more things to scrape in subsequent jobs here.
Real-world Prometheus configurations have to account for real-world constraints. These include data size, security, and alerting protocols. Note that time-series databases are notoriously greedy when it comes to disk usage and that metrics can reveal a lot about a threat model for your organization. These may not be important in your initial prototyping, as we noted earlier, but it’s better to start by publishing your metrics at the application level and to then add the complexity of managing a heavy-weight Prometheus installation later. For our simple example, this will be all we need to configure Prometheus to explore our cgroups.
Again, remember that the API server receives data from the kubelet periodically, which is why this strategy of only needing to scrape one endpoint works. If this was not the case, we could collect this data directly from the kubelet itself or even run our own cAdvisor service. Now, let’s take a look at the container CPU user seconds total metric. We’ll make it spike by running the following command.
Warning This command immediately creates a lot of network and CPU traffic on your computer.
$ kubectl apply -f https://raw.githubusercontent.com/ ➥ giantswarm/kube-stresscheck/master/examples/node.yaml
This command launches a series of resource-intensive containers that suck up network assets, memory, and CPU cycles in your cluster. If you’re on a laptop, the giant swarm container produced by running this command will probably cause a lot of CPU spiking, and you might hear some fan noise.
In figure 4.3, you’ll see what our kind
cluster looks like under duress. We’ll leave it as an exercise for you to map the various container cgroups and metadata (found by hovering your mouse over the Prometheus metrics) back to processes and containers that are running in your system. In particular, it’s worth looking at the following metrics to get a feel for CPU-level monitoring in Prometheus. Exploring these metrics, along with the hundreds of other metrics in your system when running your favorite workloads or containers, gives you a good way to create important monitoring and forensics protocols for your internal systems engineering pipelines:
Let’s look in more detail at the three metric types before closing up shop for this chapter, just for good measure. In figure 4.4, we compare the general topology of how these three metrics give you different perspectives on the same situation in your data center. Specifically, we can see that the gauge gives us a Boolean value indicating whether our cluster is up. Meanwhile, the histogram shows us fine-grained information on how requests are trending until we lose our application entirely. Finally, the counters show us the overall number of transactions leading to an outage:
The gauge readout would be most valuable to someone who might be on pager duty for application up time.
The histogram readout may be most valuable to an engineer doing “day after” forensics on why a microservice went down for an extended time.
The counter metric would be a good way to determine how many successful requests were served before an outage. For example, in case of a memory leak, we might find that after a certain number of requests (say, 15,000 or 20,000), a web server predictably fails.
It’s ultimately up to you to decide which metrics you want to use to make decisions, but in general, it’s good to keep in mind that your metrics should not just be a dumping ground for information. Rather, they should help you tell a story about how your services behave and interact with each other over time. Generic metrics are rarely useful for debugging intricate problems, so take the time to embed the Prometheus client into your applications and collect some interesting, quantifiable application metrics. Your administrators will thank you! We’ll look back at metrics again in our etcd chapter, so don’t worry—there will be more Prometheus to come!
A kubelet starts the scheduler processes and mirrors it for the API server.
We can use simple containers to inspect how cgroups implement a memory limitation.
The kubelet has QoS classes that nuance the quota for process resources in your Pods.
We can use Prometheus to view real-time metrics of a cluster under duress.
Prometheus expresses three core metric types: gauges, histograms, and counters.
3.137.185.180