4 Using cgroups for processes in our Pods

This chapter covers

  • Exploring the basics of cgroups
  • Identifying Kubernetes processes
  • Learning how to create and manage cgroups
  • Using Linux commands to investigate cgroup hierarchies
  • Understanding cgroup v2 versus cgroup v1
  • Installing Prometheus and looking at Pod resource usage

The last chapter was pretty granular, and you might have found it a little bit theoretical. After all, nobody really needs to build their own Pods from scratch nowadays (unless you’re Facebook). Never fear, from here on out, we will start moving a little bit further up the stack.

In this chapter, we’ll dive a bit deeper into cgroups: the control structures that isolate resources from one another in the kernel. In the previous chapter, we actually implemented a simple cgroup boundary for a Pod that we made all by ourselves. This time around, we’ll create a “real” Kubernetes Pod and investigate how the kernel manages that Pod’s cgroup footprint. Along the way, we’ll go through some silly, but nevertheless instructive, examples of why cgroups exist. We’ll conclude with a look at Prometheus, the time-series metrics aggregator that has become the de facto standard for all metrics and observation platforms in the cloud native space.

The most important thing to keep in mind as you follow along in this chapter is that cgroups and Linux Namespaces aren’t any kind of dark magic. They are really just ledgers maintained by the kernel that associates processes with IP addresses, memory allocations, and so on. Because the kernel’s job provides these resources to programs, it’s then quite evident why these data structures are also managed by the kernel itself.

4.1 Pods are idle until the prep work completes

In the last chapter, we touched briefly on what happens when a Pod starts. Let’s zoom in a little bit on that scenario and look at what the kubelet actually needs to do to create a real Pod (figure 4.1). Note that our app is idle until the pause container is added to our namespace. After that, the actual application we have finally starts.

Figure 4.1 The processes involved in container startup

Figure 4.1 shows us the states of various parts of the kubelet during the creation of a container. Every kubelet will have an installed CRI, responsible for running containers, and a CNI, responsible for giving containers IP addresses, and will run one or many pause containers (placeholders where the kubelet creates namespaces and cgroups for a container to run inside of). In order for an app to ultimately be ready for Kubernetes to begin load balancing traffic to it, several ephemeral processes need to run in a highly coordinated manner:

  • If the CNI were to run before the CNI’s pause container, there would be no network for it to use.

  • If there aren’t any resources available, the kubelet won’t finish setting up a place for a Pod to run, and nothing will happen.

  • Before every Pod runs, a pause container runs, which is the placeholder for the Pod’s processes.

The reason we chose to illustrate this intricate dance in this chapter is to drive home the fact that programs need resources, and resources are finite: orchestrating resources is a complex, ordered process. The more programs we run, the more complex the intersection of these resource requests. Let’s look at a few example programs. Each of the following programs has different CPU, memory, and storage requirements:

  • Calculating Pi—Calculating Pi needs access to a dedicated core for continuous CPU usage.

  • Caching the contents of Wikipedia for fast look ups—Caching Wikipedia into a hash table for our Pi program needs little CPU, but it could call for about 100 GB or so of memory.

  • Backing up a 1 TB database—Backing up a database into cold storage for our Pi program needs essentially no memory, little CPU, and a large, persistent storage device, which can be a slow spinning disk.

If we have a single computer with 2 cores, 101 GB of memory, and 1.1 TB of storage, we could theoretically run each program with the equivalent CPU, memory, and storage access. The result would be

  • The Pi program, if written incorrectly (if it wrote intermediate results to a persistent disk, for example) could eventually overrun our database storage.

  • The Wikipedia cache, if written incorrectly (if its hashing function was too CPU-intensive, for example) might prevent our Pi program from rapidly doing mathematical calculations.

  • The database program, if written incorrectly (if it did too much logging, for example) might prevent the Pi program from doing its job by hogging all of the CPU.

Instead of running all processes with complete access to all of our system’s (limited) resources, we could do the following—that is, if we have the ability to portion out our CPU, memory, and disk resources:

  • Run the Pi process with 1 core and 1 KB of memory

  • Run the Wikipedia caching with half a core and 99 GB of memory

  • Run the database backup program with 1 GB of memory and the remaining CPU with a dedicated storage volume not accessible by other apps

So that this can be done in a predictable manner for all programs controlled by our OS, cgroups allow us to define hierarchically separated bins for memory, CPU, and other OS resources. All threads created by a program use the same pool of resources initially granted to the parent process. In other words, no one can play in someone else’s pool.

This is, in and of itself, the argument for having cgroups for Pods. In a Kubernetes cluster, you might be running 100 programs on a single computer, many of which are low-priority or entirely idle at certain times. If these programs reserve large amounts of memory, they make the cost of running such a cluster unnecessarily high. The creation of new nodes to provide memory to starving processes leads to administrative overhead and infrastructure costs that compound over time. Because the promise of containers (increased utilization of data centers) is largely predicated on being able to run smaller footprints per service, careful usage of cgroups is at the heart of running applications as microservices.

4.2 Processes and threads in Linux

Every process in Linux can create one or more threads. An execution thread is an abstraction that programs can use to create new processes that share the same memory with other processes. As an example, we can inspect the use of various independent scheduling threads in Kubernetes by using the ps -T command:

root@kind-control-plane:/# ps -ax | grep scheduler    
631 ?  Ssl 60:14 kube-scheduler
  --authentication-kubeconfig=/etc/kubernetes/...
 
root@kind-control-plane:/# ps -T 631                  
 
root@kind-control-plane:/# ps -T 631
  PID  SPID TTY      STAT   TIME COMMAND
  631   631 ?        Ssl    4:40 kube-scheduler --authentication-kube..
  631   672 ?        Ssl   12:08 kube-scheduler --authentication-kube..
  631   673 ?        Ssl    4:57 kube-scheduler --authentication-kube..
  631   674 ?        Ssl    4:31 kube-scheduler --authentication-kube..
  631   675 ?        Ssl    0:00 kube-scheduler --authentication-kube..

Gets the PID of the Kubernetes scheduler Pod

Finds the threads in the Pod

This query shows us parallel scheduler threads that share memory with one another. These processes have their own subprocess IDs, and to the Linux kernel, they are all just regular old processes. That said, they have one thing in common: a parent. We can investigate this parent/child relationship by using the pstree command in our kind cluster:

/# pstree -t -c | grep sched                 
|-containerd-sh-+-kube-scheduler-+-{kube-}   
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}
|               |                |-{kube-}

The scheduler has the parent container shim, so it is run as a container.

Every scheduler thread shares the same parent thread, the scheduler itself.

containerd and Docker

We haven’t spent time contrasting containerd and Docker, but it’s good to note that our kind clusters are not running Docker as their container runtime. Instead, they use Docker to create nodes, and then every node uses containerd as a run time. Modern Kubernetes clusters do not typically run Docker as the container runtime for Linux for a variety of reasons. Docker was a great on-ramp for developers to run Kubernetes, but data centers require a lighter-weight container runtime solution that is more deeply integrated with the OS. Most clusters execute runC as the container runtime at the lowest level, where runC is called by containerd, CRI-O, or some other higher-level command-line executable that is installed on your nodes. This causes systemd to be the parent of your containers rather than the Docker daemon.

One of the things that makes containers so popular is the fact that, especially in Linux, they don’t create an artificial boundary between a program and its host. Rather, they just allow for scheduling programs in a way that is lightweight and easier to manage than a VM-based isolation.

4.2.1 systemd and the init process

Now that you’ve seen a process hierarchy in action, let’s take a step back and ask what it really means to be a process. Back in our trusty kind cluster, we ran the following command to see who started this whole charade (look at the first few lines of systemd’s status log). Remember, our kind node (which we exec into in order to do all of this) is really just a Docker container; otherwise, the output of the following command might scare you a little:

root@kind-control-plane:/# systemctl status | head
kind-control-plane
  State: running
   Jobs: 0 queued
 Failed: 0 units
  Since: Sat 2020-06-06 17:20:42 UTC; 1 weeks 1 days
 CGroup: /docker/b7a49b4281234b317eab...9               
         ├── init.scope
         │     ├── 1 /sbin/init
         └── system.slice
             ├── containerd.service                     
             │     ├── 126 /usr/local/bin/containerd

This single cgroup is the parent of our kind node.

The containerd service is a child of the Docker cgroup.

If you happen to have a regular Linux machine, you can see the following output. This gives you a more revealing answer:

State: running
     Jobs: 0 queued
   Failed: 0 units
    Since: Thu 2020-04-02 03:26:27 UTC; 2 months 12 days
   cgroup: / 
           ├── docker
           │     ├── ae17db938d5f5745cf343e79b8301be2ef7
           │     │     ├── init.scope
           │     │     │     └── 20431 /sbin/init
           │     │     └── system.slice

And, under the system.slice, we’ll see

├── containerd.service
├──  3067 /usr/local/bin/containerd-shim-runc-v2
          -namespace k8s.io -id db70803e6522052e
├──  3135 /pause

In a standard Linux machine or in a kind cluster node, the root of all cgroups is /. If we really want to know what cgroup is the ultimate parent of all processes in our system, it’s the / cgroup that is created at startup. Docker itself is a child of this cgroup, and if we run a kind cluster, our kind nodes are the child of this Docker process. If we run a regular Kubernetes cluster, we would likely not see a Docker cgroup at all, but instead, we would see that containerd itself was a child of the systemd root process. If you have a handy Kubernetes node to ssh into, this might be a good follow-up exercise.

If we traverse down these trees far enough, we’ll find the available processes, including any process started by any container, in our entire OS. Note that the process IDs (PIDs), such as 3135 in the previous snippet, are actually high numbers if we inspect this information in our host machine. That is because the PID of a process outside of a container is not the same as the PID of a process inside a container. If you’re wondering why, recall how we used the unshare command in the first chapter to separate our process namespaces. This means that processes started by containers have no capacity to see, identify, or kill processes running in other containers. This is an important security feature of any software deployment.

You may also be wondering why there are pause processes. Each of our containerd-shim programs has a pause program that corresponds to it, which is initially used as a placeholder for the creation of our network namespace. The pause container also helps clean up processes and serves as a placeholder for our CRI to do some basic process bookkeeping, helping us to avoid zombie processes.

4.2.2 cgroups for our process

We now have a pretty good idea of what this scheduler Pod is up to: it has spawned several children, and most likely, it was created by Kubernetes because it’s a child of containerd, which is the container runtime that Kubernetes uses in kind. As a first look at how processes work, you can kill the containerd process, and you’ll naturally see the scheduler and its subthreads come back to life. This is done by the kubelet itself, which has a /manifests directory. This directory tells the kubelet about a few processes that should always run even before an API server is able to schedule containers. This, in fact, is how Kubernetes installs itself via a kubelet. The life cycle of a Kubernetes installation, which uses kubeadm (now the most common installation tool), looks something like this:

  • The kubelet has a manifests directory that includes the API server, scheduler, and controller manager.

  • The kubelet is started by systemd.

  • The kubelet tells containerd (or whatever the container runtime is) to start running all the processes in the manifests directory.

  • Once the API server comes up, the kubelet connects to it and then runs any containers that the API server asks it to execute.

Mirror Pods sneak up on the API server

The kubelet has a secret weapon: the /etc/kubernetes/manifests directory. This directory is continuously scanned, and when Pods are put inside of it, they are created and run by the kubelet. Because these aren’t scheduled via the Kubernetes API server, they need to mirror themselves so that the API server can be aware of their existence. Hence, the Pods created outside the knowledge of the Kubernetes control plane are known as mirror Pods.

Mirror Pods can be viewed by listing them like any other Pod, via kubectl get pods -A, but they are created and managed by a kubelet on an independent basis. This allows the kubelet, alone, to bootstrap an entire Kubernetes cluster that runs inside the Pods. Pretty sneaky!

You might ask, “What does this all have to do with cgroups?” It turns out that the scheduler we’ve been spelunking is actually identified as a mirror Pod, and the cgroups that it is assigned to are named using this identity. The reason it has this special identity is that, originally, the API server doesn’t actually have any knowledge of the mirror Pod because it was created by the kubelet. To be a little less abstract, let’s poke around with the following code and find its identity:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/config.hash: 155707e0c1919c0ad1
    kubernetes.io/config.mirror: 155707e0c19147c8      
    kubernetes.io/config.seen: 2020-06-06T17:21:0
    kubernetes.io/config.source: file
  creationTimestamp: 2020-06-06T17:21:06Z
  labels:

The mirror Pod ID of the scheduler

We use the mirror Pod ID of the scheduler for finding its cgroups. You can get at these Pods to view their contents by running edit or get action against a control plane Pod (for example, kubectl edit Pod -n kube-system kube-apiserver-calico-control-plane). Now, let’s see if we can find any cgroups associated with our processer by running the following:

$ cat /proc/631/cgroup

With this command, we used the PID we found earlier to ask the Linux kernel about what cgroups exist for the scheduler. The output is pretty intimidating (something like that shown in the following). Don’t worry about the burstable folder; we will explain the burstable concept, which is a quality of service or QoS class, later, when we look at some kubelet internals. In the meantime, a burstable Pod is generally one that doesn’t have hard usage limits. The scheduler is an example of a Pod that typically runs with the ability to use large bursts of CPU when necessary (for example, in an instance where 10 or 20 Pods need to be quickly scheduled to a node). Each of these entries has an extremely long identifier for the cgroup and Pod identity like so:

13:name=systemd:/docker/b7a49b4281234b31
 b9/kubepods/burstable/pod155707e0c19147c../391fbfc..
 a08fc16ee8c7e45336ef2e861ebef3f7261d

The kernel is thus tracking all of these processes in the /proc location, and we can keep digging further to see what each process is getting in terms of resources. To abbreviate the entire listing of cgroups for process 631, we can cat the cgroup file, as the following shows. Note that we’ve abbreviated the extra-long IDs for readability:

root@kind-control-plane:/# cat /proc/631/cgroup
 
13:name=systemd:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f...
12:pids:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f...
11:hugetlb:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f...
10:net_prio:/docker/b7a49b42.../kubepods/burstable/pod1557.../391f...
9:perf_event:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
8:net_cls:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
7:freezer:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
6:devices:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
5:memory:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
4:blkio:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
3:cpuacct:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
2:cpu:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
1:cpuset:/docker/b7a49b42.../kubepods/burstable//pod1557.../391f...
0::/docker/b7a49b42.../system.slice/containerd.service

We’ll look inside these folders, one at a time, as follows. Don’t worry too much about the docker folder, though. Because we’re in a kind cluster, the docker folder is the parent of everything. But note that, actually, our containers are all running in containerd:

  • docker—The cgroup for Docker’s daemon running on our computer, which is essentially like a VM that runs a kubelet.

  • b7a49b42 . . .—The name of our Docker kind container. Docker creates this cgroup for us.

  • kubepods—A division of cgroups that Kubernetes puts aside for its Pods.

  • burstable—A special cgroup for Kubernetes that defines the quality of service the scheduler gets.

  • pod1557 . . .—Our Pod’s ID, which is reflected inside our Linux kernel as its own identifier.

At the time of the writing of this book, Docker has been deprecated in Kubernetes. You can think of the docker folder in the example, not as a Kubernetes concept, but rather as “the VM that runs our kubelet,” because kind itself is really just running one Docker daemon as a Kubernetes node and then putting a kubelet, containerd, and so on, inside this node. Thus, continue to repeat to yourself when exploring Kubernetes, “kind itself does not use Docker to run containers.” Instead, it uses Docker to make nodes and installs containerd as the container runtime inside those nodes.

We’ve now seen that every process Kubernetes (for a Linux machine) ultimately lands in the bookkeeping tables of the proc directory. Now, let’s explore what these fields mean for a more traditional Pod: the NGINX container.

4.2.3 Implementing cgroups for a normal Pod

The scheduler Pod is a bit of a special case in that it runs on all clusters and isn’t something that you might directly want to tune or investigate. A more realistic scenario might be one wherein you want to confirm that the cgroups for an application you’re running (like NGINX) were created correctly. In order to try this out, you can create a Pod similar to our original pod.yaml, which runs the NGINX web server with resource requests. The specification for this portion of the Pod looks like the following (which is probably familiar to you):

spec:
    containers:
    - image: nginx
      imagePullPolicy: Always
      name: nginx
      resources:
        requests:
          cpu: "1"
          memory: 1G

In this case, the Pod defines a core count (1) and a memory request (1 GB). These both go in to the cgroups defined under the /sys/fs directory, and the kernel enforces the cgroup rules. Remember, you need to ssh into your node to do this or, if you’re using kind, use docker exec -t -i 75 /bin/sh to access the shell for the kind node.

The result is that now your NGINX container runs with dedicated access to 1 core and 1 GB of memory. After creating this Pod, we can actually take a direct look at its cgroup hierarchy by traversing its cgroup information for the memory field (again running the ps -ax command to track it down). In doing so, we can see how Kubernetes really responds to the memory request we give it. We’ll leave it to you, the reader, to experiment with other such limits and see how the OS expresses them.

If we now look into our kernel’s memory tables, we can see that there is a marker for how much memory has been carved up for our Pod. It’s about 1 GB. When we made the previous Pod, our underlying container runtime was in a cgroup with a limited amount of memory. This solves the exact problem we originally discussed in this chapter—isolating resources for memory and CPU:

$ sudo cat /sys/fs/memory/docker/753../kubepods/pod8a58e9/d176../
    memory.limit_in_bytes
999997440

Thus, the magic of Kubernetes isolation really can just be viewed on a Linux machine as regular old hierarchical distribution of resources that are organized by a simple directory structure. There’s a lot of logic in the Kernel to “get this right,” but it’s all easily accessible to anyone with the courage to peer under the covers.

4.3 Testing the cgroups

We now know how to confirm that our cgroups are created correctly. But how do we test that the cgroups are being honored by our processes? It’s a well-known fact that container runtimes and the Linux kernel itself may have bugs when it comes to isolating things in the exact way we expect. For example, there are instances where the OS might allow a container to run above its allotted CPU allocation if the other processes aren’t starving for resources. Let’s run a simple process with the following code to test whether our cgroups are working properly:

$ cat /tmp/pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: core-k8s
  labels:
    role: just-an-example
    app: my-example-app
    organization: friends-of-manning
    creator: jay
spec:
  containers:
    - name: an-old-name-will-do
      image: busybox:latest
      command: ['sleep', '1000']
      resources:
        limits:             
          cpu:  2
        requests:           
          cpu: 1
      ports:
        - name: webapp-port
          containerPort: 80
          protocol: TCP

Ensures our Pod has plenty of opportunity to use lots of CPU

Ensures our Pod won’t start until it has a full core of CPU to access

Now, we can execute into our Pod and run a (nasty) CPU usage command. We’ll see in the output that the top command blows up:

$ kubectl create -f pod.yaml
$ kubectl exec -t -i core-k8s /bin/sh    
 
#> dd if=/dev/zero of=/dev/null          
$ docker exec -t -i 75 /bin/sh
 
root@kube-control-plane# top             
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM   TIME+ COMMAND
91467 root    20   0    1292      4      0 R  99.7   0.0   0:35.89 dd

Creates a shell into your container

Consumes CPU with reckless abandon by running dd

Runs the top command to measure CPU usage in our Docker kind node

What happens if we fence this same process and rerun this experiment? To test this, you can change the resources stanza to something like this:

resources:
        limits:
          cpu:  .1    
        requests:
          cpu: .1     

Limits CPU usage to .1 core as a maximum

Reserves the whole .1 core, guaranteeing this CPU share

Let’s rerun the following command. In this second example, we can actually see a much less stressful scenario for our kind node taking place:

root@kube-control-plane# top           
PID USER      PR  NI   VIRT   RES   SHR S  %CPU  %MEM TIME+COMMAND
93311 root    20  0    1292   4     0   R  10.3  0.0  0:03.61 dd

This time only 10% of the CPU is used for the node.

4.4 How the kubelet manages cgroups

Earlier in this chapter we glossed over the other cgroups like blkio. To be sure, there are many different kinds of cgroups, and it’s worth understanding what they are, even though 90% of the time, you will only be concerned about CPU and memory isolation for most containers.

At a lower level, clever use of the cgroup primitives listed in /sys/fs/cgroup exposes control knobs for managing how these resources are allocated to processes. Some such groups are not readily useful to a Kubernetes administrator. For example, the freezer cgroup assigns groups of related tasks to a single stoppable or freezable control point. This isolation primitive allows for efficient scheduling and descheduling of gang processes (and, ironically, some have criticized Kubernetes for its poor handling of this type of scheduling).

Another example is the blkio cgroup, which is also a lesser-known resource that’s used to manage I/O. Looking into the /sys/fs/cgroup, we can see all of the various quantifiable resources that can be allocated hierarchically in Linux:

$ ls -d /sys/fs/cgroup/*
/sys/fs/cgroup/blkio freezer perf_event
/sys/fs/cgroup/cpu hugetlb pids
/sys/fs/cgroup/cpuacct memory rdma
/sys/fs/cgroup/cpu,cpuacct net_cls systemd
/sys/fs/cgroup/cpuset net_cls,net_prio unified
/sys/fs/cgroup/devices net_prio

You can read about the original intent of cgroups at http://mng.bz/vo8p. Some of the corresponding articles might be out of date, but they provide a lot of information about how cgroups have evolved and what they are meant to do. For advanced Kubernetes administrators, understanding how to interpret these data structures can be valuable when it comes to looking at different containerization technologies and how they affect your underlying infrastructure.

4.5 Diving into how the kubelet manages resources

Now that you understand where cgroups come from, it is worth taking a look at how cgroups are used in a kubelet; namely, by the allocatable data structure. Looking at an example Kubernetes node (again, you can do this with your kind cluster), we can see the following stanza in the output from kubectl get nodes -o yaml:

...
    allocatable:
      cpu: "12"
      ephemeral-storage: 982940092Ki
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32575684Ki
      pods: "110"

Do these settings look familiar? By now, they should. These resources are the amount of cgroup budget available for allocating resources to Pods. The kubelet calculates this by determining the total capacity on the node. It then deducts how much CPU bandwidth is required for itself as well as for the underlying node and subtracts this from the amount of allocatable resources for containers. The equations for these numbers are documented at http://mng.bz/4jJR and can be toggled with parameters, including --system-reserved and --kubelet-reserved. This value is then used by the Kubernetes scheduler to decide whether to request a running container on this particular node.

Typically, you might launch --kubelet-reserved and --system-reserved with half of a core each, leaving a 2-core CPU with ~ 1.5 cores free to run workloads, because a kubelet is not an incredibly CPU-hungry resource (except in times of burst scheduling or startup). At large scales, all of these numbers break down and depend on a variety of performance factors related to workload types, hardware types, network latency, and so on. As an equation, when it comes to scheduling, we have the following implementation (system-reserved refers to the quantity of resources a healthy OS needs to run):

Allocatable = node capacity - kube-reserved - system-reserved

As an example, if you have

  • 16 cores of CPU reserved for a node

  • 1 CPU core reserved for a kubelet and system processes in a cluster

the amount of allocatable CPU is 15 cores. To contextualize how all of this relates to a scheduled, running container

  • The kubelet creates cgroups when you run Pods to bound their resource usage.

  • Your container runtime starts a process inside the cgroups, which guarantees the resource requests you gave it in the Pod specification.

  • systemd usually starts a kubelet, which broadcasts the total available resources to the Kubernetes API periodically.

  • systemd also typically starts your container runtime (containerd, CRI-O, or Docker).

When you start a kubelet, there is parenting logic embedded in it. This setting is configured by a command-line flag (that you should leave enabled), which results in the kubelet itself being a top-level cgroup parent to its children’s containers. The previous equation calculates the total amount of allocatable cgroups for a kubelet. It is called the allocatable resource budget.

4.5.1 Why can’t the OS use swap in Kubernetes?

To understand this, we have to dive a little deeper into the specific cgroups that we saw earlier. Remember how our Pods resided under special folders, such as guaranteed and burstable? If we allowed our OS to swap inactive memory to disk, then an idle process might suddenly have slow memory allocation. This allocation would violate the guaranteed access to memory that Kubernetes provides users when defining Pod specifications and would make performance highly variable.

Because the scheduling of large amounts of processes in a predictable manner is more important than the health of any one process, we disable swapping entirely on Kubernetes. To avoid any confusion around this, the Kubernetes installers, such as kubeadm, fail instantly if you bootstrap your kubelets on machines with swap enabled.

Why not enable swap?

In certain cases, thinly provisioning memory might benefit an end user (for example, it might allow you to pack containers on a system more densely). However, the semantic complexity associated with accommodating this type of memory facade isn’t proportionally beneficial to most users. The maintainers of the kubelet haven’t decided (yet) to support this more complex notion of memory, and such API changes are hard to make in a system such as Kubernetes, which is being used by millions of users.

Of course, like everything else in tech, this is rapidly evolving, and in Kubernetes 1.22, you’ll find that, in fact, there are ways you can run with swap memory enabled (http://mng.bz/4jY5). This is not recommended for most production deployments, however, because it would result in highly erratic performance characteristics for workloads.

That said, there is a lot of subtlety at the container runtime level when it comes to resource usage such as memory. For example, cgroups differentiate between soft and hard limits as follows:

  • A process with soft memory limits has varying amounts of RAM over time, depending on the system load.

  • A process with hard memory limits is killed if it exceeds its memory limit for an extended period.

Note that Kubernetes relays an exit code and the OOMKilled status back to you in the cases where a process has to be killed for these reasons. You can increase the amount of memory allocated to a high-priority container to reduce the odds that a noisy neighbor causes problems on a machine. Let’s look at that next.

4.5.2 Hack: The poor man’s priority knob

HugePages is a concept that initially was not supported in Kubernetes because it was a web-centric technology at inception. As it moved to a core data-center technology, more subtle scheduling and resource allocation strategies became relevant. HugePages configuration allows a Pod to access memory pages larger than the Linux kernel’s default memory page size, which is usually 4 KB.

Memory, like CPU, can be allocated explicitly for Pods and is denoted using units for kilobytes, megabytes, and gigabytes (Kis, Mis, and Gis, respectively). Many memory-intensive applications like Elasticsearch and Cassandra supports using HugePages. If a node supports HugePages and also sustains 2048 KiB page sizes, it exposes a schedulable resource: HugePages - 2 Mi. In general, it is possible to schedule against HugePages in Kubernetes using a standard resources directive as follows:

resources:
  limits:
    hugepages-2Mi: 100Mi

Transparent HugePages are the optimization of HugePages that can have highly variable effects on Pods that need high performance. You’ll want to disable them in some cases, especially for high-performance containers that need large, contiguous blocks of memory at the bootloader or OS level, depending on your hardware.

4.5.3 Hack: Editing HugePages with init containers

We’ve come full circle now. Remember how at the beginning of this chapter we looked at the /sys/fs directory and how it managed various resources for containers? The rigging of HugePages can be done in init containers if you can run these as root and mount /sys using a container to edit these files.

The configuration of HugePages can be toggled by merely writing files to and from the sys directory. For example, to turn off transparent HugePages, which might make a performance difference for you on specific OSs, you would typically run a command such as echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/enabled. If you need to set up HugePages in a specific way, you could do so entirely from a Pod specification as follows:

  1. Declare a Pod, which presumably has specific performance needs based around HugePages.

  2. Declare an init container with this Pod, which runs in privileged mode and mounts the /sys directory using the volume type of hostPath.

  3. Have the init container execute any Linux-specific commands (such as the previous echo statement) as its only execution steps.

In general, init containers can be used to bootstrap certain Linux features that might be required for a Pod to run properly. But keep in mind that any time you mount a hostPath, you need special privileges on your cluster, which an administrator might not readily give you. Some distributions, such as OpenShift, deny hostPath volume mounts by default.

4.5.4 QoS classes: Why they matter and how they work

We’ve seen terms such as guaranteed and burstable throughout this chapter, but we haven’t defined these terms yet. To define these concepts, we first need to introduce QoS.

When you go to a fancy restaurant, you expect the food to be great, but you also expect the wait staff to be responsive. This responsiveness is known as quality of service or QoS. We hinted at QoS earlier when we looked at why swap is disabled in Kubernetes to guarantee the performance of memory access. QoS refers to the availability of resources at a moment’s notice. Any data center, hypervisor, or cloud has to make a tradeoff around resource availability for applications by

  • Guaranteeing that critical services stay up, but you’re spending lots of money because you have more hardware than you need

  • Spending little money and risking essential services going down

QoS allows you to walk the fine line of having many services performing suboptimally during peak times without sacrificing the quality of critical services. In practice, these critical services might be payment-processing systems, machine-learning or AI jobs that are costly to restart, or real-time communications processes that cannot be interrupted. Keep in mind that the eviction of a Pod is heavily dependent on how much above its resource limits it is. In general

  • Nicely-behaved applications with predictable memory and CPU usage are less likely to be evicted in times of duress than others.

  • Greedy applications are more likely to get killed during times of pressure when they attempt to use more CPU or memory than allocated by Kubernetes, unless these apps are in the Guaranteed class.

  • Applications in the BestEffort QoS class are highly likely to get killed and rescheduled in times of duress.

You might be wondering how we decide which QoS class to use. In general, you don’t directly decide this, and instead, you influence this decision by determining whether your app needs guaranteed access to resources by using the resource stanza in your Pod specification. We’ll walk through this process in the following section.

4.5.5 Creating QoS classes by setting resources

Burstable, Guaranteed, and BestEffort are the three QoS classes that are created for you, depending on how you define a Pod. These settings can increase the number of containers that you can run on your cluster, where some may die off at times of high utilization and can be rescheduled later. It’s tempting to make global policies for how much CPU or memory you should allocate to end users but, be warned, rarely does one size fit all:

  • If all the containers on your system have a Guaranteed QoS, your ability to handle dynamic workloads with modulating resources needs is hampered.

  • If no containers on your servers have a Guaranteed QoS, then a kubelet won’t be able to make certain critical processes stay up.

The rules for QoS determination are as follows (these are calculated and displayed as a status field in your Pod):

  • BestEffort Pods are those that have no CPU or memory requests. They are easily killed and displaced (and are likely to pop up on a new node) when resources are tight.

  • Burstable Pods are those that have memory or CPU requests but do not have limits defined for all classes. These are less likely to be displaced than BestEffort Pods.

  • Guaranteed Pods are those that have both CPU and memory requests. These are less likely to be displaced than Burstable Pods.

Let’s see this in action. Create a new deployment by running kubectl create ns qos; kubectl -n qos run --image=nginx myapp. Then, edit the deployment to include a container specification that states a request but does not define a limit. For example:

spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources:
          requests:
            cpu: "1"
            memory: 1G

You will now see that when you run kubectl get Pods -n qos -o yaml, you will have a Burstable class assigned to the status field of your Pod, as the following code snippet shows. In crunch time, you might use this technique to ensure that the most critical processes for your business all have a Guaranteed or Burstable status.

hostIP: 172.17.0.3
    phase: Running
    podIP: 192.168.242.197
    qosClass: Burstable
    startTime: "2020-03-08T08:54:08Z"

4.6 Monitoring the Linux kernel with Prometheus, cAdvisor, and the API server

We’ve looked at a lot of low-level Kubernetes concepts and mapped them to the OS in this chapter, but in the real world, you won’t be manually curating this data. Instead, for system metrics and overall trends, people typically aggregate container and system-level OS information in a single, time-series dashboard so that, in case of emergencies, they can ascertain the timescale of a problem and drill into it from various perspectives (application, OS, and so forth).

To conclude this chapter, we’ll up-level things a little bit and use Prometheus, the industry standard for monitoring cloud native applications, as well as monitoring Kubernetes itself. We’ll look at how Pod resource usage can be quantified by direct inspection of cgroups. This has several advantages when it comes to an end-to-end system visibility:

  • It can see sneaky processes that might overrun your cluster that aren’t visible to Kubernetes.

  • You can directly map resources that Kubernetes is aware of with kernel-level isolation tools, which might uncover bugs in the way your cluster is interacting with your OS.

  • It’s a great tool for learning more about how containers are implemented at scale by the kubelet and your container runtime.

Before we get into Prometheus, we need to talk about metrics. In theory, a metric is a quantifiable value of some sort; for example, how many cheeseburgers you ate in the last month. In the Kubernetes universe, the myriad of containers coming online and offline in a data center makes application metrics important for administrators as an objective and app-independent model for measuring the overall health of a data center’s services.

Sticking with the cheeseburger metaphor, you might have a collection of metrics that look something like the following code snippet, which you can jot down in a journal. There are three fundamental types of metrics that we’ll concern ourselves with—histograms, gauges, and counters:

  • Gauges: Indicate how many requests you get per second at any given time.

  • Histograms: Show bins of timing for different types of events (e.g., how many requests completed in under 500 ms).

  • Counters: Specify continuously increasing counts of events (e.g., how many total requests you’ve seen).

As a concrete example that might be a little closer to home, we can output Prometheus metrics about our daily calorie consumption. The following code snippet shows this output:

meals_today 2                              
cheeseburger 1                             
salad 1
dinner 1
lunch 1
calories_total_bucket_bucket[le=1024] 1    

The total number of meals you had today

The number of cheeseburgers you’ve eaten today

The amount of calories you’ve had, binned into buckets of 2, 4, 8, 16, and so on, up to 2,048

You might publish the total number of meals once a day. This is known as a gauge, as it goes up and down and is updated periodically. The amount of cheeseburgers you’ve eaten today would be a counter, which continually gets incremented over time. With the amount of calories you’ve had, the metric says you had one meal with less than 1,024 calories. This gives you a discrete way to bin how much you ate without getting bogged down in details (anything above 2,048 is probably too much and anything below 1,024 is most likely too few).

Note that buckets like this are commonly used to monitor etcd over long time periods. The amount of writes above 1 second are important for predicting etcd outages. Over time, if we aggregated the daily journal entries that you made, you might be able to make some interesting correlations as long as you logged the time of these metrics. For example:

meals_today 2
cheeseburger 50
salad 99
dinner 101
lunch 99
 
calories_total_bucket_bucket[le=512] 10
calories_total_bucket_bucket[le=1024] 40
calories_total_bucket_bucket[le=2048] 60

If you plotted these metrics on their own individual y-axes with the x-axis being time, you might be able to see that

  • Days where you ate cheeseburgers were inversely correlated to days you ate breakfast.

  • The amount of cheeseburgers you’ve been eating is increasing steadily.

4.6.1 Metrics are cheap to publish and extremely valuable

Metrics are important for containerized and cloud-based applications, but they need to be managed in a lightweight and decoupled manner. Prometheus gives us the tools to enable metrics at scale without creating any unnecessary boilerplate or frameworks that get in our way. It is designed to fulfill the following requirements:

  • Hundreds or thousands of different processes might publish similar metrics, which means that a given metric needs to support metadata labels to differentiate these processes.

  • Applications should publish metrics in a language-independent manner.

  • Applications should publish metrics without being aware of how those metrics are being consumed.

  • It should be easy for any developer to publish metrics for a service, regardless of the language they use.

Programmatically, if we were to journal our diet choices in the previous analogy, we would declare instances of cheeseburger, meals_today, and calories_total that would be of the type counter, gauge, and histogram, respectively. These types would be Prometheus API types, supporting operations that automatically store local values to memory, which could be scraped as a CSV file from a local endpoint. Typically, this is done by adding a Prometheus handler to a REST API server, and this handler serves only one meaningful endpoint: metrics/. To manage this data, we might use a Prometheus API client like so:

  • Periodically, to observe a value for how many meals_today we’ve had as that is a Gauge API call

  • Periodically, to increment a value for the cheeseburger right after lunch

  • Daily, to aggregate the value of calories_total, which can be fed in from a different data source

Over time, we could possibly correlate whether eating cheeseburgers related to a higher total calorie consumption on a per day basis, and we might also be able to tie in other metrics (for example, our weight) to these values. Although any time-series database could enable this, Prometheus, as a lightweight metrics engine, works well well in containers because it is entirely published by processes in a way that is independent and stateless, and it’s emerged as the modern standard for adding metrics to any application.

Don’t wait to publish metrics

Prometheus is often mistakenly thought of as a heavyweight system that needs to be centrally installed to be useful. Actually, it’s really just an open source counting tool and an API that can be embedded in any application. The fact that a Prometheus master can scrape and integrate this information is obviously central to that story, but it’s not a requirement to begin publishing and collecting metrics for your app.

Any microservice can publish metrics on an endpoint by importing a Prometheus client. Although your cluster may not consume these metrics, there’s no reason not to make these available on the container side, if for no other reason than that you can use this endpoint to manually inspect counts of various quantifiable aspects of your application, and you can spin up an ad hoc Prometheus master if you want to observe it in the wild.

There are Prometheus clients for all major programming languages. Thus, for any microservice, it is simple and cheap to journal the daily goings-on of various events as a Prometheus metric.

4.6.2 Why do I need Prometheus?

In this book, we focus on Prometheus because it is the de facto standard in the cloud native landscape, but we’ll try to convince you that it deserves this status with a simple, powerful example of how to quickly do a health check on the inner workings of your API server. As an example, you can take a look at whether requests for Pods has put a lot of strain on your Kubernetes API server by running the following commands in your terminal (assuming that you have your kind cluster up and running). In a separate terminal, run a kubectl proxy command, and then curl the API server’s metrics endpoint like so:

$ kubectl proxy                               
 
$> curl localhost:8001/metrics |grep etcd     
etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.005"}
174
etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.01"}
194
etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.025"}
201
etcd_request_duration_seconds_bucket{op="get",type="*core.Pod",le="0.05"}
203

Allows you to access the Kubernetes API server on localhost:8001

curls the API server’s metrics endpoint

Anyone with a kubectl client can immediately use the curl command to ingest real-time metrics about the response times for a certain API endpoint. In the previous snippet, we can see that almost all get calls to the Pod’s API endpoint return in less than .025 seconds, which is generally considered as reasonable performance. For the remainder of this chapter, we’ll set up a Prometheus monitoring system for your kind cluster from scratch.

4.6.3 Creating a local Prometheus monitoring service

We can use a Prometheus monitoring service to inspect the way cgroups and system resources are utilized under duress. The architecture of a Prometheus monitoring system (figure 4.2) on kind includes the following:

  • A Prometheus master

  • A Kubernetes API server that the master monitors

  • Many kubelets (in our case, 1), each a source of metric information for the API server to aggregate

Figure 4.2 Architecture of a Prometheus monitoring deployment

Note that, in general, a Prometheus master might be scraping metrics from many different sources, including API servers, hardware nodes, standalone databases, and even standalone applications. Not all services conveniently get aggregated into the Kubernetes API server for use, however. In this simple example, we want to look at how to use Prometheus for the specific purpose of monitoring cgroup resource usage on Kubernetes, and conveniently, we can do this by simply scraping data from all of our nodes directly from the API server. Also, note that our kind cluster in this example has only one node. Even if we had more nodes, we could still scrape all of this data directly from the API server by adding more target fields to our scrape YAML file (which we will introduce shortly).

We will launch Prometheus with the configuration file that follows. Then we can store the configuration file as prometheus.yaml:

$ mkdir ./data
$ ./prometheus-2.19.1.darwin-amd64/prometheus 
      --storage.tsdb.path=./data --config.file=./prometheus.yml

The kubelet uses the cAdvisor library to monitor cgroups and to collect quantifiable data about them (for example, how much CPU and memory a Pod in particular group uses). Because you already know how to browse through cgroup filesystem hierarchies, reading the output of a kubelet collected by cAdvisor metrics will yield an “aha” moment for you (in terms of your understanding how Kubernetes itself connects to the lower-level kernel resource accounting). To scrape up these metrics, we’ll tell Prometheus to query the API server every 3 seconds like so:

global:
  scrape_interval: 3s
  evaluation_interval: 3s
 
scrape_configs:
  - job_name: prometheus
    metrics_path:
      /api/v1/nodes/kind-control-plane/
      proxy/metrics/cadvisor           
    static_configs:
      - targets: ['localhost:8001']    

The kind control plane node is the only node in our cluster.

Add more nodes in our cluster or more things to scrape in subsequent jobs here.

Real-world Prometheus configurations have to account for real-world constraints. These include data size, security, and alerting protocols. Note that time-series databases are notoriously greedy when it comes to disk usage and that metrics can reveal a lot about a threat model for your organization. These may not be important in your initial prototyping, as we noted earlier, but it’s better to start by publishing your metrics at the application level and to then add the complexity of managing a heavy-weight Prometheus installation later. For our simple example, this will be all we need to configure Prometheus to explore our cgroups.

Again, remember that the API server receives data from the kubelet periodically, which is why this strategy of only needing to scrape one endpoint works. If this was not the case, we could collect this data directly from the kubelet itself or even run our own cAdvisor service. Now, let’s take a look at the container CPU user seconds total metric. We’ll make it spike by running the following command.

Warning This command immediately creates a lot of network and CPU traffic on your computer.

$ kubectl apply -f 
https://raw.githubusercontent.com/
 giantswarm/kube-stresscheck/master/examples/node.yaml

This command launches a series of resource-intensive containers that suck up network assets, memory, and CPU cycles in your cluster. If you’re on a laptop, the giant swarm container produced by running this command will probably cause a lot of CPU spiking, and you might hear some fan noise.

In figure 4.3, you’ll see what our kind cluster looks like under duress. We’ll leave it as an exercise for you to map the various container cgroups and metadata (found by hovering your mouse over the Prometheus metrics) back to processes and containers that are running in your system. In particular, it’s worth looking at the following metrics to get a feel for CPU-level monitoring in Prometheus. Exploring these metrics, along with the hundreds of other metrics in your system when running your favorite workloads or containers, gives you a good way to create important monitoring and forensics protocols for your internal systems engineering pipelines:

  • container_memory_usage_bytes

  • container_fs_writes_total

  • container_memory_cache

Figure 4.3 Plotting metrics in a busy cluster

4.6.4 Characterizing an outage in Prometheus

Let’s look in more detail at the three metric types before closing up shop for this chapter, just for good measure. In figure 4.4, we compare the general topology of how these three metrics give you different perspectives on the same situation in your data center. Specifically, we can see that the gauge gives us a Boolean value indicating whether our cluster is up. Meanwhile, the histogram shows us fine-grained information on how requests are trending until we lose our application entirely. Finally, the counters show us the overall number of transactions leading to an outage:

  • The gauge readout would be most valuable to someone who might be on pager duty for application up time.

  • The histogram readout may be most valuable to an engineer doing “day after” forensics on why a microservice went down for an extended time.

  • The counter metric would be a good way to determine how many successful requests were served before an outage. For example, in case of a memory leak, we might find that after a certain number of requests (say, 15,000 or 20,000), a web server predictably fails.

Figure 4.4 Comparing how gauge, histogram, and counter metrics look in the same scenario cluster

It’s ultimately up to you to decide which metrics you want to use to make decisions, but in general, it’s good to keep in mind that your metrics should not just be a dumping ground for information. Rather, they should help you tell a story about how your services behave and interact with each other over time. Generic metrics are rarely useful for debugging intricate problems, so take the time to embed the Prometheus client into your applications and collect some interesting, quantifiable application metrics. Your administrators will thank you! We’ll look back at metrics again in our etcd chapter, so don’t worry—there will be more Prometheus to come!

Summary

  • The kernel expresses cgroup limitations for containers.

  • A kubelet starts the scheduler processes and mirrors it for the API server.

  • We can use simple containers to inspect how cgroups implement a memory limitation.

  • The kubelet has QoS classes that nuance the quota for process resources in your Pods.

  • We can use Prometheus to view real-time metrics of a cluster under duress.

  • Prometheus expresses three core metric types: gauges, histograms, and counters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.185.180