Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9 Running Pods: How the kubelet works

This chapter covers

Learning what the kubelet does and how it’s configured
Connecting container runtimes and launching containers
Controlling the Pod’s life cycle
Understanding the CRI
Looking at the Go interfaces inside the kubelet and CRI

The kubelet is the workhorse of a Kubernetes cluster, and there can be thousands of kubelets in a production data center, as every node runs the kubelet. In this chapter, we’ll go through the internals of what the kubelet does and precisely how it uses the CRI (Container Runtime Interface) to run containers and manage the life cycle of workloads.

One of the kubelet’s jobs is to start and stop containers, and the CRI is the interface that the kubelet uses to interact with container runtimes. For example, containerd is categorized as a container runtime because it takes an image and creates a running container. The Docker engine is a container runtime, but it is now depreciated by the Kubernetes community in favor of containerd, runC, or other runtimes.

Note We want to thank Dawn Chen for allowing us to interview her about the kubelet. Dawn is the original author of the kubelet binary and is currently one of the leads of the Node Special Interest Group for Kubernetes. This group maintains the kubelet codebase.

9.1 The kubelet and the node

At a high level, the kubelet is a binary, started by systemd. The kubelet runs on every node and is a Pod scheduler and node agent, but only for the local node. The kubelet monitors and maintains information about the server it runs on for the node. The binary updates the Node object via calls to the API server, based on changes to the node.

Let’s start by looking at a Node object, which we get by executing kubectl get nodes <insert_node_name> -o yaml on a running cluster. The next few code blocks are snippets produced by the kubectl get nodes command. You can follow along by executing kind create cluster and running the kubectl commands. For example, kubectl get nodes -o yaml produces the following output, which is shortened for the sake of brevity:

kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket:
      /run/containerd/containerd.sock         ❶
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  labels:
    beta.kubernetes.io/arch: amd64
    kubernetes.io/hostname: kind-control-plane
    node-role.kubernetes.io/master: ""
  name: kind-control-plane

❶ The kubelet uses this socket to communicate with the container runtime.

The metadata in the Node object in this code tells us what its container runtime is and what Linux architecture it runs. The kubelet interacts with the CNI provider. As we have mentioned in other chapters, the CNI provider’s job is to allocate IP addresses for Pods and to create the Pod’s network, which allows networking inside a Kubernetes cluster. The Node API object includes the CIDR (an IP address range) for all Pods. Importantly, we also specify an internal IP address for the node itself, which is necessarily different from that of the Pod’s CIDR. The next source block displays part of the YAML produced by kubectl get node:

spec:
  podCIDR: 10.244.0.0/24

Now we get to the status portion of the definition. All Kubernetes API objects have spec and status fields:

spec—Defines an object’s specifications (what you want it to be)
status—Represents the current state of an object

The status stanza is the data that the kubelet maintains for the cluster, and it also includes a list of conditions that are heartbeat messages communicated to the API server. All additional system information is acquired automatically when the node starts. This status information is sent to the Kubernetes API server and is updated continually. The following code block displays part of the YAML produced by kubectl get node that shows the status fields:

status:
  addresses:
  - address: 172.17.0.2
    type: InternalIP
  - address: kind-control-plane
    type: Hostname

Further down in the YAML document, you’ll find the allocatable fields for this node. If you can explore these fields, you’ll see that there is information about CPU and memory:

allocatable:
  ...
  capacity:
    cpu: "12"
    ephemeral-storage: 982940092Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32575684Ki
    pods: "110"

There are other fields available in a Node object, so we encourage you to look at the YAML for yourself when your nodes report back as you inspect them. You can have anywhere from 0–15,000 nodes (15,000 nodes is considered the current limit of nodes on a cluster due to endpoints and other metadata-intensive operations that become costly at scale). The information in the Node object is critical for things like scheduling Pods.

9.2 The core kubelet

We know that the kubelet is a binary that is installed on every node, and we know that it is critical. Let’s get into the world of the kubelet and what it does. Nodes and kubelets are not useful without a container runtime, which they rely on to execute containerized processes. We’ll take a look at container runtimes next.

9.2.1 Container runtimes: Standards and conventions

Images, which are tarballs, and the kubelet need well-defined APIs for executing binaries that run these tarballs. This is where standard APIs come into play. Two specifications, CRI and OCI, define the how and the what for the kubelet’s goal of running containers:

The CRI defines the how. These are the remote calls used to start, stop, and manage containers and images. Any container runtime fulfills this interface in one way or another as a remote service.
The OCI defines the what. This is the standard for container image formats. When you start or stop a container via a CRI implementation, you are relying on that container’s image format to be standardized in a certain way. The OCI defines a tarball that contains more tarballs with a metadata file.

If you can, start a kind cluster so that you can walk through these examples with us. The kubelet’s core dependency, the CRI, must be provided as a startup argument to the kubelet or configured in an alternative manner. As an example of a containerd’s configuration, you can look for /etc/containerd/config.toml inside a running kind cluster and observe the various configuration inputs, which include the hooks that defined the CNI provider. For example:

# explicitly use v2 config format
version = 2
 
# set default runtime handler to v2, which has a per-pod shim
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
 
# setup a runtime with the magic name ("test-handler") for k8s
# runtime class tests ...
[plugins."io.containerd.grpc.v1.cri"
    .containerd.runtimes.test-handler]
  runtime_type = "io.containerd.runc.v2"

In the next example, we use kind to create a Kubernetes v1.20.2 cluster. Note that this output may vary between Kubernetes versions. To view the file on a kind cluster, run these commands:

$ kind create cluster                             ❶
 
$ export 
KIND_CONTAINER=
$(docker ps | grep kind | awk '{ print $1 }')     ❷
 
$ docker exec -it "$KIND_CONTAINER" /bin/bash     ❸
 
root@kind-control-plane:/# 
  cat /etc/containerd/config.toml                 ❹

❶ Creates a Kubernetes cluster

❷ Finds the Docker container ID of the running kind container

❸ Executes into the running container and gets an interactive command line

❹ Displays the containerd configuration file

We’re not going to be diving into container implementation details here. Still, you need to know that it is the underlying runtime that the kubelet usually depends on under the hood. It takes a CRI provider, image registry, and runtime values as inputs, meaning that the kubelet can accommodate many different containerization implementations (VM containers, gVisor containers, and so on). If you are in the same shell running inside the kind container, you can execute the following command:

root@kind-control-plane:/# ps axu | grep /usr/bin/kubelet
root         653 10.6  3.6 1881872 74020 ?
   Ssl  14:36   0:22 /usr/bin/kubelet
   --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
   --kubeconfig=/etc/kubernetes/kubelet.conf
   --config=/var/lib/kubelet/config.yaml
   --container-runtime=remote
   --container-runtime-endpoint=unix:///run/containerd/containerd.sock
   --fail-swap-on=false --node-ip=172.18.0.2
   --provider-id=kind://docker/kind/kind-control-plane
   --fail-swap-on=false

This prints the list of configuration options and command-line flags provided to the kubelet running inside the kind container. These options are covered next; however, we do not cover all the options because there are a lot.

9.2.2 The kubelet configurations and its API

The kubelet is an integration point for a broad range of primitives in the Linux OS. Some of its data structures reveal the form and function of how it has evolved. The kubelet has well over 100 different command-line options in two different categories:

Options—Toggle the behavior of the low-level Linux functionality used with Kubernetes, such as rules related to maximum iptables usage or DNS configuration
Choices—Define the life cycle and health of the kubelet binary

The kubelet has numerous corner cases; for example, how it handles Docker versus containerd workloads, how it manages Linux versus Windows workloads, and so on. Each one of these corner cases may take weeks or even months to debate when it comes down to defining its specification. Because of this, it’s good to understand the structure of the kubelet’s codebase so that you can dig into it and provide yourself with some self-soothing in cases where you hit a bug or an otherwise unexpected behavior.

Note The Kubernetes v1.22 release introduced quite a few changes to the kubelet. Some of these changes included removal of in-tree storage providers, new security defaults via the --seccomp-default flag, the ability to rely on memory swapping (known as the NodeSwap feature), and memory QoS improvements. If you are interested in learning more about all the improvements in the Kubernetes v1.22 release, we highly recommend reading through http://mng.bz/2jy0. Relevant to this chapter, a recent bug in the kubelet can cause static Pod manifest changes to break long running Pods.

The kubelet.go file is the main entry point for the start of the kubelet binary. The cmd folder contains the definitions for the kubelet’s flags. (Take a look at http://mng.bz/REVK for the flags, CLI options, and definitions.) The following declares the kubeletFlags struct. This struct is for the CLI flags, but we also have API values as well:

// kubeletFlags contains configuration flags for the kubelet.
// A configuration field should go in the kubeletFlags instead of the
// kubeletConfiguration if any of these are true:
// - its value will never or cannot safely be changed during
//   the lifetime of a node, or
// - its value cannot be safely shared between nodes at the
//   same time (e.g., a hostname);
//   the kubeletConfiguration is intended to be shared between nodes.
// In general, please try to avoid adding flags or configuration fields,
// we already have a confusingly large amount of them.
 
type kubeletFlags struct {

Previously, we had a code block where we grepped for /usr/bin/kubelet, and part of the result included --config=/var/lib/kubelet/config.yaml. The --config flag defines a configuration file. The following code block cats that configuration file:

$ cat /var/lib/kubelet/config.yaml    ❶

❶ Outputs the config.yaml file

The next code block shows the output of the cat command:

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionHard:
  imagefs.available: 0%
  nodefs.available: 0%
  nodefs.inodesFree: 0%
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageGCHighThresholdPercent: 100
imageMinimumGCAge: 0s
kind: kubeletConfiguration
logging: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

All of the kubelet API values are defined in the types.go file at http://mng.bz/wnJP. This file is an API data structure holding input configuration data for the kubelet. It defines many of the configurable aspects of the kubelet referenced via http://mng.bz/J1YV.

Note Although we reference Kubernetes version 1.20.2 in the URLs, when you read this information, keep in mind that although the code location may vary, the API objects change quite slowly.

Kubernetes API machinery is the mechanism or standard for how API objects are defined within Kubernetes and the Kubernetes source base.

You will notice in the types.go file that many low-level networking and process control knobs are sent directly to the kubelet as input. The following example shows the ClusterDNS configuration that you probably can relate to. It is important for a functioning Kubernetes cluster:

// ClusterDNS is a list of IP addresses for a cluster DNS server. If set,
// the kubelet will configure all containers to use this for DNS resolution
// instead of the host's DNS servers.
 
ClusterDNS []string

When a Pod is created, multiple files are also produced dynamically. One of those files is /etc/resolv.conf. It is used by the Linux networking stack to perform DNS lookups because the file defines DNS servers. We’ll see how to create this next.

9.3 Creating a Pod and seeing it in action

Run the following commands to create a NGINX Pod running on a Kubernetes cluster. Then, from the command line, you can cat the file with the next code block:

$ kubectl run nginx --generator=run-pod/v1 
  --image nginx                                ❶
 
$ kubectl exec -it nginx -- /bin/bash          ❷
root@nginx:/# cat /etc/resolv.conf             ❸
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

❶ Starts the Pod

❷ Executes into the shell of the running NGINX container

❸ Use cat to inspect the resolv.conf file.

You can now see how the kubelet, when creating a Pod (as in the previous section), creates and mounts the resolv.conf file. Now your Pod can do a DNS lookup and, if you want, you can ping google.com. Other interesting structs in the types.go file include

ImageMinimumGCAge (for image garbage collection)—In long-running systems, images might fill up drive space over time.
kubeletCgroups (for Pod cgroup roots and drivers)—The ultimate upstream pool for Pod resources can be systemd, and this struct unifies the administration of all processes along with the administration of containers.
EvictionHard (for hard limits)—This struct specifies when Pods should be deleted, which is based on system load.
EvictionSoft (for soft limits)—This struct specifies how long the kubelet waits before evicting a greedy Pod.

These are just a few of the types.go file options; the kubelet has hundreds of permutations. All of these values are set via command-line options, default values, or YAML configuration files.

9.3.1 Starting the kubelet binary

When a node starts, several events occur that ultimately lead to its availability as a scheduling target in a Kubernetes cluster. Note that the ordering of events is approximate due to changes in the kubelet codebase and the asynchrony of Kubernetes in general. Figure 9.1 shows the kubelet at startup. Looking at the steps in the figure, we notice that

Some simple sanity checks occur to make sure that Pods (containers) are runnable by the kubelet. (NodeAllocatable inputs are checked, which defines how much CPU and memory are allocated.)
The containerManager routine begins. This is the kubelet’s main event loop.
A cgroup is added. If necessary, it’s created with the setupNode function. The Scheduler and ControllerManager both “notice” that there is a new node in the system. They “watch” it via the API server so that it can run processes that need homes (it can even run new Pods) and ensure that it is not skipping periodic heartbeats from the API server. If the kubelet skips a heartbeat, the node eventually is removed from the cluster by the ControllerManager.
The deviceManager event loop starts. This takes external plugin devices into the kubelet. These devices are then sent as part of the continuous updates (mentioned in the previous step).
Logging, CSI, and device-plugin functionality are attached to the kubelet and registered.

Figure 9.1 The kubelet startup cycle

9.3.2 After startup: Node life cycle

In earlier versions of Kubernetes (before 1.17), the Node object was updated in a status loop every 10 seconds via the kubelet making a call to the API server. By design, the kubelet is a bit chatty with the API server because the control plane in a cluster needs to know if nodes are healthy or not. If you watch a cluster starting, you will notice the kubelet binary attempting to communicate with the control plane, and it will do so multiple times until the control plane is available. This control loop allows the control plane to not be available, and the nodes are aware of this. When the kubelet binary starts, it configures the network layer as well, having the CNI provider create the proper networking features, such as a bridge for CNI networking to function.

9.3.3 Leasing and locking in etcd and the evolution of the node lease

To optimize the performance of large clusters and decrease network chattiness, Kubernetes versions 1.17 and later implement a specific API server endpoint for managing the kubelets via etcd’s leasing mechanism. etcd introduced the concept of a lease so that HA (highly available) components that might need failovers can rely on a central leasing and locking mechanism rather than implementing their own.

Anyone who has taken a computer science course on semaphores can identify with why the creators of Kubernetes did not want to rely on a myriad of home-grown locking implementations for different components. Two independent control loops maintain the kubelet’s state:

The NodeStatus object is updated by the kubelet every 5 minutes to tell the API server about its state. For example, if you reboot a node after upgrading its memory, you will see this update in the API server’s view of the NodeStatus object of the kubelet 5 minutes later. If you’re wondering how big this data structure is, run kubectl get nodes -o yaml on a large production cluster. You will likely see tens of thousands of lines of text amounting to at least 10 KB per node.
Independently, the kubelet updates a Lease object (which is quite tiny) every 10 seconds. These updates allow the controllers in the Kubernetes control plane to evict a node within a few seconds if it appears to have gone offline, without incurring the high cost of sending a large amount of status information.

9.3.4 The kubelet’s management of the Pod life cycle

After all of the preflight checks are complete, the kubelet starts a big sync loop: the containerManager routine. This routine handles the Pod’s life cycle, which consists of a control loop of actions. Figure 9.2 shows the Pod’s life cycle and the steps to managing a Pod:

Starts the Pod life cycle
Ensures the Pod can run on the node
Sets up storage and networking (CNI)
Starts the containers via CRI
Monitors the Pod
Performs restarts
Stops the Pod

Figure 9.2 A kubelet’s Pod life cycle

Figure 9.3 illustrates the life of a container hosted on a Kubernetes node. As depicted in the figure

A user or the replica set controller decides to create a Pod via the Kubernetes API.
The scheduler finds the right home for the Pod (e.g., a host with the IP Address of 1.2.3.4).
The kubelet on host 1.2.3.4 gets new data from its watch on the API server’s Pods, and it notices that it is not yet running the Pod.
The Pod’s creation process starts.
The pause container has a sandbox where the requested one or more containers will live, defining the Linux namespaces and IP address created for it by the the kubelet and the CNI (container networking interface) provider.
The kubelet communicates with the container runtime, pulling the layers of a container, and runs the actual image.
The NGINX container starts.

Figure 9.3 Pod creation

If something goes wrong, such as if the container dies or its health check fails, the Pod itself may get moved to a new node. This is known as rescheduling. We mentioned the pause container, which is a container that is used to create the Pod-shared Linux namespaces. We’ll cover the pause container later in this chapter.

9.3.5 CRI, containers, and images: How they are related

Part of the kubelet’s job is image management. You probably are familiar with this process if you have ever run docker rm -a -q or docker images --prune on your laptop. Although the kubelet only concerns itself with running containers, to come to life, these containers ultimately rely on base images. These images are pulled from image registries. One such registry is Docker Hub.

A new layer on top of existing images creates a container. Commonly used images use the same layers, and these layers are cached by the container runtime running on the kubelet. The caching time is based on the garbage collection facility in the kubelet itself. This functionality expires and deletes old images from the ever-growing registry cache, which ultimately is the kubelet’s job to maintain. This process optimizes container startup while preventing disks from getting flooded with images that are no longer utilized.

9.3.6 The kubelet doesn’t run containers: That’s the CRI’s job

A container runtime provides functionality associated with managing the containers you need to run from the kubelet. Remember, the kubelet itself can’t run containers on its own: it relies on something like containerd or runC under the hood. This reliance is managed via the CRI interface.

The chances are that, regardless of what Kubernetes release you are running, you have runC installed. You can efficiently run any image manually with runC. For example, run docker ps to list a container that is running locally. You could export the image as a tarball as well. In our case, we can do the following:

$  docker ps                                     ❶
d32b87038ece kindest/node:v1.15.3
"/usr/local/bin/entr..." kind-control-plane
$ docker export d32b > /tmp/whoneedsdocker.tar   ❷
$ mkdir /tmp/whoneedsdocker
$ cd /tmp/whoneedsdocker
$ tar xf /tmp/whoneedsdocker.tar                 ❸
$ runc spec                                      ❹

❶ Gets the image ID

❷ Exports the image to a tarball

❸ Extracts the tarball

❹ Starts runC

These commands create a config.json file. For example:

{
        "ociVersion": "1.0.1-dev",
        "process": {
                "terminal": true,
                "user": {
                        "uid": 0,
                        "gid": 0
                },
                "args": [
                  "sh"
                ]
          },
          "namespaces": [
              {
                "type": "pid"
              },
              {
                "type": "network"
              },
              {
                "type": "ipc"
              },
              {
                "type": "uts"
              },
              {
                "type": "mount"
              }
          ]
}

Typically, you will want to chat the args section sh, which is the default command created by runC, to do something meaningful (such as python mycontainerizedapp.py). We omitted most of the boilerplate from the preceding config.json file, but we kept an essential part: the namespaces section.

9.3.7 Pause container: An “aha” moment

Every container in a Pod corresponds to a runC action. We therefore need a pause container, which precedes all of the containers. A pause container

Waits until a network namespace is available so all containers in a Pod can share a single IP and talk over 127.0.0.1
Pauses until a filesystem is available so all containers in a Pod can share data over emptyDir

Once the Pod is set up, each runC call takes the same namespace parameters. Although the kubelet does not run the containers, there is a lot of logic that goes into creating Pods, which the kubelet needs to manage. The kubelet ensures that Kubernetes has the networking and storage guarantees for containers. This makes it easy to run in distributed scenarios. Other tasks precede running a container, like pulling images, which we will walk through later in this chapter. First, we need to back up and look at the CRI so that we can understand the boundary between the container runtime and the kubelet a little more clearly.

9.4 The Container Runtime Interface (CRI)

The runC program is one small part of the puzzle when it comes to what Kubernetes needs for running containers at scale. The whole mystery is primarily defined by the CRI interface, which abstracts runC along with other functionality to enable higher-order scheduling, image management, and container runtime functionality.

9.4.1 Telling Kubernetes where our container runtime lives

How do we tell Kubernetes where our CRI service is running? If you look inside a running kind cluster, you will see that the kubelet runs with the following two options:

--container-runtime=remote
--container-runtime-endpoint=/run/containerd/containerd.sock

The kubelet communicates via gRPC, a remote procedure call (RPC) framework, with the container runtime endpoint; containerd itself has a CRI plugin built into it. By remote, what is meant is that Kubernetes can use the containerd socket as a minimal implementation of an interface to create and manage Pods and their life cycles. The CRI is a minimal interface that any container runtime can implement. It was mainly designed so that the community could quickly innovate different container runtimes (other than Docker) and plug those into and unplug from Kubernetes.

Note Although Kubernetes is modular in how it runs containers, it is still stateful. You cannot “hot” unplug a container runtime from a running Kubernetes cluster without also draining (and potentially removing) a node from a live cluster. This limitation is due to metadata and cgroups that the kubelet manages and creates.

Because the CRI is a gRPC interface, the container-runtime option in Kubernetes ideally should be defined as remote for newer Kubernetes distributions. The CRI describes all container creation through an interface, and like storage and networking, Kubernetes aims to move container runtime logic out of the Kubernetes core over time.

9.4.2 The CRI routines

The CRI consists of four high-level go interfaces. This unifies all the core functionality Kubernetes needs to run containers. CRI’s interfaces include

PodSandBoxManager—Creates the setup environment for Pods
ContainerRuntime—Starts, stops, and executes containers
ImageService—Pulls, lists, and removes images
ContainerMetricsGetter—Provides quantitative information about running containers

These interfaces provide pausing, pulling, and sandboxing functions. Kubernetes expects this functionality to be implemented by any remote CRI and invokes this functionality using gRPC.

9.4.3 The kubelet’s abstraction around CRI: The GenericRuntimeManager

The CRI’s functionality does not necessarily cover all the bases for a production container orchestration tool, such as garbage collecting old images, managing container logs, and dealing with the life cycle of image pulls and image pull backoffs. The kubelet provides a Runtime interface, implemented by kuberuntime.NewKubeGenericRuntimeManager, as a wrapper for any CRI provider (containerd, CRI-O, Docker, and so on). The runtime manager (inside of http://mng.bz/lxaM) manages all calls to the four core CRI interfaces. As an example, let’s see what happens when we create a new Pod:

imageRef, msg, err := m.imagePuller.EnsureImageExists(
        pod, container, pullSecrets,
        podSandboxConfig)                                    ❶
        containerID, err := m.runtimeService.CreateContainer(
        podSandboxID, containerConfig,
        podSandboxConfig)                                    ❷
        err = m.internalLifecycle.PreStartContainer(
        pod, container, containerID)                         ❸
        err = m.runtimeService.StartContainer(
        containerID)                                         ❹
        events.StartedContainer, fmt.Sprintf(
        "Started container %s", container.Name))

❶ Pulls the image

❷ Creates the container’s cgroups without starting the container

❸ Performs network or device configuration, which is cgroup- or namespace-dependent

❹ Starts the container

You might wonder why we need a prestart hook in this code. A few common examples of where Kubernetes uses prestart hooks include certain networking plugins and GPU drivers, which need to be configured with cgroups-specific information before a GPU process starts.

9.4.4 How is the CRI invoked?

Several lines of code obfuscate the remote calls to the CRI in the previous code snippet, and we’ve removed a lot of the bloat. We will go through the EnsureImageExists function in detail in a few moments, but let’s first look at the way Kubernetes abstracts the low-level CRI functionality into the two main APIs that are internally utilized by the kubelet to work with containers.

9.5 The kubelet’s interfaces

In the source code of the kubelet, various Go interfaces are defined. The next few sections will walk through the interfaces in order to provide an overview of the inner workings of the kubelet.

9.5.1 The Runtime internal interface

The CRI in Kubernetes is broken into three parts: Runtime, StreamingRuntime, and CommandRunner. The KubeGenericRuntime interface (located at http://mng.bz/ BMxg) is managed inside of Kubernetes, wrapping core functions in the CRI runtime. For example:

type KubeGenericRuntime interface {
 
    kubecontainer.Runtime               ❶
    kubecontainer.StreamingRuntime      ❷
    kubecontainer.CommandRunner         ❸
 
}

❶ Defines the interface that’s specified by a CRI provider

❷ Defines functions to handle streaming calls (like exec/attach/port-forward)

❸ Defines a function that executes the command in the container, returning the output

For vendors, this means that you first implement the Runtime interface and then the StreamingRuntime interface because the Runtime interface describes most of the core functionality of Kubernetes (see http://mng.bz/1jXj and http://mng.bz/PWdn). The gRPC service clients are the functions that let you get your head around how the kubelet interacts with CRI. These functions are defined in the kubeGenericRuntimeManager struct. Specifically, runtimeService internalapi.RuntimeService interacts with the CRI provider.

Within the RuntimeService, we have the ContainerManager, and this is where the magic happens. This interface is part of the actual CRI definition. The function calls in the next code snippet allow the kubelet to use a CRI provider to start, stop, and remove containers:

// ContainerManager contains methods to manipulate containers managed
// by a container runtime. The methods are thread-safe.
 
type ContainerManager interface {
    // CreateContainer creates a new container in specified PodSandbox.
    CreateContainer(podSandboxID string, config
       *runtimeapi.ContainerConfig, sandboxConfig
        *runtimeapi.PodSandboxConfig) (string, error)
    // StartContainer starts the container.
    StartContainer(containerID string) error
    // StopContainer stops a running container.
    StopContainer(containerID string, timeout int64) error
    // RemoveContainer removes the container.
    RemoveContainer(containerID string) error
    // ListContainers lists all containers by filters.
    ListContainers(filter *runtimeapi.ContainerFilter)
       ([]*runtimeapi.Container, error)
    // ContainerStatus returns the status of the container.
    ContainerStatus(containerID string)
       (*runtimeapi.ContainerStatus, error)
    // UpdateContainerResources updates the cgroup resources
    // for the container.
    UpdateContainerResources(
       containerID string, resources *runtimeapi.LinuxContainerResources)
         error
    // ExecSync executes a command in the container.
    // If the command exits with a nonzero exit code, an error is returned.
    ExecSync(containerID string, cmd []string, timeout time.Duration)
            (stdout []byte, stderr []byte, err error)
    // Exec prepares a streaming endpoint to exe..., returning the address.
    Exec(*runtimeapi.ExecRequest) (*runtimeapi.ExecResponse, error)
    // Attach prepares a streaming endpoint to attach to
    // a running container and returns the address.
    Attach(req *runtimeapi.AttachRequest)
          (*runtimeapi.AttachResponse, error)
    // ReopenContainerLog asks runtime to reopen
    // the stdout/stderr log file for the container.
    // If it returns error, the new container log file MUST NOT
    // be created.
    ReopenContainerLog(ContainerID string) error
}

9.5.2 How the kubelet pulls images: The ImageService interface

Lurking in the routines for a container runtime is the ImageService interface, which defines a few core methods: PullImage, GetImage, ListImages, and RemoveImage. The concept of pulling an image, which derives from Docker semantics, is part of the CRI specification. You can see its definition in the same file (runtime.go) as the other interfaces. Thus, every container runtime implements these functions:

// ImageService interfaces allows to work with image service.
type ImageService interface {
    PullImage(image ImageSpec, pullSecrets []v1.Secret,
                  podSandboxConfig *runtimeapi.PodSandboxConfig)
                  (string, error)
    GetImageRef(image ImageSpec) (string, error)
    // Gets all images currently on the machine.
    ListImages() ([]Image, error)
    // Removes the specified image.
    RemoveImage(image ImageSpec) error
    // Returns the image statistics.
    ImageStats() (*ImageStats, error)
}

A container runtime could call docker pull to pull an image. Similarly, this runtime could make a call to execute docker run to create a container. The container runtime, as you’ll recall, can be set on the kubelet when it starts, using the container-runtime-endpoint flag like so:

--container-runtime-endpoint=unix:///var/run/crio/crio.sock

9.5.3 Giving ImagePullSecrets to the kubelet

Let’s make the connection between kubectl, the kubelet, and the CRI interface concrete. To do that, we will look at how you can provide information to the kubelet so that it can download images securely from a private registry. The following is a block of YAML that defines a Pod and a Secret. The Pod references a secure registry that requires login credentials, and the Secret stores the login credentials:

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  labels:
    app: myapp
spec:
  containers:
  - name: myapp-container
    image: my.secure.registry/container1:1.0
  imagePullSecrets:
  - name: my-secret
---
apiVersion: v1
data:
  .dockerconfigjson: sojosaidjfwoeij2f0ei8f...
kind: Secret
metadata:
  creationTimestamp: null
  name: my-secret
  selfLink: /api/v1/namespaces/default/secrets/my-secret
type: kubernetes.io/.dockerconfigjson

In the snippet, you need to generate the .dockerconfigjson value yourself. You can also generate the Secret interactively using kubectl itself like so:

$ kubectl create secret docker-registry my-secret
  --docker-server my.secure.registry
  --docker-username my-name --docker-password 1234
  --docker-email [email protected]

Or you can do this with the equivalent command, if you already have an existing Docker configuration JSON file:

$ kubectl create secret generic regcred
   --from-file=.dockerconfigjson=<path/to/.docker/config.json>
   --type=kubernetes.io/dockerconfigjson

This command creates an entire Docker configuration, puts it into the .dockerconfigjson file, and then uses that JSON payload when pulling images through ImageService. More importantly, this service ultimately calls the EnsureImageExists function. You can then run kubectl get secret -o yaml to view the Secret and copy the entire Secret value. Then use Base64 to decode it to see your Docker login token, which the kubelet uses.

Now that you know how the Secret is used by the Docker daemon when pulling images, we will get back to looking at the plumbing in Kubernetes, which allows this functionality to work entirely via Secrets managed by Kubernetes. The key to all this is the ImageManager interface, which implements this functionality via an EnsureImageExists method. This method calls the PullImage function under the hood, if necessary, depending on the ImagePullPolicy defined on your Pod. The next code snippet sends the required pull Secrets:

type ImageManager interface {
    EnsureImageExists(pod *v1.Pod, container *v1.Container,
      pullSecrets []v1.Secret,
      podSandboxConfig *runtimeapi.PodSandboxConfig)
     (string, string, error)
}

The EnsureImageExists function receives the pull Secrets that you created in the YAML document earlier in this chapter. Then a docker pull is executed securely by deserializing the dockerconfigjson value. Once the daemon pulls this image, Kubernetes can move forward, starting the Pod.

9.6 Further reading

M. Crosby. “What is containerd ?” Docker blog. http://mng.bz/Nxq2 (accessed 12/27/21).

J. Jackson. “GitOps: ‘Git Push’ All the Things.” http://mng.bz/6Z5G (accessed 12/27/21).

“How does copy-on-write in fork() handle multiple fork?” Stack Exchange documentation. http://mng.bz/Exql (accessed 12/27/21).

“Deep dive into Docker storage drivers.” YouTube video. https://www.youtube.com/watch?v=9oh_M11-foU (accessed 12/27/21).

Summary

The kubelet runs on every node and controls the lifecyle of Pods on that node.
The kubelet interacts with the container runtime to start, stop, create, and delete containers.
We have the capability to configure various functionality (such as time to evict Pods) within the kubelet.
When the kubelet starts, it runs various sanity checks on the node, creates cgroups, and starts various plugins, such as CSI.
The kubelet controls the life cycle of a Pod: starting the Pod, ensuring that it’s running, creating storage and networking, monitoring, performing restarts, and stopping Pods.
CRI defines the way that the kubelet interacts with the container runtime that is installed.
The kubelet is built from various Go interfaces. These include interfaces for CRI, image pulling, and the kubelet itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9 Running Pods: How the kubelet works

Create new playlist

Sign In

Sign Up