Linux has evolved sandboxing and isolation techniques that strengthen it from current and future vulnerabilities. Sometimes these sandboxes are called micro VMs.
These sandboxes combine parts of all previous container and VM approaches. You would use them to protect sensitive workloads and data, as they focus on rapid deployment and high performance on shared infrastructure.
In this chapter we’ll discuss different types of micro VMs that use virtual machines and containers together, to protect your running Linux kernel and userspace. The generic term sandboxing is used to cover the entire spectrum: all the technologies in this chapter combine software and hardware virtualisation and use Linux’s KVM (Kernel Virtual Machine), which is widely used to power VMs in public cloud services, including AWS and GCP.
You run a lot of workloads at BCTL, and you should remember that while these techniques may also protect against Kubernetes mistakes, all of your web-facing software and infrastructure is a more obvious place to defend first. Zero days and container breakouts are rare in comparison to misconfigurations.
Hardened runtimes are newer, and have fewer (generally less dangerous) CVEs than the kernel or more established container runtimes, so we’ll focus less on historical breakouts and more on the history of micro VM design and rationale.
You have two main reasons for isolating a workload or pod: it may have access to sensitive information and data, or it may be untrusted and potentially hostile to other users of the system:
A “sensitive” workload is one whose data or code is too important to permit unauthorised access to. This may include fraud detection systems, pricing engines, high-frequency trading algorithms, personally identifiable information (PII), financial records, passwords that may be reused in other systems, machine learning models, or an organisation’s “secret sauce”. Sensitive workloads are precious.
“Untrusted” workloads are those that may be dangerous to run. They may allow high-risk user input or run external software.
Examples of potentially untrusted workloads include:
VM workloads on a Cloud provider’s hypervisor
CI/CD infrastructure subject to build-time supply chain attacks
Transcoding of complex files with potential parser errors
Untrusted workloads may also include software with published or suspected zero-day common vulnerabilities and exposures (CVEs) — if no patch is available and the workload is business-critical, isolating it further may decrease the potential impact of the vulnerability if exploited.
The threat to a host running untrusted workloads is the workload, or process, itself. By sandboxing a process and removing the system APIs available to it, the attack surface presented by the host to the process is decreased. Even if that process is compromised, the risk to the host is less.
BCTL allow users to upload files to import data and shipping manifests, so you have a risk that threat actors will try to upload badly formatted or malicious files to try to force exploitable software errors. The pods that run the batch transformation and processing workloads are a good candidate for sandboxing, as they are processing untrusted inputs as shown in Figure 3-1.
Any data supplied to an application by users can be considered untrusted, however most input will be sanitised in some way (for example validating against an integer or string type). Complex files like PDFs or videos can not be sanitised in this way, and rely upon the encoding libraries to be secure, which they sometimes are not.
Your threat model may include:
An untrusted user input triggers a bug in a workload that an attacker uses to execute malicious code
A sensitive application is compromised and the attacker tries to exfiltrate data
A malicious user on a compromised node attempts to read memory of other processes on the host
New sandboxing code is less well tested, and may contain exploitable bugs
A container image build pulls malicious dependencies and code from unauthenticated external sources that may contain malware
Existing container runtimes come with some hardening by default, and Docker uses default seccomp and AppArmor profiles that drop a large number of unused system calls. These are not enabled by default in Kubernetes and must be enforced with admission control or PodSecurityPolicy.
Now that we have an idea of the dangers to your systems, let’s take a step back. We’ll look at virtualisation: what it is, why we use containers, and how to combine the best bits of containers and VMs.
A major difference between a container and a VM is that containers exist on a shared host kernel. VMs boot a kernel every time they start, use hardware-assisted virtualisation, and have a more secure but traditionally slower runtime.
A common perception is that containers are optimised for speed and portability, and virtual machines sacrifice these features for more robust isolation from malicious behaviour and higher fault tolerance.
This perception is not entirely true. Both technologies share a lot of common code pathways in the kernel itself. Containers and virtual machines have evolved like co-orbiting stars — never fully able to escape each other’s gravity. Container runtimes are a form of kernel virtualisation. The OCI (Open Container Initiative) container image specifications have become the standardised atomic unit of container deployment.
Next-generation sandboxes combine container and virtualisation techniques (see Figure 3-2) to reduce workloads’ access to the kernel. They do this by by emulating kernel functionality in user space or the isolated guest environment, thus reducing the host’s attack surface to the process inside the sandbox. Well-defined interfaces can help to reduce complexity, minimising the opportunity for untested code paths. And, by integrating the sandboxes with containerd, they are also able to interact with OCI images and with a software proxy (“shim”) to connect two different interfaces, which can be used with orchestrators like Kubernetes.
These sandboxing techniques are especially relevant to public cloud providers, for which multi-tenancy and bin packing is highly lucrative. Aggressively multi-tenanted systems such as Google Cloud Functions and AWS Lambda are running “untrusted code as a service”, and this isolation software is born from cloud vendor security requirements to isolate serverless runtimes from other tenants. Multitenancy will be discussed in-depth in the next chapter.
Cloud providers are using virtual machines as the atomic unit of compute, but of course they may also wrap the root virtual machine process in a container. Customers then use the virtual machine to run containers - inception.
Traditional virtualisation emulates a physical hardware architecture in software. Micro VMs emulate as small an API as possible, removing features like I/O devices and even system calls to ensure least privilege. Ultimately however, they are still running the same Linux kernel code to perform low-level program operations such as memory mapping and opening sockets - just with additional security abstractions to create a secure by default runtime. So even though VMs are not sharing as much of the kernel as containers do, some system calls must still be executed by the host kernel.
Software abstractions require CPU time to execute, and so virtualisation must always be a balance of security and performance. It is possible to add enough layers of abstraction and indirection that a process is considered “highly secure”, but it is unlikely that this ultimate security will result in a suitable user experience. Unikernels go in the other direction, tracing a program’s execution and then removing almost all kernel functionality except what the program has used. Observability and debuggability are perhaps the reasons that unikernels have not seen widespread adoption.
To understand the trade-offs and compromises inherent in each approach, a comparison of virtualisation types is important to grok. Virtualisation has existed for a long time and has many variations.
Although virtual machines and associated technologies have existed since the late 1950s, a lack of hardware support in the 1990s led to their temporary demise. During this time “process virtual machines” became more popular, especially the java virtual machine (JVM). In this chapter we are exclusively referring to system virtual machines: a form of virtualisation not tied to a specific programming language. Examples include KVM/QEMU, VMWare, Xen, VirtualBox, etc.
Virtual machine research began in the 1960s to facilitate sharing large, expensive physical machines between multiple users and processes (see the Figure 3-3). To share a physical host safely, some level of isolation must be enforced between tenants - and in case of hostile tenants, there should be much less access to the underlying system.
This is performed in hardware (the CPU), software (in the kernel, and userspace), or from cooperation between both layers, and allows many users to share the same large physical hardware. This innovation became the driving technology behind public cloud adoption: safe sharing and isolation for processes, memory, and the resources they require from the physical host machine.
The host machine is split into smaller isolated compute units, traditionally referred to as guests - see Figure 3-4. These guests interact with a virtualized layer above the physical host’s CPU and devices. That layer intercepts system calls to handle them itself: either by proxying them to the host kernel, or handling the request itself - doing the kernel’s job where possible. Full virtualisation (e.g. VMware) emulates hardware and boots a full kernel inside the guest. Operating-system-level virtualisation (e.g. a container) emulates the host’s kernel (i.e. using namespace, cgroups, capabilities, and seccomp) so it can start a containerised process directly on the host kernel. Processes in containers share many of the kernel pathways and security mechanisms that processes in VMs execute.
To boot a kernel, a guest operating system will require access to a subset of the host machine’s functionality, including: BIOS routines, devices and peripherals (e.g. keyboard, graphical/console access, storage and networking), an interrupt controller and an interval timer, a source of entropy (for random number seeds), and the memory address space that it will run in.
Inside each guest virtual machine is an environment in which processes (or workloads) can run. The virtual machine itself is owned by a privileged parent process that manages its setup and interaction with the host, known as a virtual machine monitor or VMM (as in Figure 3-5). This has also been known as a hypervisor, but the distinction is blurred with more recent approaches so the original term VMM is preferred.
Linux has a built-in virtual machine manager called KVM that allows a host kernel to run virtual machines. Along with QEMU, which emulates physical devices and provides memory management to the guest (and can run by itself if necessary), an operating system can run fully emulated by the guest OS and by QEMU (as contrasted with the Xen hypervisor in Figure 3-6). This emulation narrows the interface between the VM and the host kernel and reduces the amount of kernel code the process inside the VM can reach directly. This provides a greater level of isolation from unknown kernel vulnerabilities.
Despite many decades of effort, “in practice no virtual machine is completely equivalent to its real machine counterpart” (source). This is due to the complexities of emulating hardware, and hopefully decreases the chance that we’re living in a simulation.
Like all things we try to secure, virtualisation must balance performance with security: decreasing the risk of running your workloads using the minimum possible number of extra checks at runtime. For containers, a shared host kernel is an avenue of potential container escape - Linux has a long heritage and monolithic codebase.
Linux is mainly written in the C language, which has classes of memory management and range checking vulnerabilities that have proven notoriously difficult to entirely eradicate. Many applications have experienced these exploitable bugs when subjected to fuzzers. This risk means we want to keep hostile code away from trusted interfaces in case they have zero day vulnerabilities. This is a pretty serious defensive stance - it’s about reducing any window of opportunity for an attacker that has access to zero day Linux vulnerabilities.
Google’s OSS-Fuzz was born from the swirling maelstrom around the Heartbleed OpenSSL bug, which may have been ranging in the wild for up to two years. Critical, internet-bolstering projects like OpenSSL are poorly funded and much goodwill exists in the Open Source community, so finding these bugs before they are exploited is a vital step in securing critical software.
The sandboxing model defends against zero days by abstractions. It moves processes away from the Linux system call interface to reduce the chance of bad actors exploiting it, using an assortment of containers and capabilities, LSMs and kernel modules, hardware and software virtualisation, and dedicated drivers. Most recent sandboxes use a type-safe language like golang or Rust, which makes their memory management safer than software programmed in C (which requires manual and potentially error-prone memory management).
Let’s further define what we mean by containers by looking at how they interact with the Figure 3-7:
Containers talk directly to the host kernel, but the layers of LSMs, capabilities, and namespaces, ensure they do not have full host kernel access. Conversely, instead of sharing one kernel, VMs use a guest kernel (a dedicated kernel running in a hypervisor). This means if the VM’s guest kernel is compromised, more work is required to break out of the hypervisor and into the host.
Containers are created by a low-level container runtime, and as users we talk to the high-level container runtime that controls it.
The diagram Figure 3-8 shows the high-level interfaces, with the container managers on the left. Then Kubernetes, Docker, and Podman interact with their respective libraries and runtimes. These perform useful container management features including pushing and pulling container images, managing storage and network interfaces, and interacting with the low-level container runtime.
In the middle column of Figure 3-8 are the container runtimes that your Kubernetes cluster interacts with, while in the right column are the low-level runtimes responsible for starting and managing the container.
That low-level container runtime is directly responsible for starting and managing containers, interfacing with the kernel to create the namespaces and configuration, and finally starting the process in the container. It is also responsible to handling your process inside the container, and getting its system calls to the host kernel at runtime.
Linux was written with a core assumption: that the root user is always in the host namespace. While there were no other namespaces this assumption held true. But with the introduction of user namespaces (the last major kernel namespace to be completed) this changed: developing user namespaces required many code changes to code concerning the root user.
User namespaces allow you to map users inside a container to other users on the host, so id 0 (root) inside the container can create files on a volume that from withint the container look to be root-owned. But when you inspect the same volume from the host, they show up as owned by the user root was mapped to (e.g. user id 1000, or 110000, as shown in Figure 3-9). User namespaces are not enabled in Kubernetes, although work is underway to support them.
Everything in Linux is a file, and files are owned by users. This makes user namespaces wide-reaching and complex, and they have been a source of privilege escalation bugs in previous versions of Linux:
CVE-2013-1858: user namespace & CLONE_FS. The clone system-call
implementation in the Linux kernel before 3.8.3 does not properly handle a combination of the
CLONE_FS flags, which allows local users to gain privileges by calling
chroot and leveraging the sharing of the /
directory between a parent process and a child process.
CVE-2014-4014: user namespace & chmod. The capabilities implementation in
the Linux kernel before 3.14.8 does not properly consider that namespaces are inapplicable to inodes, which allows local
users to bypass intended
chmod restrictions by first creating a user namespace, as demonstrated by setting the
bit on a file with group ownership of
CVE-2015-1328: user namespace & OverlayFS (Ubuntu-only). The
implementation in the Linux kernel package before 3.19.0-21.21 in Ubuntu versions until 15.04 did not properly check
permissions for file creation in the upper filesystem directory, which allowed local users to obtain root access by
leveraging a configuration in which
overlayfs is permitted in an arbitrary mount namespace.
CVE-2018-18955: user namespace & complex ID mapping. In the Linux kernel
4.15.x through 4.19.x before 4.19.2,
kernel/user_namespace.c allows privilege escalation because it
mishandles nested user namespaces with more than 5
GID ranges. A user who has
CAP_SYS_ADMIN in an affected
user namespace can bypass access controls on resources outside the namespace, as demonstrated by reading /etc/shadow.
This occurs because an ID transformation takes place properly for the namespaced-to-kernel direction but not for the
Containers are not inherently “insecure”, but as we saw in Chapter 2, they can leak some information about a host, and a root-owned container runtime is a potential exploitation path for a hostile process or container image.
Operations such as creating network adapters in the host network namespace, and mounting host disks disks, are historically root-only, which has made rootless containers harder to implement. Rootfull container runtimes were the only viable option for the first decade of popularised container use.
Exploits that have abused this rootfulness include CVE-2019-5736, replacing the runc binary from inside a container via
/proc/self/exe, and CVE-2019-14271, attacking the host from inside a container responding to
Underlying concerns about a root-owned daemon can be assuaged by running rootless containers in “unprivileged user namespaces” mode: creating containers using a non-root user, within their own user namespace. This is supported in Docker 20.0X and Podman.
“Rootless” means the low-level container runtime process that creates the container is owned by an unprivileged user, and so container breakout via the process tree only escapes to a non-root user, nullifying some potential attacks.
Rootless containers introduce a hopefully less dangerous risk: user namespaces have historically been a rich source of vulnerabilities. The answer to whether it is riskier to run root-owned daemon or user namespaces isn’t clear-cut, although any reduction of root privileges is likely to be the more effective security boundary. There have been more high-profile breakouts from root-owned Docker, but this may well be down to adoption and widespread use.
Rootless containers (without a root-owned daemon) provide a security boundary when compared to those with root-owned daemons. When code owned by the host’s root user is compromised by a malicious process, it can potentially read and write other users’ files, attack the network and its traffic, or install malware to the host.
The mapping of user identifiers (UIDs) in the guest to actual users on the host depends on the user mappings of the host user namespace, container user namespace, and rootless runtime, as shown in Figure 3-10.
User namespaces allow non-root users to pretend to be the host’s root user. The “root-in-userns” user can have a “fake” UID 0 and permission to create new namespaces (mount, net, uts, ipc), change the container’s hostname, and mount points.
This allows root-in-userns, which is unprivileged in the host namespace, to create new containers. To achieve this, additional work must be done: network connections into the host network namespace can only be created by the host’s root. For rootless containers, an unprivileged slirp4netns networking device (guarded by seccomp) is used to create a virtual network device.
Unfortunately, mounting remote file systems becomes difficult when the remote system, e.g. NFS home directories, does not understand the host’s user namespaces.
If you have a normal process creating files on an NFS share and not taking advantage of user-namespaced capabilities, everything works fine. The problem comes in when the root process inside the container needs to do something on the NFS share that requires special capability access. In that case, the remote kernel will not know about the capability and will most likely deny access.
While rootless Podman has SELinux support (and dynamic profile support via udica), rootless Docker does not support AppArmor yet, and for both runtimes CRIU (Checkpoint/Restore In Userspace, a feature to freeze running applications) is disabled.
Both rootless runtimes require configuration for some networking features:
CAP_NET_BIND_SERVICE is required by the
kernel to bind to ports below 1024 (historically considered a privileged boundary), and ping is not supported for users
with high UIDs if the ID is not in /proc/sys/net/ipv4/ping_group_range (although this can be changed by host root).
Host networking is not permitted (as it breaks the network isolation),
cgroups v2 are functional but only when running
cgroup v1 is not supported by either rootless implementation. There are more details in the docs for
shortcomings of rootless Podman.
Docker and Podman share similar performance and features as both use runc, although Docker has an established networking model that doesn’t support host networking in rootless mode, whereas Podman reuses Kubernetes’ CNI (container network interface) plugins for greater networking deployment flexibility.
Rootless containers decrease the risk of running your container images. Rootlessness prevents an exploit escalating to root via many host interactions (although some use of SETUID and SETGID binaries is often needed by software aiming to avoid running processes as root).
While rootless containers protect the host from the container, it may still be possible to read some data from the host, although an adversary will find this a lot less useful. Root capabilities are needed to interact with potential privilege escalation points including /proc, host devices, and the kernel interface, among others.
Throughout these layers of abstraction, system calls are still ultimately handled by software written in potentially unsafe C. Is the rootless runtime’s exposure to C-based system calls in Linux kernel really that bad? Well, the C language powers the internet (and world?) and has done so for decades, but its lack of memory management leads to the same critical bugs occurring over and over again. When the kernel, openssl, and other critical software are written in C, we just want to move everything as far away from trusted kernel space as possible.
Whitesource suggests that C has accounted for 47% of all reported vulnerabilities in the last 10 years. This may largely be due to its proliferation and longevity, but highlights the inherent risk.
While “trimmed-down” kernels exist (like unikernels and rump kernels), many traditional and legacy applications are portable onto a container runtime without code modifications. To achieve this feat for a unikernel would require the application to be ported to the new reduced kernel. Containerising an application is a generally frictionless developer experience, which has contributed to the success of containeris.
If a process can exploit the kernel, it can take over the system the kernel’s managing. This is a risk that bad actors like Captian Hashjack will attempt to exploit, and so cloud providers and hardware vendors have been pioneering different approaches to moving away from Linux system call interaction for the guest.
Linux containers are a lightweight form of isolation as they allow workloads to use kernel APIs directly, minimising the layers of abstraction. Sandboxes take a variety of other approaches, and generally use container techniques as well.
Linux’s Kernel Virtual Machine (KVM) is a module that allows the kernel to run as a hypervisor. It uses the processor’s hardware virtualisation commands and allows each “guest” to run a full Linux or Windows operating system in the virtual machine with private, virtualized hardware. A virtual machine differs from a container as the guest’s processes are running on their own kernel: container processes always share the host kernel.
Sandboxes take a the best of virtualisation and container isolation and combine the two approaches to optimise for specific use cases.
gVisor and Firecracker (written in Golang and Rust respectively) both operate on the premise that their statically typed system call proxying (between the workload/guest process and the host kernel) is more secure for consumption by untrusted workloads than the Linux kernel itself, and that performance is not significantly impacted.
gVisor starts a KVM virtual machine or operates in
ptrace mode (using a debug
ptrace system call to
monitor and control its guest), and inside starts a user-space kernel, which proxies system calls down to the Host using
a “sentry” process. This trusted process reimplements 237 Linux system calls and only needs 53 host system calls to
operate. It is constrained to that list of system calls by seccomp. It also starts a companion “file system interaction”
side process called Gofer to prevent a compromised sentry process interacting with the host’s file system, and finally
implements its own userspace networking stack to isolate it from
bugs in the Linux TCP/IP stack.
Firecracker on the other hand, while also using KVM, starts a stripped down device emulator instead of implementing the heavyweight QEMU process to emulate devices (as traditional Linux virtual machines do). This reduces the host’s attack surface and removes unnecessary code, requiring 36 system calls itself to function.
And finally, at the other end of the Figure 3-11, KVM/QEMU VMs emulate hardware and so provide a guest kernel and full device emulation, which increases startup times and memory footprint.
Virtualisation provides better hardware isolation through CPU integration, but is slower to start and run due to the abstraction layer between the guest and the underlying host.
Containers are lightweight and suitably secure for most workloads. They run in production for multinational organisations around the world.
But high sensitivity workloads and data need greater isolation. You can categorise workloads by risk:
Does this application access a sensitive or high-value asset?
Is this application able to receive untrusted traffic or input?
Have there been vulnerabilities or bugs in this application before?
If the answer to any of those is yes, you may want to consider a next-generation sandboxing technology to further isolate workloads.
gVisor, Firecracker, and Kata Containers all take different approaches to virtual machine isolation, while sharing the aim of challenging the perception of slow startup time and high memory overhead.
Kata Containers is a container runtime that starts a VM and runs a container inside. It is widely compatible and
firecracker as a guest.
Figure 3-12 compares these sandboxes and some key features:
Each sandbox combines virtual machine and container technologies: some VMM process, a Linux kernel within the virtual machine, a Linux userspace in which to run the process once the kernal has booted, and some mix of kernel-based isolation (that is container-style namespaces, cgroups, or seccomp) either within the VM, around the VMM, or some combination thereof.
Let’s have a closer look at each one.
Google’s gVisor was originally built to allow untrusted, customer-supplied workload to run in AppEngine on Borg, Google’s internal orchestrator and the progenitor to Kubernetes. It now protects Google Cloud products: App Engine standard environment, Cloud Functions, Cloud ML Engine, and Cloud Run, and, as it has been modified for GKE. It has the best Docker and Kubernetes integrations from amongst this chapter’s sandboxing technologies.
To run the examples the gVisor runtime binary must be installed on the host or worker node.
Docker supports pluggable container runtimes, and a simple
docker run -it --runtime=runsc starts a gVisor sandboxed OCI container. Let’s have a look at what’s in
/proc in a vanilla gVisor container to compare it with standard
$docker run -it --runtime
ls -lasp /proc/1 total 0
2316:22 cwd -> /root
2316:22 exe -> /usr/bin/coreutils
Removing special files from this directory prevents a hostile process from accessing the relevant feature in the underlying host kernel.
There are far fewer entries in
/proc than in a
runc container, as this diff shows:
(docker run -t sublimino/hack ls -1 /proc/1
(docker run -t --runtime
=runsc sublimino/hack ls -1 /proc/1
)-arch_status -attr -autogroup auxv -cgroup -clear_refs cmdline comm -coredump_filter -cpu_resctrl_groups -cpuset cwd environ exe @@ -16,39 +8,17 @@ fdinfo gid_map io -limits -loginuid -map_files maps mem mountinfo mounts -mountstats net ns -numa_maps -oom_adj oom_score oom_score_adj -pagemap -patch_state -personality -projid_map -root -sched -schedstat -sessionid -setgroups smaps -smaps_rollup -stack stat statm status -syscall task -timens_offsets -timers -timerslack_ns uid_map -wchan
The sentry process that simulates the Linux system call interface reimplements over 235 of 350 in Linux 5.3.11. This shows you a “masked” view of the
/dev virtual filesystems. These filesystems have historically leaked the container abstraction by sharing information from the host (memory, devices, processes etc) so are an area of special concern.
Let’s look at system devices under
/dev in gVisor and runc:
(docker run -t sublimino/hack ls -1p /dev
(docker run -t --runtime
=runsc sublimino/hack ls -1p /dev
)-console -core fd full mqueue/ +net/ null ptmx pts/
We can see that the
runsc gVisor runtime drops the
core devices, but includes a
/dev/net/tun device (under the
net/ directory above) for its
netstack networking stack, which also runs inside sentry. Netstack can be bypassed for direct host network access (at the cost of some isolation), or host networking disabled entirely for fully host-isolated networking (depending on the CNI or other network configured within the sandbox).
Apart from these giveaways, gVisor is kind enough to identify itself at boot time, which you can see in a container with
$docker run --runtime
=runsc sublimino/hack dmesg
]Feeding the init monster...
]Committing treasure map to memory...
]Checking naughty and nice process list...
]Granting licence to
]Creating process schedule...
]Creating bureaucratic processes...
]Checking naughty and nice process list...
Notably this is not the real time it takes to start the container, and the quirky messages are randomised — don’t
rely on them for automation. If we
time the process we can see it start faster than it claims:
timedocker run --runtime
=runsc sublimino/hack dmesg
]Consulting tar man page...
]Verifying that no non-zero bytes made their way into /dev/zero...
]Synthesizing system calls...
forthe zombie uprising...
]Adversarially training Redcode AI...
]Conjuring /dev/null black hole...
]Accelerating teletypewriter to
]Checking naughty and nice process list...
]Generating random numbers by fair dice roll...
]Ready! real 0m0.852s user 0m0.021s sys 0m0.016s
Unless an application running in a sandbox explicitly checks for these features of the environment, it will be unaware that it is in a sandbox. Your application makes the same system calls as it would to a normal Linux kernel, but the Sentry process intercepts the system calls as shown in Figure 3-13.
Sentry prevents the application interacting directly with the host kernel, and has a seccomp profile that limits its possible host system calls. This helps prevent escalation in case a tenant breaks into Sentry and attempts to attack the host kernel.
Implementing a userspace kernel is a Herculean undertaking and does not cover every system call. This means some applications are not able to run in gVisor, although in practice this doesn’t happen very often and there are millions of workloads running on GCP under gVisor.
The Sentry has a side process called Gofer. It handles disks and devices, which are historically common VM attack vectors. Separating out these responsibilities increases your resistance to compromise; if Sentry has an exploitable bug, it can’t be used to attack the host’s devices directly because they’re all proxied through Gofer.
gVisor is written in Go to avoid security pitfalls that can plague kernels. Go is strongly typed, with built-in bounds checks, no uninitialized variables, no use-after-free bugs, no stack overflow bugs, and a built-in race detector. However using Go has its challenges, and the runtime often introduces a little performance overhead.
However, this comes at the cost of some reduced application compatibility and a high per system call overhead. Of course, not all applications make a lot of system calls, so this depends on usage.
Application system calls are redirected to Sentry by a Platform Syscall Switcher, which intercepts the application when it tries to make system calls to the kernel. Sentry then makes the required system calls to the host for the containerised process, as shown in Figure 3-14. This proxying prevents the application from directly controlling system calls.
Sentry sits in a loop waiting for a system call to be generated by the application as shown in Figure 3-15.
It captures the system call with
ptrace, handles it, and returns a response to the process (often without making
the expected system call to the host). This simple model protects
the underlying kernel from any direct interaction with the process inside the container.
The Platform Syscall Switcher, gVisor’s system call interceptor, has two modes:
ptrace and KVM.
ptrace (“process trace”) system call provides a mechanism for a parent process to observe and modify another process’s behaviour.
PTRACE_SYSEMU forces the traced process to stop on entry to the next syscall, and gVisor is able to respond to it or proxy the request to the host kernel, going via Gofer if I/O is required.
The decreasing number of permitted calls shown in Figure 3-16 limits the exploitable interface of the underlying host kernel to 68 system calls, while the containerised application process believes it has access to all ~350 kernel calls.
Firecracker is a virtual machine monitor (VMM) that boots a dedicated VM for its guest using KVM. Instead of using KVM’s traditional device emulation pairing with QEMU, Firecracker implements its own memory management and device emulation. It has no BIOS (instead implementing Linux Boot Protocol), no PCI support, and stripped down, simple, virtualised devices with a single network device, a block I/O device, timer, clock, serial console, and keyboard device that only simulates ctrl-alt-del to reset the VM as show in Figure 3-17.
The Firecracker VMM process that starts the guest virtual machine is in turn started by a jailer process. The jailer configures the security configuration of the VMM sandbox (GID & UID assignment, network namespaces, create chroot, create cgroups), then terminates and passes control to Firecracker, where seccomp is enforced around the KVM guest kernel and userspace that it boots.
Instead of using a second process for I/O like gVisor, Firecracker uses the KVM’s Virtio drivers to proxy from the guest’s Firecracker process to the host kernel, via the VMM (shown in Figure 3-18). When the Firecracker VM image starts, it boots into protected mode in the guest kernel, never running in its real mode.
Firecracker is compatible with Kubernetes and OCI using the firecracker-containerd shim.
Firecracker invokes far less host kernel code than traditional LXC or gVisor once it has started, although they all touch similar amounts of kernel code to start their sandboxes.
Performance improvements are gained from an isolated memory stack, and lazily flushing data to the page cache instead of disk to increase filesystem performance.
It supports arbitrary Linux binaries but does not support generic Linux kernels. It was created for AWS’s Lambda service, forked from Google’s ChromeOS VMM, crosvm.
What makes crosvm unique is a focus on safety within the programming language and a sandbox around the virtual devices to protect the kernel from attack in case of an exploit in the devices.
Firecracker is a statically linked Rust binary that is compatible with Kata Containers, Weave Ignite, firekube, and firecracker-containerd. It provides soft allocation (not allocating memory until its acutally used) for more aggressive “bin packing” and so greater resource utilisation.
Finally, Kata containers are lightweight VMs containing a container engine. They are highly optimized for running containers. They are also the oldest, and most mature, of the recent sandboxes, and grew from the Clear Containers project (based on Intel Clear Linux). Compatibility is wide, with support for most container orchestrators.
Grown from a combination of Intel Clear Containers and Hyper.sh RunV, Kata Containers “wraps” containers with a dedicated KVM virtual machine (seen in Figure 3-19) and device emulation from a pluggable back end: QEMU, QEMU-lite, NEMU (a custom stripped-down QEMU), or Firecracker. It is an OCI runtime and so supports Kubernetes, which does not require modification of container images.
The Kata Container runtime launches each container on a guest Linux kernel. Each Linux system is on its own hardware isolated VM (as you can see in Figure 3-20).
kata-runtime process is the VMM, and the interface to the OCI runtime. Kata-proxy handles I/O for the kata-agent
(and therefore the application) using KVM’s virtio-serial, and multiplexes a command channel over the same connection.
Kata-shim is the interface to the container engine, handling container lifecycles, signals, and logs.
The guest is started using KVM and either QEMU or Firecracker. The project has forked QEMU twice to experiment with lightweight start times and has reimplemented a number of features back into QEMU, which is now preferred to NEMU (the most recent fork).
Inside the VM, QEMU boots an optimised kernel, and systemd starts the kata-agent process. Kata-agent, which uses libcontainer and so shares a lot of code with runc, manages the containers running inside the VM.
Networking is provided by integrating with CNI or Docker’s CNM, and a network namespace is created for each VM. Because of its networking model, the host network can’t be joined.
SELinux and AppArmor are not currently implemented (July 2020), and some OCI inconsistencies limit the Docker integration.
It is similar to golang in that it is memory safe (memory model, virtio, etc) but it is built atop a memory ownership model, which avoids whole classes of bugs including use after free, double free, and dangling pointer issues.
It has safe and simple concurrency and no garbage collector (which may incur some virtualisation overhead and latency), instead using build time analysis to find segmentation faults and memory issues.
rust-vmm is a development toolkit for new VMMs as shown in Figure 3-21. It is a collection of building blocks (Rust packages, or “crates”) comprised of virtualisation components. These are well tested (and therefore better secured) and provide a simple, clean interface. For example, the vm-memory crate is a guest memory abstraction, providing a guest address, memory regions, and guest shared memory.
The project was birthed from ChromeOS’s cross-vm (crosvm), which was forked by firecracker and subsequently abstracted into the “hypervisor from scratch” crates. This approach will enable the development of a plug-n-play hypervisors architecture.
The degree of access and privilege that a guest process has to host features, or virtualised versions of them, impacts the attack surface available to an attacker in control of the guest process.
This new tranche of sandbox technologies is under active development. It’s code, and like all new code, is at risk of exploitable bugs. This is a fact of software, however, and is infinitely better than no new software at all!
It may be that these sandboxes are not yet a target for attackers. The level of innovation and baseline knowledge to contribute means the barrier to entry is set high.
From an administrator’s perspective, modifying or debugging applications within the sandbox becomes slightly more difficult, similar to the difference between bare metal and containerised processes. These difficulties are not insurmountable but require administrator familiarisation with the underlying runtime.
It is still possible to run privileged sandboxes that have elevated capabilities within the guest, and although the risks are fewer than for privileged containers users should be aware that any reduction of isolation increases the risk of running the process inside the sandbox.
Kubernetes and Docker support running multiple container runtimes simultaneously; in Kubernetes, Runtime Class is stable from 1.20 on. This means a Kubernetes worker node can host pods running under different Container Runtime Interfaces (CRIs), which greatly enhances workload separation.
spec.template.spec.runtimeClassName you can target a sandbox for a
Kubernetes workload via CRI, although Kubernetes doesn’t distinguish between
Docker is able to run any OCI compliant runtime (e.g.
runsc), but the
kubelet uses CRI. While Kubernetes has not yet distinguished between
types of sandboxes, we can still set node affinity and toleration so pods are scheduled
on to nodes that have the relevant sandbox technology installed.
To use a new CRI runtime in Kubernetes, create a non-namespaced RuntimeClass:
# The name the RuntimeClass will be referenced by
# RuntimeClass is a non-namespaced resource
# The name of the corresponding CRI configuration
Then reference the CRI runtime class in the pod definition:
This has started a new pod using
gvisor. Remember that
runsc (gVisor’s runtime component) must be installed
on the node that the pod is scheduled on.
Generally sandboxes are more secure, containers are less complex.
When running sensitive or untrusted workloads, you want to narrow the interface between a sandboxed process and the host. There are trade offs — debugging a rogue process becomes much harder, and traditional tracing tools may not have good compatibility.
There is a general, minor performance overhead for sandboxes over containers (~50-200ms startup), which may be negligible for some workloads, and benchmarking is strongly encouraged. Options may also be limited by platform or nested virtualisation options.
As next generation runtimes have focused on stripping down legacy compatibility, they are very small and very fast to start up (compared to traditional VMs) — not as fast as LXC or runc, but fast enough for FaaS providers to offer aggressive scale rates.
Traditional container runtimes like LXC and runc are faster to start as they run a process on an existing kernel. Sandboxes need to configure their own guest kernel, which leads to slightly longer start times.
Managed services are easiest to adopt, with gVisor in GKE and Firecracker in AWS Fargate. Both of them, and Kata, will run anywhere virtualisation is supported, and the future is bright with the rust-vmm library promising many more runtimes to keep valuable workloads safe.