Chapter 3. Virtual machines

Containers are often compared with Virtual Machines (VMs), especially in terms of the isolation that they offer. Let’s make sure you have a solid understanding of how VMs operate so that you can reason about the differences between them and containers. This will be particularly useful when you want to assess the security boundaries around your applications when they run in containers, or in different VMs. When you are discussing the relative merits of containers from a security perspective, understanding how they differ from VMs can be a useful tool.

The fundamental difference is that a VM runs an entire copy of an Operating System, including its kernel, whereas a container shares the host machine’s kernel. To understand what that means, you’ll need to know something about how virtual machines are created and managed by a Virtual Machine Monitor (VMM). And to set the scene for that, let’s start by thinking about what happens when a computer boots up.

Booting up a machine

Picture a physical server. It has some CPUs, memory and networking interfaces. When you first boot up the machine, an initial program runs that’s called the BIOS, or Basic Input Output System. It scans how much memory is available, identifies the network interfaces, and spots any other devices like displays, keyboards, attached storage devices and so on.

In practice, nowadays a lot of this functionality has been superceded by UEFI (Unified Extensible Firmware Interface), but for the sake of argument let’s just think of this as a modern BIOS. TODO!! Double-check UEFI acronym!!

Once the hardware has been enumerated, the system runs a bootloader which loads and then runs the Operating System’s kernel code. Kernel code operates at a higher level of privilege than your application code. This privilege level allows it to interact with memory, network interfaces and so on, whereas applications, running in what’s called “user space” can’t do this directly. Instead, user space code has to ask the kernel to do these kind of activities on its behalf. (It makes these requests through system calls, which are covered in ???.) The kernel could be Linux, Windows, or some other kind of operating system.

On an x86 processor, privilege levels are organised into Rings with Ring 0 being the most privileged, and Ring 3 being the least. In a regular set-up (without VMs), for most operating systems the kernel runs at Ring 0, and user space code runs at Ring 3, as shown in Figure 3-1.

Privilege rings
Figure 3-1. Privilege rings

Kernel code (like any code) runs on the CPU in the form of machine code instructions, and these instructions can include privileged instructions for accessing memory, starting CPU threads and so on. The details of everything that can and will happen while the kernel initializes are beyond the scope of this book, but essentially the goal is to mount the root filesystem, set up networking, and bring up any system daemons. (If you want to dive deeper there is a lot of great information on Linux kernel internals, including the bootstrap process, at https://github.com/0xAX/linux-insides.)

Once the kernel has finished its own initialization, it can start running programs in user space. The kernel is responsible for managing everything that those user space programs need. It starts, manages and schedules the CPU threads that these programs run in, and keeps track of these threads through its own data structures that represent processes. One important aspect of kernel functionality is memory management. The kernel assigns blocks of memory to each process and makes sure that processes can’t access each others’ memory blocks.

Enter the VMM

As you have just seen, in a regular set-up, the kernel manages the machine’s resources directly. In the world of virtual machines, a Virtual Machine Monitor does the first layer of resource management, splitting up the resources and assigning them to virtual machines. Each virtual machine gets a kernel of its own.

For each virtual machine that it manages, the VMM assigns some memory and CPU resources, sets up some virtual network interfaces and other virtual devices, and starts a guest kernel with access to these resources.

In a regular server, the BIOS gives the kernel the details of the resources available on the machine; in a virtual machine situation, the VMM divides up those resources and only gives each guest kernel the details of the subset that it is being given access to. From the perspective of the guest OS, it thinks it has direct access to physical memory and devices, but in fact it’s getting access to an abstraction provided by the VMM.

The VMM is responsible for making sure that the guest OS and its applications can’t breach the boundaries of the resources it has been allocated. For example, the guest operating system is assigned a range of memory on the host machine. If the guest somehow tries to access memory outside that range, this is forbidden.

There are two main forms of VMM, often called, not very imaginatively, Type 1 and Type 2. And there is a bit of a grey area between the two, naturally!

Type 1 VMM, or Hypervisors

In a regular system, the bootloader runs an operating system kernel like Linux or Windows. In a pure Type 1 virtual machine environment, a dedicated kernel-level VMM program runs instead.

Type 1 VMM
Figure 3-2. Type 1 Virtual Machine Monitor, also known as a Hypervisor

Type 1 VMMs are also known as “hypervisors”, and examples include Hyper-V, Xen and ESX/ESXi. As you can see in Figure 3-2, the hypervisor runs directly on the hardware (or “bare metal”) with no operating system underneath it.

By saying “kernel level” I mean that the hypervisor runs at Ring 0. (Well, that’s true until we consider hardware virtualization later in this chapter, but for now let’s just assume Ring 0). The guest OS kernel runs at Ring 1, as depicted in Figure 3-3, which means it has less privilege than the hypervisor.

Privilege rings for a hypervisor
Figure 3-3. Privilege rings used under a hypervisor

Type 2 VMM

When you run virtual machines on your laptop or desktop machine, perhaps through something like VirtualBox, they are “hosted” or Type 2 VMs. Your laptop might be running, say, MacOS, which is to say that it’s running a MacOS kernel. You install VirtualBox as a separate application, which then goes on to manage guest VMs that co-exist with your host operating system. Those guest VMs could be running Linux or Windows. Figure 3-4 shows how the guest OS and host OS coexist.

Type 2 VMM
Figure 3-4. Type 2 Virtual Machine Monitor

Consider that for a moment and think about what it means to run, say, Linux, within a MacOS. By definition this means there has to be a Linux kernel, and that has to be a different kernel from the host’s MacOS kernel.

The VMM application has user space components that you can interact with as a user, but it also installs privileged components allowing it to provide virtualization. You’ll see more about how this works later in this chapter.

As well as VirtualBox, examples of Type 2 VMMs include Parallels and QEMU.

Kernel-based Virtual Machines

I promised that there would be some blurred boundaries between type 1 and type 2. In type 1 the hypervisor runs directly on bare metal; in type 2 the VMM runs in user space on the host OS. What if you run a virtual machine manager within the kernel of the host OS?

This is exactly what happens with a Linux kernel module called KVM, or Kernel-based Virtual Machines.

KVM
Figure 3-5. KVM

Generally KVM is considered to be a Type 1 hypervisor because the guest OS doesn’t have to traverse the host OS, but this author’s view is that this categorisation is overly simplistic.

KVM is often used with QEMU (Quick Emulation), which I listed above as a type 2 hypervisor. QEMU dynamically translates system calls from the guest OS into host OS system calls. It’s worth a mention that QEMU can take advantage of hardware acceleration offered by KVM.

Whether Type 1, Type 2, or something in between, VMMs employ similar techniques to achieve virtualization. The basic idea is called “trap-and-emulate”, though as we’ll see, x86 processors provide some challenges in implementing this idea.

Trap-and-emulate

Some CPU instructions are privileged, meaning that they can only be executed in Ring 0, and if they are attempted in a higher ring this will cause a trap. You can think of the trap as being like an exception in application software that triggers an error handler; a trap will result in the CPU calling to a handler in the Ring 0 code.

If the VMM runs at Ring 0, and the guest OS kernel code runs at a lower privilege, a privileged instruction run by the guest can invoke a handler in the VMM to emulate the instruction. In this way the VMM can ensure that the guest OSs can’t interfere with each other through privileged instructions.

Unfortunately privileged instructions are only part of the story. The set of CPU instructions that can affect the machine’s resources are known as sensitive. The VMM needs to handle these on behalf of the guest OS, because only the VMM has a true view of the machine’s resources. There is also another class of sensitive instructions which behave differently when they are executed in Ring 0 or in lower privileged rings. Again, a VMM needs to do something about these instructions because the guest OS code was written assuming the Ring 0 behavior.

If all sensitive instructions were privileged, this would make life relatively easy for VMM programmers, as they would just need to write trap handlers for all these sensitive instructions. Unfortunately, not all x86 sensitive instructions are also privileged, so VMMs need to use different techniques to handle them. Instructions that are sensitive but not privileged are considered to be “non-virtualizable”.

Handling non-virtualizable instructions

There are a few different techniques for handling these non-virtualizable instructions:

  • One option is “binary translation”. All the non-privileged, sensitive instructions in the guest OS are spotted and rewritten by the VMM in real time. This is complex, and newer x86 processors support hardware-assisted virtualization to simplify binary translation.

  • Another option is “paravirtualization”. Instead of modifying the guest OS on the fly, the guest OS is re-written to avoid the non-virtualizable set of instructions, effectively making system calls to the hypervisor. This is the technique used by the Xen hypervisor.

  • Hardware virtualization (such as Intel’s VT-x) allows hypervisors to run in a new, extra privileged level known as VMX root mode which is essentially Ring -1. This allows the VM guest OS kernels to run at Ring 0 (or VMX non-root mode), as they would if they were the host OS.

Note

If you would like to dig deeper into how virtualization works, Adams and Ogesen provide a useful comparison and describe how hardware enhancements enable better performance.

Now that you have a picture of how Virtual Machines are created and managed, let’s consider what this means in terms of isolation one process, or application, from another.

Process isolation

Making sure that applications are safely isolated from each other is a primary security concerns. If my application can read the memory that belongs to your application, I will have access to your data.

Physical isolation is the strongest form of isolation possible. If our applications are running on entirely separate physical machines, there is no way for my code to get access to the memory of yours.

As we have just discussed, the kernel is responsible for managing its user space processes, including assigning memory to each one. It’s up to the kernel to make sure that one application can’t access the memory assigned to another. If there is a bug in the way that the kernel manages memory, an attacker might be able to exploit that bug to access memory that they shouldn’t be able to reach. And while the kernel is extremely battle-tested, it’s also extremely large and complex, and still evolving. Even though we don’t know of significant flaws in kernel isolation at the time of writing, I wouldn’t advise you to bet against someone finding problems at some point in the future.

These flaws can come about due to increased sophistication in the underlying hardware. In recent years CPU manufacturers developed “speculative processing” where a processor runs ahead of the currently executing instruction, and works out what the results are going to be ahead of actually needing to run that branch of code. This enables significant performance gains, but it also opened the door to the famous Spectre and Meltdown exploits.

You might be wondering why people consider hypervisors to give greater isolation to virtual machines than a kernel gives to its processes; after all, they are also managing memory and device access, and have a responsibility to keep virtual machines separate. It’s absolutely true that a hypervisor flaw could result in a serious problem with isolation between virtual machines. The difference is that hypervisors have a much, much simpler job. In a kernel, user space processes are allowed some visibility of each other; as a very simple example, you can run ps and see the running processes on the same machine. You can (given the right permissions) access information about those processes by looking in the /proc directory. You are allowed to deliberately share memory between processes through IPC and, well, shared memory. All these mechanisms, where one process is legitimately allowed to discover information about another, make the isolation weaker, because of the possibility of a flaw that allows this access in unexpected or unintended circumstances.

There is no similar equivalent when running virtual machines; you can’t see one machine’s processes from another. There is less code required to manage memory simply because the hypervisor doesn’t need to handle circumstances where machines might share memory - it’s just not something that virtual machines do. As a result hypervisors are far smaller and simpler than full kernels. There are well over 20 million lines of code in the Linux kernel; in contrast the Xen hypervisor is around 50,000 lines.

Where there is less code, and less complexity, there is a smaller attack surface, and the likelihood of an exploitable flaw is less. For this reason, virtual machines are considered to have strong isolation boundaries.

That said, virtual machine exploits are not unheard of. Tank, Aggarwal and Chaubey describe a taxonomy of the different types of attack, and NIST have published security guidelines for hardening virtualized environments.

Disadvantages of virtual machines

At this point you might be so convinced of the isolation advantages of virtual machines that you might be wondering why people use containers at all! There are some disadvantges of VMs compared to containers:

  • Virtual machines have start-up times that are several orders of magnitude greater than a container. After all, a container simply means starting a new Linux process, not having to go through the whole start-up and initialization of a VM. (However, Amazon’s Firecracker offers VMs with very fast start-up times, of the order of 100ms at time of writing.)

  • Containers give developers a convenient ability to “build once, run anywhere” quickly and efficiently. It’s possible, but very slow, to build an entire machine image for a VM and run it on one’s laptop, but this technique hasn’t taken off in the developer community in the way containers have.

  • In today’s cloud environments, when you rent a virtual machine you have to specify its CPU and memory, and you pay for those resources regardless of how much is actually used by the application code running inside it.

  • Each virtual machine has the overhead of running a whole kernel. By sharing a kernel, containers can be very efficient in both resource use and performance.

When choosing whether to use VMs or containers, there are many trade-offs to be made between factors such as performance, price, convenience, risk, and the strength of security boundary required between different application workloads.

Container isolation compared to VM isolation

As you saw in [Link to Come], containers are simply Linux processes with a restricted view. They are isolated from each other by the kernel through the mechanisms of namespaces, cgroups and chroot. These mechanisms were created specifically to create isolation between processes. However, the simple fact that containers share a kernel means that the basic isolation is weaker compared to that of VMs.

However, all is not lost! You can apply additional security features and sandboxing to strengthen this isolation, which will I will explain in [Link to Come]. There are also very effective security tools which take advantage of the fact that containers tend to encapsulate microservices, and I will cover this in [Link to Come].

You should now have a good grasp of what virtual machines are. You have learnt why the isolation between virtual machines is considered strong compared to container isolation, and why containers are not generally considered suitably secure for hard multi-tenancy environments. Understanding this difference is an important tool to have in your toolbox when discussing container security. In the next chapter you will see some examples where the weaker isolation of containers can easily be broken through misconfiguration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.96.5