© Shashank Mohan Jain 2020
S. M. JainLinux Containers and Virtualizationhttps://doi.org/10.1007/978-1-4842-6283-2_1

1. Virtualization Basics

Shashank Mohan Jain1 
(1)
Bengaluru, India
 

This book explains the basics of virtualization and will help you create your own container frameworks like Docker, but a slimmed-down version. Before we get into that process, we need to understand how the Linux kernel supports virtualization and how the evolution of the Linux kernel and CPUs helped advance virtual machines in terms of performance, which in turn led to the creation of containerization technologies.

The intent of this chapter is to explain what a virtual machine is and what is happening under the hood. We also look into some of the basics of hypervisors, which make it possible to run a virtual machine in a system.

History of Virtualization

Prior to the virtualization era, the only way to get full physical servers provisioned was via IT. This was a costly and time-consuming process. One of the major drawbacks of this method was that the machine’s resources—like the CPU, memory, and disks—remained underutilized. To get around this, the notion of virtualization started to gain traction.

The history of virtualization goes back to the 1960s, when Jim Rymarczyk, who was a programmer with IBM, started virtualizing the IBM mainframe. IBM designed the CP-40 mainframe for internal usage. This system evolved into the CP-67, which used partition technology to run multiple applications at once. Finally came UNIX, which allowed multiple programs to run on the x86 hardware. Still the problem of portability remained. In the early 90s, Sun Microsystems came up with Java, which allowed the “write once run anywhere” paradigm to spread its wings. A user could now write a program in Java that could run across a variety of hardware architectures. Java did this by introducing intermediary code (called bytecode), which could then be executed on a Java runtime across different hardware architectures. This was the advent of process-level virtualization , whereby the Java runtime environment virtualized the POSIX layer.

In the late 1990s, VMware stepped in and launched its own virtualization model. This was related to virtualizing the actual hardware like the CPU, memory, disks, and so on. This meant that on top of the VMware software (also called the hypervisor), we could run operating systems themselves (called guests). This meant that developers were not restricted to just running Java programs, but could run any program meant to be run on the guest operating system. Around 2001, VMware launched the ESX and GSX servers. GSX was a Type 2 hypervisor so it needed an operating system like Windows to run guests. ESX was a Type 1 hypervisor, which allowed guest OSes to be run directly on the hypervisor.

What Is Virtualization?

Virtualization provides abstraction on top of the actual resources we want to virtualize. The level at which this abstraction is applied changes the way that different virtualization techniques look.

At a higher level, there are two major virtualization techniques based on the level of abstraction.
  • Virtual machine (VM)-based

  • Container-based

Apart from these two virtualizing techniques, there are other techniques, such as unikernels , which are lightweight single-purpose VMs. IBM is currently attempting to run unikernels as processes with projects like Nabla. In this book, we will mainly look at VM-based and container-based virtualizations only.

VM-Based Virtualization

The VM-based approach virtualizes the complete OS. The abstraction it presents to the VM are virtual devices like virtual disks, virtual CPUs, and virtual NICs. In other words, we can state that this is virtualizing the complete ISA (instruction set architecture); as an example, the x86 ISA.

With virtual machines, multiple OSes can share the same hardware resources, with virtualized representations of each of the resources available to the VM. For example, the OS on the virtual machine (also called the guest) can continue to do I/O operations on a disk (in this case, it’s a virtual disk), thinking that it’s the only OS running on the physical hardware (also called the host), although in actuality, it is shared by multiple virtual machines as well as by the host OS.

VMs are very similar to other processes in the host OS. VMs execute in a hardware-isolated virtual address space and at a lower privilege level than the host OS. The primary difference between a process and a VM is the ABI (Application Binary Interface) exposed by the host to the VM. In the case of a process, the exposed ABI has constructs like network sockets, FDs, and so on, whereas with a full-fledged OS virtualization, the ABI will have a virtual disk, a virtual CPU, virtual network cards, and so on.

Container-Based Virtualization

This form of virtualization doesn’t abstract the hardware but uses techniques within the Linux kernel to isolate access paths for different resources. It carves out a logical boundary within the same operating system. As an example, we get a separate root file system, a separate process tree, a separate network subsystem, and so on.

Hypervisors

A special piece of software is used to virtualize the OS, called the hypervisor. The hypervisor itself has two parts:
  • Virtual Machine Monitor (VMM): Used for trapping and emulating the privileged instruction set (which only the kernel of the operating system can perform).

  • Device model : Used for virtualizing the I/O devices.

Virtual Machine Monitor (VMM)

Since the hardware is not available directly on a virtual machine (although in some cases it can be), the VMM traps privileged instructions that access the hardware (like disk/network card) and executes these instructions on behalf of the virtual machine.

The VMM has to satisfy three properties (Popek and Goldberg, 1973):
  • Isolation :  Should isolate guests (VMs) from each other.

  • Equivalency : Should behave the same, with or without virtualization. This means we run the majority (almost all) of the instructions on the physical hardware without any translation, and so on.

  • Performance : Should perform as good as it does without any virtualization. This again means that the overhead of running a VM is minimal.

Some of the common functionalities of the VMM are as follows:
  • Does not allow the VM to access privileged states; that is, things like manipulating the state of certain host registers should not be allowed from the VM. The VMM will always trap and emulate those calls.

  • Handles exceptions and interrupts. If a network call (i.e., a request) was issued from within a virtual machine, it will be trapped in the VMM and emulated. On receipt of a response over the physical network/NIC, the CPU will generate an interrupt and deliver it to the actual virtual machine that it’s addressed to.

  • Handles CPU virtualization by running the majority of the instructions natively (within the virtual CPU of the VM) and only trapping for certain privileged instructions. This means the performance is almost as good as native code running directly on the hardware.

  • Handles memory mapped I/O by mapping the calls to the virtual device-mapped memory in the guest to the actual physical device-mapped memory. For this, the VMM should control the physical memory mappings (Guest Physical memory to Host Physical memory). More details are covered in a later section of this chapter.

Device Model

The device model of the hypervisor handles the I/O virtualization again by trapping and emulating and then delivering interrupts back to the specific virtual machine.

Memory Virtualization

One of the critical challenges with virtualization is how to virtualize the memory. The guest OS should have the same behavior as the non-virtualized OS. This means that the guest OS should probably be at least made to feel that it controls the memory.

In the case of virtualization, the guest OS cannot be given direct access to the physical memory. What this means is that the guest OS should not be able to manipulate the hardware page tables, as this can lead to the guest taking control of the physical system.

Before we delve into how this is tackled, a basic understanding of memory virtualization is needed, even in the context of normal OS and hardware interactions.

The OS provides its processes a virtual view of memory; any access to the physical memory is intercepted and handled by the hardware component called the Memory Management Unit (MMU). The OS sets up the CR3 register (via a privileged instruction) and the MMU uses this entry to walk the page tables to determine the physical mapping. The OS also takes care of changing these mappings when allocation and deallocation of physical memory happens.

Now, in the case of virtualized guests, the behavior should be similar. The guest should not get direct access to the physical memory, but should be intercepted and handled by the VMM.

Basically, there are three memory abstractions involved when running a guest OS:
  • Guest Virtual memory: This is what the process running on the guest OS sees.

  • Guest Physical memory: This is what the guest OS sees.

  • System Physical  memory: This is what the VMM sees.

There are two possible approaches to handle this:
  • Shadow page tables

  • Nested page tables with hardware support

Shadow Page Tables

In the case of shadow page tables, the Guest Virtual memory is mapped directly to the System Physical memory via the VMM. This improves performance by avoiding one additional layer of translation. But this approach has a drawback. When there is a change to the guest page tables, the shadow page tables need to be updated. This means there has to be a trap and emulation into the VMM to handle this. The VMM can do this by marking the guest page tables as read-only. That way, any attempt by the guest OS to write to them causes a trap and the VMM can then update the shadow tables.

Nested Page Tables with Hardware Support

Intel and AMD provided a solution to this problem via hardware extensions. Intel provides something called an Extended Page Table (EPT), which allows the MMU to walk two page tables.

The first walk is from the Guest Virtual to the Guest Physical memory and the second walk is from the Guest Physical to the System Physical memory. Since all this translation now happens in the hardware, there is no need to maintain shadow page tables. Guest page tables are maintained by the guest OS and the other page table is maintained by the VMM.

With shadow page tables, the TLB cache (translation look-aside buffer, which is part of MMU) needs to be flushed on a context switch, that is, bringing up another VM. Whereas, in the case of an EPT, the hardware introduces a VM identifier via the address space identifier, which means TLB can have mappings for different VMs at the same time, which is a performance boost.

CPU Virtualization

Before we look into CPU virtualization, it would be interesting to understand how the protection rings are built into the x86 architecture. These rings allow the CPU to protect memory and control privileges and determine what code executes at what privilege level.

The x86 architecture uses the concept of protection rings. The kernel runs in the most privileged mode, Ring 0, and the user space used for running processes run is in Ring 3.

The hardware requires that all privileged instructions be executed in Ring 0. If any attempt is made to run a privileged instruction in Ring 3, the CPU generates a fault. The kernel has registered fault handlers and, based on the fault type, a fault handler is invoked. The corresponding fault handler does a sanity check on the fault and processes it. If a sanity check passes, the fault handler handles the execution on behalf of the process. In the case of VM-based virtualization, the VM is run as a process on the host OS, so if a fault is not handled, the whole VM could be killed.

At a high-level, privilege instruction execution from Ring 3 is controlled by a code segment register via the CPL (code privilege level) bit. All calls from Ring 3 are gated to Ring 0. As an example, a system call can be made by an instruction like syscall (from user space), which in turn sets the right CPL level and executes the kernel code with a higher privilege level. Any attempt to directly call high-privilege code from upper rings leads to a hardware fault.

The same concept applies to a virtualized OS. In this case, the guest is deprivileged and runs in Ring 1 and the process of the guest runs in Ring 3. The VMM itself runs in Ring 0. With fully virtualized guests, any privileged instruction has to be trapped and emulated. The VMM emulates the trapped instruction. Over and above the privileged instructions, the sensitive instructions also need to be trapped and emulated by the VMM.

Older versions of x86 CPU are not virtualizable, which means not all sensitive instructions are privileged. Instructions like SGDT, SIDT, and more can be executed in Ring 1 without being trapped. This can be harmful when running a guest OS, as this could allow the guest to peek at the host kernel data structures. This problem can be addressed in two ways:
  • Binary translation in the case of full virtualization

  • Paravirtualization in the case of XEN with hypercalls

Binary Translation in the Case of Full Virtualization

In this case, the guest OS is used without any changes. The instructions are trapped and emulated for the target environment. This causes a lot of performance overhead, as lots of instructions have to be trapped into the host/hypervisor and emulated.

Paravirtualization

To avoid the performance problems related to binary translation when using full virtualization, we use paravirtualization, wherein the guest knows that it is running in a virtualized environment and its interaction with the host is optimized to avoid excessive trapping. As an example, the device driver code is changed and split into two parts. One is the backend (which is with the hypervisor) and the other is the frontend, which is with the guest. The guest and host drivers now communicate over ring buffers. The ring buffer is allocated from the guest memory. Now the guest can accumulate/aggregate data within the ring buffer and make one hypercall (i.e., a call to the hypervisor, also called a kick) to signal that the data is ready to be drained. This avoids excessive traps from the guest to the host and is a performance win.

In 2005, x86 finally became virtualizable. They introduced one more ring, called Ring -1, which is also called VMX (virtual machine extensions) root mode. The VMM runs in VMX root mode and the guests run in non-root mode.

This means that guests can run in Ring 0 and, for the majority of the instructions, there is no trap. Privileged/sensitive instructions that guests need are executed by the VMM in root mode via the trap. We call these switches the VM Exits (i.e., the VMM takes over instruction executions from the guest) and VM Entries (the VM gains control from the VMM ).

Apart from this, the virtualizable CPU manages a data structure called VMCS (VM control structure), and it has the state of the VM and registers. The CPU uses this information during the VM Entries and Exits. The VMCS structure is like task_struct, the data structure used to represent a process. One VMCS pointer points to the currently active VMCS. When there is a trap to the VMM, VMCS provides the state of all the guest registers, like the reason of exit, and so on.

Advantages of hardware-assisted virtualization are two-fold:
  • No binary translation

  • No OS modification

The problem is that the VM Entry and Exits are still heavy calls involving a lot of CPU cycles, as the complete VM state has to be saved and restored. Considerable work has gone into reducing the cycles of these entries and exits. Using paravirtualized drivers helps mitigate some of these performance concerns. The details are explained in the next section.

IO Virtualization

There are generally two modes of IO virtualization:
  • Full virtualization

  • Paravirtualization

Full Virtualization

With full virtualization, the guest does not know it’s running on a hypervisor and the guest O/S doesn’t need any changes to run on a hypervisor. Whenever the guest makes I/O calls, they are trapped on the hypervisor and the hypervisor performs the I/O on the physical device.

Paravirtualization

In this case, the guest OS is made aware that it’s running in a virtualized environment and special drivers are loaded into the guest to take care of the I/O. The system calls for I/O are replaced with hypercalls.

Figure 1-1 shows the difference between paravirtualization and full virtualization.
../images/500466_1_En_1_Chapter/500466_1_En_1_Fig1_HTML.jpg
Figure 1-1

Difference between full and paravirtualized drivers

With the paravirtualized scenario, the guest-side drivers are called the frontend drivers and the host-side drivers are called the backend drivers. Virtio is the virtualization standard for implementing paravirtualized drivers. The frontend network or I/O drivers of the guest are implemented based on the Virtio standard and the frontend drivers are aware that they are running in a virtual environment. They work in tandem with the backend Virtio drivers of the hypervisor. This working mechanism of frontend and backend drivers helps achieve high-performance network and disk operations and is the reason for most of the performance benefits enjoyed by paravirtualization.

As mentioned, the frontend drivers on the guests implement a common set of interfaces, as described by the Virtio standard. When an I/O call has to be made from the process in the guest, the process invokes the frontend driver API and the driver passes the data packets to the corresponding backend driver through the virtqueue (the virtual queue).

The backend drivers can work in two ways:
  • They can use QEMU emulation, which means the QEMU emulates the device call via system calls from the user space. This means that hypervisor lets the user space QEMU program make the actual device calls.

  • They can use mechanisms like vhost, whereby the QEMU emulation is avoided and the hypervisor kernel makes the actual device call.

As mentioned, communication between frontend and backend Virtio drivers is done by the virtqueue abstraction. The virtqueue presents an API to interact, which allows it to enqueue and dequeue buffers. Depending on the driver type, they can use zero or more queues. In the case of a network driver, it uses two virtqueues—one queue for the request and the other to receive the packets. The Virtio block driver, on the other hand, uses only one virtqueue.

Consider this example of a network packet flow, where the guest wants to send some data over the network:
  1. 1.

    The guest initiates a network packet write via the guest kernel.

     
  2. 2.

    The paravirtualized drivers (Virtio) in guest take those buffers and put them into the virtqueue (tx).

     
  3. 3.

    The backend of the virtqueue is the worker thread, and it receives the buffers.

     
  4. 4.

    The buffers are then written to the tap device file descriptor. The tap device can be connected to a software bridge like an OVS or Linux bridge.

     
  5. 5.

    The other side of the bridge has a physical interface, which then takes the data out over the physical layer.

     

In this example, when a guest places the packets on the tx queue, it needs a mechanism to inform the host side that there are packets for handling. There is an interesting mechanism in Linux called eventfd that’s used to notify the host side that there are events. The host watches the eventfd for changes.

A similar mechanism is used to send packets back to the guest.

As you saw in earlier sections, the hardware industry is catching up in the virtualization space and is providing more and more hardware virtualization, be it for CPUs (introducing a new ring) and instructions with vt-x or be it for memory (extended page tables).

Similarly, for I/O virtualization, hardware has a mechanism called an I/O memory management unit, which is similar to the memory management unit of CPU, but this is just for I/O-based memory. The concept is similar to CPU MMU, but here the device memory access is intercepted and mapped to allow different guests. Guests are physically mapped to different physical memory and access is controlled by the I/O MMU hardware. This provides the isolation needed for device access.

This feature can be used in conjunction with something called SRIOV (single root I/O virtualization), which allows an SRIOV-compatible device to be broken into multiple virtual functions. The basic idea is to bypass the hypervisor in the data path and use a pass-through mechanism, wherein VM directly communicates with the devices. Details of SRIOV are beyond the scope of this book. Curious users can follow these links for more about SRIOV:

https://blog.scottlowe.org/2009/12/02/what-is-sr-iov/

https://fir3net.com/Networking/Protocols/what-is-sr-iov-single-root-i-o-virtualization.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.189.247