In this chapter, we touch upon an important aspect of Linux containers, called Linux namespaces. Namespaces allow the kernel to provide isolation by restricting the visibility of the kernel resources like mountpoints, network subsystems among processes scoped to different namespaces. Examples of such namespace visibilities are mount points and network subsystems.
Today, containers are the de facto cloud software provision mechanism. They provide fast spin-up times and have less overhead than a virtual machine. There are certain very specific reasons behind these features.
The VM-based virtualization emulates the hardware and provides an OS as the abstraction. This means that a bulk of the OS code and the device drivers are loaded as part of the provisioning. On other hand, containers virtualize the OS itself. This means that there are data structures within the kernel that facilitate this separation. Most of the time, we are not clear as to what is happening behind the covers.
Linux namespaces
cgroups
Layered file systems
A namespace is a logical isolation within the Linux kernel. A namespace controls visibility within the kernel. All the controls are defined at the process level. That means a namespace controls which resources within the kernel a process can see. Think of the Linux kernel as a guard protecting resources like OS memory, privileged CPU instructions, disks, and other resources that only kernel should be able to access. Applications running within user space should only access these resources via a trap, in which case the kernel takes over control and executes these instructions on behalf of the user space application. As an example, an application that wants to access a file on a disk will have to delegate this call to the kernel via a system call (which internally traps into the kernel) to the Linux kernel, which then executes this request on behalf of the application.
Since there could be many user space applications running in parallel on a single Linux kernel, we need a way to provide isolation between these user space-based applications. By isolation, we mean that there should be a kind of sandboxing of the individual application, so that certain resources in the application are confined to that sandbox. As an example, we would like to have file system sandbox, which would mean that within that sandbox, we could have our own view of the files. That way, multiple such sandboxes could be run over the same Linux kernel without interfering with each other.
The technique to achieve such sandboxing is done by a specific data structure in the Linux kernel, called the namespace.
Namespace Types
In this section, we explain the different namespaces that exist within the Linux kernel and discuss how they are realized within the kernel.
UTS
This namespace allows a process to see a separate hostname other than the actual global namespace one.
PID
The processes within the PID namespace have a different process tree. They have an init process with PID 1. At the data structure level though, the processes belong to one global process tree, which is visible only at the host level. Tools like ps or direct usage of the /proc file system from within the namespace will list the processes and their related resources for the process tree within the namespace.
Mount
This is one of the most important namespaces. It controls which mount points a process should see. If a process is within a namespace, it will only see the mounts within that namespace.
Whenever a mount operation is invoked, a vfsmount structure is created and the dentry of the mount point as well as the dentry of the mounted tree is populated. A dentry is a data structure that maps the inode to the filename.
Apart from mount, there is a bind mount, which allows a directory (instead of a device) to be mounted at a mount point. The process of bind mounting results in creating a vfsmount structure that points to the dentry of the directory.
Containers work on the concept of bind mounts. So, when a volume is created for a container, it’s actually a bind mount of a directory within the host to a mount point within the container’s file system. Since the mount happens within the mount namespace, the vfsmount structures are scoped to the mount namespace. This means that, by creating a bind mount of a directory, we can expose a volume within the namespace that’s holding the container.
Network
A network namespace gives a container a separate set of network subsystems. This means that the process within the network namespace will see different network interfaces, routes, and iptables. This separates the container network from the host network. We will study this in more depth when we look at an example of the packet flow between two containers in different namespaces on the same host as well as containers in different namespaces within the same host.
IPC
This namespace scopes IPC constructs such as POSIX message queues. Between two processes within the same namespace, IPC is enabled, but it will be restricted if two processes in two different namespaces try to communicate over IPC.
Cgroup
This namespace restricts the visibility of the cgroup file system to the cgroup the process belongs to. Without this restriction, a process could peek at the global cgroups via the /proc/self/cgroup hierarchy. This namespace effectively virtualizes the cgroup itself.
Apart from the namespaces mentioned here, as of the writing of this book, there is one more namespace under discussion within the Linux community—called the time namespace.
Time
Changes the date and time inside a container
Adjusts the clocks for a container restored from a checkpoint
The kernel provides access to several clocks: CLOCK_REALTIME, CLOCK_MONOTONIC, and CLOCK_BOOTTIME. The last two clocks are monotonous, but the start points for them are not well defined (currently it is system startup time, but the POSIX says “since an unspecified point in the past”) and are different for each system. When a container migrates from one node to another, all the clocks are restored to their consistent states. In other words, they have to continue running from the same point where they were dumped.
Now that you have a basic idea about namespaces, we can study the details about how some of the data structures in the Linux kernel allow this separation when it comes to Linux containers. The term used for these structures is Linux namespaces.
The nsproxy holds the eight namespace data structures. The missing one is the user namespace, which is part of the cred data structure in the task_struct.
There are three system calls that can be used to put tasks into specific namespaces. These are clone, unshare, and setns. The clone and setns calls result in creating a nsproxy object and then adding the specific namespaces needed for the task.
This is how the iptables and routing rules are all scoped into the network namespace.
Other data structures of relevance here are the net_device (this is how the kernel represents the network card/device) and sock (a kernel representation of a socket data structure). These two structures allow the device to be scoped into a network namespace as well as the socket to be scoped to the namespace. Both these structures can be part of only one namespace at a time. We can move the device to a different namespaces via the iproute2 utility .
Ip netns add testns : Adds a network namespace
Ip netns del testns : Deletes the mentioned namespace
Ip netns exec testns sh: Executes a shell within the testns namespace
Adding a Device to a Namespace
The other end (veth0) is in the host namespace and so any traffic sent to veth0 ends up on veth1 in the testns namespace.
Assume that we run an HTTP server in the testns namespace, which means the listener socket is scoped to the testns namespace, as explained previously in the sock data structure. So a TCP packet to be delivered to the IP and port of the application within the testns namespace would be delivered to the socket scoped within that namespace.
This is how the kernel virtualizes the operating system and various subsystems like networking, IPC, mounts, and so on.
Summary
In this chapter, we learned about the Linux namespaces and how they facilitate isolation between user space-based applications. We also looked into how different Linux kernel-based data structures are used to realize the different namespaces. Going forward, we will look into how Linux kernel provides resource limits to the different user space-based processes so that one process doesn’t hog the resources of the operating system.