This is the chapter where you’ll find out how containers really work!
As you’ll know if you have ever run docker exec <image> bash
, a container looks a lot like a virtual machine from the inside. If you have shell access to a container and run ps
, you can only see the processes that are running inside the container; it has its own network stack; it seems to have its own filesystem with a root directory that bears no relation to root on the host. You can run containers with limited resources, such as a restricted amount of memory or a fraction of the available CPUs. This all happens using the Linux features that we’re going to dive into in this chapter.
It’s important to realise that containers aren’t virtual machines, and in Chapter 3 we’ll take a look at the differences between these two types of isolation. And in order to do that, you’ll need to know how containers really work.
You’ll see how containers are built out of Linux constructs like namespaces and chroot, along with cgroups that were covered in Chapter 1. With an understanding of these constructs under your belt, you’ll have a feeling for how well protected your applications are when they run inside containers.
If cgroups control the resources that a process can use, namespaces control what it can see. By putting a process in a namespace, you can restrict the resources that are visible to that process.
The origins of namespaces date back to the Plan 9 operating system. At the time, most operating systems had a single “name space” of files. Unix systems allowed the mounting of file systems, but they would all be mounted into the same system-wide view of all filenames. In Plan 9, each process was part of a process group that had its own “name space” abstraction, the hierarchy of files (and file-like objects) that this group of processes could see. Each process group could mount its own set of file systems without seeing each other.
The first namespace was introduced to the Linux kernel in version 2.4.19 back in 2002. This was the mount namespace, and it followed similar functionality to that in Plan 9. Nowadays there are several different kinds of namespace supported by Linux:
Unix Timesharing System (UTS) - this sounds complicated, but to all intents and purposes this namespace is really just about the host and domain names for the system that a process is aware of.
Process IDs
Mount points
Network
User and Group IDs
It’s possible that more resources will be namespaced in future revisions of the Linux kernel. For example, there have been discussions about having a namespace for time.
A process is always in exactly one namespace of each type. When you start a Linux system it has a single namespace of each type, but as you’ll see, you can create additional namespaces and assign processes into them. You can easily see the namespaces on your machine using the lsns
command.
vagrant@myhost:~$
lsns NS TYPE NPROCS PID USER COMMAND4026531835
cgroup3
28459
vagrant /lib/systemd/systemd --user4026531836
pid3
28459
vagrant /lib/systemd/systemd --user4026531837
user3
28459
vagrant /lib/systemd/systemd --user4026531838
uts3
28459
vagrant /lib/systemd/systemd --user4026531839
ipc3
28459
vagrant /lib/systemd/systemd --user4026531840
mnt3
28459
vagrant /lib/systemd/systemd --user4026531992
net3
28459
vagrant /lib/systemd/systemd --user
This looks nice and neat, and there is one namespace for each of the types I mentioned above. Sadly, this is an incomplete picture! The man page for lsns
tells us that it “reads information directly from the /proc filesystem and for non-root users it may return incomplete information”. Let’s see what you get when you run as root.
vagrant@myhost:~$
sudo lsns NS TYPE NPROCS PID USER COMMAND4026531835
cgroup93
1
root /sbin/init4026531836
pid93
1
root /sbin/init4026531837
user93
1
root /sbin/init4026531838
uts93
1
root /sbin/init4026531839
ipc93
1
root /sbin/init4026531840
mnt89
1
root /sbin/init4026531860
mnt1
15
root kdevtmpfs4026531992
net93
1
root /sbin/init4026532170
mnt1
14040
root /lib/systemd/systemd-udevd4026532171
mnt1
451
systemd-network /lib/systemd/systemd-networkd4026532190
mnt1
617
systemd-resolve /lib/systemd/systemd-resolved
The root user can see some additional mount namespaces, and there are a lot more processes visible to root than were visible to the non-root user. The reason to show you this is to note that when we are using lsns
, we should run as root (or use sudo) to get the complete picture.
Let’s explore how you can use namespaces to create something that behaves like what we call a “container”.
The examples in this chapter use Linux shell commands to create a container. If you would like to try creating a container using the Go programming language, you will find instructions at github.com/lizrice/containers-from-scratch.
Let’s start with the namespace for the Unix Timesharing System (UTS). As mentioned above, this covers host and domain names. By putting a process in its own UTS namespace, you can change the hostname for this process independently of the hostname of the machine or virtual machine on which it’s running.
If you open a terminal on Linux, you can see the hostname:
vagrant@myhost:~$
hostname
myhost
Most (perhaps all?) container systems give each container a random ID. By default this ID is used as the hostname. You can see this by running a container and getting shell access. For example in Docker you could do the following:
vagrant@myhost:~$
docker run --rm -it --name hello ubuntu bash root@cdf75e7a6c50:/$
hostname cdf75e7a6c50
Incidentally, you can see in the example above that even if you give the container a name in Docker (here I specified --name hello
), this isn’t used for the hostname of the container.
The container can have its own hostname because Docker created it with its own UTS namespace. You can explore the same thing by using the unshare
command to create a process that has a UTS namespace of its own.
As it’s described on the man page (seen by running man unshare
), unshare
lets you “run a program with some namespaces unshared from the parent”. Let’s dig a little deeper into that description. When you “run a program”, the kernel creates a new process and executes the program in it. This is done from the context of a running process - the “parent” - and the new process will be referred to as the “child”. The word “unshare” means that rather than share namespaces of its parent, the child is going to be given its own.
Let’s give it a try. (You need to have root privileges to do this, hence the sudo
at the start of the line).
vagrant@myhost:~$
sudo unshare --uts sh$
hostname myhost$
hostname experiment$
hostname experiment$
exit
vagrant@myhost:~$
hostname myhost
This runs a sh
shell in a new process that has a new UTS namespace. Any programs you run inside the shell will inherit its namespaces. When you run the hostname
command, it executes in the new UTS namespace that has been isolated from that of the host machine.
If you were to open another terminal window to the same host before the exit
, you could confirm that the hostname hasn’t changed for the whole (virtual) machine. You can change the hostname on the host without affecting the hostname that the namespaced process is aware of, and vice versa.
This is a key component of the way containers work. Namespaces give them a set of resources (in this case the hostname) that are independent of the host machine, and of other containers. But we are still talking about a process that is being run by the same Linux kernel. This has security implications that I’ll discuss later in the chapter. For now, let’s look at another example of a namespace by seeing how you can give a container its own view of running processes.
If you run the ps
command inside a Docker container, you can only see the processes running inside that container, and none of the processes running on the host.
vagrant@myhost:~$
docker run --rm -it --name hello ubuntu bash root@cdf75e7a6c50:/$
ps -eaf UID PID PPID C STIME TTY TIME CMD root1
0
0
18:41 pts/0 00:00:00 bash root10
1
0
18:42 pts/0 00:00:00 ps -eaf root@cdf75e7a6c50:/$
exit
vagrant@myhost:~$
This is achieved with the process ID namespace, which restricts the set of process IDs that are visible. Try running unshare
again, but this time specifying that you want a new PID namespace with the --pid
flag.
vagrant@myhost:~$
sudo unshare --pid sh$
whoami root$
whoami sh: 2: Cannot fork$
whoami sh: 3: Cannot fork$
ls sh: 4: Cannot fork$
exit
vagrant@myhost:~$
This doesn’t seem very successful - it’s not possible to run any commands after the first whoami
! But there are some interesting artefacts in this output.
The first process under sh
seems to have worked OK, but every command after that fails due to an inability to fork. The error is output in the form <command>: <process ID>: <message>
and you can see that the process IDs are incrementing each time. Given the sequence, it would be reasonable to assume that the first whoami
ran as process ID 1. That is a clue that the PID namespace is working in some fashion, in that the process ID numbering has restarted. But it’s pretty much useless if you can’t run more than one process!
There are clues to what the problem is in the description of the --fork
flag in the man page for unshare
:
“Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new pid namespace.”
You can explore this by running ps
to view the process hierarchy from a second terminal window:
vagrant@myhost:~$
ps fa PID TTY STAT TIME COMMAND ...30345
pts/0 Ss 0:00 -bash30475
pts/0 S 0:00\_
sudo unshare --pid sh30476
pts/0 S 0:00\_
sh
The sh
process is not a child of unshare
; it’s a child of the sudo
process.
Now try the same thing with the --fork
parameter.
vagrant@myhost:~$
sudo unshare --pid --fork sh$
whoami root$
whoami root
This is progress, in that you can now run more than one command before running into the “Cannot fork” error. If you look at the process hierachy again from a second terminal, you’ll see an important difference.
vagrant@myhost:~$
ps fa PID TTY STAT TIME COMMAND ...30345
pts/0 Ss 0:00 -bash30470
pts/0 S 0:00\_
sudo unshare --pid --fork sh30471
pts/0 S 0:00\_
unshare --pid --fork sh30472
pts/0 S 0:00\_
sh ...
With the --fork
parameter, the sh
shell is running as a child of the unshare
process, and you can successfully run as many different child commands as you choose within this shell.
Given that the shell is within its own process ID namespace, the results of running ps
inside it might be surprising.
vagrant@myhost:~$
sudo unshare --pid --fork sh$
ps PID TTY TIME CMD14511
pts/0 00:00:00 sudo14512
pts/0 00:00:00 unshare14513
pts/0 00:00:00 sh14515
pts/0 00:00:00 ps$
ps -eaf UID PID PPID C STIME TTY TIME CMD root1
0
0
Mar27 ? 00:00:02 /sbin/init root2
0
0
Mar27 ? 00:00:00[
kthreadd]
root3
2
0
Mar27 ? 00:00:00[
ksoftirqd/0]
root5
2
0
Mar27 ? 00:00:00[
kworker/0:0H]
...many more lines of output about processes...$
exit
vagrant@myhost:~$
As you can see, ps
is still showing all the processes on the whole host, despite running inside a new process ID namespace. If you want the ps
behavior that you would see in a Docker container, it’s not sufficient just to use a new process ID namespace, and the reason for this is included in the man page for ps
:
“This ps works by reading the virtual files in /proc.”
Let’s take a look at the /proc
directory to see what virtual files this is referring to. Your system will look similar, but not exactly the same, as it will be running a different set of processes.
vagrant@myhost:~$
ls /proc1
14553
292
467
cmdline modules10
14585
3
5
consoles mounts1009
14586
30087
53
cpuinfo mpt1010
14664
30108
538
crypto mtrr1015
14725
30120
54
devices net1016
14749
30221
55
diskstats pagetypeinfo1017
15
30224
56
dma partitions1030
156
30256
57
driver sched_debug1034
157
30257
58
execdomains schedstat1037
158
30283
59
fb scsi1044
159
313
60
filesystems self1053
16
314
61
fs slabinfo1063
160
315
62
interrupts softirqs1076
161
34
63
iomem stat1082
17
35
64
ioports swaps11
18
3509
65
irq sys1104
19
3512
66
kallsyms sysrq-trigger1111
2
36
7
kcore sysvipc1175
20
37
72
keys thread-self1194
21
378
8
key-users timer_list12
22
385
85
kmsg timer_stats1207
23
392
86
kpagecgroup tty1211
24
399
894
kpagecount uptime1215
25
401
9
kpageflags version12426
26
403
966
loadavg version_signature125
263
407
acpi locks vmallocinfo13
27
409
buddyinfo mdstat vmstat14046
28
412
bus meminfo zoneinfo14087
29
427
cgroups misc
Every numbered directory in /proc
corresponds to a process ID, and there is a lot of interesting information about a process inside its directory. For example, /proc/<pid>/exe
is a symbolic link to the executable that’s being run inside this particular process, as you can see in the following example.
vagrant@myhost:~$
ps PID TTY TIME CMD28441
pts/1 00:00:00 bash28558
pts/1 00:00:00 ps vagrant@myhost:~$
ls /proc/28441 attr fdinfo numa_maps smaps autogroup gid_map oom_adj smaps_rollup auxv io oom_score stack cgroup limits oom_score_adj stat clear_refs loginuid pagemap statm cmdline map_files patch_state status comm maps personality syscall coredump_filter mem projid_map task cpuset mountinfo root timers cwd mounts sched timerslack_ns environ mountstats schedstat uid_map exe net sessionid wchan fd ns setgroups vagrant@myhost:~$
ls -l /proc/28441/exe lrwxrwxrwx1
vagrant vagrant0
Oct10
13:32 /proc/28441/exe -> /bin/bash
Irrespective of the process ID namespace it’s running in, ps
is going to look in /proc
for information about running processes. In order to have ps
return only the information about the processes inside the new namespace, there needs to be a separate copy of the /proc
directory, where the kernel can write information about the namespaced processes. Given that /proc
is a directory directly under root, this means changing the root directory.
You can change the root directory in Linux with the chroot
command. This effectively moves the root directory / to point to some other location within the file system. Once you have done a chroot
command, you lose access to anything that was higher in the file hierarchy than your current root directory, since there is no way to go any higher than root.
The description for chroot
’s man page reads as follows:
“Run COMMAND with root directory set to NEWROOT. […] If no command is given, run ${SHELL} -i (default: /bin/sh -i).”
From this you can see that chroot
doesn’t just change the directory, but also runs a command, falling back to running a shell if you don’t specify a different command.
Create a new directory and try to chroot
into it.
vagrant@myhost:~$
mkdir new_root vagrant@myhost:~$
sudo chroot new_root chroot: failed to runcommand
‘/bin/bash’: No such file or directory vagrant@myhost:~$
sudo chroot new_root ls chroot: failed to runcommand
‘ls’: No such file or directory
This doesn’t work! The problem is that once you are inside the new root directory, there is no bin
directory inside this root, so it’s impossible to run the /bin/bash
shell. Similarly, if you try to run the ls
command, it’s not there. You’ll need the files for any commands you want to run to be available within the new root. This is exactly what happens in a “real” container: the container is instantiated from a container image, which encapsulates the filesystem that the container sees. If an executable isn’t present within that filesystem, the container won’t be able to find and run it.
Why not try running Alpine Linux within your container? Alpine is a fairly minimal Linux distribution designed for containers. You’ll need to start by downloading the filesystem.
vagrant@myhost:~$
mkdir alpine vagrant@myhost:~$
cd
alpine vagrant@myhost:~/alpine$
curl -o alpine.tar.gz http://dl-cdn.alpinelinux.org/alpine/ v3.10/releases/x86_64/alpine-minirootfs-3.10.0-x86_64.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100
2647k100
2647k0
0
16.6M0
--:--:-- --:--:-- --:--:-- 16.6M vagrant@myhost:~/alpine$
tar xvf alpine.tar.gz
At this point you have a copy of the Alpine filesystem inside the alpine
directory you created. Remove the compressed version and move back to the parent directory.
vagrant@myhost:~/alpine$
rm alpine.tar.gz vagrant@myhost:~/alpine$
cd
..
You can explore the contents of the filesystem with ls alpine
, to see that it looks like the root of a Linux filesystem with directories like bin
, lib
, var
, tmp
and so on.
Now that you have the Alpine distribution unpacked, you can use chroot
to move into the alpine
directory, provided you supply a command that exists within that directory’s hierarchy.
It’s slightly more subtle than that, because the executable has to be in the new process’s path. This process inherits the parent’s environment, including the PATH
environment variable. The bin
directory within alpine
has become /bin
for the new process, and assuming that your regular path includes /bin
, you can pick up the ls
executable from that directory without specifying its path explicitly.
vagrant@myhost:~$
sudo chroot alpine ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr vagrant@myhost:~$
Notice that it is only the child process (in that example, the process that ran ls
) that gets the new root directory. When that process finishes, control returns to the parent process. If you run a shell as the child process, it won’t complete immediately so that makes it easier to see the effects of changing the root directory.
vagrant@myhost:~$
sudo chroot alpine sh /$
ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr /$
whoami root /$
exit
vagrant@myhost:~$
If you try to run the bash
shell, it won’t work. This is because the Alpine distribution doesn’t include it, so it’s not present inside the new root directory. If you tried the same thing with the filesystem of a distribution like Ubuntu which does include bash
, it would work.
To summarize, chroot
literally “changes the root” for a process. After changing the root, the process (and its children) will only be able to access the files and directories that are lower in the hierarchy than the new root directory.
In addition to chroot
, there is also a system call called pivot_root
. This allows the containerized process to store information about its previous root before changing to a new one; this makes it possible to pivot back to the old root again. Whether chroot
or pivot_root
is used is an implementation detail; the key point is that a container needs to have its own root directory.
TODO! Check if this is still true - probably not? Also https://warsang.ovh/prison-break-chroot/
It’s possible for a process with root privileges to break out of a chroot
You have now seen how a container can be given its own root filesystem. I’ll discuss this further in [Link to Come], but now let’s see how having its own root filesystem allows the kernel to show a container just a restricted view of namespaced resources.
So far you have seen namespacing and changing the root as two separate things, but you can combine the two by running chroot
in a new namespace.
me@myhost:~$
sudo unshare --pid --fork chroot alpine sh /$
ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
If you recall, earlier in this chapter (see “Isolating process IDs”), giving the container its own root directory allows it to create a /proc
directory for the container that’s independent of /proc
on the host. For this to be populated with process information, you will need to mount it as a pseudo-filesystem of type proc
. With the combination of a process ID namespace, and an independent /proc directory, ps
will now show just the processes that are inside the process ID namespace.
/$
mount -t proc proc proc /$
ps PID USER TIME COMMAND1
root 0:00 sh6
root 0:00 ps /$
exit
vagrant@myhost:~$
Success! It has been more complex than isolating the container’s hostname, but through the combination of creating a process ID namespace, changing the root directory, and mounting a pseudo-filesystem to handle process information, you can limit a container so that it only has a view of its own processes.
There are more namespaces left to explore. Let’s see the mount namespace next.
Typically you don’t want a container to have the all the same file system mounts as its host. Giving the container its own mount namespace achieves this separation.
Here’s an example that creates a simple bind mount for a process with its own mount namespace.
vagrant@myhost:~$
sudo unshare --mount sh$
mkdirsource
$
touchsource
/HELLO$
lssource
HELLO$
mkdir target$
ls target$
mount --bindsource
target$
ls target HELLO
Once the bind mount is in place, the contents of the source directory are also available in target. If you look at all the mounts from within this process, there will probably be a lot of them, but the following command finds the target you created if you followed the example above.
$
findmnt target TARGET SOURCE FSTYPE OPTIONS /home/vagrant/target /dev/mapper/vagrant--vg-root[
/home/vagrant/source]
ext4 rw,relatime,errors=
remount-ro,data=
ordered
From the host’s perspective, this isn’t visible, which you can prove by running the same command from another terminal window and confirming that it doesn’t return anything.
Try running findmnt
from within the mount namespace again, but this time without any parameters, and you will get a long list. You might be thinking that it seems wrong for a container to be able to see all the mounts on the host. This is a very similar situation to what you saw with the process ID namespace: the kernel uses the /proc/<PID>/mounts
directory to communicate information about mount points for each process. If you create a process with its own mount namespace, but it is using the host’s /proc
directory, you’ll find that its /proc/<PID>/mounts
file includes all the pre-existing host mounts. (You can simply cat
this file to get a list of mounts.)
To get a fully-isolated set of mounts for the containerized process, you will need to combine creating a new mount namespace with a new root filesystem and a new proc mount, like this:
vagrant@myhost:~$
sudo unshare --mount chroot alpine sh /$
mount -t proc proc proc /$
mount proc on /proctype
proc(
rw,relatime)
/$
mkdirsource
/$
touchsource
/HELLO /$
mkdir target /$
mount --bindsource
target /$
mount proc on /proctype
proc(
rw,relatime)
/dev/sda1 on /targettype
ext4(
rw,relatime,data=
ordered)
Alpine Linux doesn’t come with the findmnt
command, so the example above uses mount
with no parameters to generate the list of mounts. (If you are cynical about this change, try the earlier example with mount
instead of findmnt
to check that you get the same results).
You may be familiar with the concept of mounting host directories into a container using docker run -v <host directory>:<container directory> ...
. To achieve this, after the root filesystem has been put in place for the container, the target container directory is created and then the source host directory gets bind mounted into that target. Because each container has its own mount namespace, host directories mounted like this are not visible from other containers.
If you create a mount that is visible to the host, it won’t automatically get cleaned up when your “container” process terminates. You will need to destroy these using umount
. This also applies to the /proc
pseudo-filesystems. They won’t do any particular harm but if you like to keep things tidy, you can remove them with umount proc
. The system won’t let you unmount the final /proc
used by the host.
The network namespace allows a container to have its own view of network interfaces and routing tables. When you create a process with its own network namespace, you can see it with lsns
.
vagrant@myhost:~$
sudo lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND4026531992
net93
1
root unassigned /sbin/init vagrant@myhost:~$
sudo unshare --net bash root@myhost:~$
lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND4026531992
net92
1
root unassigned /sbin/init4026532192
net2
28586
root unassigned bash
You might come across the ip netns
command, but this is not much use to us here. Using unshare --net
creates an anonymous network namespace, and these don’t appear in the output from ip netns list
When you put a process into its own network namespace, it starts with just the loopback interface.
vagrant@myhost:~$
sudo unshare --net bash root@myhost:~$
ip a 1: lo: <LOOPBACK> mtu65536
qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
With nothing but a loopback interface, your container won’t be able to communicate. To give it a path to the outside world, you create a virtual ethernet interface - or strictly, a pair of virtual ethernet interfaces, metaphorically like a cable that connects your container namespace to the default network namespace.
In a second terminal window, as root, you can create a virtual ethernet pair by specifying the anonymous namespaces associated with their process IDs, like this:
root@myhost:~$
ip link add ve1 netns28586
type
veth peer name ve2 netns 1
ip link add
indicates that you want to add a link
ve1
is the name of one “end” of the virtual ethernet “cable”
netns 28586
says that this end is “plugged in” to the network namespace associated with process ID 28586 (which is shown in the output from lsns -t net
above)
type veth
shows that this a virtual ethernet pair
peer name ve2
gives the name of the other end of the “cable”
netns 1
specifies that this second end is “plugged in” to the network namespace associated with process ID 1.
The ve1
virtual ethernet interface is now visible from inside the “container” process:
root@myhost:~$
ip a 1: lo: <LOOPBACK> mtu65536
qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ve1@if3: <BROADCAST,MULTICAST> mtu1500
qdisc noop state DOWN group default qlen 1000 link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
The link is in “DOWN” state and needs to be brought up before it’s any use. Both ends of the connection need to be brought up.
Bring up the ve2
end on the host:
root@myhost:~$
ip linkset
ve2 up
And once you bring up the ve1
end in the container, the link should move to “UP” state:
root@myhost:~$
ip linkset
ve1 up root@myhost:~$
ip a 1: lo: <LOOPBACK> mtu65536
qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ve1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu1500
qdisc noqueue state UP group default qlen 1000 link/ether 7a:8a:3f:ba:61:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::788a:3fff:feba:612c/64 scope link valid_lft forever preferred_lft forever
To send IP traffic there needs to an IP address associated with its interface. In the container:
root@myhost:~$
ip addr add 192.168.1.100/24 dev ve1
And on the host:
root@myhost:~$
ip addr add 192.168.1.200/24 dev ve1
This will also have the effect of adding an IP route into the routing table in the container:
root@myhost:~$
ip route
192.168.1.0/24 dev ve1 proto kernel scope link src 192.168.1.100
As mentioned at the start of this section, the network namespace isolates both the interfaces and the routing table, so this routing information is independent of the IP routing table on the host. At this point the container can only send traffic to 192.168.1.0/24
addresses. You can test this out with a ping from within the container to the remote end:
root@myhost:~$
ping 192.168.1.100 PING 192.168.1.100(
192.168.1.100)
56(
84)
bytes of data.64
bytes from 192.168.1.100:icmp_seq
=
1
ttl
=
64
time
=
0.355 ms64
bytes from 192.168.1.100:icmp_seq
=
2
ttl
=
64
time
=
0.035 ms ^C
We will dig further into networking and container network security in [Link to Come].
The user namespace allows processes to have their own view of user and group IDs. Much like process IDs, the users and groups still exist on the host, but they can have different IDs. The main benefit of this is that you can map the root ID of 0 within a container to some other non-root identity on the host. This is a huge advantage from a security perspective, since it allows software to run as root inside a container, but an attacker who escapes from the container to the host will have a non-root, unprivileged identity. As you’ll see in [Link to Come] it’s not hard to misconfigure a container to make it easy escape to the host. With user namespaces, you’re not just one false move away from host takeoever.
Generally speaking, you need to be root in order to create new namespaces, which is why the Docker daemon runs as root, but the user namespace is an exception.
vagrant@myhost:~$
unshare --user bash nobody@myhost:~$
iduid
=
65534(
nobody)
gid
=
65534(
nogroup)
groups
=
65534(
nogroup)
nobody@myhost:~$
echo
$$
31196
Inside the new user namespace the user has the nobody ID. You need to put in place a mapping between user IDs inside and outside the namespace. This mapping exists in /proc/<pid>/uid_map
which you can edit as root (on the host). There are three fields in this file:
The lowest ID to map from the child process’s perspective
The lowest corresponding ID this should map to on the host
The number of IDs to be mapped
As an example, on my machine, the vagrant user has ID 1000. In order to have vagrant get assigned the root ID of 0 inside the child process, the first two fields are 0 and 1000. The last field can be 1 if you only want to map one ID (which may well be the case if you only want one user inside the container). Here’s the command I used to set up that mapping:
vagrant@myhost:~$
sudoecho
'0 1000 1'
> /proc/31196/uid_map
Immediately, inside its user namespace, the process has taken on the root identity. Don’t be put off by the fact that the bash prompt still says “nobody”; this doesn’t get updated unless you re-run the scripts that get run when you start a new shell (e.g. ~/.bash_profile
).
nobody@myhost:~$
iduid
=
0(
root)
gid
=
65534(
nogroup)
groups
=
65534(
nogroup)
A similar mapping process is used to map the group(s) used inside the child process.
This process is now running with a large set of capabilities:
nobody@myhost:~$
capsh --print|
grep Current Current:=
cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep
As you saw in ???, these capabilities allow the process various permissions. When create a new user namespace the kernel gives the process all these capabilities, so that the pseudo-root user inside the namespace is allowed to create other namespaces, set up networking and so on, fulfilling everything else required to make it a real container.
You can enable the use of user namespaces in Docker, but it’s not turned on by default because it is incompatible with a few things that Docker users might want to do.These will also affect you if you use user namespaces with other container runtimes.
User namespaces are incompatible with sharing a process ID or network namespace with the host.
Even if the process is running as root inside the container, it doesn’t really have full root privileges. It doesn’t for example have CAP_NET_BIND_SERVICE so it can’t bind to a low numbered port. (See for more information about Linux capabilities.)
When the containerized process interacts with a file, it will need appropriate permissions (for example, write access in order to modify the file). If the file is mounted from the host, it is the effective user ID on the host that matters. This is a good thing in terms of protecting the host files from unauthorized access from within a container, but it can be confusing if, say, what appears to be root inside the container is not permitted to modify a file.
As noted above, you don’t need to be root to create a new user namespace, but you will have a full set of capabilities in the new process. In fact, if you simultaneously create a process with several new namespaces, the user namespace will be created first so that you have the full capability set that permits you to create other namespaces.
vagrant@myhost:~$
unshare --uts bash unshare: unshare failed: Operation not permitted vagrant@myhost:~$
unshare --uts --user bash nobody@myhost:~$
User namespaces allow an unprivileged user to effectively become root within the containerized process. This allows a normal user to run containers using a concept called rootless containers that we will cover in [Link to Come].
The general consensus is that user namespaces are a security benefit because fewer containers need to run as “real” root (that is, root from the host perspective). However, there have been a few vulnerabilities (for example CVE-2018-18955) directly related to privileges being incorrectly transformed transitioning to or from a user namespace. The Linux kernel is a complex piece of software, and you should expect that people will find problems in it from time to time.
In Linux it’s possible to communicate between different processes by giving them access to a shared range of memory, or by using a shared message queue. The two processes need to be members of the same inter-process communications (IPC) namespace for them to have access to the same set of identifiers for these mechanisms. Generally speaking you don’t want your containers to be able to access each others’ shareed memory, so they are given their own IPC namespaces.
You can see this in action by creating some a shared memory block, and then viewing the current IPC status with ipcs
:
$
ipcmk -M 1000 Shared memory id: 98307$
ipcs ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x000000000
root644
80
2 0x0000000032769
root644
16384
2 0x0000000065538
root644
280
2 0xad291bee98307
ubuntu644
1000
0 ------ Semaphore Arrays -------- key semid owner perms nsems 0x000000a70
root600
1
In this example the newly-created shared memory block (with its ID in the shmid
column) appears as the last item in the “Shared Memory Segments” block. There are also some pre-existing IPC objects that had previously been created by root.
A process with its own IPC namespace does not see any of these IPC objects:
$
sudo unshare --ipc sh$
ipcs ------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status ------ Semaphore Arrays -------- key semid owner perms nsems
The last of the namespaces (at least, at the time of writing this book) is the cgroup namespace. This is a little bit like a chroot for the cgroup filesystem; it stops a process from seeing the cgroup configuration higher up the hierarchy of cgroup directories that its own cgroup.
Most namespaces were added by Linux kernel version 3.8, but the cgroup namespace was added later in version 4.6. If you’re using a relatively old distribution of Linux (such as Ubuntu 16.04) you won’t have support for this feature. You can check the kernel version on your Linux host by running uname -r
.
You can see the cgroup namespace in action by comparing the contents of /proc/self/cgroup
outside and then inside a cgroup namespace:
vagrant@myhost:~$
cat /proc/self/cgroup 12:cpu,cpuacct:/ 11:cpuset:/ 10:hugetlb:/ 9:blkio:/ 8:memory:/user.slice/user-1000.slice/session-51.scope 7:pids:/user.slice/user-1000.slice/session-51.scope 6:freezer:/ 5:devices:/user.slice 4:net_cls,net_prio:/ 3:rdma:/ 2:perf_event:/ 1:name=
systemd:/user.slice/user-1000.slice/session-51.scope 0::/user.slice/user-1000.slice/session-51.scope vagrant@myhost:~$
vagrant@myhost:~$
sudo unshare --cgroup bash root@myhost:~# cat /proc/self/cgroup 12:cpu,cpuacct:/ 11:cpuset:/ 10:hugetlb:/ 9:blkio:/ 8:memory:/ 7:pids:/ 6:freezer:/ 5:devices:/ 4:net_cls,net_prio:/ 3:rdma:/ 2:perf_event:/ 1:name=
systemd:/ 0::/
You have now explored all the different types of namespace, and seen how they are used along with chroot to isolate a process’s view of its surrounding. Combine this with what you learned about cgroups in the previous chapter, and you should have a good understanding of everything that’s needed to make what we call a “container”.
Before moving on to the next chapter, it’s worth taking a look at a container from the perspective of the host it’s running on.
Although they are called containers, it might be more accurate to use the term “containerized processes”. A container is still a Linux process, running on the host machine - it just has a limited view of that host machine, and it only has access to a subtree of the file system and perhaps to a limited set of resources restricted by cgroups. Because it’s really just a process, it exists within the context of the host operating system, and it shares the host’s kernel.
Let’s end this chapter by examining in more detail the extent to which a containerized process is isolated from the host, and from other containerized processesa on that host, by trying some experiments on a Docker container. Start a container process based on Ubuntu (or your favourite Linux distribution) and run a shell in it, and then run a long sleep
in it as follows:
$
docker run --rm -it ubuntu bash root@1551d24a$
sleep 1000
This example runs the sleep
command for 1000 seconds, but note that the sleep command is running as a process inside the container. When you press enter at the end of the sleep command, this triggers Linux to clone a new process with a new process ID, and run the sleep executable within that process.
You can put the sleep process into the background (Ctrl-Z
to pause the process, and bg %1
to put background it). Now run ps inside the container to see the same process from the container’s perspective.
me@myhost:~$
docker run --rm -it ubuntu bash root@ab6ea36fce8e:/$
sleep 1000 ^Z[
1]
+ Stopped sleep 1000 root@ab6ea36fce8e:/$
bg
%1[
1]
+ sleep1000
&
root@ab6ea36fce8e:/$
ps PID TTY TIME CMD1
pts/0 00:00:00 bash10
pts/0 00:00:00 sleep11
pts/0 00:00:00 ps root@ab6ea36fce8e:/$
While that sleep command is still running, open a second terminal into the same host and look at the same sleep process from the host’s perspective.
me@myhost:~$
ps -C sleep PID TTY TIME CMD30591
pts/0 00:00:00 sleep
The -C sleep
parameter specifies that we are only interested in processes running the sleep
executable.
The container has its own process ID namespace, so it makes sense that its processes will have low numbers, and that is indeed what you see when running ps
in the container. From the host’s perspective, however, the sleep process has a different, high numbered process ID. In the example above there is just one process, and it has ID 30591 on the host and 10 in the container. (The actual number will vary according to what else is, and has been, running on the same machine, but it’s likely to be a much higher number).
In order to get a good understanding of containers and the level of isolation they provide, it’s really key to get to grips with the fact that although there are two different process IDs, they both refer to the same process. It’s just that from the host’s perspective it has a higher process ID number.
The fact that container processes are visible from the host is one of the fundamental differences between containers and virtual machines. An attacker who gets access to the host, can observe and affect all the containers running on that host, especially if they have root access. And as you’ll see in [Link to Come], there are some remarkably easy ways you can inadvertently make it possible for an attacker to move from a compromised container onto the host.
Congratulations! By reaching the end of this chapter, you should now know what a container really is. You’ve seen the three essential Linux kernel mechanisms that are used to limit a process’s access to host resources:
Namespaces limit what the container process can see - for example, giving the container an isolated set of process IDs.
Changing the root limits the set of files and directories that the container can see.
Cgroups control the resources the container can access.
Now that you know how containers work, you might want to explore Jess Frazelle’s contained.af site to see just how effective they are. Will you be the person who breaks the containment?
In the next chapter we’ll contrast that with how virtual machines work, so that we can consider the relative strengths of the isolation between containers, and between VMs, especially through the lens of security.
3.141.103.61