In the previous chapter, we learned how to build a Docker image and the very basic steps required for running the resulting image within a container. In this chapter, we’ll first take a look at where containers came from and then dive deeper into containers and the Docker commands that control the overall configuration, resources, and privileges that your container receives.
You might be familiar with virtualization systems like VMware or Xen that allow you to run a complete Linux kernel and operating system on top of a virtualized layer, commonly called a hypervisor. This approach provides very strong isolation between virtual machines because each hosted kernel sits in separate memory space and has defined entry points into the actual hardware, either through another kernel or something that looks like hardware.
Containers are a fundamentally different approach where all containers share a single kernel and isolation is implemented entirely within that single kernel. This is called operating system virtualization. The libcontainer
project gives a good, short definition of a container: “A container is a self-contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.” The major advantages are around efficiency of resources because you don’t need a whole operating system for each isolated function. Since you are sharing a kernel, there is one less layer of indirection between the isolated task and the real hardware underneath. When a process is running inside a container, there is only a very little shim that sits inside the kernel rather than potentially calling up into a whole second kernel while bouncing in and out of privileged mode on the processor.
But the container approach means that you can only run processes that are compatible with the underlying kernel. Unlike hardware virtualization like that provided by VMware, for example, Windows applications cannot run inside a Linux container. So containers are best thought of as a Linux technology where, at least for now, you can run any of your favorite Linux applications or servers. When thinking of containers, you should try very hard to throw out what you might already know about virtual machines and instead conceptualize a container as a wrapper around a process that actually runs on the server.
It is often the case that a revolutionary technology is an older technology that has finally arrived in the spotlight. Technology goes in waves, and some of the ideas from the 1960s are back in vogue. Similarly, Docker is a new technology and it has an ease of use that has made it an instant hit, but it doesn’t exist in a vacuum. Much of what underpins Docker comes from work done over the last 30 years in a few different arenas: from a system call added to the Unix kernel in the late 1970s, to tooling built on modern Linux. It’s worth a quick tour through how we got to Docker because understanding that helps you place it within the context of other things you might be familiar with.
Containers are not a new idea. They are a way to isolate and encapsulate a part of the running system. The oldest technology in that area were the first batch processing systems. You’d run a program for a while, then switch to run another program. There was isolation: you could make sure your program didn’t step on anyone else’s program. That’s all pretty crude now, but it’s the very first step on the road to Linux containers and Docker.
Most people would argue that the seeds for today’s containers were planted in 1979 with the addition of the chroot system call to Version 7 Unix. chroot restricts a process’s view of the underlying filesystem. The chroot system call is commonly used to protect the operating system from untrusted server processes like FTP, BIND, and Sendmail, which are publicly exposed and susceptible to compromise.
In the 1980s and 1990s, various Unix variants were created with mandatory access controls for security reasons.1 This meant you had tightly controlled domains running on the same Unix kernel. Processes in each domain had an extremely limited view of the system that precluded them from interacting across domains. A popular commercial version of Unix that implemented this idea was the Sidewinder firewall built on top of BSDI Unix. But this was not possible in most mainstream Unix implementations.
That changed in 2000 when FreeBSD 4.0 was released with a new command, called jail
, which was designed to allow shared-environment hosting providers to easily and securely create a separation between their processes and those of their individual customers. FreeBSD jail
expanded chroot’s capabilities, but restricted everything a process could do with the underlying system and processes in other jails.
In 2004, Sun released an early build of Solaris 10, which included Solaris Containers, and later evolved into Solaris Zones. This was the first major commercial implementation of container technology and is still used today to support many commercial container implementations. In 2007, HP released Secure Resource Partitions for HP-UX, later renamed to HP-UX Containers; and finally, in 2008, Linux Containers (LXC) were released in version 2.6.24 of the Linux kernel. The phenomenal growth of Linux Containers across the community did not really start to grow until 2013 with the inclusion of user namespaces in version 3.8 of the Linux Kernel and the release of Docker one month later.
Companies that had to deal with scaling applications to the size of the Internet, with Google being a very early example, started pushing container technology in the early 2000s in order to facilitate distributing their applications across data centers full of computers. A few companies maintained their own patched kernels with container support for internal use. Google contributed some of its work to support containers into the mainline Linux kernel, as understanding about the broader need for these features began to increase in the Linux community.
In late 2013, months after the Docker announcement, Google released lmctfy, the open source version of the internal container engine it had been running for some years. By this time, Docker was already widely discussed in the press. It was the right combination of ease of use and enabling technology just at the right time. Other promising container engines, like CoreOS Rocket, have been released since, but Docker seems to have built up a head of steam that is currently powering it to the forefront.
If you haven’t heard about CoreOS Rocket, you might be wondering what it is. Rocket is an open source container runtime that CoreOS is designing to address what they see as serious deficiencies with the Docker approach to containerization and the supporting tool set. It is left as an exercise for the reader to determine whether the CoreOS approach and solution fits your needs.
Now let’s turn back to Docker and take a closer look at modern containers.
So far we’ve started containers using the handy docker run
command. But docker run
is really a convenience command that wraps two separate steps into one. The first thing it does is create a container from the underlying image. This is accomplished separately using the docker create
command. The second thing docker run
does is execute the container, which we can also do separately with the docker start
command.
The docker create
and docker run
commands both contain all the options that pertain to how a container is initially set up. In Chapter 4, we demonstrated that with the docker run
command you could map network ports in the underlying container to the host using the -p
argument, and that -e
could be used to pass environment variables into the container.
This only just begins to touch on the array of things that you can configure when you first create a container. So let’s take a pass over some of the options that docker
supports.
Now let’s take a look at some of the ways we can tell Docker to configure our container when we create it.
When you create a container, it is built from the underlying image, but various command-line arguments can affect the final settings. Settings specified in the Dockerfile are always used as defaults, but you can override many of them at creation time.
By default, Docker randomly names your container by combining an adjective with the name of a famous person. This results in names like ecstatic-babbage and serene-albattani. If you want to give your container a specific name, you can do so using the --name
argument.
$ docker create --name="awesome-service" ubuntu:latest
As mentioned in Chapter 4, labels are key-value pairs that can be applied to Docker images and containers as metadata. When new Docker containers are created, they automatically inherit all the labels from their parent image.
It is also possible to add new labels to the containers so that you can apply metadata that might be specific to that single container.
docker run -d --name labels -l deployer=Ahmed -l tester=Asako ubuntu:latest sleep 1000
You can then search for and filter containers based on this metadata, using commands like docker ps
.
$ docker ps -a -f label=deployer=Ahmed CONTAINER ID IMAGE COMMAND ... NAMES 845731631ba4 ubuntu:latest "sleep 1000" ... labels
You can use the docker inspect
command on the container to see all the labels that a container has.
$ docker inspect 845731631ba4 ... "Labels": { "deployer": "Ahmed", "tester": "Asako" }, ...
By default, when you start a container, Docker will copy certain system files on the host, including /etc/hostname, into the container’s configuration directory on the host,2 and then use a bind mount to link that copy of the file into the container. We can launch a default container with no special configuration like this:
$ docker run --rm -ti ubuntu:latest /bin/bash
This command uses the docker run
command, which runs docker create
and docker start
in the background. Since we want to be able to interact with the container that we are going to create for demonstration purposes, we pass in a few useful arguments. The --rm
argument tells Docker to delete the container when it exits, the -t
argument tells Docker to allocate a psuedo-TTY, and the -i
argument tells Docker that this is going to be an interactive session, and we want to keep STDIN open. The final argument in the command is the exectuable that we want to run within the container, which in this case is the ever useful /bin/bash
.
If we now run the mount
command from within the resulting container, we will see something similar to this:
root@ebc8cf2d8523:/# mount overlay on / type overlay (rw,relatime,lowerdir=...,upperdir=...,workdir...) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev type tmpfs (rw,nosuid,mode=755) shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,...,ptmxmode=666) sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime) /dev/sda9 on /etc/resolv.conf type ext4 (rw,relatime,data=ordered) /dev/sda9 on /etc/hostname type ext4 (rw,relatime,data=ordered) /dev/sda9 on /etc/hosts type ext4 (rw,relatime,data=ordered) devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,...,ptmxmode=000) proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime) proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) tmpfs on /proc/kcore type tmpfs (rw,nosuid,mode=755) root@ebc8cf2d8523:/#
When you see any examples with a prompt that looks something like root@hostname, it means that you are running a command within the container instead of on the Docker host.
There are quite a few bind mounts in a container, but in this case we are interested in this one:
/dev/sda9 on /etc/hostname type ext4 (rw,relatime,data=ordered)
While the device number will be different for each container, the part we care about is that the mount point is /etc/hostname. This links the container’s /etc/hostname to the hostname file that Docker has prepared for the container, which by default contains the container’s ID and is not fully qualified with a domain name.
We can check this in the container by running the following:
root@ebc8cf2d8523:/# hostname -f ebc8cf2d8523 root@ebc8cf2d8523:/# exit
Don’t forget to exit
the container shell so that we return to the Docker host when finished.
To set the hostname specifically, we can use the --hostname
argument to pass in a more specific value.
$ docker run --rm -ti --hostname="mycontainer.example.com" ubuntu:latest /bin/bash
Then, from within the container, we will see that the fully-qualified hostname is defined as requested.
root@mycontainer:/# hostname -f mycontainer.example.com root@mycontainer:/# exit
Just like /etc/hostname, the resolv.conf file is managed via a bind mount between the host and container.
/dev/sda9 on /etc/resolv.conf type ext4 (rw,relatime,data=ordered)
By default, this is an exact copy of the Docker host’s resolv.conf file. If we didn’t want this, we could use a combination of the --dns
and --dns-search
arguments to override this behavior in the container:
$ docker run --rm -ti --dns=8.8.8.8 --dns=8.8.4.4 --dns-search=example1.com --dns-search=example2.com ubuntu:latest /bin/bash
If you want to leave the search domain completely unset, then use --dns-search=.
Within the container, we would still see a bind mount, but the file contents would no longer reflect the host’s resolv.conf; instead, it now looks like this:
root@0f887071000a:/# more /etc/resolv.conf nameserver 8.8.8.8 nameserver 8.8.4.4 search example1.com example2.com root@0f887071000a:/# exit
Another important piece of information that you can configure is the MAC address for the container.
Without any configuration, a container will receive a calculated MAC address that starts with the 02:42:ac:11 prefix.
If you need to specifically set this to a value, you can do this by running something similar to this:
$ docker run --rm -ti --mac-address="a2:11:aa:22:bb:33" ubuntu:latest /bin/bash
Normally you will not need to do that. But sometimes you want to reserve a particular set of MAC addresses for your containers in order to avoid other virtualization layers that use the same private block as Docker.
Be very careful when customizing the MAC address settings. It is possible to cause ARP contention on your network if two systems advertise the same MAC address. If you have a strong need to do this, try to keep your locally administered address ranges within some of the official ranges, like x2-xx-xx-xx-xx-xx, x6-xx-xx-xx-xx-xx, xA-xx-xx-xx-xx-xx, and xE-xx-xx-xx-xx-xx (with x being any valid hexidecimal character).
There are times when the default disk space allocated to a container or its ephemeral nature is not appropriate for the job at hand and it is necessary to have storage that can persist between container deployments.
Mounting storage from the Docker host is not a generally advisable pattern because it ties your container to a particular Docker host for its persistent state. But for cases like temporary cache files or other semi-ephemeral states, it can make sense.
For the times when we need to do this, we can leverage the -v
command to mount filesystems from the host server into the container. In the following example, we are mounting /mnt/session_data to /data within the container:
$ docker run --rm -ti -v /mnt/session_data:/data ubuntu:latest /bin/bash root@0f887071000a:/# mount | grep data /dev/sda9 on /data type ext4 (rw,relatime,data=ordered) root@0f887071000a:/# exit
If you have SELinux enabled on your Docker host, you may get a “Permission Denied” error when trying to mount a volume into your container. There are a few ways to handle this. The most direct method is to simply set the right context on the directory that you are trying to mount:
$ sudo chcon -Rt svirt_sandbox_file_t /var/lib/dhcpd
As of Docker 1.7, it is also possible to handle this directly from the Docker command line. If you are going to share a volume between containers, you can use the z
option to the volume mount. This is identical to the above chcon
command:
docker run -v /etc/dhcpd:/etc/dhcpd:z dhcpd
However, the best option is to actually utilize the Z
option to the volume mount command, which will set the directory with the exact MCS label (e.g., chcon … -l s0:c1,c2) that the container will be using. This provides for the best security and will only allow a single container to mount the volume:
docker run -v /etc/dhcpd:/etc/dhcpd:Z dhcpd
In the mount options, we can see that the filesystem was mounted read-write on /data as we expected.
The mount point in the container does not need to pre-exist for this command to work properly. However, the host mount point must exist. Auto creation of the host directory was deprecated in docker version 1.9.
If the container application is designed to write into /data, then this data will be visible on the host filesystem in /mnt/session_data and would remain available when this container was stopped and a new container started with the same volume mounted.
In Docker 1.5, a new command was added that allows the root volume of your container to be mounted read-only so that processes within the container cannot write anything to the root filesystem. This prevents things like logfiles, which a developer was unaware of, from filling up the container’s allocated disk in production. When used in conjunction with a mounted volume, you can ensure that data is only written into expected locations.
In our previous example, we could accomplish this by simply adding --read-only=true
to the command.
$ docker run --rm -ti --read-only=true -v /mnt/session_data:/data ubuntu:latest /bin/bash root@df542767bc17:/# mount | grep " / " overlay on / type overlay (ro,relatime,lowerdir=...,upperdir=...,workdir=...) root@df542767bc17:/# mount | grep data /dev/sda9 on /data type ext4 (rw,relatime,data=ordered) root@df542767bc17:/# exit
If we look closely at the mount options for the root directory, we will notice that they are mounted with the ro
option, which makes it read-only. However, the /session_data mount is still mounted with the rw
option so that our application can successfully write to the one volume to which we have designed it to write.
Sometimes it is necessary to make a directory like /tmp writeable, even when the rest of the container is read-only. In Docker 1.10, the --tmpfs
attribute was added to docker run
, so that you can mount a tmpfs filesystem into the container. Any data in these tmpfs directories will be lost when the container is stopped. The following command example shows a container being launched with a tmpfs filesystem mounted at /tmp with the rw
, noexec
, nodev
, nosuid
, and size=256M
mount options set:
$ docker run --rm -ti --read-only=true --tmpfs /tmp:rw,noexec,nodev,nosuid,size=256M ubuntu:latest /bin/bash root@25b4f3632bbc:/# df -h /tmp Filesystem Size Used Avail Use% Mounted on tmpfs 256M 0 256M 0% /tmp root@25b4f3632bbc:/# grep /tmp /etc/mtab tmpfs /tmp tmpfs rw,seclabel,nosuid,nodev,noexec,relatime,size=262144k 0 0 root@25b4f3632bbc:/# exit
When people discuss the types of problems that you must often cope with when working in the cloud, the concept of the “noisy neighbor” is often near the top of the list. The basic problem this term refers to is that other applications, running on the same physical system as yours, can have a noticeable impact on your performance and resource availability.
Traditional virtual machines have the advantage that you can easily and very tightly control how much memory and CPU, among other resources, are allocated to the virtual machine. When using Docker, you must instead leverage the cgroup functionality in the Linux kernel to control the resources that are available to a Docker container. The docker create
command directly supports configuring CPU and memory restrictions when you create a container.
Constraints are applied at the time of container creation. Constraints that you apply at creation time will exist for the life of the container. In most cases, if you need to change them, then you need to create a new container from the same image and change the constraints, unless you manipulate the kernel cgroups directly under the /sys filesystem.
There is an important caveat here. While Docker supports CPU and memory limits, as well as swap limits, you must have these capabilities enabled in your kernel in order for Docker to take advantage of them. You might need to add these as command-line parameters to your kernel on startup. To figure out if your kernel supports these limits, run docker info
. If you are missing any support, you will get warning messages at the bottom, like:
WARNING: No swap limit support
The details regarding getting cgroup support configured for your kernel are distribution-specific, so you should consult the Docker documentation if you need help configuring things.
Docker thinks of CPU in terms of “cpu shares.” The computing power of all the CPU cores in a system is considered to be the full pool of shares. 1024 is the number that Docker assigns to represent the full pool. By configuring a container’s CPU shares, you can dictate how much time the container gets to use the CPU for. If you want the container to be able to use at most half of the computing power of the system, then you would allocate it 512 shares. Note that these are not exclusive shares, meaning that assigning all 1024 shares to a container does not prevent all other containers from running. Rather it’s a hint to the scheduler about how long each container should be able to run each time it’s scheduled. If we have one container that is allocated 1024 shares (the default) and two that are allocated 512, they will all get scheduled the same number of times. But if the normal amount of CPU time for each process is 100 microseconds, the containers with 512 shares will run for 50 microseconds each time, whereas the container with 1024 shares will run for 100 microseconds.
Let’s explore a little bit how this works in practice. For the following examples, we are going to use a new Docker image that contains the stress
command for pushing a system to its limits.
When we run stress
without any cgroup constraints, it will use as many resources as we tell it to. The following command creates a load average of around 5 by creating two CPU-bound processes, one I/O-bound process, and two memory allocation processes:
$ docker run --rm -ti progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
This should be a reasonable command to run on any modern computer system, but be aware that it is going to stress the host system, so don’t do this in a location that can’t take the additional load, or even a possible failure, due to resource starvation.
If you run the top
command on the Docker host, near the end of the two-minute run, you can see how the system is affected by the load created by the stress
program.
In the following code, we are running on a system with two CPUs.
$ top -bn1 | head -n 15 top - 20:56:36 up 3 min, 2 users, load average: 5.03, 2.02, 0.75 Tasks: 88 total, 5 running, 83 sleeping, 0 stopped, 0 zombie %Cpu(s): 29.8 us, 35.2 sy, 0.0 ni, 32.0 id, 0.8 wa, 1.6 hi, 0.6 si, 0.0 st KiB Mem: 1021856 total, 270148 used, 751708 free, 42716 buffers KiB Swap: 0 total, 0 used, 0 free. 83764 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 810 root 20 0 7316 96 0 R 44.3 0.0 0:49.63 stress 813 root 20 0 7316 96 0 R 44.3 0.0 0:49.18 stress 812 root 20 0 138392 46936 996 R 31.7 4.6 0:46.42 stress 814 root 20 0 138392 22360 996 R 31.7 2.2 0:46.89 stress 811 root 20 0 7316 96 0 D 25.3 0.0 0:21.34 stress 1 root 20 0 110024 4916 3632 S 0.0 0.5 0:07.32 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/0
If you want run the exact same stress
command again, with only half the amount of available CPU time, you can run it like this:
$ docker run --rm -ti --cpu-shares 512 progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
The --cpu-shares 512
is the flag that does the magic, allocating 512 CPU shares to this container. Note that the effect might not be noticeable on a system that is not very busy. That’s because the container will continue to be scheduled for the same time-slice length whenever it has work to do, unless the system is constrained for resources. So in our case, the results of a top
command on the host system will likely look exactly the same, unless you run a few more containers to give the CPU something else to do.
Unlike virtual machines, Docker’s cgroup-based constraints on CPU shares can have unexpected consequences. They are not hard limits; they are a relative limit, similar to the nice
command. An example is a container that is constrained to half the CPU shares, but is on a system that is not very busy. Because the CPU is not busy, the limit on the CPU shares would have only a limited effect because there is no competition in the scheduler pool. When a second container that uses a lot of CPU is deployed to the same system, suddenly the effect of the constraint on the first container will be noticeable. Consider this carefully when constraining containers and allocating resources.
It is also possible to pin a container to one or more CPU cores. This means that work for this container will only be scheduled on the cores that have been assigned to this container.
In the following example, we are running our stress container pinned to the first of two CPUs, with 512 CPU shares. Note that everything following the container image here are parameters to the stress
command, not docker
.
$ docker run --rm -ti --cpu-shares 512 --cpuset=0 progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
The --cpuset
argument is zero-indexed, so your first CPU core is 0. If you tell Docker to use a CPU core that does not exist on the host system, you will get a Cannot start container error. On our two-CPU example host, you could test this by using --cpuset=0,1,2
.
If we run top
again, we should notice that the percentage of CPU time spent in user space (us) is lower than it previously was, since we have restricted two CPU-bound processes to a single CPU.
%Cpu(s): 18.5 us, 22.0 sy, 0.0 ni, 57.6 id, 0.5 wa, 1.0 hi, 0.3 si, 0.0 st
When you use CPU pinning, additional CPU sharing restrictions on the container only take into account other containers running on the same set of cores.
In Docker Engine 1.7, support was added for the CPU CFS (Completely Fair Scheduler) within the Linux kernel. You can alter the CPU quota a given container has by setting the --cpu-quota
flag to a valid value when launching the container with docker run
.
We can control how much memory a container can access in a manner similar to constraining the CPU. There is, however, one fundamental difference: while constraining the CPU only impacts the application’s priority for CPU time, the memory limit is a hard limit. Even on an unconstrained system with 96 GB of free memory, if we tell a container that it may only have access to 24 GB, then it will only ever get to use 24 GB regardless of the free memory on the system. Because of the way the virtual memory system works on Linux, it’s possible to allocate more memory to a container than the system has actual RAM. In this case, the container will resort to using swap in the event that actual memory is not available, just like a normal Linux process.
Let’s start a container with a memory constraint by passing the -m
option to the docker run
command:
$ docker run --rm -ti -m 512m progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
When you use the -m
option alone, you are setting both the amount of RAM and the amount of swap that the container will have access to. So here we’ve constrained the container to 512 MB of RAM and 512 MB of additional swap space. Docker supports b
, k
, m
, or g
, representing bytes, kilobytes, megabytes, or gigabytes, respectively. If your system somehow runs Linux and Docker and has mulitple terabytes of memory, then unfortunately you’re going to have to specify it in gigabytes.
If you would like to set the swap separately or disable it altogether, then you need to also use the --memory-swap
option. The --memory-swap
option defines the total amount of memory and swap available to the container. If we rerun our previous command, like so:
$ docker run --rm -ti -m 512m --memory-swap=768m progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
Then we are telling the kernel that this container can have access to 512 MB of memory and 256 MB of additional swap space. Setting the --memory-swap
option to -1
will disable the swap completely within the container.
Unlike CPU shares, memory is a hard limit! This is good because the constraint doesn’t suddenly make a noticeable effect on the container when another container is deployed to the system. But it does mean that you need to be careful that the limit closely matches your container’s needs because there is no wiggle room.
So, what happens if a container reaches its memory limit? Well, let’s give it a try by modifying one of our previous commands and lowering the memory significantly:
$ docker run --rm -ti -m 100m progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
Where all our other runs of the stress
container ended with the line:
stress: info: [1] successful run completed in 120s
We see that this run quickly fails with the line:
stress: FAIL: [1] (452) failed run completed in 1s
This is because the container tries to allocate more memory than it is allowed, and the Linux Out of Memory (OOM) killer is invoked and starts killing processes within the cgroup to reclaim memory. Since our container has only one running process, this kills the container.
Docker 1.10 has added features that allow you to tune and disable the Linux Out of Memory killer by using the --oom-kill-disable
and the --oom-score-adj
arguments to docker run
As of Docker 1.9, it is also possible to specifically limit the amount of kernel memory available to a container by using the --kernel-memory
argument to docker run
or docker create
.
In Docker 1.7, support was added to apply some prioritization to a container’s use of block device I/O. This is managed by manipulating the default setting of the blkio.weight
cgroup attribute, which can have a value of 10 to 1000, and defaults to 500. The system will divide all of the available I/O between every process within a cgroup slice, with the assigned weights impacting how much I/O each individual process receives.
To set this weight on a container, you need to pass the --blkio-weight
to your docker run
command with a valid value.
To read more technical details about this kernel feature, take a look at the blkio-controller kernel documentation.
The release of Docker 1.10, introduced even more block I/O tuning features, and added the docker update
command, which can be used to dynamically adjust the resources limits of one of more containers. The following example shows how you could adjust the memory limit on 2 containers simultaneously: docker update --memory="1024M" 6b785f78b75e 92b797f12af1
Another common way to limit resources avaliable to a process in Unix is through the application of user limits. The following code is a list of the types of things that can usually be configured by setting soft and hard limits via the ulimit
command:
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 5835 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Before the release of Docker 1.6, all containers inherited the ulimits of the Docker daemon. This is usually not appropriate because the Docker server requires more resources to perform its job than any individual container.
It is now possible to configure the Docker daemon with the default user limits that you want to apply to every container. The following command would tell the Docker daemon to start all containers with a soft limit of 50 open files and a hard limit of 150 open files:
$ sudo docker daemon --default-ulimit nofile=50:150
You can then override these ulimits on a specific container by passing in values using the --ulimit
argument.
$ docker run -d --ulimit nofile=150:300 nginx
There are some additional advanced commands that can be used when creating containers, but this covers many of the more common use cases. The Docker client documentation lists all the available options and is kept current with each Docker release.
Earlier in the chapter we used the docker create
command to create our container. When we are ready to start the container, we can use the docker start
command.
Let’s say that we needed to run a copy of Redis, a common key-value store. We won’t really do anything with this Redis container, but it’s a long-lived process and serves as an example of something we might do in a real environment. We could first create the container using a command like the one shown here:
$ docker create -p 6379:6379 redis:2.8 Unable to find image 'redis:2.8' locally 30d39e59ffe2: Pull complete ... 868be653dea3: Pull complete 511136ea3c5a: Already exists redis:2.8: The image you are pulling has been verified. Important: ... Status: Downloaded newer image for redis:2.8 6b785f78b75ec2652f81d92721c416ae854bae085eba378e46e8ab54d7ff81d1
The command ends with the full hash that was generated for the container. However, if we didn’t know the full or short hash for the container, we could list all the containers on the system, whether they are running or not, using:
$ docker ps -a CONTAINER ID IMAGE COMMAND ... 6b785f78b75e redis:2.8 "/entrypoint.sh redi ... 92b797f12af1 progrium/stress:latest "/usr/bin/stress --v ...
We can then start the container with the following command:
$ docker start 6b785f78b75e
Most Docker commands will work with the full hash or a short hash. In the previous example, the full hash for the container is 6b785f78b75ec2652f81d92…bae085eba378e46e8ab54d7ff81d1, but the short hash that is shown in most command output is 6b785f78b75e. This short hash consists of the first 12 characters of the full hash.
To verify that it’s running, we can run:
$ docker ps CONTAINER ID IMAGE COMMAND ... STATUS ... 6b785f78b75e redis:2.8 "/entrypoint.sh redi ... Up 2 minutes ...
In many cases, we want our containers to restart if they exit. Some containers are just very short-lived and come and go quickly. But for production applications, for instance, you expect them to be up after you’ve told them to run. We can tell Docker to do that on our behalf.
The way we tell Docker to do that is by passing the --restart
argument to the docker run
command. It takes three values: no
, always
, or on-failure:#
. If restart is set to no
, the container will never restart if it exits. If it is set to always
, then the container will restart whenever the container exits with no regard to the exit code. If restart is set to on-failure:3
, then whenever the container exits with a nonzero exit code, Docker will try to restart the container three times before giving up.
We can see this in action by rerunning our last memory-constrained stress container without the --rm
argument, but with the --restart
argument.
$ docker run -ti --restart=on-failure:3 -m 100m progrium/stress --cpu 2 --io 1 --vm 2 --vm-bytes 128M --timeout 120s
In this example, we will see the output from the first run appear on the console before it dies. If we run a docker ps
immediately after the container dies, we will see that Docker is attempting to restart the container.
$ docker ps ... IMAGE ... STATUS ... ... progrium/stress:latest ... Restarting (1) Less than a second ago ...
It will continue to fail because we have not given it enough memory to function properly. After three attempts, Docker will give up and we will see the container disappear from the the output of docker ps
.
Containers can be stopped and started at will. You might think that starting and stopping are analogous to pausing and resuming a normal process. It’s not quite the same, though. When stopped, the process is not paused; it actually exits. And when a container is stopped, it no longer shows up in the normal docker ps
output. On reboot, docker will attempt to start all of the containers that were running at shutdown. It uses this same mechanism, and it’s also useful when testing or for restarting a failed container. We can simply pause a Docker container with docker pause
and unpause
, discussed later. But let’s stop our container now:
$ docker stop 6b785f78b75e $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Now that we have stopped the container, nothing is in the ps
list! We can start it back up with the container ID, but it would be really inconvenient to have to remember that. So docker ps
has an additional option (-a)
to show all containers, not just the running ones.
$ docker ps -a CONTAINER ID IMAGE STATUS ... 6b785f78b75e progrium/stress:latest Exited (0) 2 minutes ago ...
That STATUS field now shows that our container exited with a status code of 0 (no errors). We can start it back up with all of the same configuration it had before:
docker start 6b785f78b75e 6b785f78b75e $ docker ps -a CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest Up 15 seconds ...
Voila, our container is back up and running.
Remember that containers exist even when they are not started, which means that you can always restart a container without needing to recreate it. Although memory contents will have been lost, all of the container’s filesystem contents and metadata, including environment variables and port bindings, are saved and will still be in place when you restart the container.
We keep talking about the idea that containers are just a tree of processes that interact with the system in essentially the same way as any other process on the server. That means that we can send them Unix signals, which they can respond to. In the previous docker stop
example, we’re sending the container a SIGTERM
signal and waiting for the container to exit gracefully. Containers follow the same process group signal propagation that any other process group would receive on Linux.
A normal docker stop
sends a normal SIGTERM
signal to the process. If you want to force a container to be killed if it hasn’t stopped after a certain amount of time, you can use the -t
argument, like this:
$ docker stop -t 25 6b785f78b75e
This tells Docker to initially send a SIGTERM
signal as before, but then if the container has not stopped within 25 seconds, to send a SIGKILL
signal to forcefully kill it.
Although stop
is the best way to shut down your containers, there are times when it doesn’t work and we need to forcefully kill a container.
We saw what it looks like to use docker stop
to stop a container, but often if a process is misbehaving, you just want it to exit immediately.
We have docker kill
for that. It looks pretty much like docker stop
:
$ docker kill 6b785f78b75e 6b785f78b75e
A docker ps
nows shows that the container is no longer running, as expected:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Just because it was killed rather than stopped does not mean you can’t start it again, though. You can just issue a docker start
like you would for a nicely stopped container. Sometimes you might want to send another signal to a container, one that is not stop
or kill
. Like the Linux kill
command, docker kill
supports sending any Unix signal. Let’s say we wanted to send a USR1
signal to our container to tell it to do something like reconnect a remote logging session. We could do the following:
$ docker kill --signal=USR1 6b785f78b75e 6b785f78b75e
If our container actually did something with the USR1
signal, it would now do it. Since we’re just running a bash shell, though, it just continues on as if nothing happened. Try sending a HUP
signal, though, and see what happens. Remember that a HUP
is the signal that is sent when the terminal closes on a foreground process.
Sometimes we really just want to stop our container as described above. But there are a number of times when we just don’t want our container to do anything for a while. That could be because we’re taking a snapshot of its filesystem to create a new image, or just because we need some CPU on the host for a while. If you’re used to normal Unix process handling, you might wonder how this actually works since containerized processes are just processes.
Pausing leverages the cgroups
freezer, which essentially just prevents your process from being scheduled until you unfreeze it. This will prevent the container from doing anything while maintaining its overall state, including memory contents. Unlike stopping a container, where the processes are made aware that they are stopping via the SIGSTOP
signal, pausing a container doesn’t send any information to the container about its state change. That’s an important distinction. Several Docker commands use pausing and unpausing internally as well. Here’s how we pause a container:
$ docker pause 6b785f78b75e
If we look at the list of running containers, we will now see that the Redis container status is listed as (Paused).
# docker ps CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest ... Up 36 minutes (Paused) ...
Attempting to use the container in this paused state would fail. It’s present, but nothing is running. We can now resume the container using the docker unpause
command.
$ docker unpause 6b785f78b75e 6b785f78b75e $ docker ps CONTAINER ID IMAGE ... STATUS ... 6b785f78b75e progrium/stress:latest ... Up 1 second ...
It’s back to running, and docker ps
correctly reflects the new state. Note that it shows “Up 1 second” now, which is when we unpaused it, not when it was last run.
After running all these commands to build images, create containers, and run them, we have accumulated a lot of image layers and container folders on our system.
We can list all the containers on our system using the docker ps -a
command and then delete any of the containers in the list, as follows:
$ docker ps -a CONTAINER ID IMAGE ... 92b797f12af1 progrium/stress:latest ... ... $ docker rm 92b797f12af1
We can then list all the images on our system using:
$ docker images REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE ubuntu latest 5ba9dab47459 3 weeks ago 188.3 MB redis 2.8 868be653dea3 3 weeks ago 110.7 MB progrium/stress latest 873c28292d23 7 months ago 281.8 MB
We can then delete an image and all associated filesystem layers by running:
$ docker rmi 873c28292d23
If you try to delete an image that is in use by a container, you will get a Conflict, cannot delete error. You should stop and delete the container(s) first.
There are times, especially during development cycles, when it makes sense to completely clean off all the images or containers from your system. There is no built-in command for doing this, but with a little creativity it can be accomplished reasonably easily.
To delete all of the containers on your Docker hosts, you can use the following command:
$ docker rm $(docker ps -a -q)
And to delete all the images on your Docker host, this command will get the job done:
$ docker rmi $(docker images -q -)
Newer versions of the docker ps
and docker images
commands both support a filter argument that can make it easy to fine-tune your delete commands for certain circumstances.
To remove all containers that exited with a nonzero state, you can use this filter:
$ docker rm $(docker ps -a -q --filter 'exited!=0')
And to remove all untagged images, you can type:
$ docker rmi $(docker images -q -f "dangling=true")
You can read the official Docker documentation to explore the filtering options. At the moment there are very few filters to choose from, but more will likely be added over time. And if you are really interested, Docker is an open source project, so they are always open to public code contributions.
It is also possible to make your own very creative filters by stringing together commands using pipes (|) and other similar techniques.
In the next chapter, we’ll do more exploration of what Docker brings to the table. For now it’s probably worth doing a little experimentation on your own. We suggest exercising some of the container control commands we covered here so that you’re familiar with the command-line options and the overall syntax. Try interacting with stoppped or paused containers to see what you can see. Then when you’re feeling confident, head on into Chapter 6!
3.138.37.174