Now that you have a better understanding of Docker and its associated terminology, this chapter shows you how to convert your project into a containerized application using Docker. In this chapter, you learn what a Dockerfile is, including its syntax, and learn how to write a Dockerfile. With a better understanding of Dockerfiles, you can work toward the first step in writing a Dockerfile for the Newsbot app.
Dockerfile Primer
For a traditionally deployed application, building and packaging an application was often quite tedious. With the aim to automate the build and packaging of the application, people turned to different utilities, such as GNU Make, maven, Gradle, and so on, to build the application package. Similarly, in the Docker world, a Dockerfile is an automated way to build your Docker images.
A Typical Dockerfile
Looking at this Dockerfile, it’s easy to see what we’re telling the Docker Engine to build. However, don’t let the simplicity fool you—Dockerfiles let you build complex conditions when generating your Docker images. When a Docker build command is issued, it builds the Docker images from the Dockerfile and a build context.
Build Context
A build context is a file or set of files available at a specific path or URL. To understand this better, say you have some supporting files that you need during a Docker image build—for instance, an application-specific config file that was generated earlier and needs to be part of the container.
The build command sets the context to the path or URL provided, uploading the files available to the Docker daemon and allowing it to build the image. You are not limited to a build context of an URL or path. If you pass an URL to a remote tarball (i.e., a .tar file), the tarball at the URL is downloaded onto the Docker daemon and the build command is issued with that as the build context .
If you provide the Dockerfile on the root (/) directory and set that as the context, doing so will transfer your hard disk contents to the Docker daemon.
Dockerignore
You should now understand that the build context transfers the contents of the current directory to the Docker daemon during the build. Consider the case where the context directory has a lot of files/directories that are not relevant to the build process. Uploading these files/directories can cause a significant increase in network traffic. A Dockerignore file, much like gitignore, allows you to define files that are exempt from being transferred during the build process.
BuildKit
With the 18.09 release of the Docker Engine, Docker overhauled their container build system using BuildKit. BuildKit is now the default build system for Docker. For most users, BuildKit works exactly as the legacy build system. BuildKit has a new command output for Docker image builds and, as a result, provides more detailed feedback about the build process.
Build Output When BuildKit Is Enabled
Switching Back to the Legacy Build Process
Unless you encounter any problems, I do not recommend switching back to the legacy build process. Stick to using Docker BuildKit. If you’re not seeing the new build output, ensure that you have updated to the latest version of Docker.
Building Using Docker Build
Response from Docker Engine as it Builds the Dockerfile
Dockerfile for Python with an Invalid Instruction
You’ll get back to fixing this problem a little later in the chapter. For now, it’s time to look at some of the commonly used Dockerfile instructions and at tagging images.
Tags
A tag is a name that uniquely identifies a specific version of a Docker image. Tags are plain-text labels often used to identify specific details, such as the version, the base OS of the image, or the architecture of the Docker image. Tagging a Docker image gives you the flexibility to refer uniquely to a specific version, which makes it easier to roll back to previous versions of a Docker image if the current image is not working as expected.
If a tag is not specified, Docker will apply a string called "latest" as the default tag. The "latest" tag is often the source of many problems, especially for new Docker users. Many believe that having "latest" as the tag would mean that the Docker image is the latest version of the image and would always be updated to the latest version. This is not true—latest was chosen as a convention but doesn’t have any special meaning to it.
I do not recommend using latest as a tag, especially with production workloads. During development stages, omitting the tag will result in the "latest" tag being applied to every build. If there were a breaking change, since the tag is common, the previous images would get overwritten. This makes rolling back to the previous version of the image difficult unless you noted the SHA-hash of the image. Using specific tags makes it easier to determine, at a glance, what tag or version of Docker image is running on the container. Using specific tags also reduces the chance of breaking changes being propagated, especially if you tag your image as latest and have a breaking change or a bug. The next time your container crashes or restarts, it might pull the image with the breaking change or bug.
Adding a Tag When Building the Image
Dockerfile Instructions
FROM
ADD
COPY
RUN
CMD
ENTRYPOINT
ENV
VOLUME
LABEL
EXPOSE
Let’s see what they do.
FROM
Where <image> is the name of a valid Docker image from any public/private repository. As mentioned, if the tag is skipped, Docker will fetch the image tagged as latest.
WORKDIR
WORKDIR can be set multiple times in a Dockerfile and, if a relative directory succeeds a previous WORKDIR instruction, it will be relative to the previously set working directory. Let’s look at an example demonstrating this.
The Dockerfile fetches the latest tagged image from Ubuntu as the base image, sets the current working directory to /app, and runs the pwd command when the image is run. The pwd command prints the current working directory.
Notice that you did not set any absolute working directory in the Dockerfile—the relative directories were appended to the default .
ADD and COPY
At first glance, the ADD and COPY instructions seem to be the same—they allow you to transfer files from the host to the container’s filesystem. COPY supports basic copying of files to the container, whereas ADD has support for features like tarball auto extraction (i.e., Docker will automatically extract compressed files added from local directory) and remote URL support (i.e., Docker will download the resources from a remote URL).
The ADD instruction is useful when you’re adding files from remote URLs or you have compressed files from the local filesystem that need to be automatically extracted into the container filesystem.
Docker recommends using COPY over ADD, especially when it’s a local file that’s being copied.
If the <destination> does not exist in the image, it will be created.
All new files/directories are created with UID and GID as 0—that is, as the root user. To change this, you can use the --chown flag.
If the files/directories contain special characters, they need to be escaped.
The <destination> can be absolute or relative paths. In case of relative paths, the relativeness will be inferred from the path set by the WORKDIR instruction.
If the <destination> doesn’t end with a trailing slash, it will be considered a file and the contents of the <source> will be written into <destination>.
If the <source> is specified as a wildcard pattern, the <destination> must be a directory and must end with a trailing slash; otherwise, the build process will fail.
The <source> must be within the build context. It cannot be a file/directory outside of the build context because the first step of a Docker build process involves sending the context directory to the Docker daemon.
In case of the ADD instruction :
If the <source> is a URL and the <destination> is not a directory and doesn’t end with a trailing slash, the file is downloaded from the URL and copied into <destination>.
If the <source> is a URL and the <destination> is a directory and ends with a trailing slash, the filename is inferred from the URL and the file is downloaded and copied to <destination>/<filename>.
If the <source> is a local tarball of a known compression format, the tarball is unpacked as a directory. Remote tarballs, however, are not uncompressed.
RUN
The RUN instruction will execute any command during the build step of the container. This creates a new layer that is available for the next steps in the Dockerfile. It is important to note that the command following the RUN instruction runs only when the image is being built. The RUN instruction has no relevance when a container has started and is running.
This form makes it possible to use shell variables, subcommands, command pipes, and command chains in the RUN instruction itself.
Unless you need to use shell features like chaining and redirection, it is recommended to use the exec form for the RUN instruction.
Layer Caching
The logs indicate that, instead of redownloading the layer for the base Ubuntu image, Docker uses the cached layer saved to disk. This applies to all the layers that are created—and Docker creates a new layer whenever it encounters RUN, COPY, or ADD instructions. Having the right order of instructions can greatly improve whether Docker will reuse the layers. This can not only improve the image build speed, but also reduce container start times by virtue of having lesser number of layers to download.
This creates a single layer with the packages to be installed, and any change in any of the packages will invalidate the cache and cause a new layer to be created with the updated packages. If you want to explicitly instruct Docker to avoid using the cache, then passing --no-cache flag to the docker build command will skip using the cache.
CMD and ENTRYPOINT
Passing an executable every time to override a parameter can be quite tedious. This is where the combination of ENTRYPOINT and CMD shines. You can set ENTRYPOINT to the executable while the parameter can be passed from the command line and will be overridden.
Of course, curl is just an example here. You can replace curl with any other program that accepts parameters (such as load-testing utilities, benchmarking utilities, etc.) and the combination of CMD and ENTRYPOINT makes it easy to distribute the image.
Commands for ENTRYPOINT/CMD Combinations
No ENTRYPOINT | ENTRYPOINT exec_entry p1_entry | ENTRYPOINT ["exec_entry", "p1_entry"] | |
---|---|---|---|
No CMD | Error, not allowed | /bin/sh -c exec_entry p1_entry | exec_entry p1_entry |
CMD ["exec_cmd", "p1_cmd"] | exec_cmd p1_cmd | /bin/sh -c exec_entry p1_entry | exec_entry p1_entry exec_cmd p1_cmd |
CMD ["p1_cmd", "p2_cmd"] | p1_cmd p2_cmd | /bin/sh -c exec_entry p1_entry | exec_entry p1_entry p1_cmd p2_cmd |
CMD exec_cmd p1_cmd | /bin/sh -c exec_cmd p1_cmd | /bin/sh -c exec_entry p1_entry | exec_entry p1_entry /bin/sh -c exec_cmd p1_cmd |
- As mentioned earlier, you can specify RUN, CMD , and ENTRYPOINT in shell form and exec form. Which should be used will entirely depend on the requirements. But as general guide:
In shell form, the command is run in a shell with the command as a parameter. This form provides for a shell where shell variables, subcommands, commanding piping, and chaining are possible.
In exec form, the command does not invoke a command shell. This means that normal shell processing (such as $VARIABLE substitution, piping, etc.) will not work.
A program started in shell form will run as a subcommand of /bin/sh -c. This means the executable will not be running as PID and will not receive UNIX signals. As a consequence, a Ctrl+C to send a SIGTERM will not be forwarded to the container and the application might not exit correctly .
ENV
In the first form, the entire string after the <key> is considered the value, including whitespace characters. Only one variable can be set per line in this form.
In the second form, multiple variables can be set at one time, with the equals (=) character assigning value to the key.
The environment variables set are persisted through the container runtime. They can be viewed using docker inspect.
Type exit to close the interactive terminal of the container.
VOLUME
tells Docker to mark the /var/logs/nginx directory as a mount point, with the data being mounted from the Docker host. This, when combined with the volume flag on the Docker run command, will result in data being persisted on the Docker host as a volume. This volume can then be backed up, moved, or transferred using Docker CLI commands. You will learn more about volumes in a later chapter in this book.
EXPOSE
You can also include the port number and whether the port listens on TCP/UDP or both. If not specified, Docker assumes the protocol to be TCP.
An EXPOSE instruction doesn’t publish the port. For the port to be published to the host, you need to use the -p flag with docker run to publish and map the ports.
LABEL
- For keys:
Authors of third-party tools should prefix each key with reverse DNS notation of a domain owned by them: for example, com.sathyasays.my-image.
com.docker.*, io.docker.*, and org.dockerproject.* are reserved by Docker for internal use.
Label keys should begin and end with lowercase letters and should contain only lowercase alphanumeric characters and the period (.) and hyphen (-) characters. Consecutive hyphens and periods are not allowed.
The period (.) separates the namespace fields.
- For values:
Label values can contain any data type that can be represented as a string, including JSON, XML, YAML, and CSV types .
Guidelines and Recommendations for Writing Dockerfiles
Containers should be ephemeral. Docker recommends that images generated by Dockerfiles should be as ephemeral as possible. You should be able to stop, destroy, and restart the container at any point with minimal setup and configuration to the container. The container should ideally not write data to the container filesystem, and any persistent data should be written to Docker volumes or to data storage managed outside the container (for example, using a block storage like Amazon S3).
Keep the build context minimal. You read about build context earlier in this chapter. It’s important to keep the build context as minimal as possible to reduce the build times and the image size. This can be done by making effective use of the .dockerignore file.
Use multi-stage builds. Multi-stage builds help in drastically reducing the size of the image without having to write complicated scripts to transfer/keep the required artifacts. Multi-stage builds are described in the next section.
Skip unwanted packages. Having unwanted or nice-to-have packages increases the size of the image, introduces unwanted dependent packages, and increases the surface area for attacks.
Minimize the number of layers. While not as big of a concern as they used to be, it’s still important to reduce the number of layers in the image. As of Docker 1.10 and above, only RUN, COPY, and ADD instructions create layers. With these in mind, having a minimal of these instructions or combining many lines of the respective instructions reduces the number of layers, ultimately reducing the size of the image.
Using Multi-Stage Builds
As of version 17.05 and above, Docker added support for multi-stage builds, allowing complex image builds to be performed without the Docker image being unnecessarily bloated. Multi-stage builds are especially useful when you’re building images of applications that require some additional build-time dependencies but are not needed during runtime. Most common examples are applications written using programming languages such as Go or Java, where prior to multi-stage builds, it was common to have two different Dockerfiles. One was for the build and the other was for the release and the orchestration of the artifacts from the build time image to the runtime image.
With multi-stage builds, a single Dockerfile can be leveraged for build and deploy images—the build images can contain the build tools required for generating the binary or the artifact. In the second stage, the artifact can be copied over to the runtime image, thereby considerably reducing the size of the runtime image. For a typical multi-stage build, a build stage has several layers—each layer for installing tools required to build the application, generate the dependencies, and generate the application. In the final layer, the application built from build stages is copied over to the final layer and only that layer is considered for building the image. The build layers are discarded, drastically reducing the size of the final image.
Although this book doesn’t focus on multi-stage builds in detail, you will try an exercise on how to create a multi-stage build and see how much smaller using a slim image with multi-stage build makes the final image. More details about multi-stage builds are available on Docker’s website at https://docs.docker.com/develop/develop-images/multistage-build/.
Exercises
The start of the chapter introduced a simple Dockerfile that did not build due to syntax errors. In this exercise, you see how to fix that Dockerfile and add some of the instructions that you learned in this chapter.
Tip The source code and associated Dockerfile are available on the GitHub repo of the book, at https://github.com/Apress/practical-docker-with-python, in the source-code/chapter-4/exercise-1 directory.
Trying to build this will result in an error since hello-world.py is missing. Let’s fix the build error. To do this, you need to add a hello-world.py that reads an environment variable, NAME, and prints Hello, $NAME!. If the environment variable is not defined, it will print "Hello, World!".
Congrats! You’ve successfully written your first Dockerfile and built your first Docker image.
In this exercise, you will build two Docker images. The first image uses a standard build with python:3 as the base image, whereas the second image gives an overview of how multi-stage builds can be utilized.
Tip The source code and associated Dockerfile are available on the GitHub repo of the book at https://github.com/Apress/practical-docker-with-python, in the source-code/chapter-4/exercise-2/ directory.
Building the Docker Image Using a Standard Build
Repository | Tag | Image ID | Created | Size |
---|---|---|---|---|
sathyabhat/base-build | latest | 03191af | About a minute ago | 895MB |
The Docker image sits at a fairly hefty 895MB, even though you did not add any of your application code, just a dependency. Let’s rewrite it to a multi-stage build.
The Dockerfile is different in that there are multiple FROM statements, signifying the different stages. In the first stage, you build the required packages using the python:3 image, which has the necessary build tools.
Repository | Tag | Image ID | Created | Size |
---|---|---|---|---|
sathyabhat/multistage-build | latest | 35c85a8497b5 | About a minute ago | 54.2MB |
In this exercise, you will write the Dockerfile for Newsbot, the Telegram chatbot project.
Tip The source code and associated Dockerfile are available on the GitHub repo of the book at https://github.com/Apress/practical-docker-with-python, in the source-code/chapter-4/exercise-3/ directory.
A Docker image based on Python 3
The project dependencies listed in requirements.txt
An environment variable named NBT_ACCESS_TOKEN
- 1.
Start with a proper base image.
- 2.
Make a list of files required for the application.
- 3.
Make a list of environment variables required for the application.
- 4.
Copy the application files to the image using the COPY instruction.
- 5.
Specify the environment variable with the ENV instruction.
If you see these logs, congratulations! Not only did you write the Dockerfile for Newsbot, but you also built it and ran it successfully.
Summary
In this chapter, you gained a better understanding of what a Dockerfile is by reviewing its syntax. You are now one step closer to mastering writing a Dockerfile for Newsbot.