Understanding the core concepts of Docker

So now it's time to further unfold the topic by introducing Docker. Docker basically makes use of LXC but adds support for building, shipping, and running operation system images. So there exists a layered image format, which makes it possible to pack the filesystem components necessary for running a specific application into a Docker images file.

Although not necessary for this chapter because it is already provided by the following minikube package we are using, Docker can be easily installed on different operating systems. Since Docker uses functionality only present in the Linux kernel, it can be run natively only on Linux (and only there you will see the performance benefits over using virtual machines). But you still can use it on macOS and Windows, where a separate hypervisor is running Docker on Linux in the background. So on Ubuntu Linux, we'll just provide the command here since it is so simple: sudo apt install docker.io.
Please have a look at the following link in order to install Docker on all other flavors of Linux and on other operating systems: https://www.docker.com/community-edition

The advantage is that this format is layered, so downstream changes to the image result in the addition of a layer. Therefore, a Docker image can be easily synchronized and kept up to date over a network and the internet, since only the changed layers have to be transferred. In order to create Docker images, Docker is shipped with a little build tool which supports building Docker images from so-called Dockerfiles.

The following listing shows such a Dockerfile for creating a single container Apache Spark cluster using the standalone cluster manager:

#This image is based on the ubuntu root image version 16:04
FROM ubuntu:16.04
#Update and install required packages
RUN apt-get update
RUN apt-get install -y curl wget python openssh-server sudo
#Install the JVM
RUN mkdir -p /usr/java/default
RUN curl -Ls 'http://download.oracle.com/otn-pub/java/jdk/8u102-b14/jdk-8u102-linux-x64.tar.gz' |tar --strip-components=1 -xz -C /usr/java/default/
#Install and configure ApacheSpark
RUN wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
RUN tar xvfz spark-2.0.0-bin-hadoop2.7.tgz
RUN chown -R 1000:1000 /spark-2.0.0-bin-hadoop2.7
RUN echo "SPARK_LOCAL_IP=127.0.0.1" > /spark-2.0.0-bin-hadoop2.7/conf/spark-env.sh
RUN groupadd -g 1000 packt
RUN useradd -g 1000 -u 1000 --shell /bin/bash packt
RUN usermod -a -G sudo packt
RUN mkdir /home/packt
RUN chown packt:packt /home/packt
RUN echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config
RUN echo "packt ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER packt
RUN ssh-keygen -f /home/packt/.ssh/id_rsa -t rsa -N ''
RUN cp /home/packt/.ssh/id_rsa.pub /home/packt/.ssh/authorized_keys
ENV JAVA_HOME=/usr/java/default/
ENV SPARK_HOME=/spark-2.0.0-bin-hadoop2.7/
RUN echo "export JAVA_HOME=/usr/java/default/" >> /home/packt/.bashrc
RUN echo "export SPARK_HOME=/spark-2.0.0-bin-hadoop2.7/" >> /home/packt/.bashrc
RUN echo ". ~/.bashrc" >> /home/packt/.bash_profile
#Allow external connections to the cluster
EXPOSE 8080
EXPOSE 8081

So let's have a look at the most important commands in order to understand the contents of the Dockerfile. Before we do that, it is important to notice that everyone with an active internet connection is able to create a Docker image from a Dockerfile. Therefore, besides savings on transfer times, it brings along an important security aspect. You know exactly what's inside the images and don't have to trust the provider of the image blindly anymore. So here are the most important directives used in this Dockerfile:

FROM: This is the start of each Dockerfile and tells Docker, based on which (official or unofficial) root image, this image should be based on. We are based on Ubuntu 16.04 here, which is the latest Long Time Support (LTS) version of Ubuntu. It contains a minimal subset of components and basically provides us with an empty Ubuntu Linux installation. Of course, you can base your image also on other images you have created before.
RUN: Everything after the RUN directive is directly executed as a shell command during the image creation, so the important steps done here are:
1. Installing components using the apt command.
2. Downloading, extracting, and installing a JVM and Apache Spark.
3. Creating users and groups.
4. Configuring ssh.
USER: During the image creation, we are running as user root. In case we want to run as an underprivileged user, we can change it using the USER directive.
ENV: This directive allows us to set operating system wide environment variables during runtime of the image.
EXPOSE: This directive tells Docker during runtime of the image which IP ports should be made available from outside. It acts as some sort of firewall where per default all ports from outside are closed. These values can be reconfigured during the start of the image and are not fixed to the image itself.

If you now run the following command, a Docker image is created from the Docker file:

docker build -t me/apachespark:2.0 .

. specifies that the Docker daemon should pick the Dockerfile from the current location and build the image.

You can now start this Docker container with the following command:

docker run --name spark -it me/apachespark:2.0

But as we know, an Apache Spark cluster is built using multiple machines and not just one. In addition, it makes no sense to divide a single node into multiple containers using Apache Spark since we want to achieve exactly the opposite--Apache Spark makes multiple nodes behave like a single, big one. This is the one and only reason for using a data parallel framework, such as Apache Spark, to create virtual bigger compute resources out of small ones. So why does this all make sense? Let's have a look at Kubernetes first.

Table of Contents for Understanding the core concepts of Docker

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the core concepts of Docker