Chapter 8. Containers, Containers, Containers

Containers, as a concept, is not new technology. However, in recent years, there has been rapid adoption of new container-based technologies such as Docker, Garden, and Rocket. Many organizations regard containers as a key enabler for adopting technologies such as microservices-based architectures or continuous delivery pipelines. Therefore, containers have become a critical part of the digital transformation strategy of most companies.

Some companies I work with establish a mandate to adopt containers but cannot articulate a specific use case or problem that would be solved by adopting containers. Others believe containers will help them deploy apps quicker but cannot explain why, or they believe containers will provide better utilization but have not profiled their existing infrastructure utilization. The use of container technology absolutely can provide significant benefits. It is essential to understand how those benefits are achieved in such a way as to derive the most from container technology.

The Meaning of “Container”

Like the term “platform,” the term container is also overloaded. This chapter is not about traditional app server containers such as Tomcat; it is about OS–level containers such as runC.

What Is a Container?

Despite their huge popularity, there still are a lot of misconceptions about containers. Principally, in the Linux world at least, containers are not a literal entity, they are a logical construct. Strictly speaking, a container is nothing more than a controlled user process. Container technology typically takes advantage of OS kernel features to constrain and control various resources, and isolate and secure the relevant containerized processes. In addition, the term “container” conflates various concepts, which adds to the confusion.

Containers have two key elements:

Container Images

These package a repeatable runtime environment (encapsulating your app and all of its dependencies, including the filesystem) in a format that is self-describing and portable, allowing for images to be moved between container hosts. The self-describing nature of container images specifies instructions on how they should be run, but they are not explicitly self-executable, meaning that they cannot run without a container management solution and container runtime. However, regardless of the container contents, any compliant container runtime should be able to run the container image without requiring extra dependencies.

Container management

Often referred to as container engine, the management layer typically uses OS kernel features (e.g., Linux kernel primitives such as control groups and namespaces) to run a container image in isolation, often within a shared kernel space. Container-management engines typically expose a management API for user interaction and utilize a backend container runtime such as runC or runV that is responsible for building and running an isolated containerized process.

Container Terminology

The challenge in discussing containers is that implementation terminology can mean different things to different people, and it can be implementation-specific. I have tried to be as generic as possible in my description, but it is important to note that different terms like engine or backend can have meanings relating to a specific technology implementation.

Container images can be created through a concept of containerization: the notion of packaging up a filesystem, runtime dependencies, and any other required technology artifacts to produce a single encapsulated binary image. You can then port these images around and run them in different container backends via the container API/management layer. The container backend implementation is host-specific; for example, the term Linux container is a reference to the Linux technology (originally based on LXC) for running containerized images. Currently, Linux containers are by far the most widely adopted container technology. For this reason, the rest of this chapter focuses on Linux containers to explain the fundamental container concepts.

Container Fervor

Why have containers become so popular so quickly? Containers offer three distinct advantages over traditional VMs:

  • Speed and efficiency because they are lightning fast to create

  • Greater resource consolidation, especially if you overcommit resources

  • App stack portability

Because containers use a slice of a prebuilt machine or VM,1 they are generally regarded to be significantly faster to create than a new VM. Effectively, to create a new container, you simply fork and isolate a process.

Containers also allow for a greater degree of resource consolidation because you can run several container instances in isolation on a single host machine by a single OS kernel.

In addition, containers have enabled a new era of app-stack portability because apps and dependencies developed to run in a container backend can easily be packaged into a single container image (usually containing a tarball with some additional metadata). You can then deploy and run container images in several different environments. Container images make it easy to efficiently ship deltas, and therefore moving whole images between different host machines becomes practical. App-stack portability is one of the key reasons why containers have become so popular. Container images have become a key enabler for trends such as DevOps and CD by enabling both the app artifacts and all of the related runtime dependencies to migrate unchanged, as a layered binary image, through a CI pipeline into production. This has provided a unified approach to delivering software into production as opposed to the old and defunct “it worked on my machine” approach. Chapter 9 discusses this unified approach and its various merits in further detail.

Increased deployment efficiency becomes paramount when deploying apps using a microservices architecture because there are more moving parts and more overall churn. Container-based infrastructure has therefore become a natural choice for teams using a microservices architecture.

Containers, as with all technology, are a means to an end. It is not technology itself that is important; it is how you take advantage of it that is key. For example, we discussed using container images to propagate apps through a pipeline and into production. However, the pipeline itself can also effectively use containers. For example, the Concourse CI controls pipeline inputs so that results are always repeatable. Rather than sharing state, every task runs in its own container, thus controlling its own dependencies. Containers are created and then destroyed every time a task is run, ensuring repeatability by avoiding build “pollution.” Concourse’s use of containers is a perfect example of how we can use container technology to provide a clear tangible benefit over more traditional approaches that experience polluted build-up due to a pattern of VM reuse.

Linux Containers

Linux containers provide a way of efficiently running an isolated user process. As just discussed, strictly speaking, Linux containers do not exist in a purely literal sense: there are no specific Linux kernel container features. Existing kernel features are combined together to produce the behavior associated with containers, but Linux containers themselves remain a high-level abstract concept.

The essence of container abstraction is to run processes in an isolated environment. Linux container technologies use lower Linux primitives and kernel features to produce container attributes such as the required process isolation. Other Unix OSs implement containers at the OS kernel level; for example, BSDJails or Solaris Zones.

How do containers differ from VMs? There are two key elements:

  • Where processes are run

  • What is actually run

Although the container backend (the runtime) can, technically, be backed by a VM, traditionally speaking, containers are fundamentally different in concept. A VM virtualizes the entire machine and then runs the kernel and device drivers that are then isolated to that VM. This approach provides superb isolation; however, historically, at least, VMs have been considered relatively slow and expensive to create. Containers, on the other hand, all share the same kernel within a host machine, with isolation achieved by using various kernel features to secure processes from one another. Creating a container amounts to forking a process within an existing host machine. This is orders of magnitude faster than instantiating a traditional VM.

Containers versus VMs

The container versus VM debate is blurring in many ways. Specialized, minimal, single-address-space machine images known as unikernels allow for fast VM instantiation. You can replace container backends like runC with VM equivalents such as runV. At the end of the day, the important concern is not what runs your containerized process, but that your process is being run with the appropriate isolation guarantees and resource constraints.

The core Linux primitives and kernel features that produce container attributes include the following:

  • Namespaces to enforce isolation between containers

  • Control groups to enforce resource sharing between containers

We look at both of these kernel features in more detail a little later in the chapter. It is worth keeping in mind that the typical Cloud Foundry developer or operator does not require a deep understanding of containers, because the Cloud Foundry platform handles the container creation and orchestration concerns. However, gaining a deeper understanding is valuable for both the curious and the security-minded operator.

Namespaces

Namespaces provide isolation. They offer a way of splitting kernel resources by wrapping global system resources in an abstraction. The process within a namespace sees only a subset (its own set) of those kernel resources. This produces the appearance of the namespaced process having its own isolated instance of the global resource. Changes to a resource governed by a namespace are visible only to other processes that are members of that namespace; they cannot be seen by processes outside of the namespace.

A process can be placed into the following namespaces:

PID

These processes view other processes running inside the same (or a child) namespace.

network

These processes have their own isolated view of the network stack.

cgroup

These processes have a virtualized view of their CGroups root directories.

mount

These processes have their own view of the mount table; mounts and unmounts do not affect processes in other namespaces.

uts/username

These processes have their own hostname and domain name.

ipc

These processes can communicate only with other processes within the same namespace via system-level IPCs (interprocess communication).

user

These processes have their own user and group IDs. Host users and groups are remapped to local users and groups within the user namespace.

Take, for example, the PID namespace. Processes are always structured as a tree with a single root parent process. A Linux host uses PID 1 for the root process. All the other processes are eventually parented (or grandparented) by PID 1.

A container’s first process—which might have a PID 123 in the host—will appear as PID 1 within the container. All other processes in the container are similarly mapped. Processes that are not in the container’s PID namespace do not appear to exist to the namespaced process. Effectively, both the namespaced process and the host have different views of the same PID. It is the kernel that provides this mapping.

Upon container creation, a clone of the container process is created within the newly created namespace. This cloned process becomes PID 1 in the newly created namespace. If that process then makes a kernel call asking “what is my Process ID,” the kernel does the mapping for the process transparently. The process is unaware it is running within a PID namespace.

As another example, take user namespaces. The host views a user namespaced process with, for example, UID 4000. With user namespaces, the process can ask “what is my User ID”? If the process is running as root user, the namespaced response from the kernel will be UID 0 (root). However, the kernel has explicitly mapped the namespaced process such that the process thinks it is root. The host still knows the process as, for example, UID 4000. If the process attempts to open a file owned by the host root, the kernel maps the process to UID 4000, checks the UID against the host filesystem, which is actually mapped to 0, and will then correctly deny the process access to that host file because of invalid permissions.

Note

The preceding is just an illustration. With the abstraction of containers in Cloud Foundry, host files are not even visible to try to open them. For other container scenarios outside of Cloud Foundry, there can be, however, some value in joining only a subset of namespaces. For example, you might join the same network namespace to another process but not the mount namespace. This would allow you to achieve an independent container image that shares just the network of another container. This approach is known as a sidecar container.

Security through namespaces

When running processes in a dedicated VM, the responsibility for sharing physical resources is pushed down to the hypervisor. Containers that run on a single host are at the mercy of the kernel for security concerns, not an isolated hypervisor. Even though container isolation is achieved through namespaces, the namespaces still share the underlying resource.

Cloud Foundry provides multilayered security for containers. Principally, Garden—Cloud Foundry’s container creation and management API—does not allow processes to run as root (the host’s root). Garden can run two types of containers:

  • Privileged containers that have some root privileges (useful for testing Garden itself)

  • Unprivileged containers secured as much as possible; for example, processes running as pseudo root not host root

For tighter security, Cloud Foundry recommends that everything be run in unprivileged containers. A buildpack-built app will never run as root; it will always run as the Cloud Foundry created user vcap. In addition, for buildpack-built apps, Cloud Foundry uses a trusted secure rootfs provided by the platform. The Cloud Foundry engineering team has hardened the rootfs to remove exploits that could enable a user to become root. Building containers from the same known and trusted rootfs becomes a powerful tool for release engineering, as discussed in “Why BOSH?”.

Isolating the use of the root user is not unique to containers; this is generally how multiuser systems are secured. The next layer of security is containerization itself. The act of containerization uses Linux namespaces to ensure each container has its own view of system resources. For example, when it comes to protecting app data, each container has its own view of its filesystem with no visibility of other containers’ files.

For Linux, conceptually, the kernel is unaware that a process is running in a container. (Remember, there is no actual container, only an isolated process.) Without the use of namespaces, any non-namespaced kernel call that provides access to a host resource could allow one process to directly affect another. Namespaces therefore provide additional security for processes running in a shared kernel.

A point to be aware of is that Docker containers can run as root and, therefore, if not mitigated, could potentially compromise the underlying host. To address this vulnerability, Cloud Foundry uses a namespaced root for Docker images, so a Docker container still cannot read random memory. Assuming that you have PID and user namespaces, you should never have the ability to read a random process’s RAM.

Data Locality

Data is generally regarded to be the principal attack surface as opposed to simply another app process. Therefore, it is critical to maintain data isolation between processes. Container technology uses namespaces to isolate system resources such as filesystem access. Arguably, containers are still less secure than a dedicated VM because VMs are not sharing memory at the OS level. The only way to access another process’s isolated memory would be to break out of the VM on to the host hypervisor. With containers, if a file descriptor is left open, because container processes reside in the same OS, that file descriptor becomes more exposed. For the Cloud Foundry model, this should not be a major issue, because, generally speaking, data is not stored on the local filesystem but rather in a backing service.

CGroups

Namespaces provide containers with an isolated view of the host; however, they are not responsible for enforcing resource allocation between containers. A Linux kernel feature known as control groups (CGroups) is used for resource control, accounting, and isolation (CPU, memory, disk I/O, network, etc.). CGroups enforce fair sharing of system resources such as CPU and RAM between processes. You can apply CGroups to a collection of processes; in our case, a namespaced set of container processes.

Disk Quotas

Resource limits (R limits) define things such as how many files can be opened or how many processes can be run. However, the Linux kernel does not provide any way to limit disk usage; therefore, disk quotas have been established. Disk quotas were originally dependent on quotas based on user IDs. However, because in Docker a process can be root, that process could create a new user to get around disk quotas, given that all new users will be provisioned with a new disk. This loophole allows disk usage to keep growing. As a result, today disk quotas tend to use a layered—copy-on-write—filesystem for Linux. This allows Linux to scale for the storage that will be available, with the ability to administer and manage the filesystem with a clean interface and a clear view of what is being used.

Filesystems

The filesystem image format comes from either a preexisting container image (e.g., if you use Docker) or it will be Cloud Foundry–created based on available stemcells. Cloud Foundry–created containers are known as trusted containers because of their use of a hardened rootfs. Trusted containers use a single-layer filesystem known as a stack. The stack works in combination with a buildpack to support apps running in containers.

Note

A stack is a prebuilt root filesystem (rootfs). Stacks support a specific OS; for example, Linux-based filesystems require /usr and /bin directories at their root. Cloud Foundry app machines (Diego Cells) can support multiple stacks.

Upon creation, every trusted container uses the same base image (or one of a small set of base images). The container manager in Cloud Foundry, Garden, can create multiple containers from its base images. Cloud Foundry uses a layered filesystem. If every newly created container required a new filesystem, around 700 MB would have to be copied from the base image. This would take multiple seconds or even minutes, even on a fast solid-state drive (SSD), and would result in wasted unnecessary storage. The layered approach amounts to immediately instantiating a filesystem because it provides a read-only view of the filesystem.

Here’s how it works:

  1. On every Cell, there resides a tarball containing the container’s rootfs.

  2. When a Cell starts, it untars the rootfs (for argument’s sake, to /var/rootfs).

  3. When Garden creates a new container, it takes in a parameter called rootfs (passing in /var/rootfs) and imports the contents of that directory into a layered filesystem graph.

  4. For this new container and only this container, Garden then makes a layer on top of this rootfs; this layer is the resulting filesystem for this container.

  5. When a second container is used, it recognizes that the base layer is already in place, and will create a sibling (a second layer) on the base rootfs for the new container.

  6. The host has a tree structure consisting of a single base rootfs for all containers, with each container having a layer on top of that base. All containers are therefore siblings of one another.

  7. On a write, the container copies that write to a layer above the base image.

  8. When the container does a read, it first checks the top level and then goes down for the read-only content.

Untrusted containers (containers such as Docker that can run as a pseudo root) also often use a layered filesystem. A Docker container image is slightly more complex than just a single filesystem. Docker images are constructed from a Dockerfile, a script containing a set of instructions for building a Docker image very similar to a vagrant script used for building VMs. Every line in the script becomes a layer stored upon the previous layer. The Docker image can then build up multiple filesystem layers on top of one another. In addition to the layered filesystem, Docker images usually also contain metadata such as environment variables and entry points. Cloud Foundry must apply a quota to the rootfs to stop users from pushing containers with, for example, 100 GB of MapReduce data.

Container Implementation in Cloud Foundry

There is currently a degree of confusion and misinformation in the marketplace with respect to containers. This is largely because, as described earlier in “What Is a Container?”, the term “container” conflates various concepts. Moreover, the terminology used to describe container concepts tends to differ based on specific implementations. For example, when people refer to running containers, what they are really describing is the running of containers plus a number of other things such as package management, distribution, networking, and container orchestration. Containers are of limited value if considered in isolation because there are additional concerns surrounding them that must be addressed in order to run production workloads at scale; for example, container orchestration, clustering, resiliency, and security.

Cloud Foundry’s container manager, Garden, runs container processes using the runC backend runtime, which is a CLI tool for spawning and running containers.

When Cloud Foundry uses the Garden API to make a container (running a process within runC), a couple of additional things happen:

File/volume system management

Containers require a filesystem. Whether that comes from Cloud Foundry’s rootfs or a Docker image on Docker Hub or some other Docker registry, broadly speaking, the container engine sets up a volume on disk that a container can use as its root filesystem. This mechanism provides filesystem isolation. The filesystem is a path on a disk and Garden imperatively tells the container, “make this filesystem path the root filesystem of the container.”

Networking

runC has no opinions about the network. The runC API allows Garden to specify a container that should be run within a network namespace. Therefore, Cloud Foundry provides additional code that will set up interfaces and assign IPs for each container so that it can provide each container with its own IP in order for it to be addressable.

Why Garden?

Garden offers some key advantages when used with Cloud Foundry. First, because Garden uses runC for the container runtime, it allows both Docker images and buildpack-staged containers to be run in Cloud Foundry. Therefore, the use of Garden does not preclude the use of Docker images.

The primary reason why Garden is the right choice for Cloud Foundry is that good architecture (especially, complex distributed architecture) must support change. Garden provides a platform-neutral, lightweight container abstraction so that it can be backed by multiple backends. Currently, Cloud Foundry Garden supports both a Linux backend and a Windows backend. This allows Cloud Foundry to support a wider array of apps. For example, Windows-based .NET apps can run on Cloud Foundry along with other apps running in Linux containers such as .NET Core, Java, Go, and Ruby.

The Garden API contains a set of interfaces that each platform-specific backend must implement. These interfaces contain methods to perform the following actions:

  • Create/delete containers

  • Apply resource limits to containers

  • Open and attach network ports to containers

  • Copy files to and from containers

  • Run processes within containers, streaming back stdout and stderr data

  • Annotate containers with arbitrary metadata

  • Snapshot containers for zero-downtime redeploys

In Diego, the Garden API is currently implemented by the following:

  • Garden-runC (Garden backed by runC), which provides a Linux-specific implementation of a Garden interface.

  • Garden-Windows, which provides a Windows-specific implementation of a Garden interface.

OCI and runC

As discussed earlier in “Linux Containers”, Linux achieves container-like behavior through isolation, resource sharing, and resource limits. The first well-known Linux container implementation was LXC. Both Docker and Cloud Foundry’s first container manager, Warden, originally used LXC and then built container management capability on top. Both Docker and Cloud Foundry have since moved to employing runC as the technology that spawns and runs the container process.

runC is a reference implementation according to the Open Container Project (OCP), governed by the OCI. OCI is an open-governance structure for creating open industry standards around container formats and runtime.

OCI has standardized backend formats and provided the runC reference implementation, which many higher-level systems have now adopted (e.g., Garden and Docker). runC was established primarily through Docker pulling out the non-Docker-specific container creation library (libcontainer) for reuse.

Container implementation should not be a concern to the developer, because it is just an implementation detail. Developers are more productive by focusing on their app, and Cloud Foundry makes it possible for them to keep their focus on the higher level of abstraction of apps and tasks. So, with this in mind, why focus on runC, which is a specific implementation? Why is runC important?

The answer is unification for the good of all! Container fervor is exploding. Various technologies (such as VMs and containers) are, in some cases, blending together, and in other cases (such as container orchestration), pulling in opposite directions. In this complex, emerging, and fast-moving space, standards serve to unite the common effort around a single goal while leaving enough room for differentiation, depending on the use case. This promise of containers as a source of lightweight app portability requires the establishment of certain standards around both format and runtime. These standards must do the following:

  • Not be bound to a specific product, project, or commercial vendor

  • Be portable across a wide variety of technologies and IaaS (including OSs, hardware, CPU architectures, etc.)

The OCI specification defines a bundle. A bundle is JSON that describes the container, stating the following:

  • It should have this path as the rootfs.

  • It should have this process and be in these namespaces.

It is the bundle that should become truly portable. Users should be able to push this OCI bundle to any number of OCI-compatible container environments and expect the container to always run as designed. The debate on the use of a specific container technology such as Garden versus Docker Engine is less important because both of these technologies support the same image format and use runC. These technologies are just an implementation detail that a platform user never sees nor need ever be concerned about. Most container orchestration tools are increasingly agnostic about what actually runs the container backend. For example, if you want to swap out runC for runV—a hypervisor-based runtime for OCI—this should not affect the running of your OCI container bundle.

Container Scale

Cloud Foundry is well suited for apps that can scale-out horizontally via process-based scaling. Cloud Foundry can also accommodate apps that need to scale vertically to some extent (such as increased memory or disk size); however, this limit is bound by the size of your containers and ultimately the size of your container host.

One app should not consume all of the available RAM or disk space on a host machine. Generally speaking, apps that scale memory vertically do so because they hold some state data in memory. Apps that scale disk space vertically do so because they write data to a local disk. As a best practice, you should avoid holding excessive state in memory or writing user data to local disk, or you should look to minimize these and offload them to a dedicated backing service such as a cache or database wherever possible.

Container Technologies (and the Orchestration Challenge)

All mainstream container technologies allow you deploy container images (at the moment, generally in Docker image format, although other standards exist) in a fairly similar way. Therefore, as mentioned earlier, the key concern is not the standardized backend implementation, but the user experience of the container technology.

You should not run containers in isolation. Low-level container technology alone is insufficient for dealing with scale and production environment concerns. Here are a few examples:

  • If your container dies (along with your production app), who will notice, and how will it be restarted?

  • How can you ensure that your multiple container instances will be equally distributed across your VMs and AZs?

  • How will you set up and isolate your networks on demand to limit access to specific services?

For these reasons, container orchestration, container clustering, and various other container technologies have been established. As a side effect of the explosive growth in the container ecosystem over recent years, there is a growing list of container orchestration tools including Kubernetes, Docker Swarm, Amazon EC2 Container Service, and Apache Mesos. In addition, some companies still invest in in-house solutions for container orchestration.

Container orchestration is a vital requirement for running containers at scale. Orchestration tools handle the spawning and management of the various container processes within a distributed system. Container orchestration manages the container life cycle and allows for additional functions such scheduling or container restarts. Container services provide a degree of control for the end user to determine how containers interact with other containers and backing services. Additionally, orchestration makes it possible for you to group containers into clusters to enable them to scale to accommodate increased processing loads.

As discussed in Chapter 1 and “Do More”, Cloud Foundry provides the additional capabilities required for running containers in production at scale. At its heart, Cloud Foundry uses Diego as its container orchestration mechanism. (Chapter 6 reviewed Diego in detail.)

Summary

Containers have become immensely popular in recent years because they offer app deployment speed and resource efficiency through greater resource consolidation. They also allow app-stack portability across different environments. All of these benefits are appealing to DevOps cultures who desire to deliver software with velocity and avoid the “it worked on my machine” challenge. Containers become even more critical when working with microservices and CI pipelines for which rapid, repeatable, lightweight deployment is essential.

The challenge with containers is that because they have only recently become mainstream, there has been some confusion in the marketplace as to what container solution to adopt and how best to operationalize and orchestrate them in production. The key to obtaining benefits from containers is to move away from concerns surrounding image format or the backend runtime and ensure that you approach containers at the right level—namely, the tooling and orchestration that allow you to use running containers in production at scale.

Here’s how Cloud Foundry supports this approach:

  • Garden, a pluggable model for running different container images. Garden enables Cloud Foundry to support buildpack apps and Docker apps on Linux (via the runC backend) and Windows apps (via the Windows backend).

  • Diego, which handles the container life cycle and orchestration including scheduling, clustering, networking, and resiliency.

When all is said and done, as impressive as container technology is, containers are just an implementation detail and not a developer concern. The key consideration in taking advantage of containers is not the container itself but the user experience of container orchestration. Removing developer concern from container construction and refocusing attention back on apps is vital for app velocity. This is where Cloud Foundry along with its Diego runtime really excels. Chapter 9 looks at Cloud Foundry’s use of containers through buildpacks and Docker images.

1 The terms “VM” and “machine” are used interchangeably because containers can run in both environments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.27.251