Container Storage Interface architectural overview
This chapter describes the architecture of the Kubernetes Container Storage Interface (CSI) and the IBM CSI driver for block storage systems.
Some relevant concepts in Kubernetes and OpenShift are introduced first because they help to understand the separation of duties and give the needed understanding for deploying the driver.
If you are familiar with the Kubernetes architecture and components, you might start with 3.2, “Kubernetes Container Storage Interface” on page 36, which describes the driver’s integration into the platform.
Kubernetes and OpenShift, and IBM’s block storage CSI driver, are open source projects, and every implementation detail also can be found on the internet. For more information about online resources, see “Related publications” on page 135.
This chapter includes the following topics:
3.1 Kubernetes
This section describes some of the most relevant concepts of Kubernetes to help the reader understand the IBM block storage for CSI driver and how it integrates “upwards” into an OpenShift environment, and “downwards” with IBM’s block storage products. These concepts and how the respective components work together give a basic understanding about how the driver is deployed and configured, and help with troubleshooting.
3.1.1 Kubernetes platform
Over the second half of the 2010s, Kubernetes evolved as the leading container orchestration system in the industry. All relevant cloud providers offer a Kubernetes or OpenShift service. The success of the platform seems to justify referring to it as the “operating system” of a cloud. As with a traditional operating system, it manages resources, which it presents in a consumable way to application developers who then build and run their applications on the platform.
The following basic design principles supported the success of the platform:
Containerize everything
All types of programs or workloads, whether internal to the platform or user applications, are put into containers. These containers are self-contained pieces with all of their dependencies packed into one binary object; that is, the image. Kubernetes orchestrates containers; the platform organizes containers in a way to make them run “somewhere” in a cluster and provides all needed connectivity, be it among each other, to the network, storage, or other back-end systems.
The Twelve-Factor App (for more information, see this website)
An application design methodology that leads to applications that can easily be deployed, scaled, and most importantly be automatically managed. Kubernetes’ internal components follow that principle, and user workloads following that principle also are a perfect fit to run on the platform. This method is a powerful driver for the development of applications that can be containerized.
Declarative approach
This paradigm is most contrary to the procedural approach to which “traditional” system managers and administrators are accustomed. The platform (and what the platform manages) is not managed by using some commands, which then “do something” until they come to an end, be it success or failure.
Instead, Kubernetes holds objects in its internal databases and watches these for changes. Users, administrators, internal and external processes interact with these database objects. After the platform detects a change, it does its best to have the real-world status meet the wanted status of the objects in the database. For example, a command for deploying an application (or the IBM block storage CSI driver) might return immediately and report success, although the platform is still busy combining all of the required pieces.
3.1.2 Control plane
Regarding Kubernetes as the operating system of a cloud, the kernel of this operating system is Kubernetes’ control plane. The core components can be found here, which let the platform manage itself. Also, founder here is the entry point to the platform.
The significant elements of the control plane are shown in Figure 3-1.
Figure 3-1 Kubernetes components overview
The control plane includes the following components:
API server (apiserver)
This component forms the front end of the control plane and the central point for all communication within the control plane. It offers a RESTful web service that all components in the cluster can use. Any component that must access the platform also must interact with the API server. This component features GUIs that are included with OpenShift, or command-line clients (such as the kubectl command), or user applications that want to interact with Kubernetes objects.
etcd
The distributed and reliable key-value store that forms the foundational Kubernetes object database is called etcd. Anything that must be stored is stored in the etcd cluster, which is at the core of the control plane. Protection of this database is vital for the cluster. If it is lost, the cluster is gone.
For each object the platform can manage, a representation of it exists in the etcd database. Ensure that etcd is backed up or protected by etcd snapshots. For more information about etcd backup, see this web page.
The configuration information of the OpenShift Cluster is stored in this database.
Scheduler
This component makes the ultimate decision about where to run specific workloads. It considers the requirements for a workload and the available machines with their characteristics and allocates the best fit.
 
Controller-manager
This component can be seen as the workhorse in Kubernetes. It contains the built-in controllers, which are the processes that watch for changes on the database objects and perform the necessary action to have real-world status meet the wanted status in the objects. These built-in controllers include the following examples:
 – Node controller: This controller acts on changes to the machines that are forming the cluster. It registers new machines that are joining, or starts recovery actions if it detects a machine failure.
 – Workload controller: This class of controllers implements the workload controllers that are described in 3.1.5, “Workload controllers” on page 24.
 – Endpoints controller: This controller is responsible for the connection between applications in the cluster and the network.
The cloud-controller-manager
This component often is under the responsibility of a cloud provider. It forms the interface to the underlying cloud infrastructure that differs between different providers, such as Amazon Web Services or the IBM Cloud.
The control plane components are containerized applications, as shown in Figure 3-1 on page 21.
It is good practice to use separate infrastructure or compute nodes for the control plane for high availability, security, and reliability reasons. For example, in the IBM Cloud Kubernetes cluster offering, the control plane of customer clusters is not accessible for customers themselves. Instead, it is managed by IBM and running on separate infrastructure. Access is possible through the API server only.
3.1.3 Nodes
Nodes are Kubernetes’ abstraction of real computers, be it virtual or physical machines. As shown Figure 3-1 on page 21, a node consists of the characteristic entities that are running on the operating system of a machine:
kubelet
This agent runs on every machine in the cluster. It interacts with the API server and the controllers in the control plane to fulfill its central tasks; that is, starting, stopping, monitoring, and deleting containerized workloads on the machine. This agent also includes setting up the running environment for such workloads.
kube-proxy
An application often is useless if it cannot provide its service over the network, or use other services within or outside the cluster. The kube-proxy on each node handles establishing network connectivity for the workloads that are running on the node.
Container runtime
The container runtime on a node can be considered as the motor of the platform. It loads the container images as the kubelet specifies them and runs them. Kubernetes can work with arbitrary run times if they fulfill the Kubernetes Container Runtime Interface (CRI); for example, Docker, CRI-O, containerd, or Kata containers.
 
Note: These components run under the operating system’s control on a machine. The kubelet and kube-proxy communicate with the platform through the apiserver, which in turn consists of a set of containers that are placed on the control plane nodes.
To build pools of nodes, the node objects in the Kubernetes database can be labeled (which also is possible for other objects). By using label selectors, specific sets of nodes can be specified, which, for example, can run a certain workload. Another possibility to characterize distinguished nodes is to add annotations.
3.1.4 pods
pods are the smallest possible workload entity in Kubernetes. A pod can consist of one or many containers. At first consideration, this proposition seems to be confusing because Kubernetes is a container orchestrator. However, a pod is the complete workload that is scheduled to a node for execution, as shown in Figure 3-2. Kubernetes ensures that all containers that belong to the same pod are also scheduled to the same node.
Figure 3-2 Kubernetes pod
Also, all containers in a pod share the Linux network- and Inter-Process-Communication (IPC) namespace. Therefore, all containers in a pod are bound to the same network interfaces and addresses, and they can communicate directly through IPC mechanisms, such message queues, semaphores, or shared memory segments.
A pod can be configured so that its containers share process namespaces so that they can see themselves in the same process hierarchy. A running container resembles a process in the operating system. Its main program has process ID (PID 1) if looking at it from within the container. Also, processes that are in the container spawns are visible in the container as children of PID 1. However, if containers share the process namespace, they also see each other as processes with individual PIDs.
pods are volatile in a sense that they do not maintain state between instantiations. Whatever data a pod produces internally is lost and unrecoverable after it ends. Whenever a pod is scheduled and starting, it begins at “day zero” again. Persistent storage must be provided externally, and the CSI driver is one of the possibilities to provide it.
No start, stop, or pause commands are available for pods. They can be created (run) or deleted only. This fact reinforces the declarative approach in Kubernetes; that is, it is the database object “pod” that is created, and the platform alone does what is needed to have that wanted set of containers run somewhere in the environment. The “create pod” API call returns when the creation of the database object is queued, which does not mean that the wanted workload is running.
Sidecars
The CSI design uses so-called sidecar containers, which are also relevant during the deployment of the IBM block storage CSI driver. (The name of the pattern stems from the metaphor of mounting a sidecar to a motorcycle; that is, the main application. In the context of the CSI driver architecture, we see this pattern applied.)
The sidecar pattern is a well-known design pattern in microservices development, and one of the simplest patterns that motivate the concept of pods. It assumes that a containerized main application provides the wanted basic service. Also, one or more “sidecar” containers exist that extend or improve the application container.
For example, imagine an application container that makes available an HTTP-based API. Assume that we want to extend the application with an HTTPS endpoint for secure communication. To do so, we add a sidecar container that handles the added protocol and forwards its incoming requests to the main application container.
The sidecar does not need to know about the specific information of the main application, and vice versa (the main application container does not need any modifications to handle HTTPS requests). Both of the containers are arranged in the same pod so they can directly communicate with each other without exposing any of their internal, potentially unsecured, communication.
3.1.5 Workload controllers
Creating and running single pods might seem enough for simple use cases, but the real power of the platform is found in its workload controllers. These controllers help describe specific characteristic workloads and let the platform enforce respective behavior.
One important aspect about the workload controllers is that they take complete ownership of the pods they start and delete. This behavior might confuse a Kubernetes novice. If an administrator deletes a pod that is controlled by a workload controller, the pod immediately is re-created according to the type of controller. This process can occur in parallel to the ending of the deleted pod.
However, turning around that principle helps fix erroring pods. It is often safe to delete or rewrite an erroring pod, or in the worst case, delete all pods that are managed by the same controller because they are re-created from their initial state.
In the following sections, we cover the basic controllers as they are used in the context of the CSI driver.
ReplicaSet and deployment controller
In this section, we described use cases for ReplicaSets and deployment controllers.
ReplicaSet
The typical use case for a ReplicaSet (an object controlled by the ReplicaSet controller) is a workload in which multiple instances of the same pod must run in parallel. A ReplicaSet is used when the workload is characterized by the following conditions:
The pod instances do not need to be distinguished, which means that they all provide the same service independently of each other.
The pods do not maintain state.
The first condition means that no pods with a special role exist; for example, one distinguished leader and several secondary instances.
The second condition means that none of the instances must preserve information between restarts. the ReplicaSet workload is a number of pods where the count of pods can be increased or decreased without affecting the application’s function or internal application status.
A typical use case for this type of workload is a media streaming service, which can be scaled up or down according to a request rate. The IBM block storage CSI operator is such a workload.
Deployment controller
The deployment controller’s most important feature is to add a lifecycle concept to the ReplicaSet. It is for this reason that the use of deployments for the depicted use case is favored.
The deployment can be viewed as a workload controller that rolls out ReplicaSets. By doing so, it can roll out a new version of a pod and roll back a failing deployment. The deployment controller can implement different roll out strategies, such as scaling down the number of pods in an active ReplicaSet, while in parallel scaling up the number of pods in a ReplicaSet with a new pod version.
In our example, my-app is our application for which we have a container image my-app, which is tagged at version 1.3.3. Our application pod needs only to have this single container image run.
We want to scale our application to have two replicas active, and we do this in a deployment in which we also specify an update strategy of Rolling Upgrade.
After the my-app deployment object is created in the platform, the deployment controller detects no respective ReplicaSet exists and creates one, giving it a unique name with my-app- as a prefix.
The ReplicaSet controller then discovers no respective pods exist for that ReplicaSet, and create pods, again with unique names that are prefixed with the ReplicaSet name.
The scheduler then looks for nodes that can fulfill the respective resource requirements of the application. Eventually, the kubelet on the wanted nodes spin up the pods by enabling the container runtime start containers.
Figure 3-3 shows the final situation: two pods were created: one on node worker-0, the other on worker-3. Both pods run the same image in their respective containers.
Figure 3-3 After Deployment is created
Assume that we built a new release of our application so a new my-app image is available, which is tagged as version 1.3.4. To deploy the new version, it is sufficient to update the image reference in the my-app deployment object for our new release.
The deployment controller then detects the change and starts the upgrade according to the Rolling Update strategy. It creates a ReplicaSet with the new image reference, and scales it up while it scales down the old one, which ends the pods.
Figure 3-4 shows an intermediate situation, where the ReplicaSet with version 1.3.3 that is scaled down to one replica, and the new ReplicaSet with version 1.3.4 that is scaled up to one replica.
Figure 3-4 Updated Deployment in progress
The decision about where the new pods are placed is made exclusively by the scheduler that is based on the current use of the cluster. Therefore, it might happen that one of our application pods are on a different worker than before (as shown in Figure 3-4 on page 26), or the scheduler decides for the same worker again. The latter case is illustrated in Figure 3-5, which shows the final state after updating our image version in the deployment. The new pod, with our release 1.3.4, is worker-0, where we also released a 1.3.3 image running earlier.
Figure 3-5 New release of “my-app” is deployed
DaemonSet
A DaemonSet workload is characterized as a pod, which runs on all (or a distinguished set of) nodes. Therefore, whenever a new node enters the cluster, a pod is placed there. When a node exits from the cluster, the respective pod is not restarted elsewhere, as was the case with other workload controllers (see Figure 3-6 and Figure 3-7 on page 28).
Figure 3-6 DaemonSet before worker-3 joins the cluster
Figure 3-7 DaemonSet after worker-3 has joined the cluster
A typical use case for a DaemonSet is a monitoring application that monitors activity on all nodes that are in the cluster. Logging collections on nodes is also a popular pattern for a DaemonSet. We see the node component of the IBM block storage CSI driver being deployed in a DaemonSet.
StatefulSet
A StatefulSet can be a complementing concept to the ReplicaSet. It manages a specific number of pods that are running the same code. However, a StatefulSet features the following set of conditions:
The pod instances must be identifiable
The distinguished pods maintain individual state across instantiations
Identifiable pods means that the pods, although running the same application, can have different roles. For example, a leader might be among them. The etcd cluster that is described in “Control plane” is an example for that use case.
If a pod from an etcd cluster is rescheduled to another node, its database also must go there and it must preserve the state it had on the former node.
Distinguishing the pods might also be an application requirement. While in a DaemonSet or a ReplicaSet, the different pod instances do not know each other, and the StatefulSet allows the different instances to identify and communicate with each other.
Our example that is shown in Figure 3-8 on page 29 shows a deployed StatefulSet with three pods. Our specification also states that each of the distinct pod instances requires 10 GB of persistent storage to keep its state. We use symbolic names for the provisioned storage parts in our figures; in reality, they look a bit more cryptic. The pod names appear as shown.
We assume that the pods in the StatefulSet were deployed to the workers nodes worker-0, worker-2, and worker-3. On worker-1, other pods might be running that are out of the scope of our StatefulSet.
We also assume that node worker-2 quits the cluster for whatever reason, be it an operating system upgrade, hardware defect, or network issue. The platform notices the loss of that node and deletes the pod objects that were scheduled to the node.
Then, the StatefulSet controller can discover that one of the wanted replicas is missing, and it creates the pod again.
Figure 3-8 StatefulSet with three pod instances deployed to three nodes
The newly created pod then proceeds with the normal scheduling process in Kubernetes, which can result in placing the pod onto node worker-1, as shown in Figure 3-9.
Figure 3-9 Pod and storage rescheduled in a StatefulSet after worker-2 quits the clusters
In contrast to the ReplicaSet or DaemonSet, the StatefulSet controller manages the following special circumstances:
The exact pod instance that was lost (here, my-app-1) must be rescheduled.
The data that the pod uses must “move with the pod” to the node where the pod is not running.
When discussing the CSI driver in “Kubernetes Container Storage Interface” on page 36, we return the respective challenges for the implementation of a persistent storage concept in a highly dynamic environment.
The users of persistent storage that is provisioned with the CSI driver are typical candidates for StatefulSet workloads.
3.1.6 Persistent storage
As shown in section “pods” on page 23, pods do not keep their state when they are restarted because the platform decides to reschedule them, or an administrator or user deletes and starts them again.
This principle does not allow designing stateful applications that require to maintain state, or to share persistent data among multiple pods. Kubernetes solves this problem with its concept of persistent storage.
Several objects and controllers come together to fulfill the need for storage. The architecture might seem confusingly complicated at first, but it is useful. Some evolution occurred over the Kubernetes releases. We now have a robust design with clear separation of duties as the implementation of the CSI driver demonstrates.
The concepts around persistent storage are shown in Figure 3-10. The storage-related aspects are shown on the left side of the figure; the application, or pod-related entities, are shown on the right side.
Figure 3-10 Persistent storage overview
Volumes
At pod level, volumes are the objects providing persistent storage. The containers within the pod use volumes by mounting them into their file system hierarchy. These volumes are shown as the smaller orange disks that are shown in Figure 3-10. They are close to the containers. However, although these volumes are the object of an application’s requirement, Kubernetes adds some wrapping concepts around these mountable volumes as described next.
PersistentVolumeClaims
Kubernetes manages volumes with persistent storage through PersistentVolumeClaim (PVC) objects. This kind of object describes the characteristics of a specific piece of storage from an application’s perspective. PVCs offer an abstraction of storage that is usable for pods. They are created together with the respective pods that want to use the respective storage, and their characteristics are determined by the application:
Size of the storage: How much space does the pod need on the volume?
Access mode of the storage: Does it need to be read or written by one or many pods at the same time?
Volume mode: Does this volume contain a file system or is it a block device?
In Figure 3-10 on page 30, we see the light green disk symbols close to the pod. Although PVC describes storage from a pod’s perspective, PVCs also can exist independently of any pod because the storage is persistent. The data in that storage exists independently of the fact that it is actively used. This situation of having PVCs without pods referring to them is normal.
Consider the following points:
A PVC can be referred from an arbitrary number of different pods. Therefore, zero or more instances, either active at the same point in time for ROX or RWX access, or different pods can refer to the same PVC at different points in time for RWO access. For example, the same PVC can be used by one pod for data generation, while another pod can use the generated data later for analysis.
A pod can refer to a specific PVC that does not exist. This aspect is confusing about Kubernetes: A pod referring to a non-existent PVC can successfully be created without issues. The respective CLI call immediately returns successfully after the pod object exists in the Kubernetes database. However, the pod cannot start until the PVC exists and is bound to a PersistentVolume, what we describe next.
PersistentVolumes
A PersistentVolume (PV) object (in our Figure 3-10 on page 30 the darker green disk symbols on the left) describes some real, available piece of persistent storage. In contrast to the PVC, the PV describes how a storage provider, or a storage back-end, looks at persistent storage.
At this point, often the question arises as why two different concepts exist that essentially describe the same thing. The answer can be found in the declarative paradigm of the platform.
A PV declares something a storage provider can offer, while a PVC declares something a storage consumer wants. We can leave it to the platform that takes the responsibility to bring the things together. This functional separation allows an application developer to define the PVCs for the application independently of the details of the underlying available storage.
However, a storage administrator does not need to “mount”, or “assign” storage specifically to any application. Both roles (application developer and storage administrator) can work fully decoupled in their respective domains, and let the platform put the pieces together.
How does the platform match PVs and PVCs? PV and PVC must provide the same access mode, the PVC's size requirement must be less or equal to the PV provided size, and the volume mode must be compatible. If the platform finds a matching PV for a newly created PVC, or vice versa, a PV appears that satisfies the demands of a PVC, and both PV and PVC are bound to each other.
The double-ended arrows that are shown in Figure 3-10 on page 30 between pv and pvc objects illustrate this concept. Bound and unbound PVs and PVCs are visible in Figure 3-10 on page 30, which is a natural state that we explain next.
Dynamic volume provisioning
In the simplest case, a storage administrator creates some consumable pieces of storage on the storage back-end systems and makes them available for the platform by creating respective PV definitions. Applications can then claim or allocate these pieces through PVCs. However, this approach has the following flaws:
Our storage administrator must create the storage pieces on the back-end system and then translate the “real” objects into their Kubernetes PV representation, which is a possibly error prone process.
The storage pieces our administrator provides must fit the possible demands in the PVCs. They should fit the largest possible PVC size; however, small PVCs produce much wasted storage space if only larger PVs are available.
Although sufficient storage is available on the back-end system, PVCs can remain unsatisfied ecause the platform ran out of PVs of the needed size.
To overcome these drawbacks, Kubernetes introduced the concept of dynamic volume provisioning. With dynamic provisioning, the platform takes the role of our storage administrator. However, instead of creating PVs in advance for a demand that might not exist, the platform watches the creation of PVCs and follows up by creating PVs.
StorageClass
When we look at dynamic volume provisioning, we observe that some automation must take over the storage administrator’s role. Such an automation is limited to distinct storage back-ends. This issue leads to the concept of a storage provisioner. A provisioner is some code that can allocate storage on a specific backend. Kubernetes provides an abstraction of such a provisioner in StorageClasses.
StorageClasses put together a reference to the provisioner plug-in (the code), and parameters that the plug-in needs to perform its operations. The storage administrator’s role is no longer to preallocate storage and create PVs. Instead, the role is to define the available StorageClasses and to preallocate some capacity; for example, in a dedicated storage pool that is on the storage.
It is left to the application developer to specify the StorageClass in the PVC for the application pods. The light green double-sided arrow that is shown in Figure 3-10 on page 30 connects pvc-1 with the storage back-end and shows how a StorageClass works for dynamic provisioning.
Now that we have dynamic volume provisioning and StorageClasses, binding PVCs and PVs can be done by completing the following steps:
1. A user deploys an application that specifies pods and PVCs. The PVCs reference a StorageClass that the storage administrator provided and advertised to the users.
2. The platform detects new PVCs that include no matching PVs in their StorageClass and starts the StorageClass’ provisioner.
3. The provisioner allocates a matching piece of storage on the back-end and creates a respective PV for it.
4. The platform now finds matching PVs for the PVCs and binds them.
This simple model does not cover the aspects of capacity management. However, this issue is out of scope here. The common solution is to monitor the storage back-end systems for their capacity and alert the storage administrators if capacity is running short. Also, the storage provisioning is out of scope here.
Volume snapshots
In addition to creating and managing persistent storage for stateful pods, Kubernetes also allows snapshots to be taken of volumes that are in use by a pod. This data, which represents a state at a specific point, can then be used as initial content for new volumes. A typical use case for this scenario can be database backups; for example, when preparing a bigger modification, such as a schema migration. Having a snapshot can serve the development and verification of the migration, and allows roll back if failures occur.
Similar to PVCs and PVs, Kubernetes is separating the concepts into an application-driven view and a storage administration view. The respective objects are:
VolumeSnapshots
These are the analogy to PVCs. Applications can make use of VolumeSnapshots, without the need to understand the underlying storage backend’s details.
VolumeSnapshotContent
This is the analogy to PVs. The storage backend “knows” how to map VolumeSnapshotContent objects to the respective available options on the backend.
As with the PVC/PV case, VolumeSnapshots and VolumeSnapshotContent belong to each other, although the “binding” term moved to a “ready to use” property. After a VolumeSnapshot includes a corresponding VolumeSnapshotContent, it can be used for the creation of new PVCs that refers to the VolumeSnapshot object as data source. Therefore, its bound PV’s initial content consists of the respective VolumeSnapshotContent.
VolumeSnapshots are available for Cloud Storage Interface drivers and the IBM block storage CSI driver only. It is an optional capability, which the IBM CSI driver supports starting with version 1.3 on OCP 4.4 or Kubernetes 1.17.
Similar to PVs and PVCs, a storage administrator can pre-provision VolumeSnapshotContent objects by taking snapshot of volumes on the storage system. A VolumeSnapshotContent object can be created dynamically by specifying the PVC from which to take it.
3.1.7 Application configuration
Although we covered the aspects of the stateless nature of pods and how persistent data is being made available for them, how to effectively bring configuration information to pods is important.
Because we expect one or more pods to implement a meaningful application from more or less generally usable container images, we do not want to have to place the required configuration information into these images.
Think of such things as, for example, parameters that are different for production, test, or development releases of an application, or access tokens or other external references to back-end systems. Not only is it inefficient to include this information into images, it is overkill to use persistent storage for these information bits.
 
Kubernetes offers two handy concepts for configuration information. These concepts are described next.
ConfigMaps
With ConfigMaps, Kubernetes offers an effective way to pass configuration information to applications. These are simple key or value stores within the Kubernetes objects database that keep information together. Pods then can refer to ConfigMaps for setting up individual attributes. ConfigMaps are useful for the following purposes:
Setting environment variables for containers in pods.
Providing entry call parameters for containers.
Providing configuration file content that can be used by containers.
Containers can also use the Kubernetes API programmatically to access data in a ConfigMap.
Secrets
Although ConfigMap content is a good fit for any clear text information, it should not be used for information that is confidential. It is for this reason that Kubernetes offers Secrets.
Small pieces of confidential information (for example, usernames, login passwords, or API tokens) can be stored in Secrets. This information can be stored encrypted. Often, users do not want to have such information in clear text in a pod specification or part of container images.
Instead, a container can mount a Secret to some point in its file system hierarchy and then access the data through a file in this file system. A second possibility is to pass data from Secrets to container environment variables. Eventually, a special use case for Secrets is image pull secrets, which are the access credentials the kubelet passes to the container runtime so it can pull container images from protected image registries that require authentication.
3.1.8 Extension points in Kubernetes: The operator pattern
One of Kubernetes’ strengths is its extensibility. By design, it offers various extension points which, according to the respective use case, allow flexible extensions to the platform. Such extensions can be more commands for the CLI, specific authentication or authentication for API requests, or more resources the platform can manage.
We concentrate on one specific way of extending the platform because it is the foundation of the value that is added with OpenShift, and the principle that is behind the CSI driver in general and the IBM block storage CSI driver in particular.
This method is the operator pattern. Its building blocks are CustomResources, their definitions and controllers, and the embracing concept of a Kubernetes operator.
Custom resources
Kubernetes allows to extend its API with arbitrary resources. These resources are collections of user-defined objects in the Kubernetes database. Such Custom Resources (CR) can, as with the built-in pod resource (for example) be managed by the platform.
Custom resource objects can be added to or deleted from collections on the CLI, or programmatically from within the environment. The only issue left to the CR designer is to make its structure known to Kubernetes so it can suitably handle these objects.
A common way to do this is by using a CustomResourceDefinition (CRD). CRDs are resources that are managed by Kubernetes. To add own objects to Kubernetes, a user writes a CRD and hands it over to Kubernetes. From that moment on, the platform also can manage these objects.
Just having a collection of new objects in a CR is useful, but often the intention of new resources is to make some use of them. As with the built-in pod, resources feature the semantic of letting a set of containers run somewhere in the cluster. We expect CRs to also include some semantic aspect. It is here where resource controllers come into play.
Figure 3-11 show how controllers fit into Kubernetes’ programming model. A controller declares its interest in the changes (add, update, and delete) of resources (regardless if built-in or CRs) with informers. It also provides suitable code that acts on these events. These actions can then by themselves change other resources, which is done through calls to the apiserver. These changes to resources occur in the etcd database and create events, which in turn helps their controllers to take action, and so on.
Figure 3-11 Basic principle of a resource controller
This illustration can also explain, why creating only a pod with a valid pod specification through the CLI immediately returns successfully, although at that point it is unclear if this pod can find a node in the cluster on which to run.
Several other controllers are available in the platform that watch for the creation of pod objects and take actions. However, these actions might fail or timeout. For example, the scheduler might not find a node with sufficient memory available. Because the scheduler cannot know whether this is a temporary problem, it postpones the decision, which leaves the pod in a “Pending” state.
Operators
We described the internal mechanics of the platform in a high-level overview, and discussed how Kubernetes helps us define CRs, and how controllers for CRs work. Putting the pieces together brings us to the operator pattern. An operator is a CRD and a controller for the defined CRs. However, the question arises as to why an operator is needed.
The driving force is automation: Imagine a scenario in which an application developer creates an application and now wants to deploy it in the cluster. The application can consist of several back-end pods that must cooperate on some shared storage, and some other pods that implement a user front-end.
One possible way of deploying the application can be to manually create the respective pod specifications and then manually start the CLI to create them individually. Alternatively, a deployment for the front end, and a StatefulSet for the backend can be a more effective way. However, what if another version of the application is needed in a test version, or must be deployed into a development environment?
It is here that the operator comes into play. Instead of manually deploying the application with its varying parameters and sizing in different versions, a CRD can be set up for the application as a whole. This CRD describes the mandatory and optional deployment parameters for an application. Creating a respective controller, which performs all the steps our developer completes manually, enables the application to deploy automatically.
This pattern of a CRD together with a controller implementing the operational behavior for that type of resource is what Kubernetes calls an operator. The basic idea is that an operator performs the actions that a human operator does in an automated fashion. The OpenShift platform uses the operator pattern extensively to provide value regarding the operations of Kubernetes clusters.
3.2 Kubernetes Container Storage Interface
With Kubernetes 1.13, the CSI officially became a part of Kubernetes. Its primary goal is to decouple the implementation of container storage back-ends from the main Kubernetes code base.
Before CSI, the support for many storage back-end plug-ins required integration with the Kubernetes. That is, even a bug fix in a plug-in’s code required rebuilds of the entire platform. These plug-ins are often called “in-tree” because they are part of the entire Kubernetes code tree.
CSI aims at overcoming this limitation. By the design of an interface between storage back-end providers and the Kubernetes platform, development can advance in a decoupled manner.
New storage back-ends can be implemented and tested independently from the Kubernetes platform, so can the CSI pieces on Kubernetes side. The CSI specification was developed with arbitrary container orchestrators in mind. Therefore, it is general enough to be used by other platforms, such as Cloud Foundry.
Together with the CSI specification in terms of the functions that a storage driver must implement come some valuable assets. One is a deployment proposal, which is a description of how the deployment of a CSI driver for a specific storage backend can be structured. The other is a set of containers that can be readily used to support simple integration with the Kubernetes platform, notably the Kubernetes side of CSI. This combination of assets underpins the intention to provide a general container orchestrator solution.
For more information, see the following GitHub resources:
3.2.1 Volume lifecycle
This section gives a short introduction to creating volumes and volume-to-node attachment.
Creating PVCs
One of the significant advantages of the CSI driver is its ability to dynamically provision volumes. A storage administrator no longer needs to pre-provisioned pieces of storage on the back-end system that might not fit the demand and do not leave much unused capacity behind. Next, we describe how the CSI driver implements this dynamic provisioning.
The starting point is that a PVC is created from the deployment of an application into the cluster. For dynamic provisioning (in addition to its size, volume and access mode), we know a StorageClass is required (in our example, it refers to a CSI driver). The StorageClass also contains the storage pool information and the Secrets. After such a PVC is created in the cluster, the following workflow is triggered:
1. Kubernetes’ persistent volume controller detects the creation of a PVC object and because it is not bound to a PV, the controller first tries to find a matching unbound PV.
2. Not having found a PV lets the controller check if the PVC provided a StorageClass. If so, it delegates PV provisioning to the respective provisioner plug-in that is in the StorageClass.
3. The provisioner allocates a suitable piece of storage on the back-end and creates a PV object from it in the database. As a side task, it also annotates the PVC with the PV’s name.
4. The persistent volume controller detects that a PV was created and because the new PV includes a reference to the requesting claim in it, the controller can bind PV and PVC.
When the requested PVC is in a “Bound” state, the respective volume can be accessed by a requesting pod. Once again, all of these operations (PVC creation, PV creation, and pod creation) occur asynchronously and potentially independently from each other.
Volume: Node attachment
Volume attachment, which means making a storage volume accessible on a node so that pods can use them, is a complex task. The attach/detach controller is the Kubernetes instance that is managing these aspects. It watches events on many resources: pods, nodes, PVCs, PVs, VolumeAttachments, and on CSI objects.
Pod updates are the most prominent triggering events. The scheduler that is placing pods onto nodes is a number one source for these events. The attach and detach controller is a central instance and must know about all the defined volumes, nodes, and pods and their relationships in the cluster. It interfaces with all storage plug-ins in-tree and external storage providers where CSI is considered one of them.
What happens if a pod is placed on a node and its volumes must be attached to the node? Consider the following points:
Pod updates or additions with a changed node attribute let the attach/detach controller internally create a tuple (pod, volume, or node) for each volume that the pod uses. The controller maintains a private “desiredWorld” structure. It also holds a private “actualWorld” structure, where it records the assignments.
Periodically, a reconciler loop in the controller scans the desiredWorld and the actualWorld to transform statuses into wanted statuses.
First, it triggers detachments for all (pod, volume, and node) bindings that are found in actualWorld but not in desiredWorld. Then, it triggers attachments for all (pod, volume, and node) bindings in desiredWorld but not in actualWorld.
The respective attachment or detachment operations are implemented in the storage plug-ins for the volumes in question. These operations run in asynchronous threads or processes. Therefore, the attachment or detachment controller also must manage timing conditions or multiple repetitions of the same request. For example, it does not trigger attachment for a volume if an attach operation is in progress.
Looking at attachment of a volume, the storage plug-in creates a VolumeAttachment object and waits for the attachment plug-in to mark the volume as attached to the node.
After it is attached, the attachment or detachment controller no longer sees a difference between the desiredWorld and actualWorld status. The volume is now ready to be used by the pod on the targeted node.
Similarly, if a node quits the cluster or a pod ends, the desiredWorld status is updated by removing the respective references for the node or pod. Instead of attaching volumes, the storage plug-in runs detach operations.
3.2.2 CSI driver deployment
In this section, we review the main CSI driver building blocks, CSI controller, and CSI node (see Figure 3-12.
Figure 3-12 CSI components
CSI controller
The CSI controller’s main responsibility is to provision and deprovision storage on the back-end systems. The CSI controller pod contains some sidecars, which are responsible for the communication with the Kubernetes API server, while they communicate with the controller container through gRPC calls that are laid down in CSI specification, namely the Controller service.
Next, we describe the major calls to the controller, and leave it up to the reader to consult the documentation for more information.
CreateVolume and DeleteVolume
These calls are often are made by the external-provisioner. As we saw in “Volume lifecycle” on page 37 after a PVC that is referring to a StorageClass for the CSI driver is created, the Kubernetes in-tree volume controller looks for a suitable PV, which generally does not exist yet.
However, if a PVC’s StorageClass refers to a CSI-managed volume, the respective plug-in contacts the external-provisioner, which calls CreateVolume(). The CSI Controller then cuts an appropriate piece of storage from the back-end, and with successful return from CreateVolume(), the respective PV object in the platform is created by the external-provisioner.
This process eventually allows the Kubernetes’ volume controller to bind the new PV with the PVC when it runs through its reconciliation loop. A DeleteVolume() is available to have the CSI controller delete the provisioned storage and the external-provisioner delete the respective PV.
At this point, we have not yet discussed the origin of volume content. In addition to provisioning a pure volume without any content, two other options are available: If the backend and driver support it, a volume can be created from another, existing volume, which ends up in creating a clone, and, a volume can be created from a VolumeSnapshot.
For more information about CSI volume cloning and CSI volume snapshots, see this Red Hat documentation web page.
CreateSnapshot and DeleteSnapshot
The calls that deal with VolumeSnapshots do not differ much from the respective CreateVolume or DeleteVolume calls. They start respective procedures on the storage backend to create or remove snapshots from volumes on the storage backend.
ControllerPublishVolume and ControllerUnpublishVolume
After the Kubernetes scheduler decides to place a pod onto a specific node, the pod’s volumes also must be made accessible by that compute node. The need to attach a volume is expressed in VolumeAttachment objects that are watched for changes by the external-attacher sidecar in the CSI Controller pod. If a VolumeAttachment appears or its node reference changes, it starts the respective calls against the CSI controller.
CSI node
The CSI node containers often run on each compute node (or at least those nodes where the respective storage backend can be used), which makes a perfect case for deploying them in a Kubernetes DaemonSet.
A CSI nodes main task is to bring the volumes from the storage backend and the respective storage driver of the node operating system together. This process often includes mounting the storage volume to some mount point in the node operating system. After it is assured that the mount point is available, and the correct volume is mounted there, and the container runtime can finally mount the volume into the consuming container’s file system hierarchy. However, this last step is out of the scope of this publication.
In Figure 3-12 on page 38, we see a node-driver-registrar sidecar that is collocated with the CSI node container in the CSI node pod. This sidecar allows the CSI node container to register with the kubelet on the node by providing the reference to the endpoint where its gRPC functions can be called.
After the connection is established through a UNIX Domain Socket, node-driver-registrar generally no longer plays a role. However, the CSI node container does not communicate with the kubelet other than by responding to the gRPC calls.
NodeStageVolume and UnstageVolume
For specific types of storage, it might be necessary to perform preparatory steps before some back-end volume can be made accessible on a node. The optional NodeStageVolume calls enable this accessibility. For example, in the IBM CSI node case, the NodeStageVolume call ensures that a volume that is intended to be mounted in NodePublishVolume is seen as a multipath device by the operating system.
NodePublishVolume and UnpublishVolume
The NodePublishVolume is the final step in provisioning a volume from a CSI-managed storage backend to a node from where standard Kubernetes mechanisms can present it to a pod’s containers. These mechanisms and all other CSI Node procedures are directly called from the kubelet. The kubelet is the instance on a compute node that is responsible for taking pod specifications and bringing the respective containers to life on the container runtime.
This task can be accomplished successfully only if all of the requested volumes for a pod are available on the node so that they can be mounted by the containers. The CSI Controller’s ControllerPublishVolume procedure is doing what must be done to make a volume available on a node from the storage system’s perspective. Now, with the NodePublishVolume, the loop is closing; that is, the node accesses the dynamically provisioned and attached storage.
CSI identity services
The CSI architecture is built on a plug-in pattern. Stubs are available in the platform that are complemented with concrete implementations, the CSI controller, and the CSI node. However, the platform needs information about the implemented features of these components. Therefore, the CSI components must offer some identity services. Some are common for both controller and node, and some are needed for node or controller only.
Common identity services
These services must be implemented by the controller and the node component. They provide information for the platform on the respective component. Their task is to identify the component in the platform (GetpluginInfo), and its capabilities (GetpluginCapabilities). In addition, a Probe function gives a success response after it is called so that the platform knows that the component is ready to communicate.
Component identity services
To determine the specific capabilities of the components, they provide a ControllerGetCapabilites, which is a NodeGetCapabilities call that informs about special features the component supports. For example, the ability to provision volumes from a snapshot is one of the possible controller capabilities.
An important information call is available on the node component: NodeGetInfo. This call is used to create a unique node identification for the CSI environment. As the IBM block storage CSI driver shows, this call encodes the node name and identifies the storage backend and how it can be reached. This information is important for the controller that must provision storage for a specific node on a specific backend.
3.3 IBM block storage CSI driver
Thus far, we covered the common CSI driver concept and how it integrates with Kubernetes as the container orchestrator of our choice. We also know of the CSI that governs the communication between the Kubernetes-provided sidecars and the storage vendor-provided CSI Controller and CSI Node component.
In this section, we review the IBM block storage CSI driver’s specific parts. We see that it not only comprises the expected controller and node component, it also includes an operator. Because the IBM block storage CSI driver is an open source effort, see this web page for more information about the code.
3.3.1 IBM CSI Operator
In the “Extension points in Kubernetes: The operator pattern” on page 34, we explained the operator pattern. Here, we describe the CSI Operator that plays a central role in IBM block storage CSI driver deployments. For more information about its source code, see this web page.
IBMBlockCSI resource
As described in “Extension points in Kubernetes: The operator pattern” on page 34, the operator pattern generally brings together custom resources and a controller. For the IBM CSI Operator, this kind of resource is called IBMBlockCSI. If we consider a single instance of a IBMBlockCSI, we get a basic impression of what the CSI Operator is doing. An IBMBlockCSI object contains the following major parts:
Controller specification
This specification details two important aspects of the IBM block storage CSI driver’s controller component: the container image that runs the controller code, and a node selection expression that further specifies the type of nodes onto which the controller pod can be scheduled.
Node specification
Similar to the controller specification, the node specification details the image to use and a node selection criteria. However, although node specification is about the CSI node component, the node selector in this specification helps find the suitable Kubernetes nodes where to run the CSI node.
Sidecar specifications
The CSI controller and node use the community-provided sidecars. An IBMBlockCSI object also specifies which container images are used for these sidecars.
From these specifications, we can derive the controller’s tasks: It watches the IBMBlockCSI resources and creates the respective workload controllers for the deployment of a IBMBlockCSI driver installation. Because the image specifications (their repository paths and image tags) can be declared in the IBMBlockCSI resource, upgrading the driver can also be done in a declarative style; that is, an administrator then modifies the image references in the resource, which the CSI Operator picks up and triggers new deployments of whatever is needed for the declared version.
The operator is deployed to the cluster through a Deployment workload controller. This process can be done through the OpenShift GUI or on the command line.
 
After the operator is established in the cluster, an administrator then creates a suitable IBMBlockCSI object that specifies the wanted controller and node references. The operator triggers all necessary steps to deploy the IBM block storage CSI driver with all its components into the environment.
Final steps for an administrator include defining StorageClasses that further specify how to access the storage backend, from which storage pool to provision appropriate volumes, and so on. The volumes can then be dynamically created on the back-end by creating PVCs for applications in the defined StorageClasses.
3.3.2 IBM CSI controller
The IBM CSI Controller can provision storage on three different IBM block storage provider families: DS8000, Spectrum Virtualize products (such as IBM SAN Volume Controller), and the A9000/R family of products. Although these different back-ends require different adoption layers, the gRPC calls on the “CSI-side” are the same, and only the StorageClass of a PVC determine which back-end to contact.
As of this writing, the IBM Controller component is implemented in Python. It consists of the set of gRPC calls that are required by the CSI Controller and CSI Identity service, but then uses different open source Python modules to communicate with the back-end storage systems (the pyds8k module for the IBM DS8000 family, the pysvc module for IBM Storwize® and IBM FlashSystem, and the pyxcli module for IBM XIV/IBM FlashSystem A9000/R back-ends).
A thin intermediate layer forms the connection between CSI functions and the back-end modules. For more information about these modules, see this web page.
3.3.3 IBM CSI node
Most of the functions the IBM CSI Node implements is described in “CSI node” on page 39. The IBM CSI Node driver supports the NodePublishVolume and NodeUnpublishVolume calls. After they started, the NodePublishVolume() function checks if the access for the wanted volume is valid, if the device node can be accessed, and if it appears as a multipath device in the compute node operating system.
Another task also is accomplished by the IBM CSI node on its first invocation on a node. It generates a unique identifier for the node to make itself visible and addressable for the IBM CSI controller. This identifier is called nodeid, and the Kubernetes Node object is annotated with it, as shown in Example 3-1.
Example 3-1 Nodeid example
$ oc describe node worker-0 | egrep -a -A1 -i nodeid
csi.volume.kubernetes.io/nodeid: {"block.csi.ibm.com":"worker-0;iqn.1994-05.com.redhat:7c9b3e4fee9;c05076fff280987c:c05076fff2809e40"}
The oc command is the OpenShift command line client, and its describe operation gives a summary of the node information in the etcd database. By using the egrep command, the nodeid annotation can be filtered out.
The IBM block storage CSI driver provisions “file system” and “block” volumes, as described in “Creating a PersistentVolumeClaim” on page 80. The name of the product refers to the block storage nature of the storage back-end systems, not the driver’s capabilities.
3.3.4 CSI driver storage back-end communication
The instances that are communicating with the storage back-ends are distinct Python modules for each of the three supported storage families. As with all the other components of the IBM block storage CSI driver, these modules are open source.
For more information, see the following resources:
These client modules can be used standalone in Python programs. This ability allows users to examine the interaction with a storage backend, aside from the Kubernetes environment for troubleshooting.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.229.113