In the early days of container orchestration, the targeted workloads were usually stateless applications that used external systems to store state if necessary. The thought was that containers are very temporal, and orchestration of the backing storage needed to keep state in a consistent manner was difficult at best. Over time the need for container-based workloads that kept state became a reality and, in select cases, might be more performant. Kubernetes adapted over many iterations to not only allow for storage volumes mounted into the pod, but those volumes being managed by Kubernetes directly was an important component in orchestration of storage with the workloads that require it.
If the ability to mount an external volume to the container was enough, many more examples of stateful applications running at scale in Kubernetes would exist. The reality is that volume mounting is the easy component in the grand scheme of stateful applications. The majority of applications that require state to be maintained after node failure are complicated data-state engines such as relational database systems, distributed key/value stores, and complicated document management systems. This class of applications requires more coordination between how members of the clustered application communicate with one another, how the members are identified, and the order in which members either appear or disappear into the system.
This chapter focuses on best practices for managing state, from simple patterns such as saving a file to a network share, to complex data management systems like MongoDB, mySQL, or Kafka. There is a small section on a new pattern for complex systems called Operators that brings not only Kubernetes primitives, but allows for business or application logic to be added as custom controllers that can help make operating complex data management systems easier.
Not every workload that requires a way to maintain state needs to be a complex database or high throughput data queue service. Often, applications that are being moved to containerized workloads expect certain directories to exist and read and write pertinent information to those directories. The ability to inject data into a volume that can be read by containers in a pod is covered in Chapter 5; however, data mounted from ConfigMaps or secrets is usually read-only, and this section focuses on giving containers volumes that can be written to and will survive a container failure or, even better, a pod failure.
Every major container runtime, such as Docker, rkt, CRI-O, and even
Singularity, allows for mounting volumes into a container that is mapped
to an external storage system. At its
simplest, external storage can be a memory location, a path on the container’s host, or an external filesystem such as NFS, Glusterfs, CIFS, or Ceph. Why would
this be needed, you might wonder? A useful example is
that of a legacy application that was written to log application-specific information to a local filesystem. There are many possible
solutions including, but not limited to, updating the application code to
log out to a stdout
or stderr
sidecar container that can stream log
data to an outside source via a shared pod volume or using a host-based
logging tool that can read a volume for both host logs and container
application logs. The last scenario can be attained by using a
volume mount in the container using a Kubernetes hostPath
mount, as shown in the following:
apiVersion
:
apps/v1
kind
:
Deployment
metadata
:
name
:
nginx-webserver
spec
:
replicas
:
3
selector
:
matchLabels
:
app
:
nginx-webserver
template
:
metadata
:
labels
:
app
:
nginx-webserver
spec
:
containers
:
-
name
:
nginx-webserver
image
:
nginx:alpine
ports
:
-
containerPort
:
80
volumeMounts
:
-
name
:
hostvol
mountPath
:
/usr/share/nginx/html
volumes
:
-
name
:
hostvol
hostPath
:
path
:
/home/webcontent
Try to limit the use of volumes to pods requiring multiple
containers that need to share data, for example adapter or ambassador
type patterns. Use the emptyDir
for those types of sharing patterns.
Use hostDir
when access to the data is required by node-based agents
or services.
Try to identify any services that write their critical application
logs and events to local disk, and if possible change those to stdout
or stderr
and let a true Kubernetes-aware log aggregation system
stream the logs instead of leveraging the volume map.
The examples so far show basic volume mapping into a container in a pod, which is just a basic container engine capability. The real key is allowing Kubernetes to manage the storage backing the volume mounts. This allows for more dynamic scenarios where pods can live and die as needed, and the storage backing the pod will transition accordingly to wherever the pod may live. Kubernetes manages storage for pods using two distinct APIs, the PersistentVolume and PersistentVolumeClaim.
It is best to think of a PersistentVolume as a disk that will back any volumes that are mounted to a pod. A PersistentVolume will have a claim policy that will define the scope of life of the volume independent of the life cycle of the pod that uses the volume. Kubernetes can use either dynamic or statically defined volumes. To allow for dynamically created volumes, there must be a StorageClass defined in Kubernetes. PersistentVolumes can be created in the cluster of varying types and classes, and only when a PersistentVolumeClaim matches the PersistentVolume will it actually be assigned to a pod. The volume itself is backed by a volume plug-in. There are numerous plug-ins supported directly in Kubernetes, and each has different configuration parameters to adjust:
apiVersion
:
v1
kind
:
PersistentVolume
metadata
:
name
:
pv001
labels
:
tier
:
"silver"
spec
:
capacity
:
storage
:
5Gi
accessModes
:
-
ReadWriteMany
persistentVolumeReclaimPolicy
:
Recycle
storageClassName
:
nfs
mountOptions
:
-
hard
-
nfsvers=4.1
nfs
:
path
:
/tmp
server
:
172.17.0.2
PersistentVolumeClaims are a way to give Kubernetes a resource
requirement definition for storage that a pod will use. Pods will
reference the claim, and then if a persistentVolume
that matches the
claim request exists, it will allocate that volume to that specific pod.
At minimum, a storage request size and access mode must be defined, but a
specific StorageClass can also be defined. Selectors can also be used to
match certain PersistentVolumes that meet a certain criteria will be
allocated:
apiVersion
:
v1
kind
:
PersistentVolumeClaim
metadata
:
name
:
my-pvc
spec
:
storageClass
:
nfs
accessModes
:
-
ReadWriteMany
resources
:
requests
:
storage
:
5Gi
selector
:
matchLabels
:
tier
:
"silver"
The preceding claim will match the PersistentVolume created earlier because the storage class name, the selector match, the size, and the access mode are all equal.
Kubernetes will match up the PersistentVolume with the claim and bind
them together. Now to use the volume, the pod.spec
should just
reference the claim by name, as follows:
apiVersion
:
apps/v1
kind
:
Deployment
metadata
:
name
:
nginx-webserver
spec
:
replicas
:
3
selector
:
matchLabels
:
app
:
nginx-webserver
template
:
metadata
:
labels
:
app
:
nginx-webserver
spec
:
containers
:
-
name
:
nginx-webserver
image
:
nginx:alpine
ports
:
-
containerPort
:
80
volumeMounts
:
-
name
:
hostvol
mountPath
:
/usr/share/nginx/html
volumes
:
-
name
:
hostvol
persistentVolumeClaim
:
claimName
:
my-pvc
Instead of manually defining the PersistentVolumes ahead of time, administrators might elect to create StorageClass objects, which define the volume plug-in to use and any specific mount options and parameters that all PersistentVolumes of that class will use. This then allows the claim to be defined with the specific StorageClass to use, and Kubernetes will dynamically create the PersistentVolume based on the StorageClass parameters and options:
kind
:
StorageClass
apiVersion
:
storage.k8s.io/v1
metadata
:
name
:
nfs
provisioner
:
cluster.local/nfs-client-provisioner
parameters
:
archiveOnDelete
:
True
Kubernetes also allows operators to create a default storage class using the DefaultStorageClass admission plug-in. If this has been enabled on the API server, then a default StorageClass can be defined and any PersistentVolumeClaims that do not explicitly define a StorageClass. Some cloud providers will include a default storage class to map to the cheapest storage allowed by their instances.
Often referred to as “Out-of-Tree” volume plug-ins, the Container Storage Interface (CSI) and FlexVolume enable storage vendors to create custom storage plug-ins without the need to wait for direct code additions to the Kubernetes code base like most volume plug-ins today.
The CSI and FlexVolume plug-ins are deployed on Kubernetes clusters as extensions by operators and can be updated by the storage vendors when needed to expose new functionality.
The CSI states its objective on GitHub as:
To define an industry standard Container Storage Interface that will enable storage vendors (SP) to develop a plug-in once and have it work across a number of container orchestration (CO) systems.
The FlexVolume interface has been the traditional method used to add additional features for a storage provider. It does require specific drivers to be installed on all of the nodes of the cluster that will use it. This basically becomes an executable that is installed on the hosts of the cluster. This last component is the main detractor to using FlexVolumes, especially in managed service providers, because access to the nodes is frowned upon and the masters practically impossible. The CSI plug-in solves this by basically exposing the same functionality and being as easy to use as deploying a pod into the cluster.
Cloud native application design principles try to enforce stateless application design as much as possible; however, the growing footprint of container-based services has created the need for data storage persistence. These best practices around storage in Kubernetes in general will help to design an effective approach to providing the required storage implementations to the application design:
If possible, enable the DefaultStorageClass admission plug-in and
define a default storage class. Many times, Helm charts for applications
that require PersistentVolumes default to a default
storage class
for the chart, which allows the application to be installed
without too much modification.
When designing the architecture of the cluster, either on-premises or in a cloud provider, take into consideration zone and connectivity between the compute and data layers using the proper labels for both nodes and PersistentVolumes, and using affinity to keep the data and workload as close as possible. The last thing you want is a pod on a node in zone A trying to mount a volume that is attached to a node in zone B.
Consider very carefully which workloads require state to be maintained on disk. Can that be handled by an outside service like a database system or, if running in a cloud provider, by a hosted service that is API consistent with currently used APIs, say a mongoDB or mySQL as a service?
Determine how much effort would be involved in modifying the application code to be more stateless.
While Kubernetes will track and mount the volumes as workloads are scheduled, it does not yet handle redundancy and backup of the data that is stored in those volumes. The CSI specification has added an API for vendors to plug in native snapshot technologies if the storage backend can support it.
Verify the proper life cycle of the data that volumes will hold. By default the reclaim policy is set to for dynamically provisioned persistentVolumes which will delete the volume from the backing storage provider when the pod is deleted. Sensitive data or data that can be used for forensic analysis should be set to reclaim.
Contrary to popular belief, Kubernetes has supported stateful applications since its infancy, from mySQL, Kafka, and Cassandra to other technologies. Those pioneering days, however, were fraught with complexities and were usually only for small workloads with lots of work required to get things like scaling and durability to work.
To fully grasp the critical differences, you must understand how a typical ReplicaSet schedules and manages pods, and how each could be detrimental to traditional stateful applications:
Pods in a ReplicaSet are scaled out and assigned random names when scheduled.
Pods in a ReplicaSet are scaled down in an arbitrary manner.
Pods in a ReplicaSet are never called directly through their name or
IP address but through their association with a Service
.
Pods in a ReplicaSet can be restarted and moved to another node at any time.
Pods in a ReplicaSet that have a PersistentVolume mapped are linked only by the claim, but any new pod with a new name can take over the claim if needed when rescheduled.
Those that have only cursory knowledge of cluster data management systems can immediately begin to see issues with these characteristics of ReplicaSet-based pods. Imagine a pod that has the current writable copy of the database just all of a sudden getting deleted! Pure pandemonium would ensue for sure.
Most neophytes to the Kubernetes world assume that StatefulSet applications are automatically database applications and therefore equate the two things. This could not be further from the truth in the sense that Kubernetes has no sense of what type of application it is deploying. It does not know that your database system requires leader election processes, that it can or cannot handle data replication between members of the set, or, for that matter, that it is a database system at all. This is where StatefulSets come in to play.
What StatefulSets do is make it easier to run applications
systems that expect more reliable node/pod behavior. If we look at the
list of typical pod characteristics in a ReplicaSet, StatefulSets offer
almost the complete opposite. The original spec back in Kubernetes
version 1.3 called PetSets
was introduced to answer some of the
critical scheduling and management needs for stateful-type applications
such as complex data management systems:
Pods in a StatefulSet are scaled out and assigned sequential names. As the set scales up, the pods get ordinal names, and by default a new pod must be fully online (pass its liveness and/or readiness probes) before the next pod is added.
Pods in a StatefulSet are scaled down in reverse sequence.
Pods in a StatefulSet can be addressed individually by name behind a headless Service.
Pods in a StatefulSet that require a volume mount must use a defined PersistentVolume template. Volumes claimed by pods in a StatefulSet are not deleted when the StatefulSet is deleted.
A StatefulSet specification looks very similar to a Deployment except for the Service declaration and the PersistentVolume template. The headless Service should be created first, which defines the Service that the pods will be addressed with individually. The headless Service is the same as a regular Service but does not do the normal load balancing:
apiVersion
:
v1
kind
:
Service
metadata
:
name
:
mongo
labels
:
name
:
mongo
spec
:
ports
:
-
port
:
27017
targetPort
:
27017
clusterIP
:
None
#This creates the headless Service
selector
:
role
:
mongo
The StatefulSet definition will also look exactly like a Deployment with a few changes:
apiVersion
:
apps/v1beta1
kind
:
StatefulSet
metadata
:
name
:
mongo
spec
:
serviceName
:
"mongo"
replicas
:
3
template
:
metadata
:
labels
:
role
:
mongo
environment
:
test
spec
:
terminationGracePeriodSeconds
:
10
containers
:
-
name
:
mongo
image
:
mongo:3.4
command
:
-
mongod
-
"--replSet"
-
rs0
-
"--bind_ip"
-
0.0.0.0
-
"--smallfiles"
-
"--noprealloc"
ports
:
-
containerPort
:
27017
volumeMounts
:
-
name
:
mongo-persistent-storage
mountPath
:
/data/db
-
name
:
mongo-sidecar
image
:
cvallance/mongo-k8s-sidecar
env
:
-
name
:
MONGO_SIDECAR_POD_LABELS
value
:
"role=mongo,environment=test"
volumeClaimTemplates
:
-
metadata
:
name
:
mongo-persistent-storage
annotations
:
volume.beta.kubernetes.io/storage-class
:
"fast"
spec
:
accessModes
:
[
"ReadWriteOnce"
]
resources
:
requests
:
storage
:
2Gi
StatefulSets has definitely been a major factor in introducing complex stateful data systems as feasible workloads in Kubernetes. The only real issue is, as stated earlier, Kubernetes does not really understand the workload that is running in the StatefulSet. All of the other complex operations, like backups, failover, leader registration, new replica registration, and upgrades, are all operations that need to happen quite regularly and will require some careful consideration when running as StatefulSets.
Early on in the growth of Kubernetes, CoreOS site reliability engineers (SREs) created a new class of cloud native software for Kubernetes called Operators. The original intent was to encapsulate the application domain-specific knowledge of running a specific application into a specific controller that extends Kubernetes. Imagine building up on the StatefulSet controller to be able to deploy, scale, upgrade, backup, and run general maintenance operations on Cassandra or Kafka. Some of the first Operators that were created were for etcd and Prometheus, which uses a time series database to keep metrics over time. The proper creation, backup, and restore configuration of Prometheus or etcd instances can be handled by an Operator and are basically new Kubernetes-managed objects just like a pod or Deployment.
Until recently, Operators have been one-off tools created by SREs or by software vendors for their specific application. In mid-2018, RedHat created the Operator Framework, which is a set of tools including an SDK life cycle manager and future modules that will enable features such as metering, marketplace, and registry type functions. Operators are not only for stateful applications, but because of their custom controller logic they are definitely more amenable to complex data services and stateful systems.
Operators are still an emerging technology in the Kubernetes space, but they are slowly taking a foothold with many data management system vendors, cloud providers, and SREs the world over who want to include some of the operational knowledge they have in running complex distributed systems in Kubernetes. Take a look at OperatorHub for an updated list of curated Operators.
Large distributed applications that require state and possibly complicated management and configuration operations benefit from Kubernetes StatefulSets and Operators. Operators are still evolving, but they have the backing of the community at large, so these best practices are based on current capabilities at the time of publication:
The decision to use Statefulsets should be taken judiciously because usually stateful applications require much deeper management that the orchestrator cannot really manage well yet (read the “Operators” section for the possible future answer to this deficiency in Kubernetes).
The headless Service for the StatefulSet is not automatically created and must be created at deployment time to properly address the pods as individual nodes.
When an application requires ordinal naming and dependable scaling, it does not always mean it requires the assignment of PersistentVolumes.
If a node in the cluster becomes unresponsive, any pods that are part
of a StatefulSet are not not automatically deleted; they instead will
enter a Terminating
or Unkown
state after a grace period. The only
way to clear this pod is to remove the node object from the cluster, the
kubelet beginning to work again and deleting the pod directly, or an
Operator force deleting the pod. The force delete should be the last
option and great care should be taken that the node that had the deleted
pod does not come back online, because there will now be two pods with the same
name in the cluster. You can use kubectl delete pod nginx-0 --grace-period=0 --force
to force delete the pod.
Even after force deleting a pod, it might stay in an Unknown
state, so
a patch to the API server will delete the entry and cause the StatefulSet
controller to create a new instance of the deleted pod:
kubectl patch pod nginx-0 -p '{"metadata":{"finalizers":null}}'
.
If you’re running a complex data system with some type of leader election or
data replication confirmation processes, use preStop hook
to properly
close any connections, force leader election, or verify data synchronization before
the pod is deleted using a graceful shutdown process.
When the application that requires stateful data is a complex data management system, it might be worth a look to determine whether an Operator exists to help manage the more complicated life cycle components of the application. If the application is built in-house, it might be worth investigating whether it would be useful to package the application as an Operator to add additional manageability to the application. Look at the CoreOS Operator SDK for an example.
Most organizations look to containerize their stateless applications and leave the stateful applications as is. As more and more cloud native applications run in cloud provider Kubernetes offerings, data gravity becomes an issue. Stateful applications require much more due diligence, but the reality of running them in clusters has been accelerated by the introduction of StatefulSets and Operators. Mapping volumes into containers allow Operators to abstract the storage subsystem specifics away from any application development. Managing stateful applications such as database systems in Kubernetes is still a complex distributed system and needs to be carefully orchestrated using the native Kubernetes primitives of pods, ReplicaSets, Deployments, and StatefulSets, but using Operators that have specific application knowledge built into them as Kubernetes-native APIs may help to elevate these systems into production-based clusters.
3.88.254.50