Chapter 10. Pod and Container Security

When it comes to pod security via the Kubernetes API, you have two main options at your disposal: PodSecurityPolicy and RuntimeClass. In this chapter, we review the purpose and use of each API and provide best practices for their use.

PodSecurityPolicy API


The PodSecurityPolicy API is under active development. As of Kubernetes 1.15, this API was in beta. Please visit the upstream documentation for the latest updates on the feature state.

This cluster-wide resource creates a single place to define and manage all of the security-sensitive fields found in pod specifications. Prior to the creation of the PodSecurityPolicy resource, cluster administrators and/or users would need to independently define individual SecurityContext settings for their workloads or enable bespoke admission controllers on the cluster to enforce some aspects of pod security.

Does all of this sound too easy? PodSecurityPolicy is surprisingly difficult to implement effectively and will more often than not get turned off or evaded in other ways. We do, however, strongly suggest taking the time to fully understand PodSecurityPolicy because it’s one of the single most effective means to reduce your attack surface area by limiting what can run on your cluster and with what level of privilege.

Enabling PodSecurityPolicy

Along with the resource API, a corresponding admission controller must be enabled to enforce the conditions defined in the PodSecurityPolicy resource. This means that the enforcement of these policies happens at the admission phase of the request flow. To learn more about how admission controllers work, refer to Chapter 17.

It’s worth mentioning that enabling PodSecurityPolicy is not widely available among public cloud providers and cluster operations tools. In the cases for which it is available, it’s generally shipped as an opt-in feature.


Proceed with caution when enabling PodSecurityPolicy because it’s potentially workload blocking if adequate preparation isn’t done at the outset.

There are two main components that you need to complete in order to start using PodSecurityPolicy:

  1. Ensure that the PodSecurityPolicy API is enabled (this should already be done if you’re on a currently supported version of Kubernetes).

    You can confirm that this API is enabled by running kubectl get psp. As long as the response isn’t the server doesn't have a resource type "PodSecurityPolicies, you are OK to proceed.

  2. Enable the PodSecurityPolicy admission controller via the api-server flag --enable-admission-plugins.


If you are enabling PodSecurityPolicy on an existing cluster with running workloads, you must create all necessary policies, service accounts, roles, and role bindings before enabling the admission controller.

We also recommend the addition of the --use-service-account-credentials=true flag to kube-controller-manager, which will enable service accounts to be used for each individual controller within kube-controller-manager. This allows for more granular policy control even within the kube-system namespace. You can simply run the following command to determine whether the flag has been set. It demonstrates that there is indeed a service account per controller:

$ kubectl get serviceaccount -n kube-system | grep '.*-controller'
attachdetach-controller              1         6d13h
certificate-controller               1         6d13h
clusterrole-aggregation-controller   1         6d13h
cronjob-controller                   1         6d13h
daemon-set-controller                1         6d13h
deployment-controller                1         6d13h
disruption-controller                1         6d13h
endpoint-controller                  1         6d13h
expand-controller                    1         6d13h
job-controller                       1         6d13h
namespace-controller                 1         6d13h
node-controller                      1         6d13h
pv-protection-controller             1         6d13h
pvc-protection-controller            1         6d13h
replicaset-controller                1         6d13h
replication-controller               1         6d13h
resourcequota-controller             1         6d13h
service-account-controller           1         6d13h
service-controller                   1         6d13h
statefulset-controller               1         6d13h
ttl-controller                       1         6d13h

It’s extremely important to remember that having no PodSecurityPolicies defined will result in an implicit deny. This means that without a policy match for the workload, the pod will not be created.

Anatomy of a PodSecurityPolicy

To best understand how PodSecurityPolicy enables you to secure your pods, let’s work through an end-to-end example together. This will help solidify the order of operations from policy creation through use.

Before you continue, the following section requires that your cluster have PodSecurityPolicy enabled in order for it to work. To see how to enable it, refer to the previous section.


You should not enable PodSecurityPolicy on a live cluster without considering the warnings provided in the previous section. Proceed with caution.

Let’s first test the experience without making any changes or creating any policies. The following is a test workload that simply runs the trusty pause container in a Deployment (save this file as pause-deployment.yaml on your local filesystem for use throughout this section):

apiVersion: apps/v1
kind: Deployment
  name: pause-deployment
  namespace: default
    app: pause
  replicas: 1
      app: pause
        app: pause
      - name: pause

By running the following command, you can verify that you have a Deployment and a corresponding ReplicaSet but NO pod:

$ kubectl get deploy,rs,pods -l app=pause
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/pause-delpoyment   0/1     0            0           41s

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.extensions/pause-delpoyment-67b77c4f69   1         0         0       41s

If you describe the ReplicaSet, you can confirm the cause from the event log:

$ kubectl describe replicaset -l app=pause
Name:           pause-delpoyment-67b77c4f69
Namespace:      default
Selector:       app=pause,pod-template-hash=67b77c4f69
Labels:         app=pause
Annotations: 1
Controlled By:  Deployment/pause-delpoyment
Replicas:       0 current / 1 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=pause
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
  Type     Reason        Age                  From                   Message
  ----     ------        ----                 ----                   -------
  Warning  FailedCreate  45s (x15 over 2m7s)  replicaset-controller  Error creating: pods "pause-delpoyment-67b77c4f69-" is forbidden: unable to validate against any pod security policy: []

This is because there are either no pod security policies defined or the service account is not allowed access to use the PodSecurityPolicy. You might have also noticed that all of the system pods in the kube-system namespace are probably still in RUNNING state. This is because these requests have already passed the admission phase for the request. If there were an event that restarted these pods, they would also suffer the same fate as our test workload given that there are no PodSecurityPolicy resources defined:

replicaset-controller  Error creating: pods "pause-delpoyment-67b77c4f69-" is forbidden: unable to validate against any pod security policy: []

Let’s delete the test workload deployment:

$ kubectl delete deploy -l app=pause
deployment.extensions "pause-delpoyment" deleted

Now, let’s go fix this by defining pod security policies. For a complete list of policy settings, refer to the Kubernetes documentation. The following policies are basic variations of the examples provided in the Kubernetes documentation.

Call the first policy privileged, which we use to demonstrate how to allow privileged workloads. You can apply the following resources by using kubectl create -f <filename>:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
  name: privileged
  privileged: true
  allowPrivilegeEscalation: true
  - '*'
  - '*'
  hostNetwork: true
  - min: 0
    max: 65535
  hostIPC: true
  hostPID: true
    rule: 'RunAsAny'
    rule: 'RunAsAny'
    rule: 'RunAsAny'
    rule: 'RunAsAny'

The next policy defines restricted access and will suffice for many workloads apart from those responsible for running Kubernetes cluster-wide services such as kube-proxy, located in the kube-system namespace:

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
  name: restricted
  privileged: false
  allowPrivilegeEscalation: false
    - ALL
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
    rule: 'RunAsAny'
    rule: 'RunAsAny'
    rule: 'MustRunAs'
      - min: 1
        max: 65535
    rule: 'MustRunAs'
      - min: 1
        max: 65535
  readOnlyRootFilesystem: false

You can confirm that the policies have been created by running the following command:

$ kubectl get psp
privileged   true    *      RunAsAny   RunAsAny           RunAsAny    RunAsAny    false            *
restricted   false          RunAsAny   MustRunAsNonRoot   MustRunAs   MustRunAs   false            configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim

Now that we have defined these policies, we need to grant the service accounts access to use these policies via Role-Based Access Control (RBAC).

First, create the following ClusterRole that allows access to use the restricted PodSecurityPolicy that we created in the previous step:

kind: ClusterRole
  name: psp-restricted
- apiGroups:
  - extensions
  - podsecuritypolicies
  - restricted
  - use

Now, create the following ClusterRole that allows access to use the privileged PodSecurityPolicy we created in the previous step:

kind: ClusterRole
  name: psp-privileged
- apiGroups:
  - extensions
  - podsecuritypolicies
  - privileged
  - use

We must now create a corresponding ClusterRoleBinding that allows the system:serviceaccounts group access to psp-restricted ClusterRole. This group includes all of the kube-controller-manager controller service accounts:

kind: ClusterRoleBinding
  name: psp-restricted
- kind: Group
  name: system:serviceaccounts
  namespace: kube-system
  kind: ClusterRole
  name: psp-restricted

Go ahead and create the test workload again. You can see that the pod is now up and running:

$ kubectl create -f pause-deployment.yaml
deployment.apps/pause-deployment created
$ kubectl get deploy,rs,pod
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/pause-deployment   1/1     1            1           10s

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.extensions/pause-deployment-67b77c4f69   1         1         1       10s

NAME                                    READY   STATUS    RESTARTS   AGE
pod/pause-deployment-67b77c4f69-4gmdn   1/1     Running   0          9s

Update the test workload deployment to violate the restricted policy. Adding privileged=true should do the trick. Save this manifest as pause-privileged-deployment.yaml on your local filesystem and then apply it by using kubectl apply -f <filename>:

apiVersion: apps/v1
kind: Deployment
  name: pause-privileged-deployment
  namespace: default
    app: pause
  replicas: 1
      app: pause
        app: pause
      - name: pause
          privileged: true

Again, you can see that both the Deployment and the ReplicaSet have been created; however, the pod has not. You can find the details of why in the event log of the ReplicaSet:

$ kubectl create -f pause-privileged-deployment.yaml
deployment.apps/pause-privileged-deployment created
$ kubectl get deploy,rs,pods -l app=pause
NAME                                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/pause-privileged-deployment   0/1     0            0           37s

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.extensions/pause-privileged-deployment-6b7bcfb9b7   1         0         0       37s
$ kubectl describe replicaset -l app=pause
Name:           pause-privileged-deployment-6b7bcfb9b7
Namespace:      default
Selector:       app=pause,pod-template-hash=6b7bcfb9b7
Labels:         app=pause
Annotations: 1
Controlled By:  Deployment/pause-privileged-deployment
Replicas:       0 current / 1 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=pause
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
  Type     Reason        Age                   From                   Message
  ----     ------        ----                  ----                   -------
  Warning  FailedCreate  78s (x15 over 2m39s)  replicaset-controller  Error creating: pods "pause-privileged-deployment-6b7bcfb9b7-" is forbidden: unable to validate against any pod security policy: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

The preceding example shows the exact reason why: Privileged containers are not allowed. Let’s delete the test workload deployment.

$ kubectl delete deploy pause-privileged-deployment
deployment.extensions "pause-privileged-deployment" deleted

So far, we’ve dealt only with cluster-level bindings. How about we allow the test workload access to the privileged policy using a service account.

First, create a serviceaccount in the default namespace:

$ kubectl create serviceaccount pause-privileged
serviceaccount/pause-privileged created

Bind that serviceaccount to the permissive ClusterRole. Save this manifest as role-pause-privileged-psp-permissive.yaml on your local filesystem and then apply it by using kubectl apply -f <filename>:

kind: RoleBinding
  name: pause-privileged-psp-permissive
  namespace: default
  kind: ClusterRole
  name: psp-privileged
- kind: ServiceAccount
  name: pause-privileged
  namespace: default

Finally, update the test workload to use the pause-privileged service account. Then apply it to the cluster using kubectl apply:

apiVersion: apps/v1
kind: Deployment
  name: pause-privileged-deployment
  namespace: default
    app: pause
  replicas: 1
      app: pause
        app: pause
      - name: pause
          privileged: true
       serviceAccountName: pause-privileged

You can see that the pod is now able to use the privileged policy:

$ kubectl create -f pause-privileged-deployment.yaml
deployment.apps/pause-privileged-deployment created
$ kubectl get deploy,rs,pod
NAME                                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/pause-privileged-deployment   1/1     1            1           14s

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.extensions/pause-privileged-deployment-658dc5569f   1         1         1       14s

NAME                                               READY   STATUS    RESTARTS   AGE
pod/pause-privileged-deployment-658dc5569f-nslnw   1/1     Running   0          14s

You can see which PodSecurityPolicy was matched by using the following command:

$ kubectl get pod -l app=pause -o yaml | grep psp privileged

PodSecurityPolicy Challenges

Now that you understand how to configure and use PodSecurityPolicy, it’s worth noting that there are quite a few challenges with using it in real-world environments. In this section, we describe things that we have experienced that make it challenging.

Reasonable default policies

The real power of PodSecurityPolicy is to enable the cluster administrator and/or user to ensure that their workloads meet a certain level of security. In practice, you might often overlook just how many workloads run as root, use hostPath volumes, or have other risky settings that force you to craft policies with security holes just to get the workloads up and running.

Lots of toil

Getting the policies just right is a large investment, especially where there is a large set of workloads already running on Kubernetes without PodSecurityPolicy enabled.

Are your developers interested in learning PodSecurityPolicy?

Will your developers want to learn PodSecurityPolicy? What would be the incentive for them to do so? Without a lot of up front coordination and automation to make enabling PodSecurityPolicy a smooth transition, it’s very likely that PodSecurityPolicy won’t be adopted at all.

Debugging is cumbersome

It’s difficult to troubleshoot policy evaluation. For example, you might want to understand why your workload matched or didn’t match a specific policy. Tooling or logging to make that easy doesn’t exist at this stage.

Do you rely on artifacts outside your control?

Are you pulling images from Docker Hub or another public repository? Chances are they will violate your policies in some shape or form and will be out of your control to fix. Another common place is Helm charts: do they ship with the appropriate policies in place?

PodSecurityPolicy Best Practices

PodSecurityPolicy is complex and can be error prone. Refer to the following best practices before implementing PodSecurityPolicy on your clusters:

  • It all comes down to RBAC. Whether you like it or not, PodSecurityPolicy is determined by RBAC. It’s this relationship that actually exposes all of the shortcomings in your current RBAC policy design. We cannot stress just how important it is to automate your RBAC and PodSecurityPolicy creation and maintenance. Specifically locking down access to service accounts is the key to using policy.

  • Understand the policy scope. Determining how your policies will be laid out on your cluster is very important. Your policies can be cluster-wide, namespaced, or workload-specific in scope. There will always be workloads on your cluster that are part of the Kubernetes cluster operations that will need more permissive security privileges, so make sure that you have appropriate RBAC in place to stop unwanted workloads using your permissive policies.

  • Do you want to enable PodSecurityPolicy on an existing cluster? Use this handy open source tool to generate policies based on your current resources. This is a great start. From there, you can hone your policies.

PodSecurityPolicy Next Steps

As demonstrated, PodSecurityPolicy is an extremely powerful API to assist in keeping your cluster secure, but it demands a high tax for use. With careful planning and a pragmatic approach, PodSecurityPolicy can be successfully implemented on any cluster. At the very least, it will keep your security team happy.

Workload Isolation and RuntimeClass

Container runtimes are still largely considered an insecure workload isolation boundary. There is no clear path to whether the most common runtimes of today will ever be recognized as secure. The momentum and interest among those in the industry toward Kubernetes has led to the development of different container runtimes that offer varying levels of isolation. Some are based on familiar and trusted technology stacks, whereas others are a completely new attempt to tackle the problem. Open source projects like Kata containers, gVisor, and Firecracker tout the promise of stronger workload isolation. These specific projects are either based on nested virtualization (running a super lightweight virtual machine within a virtual machine) or system call filtering and servicing.

The introduction of these container runtimes that offer different workload isolation allows users to choose many different runtimes based on their isolation guarantees in the same cluster. For example, you could have trusted and untrusted workloads running in the same cluster in different container runtimes.

RuntimeClass was introduced into Kubernetes as an API to allow container runtime selection. It is used to represent one of the supported container runtimes on the cluster when it has been configured by the cluster administrator. As a Kubernetes user, you can define specific runtime classes for your workloads by using the RuntimeClassName in the pod specification. How this is implemented under the hood is that the RuntimeClass designates a RuntimeHandler which is passed to the Container Runtime Interface (CRI) to implement. Node labeling or node taints then can be used in conjunction with nodeSelectors or tolerations to ensure that the workload lands on a node capable of supporting the desired RuntimeClass. Figure 10-1 demonstrates how a kubelet uses RuntimeClass when launching pods.

Figure 10.1
Figure 10-1. RuntimeClass flow diagram

The RuntimeClass API is under active development. For the latest updates on the feature state, visit the upstream documentation.

Using RuntimeClass

If a cluster administrator has set up different RuntimeClasses, you can use them simply by specifying runtimeClassName in the pod specification; for example:

apiVersion: v1
kind: Pod
  name: nginx
  runtimeClassName: firecracker

Runtime Implementations

Following are some open source container runtime implementations that offer different levels of security and isolation for your consideration. This list is intended as a guide and is by no means exhaustive:

CRI containerd

An API facade for container runtimes with an emphasis on simplicity, robustness, and portability.


A purpose-built, lightweight Open Container Initiative (OCI)-based implementation of a container runtime for Kubernetes.


Built on top of the Kernel-based Virtual Machine (KVM), this virtualization technology allows you to launch microVMs in nonvirtualized environments very quickly using the security and isolation of traditional VMs.


An OCI-compatible sandbox runtime that runs containers with a new user-space kernel, which provides a low overhead, secure, isolated container runtime.

Kata Containers

A community that’s building a secure container runtime that provides VM-like security and isolation by running lightweight VMs that feel and operate like containers.

Workload Isolation and RuntimeClass Best Practices

The following best practices will help you to avoid common workload isolation and RuntimeClass pitfalls:

  • Implementing different workload isolation environments via RuntimeClass will complicate your operational environment. This means that workloads might not be portable across different container runtimes given the nature of the isolation they provide. Understanding the matrix of supported features across different runtimes can be complicated to understand and will lead to poor user experience. We recommend having separate clusters, each with a single runtime to avoid confusion, if possible.

  • Workload isolation doesn’t mean secure multitenancy. Even though you might have implemented a secure container runtime, this doesn’t mean that the Kubernetes cluster and APIs have been secured in the same fashion. You must consider the total surface area of Kubernetes end to end. Just because you have an isolated workload doesn’t mean that it cannot be modified by a bad actor via the Kubernetes API.

  • Tooling across different runtimes is inconsistent. You might have users who rely on container runtime tooling for debugging and introspection. Having different runtimes means that you might no longer be able to run docker ps to list running containers. This leads to confusion and complications when troubleshooting.

Other Pod and Container Security Considerations

In addition to PodSecurityPolicy and workload isolation, here are some other tools you may consider when determining how to handle pod and container security.

Admission Controllers

If you’re worried about diving into the deep end with PodSecurityPolicy, here are some options that offer a fraction of the functionality but might offer a viable alternative. You can use admission controllers such as DenyExecOnPrivileged and DenyEscalatingExec in conjunction with an admission webhook to add SecurityContext workload settings to achieve a similar outcome. For more information on admission control, refer to Chapter 17.

Intrusion and Anomaly Detection Tooling

We’ve covered security policies and container runtimes, but what happens when you want to introspect and enforce policy within the container runtime? There are open source tools that can do this and more. They operate by either listening and filtering Linux system calls or by utilizing a Berkeley Packet Filter (BPF). One such tool is Falco. Falco is a Cloud Native Computing Foundation (CNCF) project that simply installs as a Demonset and allows you to configure and enforce policy during execution. Falco is just one approach. We encourage you to take a look at the tooling in this space to see what works for you.


In this chapter, we covered in depth both the PodSecurityPolicy and the RuntimeClass APIs with which you can configure a granular level of security for your workloads. We have also taken a look at some open source ecosystem tooling that you can use to monitor and enforce policy within the container runtime. We have provided a thorough overview for you to make an informed decision about providing the level of security that is best suited for your workload needs.

