When it comes to pod security via the Kubernetes API, you have two main options at your disposal: PodSecurityPolicy and RuntimeClass. In this chapter, we review the purpose and use of each API and provide best practices for their use.
The PodSecurityPolicy API is under active development. As of Kubernetes 1.15, this API was in beta. Please visit the upstream documentation for the latest updates on the feature state.
This cluster-wide resource creates a single place to define and manage
all of the security-sensitive fields found in pod specifications. Prior to
the creation of the PodSecurityPolicy resource, cluster administrators and/or
users would need to independently define individual SecurityContext
settings for their workloads or enable bespoke admission controllers on
the cluster to enforce some aspects of pod security.
Does all of this sound too easy? PodSecurityPolicy is surprisingly difficult to implement effectively and will more often than not get turned off or evaded in other ways. We do, however, strongly suggest taking the time to fully understand PodSecurityPolicy because it’s one of the single most effective means to reduce your attack surface area by limiting what can run on your cluster and with what level of privilege.
Along with the resource API, a corresponding admission controller must be enabled to enforce the conditions defined in the PodSecurityPolicy resource. This means that the enforcement of these policies happens at the admission phase of the request flow. To learn more about how admission controllers work, refer to Chapter 17.
It’s worth mentioning that enabling PodSecurityPolicy is not widely available among public cloud providers and cluster operations tools. In the cases for which it is available, it’s generally shipped as an opt-in feature.
Proceed with caution when enabling PodSecurityPolicy because it’s potentially workload blocking if adequate preparation isn’t done at the outset.
There are two main components that you need to complete in order to start using PodSecurityPolicy:
Ensure that the PodSecurityPolicy API is enabled (this should already be done if you’re on a currently supported version of Kubernetes).
You can confirm that this API is enabled by running kubectl get psp
. As long as the response isn’t the server doesn't have a resource type "PodSecurityPolicies
, you are OK to proceed.
Enable the PodSecurityPolicy admission controller via the api-server flag --enable-admission-plugins
.
If you are enabling PodSecurityPolicy on an existing cluster with running workloads, you must create all necessary policies, service accounts, roles, and role bindings before enabling the admission controller.
We also recommend the addition of the --use-service-account-credentials=true
flag to
kube-controller-manager
, which
will enable service accounts to be used for each individual controller
within kube-controller-manager
. This allows for more granular policy
control even within the kube-system
namespace. You can simply run the
following command to determine whether the flag has been set. It demonstrates that there is indeed a service account per controller:
$
kubectl get serviceaccount -n kube-system|
grep'.*-controller'
attachdetach-controller1
6d13h certificate-controller1
6d13h clusterrole-aggregation-controller1
6d13h cronjob-controller1
6d13h daemon-set-controller1
6d13h deployment-controller1
6d13h disruption-controller1
6d13h endpoint-controller1
6d13h expand-controller1
6d13h job-controller1
6d13h namespace-controller1
6d13h node-controller1
6d13h pv-protection-controller1
6d13h pvc-protection-controller1
6d13h replicaset-controller1
6d13h replication-controller1
6d13h resourcequota-controller1
6d13h service-account-controller1
6d13h service-controller1
6d13h statefulset-controller1
6d13h ttl-controller1
6d13h
To best understand how PodSecurityPolicy enables you to secure your pods, let’s work through an end-to-end example together. This will help solidify the order of operations from policy creation through use.
Before you continue, the following section requires that your cluster have PodSecurityPolicy enabled in order for it to work. To see how to enable it, refer to the previous section.
You should not enable PodSecurityPolicy on a live cluster without considering the warnings provided in the previous section. Proceed with caution.
Let’s first test the experience without making any changes or creating any policies. The following is a test workload that simply runs the trusty pause container in a Deployment (save this file as pause-deployment.yaml on your local filesystem for use throughout this section):
apiVersion
:
apps/v1
kind
:
Deployment
metadata
:
name
:
pause-deployment
namespace
:
default
labels
:
app
:
pause
spec
:
replicas
:
1
selector
:
matchLabels
:
app
:
pause
template
:
metadata
:
labels
:
app
:
pause
spec
:
containers
:
-
name
:
pause
image
:
k8s.gcr.io/pause
By running the following command, you can verify that you have a Deployment and a corresponding ReplicaSet but NO pod:
$
kubectl get deploy,rs,pods -lapp
=
pause NAME READY UP-TO-DATE AVAILABLE AGE deployment.extensions/pause-delpoyment 0/10
0
41s NAME DESIRED CURRENT READY AGE replicaset.extensions/pause-delpoyment-67b77c4f691
0
0
41s
If you describe the ReplicaSet, you can confirm the cause from the event log:
$
kubectl describe replicaset -lapp
=
pause Name: pause-delpoyment-67b77c4f69 Namespace: default Selector:app
=
pause,pod-template-hash=
67b77c4f69 Labels:app
=
pause pod-template-hash=
67b77c4f69 Annotations: deployment.kubernetes.io/desired-replicas: 1 deployment.kubernetes.io/max-replicas: 2 deployment.kubernetes.io/revision: 1 Controlled By: Deployment/pause-delpoyment Replicas:0
current /1
desired Pods Status:0
Running /0
Waiting /0
Succeeded /0
Failed Pod Template: Labels:app
=
pause pod-template-hash=
67b77c4f69 Containers: pause: Image: k8s.gcr.io/pause Port: <none> Host Port: <none> Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ ReplicaFailure True FailedCreate Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 45s(
x15 over 2m7s)
replicaset-controller Error creating: pods"pause-delpoyment-67b77c4f69-"
is forbidden: unable to validate against any pod security policy:[]
This is because there are either no pod security policies defined or the
service account is not allowed access to use the PodSecurityPolicy.
You might have also noticed that all of the system pods in the kube-system
namespace are probably still in RUNNING
state. This is because these
requests have already passed the admission phase for the request. If
there were an event that restarted these pods, they would also suffer the
same fate as our test workload given that there are no
PodSecurityPolicy resources defined:
replicaset-controller Error creating: pods "pause-delpoyment-67b77c4f69-" is forbidden: unable to validate against any pod security policy: []
Let’s delete the test workload deployment:
$
kubectl delete deploy -lapp
=
pause deployment.extensions"pause-delpoyment"
deleted
Now, let’s go fix this by defining pod security policies. For a complete list of policy settings, refer to the Kubernetes documentation. The following policies are basic variations of the examples provided in the Kubernetes documentation.
Call the first policy privileged
, which we use to
demonstrate how to allow privileged workloads. You can apply the following resources by using
kubectl create -f <filename>
:
apiVersion
:
policy/v1beta1
kind
:
PodSecurityPolicy
metadata
:
name
:
privileged
spec
:
privileged
:
true
allowPrivilegeEscalation
:
true
allowedCapabilities
:
-
'*'
volumes
:
-
'*'
hostNetwork
:
true
hostPorts
:
-
min
:
0
max
:
65535
hostIPC
:
true
hostPID
:
true
runAsUser
:
rule
:
'RunAsAny'
seLinux
:
rule
:
'RunAsAny'
supplementalGroups
:
rule
:
'RunAsAny'
fsGroup
:
rule
:
'RunAsAny'
The next policy defines restricted access and will suffice for many
workloads apart from those responsible for running Kubernetes
cluster-wide services such as kube-proxy
, located in the kube-system
namespace:
apiVersion
:
policy/v1beta1
kind
:
PodSecurityPolicy
metadata
:
name
:
restricted
spec
:
privileged
:
false
allowPrivilegeEscalation
:
false
requiredDropCapabilities
:
-
ALL
volumes
:
-
'configMap'
-
'emptyDir'
-
'projected'
-
'secret'
-
'downwardAPI'
-
'persistentVolumeClaim'
hostNetwork
:
false
hostIPC
:
false
hostPID
:
false
runAsUser
:
rule
:
'RunAsAny'
seLinux
:
rule
:
'RunAsAny'
supplementalGroups
:
rule
:
'MustRunAs'
ranges
:
-
min
:
1
max
:
65535
fsGroup
:
rule
:
'MustRunAs'
ranges
:
-
min
:
1
max
:
65535
readOnlyRootFilesystem
:
false
You can confirm that the policies have been created by running the following command:
$
kubectl get psp NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES privilegedtrue
* RunAsAny RunAsAny RunAsAny RunAsAnyfalse
* restrictedfalse
RunAsAny MustRunAsNonRoot MustRunAs MustRunAsfalse
configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim
Now that we have defined these policies, we need to grant the service
accounts access to use
these policies via Role-Based Access Control (RBAC).
First, create the following ClusterRole
that allows access to use
the restricted PodSecurityPolicy that we created in the previous step:
kind
:
ClusterRole
apiVersion
:
rbac.authorization.k8s.io/v1
metadata
:
name
:
psp-restricted
rules
:
-
apiGroups
:
-
extensions
resources
:
-
podsecuritypolicies
resourceNames
:
-
restricted
verbs
:
-
use
Now, create the following ClusterRole
that allows access to use
the
privileged PodSecurityPolicy we created in the previous step:
kind
:
ClusterRole
apiVersion
:
rbac.authorization.k8s.io/v1
metadata
:
name
:
psp-privileged
rules
:
-
apiGroups
:
-
extensions
resources
:
-
podsecuritypolicies
resourceNames
:
-
privileged
verbs
:
-
use
We must now create a corresponding ClusterRoleBinding
that allows the
system:serviceaccounts
group access to psp-restricted
ClusterRole
. This group includes all of the kube-controller-manager
controller service accounts:
kind
:
ClusterRoleBinding
apiVersion
:
rbac.authorization.k8s.io/v1
metadata
:
name
:
psp-restricted
subjects
:
-
kind
:
Group
name
:
system:serviceaccounts
namespace
:
kube-system
roleRef
:
kind
:
ClusterRole
name
:
psp-restricted
apiGroup
:
rbac.authorization.k8s.io
Go ahead and create the test workload again. You can see that the pod is now up and running:
$
kubectl create -f pause-deployment.yaml deployment.apps/pause-deployment created$
kubectl get deploy,rs,pod NAME READY UP-TO-DATE AVAILABLE AGE deployment.extensions/pause-deployment 1/11
1
10s NAME DESIRED CURRENT READY AGE replicaset.extensions/pause-deployment-67b77c4f691
1
1
10s NAME READY STATUS RESTARTS AGE pod/pause-deployment-67b77c4f69-4gmdn 1/1 Running0
9s
Update the test workload deployment to violate the restricted policy.
Adding privileged=true
should do the trick. Save this manifest as
pause-privileged-deployment.yaml on your local filesystem and then apply it by using kubectl apply -f <filename>
:
apiVersion
:
apps/v1
kind
:
Deployment
metadata
:
name
:
pause-privileged-deployment
namespace
:
default
labels
:
app
:
pause
spec
:
replicas
:
1
selector
:
matchLabels
:
app
:
pause
template
:
metadata
:
labels
:
app
:
pause
spec
:
containers
:
-
name
:
pause
image
:
k8s.gcr.io/pause
securityContext
:
privileged
:
true
Again, you can see that both the Deployment and the ReplicaSet have been created; however, the pod has not. You can find the details of why in the event log of the ReplicaSet:
$
kubectl create -f pause-privileged-deployment.yaml deployment.apps/pause-privileged-deployment created$
kubectl get deploy,rs,pods -lapp
=
pause NAME READY UP-TO-DATE AVAILABLE AGE deployment.extensions/pause-privileged-deployment 0/10
0
37s NAME DESIRED CURRENT READY AGE replicaset.extensions/pause-privileged-deployment-6b7bcfb9b71
0
0
37s$
kubectl describe replicaset -lapp
=
pause Name: pause-privileged-deployment-6b7bcfb9b7 Namespace: default Selector:app
=
pause,pod-template-hash=
6b7bcfb9b7 Labels:app
=
pause pod-template-hash=
6b7bcfb9b7 Annotations: deployment.kubernetes.io/desired-replicas: 1 deployment.kubernetes.io/max-replicas: 2 deployment.kubernetes.io/revision: 1 Controlled By: Deployment/pause-privileged-deployment Replicas:0
current /1
desired Pods Status:0
Running /0
Waiting /0
Succeeded /0
Failed Pod Template: Labels:app
=
pause pod-template-hash=
6b7bcfb9b7 Containers: pause: Image: k8s.gcr.io/pause Port: <none> Host Port: <none> Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ ReplicaFailure True FailedCreate Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 78s(
x15 over 2m39s)
replicaset-controller Error creating: pods"pause-privileged-deployment-6b7bcfb9b7-"
is forbidden: unable to validate against any pod security policy:[
spec.containers[
0]
.securityContext.privileged: Invalid value:true
: Privileged containers are not allowed]
The preceding example shows the exact reason why:
Privileged containers are not allowed
. Let’s delete the test workload deployment.
$
kubectl delete deploy pause-privileged-deployment deployment.extensions"pause-privileged-deployment"
deleted
So far, we’ve dealt only with cluster-level bindings. How about we allow the test workload access to the privileged policy using a service account.
First, create a serviceaccount
in the default namespace:
$
kubectl create serviceaccount pause-privileged
serviceaccount/pause-privileged created
Bind that serviceaccount
to the permissive ClusterRole
. Save this manifest as role-pause-privileged-psp-permissive.yaml on your local filesystem and then apply it by using kubectl apply
-f <filename>:
apiVersion
:
rbac.authorization.k8s.io/v1beta1
kind
:
RoleBinding
metadata
:
name
:
pause-privileged-psp-permissive
namespace
:
default
roleRef
:
apiGroup
:
rbac.authorization.k8s.io
kind
:
ClusterRole
name
:
psp-privileged
subjects
:
-
kind
:
ServiceAccount
name
:
pause-privileged
namespace
:
default
Finally, update the test workload to use the pause-privileged
service account. Then apply it to the cluster using kubectl apply:
apiVersion
:
apps/v1
kind
:
Deployment
metadata
:
name
:
pause-privileged-deployment
namespace
:
default
labels
:
app
:
pause
spec
:
replicas
:
1
selector
:
matchLabels
:
app
:
pause
template
:
metadata
:
labels
:
app
:
pause
spec
:
containers
:
-
name
:
pause
image
:
k8s.gcr.io/pause
securityContext
:
privileged
:
true
serviceAccountName
:
pause-privileged
You can see that the pod is now able to use the privileged policy:
$
kubectl create -f pause-privileged-deployment.yaml deployment.apps/pause-privileged-deployment created$
kubectl get deploy,rs,pod NAME READY UP-TO-DATE AVAILABLE AGE deployment.extensions/pause-privileged-deployment 1/11
1
14s NAME DESIRED CURRENT READY AGE replicaset.extensions/pause-privileged-deployment-658dc5569f1
1
1
14s NAME READY STATUS RESTARTS AGE pod/pause-privileged-deployment-658dc5569f-nslnw 1/1 Running0
14s
Now that you understand how to configure and use PodSecurityPolicy, it’s worth noting that there are quite a few challenges with using it in real-world environments. In this section, we describe things that we have experienced that make it challenging.
The real power of PodSecurityPolicy is to enable the cluster administrator and/or
user to ensure that their workloads meet a certain level of security. In
practice, you might often overlook just how many workloads run as root, use
hostPath
volumes, or have other risky settings that force you to craft
policies with security holes just to get the workloads up and running.
Getting the policies just right is a large investment, especially where there is a large set of workloads already running on Kubernetes without PodSecurityPolicy enabled.
Will your developers want to learn PodSecurityPolicy? What would be the incentive for them to do so? Without a lot of up front coordination and automation to make enabling PodSecurityPolicy a smooth transition, it’s very likely that PodSecurityPolicy won’t be adopted at all.
It’s difficult to troubleshoot policy evaluation. For example, you might want to understand why your workload matched or didn’t match a specific policy. Tooling or logging to make that easy doesn’t exist at this stage.
Are you pulling images from Docker Hub or another public repository? Chances are they will violate your policies in some shape or form and will be out of your control to fix. Another common place is Helm charts: do they ship with the appropriate policies in place?
PodSecurityPolicy is complex and can be error prone. Refer to the following best practices before implementing PodSecurityPolicy on your clusters:
It all comes down to RBAC. Whether you like it or not, PodSecurityPolicy is determined by RBAC. It’s this relationship that actually exposes all of the shortcomings in your current RBAC policy design. We cannot stress just how important it is to automate your RBAC and PodSecurityPolicy creation and maintenance. Specifically locking down access to service accounts is the key to using policy.
Understand the policy scope. Determining how your policies will be laid out on your cluster is very important. Your policies can be cluster-wide, namespaced, or workload-specific in scope. There will always be workloads on your cluster that are part of the Kubernetes cluster operations that will need more permissive security privileges, so make sure that you have appropriate RBAC in place to stop unwanted workloads using your permissive policies.
Do you want to enable PodSecurityPolicy on an existing cluster? Use this handy open source tool to generate policies based on your current resources. This is a great start. From there, you can hone your policies.
As demonstrated, PodSecurityPolicy is an extremely powerful API to assist in keeping your cluster secure, but it demands a high tax for use. With careful planning and a pragmatic approach, PodSecurityPolicy can be successfully implemented on any cluster. At the very least, it will keep your security team happy.
Container runtimes are still largely considered an insecure workload isolation boundary. There is no clear path to whether the most common runtimes of today will ever be recognized as secure. The momentum and interest among those in the industry toward Kubernetes has led to the development of different container runtimes that offer varying levels of isolation. Some are based on familiar and trusted technology stacks, whereas others are a completely new attempt to tackle the problem. Open source projects like Kata containers, gVisor, and Firecracker tout the promise of stronger workload isolation. These specific projects are either based on nested virtualization (running a super lightweight virtual machine within a virtual machine) or system call filtering and servicing.
The introduction of these container runtimes that offer different workload isolation allows users to choose many different runtimes based on their isolation guarantees in the same cluster. For example, you could have trusted and untrusted workloads running in the same cluster in different container runtimes.
RuntimeClass was introduced into Kubernetes as an API to allow container runtime
selection. It is used to represent one of the supported container
runtimes on the cluster when it has been configured by the cluster administrator.
As a Kubernetes user, you can define specific runtime classes for your
workloads by using the RuntimeClassName in the pod specification.
How this is implemented under the hood is that the RuntimeClass designates
a RuntimeHandler
which is passed to the Container Runtime Interface (CRI)
to implement. Node labeling or node taints then can be used in
conjunction with nodeSelectors or tolerations to ensure that the workload
lands on a node capable of supporting the desired RuntimeClass. Figure 10-1 demonstrates how a kubelet uses RuntimeClass when launching pods.
The RuntimeClass
API is under active development. For the latest updates on the feature state, visit the upstream documentation.
Following are some open source container runtime implementations that offer different levels of security and isolation for your consideration. This list is intended as a guide and is by no means exhaustive:
An API facade for container runtimes with an emphasis on simplicity, robustness, and portability.
A purpose-built, lightweight Open Container Initiative (OCI)-based implementation of a container runtime for Kubernetes.
Built on top of the Kernel-based Virtual Machine (KVM), this virtualization technology allows you to launch microVMs in nonvirtualized environments very quickly using the security and isolation of traditional VMs.
An OCI-compatible sandbox runtime that runs containers with a new user-space kernel, which provides a low overhead, secure, isolated container runtime.
A community that’s building a secure container runtime that provides VM-like security and isolation by running lightweight VMs that feel and operate like containers.
The following best practices will help you to avoid common workload isolation and RuntimeClass pitfalls:
Implementing different workload isolation environments via RuntimeClass will complicate your operational environment. This means that workloads might not be portable across different container runtimes given the nature of the isolation they provide. Understanding the matrix of supported features across different runtimes can be complicated to understand and will lead to poor user experience. We recommend having separate clusters, each with a single runtime to avoid confusion, if possible.
Workload isolation doesn’t mean secure multitenancy. Even though you might have implemented a secure container runtime, this doesn’t mean that the Kubernetes cluster and APIs have been secured in the same fashion. You must consider the total surface area of Kubernetes end to end. Just because you have an isolated workload doesn’t mean that it cannot be modified by a bad actor via the Kubernetes API.
Tooling across different runtimes is inconsistent. You might have users who rely on
container runtime tooling for debugging and introspection. Having different runtimes means that you
might no longer be able to run docker ps
to list running containers. This leads to confusion
and complications when troubleshooting.
In addition to PodSecurityPolicy and workload isolation, here are some other tools you may consider when determining how to handle pod and container security.
If you’re worried about diving into the deep end with
PodSecurityPolicy, here are some options that offer a fraction of the functionality but might offer a viable alternative. You can use admission controllers such as DenyExecOnPrivileged
and
DenyEscalatingExec
in conjunction with an admission webhook to add
SecurityContext workload settings to achieve a similar
outcome. For more information on admission control, refer to Chapter 17.
We’ve covered security policies and container runtimes, but what happens when you want to introspect and enforce policy within the container runtime? There are open source tools that can do this and more. They operate by either listening and filtering Linux system calls or by utilizing a Berkeley Packet Filter (BPF). One such tool is Falco. Falco is a Cloud Native Computing Foundation (CNCF) project that simply installs as a Demonset and allows you to configure and enforce policy during execution. Falco is just one approach. We encourage you to take a look at the tooling in this space to see what works for you.
In this chapter, we covered in depth both the PodSecurityPolicy and the RuntimeClass APIs with which you can configure a granular level of security for your workloads. We have also taken a look at some open source ecosystem tooling that you can use to monitor and enforce policy within the container runtime. We have provided a thorough overview for you to make an informed decision about providing the level of security that is best suited for your workload needs.
3.137.171.121