5 CNIs and providing the Pod with a network

This chapter covers

  • Defining the Kubernetes SDN in terms of the kube-proxy and CNI
  • Connecting between traditional SDN Linux tools and CNI plugins
  • Using open source technologies to govern the way CNIs operate
  • Exploring the Calico and Antrea CNI providers

Software-Defined Networking (SDN) traditionally manages load balancing, iso-lation, and security of VMs in the cloud, as well as in many on-premise data centers. SDNs are a convenience which eases the burden on system administrators, allowing reconfiguration of large data center networks every week, or maybe every day, when new VMs are created or destroyed. Fast-forwarding into the age of containers, the concept of SDN takes on a whole new meaning because our networks change constantly (every second, in a large Kubernetes cluster), and so it must, by definition, be automated by software. The Kubernetes network is entirely software-defined and is constantly in flux due to the ephemeral and dynamic nature of the Kubernetes Pod and service endpoints.

In this chapter, we’ll look at Pod-to-Pod networking and, in particular, how hundreds or thousands of containers on a given machine can have unique, cluster-routable IP addresses. Kubernetes delivers this functionality in a modular and extensible way by using a Container Network Interface (CNI) standard, which can be implemented by a broad range of technologies to give each Pod a unique routable IP address.

The CNI specification doesn’t specify the details of container networking

The CNI specification is a generic definition for the high-level operations to add a container to a network. Reading too deeply into it might cause a little difficulty at first if you approach it in terms of how you think about Kubernetes CNI providers. For example, some CNI plugins, such as the IPAM plugin (https://www.cni.dev/plugins/current/ipam/), are only responsible for finding a valid IP address for a container, while other CNI plugins, such as Antrea or Calico, operate at a higher level, delegating functionality as needed to other plugins. Some CNI plugins, in fact, do not actually attach a Pod to a network at all, but rather play a minute role in the broader “let’s add this container to a network” workflow. (Understanding that, the IPAM plugin is a good way to grok this concept.)

Keep in mind that any CNI plugin you’ll encounter in the wild is a snowflake that might operate at a different time in the overall progression of connecting your container to a network. Also, some CNI plugins are only meaningful in the context of the other plugins that reference them.

Let’s revisit our Pods from earlier and look back at their core networking requirement. As part of exploring this concept, we previously discussed the way iptables rules for nftables, IPVSs (IP virtual servers), and other network proxy implementations are managed by the kube-proxy. We also looked at various KUBE-SEP rules that tell the Linux kernel to “masquerade” traffic so that the traffic leaving a container is postmarked as coming from a node or to NAT traffic via a service IP. This traffic is then forwarded to a running Pod, which often might be on a different node in our cluster.

The kube-proxy is great for routing services to backend Pods and is usually the first piece of software-defined networking that users interact with. For example, when you first run and expose a simple Kubernetes application with a node port, you are accessing a Pod through a routing rule that’s created by the kube-proxy running on your Kubernetes nodes. The kube-proxy, however, is not particularly useful unless there is a robust Pod network on your cluster. This is because, ultimately, its only job is to map a service IP address into a Pod’s IP address. If that Pod’s IP address is not routable between two nodes, then the kube-proxy’s routing decisions do not result in an application that is usable to the end user. That is to say, a load balancer is only as reliable as its slowest endpoint.

The kpng project and the future of kube-proxy

As Kubernetes grows, the CNI landscape expands to actually implement the kube-proxy service routing functionality at the CNI level. This allows CNI providers like Antrea, Calico, and Cilium, to provide high performance and extended feature sets to the Kubernetes service proxy (for example, monitoring and native integration with other load balancing technologies).

To address the need for a “pluggable” network proxy that can retain some of the core logic from Kubernetes, while allowing vendors to extend other parts, the kpng project (https://github.com/kubernetes-sigs/kpng) was created and is incubating as a new kube-proxy alternative. It’s extremely modular and lives completely outside of the Kubernetes codebase. If you are interested in Kubernetes load-balancing services, it’s a great project to dig into and learn more about, but it is not ready for production workloads at the time of this writing.

As an example of an alternative CNI-provided network proxy that might someday be able to be fully implemented as a kpng extension, you can look at projects such as the Antrea proxy (currently a new feature in Antrea) that can be turned on or off, based on user preference. You’ll find more information at http://mng.bz/AxGQ.

5.1 Why we need software-defined networks in Kubernetes

The container networking puzzle can be defined as follows: given hundreds of Pods, some of which correspond to the same service, how can we consistently route traffic into and out of a cluster so that all traffic always goes to the right place, even if our Pods are moving? This is the obvious Day 2 operations problem facing anyone who has tried to run a non-Kubernetes container solution in production (for example, Docker). To solve this, Kubernetes gives us two fundamental networking tools:

  • The service proxy—Ensures that Pods can be load balanced behind services with stable IPs and routes Kubernetes Service objects

  • The CNI—Ensures that Pods can be reborn continually inside a network that is flat and easy to access from inside the cluster

At the heart of this solution is the Kubernetes Service object with the type ClusterIP. A ClusterIP service is a Kubernetes Service that is routable inside your Kubernetes cluster, but it is not accessible outside your cluster. It is a fundamental primitive on top of which other services can be built. It’s also a simple way for applications inside your cluster to access one another without needing to directly route to a Pod IP address (remember, Pod IPs can change if they move or die).

As an example, if we create the same service three times in a kind cluster, we will see that it has three random IP addresses in the 10.96 IP space. To verify this, we can recreate the same three services by running kubectl create service clusterip my-service-1 --tcp="100:100" three times in a row (changing the name of my-service-1, of course). Afterward, we could list the service IPs like so:

$ kubectl get svc -o wide
svc-1 ClusterIP 10.96.7.53    80/TCP 48s app=MyApp
svc-2 ClusterIP 10.96.152.223 80/TCP 33s app=MyApp
svc-3 ClusterIP 10.96.43.92   80/TCP 5s  app=MyApp

Similarly for Pods, we have a single network and subnet. We can see that new IP addresses are easily provisioned when making new Pods. Because our kind cluster already has two CoreDNS Pods running, we can check their IP addresses to confirm this:

$ kubectl get pods -A -o wide | grep coredns
kube-system coredns-74ff55c5b-nlxrs 1/1  Running 0 4d16h 192.168.71.1
 calico-control-plane <none> <none>
kube-system coredns-74ff55c5b-t4p6s 1/1  Running 0 4d16h 192.168.71.3
 calico-control-plane <none> <none>

We just saw the first important lessons of the Kubernetes SDN: Pod and Service IP addresses are managed for us and are in different IP subnets. This is (generally) a constant in almost any cluster we’ll encounter in the real world. In fact, if we do encounter a cluster where this is not the case, there is a chance that some other behavior of Kubernetes has been severely compromised. This behavior may include the ability for the kube-proxy to route traffic or the ability of the node to route Pod traffic.

The Kubernetes control plane charts the course for Pod and Service IP ranges

It’s a common misconception in Kubernetes that CNI providers are responsible for service as well as Pod IP addresses. Actually, when you make a new ClusterIP Service, the Kubernetes control plane creates a new IP from the CIDR you give it on startup as a command-line option (for example, --service-cluster-ip-range), which is used with the --allocate-node-cidrs option. CNI providers often rely on the node CIDRs that are also allocated by the API server if specified. Thus, CNI and the network proxy act at a highly localized level, issuing the directives of the overall cluster configuration that is coordinated by the Kubernetes control plane.

5.2 Implementing the service side of the Kubernetes SDN: The kube-proxy

There are three primary types of Kubernetes Service API objects that we can create (as you likely know by now): ClusterIPs, NodePorts, and LoadBalancers. These Services define which backend Pod we’ll connect to by the use of labels. For example, in the previous cluster, we have ClusterIP services in our 10 subnet, and those Services route traffic to Pods in our 192 subnet. How does traffic destined for a service IP get routed into another subnet? It gets routed by the kube-proxy (or, more formally, the Kubernetes network or service proxy).

In the previous example, we ran kubectl create service my-service-1 --tcp= "100:100" three times and got three services of type ClusterIP. If we were to make these services as type NodePort, then the IP of these Services would be any node in our entire cluster. If we were to make these Services a LoadBalancer type, then our cloud (if we were in a cloud) would provide an external IP, such as 35.1.2.3. This would be accessible on the wider internet or on a network outside our Pod, node, or service IP range, depending on the cloud provider.

Is the kube-proxy a proxy?

In the early days of Kubernetes, the kube-proxy itself opened up a new Golang routine for incoming requests; thus, services were actually implemented as userspace processes that continue to respond to traffic. The creation of the Kubernetes iptables proxy (and later, the IPVS proxy) and the Windows kernel proxy led to the kube-proxy being much more scalable and CPU-efficient.

Some such use cases for userspace proxying still exist but are far and few between. For example, VMware’s Tanzu Kubernetes Grid uses userspace proxying to support Windows clusters because it cannot rely on kernel-space proxying. This is due to a difference in architecture in the way that it uses Open vSwitch (OVS). In any case, the kube-proxy, in general, typically tells other proxying tools about Kubernetes endpoints, but it is usually not considered a proxy in the traditional sense.

Figure 5.1 shows the flow of traffic from a LoadBalancer into a Kubernetes cluster. It depicts how

  • The kube-proxy uses a low-level routing technology like iptables or IPVS to send traffic from services into and out of Pods.

  • We get an IP address from the outside world when we have a service of type LoadBalancer. This then routes into our internal service IP.

Figure 5.1 The flow of traffic from a LoadBalancer into a Kubernetes cluster

NodePort vs. ClusterIP services

NodePorts are Services in Kubernetes that are exposed on all ports outside of the internal Pod network. They allow a primitive on which you can build a load balancer. For example, you might have a web app that serves on a ClusterIP of say, 100.1.2.3:443.

If you want to access that app from outside your cluster, every node might forward to this service from a NodePort. The value of a NodePort is random; for example, it might be something like 50491. Thus, you could access your web app on node_ip_ 1:50491, node_ip_2:50491, node_ip_3:50491, and so on.

If you are interested in more optimal ways to set up routing by annotating services using the externalTrafficPolicy annotation, this might not work the same on all OSs and cloud types. Make sure to dig into the details if you decide to get fancy with service routing.

NodePorts are built on top of ClusterIP services. ClusterIP services have an internal IP address that (usually) does not overlap with your Pod network, which is synchronous with your API server.

Reading kube-proxy’s iptables rules just for fun

If you are interested in seeing a fully annotated iptables configuration in a real cluster, you can look at the iptables-save-calico.md file at http://mng.bz/enV9. We put together this file as a way to see all iptables rules that normally might be output from a Kubernetes cluster running in the wild.

In particular, in this file we note that there are three main iptables tables, and the most important one for Kubernetes is the NAT table. This is where the highly dynamic ebb and flow of services and Pods takes its toll on large clusters. As mentioned in other parts of this book, there are tradeoffs between different kube-proxy configurations, but by far, the most commonly used proxy is the iptables kube-proxy.

5.2.1 The kube-proxy’s data plane

The kube-proxy needs to be able to handle ongoing TCP traffic going to and from Pods that are backed by services. An IP packet has certain fundamental properties including the source and destination IP addresses. In a sophisticated network, these may get changed because a packet moves through a series of routers, and we consider a Kubernetes node (due to the kube-proxy) to be one such router. In general, the manipulation of a packet’s destination is known as NAT (referring to network address translation) and is a fundamental aspect of almost any network architecture solution at some level. SNAT and DNAT refer to the translation of source and destination IP addresses, respectively.

The data plane of the kube-proxy can accomplish this task in a variety of ways, and this is specified to the kube-proxy by its mode configuration at startup. If we dig into the details, we find that the kube-proxy itself is organized into two separate control paths: server_windows.go and server_others.go (both located here: http://mng.bz/EWxl). The server_windows.go binary is compiled into a kube-proxy.exe file and makes native calls to underlying Windows system APIs (such as the netsh command for the userspace proxy and the hcsshim and HCN [http://mng.bz/N6x2] containerization APIs for the Windows kernel proxy).

The more common case is that we run the kube-proxy on Linux. In this case, a different binary program (which is called kube-proxy) runs. This program doesn’t compile the Windows functionality into its code path. In the Linux scenario, we usually run the iptables proxy. In your kind clusters, the kube-proxy just runs in the default iptables mode. You can confirm this by looking at the configuration of the kube-proxy by running kubectl edit cm kube-proxy -n kube-system and looking at its mode field:

  • ipvs uses the kernel load balancer to write routing rules for services (Linux).

  • iptables uses the kernel firewall to write routing rules for services (Linux).

  • The userspace creates a process using a Golang go func worker that manually proxies traffic to a Pod (Linux).

  • The Windows kernel relies on the hcsshim and HCN APIs for load balancing, which is incompatible with OVS-related CNI implementations but works with other CNIs like Calico (similar to the Linux userspace option).

  • The Windows userspace also uses netsh for certain aspects of routing. This is useful for people who, for some reason, can’t use the regular Windows kernel’s APIs. Note that if you install an OVS extension on Windows, you may need to use the userspace proxy because the kernel’s HCN APIs do not work in the same way.

Note Throughout this book, we will mention the notion of informers, controllers, and Operators and how their behavior is not always uniformly implemented with respect to the configuration changes that occur. Although the network proxy is implemented with a Kubernetes controller, it doesn’t dynamically respond to configuration changes. Thus, if you want to play with your kind cluster to modify the way that service load balancing is done, you’ll need to edit configMap for the network proxy and then restart its DaemonSet. (If you want, you can do this by killing a Pod in your DaemonSet and then view the Pod’s logs as it is reborn. You should see the new kube-proxy mode.)

The kube-proxy is, however, just one way to define how the Kubernetes SDN routes traffic. To be comprehensive, we can think of Kubernetes routing in three separate layers:

  • External load balancers or ingress/gateway routers—Forward traffic into a Kubernetes cluster.

  • The kube-proxy—Manages forwarding between services to Pods. As you may know by now, the term proxy is a bit of a misnomer because, typically, the kube-proxy just maintains static routing rules that are implemented by a kernel or other data plane technology, such as iptables rules.

  • CNI providers—Route traffic to and from Pods regardless of whether we are accessing them through a service endpoint or directly (Pod-to-Pod networking).

Ultimately, a CNI provider (like the kube-proxy) also configures some kind of rule engine (such as a routing table) or an OVS switch to ensure that traffic between nodes or from the outside world can route to Pods. If you’re wondering why the technology for the kube-proxy is different from that of CNIs, you’re not alone! Many CNI providers are endeavoring to implement a full-blown kube-proxy themselves so that the kube-proxy from Kubernetes is no longer required.

5.2.2 What about NodePorts?

We’ve demonstrated the ClusterIP services in the first part of this chapter, but we haven’t yet looked at NodePort services. Let’s do that now by getting our feet wet and creating a new Kubernetes service. This will ultimately demonstrate how easy it is to add and modify load-balancing rules. For this example, let’s make a NodePort service that points to a CoreDNS container running inside a Pod in our cluster. We can quickly cobble one together by looking at the contents of kubectl get svc -o yaml kube-dns -n kube-system. We can then change the type of service from ClusterIP to NodePort like so:

# save the following file to my-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "9153"
    prometheus.io/scrape: "true"
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: CoreDNS
  name: kube-dns-2                 
  namespace: kube-system
spec:
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  - name: metrics
    port: 9153
    protocol: TCP
    targetPort: 9153
  selector:
    k8s-app: kube-dns
  sessionAffinity: None
  type: NodePort                   
status:
  loadBalancer: {}

Names the service kube-dns-2 to differentiate it from the already existing kube-dns service

Changes the type of this service to a NodePort

Now, if we run kubectl create -f my-nodeport.yaml, we’ll see that a random port was allocated for us. This is now forwarding traffic to CoreDNS for us:

kubectl get pods -o wide -A
kube-system   kube-dns     ClusterIP   10.96.0.10
              53/UDP,53/TCP,9153/TCP k8s-app=kube-dns
kube-system   kube-dns-2   NodePort    10.96.80.7
              53:30357/UDP,53:30357/TCP,9153:31588/TCP
              2m33s   k8s-app=kube-dns                  

Maps the random ports 30357 and 31588 to port 53

The random ports 30357 and 31588, mapped to port 53 from our DNS service Pods, open on all the nodes of our cluster. That’s because all nodes are running the kube-proxy. These random ports were not allocated earlier when we created the ClusterIP services.

If you are feeling brave, we’ll leave it as an exercise for you to run iptables-save on your kind Docker nodes and fish out the handy work that the kube-proxy has done to write rules for your newly created service IP addresses. (If you are interested in exercising NodePorts, you’ll enjoy our later chapter about how to install and test Kubernetes applications locally. There, we’ll create several services for testing the famous Guestbook application in Kubernetes.)

Now that you’ve got a little bit of a refresher on how services ultimately plumb routing rules between internal Pod ports and the external world, let’s look at CNI providers. These provide the next layer below the service proxy in the overall Kubernetes SDN networking stack. Ultimately, all our service is really doing is routing traffic from 10.96.80.7 to the Pods that are living inside our cluster. How do these Pods get attached to a valid IP address, and how do they receive this traffic? The answer is . . . the CNI interface.

5.3 CNI providers

CNI providers implement the CNI specification (http://mng.bz/RENK), which defines a contract that allows container runtimes to request a working IP address for a process on startup. They also add other fancy features outside this specification (like implementing network policies or third-party network monitoring integrations). For example, VMware users will find that they can use Antrea as a CNI proxy for free and plug it into things like VMware’s NSX platform for real-time container monitoring and logging features that some of the current open source CNI providers include. Although CNI providers, theoretically, only need to route Pod traffic, many provide extra bells and whistles. A quick rundown of the major, on-premise CNIs includes

  • Calico—A BGP-based CNI provider that makes new Border Gateway Protocol (BGP) routing rules to implement the data plane. Calico additionally supports XDP, NAND, and VXLAN routing options (for example, on Windows, it’s not uncommon to run Calico in VXLAN mode). As an advanced CNI, it has the ability to replace the kube-proxy, using technology similar to Cilium’s.

  • Antrea—An OVS data plane CNI provider that uses a bridge to route all Pod traffic. It is similar to Calico in that it has many advanced routing and network proxy replacement options (AntreaProxy).

  • Flannel—A bridge-based IP CNI provider that is no longer commonly used. It was one of the original major CNIs for production Kubernetes clusters.

  • Google, EC2, and NCP—These cloud-based CNIs use proprietary software to make cloud-aware traffic routing decisions. For example, they are capable of creating rules that route traffic directly between containers without needing to travel over node network paths.

  • Cilium—A XDP-based CNI provider that uses modern Linux APIs to route traffic without requiring any Kernel traffic management. This allows for faster and more secure IP communication between containers in some cases. Cillium uses its advanced data path tooling to provide a network proxy alternative as well.

  • KindNet—A simple CNI plugin that is used in kind clusters by default, but it’s only designed to work in simple clusters with only one subnet.

There are many other CNIs that might be specific to other vendors or open source technologies, as well as proprietary CNI providers for various cloud environments such as VMware, Azure, EKS, and so on. These proprietary CNIs only run inside a given vendor’s infrastructure and, thus, are less portable but often more performant or better integrated with cloud features. Some CNIs, like Calico and Antrea, provide both vendor-specific and vendor-neutral functionality (such as Tigera or NSX specific integrations, for example).

5.4 Diving into two CNI networking plugins: Calico and Antrea

Figure 5.2 shows how CNI networking works in the Calico and Antrea plugins. Both of these plugins accomplish the same end state using a series of routing rules and open source technologies. The CNI interface defines a few core functional aspects of any networking solution for containers, and all CNI plugins (BGP and OVS, for example) implement that functionality in different ways. As figure 5.2 shows, different CNIs use different underlying technology stacks.

Figure 5.2 CNI networking in the Calico and Antrea plugins

Is kube-proxy a requirement?

We talk about kube-proxy as being a requirement, but increasingly, network vendors are beginning to propose technologies such as Extended Berkeley Packet Filter (eBPF), provided by the Cilium CNI, or the OVS proxy, provided by the Antrea CNI, which shortcut the need for running kube-proxy. These typically borrow kube-proxy’s inner logic and attempt to reproduce and implement it in a way that uses a different underlying data plane. The majority of clusters at the time of this book’s publication, however, use the traditional iptables or Windows kernel proxy. Thus, we refer to the kube-proxy as a constant feature in a modern Kubernetes cluster. But look out on the horizon for fancy alternatives as the cloud native landscape expands.

5.4.1 The architecture of a CNI plugin

Both Calico and Antrea have similar architectures: a DaemonSet and a coordination container. To set these up, a CNI installation includes four steps (fully automated for you by your CNI provider, usually, so that this can be done in a snappy one-liner in simple Linux clusters):

  1. Install the kube-proxy because it’s likely that your CNI provider’s coordination controller will need the ability to query the Kubernetes API server. This is usually done for you ahead of time by any Kubernetes installer.

  2. Install a binary CNI program on the node (usually in a directory such as /opt/cni/bin) that can be called by the container runtime to create a Pod with a CNI-provided IP address.

  3. Deploy a DaemonSet to your cluster, where one container sets up networking primitives for its resident node. This DaemonSet does the previous install step for its host on startup.

  4. Deploy a coordination container to your cluster that either aggregates or proxies metadata from Kubernetes; for example, aggregating NetworkPolicy information in a single place so that it can easily be consumed and deduplicated by the DaemonSet Pods.

There’s no one mandated architecture for a CNI plugin, but the overall DaemonSet plus controller pattern is pretty robust. It is generally a good pattern to follow in Kubernetes for any agent-oriented process that is designed to be integrated with the Kubernetes API.

Note CNI providers give IP addresses to Pods, but a lot of the assumptions around how this process works were originally made in a way that is biased to the Linux OS. Thus, we’ll look at the Calico and Antrea CNI providers, but when doing this, you should note that the behavior of these CNIs varies across other OSs. For example, in Windows, both Calico and Antrea are not typically run as Pods but rather as Windows services using tools such as nssm. Currently, some of the more battle-hardened, open source CNIs for Kubernetes that support both Linux and Windows are Calico and Antrea, but there are many others as well.

The CNI specification is implemented by the binary program installed by our agent. In particular, it implements three fundamental CNI operations: ADD, DELETE, and CHECK, which are called when the containerd starts a new Pod or deletes one. Respectively, these operations

  • Add a container to a network

  • Delete a container from a network

  • Check that the container is properly set up

5.4.2 Let’s play with some CNIs

Finally, we get to do some hacking! Let’s start by installing a Calico CNI provider in our kind cluster. Calico uses Layer 3 routing (as opposed to bridging, which is a Layer 2 technology) to broadcast routes for Pods in a cluster. End users won’t generally notice this difference, but it’s an important distinction to administrators because some administrators might want to use Layer 3 concepts (like BGP peering) or Layer 2 concepts (like OVS-based traffic monitoring) for broader infrastructure design goals in their clusters:

  • BGP stands for Border Gateway Protocol, which is a Layer 3 routing technology used commonly in the overall internet.

  • OVS stands for Open vSwitch, which is a Linux kernel-based API for programming a switch inside your OS to create virtual IP addresses.

The first step in making our kind cluster is to disable its default CNI. Then we’ll recreate it from a YAML specification. For example:

$ cat << EOF > kind-Calico-conf.yaml
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha4
networking:
  disableDefaultCNI: true               
  podSubnet: 192.168.0.0/16             
nodes:                                  
- role: control-plane
- role: worker
EOF
$ kind create cluster --name=calico --config=./kind-Calico-conf.yaml

Disables the kind-net CNI

Divides the 192.168 subnet so that it’s orthogonal to our service subnet

Adds a second node to our cluster

The kind-net CNI is a minimal CNI that only works for a one node cluster. We disable it so we can use a real CNI provider. All our Pods will be on a large swath of the 192.168 subnet. Calico divides this up for each node, and it should be orthogonal to our service subnet. Additionally, having a second node in our cluster helps us to understand how Calico separates local traffic from traffic destined for another node.

Setting a kind cluster up with a real CNI plugin is not significantly different from what we’ve already done. Once this cluster comes up, it’s worth pausing for a moment to see what happens when a Pod’s CNI isn’t yet available. This leads to unschedulable Pods that aren’t defined in the kubelet/manifests directory. You’ll see this by running the following kubectl commands:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   coredns-66bff467f8-86mgh   0/1     Pending   0          7m47s
kube-system   coredns-66bff467f8-nfzhz   0/1     Pending   0          7m47s
 
$ kubectl get nodes
NAME                   STATUS     ROLES    AGE    VERSION
Calico-control-plane   NotReady   master   2m4s   v1.18.2
Calico-worker          NotReady   <none>   85s    v1.18.2

5.4.3 Installing the Calico CNI provider

At this point, our CoreDNS Pod will not be able to start because the Kubernetes scheduler sees that all nodes are NotReady, as the previous commands show. If that’s not the case, check that your CNI provider is up and running. This state is determined based on the fact that the CNI provider hasn’t been set yet. CNIs are configured once a CNI container writes a /etc/cni/net.d file on a kubelet’s local filesystem. In order to get our cluster going, we’ll now install Calico:

$ wget https://docs.projectCalico.org/manifests/Calico.yaml
$ kubelet create -f Calico.yaml

Kubernetes security matters, most of the time

This book is focused on learning Kubernetes internals, but we don’t spend much time making every command “airtight.” The previous command, for example, pulls a manifest file from the internet and installs several containers in your cluster. Make sure that you are running these commands in a sandbox (such as kind) if you don’t fully understand their consequences!

Chapters 13 and 14 provide a guide to Pod and node security. Beyond that, if you you are interested in application-centric security, projects such as https://sigstore.dev/ and https://github.com/bitnami-labs/sealed-secrets have evolved over time to address various security concerns around Kubernetes binaries, artifacts, manifests, and even secrets. If you are interested in implementing the convenient Kubernetes idioms used in this book in a more secure manner, it’s worth delving into these (and other) tools in the Kubernetes ecosystem. For more information on general Kubernetes security concepts, consult https://kubernetes.io/docs/concepts/security/ or feel free to join the Kubernetes security mailing list (http://mng.bz/QWz1).

The previous step creates two container types: a Calico-node Pod on each node and a Calico-kube-controllers Pod, which run on arbitrary nodes. Once these containers come up, your nodes should be in the Ready state, and you’ll also see that the CoreDNS Pod is now running:

$ kubectl get pods --all-namespaces
NAMESPACE            NAME
kube-system          Calico-kube-cntrlrs-57-m5       
kube-system          Calico-node-4mbc5               
kube-system          Calico-node-gpvxm
kube-system          coredns-66bff467f8-98t8j
kube-system          coredns-66bff467f8-m7lj5
kube-system          etcd-Calico-control-plane
kube-system          kube-apiserver-Calico-control-plane
kube-system          kube-controller-mgr
kube-system          kube-proxy-8q5zq
kube-system          kube-proxy-zgrjf
kube-system          kube-scheduler-Calico-control-plane
local-path-storage   local-path-provisioner-b5-fsr

Coordinates the Calico node containers

Sets up various BGP and IP routes on a per node basis

In this code example, the kube controller container coordinates the Calico node containers. Each Calico node container sets up various BGP and IP routes on a per node basis for all containers running on a given node. There are two because we have two nodes.

Both Calico and Antrea mount what are known as hostPath volume types. The CNI binary for the Calico-node process then accesses this hostPath, which connects to /etc/cni/net.d/ on your kubelet. The kubelet uses this binary to call the CNI API when an IP address is needed for a new Pod, and thus, it can be thought of as the installation mechanism for the host’s CNI provider. Remember that hostPath volume types are (most of the time) an anti-pattern, unless you are configuring a low-level OS functionality such as CNI.

In figure 5.2, we looked at the DaemonSet functionality as an interface that both Calico and Antrea implement. Let’s take a look at what Calico created by running kubectl get ds -n kube-system. We’ll see that there is a Calico DaemonSet for running a CNI Pod on all nodes. When we run Antrea later, we’ll see a similar DaemonSet for the Antrea agent.

Because Linux CNI plugins usually shovel a CNI binary file into the host’s system path, we can think of CNI plugins as implementing a MountCniBinary method. This might not be a part of the formal CNI interface, but it will ultimately be a part of almost any CNI plugin you’ll see in the wild.

Great! We now have a CNI. Let’s take a look at what has been created for us by Calico by running docker exec to get into our nodes and poke around. After running docker exec -t -i <your kind node> /bin/bash, we can start looking at what routes have been created by Calico. For example:

root@Calico-control-plane:/# ip route
default via 172.18.0.1 dev eth0
172.18.0.0/16 dev eth0 proto kernel scope
              link src 172.18.0.3
192.168.9.128/26 via 172.18.0.2 dev tunl0
              proto bird onlink               
blackhole 192.168.71.0/26 proto bird          
192.168.71.1 dev cali38312ba5f3c scope link
192.168.71.2 dev califcbd6ecdce5 scope link

Traffic destined to another node is identified based on its subnet.

Traffic not matched by this node but on the 71 subnet will be thrown away.

We can see that there are two IP addresses here: 192.168.71.1 and 71.2. These IP addresses are associated with two devices prefixed with the string cali that our Calico-node containers created. How do these devices work? We can see how they’re defined by running the ip a command:

root@Calico-control-plane:/# ip a | grep califc
5: califcbd6ecdce5@if4: <BROADCAST,MULTICAST,UP,LOWER_UP>
 mtu 1440 qdisc noqueue state UP group default

Now we can see that the node has an interface created for Calico-related Pods with a recognizable name. For example:

root@Calico-control-plane:/# apt-get update -y;
 apt-get install tcpdump                         
root@Calico-control-plane:/# tcpdump -s 0
 -i cali38312ba5f3c -v | grep 192                
tcpdump: listening on cali38312ba5f3c, link-type EN10MB (Ethernet),
 capture size 262144 bytes
 
    10.96.0.1.443 > 192.168.71.1.59186: Flags [P.],
                    cksum 0x14d2 (incorrect -> 0x7189),
                    seq 520038628:520039301, ack 2015131286, win 502,
                    options [nop,nop,TS val 1110809235 ecr 1170831911],
                    length 673
    192.168.71.1.59186 > 10.96.0.1.443: Flags [.],
                    cksum 0x1231 (incorrect -> 0x9f10),
                    ack 673, win 502,
                    options [nop,nop,TS val 1170833141 ecr 1110809235],
                    length 0
    10.96.0.1.443 > 192.168.71.1.59186:
                    Flags [P.], cksum 0x149c (incorrect -> 0xa745),
                    seq 673:1292, ack 1, win 502,
                    options [nop,nop,TS val 1110809914 ecr 1170833141],
                    length 619
    192.168.71.1.59186 > 10.96.0.1.443:
                    Flags [.], cksum 0x1231 (incorrect -> 0x9757),
                    ack 1292, win 502,
                    options [nop,nop,TS val 1170833820 ecr 1110809914],
                    length 0
    192.168.71.1.59186 > 10.96.0.1.443:
                    Flags [P.], cksum 0x1254 (incorrect -> 0x362c),
                    seq 1:36, ack 1292, win 502,
                    options [nop,nop,TS val 1170833820 ecr 1110809914],
                    length 35
    10.96.0.1.443 > 192.168.71.1.59186:
                    Flags [.], cksum 0x1231 (incorrect -> 0x9734),
                    ack 36, win 502, options [nop,nop,TS val 1110809914
                    ecr 1170833820],
                    length 0

Installs tcpdump in the container

Runs tcpdump against the Calico device

In our code example, we can see incoming traffic to the 71.1 IP address from the 10.96 subnet. This subnet is actually the subnet of our Kubernetes service for the CoreDNS container, which is the point where our DNS containers powered by our CNI are contacted from. The previous cali3831... device is something that is directly attached (like any other device) via an Ethernet cable (of sorts) to our node. This is known as a veth pair, wherein our containers themselves have one end of a virtual Ethernet cable (named cali3831) directly plugged into them from our kubelet. This means anyone attempting to reach this device from our kubelet can easily do so.

Now, let’s go back and look at the IP route table we showed earlier. The dev entries are now clear. These correspond to routes that plug into our containers directly. But what about the blackhole and 192.168.9.128/26 routes? These routes correspond to

  • Containers that belong to another node (the 192.168.9.128/26 route)

  • Containers that belong to no node at all (the blackhole route)

This is BGP in action. Every node in our cluster that runs a Calico-node daemon has a range of IPs that are routed to it. As new nodes come up, these routes are added to our IP route table over time. If you run kubectl scale deployment coredns -n kube-system --replicas=6, you’ll find that all IP addresses come up in one of two different subnets:

  • Some Pods come up in the 192.168.9 subnet. These correspond to one of our nodes.

  • Other Pods come up in the 192.168.71 subnet. These correspond to the other node.

The more nodes you see in your cluster, the more subnets you’ll have. Each node has its own IP range, and your CNI provider uses that IP range to allocate the IP addresses of the Pods on a given node to avoid collisions of Pod IP addresses across nodes. This also is a performance optimization because there is no need for global coordination of Pod IP address space. Thus, we can see that Calico is managing IP address ranges for us by carving up IP pools for individual nodes and then coordinating these pools with the route tables in the Kernel.

5.4.4 Kubernetes networking with OVS and Antrea

To the casual user, Antrea and Calico appear to do the same thing: route traffic between containers on a multi-node cluster. However, there’s a lot of subtlety in how this is accomplished when you peek under the covers.

OVS is what Antrea uses to power its CNI capabilities. Unlike BGP, it doesn’t use an IP address as the mechanism for routing directly from node to node as we saw with Calico. But, rather, it creates a bridge that runs locally on our Kubernetes node. This bridge is created using OVS. OVS is, literally, a software-defined switch (like the ones you buy at any computer store). OVS is then the interface between our Pods and the rest of the world when running Antrea.

The pros and cons between bridged (also known as Layer 2) and IP (also known as Layer 3) routing are beyond the scope of this book and are hotly debated among academics and software companies alike. In our case, we’ll just say that these are different technologies that both work quite well and can scale to handle thousands of Pods quite readily.

Let’s try making our kind cluster again, this time using Antrea as our CNI provider. First, delete your last cluster with kind delete cluster --name=calico, and then we’ll recreate it with the code snippet that follows:

$ cat << EOF > kind-Antrea-conf.yaml
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
networking:
  disableDefaultCNI: true
  podSubnet: 192.168.0.0/16
nodes:
- role: control-plane
- role: worker
EOF
$ kind create cluster --name=Calico --config=./kind-Antrea-conf.yaml

Once your cluster comes up, run

kubectl apply -f https://github.com/vmware-tanzu/Antrea/
 releases/download/v0.8.0/Antrea.yml -n kube-system

Then, run docker exec again and take a look at the IP situation in your kubelets. This time, we see that there are a few different interfaces created for us. Note that we omit the tun0 interface that you’ll see in both CNIs. This is the network interface where encapsulated traffic between nodes flows.

Interestingly, when we run ip route, we don’t see a new route for every Pod we have running. This is because OVS uses a bridge, and thus, the Ethernet cables still exist, but they are all plugged directly into our locally running OVS instance. Running the following command, we can see subnet logic in Antrea that is similar to what we saw earlier in Calico:

root@Antrea-control-plane:/# ip route
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
192.168.0.0/24 dev Antrea-gw0 proto kernel scope link
               src 192.168.0.1                         
192.168.1.0/24 via 192.168.1.1 dev
               Antrea-gw0 onlink                       

Defines traffic destined for our local subnet by the 0.0 suffix

The Antrea gateway manages traffic destined to another subnet with the 1.0 suffix.

Now, to confirm this, let’s run the ip a command. This will show us all the different IP addresses that our machine understands:

$ docker exec -t -i ba133 /bin/bash
root@Antrea-control-plane:/# ip a
# ip a
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state
    DOWN group default qlen 1000
    link/ether 2e:24:a8:d8:a3:50 brd ff:ff:ff:ff:ff:ff
4: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc
    noqueue master ovs-system state
    UNKNOWN group default qlen 1000
    link/ether 76:82:e1:8b:d4:86 brd ff:ff:ff:ff:ff:ff
5: Antrea-gw0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state
    UNKNOWN group default qlen 1000
    link/ether 02:09:36:d3:cf:a4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 brd 192.168.0.255 scope global Antrea-gw0
       valid_lft forever preferred_lft forever

One of the interesting things to note when we run the ip a command is that we can see several unfamiliar devices floating around. These include

  • genev_sys_6081—The interface for Genev, which is the tunneling protocol Antrea uses

  • ovs-system—An OVS interface

  • Antrea-gw0—An Antrea interface that sends traffic to Pods

Antrea, unlike Calico, actually routes traffic to a gateway IP address, which is on the Pod subnet that uses the podCIDR of our cluster. Thus, the algorithm for how Antrea sets up Pod IP addresses for a given node is something like this:

  1. Allocates a subnet of Pod IP addresses to every node

  2. Allocates the first IP address in the subnet to an OVS switch for the given node

  3. Allocates all new Pods to the remaining free IP addresses in the subnet

The routing table for such a cluster would follow a pattern where we order nodes so that they come up chronologically. Note that each node receives traffic on an x.y.z.1 IP address (the first Pod in its allocated subnet). The way that subnets are calculated per Pod relies on both your implementation of Kubernetes and how your CNI provider logic works. In some CNIs, there might not be a distinct subnet for every node, but in general, this is an intuitive way for a CNI to manage IP addresses over time, so it’s pretty common.

Keep in mind that both Calico and Antrea create distinct subnets for a nodes’ Pod network, and from that subnet, Pods are provisioned IP addresses. If you ever need to debug a network path in a CNI, knowing which Pods are going to which nodes might help you to reason about which machines you should reboot, ssh into, or delete entirely, depending on your DevOps practices.

The following snippet shows us the antrea-gw0 device. This is the gateway IP address for all the Pods on your cluster:

192.168.0.0/24 dev Antrea-gw0 proto kernel scope
               link src 192.168.0.1                       
192.168.1.0/24 via 192.168.1.1 dev Antrea-gw0 onlink      
192.168.2.0/24 via 192.168.2.1 dev Antrea-gw0 onlink      

All local Pods go directly to the local Antrea-gw0 device.

Forwards Pods destined for the second node in your cluster to that OVS instance

Forwards Pods destined for the third node in your cluster to that OVS instance

Thus, we can see that in the bridged model for networking, there are a few differences between what sorts of devices are created:

  • There is no blackhole route, as it is handled by OVS.

  • The only routes that our kernel manages are for the Antrea gateway (Antrea-gw0) itself.

  • All of this Pod’s traffic go directly to the Antrea-gw0 device. There is no global routing to other devices as is done in the BGP protocol that is used by our Calico CNI.

5.4.5 A note on CNI providers and kube-proxy on different OSs

It’s worth noting here that the trick of using DaemonSets to manage host networking for Pods is a Linux-specific approach. In other OSs (Windows Kubernetes nodes, for instance), when running containerd, you actually need to install your CNI provider using a service manager, and the CNI provider runs as a host process. Although this may change in the future (again using Windows as an example, there is work underway to enable privileged containers for Windows Kubernetes nodes), it’s instructive to note that the Linux networking stack is ideally suited for the Kubernetes networking model. This is largely due to the architecture of cgroups, namespaces, and the concept of the Linux root user, which can run as a highly privileged process even while running in a container.

Although the complexity of Kubernetes networking may seem daunting at first because of the rapid evolution of service meshes, CNI, and network/server proxies, as long as you can understand the basic process of routing between Pods, the principles remain constant across many CNI implementations.

Summary

  • Kubernetes networking architecture has a lot of parallels with generic SDN concepts.

  • Antrea and Calico are both CNI providers that overlay a cluster network on a real network for Pods.

  • Basic Linux commands (like ip a) can be used to reason about how your Pods are networked.

  • CNI providers manage Pod networks typically in DaemonSets that run a privileged Linux container on each node.

  • Border Gateway Protocol (BGP) and Open vSwitch (OVS) are both CNI provider core technologies that solve the same fundamental problems of broadcasting and sharing overlay routing information for Pods.

  • Other OSs like Windows currently don’t have all of the same native conveniences for Pod networking that Linux does.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.66.13