Software-Defined Networking (SDN) traditionally manages load balancing, iso-lation, and security of VMs in the cloud, as well as in many on-premise data centers. SDNs are a convenience which eases the burden on system administrators, allowing reconfiguration of large data center networks every week, or maybe every day, when new VMs are created or destroyed. Fast-forwarding into the age of containers, the concept of SDN takes on a whole new meaning because our networks change constantly (every second, in a large Kubernetes cluster), and so it must, by definition, be automated by software. The Kubernetes network is entirely software-defined and is constantly in flux due to the ephemeral and dynamic nature of the Kubernetes Pod and service endpoints.
In this chapter, we’ll look at Pod-to-Pod networking and, in particular, how hundreds or thousands of containers on a given machine can have unique, cluster-routable IP addresses. Kubernetes delivers this functionality in a modular and extensible way by using a Container Network Interface (CNI) standard, which can be implemented by a broad range of technologies to give each Pod a unique routable IP address.
Let’s revisit our Pods from earlier and look back at their core networking requirement. As part of exploring this concept, we previously discussed the way iptables rules for nftables, IPVSs (IP virtual servers), and other network proxy implementations are managed by the kube-proxy
. We also looked at various KUBE-SEP
rules that tell the Linux kernel to “masquerade” traffic so that the traffic leaving a container is postmarked as coming from a node or to NAT traffic via a service IP. This traffic is then forwarded to a running Pod, which often might be on a different node in our cluster.
The kube-proxy
is great for routing services to backend Pods and is usually the first piece of software-defined networking that users interact with. For example, when you first run and expose a simple Kubernetes application with a node port, you are accessing a Pod through a routing rule that’s created by the kube-proxy
running on your Kubernetes nodes. The kube-proxy
, however, is not particularly useful unless there is a robust Pod network on your cluster. This is because, ultimately, its only job is to map a service IP address into a Pod’s IP address. If that Pod’s IP address is not routable between two nodes, then the kube-proxy
’s routing decisions do not result in an application that is usable to the end user. That is to say, a load balancer is only as reliable as its slowest endpoint.
The container networking puzzle can be defined as follows: given hundreds of Pods, some of which correspond to the same service, how can we consistently route traffic into and out of a cluster so that all traffic always goes to the right place, even if our Pods are moving? This is the obvious Day 2 operations problem facing anyone who has tried to run a non-Kubernetes container solution in production (for example, Docker). To solve this, Kubernetes gives us two fundamental networking tools:
The service proxy—Ensures that Pods can be load balanced behind services with stable IPs and routes Kubernetes Service objects
The CNI—Ensures that Pods can be reborn continually inside a network that is flat and easy to access from inside the cluster
At the heart of this solution is the Kubernetes Service object with the type ClusterIP
. A ClusterIP service is a Kubernetes Service that is routable inside your Kubernetes cluster, but it is not accessible outside your cluster. It is a fundamental primitive on top of which other services can be built. It’s also a simple way for applications inside your cluster to access one another without needing to directly route to a Pod IP address (remember, Pod IPs can change if they move or die).
As an example, if we create the same service three times in a kind
cluster, we will see that it has three random IP addresses in the 10.96 IP space. To verify this, we can recreate the same three services by running kubectl
create
service
clusterip
my-service-1
--tcp="100:100"
three times in a row (changing the name of my-service-1
, of course). Afterward, we could list the service IPs like so:
$ kubectl get svc -o wide svc-1 ClusterIP 10.96.7.53 80/TCP 48s app=MyApp svc-2 ClusterIP 10.96.152.223 80/TCP 33s app=MyApp svc-3 ClusterIP 10.96.43.92 80/TCP 5s app=MyApp
Similarly for Pods, we have a single network and subnet. We can see that new IP addresses are easily provisioned when making new Pods. Because our kind
cluster already has two CoreDNS Pods running, we can check their IP addresses to confirm this:
$ kubectl get pods -A -o wide | grep coredns kube-system coredns-74ff55c5b-nlxrs 1/1 Running 0 4d16h 192.168.71.1 ➥ calico-control-plane <none> <none> kube-system coredns-74ff55c5b-t4p6s 1/1 Running 0 4d16h 192.168.71.3 ➥ calico-control-plane <none> <none>
We just saw the first important lessons of the Kubernetes SDN: Pod and Service IP addresses are managed for us and are in different IP subnets. This is (generally) a constant in almost any cluster we’ll encounter in the real world. In fact, if we do encounter a cluster where this is not the case, there is a chance that some other behavior of Kubernetes has been severely compromised. This behavior may include the ability for the kube-proxy
to route traffic or the ability of the node to route Pod traffic.
There are three primary types of Kubernetes Service API objects that we can create (as you likely know by now): ClusterIPs, NodePorts, and LoadBalancers. These Services define which backend Pod we’ll connect to by the use of labels. For example, in the previous cluster, we have ClusterIP services in our 10 subnet, and those Services route traffic to Pods in our 192 subnet. How does traffic destined for a service IP get routed into another subnet? It gets routed by the kube-proxy
(or, more formally, the Kubernetes network or service proxy).
In the previous example, we ran kubectl
create
service
my-service-1
--tcp= "100:100"
three times and got three services of type ClusterIP
. If we were to make these services as type NodePort
, then the IP of these Services would be any node in our entire cluster. If we were to make these Services a LoadBalancer
type, then our cloud (if we were in a cloud) would provide an external IP, such as 35.1.2.3. This would be accessible on the wider internet or on a network outside our Pod, node, or service IP range, depending on the cloud provider.
Figure 5.1 shows the flow of traffic from a LoadBalancer into a Kubernetes cluster. It depicts how
The kube-proxy
uses a low-level routing technology like iptables or IPVS to send traffic from services into and out of Pods.
We get an IP address from the outside world when we have a service of type LoadBalancer
. This then routes into our internal service IP.
The kube-proxy
needs to be able to handle ongoing TCP traffic going to and from Pods that are backed by services. An IP packet has certain fundamental properties including the source and destination IP addresses. In a sophisticated network, these may get changed because a packet moves through a series of routers, and we consider a Kubernetes node (due to the kube-proxy
) to be one such router. In general, the manipulation of a packet’s destination is known as NAT (referring to network address translation) and is a fundamental aspect of almost any network architecture solution at some level. SNAT and DNAT refer to the translation of source and destination IP addresses, respectively.
The data plane of the kube-proxy
can accomplish this task in a variety of ways, and this is specified to the kube-proxy
by its mode
configuration at startup. If we dig into the details, we find that the kube-proxy
itself is organized into two separate control paths: server_windows.go and server_others.go (both located here: http://mng.bz/EWxl). The server_windows.go binary is compiled into a kube-proxy.exe file and makes native calls to underlying Windows system APIs (such as the netsh
command for the userspace proxy and the hcsshim and HCN [http://mng.bz/N6x2] containerization APIs for the Windows kernel proxy).
The more common case is that we run the kube-proxy
on Linux. In this case, a different binary program (which is called kube-proxy) runs. This program doesn’t compile the Windows functionality into its code path. In the Linux scenario, we usually run the iptables proxy. In your kind
clusters, the kube-proxy
just runs in the default iptables mode. You can confirm this by looking at the configuration of the kube-proxy
by running kubectl
edit
cm
kube-proxy
-n
kube-system
and looking at its mode
field:
ipvs
uses the kernel load balancer to write routing rules for services (Linux).
iptables
uses the kernel firewall to write routing rules for services (Linux).
The userspace
creates a process using a Golang go func
worker that manually proxies traffic to a Pod (Linux).
The Windows kernel relies on the hcsshim and HCN APIs for load balancing, which is incompatible with OVS-related CNI implementations but works with other CNIs like Calico (similar to the Linux userspace option).
The Windows userspace also uses netsh
for certain aspects of routing. This is useful for people who, for some reason, can’t use the regular Windows kernel’s APIs. Note that if you install an OVS extension on Windows, you may need to use the userspace proxy because the kernel’s HCN APIs do not work in the same way.
Note Throughout this book, we will mention the notion of informers, controllers, and Operators and how their behavior is not always uniformly implemented with respect to the configuration changes that occur. Although the network proxy is implemented with a Kubernetes controller, it doesn’t dynamically respond to configuration changes. Thus, if you want to play with your kind
cluster to modify the way that service load balancing is done, you’ll need to edit configMap
for the network proxy and then restart its DaemonSet. (If you want, you can do this by killing a Pod in your DaemonSet and then view the Pod’s logs as it is reborn. You should see the new kube-proxy
mode.)
The kube-proxy
is, however, just one way to define how the Kubernetes SDN routes traffic. To be comprehensive, we can think of Kubernetes routing in three separate layers:
External load balancers or ingress/gateway routers—Forward traffic into a Kubernetes cluster.
The kube-proxy
—Manages forwarding between services to Pods. As you may know by now, the term proxy is a bit of a misnomer because, typically, the kube-proxy
just maintains static routing rules that are implemented by a kernel or other data plane technology, such as iptables rules.
CNI providers—Route traffic to and from Pods regardless of whether we are accessing them through a service endpoint or directly (Pod-to-Pod networking).
Ultimately, a CNI provider (like the kube-proxy
) also configures some kind of rule engine (such as a routing table) or an OVS switch to ensure that traffic between nodes or from the outside world can route to Pods. If you’re wondering why the technology for the kube-proxy
is different from that of CNIs, you’re not alone! Many CNI providers are endeavoring to implement a full-blown kube-proxy
themselves so that the kube-proxy
from Kubernetes is no longer required.
We’ve demonstrated the ClusterIP services in the first part of this chapter, but we haven’t yet looked at NodePort services. Let’s do that now by getting our feet wet and creating a new Kubernetes service. This will ultimately demonstrate how easy it is to add and modify load-balancing rules. For this example, let’s make a NodePort service that points to a CoreDNS container running inside a Pod in our cluster. We can quickly cobble one together by looking at the contents of kubectl
get
svc
-o
yaml
kube-dns
-n
kube-system
. We can then change the type of service from ClusterIP
to NodePort
like so:
# save the following file to my-nodeport.yaml apiVersion: v1 kind: Service metadata: annotations: prometheus.io/port: "9153" prometheus.io/scrape: "true" labels: k8s-app: kube-dns kubernetes.io/cluster-service: "true" kubernetes.io/name: CoreDNS name: kube-dns-2 ❶ namespace: kube-system spec: ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: dns port: 53 protocol: UDP targetPort: 53 - name: dns-tcp port: 53 protocol: TCP targetPort: 53 - name: metrics port: 9153 protocol: TCP targetPort: 9153 selector: k8s-app: kube-dns sessionAffinity: None type: NodePort ❷ status: loadBalancer: {}
❶ Names the service kube-dns-2 to differentiate it from the already existing kube-dns service
❷ Changes the type of this service to a NodePort
Now, if we run kubectl create -f my-nodeport.yaml
, we’ll see that a random port was allocated for us. This is now forwarding traffic to CoreDNS for us:
kubectl get pods -o wide -A kube-system kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP,9153/TCP k8s-app=kube-dns kube-system kube-dns-2 NodePort 10.96.80.7 53:30357/UDP,53:30357/TCP,9153:31588/TCP 2m33s k8s-app=kube-dns ❶
❶ Maps the random ports 30357 and 31588 to port 53
The random ports 30357 and 31588, mapped to port 53 from our DNS service Pods, open on all the nodes of our cluster. That’s because all nodes are running the kube-proxy
. These random ports were not allocated earlier when we created the ClusterIP services.
If you are feeling brave, we’ll leave it as an exercise for you to run iptables-save
on your kind
Docker nodes and fish out the handy work that the kube-proxy
has done to write rules for your newly created service IP addresses. (If you are interested in exercising NodePorts, you’ll enjoy our later chapter about how to install and test Kubernetes applications locally. There, we’ll create several services for testing the famous Guestbook application in Kubernetes.)
Now that you’ve got a little bit of a refresher on how services ultimately plumb routing rules between internal Pod ports and the external world, let’s look at CNI providers. These provide the next layer below the service proxy in the overall Kubernetes SDN networking stack. Ultimately, all our service is really doing is routing traffic from 10.96.80.7 to the Pods that are living inside our cluster. How do these Pods get attached to a valid IP address, and how do they receive this traffic? The answer is . . . the CNI interface.
CNI providers implement the CNI specification (http://mng.bz/RENK), which defines a contract that allows container runtimes to request a working IP address for a process on startup. They also add other fancy features outside this specification (like implementing network policies or third-party network monitoring integrations). For example, VMware users will find that they can use Antrea as a CNI proxy for free and plug it into things like VMware’s NSX platform for real-time container monitoring and logging features that some of the current open source CNI providers include. Although CNI providers, theoretically, only need to route Pod traffic, many provide extra bells and whistles. A quick rundown of the major, on-premise CNIs includes
Calico—A BGP-based CNI provider that makes new Border Gateway Protocol (BGP) routing rules to implement the data plane. Calico additionally supports XDP, NAND, and VXLAN routing options (for example, on Windows, it’s not uncommon to run Calico in VXLAN mode). As an advanced CNI, it has the ability to replace the kube-proxy
, using technology similar to Cilium’s.
Antrea—An OVS data plane CNI provider that uses a bridge to route all Pod traffic. It is similar to Calico in that it has many advanced routing and network proxy replacement options (AntreaProxy).
Flannel—A bridge-based IP CNI provider that is no longer commonly used. It was one of the original major CNIs for production Kubernetes clusters.
Google, EC2, and NCP—These cloud-based CNIs use proprietary software to make cloud-aware traffic routing decisions. For example, they are capable of creating rules that route traffic directly between containers without needing to travel over node network paths.
Cilium—A XDP-based CNI provider that uses modern Linux APIs to route traffic without requiring any Kernel traffic management. This allows for faster and more secure IP communication between containers in some cases. Cillium uses its advanced data path tooling to provide a network proxy alternative as well.
KindNet—A simple CNI plugin that is used in kind
clusters by default, but it’s only designed to work in simple clusters with only one subnet.
There are many other CNIs that might be specific to other vendors or open source technologies, as well as proprietary CNI providers for various cloud environments such as VMware, Azure, EKS, and so on. These proprietary CNIs only run inside a given vendor’s infrastructure and, thus, are less portable but often more performant or better integrated with cloud features. Some CNIs, like Calico and Antrea, provide both vendor-specific and vendor-neutral functionality (such as Tigera or NSX specific integrations, for example).
Figure 5.2 shows how CNI networking works in the Calico and Antrea plugins. Both of these plugins accomplish the same end state using a series of routing rules and open source technologies. The CNI interface defines a few core functional aspects of any networking solution for containers, and all CNI plugins (BGP and OVS, for example) implement that functionality in different ways. As figure 5.2 shows, different CNIs use different underlying technology stacks.
Both Calico and Antrea have similar architectures: a DaemonSet and a coordination container. To set these up, a CNI installation includes four steps (fully automated for you by your CNI provider, usually, so that this can be done in a snappy one-liner in simple Linux clusters):
Install the kube-proxy
because it’s likely that your CNI provider’s coordination controller will need the ability to query the Kubernetes API server. This is usually done for you ahead of time by any Kubernetes installer.
Install a binary CNI program on the node (usually in a directory such as /opt/cni/bin) that can be called by the container runtime to create a Pod with a CNI-provided IP address.
Deploy a DaemonSet to your cluster, where one container sets up networking primitives for its resident node. This DaemonSet does the previous install step for its host on startup.
Deploy a coordination container to your cluster that either aggregates or proxies metadata from Kubernetes; for example, aggregating NetworkPolicy information in a single place so that it can easily be consumed and deduplicated by the DaemonSet Pods.
There’s no one mandated architecture for a CNI plugin, but the overall DaemonSet plus controller pattern is pretty robust. It is generally a good pattern to follow in Kubernetes for any agent-oriented process that is designed to be integrated with the Kubernetes API.
Note CNI providers give IP addresses to Pods, but a lot of the assumptions around how this process works were originally made in a way that is biased to the Linux OS. Thus, we’ll look at the Calico and Antrea CNI providers, but when doing this, you should note that the behavior of these CNIs varies across other OSs. For example, in Windows, both Calico and Antrea are not typically run as Pods but rather as Windows services using tools such as nssm
. Currently, some of the more battle-hardened, open source CNIs for Kubernetes that support both Linux and Windows are Calico and Antrea, but there are many others as well.
The CNI specification is implemented by the binary program installed by our agent. In particular, it implements three fundamental CNI operations: ADD, DELETE, and CHECK, which are called when the containerd starts a new Pod or deletes one. Respectively, these operations
Finally, we get to do some hacking! Let’s start by installing a Calico CNI provider in our kind
cluster. Calico uses Layer 3 routing (as opposed to bridging, which is a Layer 2 technology) to broadcast routes for Pods in a cluster. End users won’t generally notice this difference, but it’s an important distinction to administrators because some administrators might want to use Layer 3 concepts (like BGP peering) or Layer 2 concepts (like OVS-based traffic monitoring) for broader infrastructure design goals in their clusters:
BGP stands for Border Gateway Protocol, which is a Layer 3 routing technology used commonly in the overall internet.
OVS stands for Open vSwitch, which is a Linux kernel-based API for programming a switch inside your OS to create virtual IP addresses.
The first step in making our kind
cluster is to disable its default CNI. Then we’ll recreate it from a YAML specification. For example:
$ cat << EOF > kind-Calico-conf.yaml kind: Cluster apiVersion: kind.sigs.k8s.io/v1alpha4 networking: disableDefaultCNI: true ❶ podSubnet: 192.168.0.0/16 ❷ nodes: ❸ - role: control-plane - role: worker EOF $ kind create cluster --name=calico --config=./kind-Calico-conf.yaml
❷ Divides the 192.168 subnet so that it’s orthogonal to our service subnet
❸ Adds a second node to our cluster
The kind-net CNI is a minimal CNI that only works for a one node cluster. We disable it so we can use a real CNI provider. All our Pods will be on a large swath of the 192.168 subnet. Calico divides this up for each node, and it should be orthogonal to our service subnet. Additionally, having a second node in our cluster helps us to understand how Calico separates local traffic from traffic destined for another node.
Setting a kind
cluster up with a real CNI plugin is not significantly different from what we’ve already done. Once this cluster comes up, it’s worth pausing for a moment to see what happens when a Pod’s CNI isn’t yet available. This leads to unschedulable Pods that aren’t defined in the kubelet/manifests directory. You’ll see this by running the following kubectl
commands:
$ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-66bff467f8-86mgh 0/1 Pending 0 7m47s kube-system coredns-66bff467f8-nfzhz 0/1 Pending 0 7m47s $ kubectl get nodes NAME STATUS ROLES AGE VERSION Calico-control-plane NotReady master 2m4s v1.18.2 Calico-worker NotReady <none> 85s v1.18.2
At this point, our CoreDNS Pod will not be able to start because the Kubernetes scheduler sees that all nodes are NotReady
, as the previous commands show. If that’s not the case, check that your CNI provider is up and running. This state is determined based on the fact that the CNI provider hasn’t been set yet. CNIs are configured once a CNI container writes a /etc/cni/net.d file on a kubelet’s local filesystem. In order to get our cluster going, we’ll now install Calico:
The previous step creates two container types: a Calico-node Pod on each node and a Calico-kube-controllers Pod, which run on arbitrary nodes. Once these containers come up, your nodes should be in the Ready
state, and you’ll also see that the CoreDNS Pod is now running:
$ kubectl get pods --all-namespaces NAMESPACE NAME kube-system Calico-kube-cntrlrs-57-m5 ❶ kube-system Calico-node-4mbc5 ❷ kube-system Calico-node-gpvxm kube-system coredns-66bff467f8-98t8j kube-system coredns-66bff467f8-m7lj5 kube-system etcd-Calico-control-plane kube-system kube-apiserver-Calico-control-plane kube-system kube-controller-mgr kube-system kube-proxy-8q5zq kube-system kube-proxy-zgrjf kube-system kube-scheduler-Calico-control-plane local-path-storage local-path-provisioner-b5-fsr
❶ Coordinates the Calico node containers
❷ Sets up various BGP and IP routes on a per node basis
In this code example, the kube controller container coordinates the Calico node containers. Each Calico node container sets up various BGP and IP routes on a per node basis for all containers running on a given node. There are two because we have two nodes.
Both Calico and Antrea mount what are known as hostPath volume types. The CNI binary for the Calico-node process then accesses this hostPath, which connects to /etc/cni/net.d/ on your kubelet. The kubelet uses this binary to call the CNI API when an IP address is needed for a new Pod, and thus, it can be thought of as the installation mechanism for the host’s CNI provider. Remember that hostPath volume types are (most of the time) an anti-pattern, unless you are configuring a low-level OS functionality such as CNI.
In figure 5.2, we looked at the DaemonSet functionality as an interface that both Calico and Antrea implement. Let’s take a look at what Calico created by running kubectl get ds -n kube-system
. We’ll see that there is a Calico DaemonSet for running a CNI Pod on all nodes. When we run Antrea later, we’ll see a similar DaemonSet for the Antrea agent.
Because Linux CNI plugins usually shovel a CNI binary file into the host’s system path, we can think of CNI plugins as implementing a MountCniBinary
method. This might not be a part of the formal CNI interface, but it will ultimately be a part of almost any CNI plugin you’ll see in the wild.
Great! We now have a CNI. Let’s take a look at what has been created for us by Calico by running docker
exec
to get into our nodes and poke around. After running docker
exec
-t
-i
<your
kind
node>
/bin/bash
, we can start looking at what routes have been created by Calico. For example:
root@Calico-control-plane:/# ip route default via 172.18.0.1 dev eth0 172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3 192.168.9.128/26 via 172.18.0.2 dev tunl0 proto bird onlink ❶ blackhole 192.168.71.0/26 proto bird ❷ 192.168.71.1 dev cali38312ba5f3c scope link 192.168.71.2 dev califcbd6ecdce5 scope link
❶ Traffic destined to another node is identified based on its subnet.
❷ Traffic not matched by this node but on the 71 subnet will be thrown away.
We can see that there are two IP addresses here: 192.168.71.1 and 71.2. These IP addresses are associated with two devices prefixed with the string cali that our Calico-node containers created. How do these devices work? We can see how they’re defined by running the ip a
command:
root@Calico-control-plane:/# ip a | grep califc 5: califcbd6ecdce5@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> ➥ mtu 1440 qdisc noqueue state UP group default
Now we can see that the node has an interface created for Calico-related Pods with a recognizable name. For example:
root@Calico-control-plane:/# apt-get update -y; ➥ apt-get install tcpdump ❶ root@Calico-control-plane:/# tcpdump -s 0 ➥ -i cali38312ba5f3c -v | grep 192 ❷ tcpdump: listening on cali38312ba5f3c, link-type EN10MB (Ethernet), ➥ capture size 262144 bytes 10.96.0.1.443 > 192.168.71.1.59186: Flags [P.], cksum 0x14d2 (incorrect -> 0x7189), seq 520038628:520039301, ack 2015131286, win 502, options [nop,nop,TS val 1110809235 ecr 1170831911], length 673 192.168.71.1.59186 > 10.96.0.1.443: Flags [.], cksum 0x1231 (incorrect -> 0x9f10), ack 673, win 502, options [nop,nop,TS val 1170833141 ecr 1110809235], length 0 10.96.0.1.443 > 192.168.71.1.59186: Flags [P.], cksum 0x149c (incorrect -> 0xa745), seq 673:1292, ack 1, win 502, options [nop,nop,TS val 1110809914 ecr 1170833141], length 619 192.168.71.1.59186 > 10.96.0.1.443: Flags [.], cksum 0x1231 (incorrect -> 0x9757), ack 1292, win 502, options [nop,nop,TS val 1170833820 ecr 1110809914], length 0 192.168.71.1.59186 > 10.96.0.1.443: Flags [P.], cksum 0x1254 (incorrect -> 0x362c), seq 1:36, ack 1292, win 502, options [nop,nop,TS val 1170833820 ecr 1110809914], length 35 10.96.0.1.443 > 192.168.71.1.59186: Flags [.], cksum 0x1231 (incorrect -> 0x9734), ack 36, win 502, options [nop,nop,TS val 1110809914 ecr 1170833820], length 0
❶ Installs tcpdump in the container
❷ Runs tcpdump against the Calico device
In our code example, we can see incoming traffic to the 71.1 IP address from the 10.96 subnet. This subnet is actually the subnet of our Kubernetes service for the CoreDNS container, which is the point where our DNS containers powered by our CNI are contacted from. The previous cali3831...
device is something that is directly attached (like any other device) via an Ethernet cable (of sorts) to our node. This is known as a veth pair, wherein our containers themselves have one end of a virtual Ethernet cable (named cali3831) directly plugged into them from our kubelet. This means anyone attempting to reach this device from our kubelet can easily do so.
Now, let’s go back and look at the IP route table we showed earlier. The dev
entries are now clear. These correspond to routes that plug into our containers directly. But what about the blackhole
and 192.168.9.128/26
routes? These routes correspond to
Containers that belong to another node (the 192.168.9.128/26 route)
Containers that belong to no node at all (the blackhole route)
This is BGP in action. Every node in our cluster that runs a Calico-node daemon has a range of IPs that are routed to it. As new nodes come up, these routes are added to our IP route table over time. If you run kubectl
scale
deployment
coredns
-n
kube-system
--replicas=6
, you’ll find that all IP addresses come up in one of two different subnets:
Some Pods come up in the 192.168.9 subnet. These correspond to one of our nodes.
Other Pods come up in the 192.168.71 subnet. These correspond to the other node.
The more nodes you see in your cluster, the more subnets you’ll have. Each node has its own IP range, and your CNI provider uses that IP range to allocate the IP addresses of the Pods on a given node to avoid collisions of Pod IP addresses across nodes. This also is a performance optimization because there is no need for global coordination of Pod IP address space. Thus, we can see that Calico is managing IP address ranges for us by carving up IP pools for individual nodes and then coordinating these pools with the route tables in the Kernel.
To the casual user, Antrea and Calico appear to do the same thing: route traffic between containers on a multi-node cluster. However, there’s a lot of subtlety in how this is accomplished when you peek under the covers.
OVS is what Antrea uses to power its CNI capabilities. Unlike BGP, it doesn’t use an IP address as the mechanism for routing directly from node to node as we saw with Calico. But, rather, it creates a bridge that runs locally on our Kubernetes node. This bridge is created using OVS. OVS is, literally, a software-defined switch (like the ones you buy at any computer store). OVS is then the interface between our Pods and the rest of the world when running Antrea.
The pros and cons between bridged (also known as Layer 2) and IP (also known as Layer 3) routing are beyond the scope of this book and are hotly debated among academics and software companies alike. In our case, we’ll just say that these are different technologies that both work quite well and can scale to handle thousands of Pods quite readily.
Let’s try making our kind
cluster again, this time using Antrea as our CNI provider. First, delete your last cluster with kind
delete
cluster
--name=calico
, and then we’ll recreate it with the code snippet that follows:
$ cat << EOF > kind-Antrea-conf.yaml kind: Cluster apiVersion: kind.sigs.k8s.io/v1alpha3 networking: disableDefaultCNI: true podSubnet: 192.168.0.0/16 nodes: - role: control-plane - role: worker EOF $ kind create cluster --name=Calico --config=./kind-Antrea-conf.yaml
Once your cluster comes up, run
kubectl apply -f https://github.com/vmware-tanzu/Antrea/ ➥ releases/download/v0.8.0/Antrea.yml -n kube-system
Then, run docker exec
again and take a look at the IP situation in your kubelets. This time, we see that there are a few different interfaces created for us. Note that we omit the tun0
interface that you’ll see in both CNIs. This is the network interface where encapsulated traffic between nodes flows.
Interestingly, when we run ip route
, we don’t see a new route for every Pod we have running. This is because OVS uses a bridge, and thus, the Ethernet cables still exist, but they are all plugged directly into our locally running OVS instance. Running the following command, we can see subnet logic in Antrea that is similar to what we saw earlier in Calico:
root@Antrea-control-plane:/# ip route 172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3 192.168.0.0/24 dev Antrea-gw0 proto kernel scope link src 192.168.0.1 ❶ 192.168.1.0/24 via 192.168.1.1 dev Antrea-gw0 onlink ❷
❶ Defines traffic destined for our local subnet by the 0.0 suffix
❷ The Antrea gateway manages traffic destined to another subnet with the 1.0 suffix.
Now, to confirm this, let’s run the ip a
command. This will show us all the different IP addresses that our machine understands:
$ docker exec -t -i ba133 /bin/bash root@Antrea-control-plane:/# ip a # ip a 3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 2e:24:a8:d8:a3:50 brd ff:ff:ff:ff:ff:ff 4: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000 link/ether 76:82:e1:8b:d4:86 brd ff:ff:ff:ff:ff:ff 5: Antrea-gw0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 02:09:36:d3:cf:a4 brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/24 brd 192.168.0.255 scope global Antrea-gw0 valid_lft forever preferred_lft forever
One of the interesting things to note when we run the ip a
command is that we can see several unfamiliar devices floating around. These include
Antrea, unlike Calico, actually routes traffic to a gateway IP address, which is on the Pod subnet that uses the podCIDR of our cluster. Thus, the algorithm for how Antrea sets up Pod IP addresses for a given node is something like this:
Allocates the first IP address in the subnet to an OVS switch for the given node
Allocates all new Pods to the remaining free IP addresses in the subnet
The routing table for such a cluster would follow a pattern where we order nodes so that they come up chronologically. Note that each node receives traffic on an x.y.z.1 IP address (the first Pod in its allocated subnet). The way that subnets are calculated per Pod relies on both your implementation of Kubernetes and how your CNI provider logic works. In some CNIs, there might not be a distinct subnet for every node, but in general, this is an intuitive way for a CNI to manage IP addresses over time, so it’s pretty common.
Keep in mind that both Calico and Antrea create distinct subnets for a nodes’ Pod network, and from that subnet, Pods are provisioned IP addresses. If you ever need to debug a network path in a CNI, knowing which Pods are going to which nodes might help you to reason about which machines you should reboot, ssh
into, or delete entirely, depending on your DevOps practices.
The following snippet shows us the antrea-gw0
device. This is the gateway IP address for all the Pods on your cluster:
192.168.0.0/24 dev Antrea-gw0 proto kernel scope link src 192.168.0.1 ❶ 192.168.1.0/24 via 192.168.1.1 dev Antrea-gw0 onlink ❷ 192.168.2.0/24 via 192.168.2.1 dev Antrea-gw0 onlink ❸
❶ All local Pods go directly to the local Antrea-gw0 device.
❷ Forwards Pods destined for the second node in your cluster to that OVS instance
❸ Forwards Pods destined for the third node in your cluster to that OVS instance
Thus, we can see that in the bridged model for networking, there are a few differences between what sorts of devices are created:
The only routes that our kernel manages are for the Antrea gateway (Antrea-gw0
) itself.
All of this Pod’s traffic go directly to the Antrea-gw0
device. There is no global routing to other devices as is done in the BGP protocol that is used by our Calico CNI.
It’s worth noting here that the trick of using DaemonSets to manage host networking for Pods is a Linux-specific approach. In other OSs (Windows Kubernetes nodes, for instance), when running containerd, you actually need to install your CNI provider using a service manager, and the CNI provider runs as a host process. Although this may change in the future (again using Windows as an example, there is work underway to enable privileged containers for Windows Kubernetes nodes), it’s instructive to note that the Linux networking stack is ideally suited for the Kubernetes networking model. This is largely due to the architecture of cgroups, namespaces, and the concept of the Linux root user, which can run as a highly privileged process even while running in a container.
Although the complexity of Kubernetes networking may seem daunting at first because of the rapid evolution of service meshes, CNI, and network/server proxies, as long as you can understand the basic process of routing between Pods, the principles remain constant across many CNI implementations.
Kubernetes networking architecture has a lot of parallels with generic SDN concepts.
Antrea and Calico are both CNI providers that overlay a cluster network on a real network for Pods.
Basic Linux commands (like ip a
) can be used to reason about how your Pods are networked.
CNI providers manage Pod networks typically in DaemonSets that run a privileged Linux container on each node.
Border Gateway Protocol (BGP) and Open vSwitch (OVS) are both CNI provider core technologies that solve the same fundamental problems of broadcasting and sharing overlay routing information for Pods.
Other OSs like Windows currently don’t have all of the same native conveniences for Pod networking that Linux does.
18.188.66.13