Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

10 Exploring Kubernetes Networking

In this chapter, we will examine the important topic of networking. Kubernetes as an orchestration platform manages containers/pods running on different machines (physical or virtual) and requires an explicit networking model. We will look at the following topics:

Understanding the Kubernetes networking model
Kubernetes network plugins
Kubernetes and eBPF
Kubernetes networking solutions
Using network policies effectively
Load balancing options

By the end of this chapter, you will understand the Kubernetes approach to networking and be familiar with the solution space for aspects such as standard interfaces, networking implementations, and load balancing. You will even be able to write your very own Container Networking Interface (CNI) plugin if you wish.

Understanding the Kubernetes networking model

The Kubernetes networking model is based on a flat address space. All pods in a cluster can directly see each other. Each pod has its own IP address. There is no need to configure any Network Address Translation (NAT). In addition, containers in the same pod share their pod’s IP address and can communicate with each other through localhost. This model is pretty opinionated, but once set up, it simplifies life considerably both for developers and administrators. It makes it particularly easy to migrate traditional network applications to Kubernetes. A pod represents a traditional node and each container represents a traditional process.

We will cover the following:

Intra-pod communication
Pod-to-service communication
External access
Lookup and discovery
DNS in Kubernetes

Intra-pod communication (container to container)

A running pod is always scheduled on one (physical or virtual) node. That means that all the containers run on the same node and can talk to each other in various ways, such as via the local filesystem, any IPC mechanism, or using localhost and well-known ports. There is no danger of port collision between different pods because each pod has its own IP address and when a container in the pod uses localhost, it applies to the pod’s IP address only. So if container 1 in pod 1 connects to port 1234, which container 2 listens to on pod 1, it will not conflict with another container in pod 2 running on the same node that also listens on port 1234. The only caveat is that if you’re exposing ports to the host, then you should be careful about pod-to-node affinity. This can be handled using several mechanisms, such as Daemonsets and pod anti-affinity.

Inter-pod communication (pod to pod)

Pods in Kubernetes are allocated a network-visible IP address (not private to the node). Pods can communicate directly without the aid of NAT, tunnels, proxies, or any other obfuscating layer. Well-known port numbers can be used for a configuration-free communication scheme. The pod’s internal IP address is the same as its external IP address that other pods see (within the cluster network; not exposed to the outside world). That means that standard naming and discovery mechanisms such as a Domain Name System (DNS) work out of the box.

Pod-to-service communication

Pods can talk to each other directly using their IP addresses and well-known ports, but that requires the pods to know each other’s IP addresses. In a Kubernetes cluster, pods can be destroyed and created constantly. There may also be multiple replicas of the same pod spec, each with its own IP address. The Kubernetes service resource provides a layer of indirection that is very useful because the service is stable even if the set of actual pods that responds to requests is ever-changing. In addition, you get automatic, highly available load balancing because the kube-proxy on each node takes care of redirecting traffic to the correct pod:

Figure 10.1: Internal load balancing using a serviceExternal access

Eventually, some containers need to be accessible from the outside world. The pod IP addresses are not visible externally. The service is the right vehicle, but external access typically requires two redirects. For example, cloud provider load balancers are not Kubernetes-aware, so they can’t direct traffic to a particular service directly to a node that runs a pod that can process the request. Instead, the public load balancer just directs traffic to any node in the cluster and the kube-proxy on that node will redirect it again to an appropriate pod if the current node doesn’t run the necessary pod.

The following diagram shows how the external load balancer just sends traffic to an arbitrary node, where the kube-proxy takes care of further routing if needed:

Figure 10.2: External load balancer sending traffic to an arbitrary node and the kube-proxy

Lookup and discovery

In order for pods and containers to communicate with each other, they need to find each other. There are several ways for containers to locate other containers or announce themselves, which we will discuss in the following subsections. Each approach has its own pros and cons.

Self-registration

We’ve mentioned self-registration several times. Let’s understand what it means exactly. When a container runs, it knows its pod’s IP address. Every container that wants to be accessible to other containers in the cluster can connect to some registration service and register its IP address and port. Other containers can query the registration service for the IP addresses and ports of all registered containers and connect to them. When a container is destroyed (gracefully), it will unregister itself. If a container dies ungracefully, then some mechanism needs to be established to detect that. For example, the registration service can periodically ping all registered containers, or the containers can be required periodically to send a keep-alive message to the registration service.

The benefit of self-registration is that once the generic registration service is in place (no need to customize it for different purposes), there is no need to worry about keeping track of containers. Another huge benefit is that containers can employ sophisticated policies and decide to unregister temporarily if they are unavailable based on local conditions; for example, if a container is busy and doesn’t want to receive any more requests at the moment. This sort of smart and decentralized dynamic load balancing can be very difficult to achieve globally without a registration service. The downside is that the registration service is yet another non-standard component that containers need to know about in order to locate other containers.

Services and endpoints

Kubernetes services can be considered standard registration services. Pods that belong to a service are registered automatically based on their labels. Other pods can look up the endpoints to find all the service pods or take advantage of the service itself and directly send a message to the service that will get routed to one of the backend pods. Although, most of the time, pods will just send their message to the service itself, which will forward it to one of the backing pods. Dynamic membership can be achieved using a combination of the replica count of deployments, health checks, readiness checks, and horizontal pod autoscaling.

Loosely coupled connectivity with queues

What if containers can talk to each other without knowing their IP addresses and ports or even service IP addresses or network names? What if most of the communication can be asynchronous and decoupled? In many cases, systems can be composed of loosely coupled components that are not only unaware of the identities of other components but are also unaware that other components even exist. Queues facilitate such loosely coupled systems. Components (containers) listen to messages from the queue, respond to messages, perform their jobs, and post messages to the queue, such as progress messages, completion status, and errors. Queues have many benefits:

Easy to add processing capacity without coordination just by adding more containers that listen to the queue
Easy to keep track of the overall load based on the queue depth
Easy to have multiple versions of components running side by side by versioning messages and/or queue topics
Easy to implement load balancing as well as redundancy by having multiple consumers process requests in different modes
Easy to add or remove other types of listeners dynamically

The downsides of queues are the following:

You need to make sure that the queue provides appropriate durability and high availability so it doesn’t become a critical single point of failure (SPOF)
Containers need to work with the async queue API (could be abstracted away)
Implementing a request-response requires somewhat cumbersome listening on response queues

Overall, queues are an excellent mechanism for large-scale systems and they can be utilized in large Kubernetes clusters to ease coordination.

Loosely coupled connectivity with data stores

Another loosely coupled method is to use a data store (for example, Redis) to store messages and then other containers can read them. While possible, this is not the design objective of data stores, and the result is often cumbersome, fragile, and doesn’t have the best performance. Data stores are optimized for data storage and access and not for communication. That being said, data stores can be used in conjunction with queues, where a component stores some data in a data store and then sends a message to the queue saying that the data is ready for processing. Multiple components listen to the message and all start processing the data in parallel.

Kubernetes ingress

Kubernetes offers an ingress resource and controller that is designed to expose Kubernetes services to the outside world. You can do it yourself, of course, but many tasks involved in defining an ingress are common across most applications for a particular type of ingress, such as a web application, CDN, or DDoS protector. You can also write your own ingress objects.

The ingress object is often used for smart load balancing and TLS termination. Instead of configuring and deploying your own Nginx server, you can benefit from the built-in ingress controller. If you need a refresher, check out Chapter 5, Using Kubernetes Resources in Practice, where we discussed the ingress resource with examples.

DNS in Kubernetes

A DNS is a cornerstone technology in networking. Hosts that are reachable on IP networks have IP addresses. DNS is a hierarchical and decentralized naming system that provides a layer of indirection on top of IP addresses. This is important for several use cases, such as:

Load balancing
Dynamically replacing hosts with different IP addresses
Providing human-friendly names to well-known access points

DNS is a vast topic and a full discussion is outside the scope of this book. Just to give you a sense, there are tens of different RFC standards that cover DNS: https://en.wikipedia.org/wiki/Domain_Name_System#Standards.

In Kubernetes, the main addressable resources are pods and services. Each pod and service has a unique internal (private) IP address within the cluster. The kubelet configures the pods with a resolve.conf file that points them to the internal DNS server. Here is what it looks like:

$ k run -it --image g1g1/py-kube:0.3 -- bash
If you don't see a command prompt, try pressing enter.
root@bash:/#
root@bash:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

The nameserver IP address 10.96.0.10 is the address of the kube-dns service:

$ k get svc -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   19m

A pod’s hostname is, by default, just its metadata name. If you want pods to have a fully qualified domain name inside the cluster, you can create a headless service and also set the hostname explicitly, as well as a subdomain to the service name. Here is how to set up a DNS for two pods called py-kube1 and py-kube2 with hostnames of trouble1 and trouble2, as well as a subdomain called maker, which matches the headless service:

apiVersion: v1
kind: Service
metadata:
  name: maker
spec:
  selector:
    app: py-kube
  clusterIP: None # headless service
---
apiVersion: v1
kind: Pod
metadata:
  name: py-kube1
  labels:
    app: py-kube
spec:
  hostname: trouble
  subdomain: maker
  containers:
  - image: g1g1/py-kube:0.3
    command:
      - sleep
      - "9999"
    name: trouble
---
apiVersion: v1
kind: Pod
metadata:
  name: py-kube2
  labels:
    app: py-kube
spec:
  hostname: trouble2
  subdomain: maker
  containers:
    - image: g1g1/py-kube:0.3
      command:
        - sleep
        - "9999"
      name: trouble

Let’s create the pods and service:

$ k apply -f pod-with-dns.yaml
service/maker created
pod/py-kube1 created
pod/py-kube2 created

Now, we can check the hostnames and the DNS resolution inside the pod. First, we will connect to py-kube2 and verify that its hostname is trouble2 and the fully qualified domain name (FQDN) is trouble2.maker.default.svc.cluster.local.

Then, we can resolve the FQDN of both trouble and trouble2:

$ k exec -it py-kube2 -- bash
root@trouble2:/# hostname
trouble2
root@trouble2:/# hostname --fqdn
trouble2.maker.default.svc.cluster.local
root@trouble2:/# dig +short trouble.maker.default.svc.cluster.local
10.244.0.10
root@trouble2:/# dig +short trouble2.maker.default.svc.cluster.local
10.244.0.9

To close the loop, let’s confirm that the IP addresses 10.244.0.10 and 10.244.0.9 actually belong to the py-kube1 and py-kube2 pods:

$ k get po -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
py-kube1   1/1     Running   0          10m   10.244.0.10   kind-control-plane   <none>           <none>
py-kube2   1/1     Running   0          18m   10.244.0.9    kind-control-plane   <none>           <none>

There are additional configuration options and DNS policies you can apply. See https://kubernetes.io/docs/concepts/services-networking/dns-pod-service.

CoreDNS

Earlier, we mentioned that the kubelet uses a resolve.conf file to configure pods by pointing them to the internal DNS server, but where is this internal DNS server hiding? You can find it in the kube-system namespace. The service is called kube-dns:

$ k describe svc -n kube-system kube-dns
Name:              kube-dns
Namespace:         kube-system
Labels:            k8s-app=kube-dns
                   kubernetes.io/cluster-service=true
                   kubernetes.io/name=CoreDNS
Annotations:       prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.0.10
IPs:               10.96.0.10
Port:              dns  53/UDP
TargetPort:        53/UDP
Endpoints:         10.244.0.2:53,10.244.0.3:53
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         10.244.0.2:53,10.244.0.3:53
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         10.244.0.2:9153,10.244.0.3:9153
Session Affinity:  None
Events:            <none>

Note that selector: k8s-app=kube-dns. Let’s find the pods that back this service:

$ k get po -n kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-64897985d-n4x5b   1/1     Running   0          97m
coredns-64897985d-nqtwk   1/1     Running   0          97m

The service is called kube-dns, but the pods have a prefix of coredns. Interesting. Let’s check the image the deployment uses:

$ k get deploy coredns -n kube-system -o jsonpath='{.spec.template.spec.containers[0]}' | jq .image
"k8s.gcr.io/coredns/coredns:v1.8.6"

The reason for this mismatch is that, initially, the default Kubernetes DNS server was called kube-dns. Then, CoreDNS replaced it as the mainstream DNS server due to its simplified architecture and better performance.

We have covered a lot of information about the Kubernetes networking model and its components. In the next section, we will cover the Kubernetes network plugins that implement this model with standard interfaces such as CNI and Kubenet.

Kubernetes network plugins

Kubernetes has a network plugin system since networking is so diverse and different people would like to implement it in different ways. Kubernetes is flexible enough to support any scenario. The primary network plugin is CNI, which we will discuss in depth. But Kubernetes also comes with a simpler network plugin called Kubenet. Before we go over the details, let’s get on the same page with the basics of Linux networking (just the tip of the iceberg). This is important because Kubernetes networking is built on top of standard Linux networking and you need this foundation to understand how Kubernetes networking works.

Basic Linux networking

Linux, by default, has a single shared network space. The physical network interfaces are all accessible in this namespace. But the physical namespace can be divided into multiple logical namespaces, which is very relevant to container networking.

IP addresses and ports

Network entities are identified by their IP address. Servers can listen to incoming connections on multiple ports. Clients can connect (TCP) or send/receive data (UDP) to servers within their network.

Network namespaces

Namespaces group a bunch of network devices such that they can reach other servers in the same namespace, but not other servers, even if they are physically on the same network. Linking networks or network segments can be done via bridges, switches, gateways, and routing.

Subnets, netmasks, and CIDRs

A granular division of networks segments is very useful when designing and maintaining networks. Dividing networks into smaller subnets with a common prefix is a common practice. These subnets can be defined by bitmasks that represent the size of the subnet (how many hosts it can contain). For example, a netmask of 255.255.255.0 means that the first 3 octets are used for routing and only 256 (actually 254) individual hosts are available. The Classless Inter-Domain Routing (CIDR) notation is often used for this purpose because it is more concise, encodes more information, and also allows combining hosts from multiple legacy classes (A, B, C, D, E). For example, 172.27.15.0/24 means that the first 24 bits (3 octets) are used for routing.

Virtual Ethernet devices

Virtual Ethernet (veth) devices represent physical network devices. When you create a veth that’s linked to a physical device, you can assign that veth (and by extension, the physical device) into a namespace where devices from other namespaces can’t reach it directly, even if, physically, they are on the same local network.

Bridges

Bridges connect multiple network segments to an aggregate network, so all the nodes can communicate with each other. Bridging is done at layer 2 (the data link) of the OSI network model.

Routing

Routing connects separate networks, typically based on routing tables that instruct network devices how to forward packets to their destinations. Routing is done through various network devices, such as routers, gateways, switches, and firewalls, including regular Linux boxes.

Maximum transmission unit

The maximum transmission unit (MTU) determines how big packets can be. On Ethernet networks, for example, the MTU is 1,500 bytes. The bigger the MTU, the better the ratio between payload and headers, which is a good thing. But the downside is that minimum latency is reduced because you have to wait for the entire packet to arrive and, furthermore, in case of failure, you have to retransmit the entire big packet.

Pod networking

Here is a diagram that describes the relationship between pod, host, and the global internet at the networking level via veth0:

Figure 10.3: Pod networking

Kubenet

Back to Kubernetes. Kubenet is a network plugin. It’s very rudimentary: it establishes a Linux bridge named cbr0 and creates a veth interface for each pod. This is commonly used by cloud providers to configure routing rules for communication between nodes, or in single-node environments. The veth pair connects each pod to its host node using an IP address from the host’s IP address’ range.

Requirements

The Kubenet plugin has the following requirements:

The node must be assigned a subnet to allocate IP addresses to its pods
The standard CNI bridge, lo, and host-local plugins must be installed at version 0.2.0 or higher
The kubelet must be executed with the --network-plugin=kubenet flag
The kubelet must be executed with the --non-masquerade-cidr=<clusterCidr> flag
The kubelet must be run with --pod-cidr or the kube-controller-manager must be run with --allocate-node-cidrs=true --cluster-cidr=<cidr>

Setting the MTU

The MTU is critical for network performance. Kubernetes network plugins such as Kubenet make their best efforts to deduce the optimal MTU, but sometimes they need help. If an existing network interface (for example, the docker0 bridge) sets a small MTU, then Kubenet will reuse it. Another example is IPsec, which requires lowering the MTU due to the extra overhead from IPsec encapsulation, but the Kubenet network plugin doesn’t take it into consideration. The solution is to avoid relying on the automatic calculation of the MTU and just tell the kubelet what MTU should be used for network plugins via the --network-plugin-mtu command-line switch that is provided to all network plugins. However, at the moment, only the Kubenet network plugin accounts for this command-line switch.

The Kubenet network plugin is mostly around for backward compatibility reasons. The CNI is the primary network interface that all modern network solution providers implement to integrate with Kubernetes. Let’s see what it’s all about.

The CNI

The CNI is a specification as well as a set of libraries for writing network plugins to configure network interfaces in Linux containers. The specification actually evolved from the rkt network proposal. CNI is an established industry standard now even beyond Kubernetes. Some of the organizations that use CNI are:

Kubernetes
OpenShift
Mesos
Kurma
Cloud Foundry
Nuage
IBM
AWS EKS and ECS
Lyft

The CNI team maintains some core plugins, but there are a lot of third-party plugins too that contribute to the success of CNI. Here is a non-exhaustive list:

Project Calico: A layer 3 virtual network for Kubernetes
Weave: A virtual network to connect multiple Docker containers across multiple hosts
Contiv networking: Policy-based networking
Cilium: ePBF for containers
Flannel: Layer 3 network fabric for Kubernetes
Infoblox: Enterprise-grade IP address management
Silk: A CNI plugin for Cloud Foundry
OVN-kubernetes: A CNI plugin based on OVS and Open Virtual Networking (OVN)
DANM: Nokia’s solution for Telco workloads on Kubernetes

CNI plugins provide a standard networking interface for arbitrary networking solutions.

The container runtime

CNI defines a plugin spec for networking application containers, but the plugin must be plugged into a container runtime that provides some services. In the context of CNI, an application container is a network-addressable entity (has its own IP address). For Docker, each container has its own IP address. For Kubernetes, each pod has its own IP address and the pod is considered the CNI container, and the containers within the pod are invisible to CNI.

The container runtime’s job is to configure a network and then execute one or more CNI plugins, passing them the network configuration in JSON format.

The following diagram shows a container runtime using the CNI plugin interface to communicate with multiple CNI plugins:

Figure 10.4: Container runtime with CNI

The CNI plugin

The CNI plugin’s job is to add a network interface into the container network namespace and bridge the container to the host via a veth pair. It should then assign an IP address via an IP address management (IPAM) plugin and set up routes.

The container runtime (any CRI-compliant runtime) invokes the CNI plugin as an executable. The plugin needs to support the following operations:

Add a container to the network
Remove a container from the network
Report version

The plugin uses a simple command-line interface, standard input/output, and environment variables. The network configuration in JSON format is passed to the plugin through standard input. The other arguments are defined as environment variables:

CNI_COMMAND: Specifies the desired operation, such as ADD, DEL, or VERSION.
CNI_CONTAINERID: Represents the ID of the container.
CNI_NETNS: Points to the path of the network namespace file.
CNI_IFNAME: Specifies the name of the interface to be set up. The CNI plugin should use this name or return an error.
CNI_ARGS: Contains additional arguments passed in by the user during invocation. It consists of alphanumeric key-value pairs separated by semicolons, such as FOO=BAR;ABC=123.
CNI_PATH: Indicates a list of paths to search for CNI plugin executables. The paths are separated by an OS-specific list separator, such as “:" on Linux and “;" on Windows.

If the command succeeds, the plugin returns a zero exit code and the generated interfaces (in the case of the ADD command) are streamed to standard output as JSON. This low-tech interface is smart in the sense that it doesn’t require any specific programming language or component technology or binary API. CNI plugin writers can use their favorite programming language too.

The result of invoking the CNI plugin with the ADD command looks as follows:

{
  "cniVersion": "0.3.0",
  "interfaces": [              (this key omitted by IPAM plugins)
      {
          "name": "<name>",
          "mac": "<MAC address>", (required if L2 addresses are meaningful)
          "sandbox": "<netns path or hypervisor identifier>" (required for container/hypervisor interfaces, empty/omitted for host interfaces)
      }
  ],
  "ip": [
      {
          "version": "<4-or-6>",
          "address": "<ip-and-prefix-in-CIDR>",
          "gateway": "<ip-address-of-the-gateway>",     (optional)
          "interface": <numeric index into 'interfaces' list>
      },
      ...
  ],
  "routes": [                                           (optional)
      {
          "dst": "<ip-and-prefix-in-cidr>",
          "gw": "<ip-of-next-hop>"                      (optional)
      },
      ...
  ]
  "dns": {
    "nameservers": <list-of-nameservers>                (optional)
    "domain": <name-of-local-domain>                    (optional)
    "search": <list-of-additional-search-domains>       (optional)
    "options": <list-of-options>                        (optional)
  }
}

The input network configuration contains a lot of information: cniVersion, name, type, args (optional), ipMasq (optional), ipam, and dns. The ipam and dns parameters are dictionaries with their own specified keys. Here is an example of a network configuration:

{
  "cniVersion": "0.3.0",
  "name": "dbnet",
  "type": "bridge",
  // type (plugin) specific
  "bridge": "cni0",
  "ipam": {
    "type": "host-local",
    // ipam specific
    "subnet": "10.1.0.0/16",
    "gateway": "10.1.0.1"
  },
  "dns": {
    "nameservers": ["10.1.0.1"]
  }
}

Note that additional plugin-specific elements can be added. In this case, the bridge: cni0 element is a custom one that the specific bridge plugin understands.

The CNI spec also supports network configuration lists where multiple CNI plugins can be invoked in order.

That concludes the conceptual discussion of Kubernetes network plugins, which are built on top of basic Linux networking, allowing multiple network solution providers to integrate smoothly with Kubernetes.

Later in this chapter, we will dig into a full-fledged implementation of a CNI plugin. First, let’s talk about one of the most exciting prospects in the Kubernetes networking world – extended Berkeley Packet Filter (eBPF).

Kubernetes and eBPF

Kubernetes, as you know very well, is a very versatile and flexible platform. The Kubernetes developers, in their wisdom, avoided making many assumptions and decisions that could later paint them into a corner. For example, Kubernetes networking operates at the IP and DNS levels only. There is no concept of a network or subnets. Those are left for networking solutions that integrate with Kubernetes through very narrow and generic interfaces like CNI.

That opens the door to a lot of innovation because Kubernetes doesn’t constrain the choices of implementors.

Enter ePBF. It is a technology that allows running sandboxed programs safely in the Linux kernel without compromising the system’s security or requiring you to make changes to the kernel itself or even kernel modules. These programs execute in response to events. This is a big deal for software-defined networking, observability, and security. Brendan Gregg calls it the Linux super-power.

The original BPF technology could be attached only to sockets for packet filtering (hence the name Berkeley Packet Filter). With ePBF, you can attach to additional objects, such as:

Kprobes
Tracepoints
Network schedulers or qdiscs for classification or action
XDP

Traditional Kubernetes routing is done by the kube-proxy. It is a user space process that runs on every node. It’s responsible for setting up iptable rules and does UDP, TCP, and STCP forwarding as well as load balancing (based on Kubernetes services). At large scale, kube-proxy becomes a liability. The iptable rules are processed sequentially and the frequent user space to kernel space transitions are unnecessary overhead. It is possible to completely remove kube-proxy and replace it with an eBPF-based approach that performs the same function much more efficiently. We will discuss one of these solutions – Cilium – in the next section.

Here is an overview of eBPF:

Figure 10.5: eBPF overview

For more details, check out https://ebpf.io.

Kubernetes networking solutions

Networking is a vast topic. There are many ways to set up networks and connect devices, pods, and containers. Kubernetes can’t be opinionated about it. The high-level networking model of a flat address space for Pods is all that Kubernetes prescribes. Within that space, many valid solutions are possible, with various capabilities and policies for different environments. In this section, we’ll examine some of the available solutions and understand how they map to the Kubernetes networking model.

Bridging on bare-metal clusters

The most basic environment is a raw bare-metal cluster with just an L2 physical network. You can connect your containers to the physical network with a Linux bridge device. The procedure is quite involved and requires familiarity with low-level Linux network commands such as brctl, ipaddr, iproute, iplink, and nsenter. If you plan to implement it, this guide can serve as a good start (search for the With Linux Bridge devices section): http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/.

The Calico project

Calico is a versatile virtual networking and network security solution for containers. Calico can integrate with all the primary container orchestration frameworks and runtimes:

Kubernetes (CNI plugin)
Mesos (CNI plugin)
Docker (libnetwork plugin)
OpenStack (Neutron plugin)

Calico can also be deployed on-premises or on public clouds with its full feature set. Calico’s network policy enforcement can be specialized for each workload and makes sure that traffic is controlled precisely and packets always go from their source to vetted destinations. Calico can automatically map network policy concepts from orchestration platforms to its own network policy. The reference implementation of Kubernetes’ network policy is Calico. Calico can be deployed together with Flannel, utilizing Flannel’s networking layer and Calico’s network policy facilities.

Weave Net

Weave Net is all about ease of use and zero configuration. It uses VXLAN encapsulation under the hood and micro DNS on each node. As a developer, you operate at a higher abstraction level. You name your containers and Weave Net lets you connect to them and use standard ports for services. That helps migrate existing applications into containerized applications and microservices. Weave Net has a CNI plugin for interfacing with Kubernetes (and Mesos). On Kubernetes 1.4 and higher, you can integrate Weave Net with Kubernetes by running a single command that deploys a Daemonset:

kubectl apply -f https://github.com/weaveworks/weave/releases/download/v2.8.1/weave-daemonset-k8s.yaml

The Weave Net pods on every node will take care of attaching any new pod you create to the Weave network. Weave Net supports the network policy API, as well providing a complete, yet easy-to-set-up solution.

Cilium

Cilium is a CNCF incubator project that is focused on eBPF-based networking, security, and observability (via its Hubble project).

Let’s take a look at the capabilities Cilium provides.

Efficient IP allocation and routing

Cilium allows a flat Layer 3 network that covers multiple clusters and connects all application containers. Host scope allocators can allocate IP addresses without coordination with other hosts. Cilium supports multiple networking models:

Overlay: This model utilizes encapsulation-based virtual networks that span across all hosts. It supports encapsulation formats like VXLAN and Geneve, as well as other formats supported by Linux. Overlay mode works with almost any network infrastructure as long as the hosts have IP connectivity. It provides a flexible and scalable solution.
Native routing: In this model, Kubernetes leverages the regular routing table of the Linux host. The network infrastructure must be capable of routing the IP addresses used by the application containers. Native Routing mode is considered more advanced and requires knowledge of the underlying networking infrastructure. It works well with native IPv6 networks, cloud network routers, or when using custom routing daemons.

Identity-based service-to-service communication

Cilium provides a security management feature that assigns a security identity to groups of application containers with the same security policies. This identity is then associated with all network packets generated by those application containers. By doing this, Cilium enables the validation of the identity at the receiving node. The management of security identities is handled through a key-value store, which allows for efficient and secure management of identities within the Cilium networking solution.

Load balancing

Cilium offers distributed load balancing for traffic between application containers and external services as an alternative to kube-proxy. This load balancing functionality is implemented using efficient hashtables in eBPF, providing a scalable approach compared to the traditional iptables method. With Cilium, you can achieve high-performance load balancing while ensuring efficient utilization of network resources.

When it comes to east-west load balancing, Cilium excels in performing efficient service-to-backend translation directly within the Linux kernel’s socket layer. This approach eliminates the need for per-packet NAT operations, resulting in lower overhead and improved performance.

For north-south load balancing, Cilium’s eBPF implementation is highly optimized for maximum performance. It can be seamlessly integrated with XDP (eXpress Data Path) and supports advanced load balancing techniques like Direct Server Return (DSR) and Maglev consistent hashing. This allows load balancing operations to be efficiently offloaded from the source host, further enhancing performance and scalability.

Bandwidth management

Cilium implements bandwidth management through efficient Earliest Departure Time (EDT)-based rate-limiting with eBPF for egress traffic. This significantly reduces transmission tail latencies for applications.

Observability

Cilium offers comprehensive event monitoring with rich metadata. In addition to capturing the source and destination IP addresses of dropped packets, it also provides detailed label information for both the sender and receiver. This metadata enables enhanced visibility and troubleshooting capabilities. Furthermore, Cilium exports metrics through Prometheus, allowing for easy monitoring and analysis of network performance.

To further enhance observability, the Hubble observability platform provides additional features such as service dependency maps, operational monitoring, alerting, and comprehensive visibility into application and security aspects. By leveraging flow logs, Hubble enables administrators to gain valuable insights into the behavior and interactions of services within the network.

Cilium is a large project with a very broad scope. Here, we just scratched the surface. See https://cilium.io for more details.

There are many good networking solutions. Which network solution is the best for you? If you’re running in the cloud, I recommend using the native CNI plugin from your cloud provider. If you’re on your own, Calico is a solid choice, and if you’re adventurous and need to heavily optimize your network, consider Cilium.

In the next section, we will cover network policies that let you get a handle on the traffic in your cluster.

Using network policies effectively

The Kubernetes network policy is about managing network traffic to selected pods and namespaces. In a world of hundreds of microservices deployed and orchestrated, as is often the case with Kubernetes, managing networking and connectivity between pods is essential. It’s important to understand that it is not primarily a security mechanism. If an attacker can reach the internal network, they will probably be able to create their own pods that comply with the network policy in place and communicate freely with other pods. In the previous section, we looked at different Kubernetes networking solutions and focused on the container networking interface. In this section, the focus is on the network policy, although there are strong connections between the networking solution and how the network policy is implemented on top of it.

Understanding the Kubernetes network policy design

A network policy defines the communication rules for pods and other network endpoints within a Kubernetes cluster. It uses labels to select specific pods and applies whitelist rules to control traffic access to the selected pods. These rules complement the isolation policy defined at the namespace level by allowing additional traffic based on the defined criteria. By configuring network policies, administrators can fine-tune and restrict the communication between pods, enhancing security and network segmentation within the cluster.

Network policies and CNI plugins

There is an intricate relationship between network policies and CNI plugins. Some CNI plugins implement both network connectivity and a network policy, while others implement just one aspect, but they can collaborate with another CNI plugin that implements the other aspect (for example, Calico and Flannel).

Configuring network policies

Network policies are configured via the NetworkPolicy resource. You can define ingress and/or egress policies. Here is a sample network policy that specifies both ingress and egress:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: awesome-project
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
    - Ingress
    - Egress
  ingress:  
    - from:
        - namespaceSelector:
            matchLabels:
              project: awesome-project
        - podSelector:
            matchLabels:
              role: frontend
      ports:
       - protocol: TCP
         port: 6379
  egress:
    - to:
        - ipBlock:
            cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 7777

Implementing network policies

While the network policy API itself is generic and is part of the Kubernetes API, the implementation is tightly coupled to the networking solution. That means that on each node, there is a special agent or gatekeeper (Cilium implements it via eBPF in the kernel) that does the following:

Intercepts all traffic coming into the node
Verifies that it adheres to the network policy
Forwards or rejects each request

Kubernetes provides the facilities to define and store network policies through the API. Enforcing the network policy is left to the networking solution or a dedicated network policy solution that is tightly integrated with the specific networking solution.

Calico is a good example of this approach. Calico has its own networking solution and a network policy solution, which work together. In both cases, there is tight integration between the two pieces. The following diagram shows how the Kubernetes policy controller manages the network policies and how agents on the nodes execute them:

Figure 10.6: Kubernetes network policy management

In this section, we covered various networking solutions, as well as network policies, and we briefly discussed load balancing. However, load balancing is a wide subject and the next section will explore it.

Load balancing options

Load balancing is a critical capability in dynamic systems such as a Kubernetes cluster. Nodes, VMs, and pods come and go, but the clients typically can’t keep track of which individual entities can service their requests. Even if they could, it requires a complicated dance of managing a dynamic map of the cluster, refreshing it frequently, and handling disconnected, unresponsive, or just slow nodes. This so-called client-side load balancing is appropriate in special cases only. Server-side load balancing is a battle-tested and well-understood mechanism that adds a layer of indirection that hides the internal turmoil from the clients or consumers outside the cluster. There are options for external as well as internal load balancers. You can also mix and match and use both. The hybrid approach has its own particular pros and cons, such as performance versus flexibility. We will cover the following options:

External load balancer
Service load balancer
Ingress
HA Proxy
MetalLB
Traefik
Kubernetes Gateway API

External load balancers

An external load balancer is a load balancer that runs outside the Kubernetes cluster. There must be an external load balancer provider that Kubernetes can interact with to configure the external load balancer with health checks and firewall rules and get the external IP address of the load balancer.

The following diagram shows the connection between the load balancer (in the cloud), the Kubernetes API server, and the cluster nodes. The external load balancer has an up-to-date picture of which pods run on which nodes and it can direct external service traffic to the right pods:

Figure 10.7: The connection between the load balancer, the Kubernetes API server, and the cluster nodes

Configuring an external load balancer

The external load balancer is configured via the service configuration file or directly through kubectl. We use a service type of LoadBalancer instead of using a service type of ClusterIP, which directly exposes a Kubernetes node as a load balancer. This depends on an external load balancer provider properly installed and configured in the cluster.

Via manifest file

Here is an example service manifest file that accomplishes this goal:

apiVersion: v1
kind: Service
metadata:
  name: api-gateway
spec:
  type: LoadBalancer
  ports:
  - port:  80
    targetPort: 5000
  selector:
    svc: api-gateway
    app: delinkcious

Via kubectl

You may also accomplish the same result using a direct kubectl command:

$ kubectl expose deployment api-gateway --port=80 --target-port=5000 --name=api-gateway --type=LoadBalancer

The decision whether to use a service configuration file or kubectl command is usually determined by the way you set up the rest of your infrastructure and deploy your system. Manifest files are more declarative and more appropriate for production usage where you want a versioned, auditable, and repeatable way to manage your infrastructure. Typically, this will be part of a GitOps-based CI/CD pipeline.

Finding the load balancer IP addresses

The load balancer will have two IP addresses of interest. The internal IP address can be used inside the cluster to access the service. Clients outside the cluster will use the external IP address. It’s a good practice to create a DNS entry for the external IP address. It is particularly important if you want to use TLS/SSL, which requires stable hostnames. To get both addresses, use the kubectl describe service command. The IP field denotes the internal IP address and the LoadBalancer Ingress field denotes the external IP address:

$ kubectl describe services example-service
Name: example-service
Selector: app=example
Type: LoadBalancer
IP: 10.67.252.103
LoadBalancer Ingress: 123.45.678.9
Port: <unnamed> 80/TCP
NodePort: <unnamed> 32445/TCP
Endpoints: 10.64.0.4:80,10.64.1.5:80,10.64.2.4:80
Session Affinity: None
No events.

Preserving client IP addresses

Sometimes, the service may be interested in the source IP address of the clients. Up until Kubernetes 1.5, this information wasn’t available. In Kubernetes 1.7, the capability to preserve the original client IP was added to the API.

Specifying original client IP address preservation

You need to configure the following two fields of the service spec:

service.spec.externalTrafficPolicy: This field determines whether the service should route external traffic to a node-local endpoint or a cluster-wide endpoint, which is the default. The Cluster option doesn’t reveal the client source IP and might add a hop to a different node, but spreads the load well. The Local option keeps the client source IP and doesn’t add an extra hop as long as the service type is LoadBalancer or NodePort. Its downside is it might not balance the load very well.
service.spec.healthCheckNodePort: This field is optional. If used, then the service health check will use this port number. The default is the allocated node port. It has an effect on services of the LoadBalancer type whose externalTrafficPolicy is set to Local.

Here is an example:

apiVersion: v1
kind: Service
metadata:
  name: api-gateway
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ports:
  - port:  80
    targetPort: 5000
  selector:
    svc: api-gateway
    app: delinkcious

Understanding even external load balancing

External load balancers operate at the node level; while they direct traffic to a particular pod, the load distribution is done at the node level. That means that if your service has four pods, and three of them are on node A and the last one is on node B, then an external load balancer is likely to divide the load evenly between node A and node B.

This will have the 3 pods on node A handle half of the load (1/6 each) and the single pod on node B handle the other half of the load on its own. Weights may be added in the future to address this issue. You can avoid the issue of too many pods unevenly distributed between nodes by using pod anti-affinity or topology spread constraints.

Service load balancers

Service load balancing is designed for funneling internal traffic within the Kubernetes cluster and not for external load balancing. This is done by using a service type of clusterIP. It is possible to expose a service load balancer directly via a pre-allocated port by using a service type of NodePort and using it as an external load balancer, but it requires curating all Node ports across the cluster to avoid conflicts and might not be appropriate for production. Desirable features such as SSL termination and HTTP caching will not be readily available.

The following diagram shows how the service load balancer (the yellow cloud) can route traffic to one of the backend pods it manages (via labels of course):

Figure 10.8: The service load balancer routing traffic to a backend pod

Ingress

Ingress in Kubernetes is, at its core, a set of rules that allow inbound HTTP/S traffic to reach cluster services. In addition, some ingress controllers support the following:

Connection algorithms
Request limits
URL rewrites and redirects
TCP/UDP load balancing
SSL termination
Access control and authorization

Ingress is specified using an Ingress resource and is serviced by an ingress controller. The Ingress resource was in beta since Kubernetes 1.1 and finally, in Kubernetes 1.19, it became GA. Here is an example of an ingress resource that manages traffic into two services. The rules map the externally visible http://foo.bar.com/foo to the s1 service, and http://foo.bar.com/bar to the s2 service:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: test
spec:
  ingressClassName: cool-ingress
  rules:
  - host: foo.bar.com
    http:
      paths:
- path: /foo
  backend: 
    service:
      name: s1 
      port: 80 
- path: /bar 
  backend: 
    service:
      name: s2 
      port: 80

The ingressClassname specifies an IngressClass resource, which contains additional information about the ingress. If it’s omitted, a default ingress class must be defined.

Here is what it looks like:

apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  labels:
    app.kubernetes.io/component: controller
  name: cool-ingress
  annotations:
    ingressclass.kubernetes.io/is-default-class: "true"
spec:
  controller: k8s.io/ingress-nginx

Ingress controllers often require annotations to be added to the Ingress resource in order to customize its behavior.

The following diagram demonstrates how Ingress works:

Figure 10.9: Demonstration of ingress

There are two official ingress controllers right now in the main Kubernetes repository. One of them is an L7 ingress controller for GCE only, the other is a more general-purpose Nginx ingress controller that lets you configure the Nginx web server through a ConfigMap. The Nginx ingress controller is very sophisticated and brings a lot of features that are not available yet through the ingress resource directly. It uses the Endpoints API to directly forward traffic to pods. It supports Minikube, GCE, AWS, Azure, and bare-metal clusters. For more details, check out https://github.com/kubernetes/ingress-nginx.

However, there are many more ingress controllers that may be better for your use case, such as:

Ambassador
Traefik
Contour
Gloo

For even more ingress controllers, see https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/.

HAProxy

We discussed using a cloud provider external load balancer using the service type LoadBalancer and using the internal service load balancer inside the cluster using ClusterIP. If we want a custom external load balancer, we can create a custom external load balancer provider and use LoadBalancer or use the third service type, NodePort. High-Availability (HA) Proxy is a mature and battle-tested load balancing solution. It is considered one of the best choices for implementing external load balancing with on-premises clusters. This can be done in several ways:

Utilize NodePort and carefully manage port allocations
Implement a custom load balancer provider interface
Run HAProxy inside your cluster as the only target of your frontend servers at the edge of the cluster (load balanced or not)

You can use all these approaches with HAProxy. Regardless, it is still recommended to use ingress objects. The service-loadbalancer project is a community project that implemented a load balancing solution on top of HAProxy. You can find it here: https://github.com/kubernetes/contrib/tree/master/service-loadbalancer. Let’s look into how to use HAProxy in a bit more detail.

Utilizing the NodePort

Each service will be allocated a dedicated port from a predefined range. This usually is a high range such as 30,000 and above to avoid clashing with other applications using ports that are not well known. HAProxy will run outside the cluster in this case and it will be configured with the correct port for each service. Then, it can just forward any traffic to any nodes and Kubernetes via the internal service, and the load balancer will route it to a proper pod (double load balancing). This is, of course, sub-optimal because it introduces another hop. The way to circumvent it is to query the Endpoints API and dynamically manage for each service the list of its backend pods and directly forward traffic to the pods.

A custom load balancer provider using HAProxy

This approach is a little more complicated, but the benefit is that it is better integrated with Kubernetes and can make the transition to/from on-premises and the cloud easier.

Running HAProxy inside the Kubernetes cluster

In this approach, we use the internal HAProxy load balancer inside the cluster. There may be multiple nodes running HAProxy and they will share the same configuration to map incoming requests and load-balance them across the backend servers (the Apache servers in the following diagram):

Figure 10.10: Multiple nodes running HAProxy for incoming requests and to load-balance the backend servers

HAProxy also developed its own ingress controller, which is Kubernetes-aware. This is arguably the most streamlined way to utilize HAProxy in your Kubernetes cluster. Here are some of the capabilities you gain when using the HAProxy ingress controller:

Streamlined integration with the HAProxy load balancer
SSL termination
Rate limiting
IP whitelisting
Multiple load balancing algorithms: round-robin, least connections, URL hash, and random
A dashboard that shows the health of your pods, current request rates, response times, etc.
Traffic overload protection

MetalLB

MetalLB also provides a load balancer solution for bare-metal clusters. It is highly configurable and supports multiple modes such as L2 and BGP. I had success configuring it even for minikube. For more details, check out https://metallb.universe.tf.

Traefik

Traefik is a modern HTTP reverse proxy and load balancer. It was designed to support microservices. It works with many backends, including Kubernetes, to manage its configuration automatically and dynamically. This is a game-changer compared to traditional load balancers. It has an impressive list of features:

It’s fast
Single Go executable
Tiny official Docker image: The solution provides a lightweight and official Docker image, ensuring efficient resource utilization.
Rest API: It offers a RESTful API for easy integration and interaction with the solution.
Hot-reloading of configuration: Configuration changes can be applied dynamically without requiring a process restart, ensuring seamless updates.
Circuit breakers and retry: The solution includes circuit breakers and retry mechanisms to handle network failures and ensure robust communication.
Round-robin and rebalancer load balancers: It supports load balancing algorithms like round-robin and rebalancer to distribute traffic across multiple instances.
Metrics support: The solution provides various options for metrics collection, including REST, Prometheus, Datadog, statsd, and InfluxDB.
Clean AngularJS web UI: It offers a user-friendly web UI powered by AngularJS for easy configuration and monitoring.
Websocket, HTTP/2, and GRPC support: The solution is capable of handling Websocket, HTTP/2, and GRPC protocols, enabling efficient communication.
Access logs: It provides access logs in both JSON and Common Log Format (CLF) for monitoring and troubleshooting.
Let’s Encrypt support: The solution seamlessly integrates with Let’s Encrypt for automatic HTTPS certificate generation and renewal.
High availability with cluster mode: It supports high availability by running in cluster mode, ensuring redundancy and fault tolerance.

Overall, this solution offers a comprehensive set of features for deploying and managing applications in a scalable and reliable manner.

See https://traefik.io/traefik/ to learn more about Traefik.

Kubernetes Gateway API

Kubernetes Gateway API is a set of resources that model service networking in Kubernetes. You can think of it as the evolution of the ingress API. While there are no intentions to remove the ingress API, its limitations couldn’t be addressed by improving it, so the Gateway API project was born.

Where the ingress API consists of a single Ingress resource and an optional IngressClass, Gateway API is more granular and breaks the definition of traffic management and routing into different resources. Gateway API defines the following resources:

GatewayClass
Gateway
HTTPRoute
TLSRoute
TCPRoute
UDPRoute

Gateway API resources

The role of the GatewayClass is to define common configurations and behavior that can be used by multiple similar gateways.

The role of the gateway is to define an endpoint and a collection of routes where traffic can enter the cluster and be routed to backend services. Eventually, the gateway configures an underlying load balancer or proxy.

The role of the routes is to map specific requests that match the route to a specific backend service.

The following diagram demonstrates the resources and organization of Gateway API:

Figure 10.11: Gateway API resources

Attaching routes to gateways

Gateways and routes can be associated in different ways:

One-to-one: A gateway may have a single route from a single owner that isn’t associated with any other gateway
One-to-many: A gateway may have multiple routes associated with it from multiple owners
Many-to-many: A route may be associated with multiple gateways (each may have additional routes)

Gateway API in action

Let’s see how all the pieces of Gateway API fit together with a simple example. Here is a Gateway resource:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: cool-gateway
  namespace: ns1
spec:
  gatewayClassName: cool-gateway-class
  listeners:
  - name: cool-service
    port: 80
    protocol: HTTP
    allowedRoutes:
      kinds: 
        - kind: HTTPRoute
      namespaces:
        from: Selector
        selector:
          matchLabels:
            # This label is added automatically as of K8s 1.22
            # to all namespaces
            kubernetes.io/metadata.name: ns2

Note that the gateway is defined in namespace ns1, but it allows only HTTP routes that are defined in namespace ns2. Let’s see a route that attaches to this gateway:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: cool-route
  namespace: ns2
spec:
  parentRefs:
  - kind: Gateway
    name: cool-gateway
    namespace: ns1
  rules:
  - backendRefs:
    - name: cool-service
      port: 8080

The route cool-route is properly defined in namespace ns2; it is an HTTP route, so it matches. To close the loop, the route defines a parent reference to the cool-gateway gateway in namespace ns1.

See https://gateway-api.sigs.k8s.io to learn more about Gateway API.

Load balancing on Kubernetes is an exciting area. It offers many options for both north-south and east-west load balancing. Now that we have covered load balancing in detail, let’s dive deep into the CNI plugins and how they are implemented.

Writing your own CNI plugin

In this section, we will look at what it takes to actually write your own CNI plugin. First, we will look at the simplest plugin possible – the loopback plugin. Then, we will examine the plugin skeleton that implements most of the boilerplate associated with writing a CNI plugin.

Finally, we will review the implementation of the bridge plugin. Before we dive in, here is a quick reminder of what a CNI plugin is:

A CNI plugin is an executable
It is responsible for connecting new containers to the network, assigning unique IP addresses to CNI containers, and taking care of routing
A container is a network namespace (in Kubernetes, a pod is a CNI container)
Network definitions are managed as JSON files, but are streamed to the plugin via standard input (no files are being read by the plugin)
Auxiliary information can be provided via environment variables

First look at the loopback plugin

The loopback plugin simply adds the loopback interface. It is so simple that it doesn’t require any network configuration information. Most CNI plugins are implemented in Golang and the loopback CNI plugin is no exception. The full source code is available here: https://github.com/containernetworking/plugins/blob/master/plugins/main/loopback.

There are multiple packages from the container networking project on GitHub that provide many of the building blocks necessary to implement CNI plugins, as well as the netlink package for adding interfaces, removing interfaces, setting IP addresses, and setting routes. Let’s look at the imports of the loopback.go file first:

package main
import (
    "encoding/json"
    "errors"
    "fmt"
    "net"
    "github.com/vishvananda/netlink"
    "github.com/containernetworking/cni/pkg/skel"
    "github.com/containernetworking/cni/pkg/types"
    current "github.com/containernetworking/cni/pkg/types/100"
    "github.com/containernetworking/cni/pkg/version"
    "github.com/containernetworking/plugins/pkg/ns"
    bv "github.com/containernetworking/plugins/pkg/utils/buildversion"
)

Then, the plugin implements two commands, cmdAdd and cmdDel, which are called when a container is added to or removed from the network. Here is the add command, which does all the heavy lifting:

func cmdAdd(args *skel.CmdArgs) error {
    conf, err := parseNetConf(args.StdinData)
    if err != nil {
        return err
    }
    var v4Addr, v6Addr *net.IPNet
    args.IfName = "lo" // ignore config, this only works for loopback
    err = ns.WithNetNSPath(args.Netns, func(_ ns.NetNS) error {
        link, err := netlink.LinkByName(args.IfName)
        if err != nil {
            return err // not tested
        }
        err = netlink.LinkSetUp(link)
        if err != nil {
            return err // not tested
        }
        v4Addrs, err := netlink.AddrList(link, netlink.FAMILY_V4)
        if err != nil {
            return err // not tested
        }
        if len(v4Addrs) != 0 {
            v4Addr = v4Addrs[0].IPNet
            // sanity check that this is a loopback address
            for _, addr := range v4Addrs {
                if !addr.IP.IsLoopback() {
                    return fmt.Errorf("loopback interface found with non-loopback address %q", addr.IP)
                }
            }
        }
        v6Addrs, err := netlink.AddrList(link, netlink.FAMILY_V6)
        if err != nil {
            return err // not tested
        }
        if len(v6Addrs) != 0 {
            v6Addr = v6Addrs[0].IPNet
            // sanity check that this is a loopback address
            for _, addr := range v6Addrs {
                if !addr.IP.IsLoopback() {
                    return fmt.Errorf("loopback interface found with non-loopback address %q", addr.IP)
                }
            }
        }
        return nil
    })
    if err != nil {
        return err // not tested
    }
    var result types.Result
    if conf.PrevResult != nil {
        // If loopback has previous result which passes from previous CNI plugin,
        // loopback should pass it transparently
        result = conf.PrevResult
    } else {
        r := &current.Result{
            CNIVersion: conf.CNIVersion,
            Interfaces: []*current.Interface{
                &current.Interface{
                    Name:    args.IfName,
                    Mac:     "00:00:00:00:00:00",
                    Sandbox: args.Netns,
                },
            },
        }
        if v4Addr != nil {
            r.IPs = append(r.IPs, &current.IPConfig{
                Interface: current.Int(0),
                Address:   *v4Addr,
            })
        }
        if v6Addr != nil {
            r.IPs = append(r.IPs, &current.IPConfig{
                Interface: current.Int(0),
                Address:   *v6Addr,
            })
        }
        result = r
    }
    return types.PrintResult(result, conf.CNIVersion)
}

The core of this function is setting the interface name to lo (for loopback) and adding the link to the container’s network namespace. It supports both IPv4 and IPv6.

The del command does the opposite and is much simpler:

func cmdDel(args *skel.CmdArgs) error {
    if args.Netns == "" {
        return nil
    }
    args.IfName = "lo" // ignore config, this only works for loopback
    err := ns.WithNetNSPath(args.Netns, func(ns.NetNS) error {
        link, err := netlink.LinkByName(args.IfName)
        if err != nil {
            return err // not tested
        }
        err = netlink.LinkSetDown(link)
        if err != nil {
            return err // not tested
        }
        return nil
    })
    if err != nil {
        //  if NetNs is passed down by the Cloud Orchestration Engine, or if it called multiple times
        // so don't return an error if the device is already removed.
        // https://github.com/kubernetes/kubernetes/issues/43014#issuecomment-287164444
        _, ok := err.(ns.NSPathNotExistErr)
        if ok {
            return nil
        }
        return err
    }
    return nil
}

The main function simply calls the PluginMain() function of the skel package, passing the command functions. The skel package will take care of running the CNI plugin executable and will invoke the cmdAdd and delCmd functions at the right time:

func main() {
    skel.PluginMain(cmdAdd, cmdCheck, cmdDel, version.All, bv.BuildString("loopback"))
}

Building on the CNI plugin skeleton

Let’s explore the skel package and see what it does under the covers. The PluginMain() entry point, is responsible for invoking PluginMainWithError(), catching errors, printing them to standard output, and exiting:

func PluginMain(cmdAdd, cmdCheck, cmdDel func(_ *CmdArgs) error, versionInfo version.PluginInfo, about string) {
    if e := PluginMainWithError(cmdAdd, cmdCheck, cmdDel, versionInfo, about); e != nil {
        if err := e.Print(); err != nil {
            log.Print("Error writing error JSON to stdout: ", err)
        }
        os.Exit(1)
    }
}

The PluginErrorWithMain() function instantiates a dispatcher, sets it up with all the I/O streams and the environment, and invokes its internal pluginMain() method:

func PluginMainWithError(cmdAdd, cmdCheck, cmdDel func(_ *CmdArgs) error, versionInfo version.PluginInfo, about string) *types.Error {
    return (&dispatcher{
        Getenv: os.Getenv,
        Stdin:  os.Stdin,
        Stdout: os.Stdout,
        Stderr: os.Stderr,
    }).pluginMain(cmdAdd, cmdCheck, cmdDel, versionInfo, about)
}

Here, finally, is the main logic of the skeleton. It gets the cmd arguments from the environment (which includes the configuration from standard input), detects which cmd is invoked, and calls the appropriate plugin function (cmdAdd or cmdDel). It can also return version information:

func (t *dispatcher) pluginMain(cmdAdd, cmdCheck, cmdDel func(_ *CmdArgs) error, versionInfo version.PluginInfo, about string) *types.Error {
    cmd, cmdArgs, err := t.getCmdArgsFromEnv()
    if err != nil {
        // Print the about string to stderr when no command is set
        if err.Code == types.ErrInvalidEnvironmentVariables && t.Getenv("CNI_COMMAND") == "" && about != "" {
            _, _ = fmt.Fprintln(t.Stderr, about)
            _, _ = fmt.Fprintf(t.Stderr, "CNI protocol versions supported: %s
", strings.Join(versionInfo.SupportedVersions(), ", "))
            return nil
        }
        return err
    }
    if cmd != "VERSION" {
        if err = validateConfig(cmdArgs.StdinData); err != nil {
            return err
        }
        if err = utils.ValidateContainerID(cmdArgs.ContainerID); err != nil {
            return err
        }
        if err = utils.ValidateInterfaceName(cmdArgs.IfName); err != nil {
            return err
        }
    }
    switch cmd {
    case "ADD":
        err = t.checkVersionAndCall(cmdArgs, versionInfo, cmdAdd)
    case "CHECK":
        configVersion, err := t.ConfVersionDecoder.Decode(cmdArgs.StdinData)
        if err != nil {
            return types.NewError(types.ErrDecodingFailure, err.Error(), "")
        }
        if gtet, err := version.GreaterThanOrEqualTo(configVersion, "0.4.0"); err != nil {
            return types.NewError(types.ErrDecodingFailure, err.Error(), "")
        } else if !gtet {
            return types.NewError(types.ErrIncompatibleCNIVersion, "config version does not allow CHECK", "")
        }
        for _, pluginVersion := range versionInfo.SupportedVersions() {
            gtet, err := version.GreaterThanOrEqualTo(pluginVersion, configVersion)
            if err != nil {
                return types.NewError(types.ErrDecodingFailure, err.Error(), "")
            } else if gtet {
                if err := t.checkVersionAndCall(cmdArgs, versionInfo, cmdCheck); err != nil {
                    return err
                }
                return nil
            }
        }
        return types.NewError(types.ErrIncompatibleCNIVersion, "plugin version does not allow CHECK", "")
    case "DEL":
        err = t.checkVersionAndCall(cmdArgs, versionInfo, cmdDel)
    case "VERSION":
        if err := versionInfo.Encode(t.Stdout); err != nil {
            return types.NewError(types.ErrIOFailure, err.Error(), "")
        }
    default:
        return types.NewError(types.ErrInvalidEnvironmentVariables, fmt.Sprintf("unknown CNI_COMMAND: %v", cmd), "")
    }
    return err
}

The loopback plugin is one of the simplest CNI plugins. Let’s check out the bridge plugin.

Reviewing the bridge plugin

The bridge plugin is more substantial. Let’s look at some key parts of its implementation. The full source code is available here: https://github.com/containernetworking/plugins/tree/main/plugins/main/bridge.

The plugin defines in the bridge.go file a network configuration struct with the following fields:

type NetConf struct {
    types.NetConf
    BrName       string `json:"bridge"`
    IsGW         bool   `json:"isGateway"`
    IsDefaultGW  bool   `json:"isDefaultGateway"`
    ForceAddress bool   `json:"forceAddress"`
    IPMasq       bool   `json:"ipMasq"`
    MTU          int    `json:"mtu"`
    HairpinMode  bool   `json:"hairpinMode"`
    PromiscMode  bool   `json:"promiscMode"`
    Vlan         int    `json:"vlan"`
    MacSpoofChk  bool   `json:"macspoofchk,omitempty"`
    EnableDad    bool   `json:"enabledad,omitempty"`
    Args struct {
        Cni BridgeArgs `json:"cni,omitempty"`
    } `json:"args,omitempty"`
    RuntimeConfig struct {
        Mac string `json:"mac,omitempty"`
    } `json:"runtimeConfig,omitempty"`
    mac string
}

We will not cover what each parameter does and how it interacts with the other parameters due to space limitations. The goal is to understand the flow and have a starting point if you want to implement your own CNI plugin. The configuration is loaded from JSON via the loadNetConf() function. It is called at the beginning of the cmdAdd() and cmdDel() functions:

n, cniVersion, err := loadNetConf(args.StdinData, args.Args)

Here is the core of the cmdAdd() that uses information from network configuration, sets up the bridge, and sets up a veth:

br, brInterface, err := setupBridge(n)
    if err != nil {
        return err
    }
    netns, err := ns.GetNS(args.Netns)
    if err != nil {
        return fmt.Errorf("failed to open netns %q: %v", args.Netns, err)
    }
    defer netns.Close()
    hostInterface, containerInterface, err := setupVeth(netns, br, args.IfName, n.MTU, n.HairpinMode, n.Vlan)
    if err != nil {
        return err
    }

Later, the function handles the L3 mode with its multiple cases:

    // Assume L2 interface only
    result := &current.Result{
        CNIVersion: current.ImplementedSpecVersion,
        Interfaces: []*current.Interface{
            brInterface,
            hostInterface,
            containerInterface,
        },
    }
    if n.MacSpoofChk {
        ...
    }
    
    if isLayer3 {
        // run the IPAM plugin and get back the config to apply
        r, err := ipam.ExecAdd(n.IPAM.Type, args.StdinData)
        if err != nil {
            return err
        }
        // release IP in case of failure
        defer func() {
            if !success {
                ipam.ExecDel(n.IPAM.Type, args.StdinData)
            }
        }()
        // Convert whatever the IPAM result was into the current Result type
        ipamResult, err := current.NewResultFromResult(r)
        if err != nil {
            return err
        }
        result.IPs = ipamResult.IPs
        result.Routes = ipamResult.Routes
        result.DNS = ipamResult.DNS
        if len(result.IPs) == 0 {
            return errors.New("IPAM plugin returned missing IP config")
        }
        // Gather gateway information for each IP family
        gwsV4, gwsV6, err := calcGateways(result, n)
        if err != nil {
            return err
        }
        // Configure the container hardware address and IP address(es)
        if err := netns.Do(func(_ ns.NetNS) error {
            ...
        }
        // check bridge port state
        retries := []int{0, 50, 500, 1000, 1000}
        for idx, sleep := range retries {
            ...
        }
        
        if n.IsGW {
            ...
        }
        if n.IPMasq {
            ...
        }
    } else {
        ...
    }

Finally, it updates the MAC address that may have changed and returns the results:

    // Refetch the bridge since its MAC address may change when the first
    // veth is added or after its IP address is set
    br, err = bridgeByName(n.BrName)
    if err != nil {
        return err
    }
    brInterface.Mac = br.Attrs().HardwareAddr.String()
    // Return an error requested by testcases, if any
    if debugPostIPAMError != nil {
        return debugPostIPAMError
    }
    // Use incoming DNS settings if provided, otherwise use the
    // settings that were already configued by the IPAM plugin
    if dnsConfSet(n.DNS) {
        result.DNS = n.DNS
    }
    success = true
    return types.PrintResult(result, cniVersion)

This is just part of the full implementation. There is also route setting and hardware IP allocation. If you plan to write your own CNI plugin, I encourage you to pursue the full source code, which is quite extensive, to get the full picture: https://github.com/containernetworking/plugins/tree/main/plugins/main/bridge.

Let’s summarize what we have learned.

Summary

In this chapter, we covered a lot of ground. Networking is such a vast topic as there are so many combinations of hardware, software, operating environments, and user skills. It is a very complicated endeavor to come up with a comprehensive networking solution that is both robust, secure, performs well, and is easy to maintain. For Kubernetes clusters, the cloud providers mostly solve these issues. But if you run on-premises clusters or need a tailor-made solution, you get a lot of options to choose from. Kubernetes is a very flexible platform, designed for extension. Networking in particular is highly pluggable.

The main topics we discussed were the Kubernetes networking model (a flat address space where pods can reach other), how lookup and discovery work, the Kubernetes network plugins, various networking solutions at different levels of abstraction (a lot of interesting variations), using network policies effectively to control the traffic inside the cluster, ingress and Gateway APIs, the spectrum of load balancing solutions, and, finally, we looked at how to write a CNI plugin by dissecting a real-world implementation.

At this point, you are probably overwhelmed, especially if you’re not a subject matter expert. However, you should have a solid grasp of the internals of Kubernetes networking, be aware of all the interlocking pieces required to implement a full-fledged solution, and be able to craft your own solution based on trade-offs that make sense for your system and your skill level.

In Chapter 11, Running Kubernetes on Multiple Clusters, we will go even bigger and look at running Kubernetes on multiple clusters with federation. This is an important part of the Kubernetes story for geo-distributed deployments and ultimate scalability. Federated Kubernetes clusters can exceed local limitations, but they bring a whole slew of challenges too.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Exploring Kubernetes Networking

Create new playlist

Sign In

Sign Up

10

Exploring Kubernetes Networking