Chapter 2. Linux Networking

In order to understand the implementation of networking in Kubernetes, we will need to understand the fundamentals of networking in Linux. Ultimately, Kubernetes is a complex management tool for Linux (or Windows!) machines, and this is hard to ignore while working with the Kubernetes network stack. This chapter will provide an overview for the Linux networking stack, with a focus on areas of note in Kubernetes. If you are highly familiar with Linux networking and network management, you may want to skim or skip this chapter.

Remember to use man references

This chapter introduces many Linux programs. Manual (“man”) pages, accessible with man <program>, will provide much more detail.


Let’s revisit our Go webserver, which we used in Chapter 1. This webserver listens on port 8080, and returns “Hello” for HTTP requests to “/”.

Privileged Ports

Ports 1-1023 (also known as “well-known ports”) require root permission to bind to.

Programs should always be given the least permissions necessary to function, which means that a typical web service should not be run as the root user. Because of this, many programs will listen on port 1024 or higher (in particular, port 8080 is a common choice for HTTP services). When possible, listen on a non-privileged port, and use infrastructure redirects (load balancer forwarding, Kubernetes Services, etc) to forward an externally-visible privileged port, to a program listening on a non-privileged port.

This way, an attacker exploiting a possible vulnerability in your service will not have overly-broad permissions available to them.

Example 2-1. Minimal web server in Go
package main

import (

func hello(w http.ResponseWriter, _ *http.Request) {
	fmt.Fprintf(w, "Hello")

func main() {
	http.HandleFunc("/", hello)
	http.ListenAndServe("", nil)

Suppose this program is running on a Linux server machine, and an external client makes a request to /. What happens on the server? To start off, our program needs to listen to an address and port. Our program creates a socket for that address and port, and binds to it. The socket will receive requests addressed to both the specified address and port - port 8080 with any IP address in our case.

IP address wildcard in IPv4, and [::] in IPv6 are wildcard addresses. They match all addresses of their respective protocol, and as such, listen on all available IP addresses when used for a socket binding.

This is useful to expose a service, without prior knowledge of what IP addresses that the machines running it will have. Most network-exposed services bind this way.

There are multiple ways to inspect sockets. For example ls -lah /proc/<server proc>/fd will list the sockets. We will discuss some programs that can inspect sockets at the end of the chapter.

The kernel maps a given packet to a specific connection, and uses an internal state machine to manage the connection state. Like sockets, connections can be inspected through various tools, which we will discuss later in this chapter. Linux represents each connection with a file. Accepting a connection entails a notification from the kernel to our program, which is then able to stream content to and from the file.

Going back to our golang webserver, we can use strace to show what the server is doing.

$ strace ./main
execve("./main", ["./main"], 0x7ebf2700 /* 21 vars */) = 0
brk(NULL)                               = 0x78e000
uname({sysname="Linux", nodename="raspberrypi", ...}) = 0
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f1d000
[Content cut]

Because strace captures all system calls made by our server, there is a lot of output. Let’s reduce it somewhat to the relevant network syscalls. Key points are highlighted, as the Go http server performs many syscalls during startup.

openat(AT_FDCWD, "/proc/sys/net/core/somaxconn", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
epoll_create1(EPOLL_CLOEXEC)            = 4 1
    {u32=1714573248, u64=1714573248}}) = 0
fcntl(3, F_GETFL)                       = 0x20000 (flags O_RDONLY|O_LARGEFILE)
read(3, "128
", 65536)                 = 4
read(3, "", 65532)                      = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, 0x20245b0) = 0
close(3)                                = 0
close(3)                                = 0
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0 3
bind(3, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr),
    sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
setsockopt(5, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
bind(5, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6,
    "::ffff:", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
close(5)                                = 0
close(3)                                = 0
setsockopt(3, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(3, {sa_family=AF_INET6, sin6_port=htons(8080), inet_pton(AF_INET6, "::", &sin6_addr),
    sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0 4
listen(3, 128)                          = 0
    u64=1714573248}}) = 0
getsockname(3, {sa_family=AF_INET6, sin6_port=htons(8080),
    inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0},
    [112->28]) = 0
accept4(3, 0x2032d70, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN
    (Resource temporarily unavailable)
epoll_wait(4, [], 128, 0)               = 0
epoll_wait(4, 5

Open a file descriptor to listen on ???


Create a TCP socket for IPv6 connections.


Disable IPV6_V6ONLY on the socket. Now, it can listen on IPv4 and IPv6.


Bind IPv6 socket to listen on port 8080 (all addresses).


Wait for a request.

Once the server has started, we see the output from strace pause on epoll_wait.

At this point, the server is listening on its socket, and waiting for the kernel to notify it about packets. When we make a request to our listening server, we see the hello message.

$ curl <ip>:8080/

Browsers May Cause Unpredictable strace Results

If you are trying to debug the fundamentals of a web server with strace, you will probably not want to use a web browser. Additional requests or metadata sent to the server may result in additional work for the server, or the browser may not make expected requests. For example, many browsers try to request a favicon file automatically. They will also attempt to cache files, reuse connections, and other things that make it harder to predict the exact network interaction. When simple or minimal reproduction matters, try using a tool like curl or telnet.

In strace, we see the following from our server process:

[{EPOLLIN, {u32=1714573248, u64=1714573248}}], 128, -1) = 1
accept4(3, {sa_family=AF_INET6, sin6_port=htons(54202), inet_pton(AF_INET6,
    "::ffff:", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0},
    [112->28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 5
    {u32=1714573120, u64=1714573120}}) = 0
getsockname(5, {sa_family=AF_INET6, sin6_port=htons(8080),
    inet_pton(AF_INET6, "::ffff:", &sin6_addr), sin6_flowinfo=htonl(0),
    sin6_scope_id=0}, [112->28]) = 0
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [180], 4) = 0
setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [180], 4) = 0
accept4(3, 0x2032d70, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN
    (Resource temporarily unavailable)

After inspecting the socket, our server writes response data (“Hello” wrapped in the HTTP protocol) to the file descriptor. From there, the Linux kernel (and some other userspace systems) translate the request into packets, and transmit those packets back to our curl client.

To summarize what the server is doing when it receives a request:

  • Epoll returns, and causes the program to resume.

  • The server sees a connection from ::ffff:, the client IP address in this example.

  • The server inspects the socket.

  • The server changes keepalive options: it turns keepalive on, and sets a 180-second interval between keepalive probes.

This is a bird’s eye view of networking in Linux, from an application developer’s point of view. There’s a lot more going on in order to make everything work. We’ll look in more detail at parts of the networking stack that are particularly relevant for Kubernetes users.

The Network Interface

Computers use a network interface to communicate with the outside world. Network interfaces can be physical (e.g., an ethernet network controller) or virtual. Virtual network interfaces do not correspond to physical hardware, but instead an abstract interface provided by the host or hypervisor.

IP addresses are assigned to network interfaces. A typical interface may have 1 IPv4 address and 1 IPv6 address, but multiple can be assigned to the same interface.

Linux itself has a concept of a network interface, which can be physical (such as an ethernet card and port), or virtual. If you run ifconfig, you will see a list of all network interfaces, and their configurations (including IP addresses).


The ip command can also be used to inspect network interfaces.

The loopback interface is a special interface for same-host communication. is the standard IP address for the loopback interface. Packets sent to the loopback interface will not leave the host, and processes listening on will only be accessible to other processes on the same host. Note that making a process listen on is not a security boundary. CVE-2020-8558 was a past Kubernetes vulnerability, in which kube-proxy rules allowed some remote systems to reach The loopback interface is commonly abbreviated as lo.

Let’s look at a typical ifconfig output.

Example 2-2. Output From ifconfig On A Machine With One Pysical Network Interface (ens4), And The Loopback Interface
$ ifconfig
ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460
        inet  netmask  broadcast
        inet6 fe80::4001:aff:fe8a:4  prefixlen 64  scopeid 0x20<link>
        ether 42:01:0a:8a:00:04  txqueuelen 1000  (Ethernet)
        RX packets 5896679  bytes 504372582 (504.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9962136  bytes 1850543741 (1.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet  netmask
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 352  bytes 33742 (33.7 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 352  bytes 33742 (33.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Container runtimes create a virtual network interface for each pod on a host, so the list would be much longer on a typical Kubernetes node. We’ll cover container networking in more detail in Chapter 3.

The Bridge Interface

The bridge interface (shown in Figure 2-1) allows system administrators to create multiple layer two networks on a single host. In other words, the bridge functions like a network switch between network interfaces on a host, seamlessly connecting them. Bridges allow pods, with their individual network interfaces, to interact with the broader network, via the node’s network interface.

Figure 2-1. Bridge Interface

Documentation for Linux bridging is on: The bridge-utilities are maintained at: git://

In Example 2-3, we demonstrate how to create a bridge device named br0, and attach a VETH device, veth, and a physical device, eth0.

Here’s how to create a bridge interface using ip:

Example 2-3. Creating Bridge interface and connecting veth pair
# # Add a new bridge interface named br0.
# ip link add br0 type bridge
# # Attach eth0 to our bridge.
# ip link set eth0 master br0
# # Attach veth to our bridge.
# ip link set veth master br0

Bridges can also be managed and created using the brctl command. Example 2-4 shows some options available with brctl.

Example 2-4. Brctl cli options
$ brctl
$ commands:
        addbr           <bridge>                add bridge
        delbr           <bridge>                delete bridge
        addif           <bridge> <device>       add interface to bridge
        delif           <bridge> <device>       delete interface from bridge
        setageing       <bridge> <time>         set ageing time
        setbridgeprio   <bridge> <prio>         set bridge priority
        setfd           <bridge> <time>         set bridge forward delay
        sethello        <bridge> <time>         set hello time
        setmaxage       <bridge> <time>         set max message age
        setpathcost     <bridge> <port> <cost>  set path cost
        setportprio     <bridge> <port> <prio>  set port priority
        show                                    show a list of bridges
        showmacs        <bridge>                show a list of mac addrs
        showstp         <bridge>                show bridge stp info
        stp             <bridge> <state>        turn stp on/off

The VETH (virtual Ethernet) device is a local Ethernet tunnel. VETH devices are created in pairs, as shown in Figure 2-1, where the pod sees an eth0 interface from the VETH. Packets transmitted on one device in the pair are immediately received on the other device. When either device is down, the link state of the pair is down. Adding a bridge to Linux can be done with using the brctl commands or ip. Use a VETH configuration when namespaces need to communicate to the main host namespace or between each other.

Here’s how to set up a VETH configuration:

Example 2-5. Veth Creation
# ip netns add net1
# ip netns add net2
# ip link add veth1 netns net1 type veth peer name veth2 netns net2

In Example 2-5 we show steps to create two network namespaces (not to be confused with Kubernetes namespaces), net1 and net2, and a pair of VETH devices, with veth1 assigned to namespace net1 and veth2 to namespace net2. These two namespaces are connected with this VETH pair. Assign a pair of IP addresses, and you can ping and communicate between the two namespaces.

Kubernetes uses this in concert with the CNI project to manage container network namespaces, interfaces and ip addresses. We will cover more of this in Chapter 3.

Packet Handling in the Kernel

The Linux kernel is responsible for translating between packets, and a coherent stream of data for programs. In particular, we will look at how the kernel handles connections, as routing and firewalling, key things in Kubernetes, rely heavily on Linux’s underlying packet management.


Netfilter, included in Linux since 2.3, is a critical component of packet handling. Netfilter is a framework of kernel hooks, which allow userspace programs to handle packets on behalf of the kernel. In short, a program registers to a specific netfilter hook, the kernel calls that program on applicable packets. That program could tell the kernel to do something with the packet (like drop it), or it could send back a modified packet to the kernel. With this, developers can build normal programs that run in userspace, and handle packets. Netfilter was created jointly with iptables, to separate kernel and userspace code.

Further Reading On Netfilter contains some excellent documentation on the design and use of both netfilter, and iptables.

Netfilter has five hooks, shown in Table 2-1. Netfilter triggers each hook under specific stages in a packet’s journey through the kernel. Understanding netfilter’s hooks is key to understanding iptables later in this chapter, as iptables directly maps its concept of chains to netfilter hooks.

Table 2-1. Netfilter hooks
Netfilter Hook Iptables Chain Name Description



Triggers when a packet arrives from an external system.



Triggers when a packet’s destination IP address matches this machine.



Triggers for packets where neither source nor destination. match the machine’s IP addresses (in other words, packets that this machine is routing on behalf of other machines).



Triggers when a packet, originating from the machine, is leaving the machine.



Triggers when any packet (regardless of origin) is leaving the machine.

As netfilter triggers each hook during a specific phase of packet handling, and under specific conditions, we can visualize netfilter hooks with a flow diagram, as seen in Figure 2-2:

neku 0202
Figure 2-2. The possible flows of a packet through netfilter hooks.

We can infer from our flow diagram that only certain permutations of netfilter hook calls are possible, for any given packet. For example, a packet originating from a local process will always trigger NF_IP_LOCAL_OUT hooks, then NF_IP_POST_ROUTING hooks. In particular, the flow of netfilter hooks for a packet depends on two things: if the packet source is the host, and if the packet destination is the host. Note that if a process sends a packet destined for the same host, it triggers the NF_IP_LOCAL_OUT then NF_IP_POST_ROUTING hooks, before “reentering” the system, and triggering the NF_IP_PRE_ROUTING and NF_IP_LOCAL_IN hooks. In some systems, it is possible to spoof such a packet, by writing a fake source address (i.e., spoofing that a packet has a source and destination address of Linux will normally filter such a packet when it arrives at an external interface. More broadly, Linux filters packets when a packet arrives at an interface, and the packet’s source address does not exist on that network. A packet with an “impossible” source IP address is called a Martian packet. It is possible to disable filtering of Martian packets in Linux. However, doing so poses substantial risk, if any services on the host assume that traffic from localhost is “more trustworthy” than external traffic. This can be a common assumption, such as exposing an API or database to the host, without strong authentication.

Kubernetes and spoofed Packet sources

Kubernetes has had at least one CVE, CVE-2020-8558, in which packets from another host, with the source IP address falsely set to, could access ports that should only be accessible locally. Among other things, it means that if a node in the Kubernetes control plane ran kube-proxy, other machines on the node’s network could use “trust authentication” to connect to the apiserver, effectively owning the cluster.

This was not technically a case of Martian packets not being filtered, as offending packets would come from the loopback device, which is on the same network as You can read the reported issue at

Let’s look at the netfilter hook order for various packet soruces and destinations.

Table 2-2. Key netfilter packet flows
Packet source Packet destination Hooks (in order)

Local machine

Local machine


Local machine

External machine


External machine

Local machine


External machine

External machine


Note that packets from the machine, to itself, will trigger NF_IP_LOCAL_OUT and NF_IP_POST_ROUTING, then “leave” the network interface. They will “reenter” and be treated like packets from any other source.

NAT (network address translation) only impacts local routing decisions in the NF_IP_PRE_ROUTING and NF_IP_LOCAL_OUT hooks (e.g., the kernel makes no routing decisions after a packet reaches the NF_IP_LOCAL_IN hook). We see this reflected in the design of iptables, where source and destination NAT can only be performed in specific hooks/chains.

Programs can register a hook by calling nf_register_net_hook (nf_register_hook prior to Linux 4.13) with a handling function. The hook will be called every time a packet matches. This is how programs like iptables integrate with netfilter, though you will likely never need to do this yourself.

There are several actions that a netfilter hook can trigger, based on the return value:


Continue packet handling.


Drop the packet, without further processing.


Pass the packet to an userspace program.


Doesn’t execute further hooks, and allows the userspace program to take ownership of the packet.


Make the packet “reenter” the hook and be re-processed.

Hooks can also return mutated packets. This allows programs to do things such as reroute or masquerade packets, adjust packet TTLs, etc.


Conntrack is a component of netfilter, used to track the state of connections to (and from) the machine. Connection tracking directly associates packets with a particular connection. Without connection tracking, the flow of packets is much more opaque. Conntrack can be a liability, or a valuable tool, or both, depending on how it is used. In general, conntrack is important on systems that handle firewalling or NAT,

Connection tracking allows firewalls to distinguish between responses, and arbitrary packets. A firewall can be configured to allow inbound packets that are part of an existing connection, but disallow inbound packets that are not part of a connection. To give an example, a program could be allowed to make an outbound connections and perform an HTTP request, without the remote server being otherwise able to send data or initiate connections inbound.

NAT relies on conntrack to function. Iptables exposes NAT as two types: SNAT (source NAT, where iptables rewrites the source address), and DNAT (destination NAT, where iptables rewrites the destination address). NAT is extremely common - the odds are overwhelming that your home router uses SNAT and DNAT to fan traffic between your public IPv4 address, and the local address of each device on the network. With connection tracking, packets are automatically associated with their connection, and easily modified with the same SNAT/DNAT change. This enables consistent routing decisions, such as “pinning” a connection in a load balancer to a specific backend or machine. The latter example is highly relevant in Kubernetes, due to kube-proxy’s implementation of Service load balancing via iptables. Without connection tracking, every packet would need to be deterministically re-mapped to the same destination, which isn’t doable (suppose the list of possible destinations could change…).

Conntrack identifies connections by a tuple, composed of source address, source port, destination address, destination port, and L4 protocol. These 5 pieces of information are the minimal identifiers needed to identify any given L4 connection. All L4 connections have an address and port on each side of the connection; after all, the internet uses addresses for routing, and computers use port numbers for application mapping. The final piece, L4 protocol, is present because a program will bind to a port in TCP or UDP mode (and binding to one does not preclude binding to the other). Conntrack refers to these connections as flows. A flow contains metadata about the connection and its state.

Conntrack stores flows in a hash table, shown in Figure 2-3, using the connection tuple as a key. The size of the keyspace is configurable. A larger keyspace requires more memory to hold the underlying array, but will result in fewer flows hashing to the same key and being chained in a linked list, leading to faster flow lookup times. The maximum number of flows is also configurable. A severe issue that can happen is when conntrack runs out of space for connection tracking, new connections cannot be made. There are other configuration options too, such as the timeout for a connection. On a typical system, default settings will suffice. However, a system that experiences a huge number of connections will run out of space. If your host runs directly exposed to the internet, overwhelming conntrack with short-lived or incomplete connections is an easy way to cause a denial of service (DOS).

Conntrack Hashtable
Figure 2-3. The structure of conntrack flows

Contact’s max size is normally set in /proc/sys/net/nf_conntrack_max, and hash table size is normally set in /sys/module/nf_conntrack/parameters/hashsize.

Conntrack entries contain a connection state, which is one of four states. It is important to note that, as a layer 3 (network layer) tool, conntrack states are distinct from layer 4 (protocol layer) states. Table 2-3 details the four states.

Table 2-3. Conntrack States
State Description Example


A valid packet is sent or received, with no response seen.

TCP SYN received.


Packets observed in both directions.

TCP SYN received, and TCP SYN/ACK sent.


An additional connection is opened, where metadata indicates that it is “related” to an original connection. Related connection handling is complex.

A ftp program, with an ESTABLISHED connection, opens additional data connections.


The packet itself is invalid, or does not properly match another conntrack connection state.

TCP RST received, with no prior connection.

Although conntrack is built into the kernel, it may not be active on your system. Certain kernel modules must be loaded, and you must have relevant iptables rules (essentially, conntrack is normally not active if nothing needs it to be). Conntrack requires the kernel module nf_conntrack_ipv4 to be active. lsmod | grep nf_conntrack will show if the module is loaded, and sudo modprobe nf_conntrack will load it. You may also need to install the conntrack CLI in order to view conntrack’s state.

When conntrack is active, conntrack -L shows all current flows. Additional conntrack flags will filter which flows are shown.

Let’s look at the anatomy of a conntrack flow, as displayed.

tcp      6 431999 ESTABLISHED src= dst= sport=22 dport=49431 src= dst= sport=49431 dport=22 [ASSURED] mark=0 use=1

<protocol> <protocol number> <flow TTL> [flow state>] <source ip> <dest ip> <source port> <dest port> [] <expected return packet>

The expected return packet is of the form <source ip> <dest ip> <source port> <dest port>. This is the identifier that we expect to see when the remote system sends a packet. Note that in our example, the source and destination values are in reverse for address and ports. This is often, but not always the case. For example, if a machine is behind a router, packets destined to that machine will be addressed to the router, whereas packets from the machine will have the machine address, not the router address, as the source.

In the above example from machine, has established a TCP connection from port 49431, to port 22 on You may recognize this as being an SSH connection, although conntrack is unable to show application-level behavior.

Tools like grep can be useful for examining conntrack state and ad-hoc statistics:

grep ESTABLISHED /proc/net/ip_conntrack | wc -l


When handling any packet, the kernel must decide where to send that packet. In most cases, the destination machine will not be within the same network. For example, suppose you are attempting to connect to from your personal computer. is not on your network; the best your computer can do is pass it to another host that is closer to being able to reach The route table serves this purpose, by mapping known subnets to a gateway IP address, and interface. You can list known routes with route (or route -n to show raw IP addresses instead of hostnames). A typical machine will have a route for the local network, and a route for Recall that subnets can be expressed as a CIDR (e.g.,, or an IP address and a mask (e.g., and

Below is a typical routing table for a machine on a local network, with access to the internet.

# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface         UG    303    0        0 eth0   U     303    0        0 eth0

In the above example, a request to would be sent to, on the eth0 interface, because is in the subnet described by the first rule (, and not in the subnet described by the second rule ( Subnets are specified by the Destination and Genmask.

Linux prefers to route packets by specificity (how “small” a matching subnet is), then by weight (“metric” in route output). Given our example, a packet addressed to will always be sent to gateway, because that route matches a smaller set of addresses. If we had two routes with the same specificity, then the route with a lower metric will be preferred.

Some CNI plugins make heavy use of the route table.

Now that we’ve covered some key concepts in how the Linux kernel handles packets, we can look at how higher-level packet and connection routing works.

High Level Routing

Linux has complex packet management abilities. Such tools allow Linux users to create firewalls, log traffic, route packets, and even implement load balancing. Kubernetes makes uses of some of these tools, to handle node and pod connectivity, as well as manage Kubernetes Services. In this book, we will cover the three tools that are most commonly seen in Kubernetes. All Kubernetes setups will make some use of iptables, but there are many ways that Services can be managed. We will also cover IPVS (which has built-in support in kube-proxy), and eBPF, which is used by Cilium (a kube-proxy alternative).

We will call back to this section in Chapter 4, when we cover Services and kube-proxy.


Iptables is staple of Linux sysadmins, and has been for many years. Iptables can be used to create firewalls and audit logs, mutate and re-route packets, even implement crude connection fan-out. Iptables uses netfilter, which allows iptables to intercept and mutate packets.

iptables rules can become extremely complex. There are many tools that provide a simpler interface for managing iptables rules - for example, firewalls like ufw and firewalld. Kubernetes components (specifically, the kubelet and kube-proxy) generate iptables rules in this fashion. Understanding iptables is important to understand access and routing for pods and nodes, in most clusters.

Linux Distros Are Replacing Iptables With Nftables

Most Linux distributions are replacing iptables with nftables, a similar but more performant tool built atop netfilter. Some distros already ship a version of iptables that is powered by nftables.

Kubernetes has many known issues with the iptables/nftables transition. We highly recommend not using a nftables-backed version of iptables for the foreseeable future.

There are 3 key concepts in iptables: tables, chains, and rules. They are considered hierarchical in nature: a table contains chains, a chain contains rules.

Tables organize rules according to the type of effect that they have. Iptables has a broad range of functionality, which tables group together. Three most commonly applicable tables are: filter (for firewall-related rules), nat (for nat-related rules), and mangle (for non-nat packet-mutating rules). Iptables executes tables in a specific order, which we’ll cover later on.

Chains contain a list of rules. When a packet executes a chain, the rules in the chain are evaluated in order. Chains exist within a table, and organize rules according to netfilter hooks. There are 5 built-in, top-level chains, each of which corresponds to a Netfilter hook (recall that netfilter was designed jointly with iptables). Therefore, the choice of which chain to insert a rule dictates if/when the rule will be evaluated for a given packet.

Rules are a combination condition, and action (referred to as a target). For example, “if a packet is addressed to port 22, drop it”. Iptables evaluates individual packets, although chains and tables dictate which packets that a rule will be evaluated against.

The specifics of table → chain → target execution are complex, and there are no end of fiendish diagrams available to describe the full state machine. Next, we’ll examine each portion in more detail.

Cross-Reference Iptables Concepts

It may help to refer back to earlier material, as you progress through this section. The designs of tables, chains, and rules are tightly intertwined, and it is hard to properly understand one without understanding the others.

Iptables Tables

A table in iptables maps to a particular capability set, where each table is “responsible” for a specific type of action. In more concrete terms, a table can only contain specific target types, and many target types can only be used in specific tables. Iptables has 5 tables, which are listed in Table 2-4.

Table 2-4. Iptables Tables
Table Purpose


The filter table handles acceptance and rejection of packets.


The NAT table is used to modify the source or destination IP addresses.


The mangle table can perform general-purpose editing of packet headers, but it is not intended for NAT. It can also “mark” the packet with iptables-only metadata.


The raw table allows for packet mutation, before connection tracking and other tables are handled. Its most common use is to disable connection tracking for some packets.


SELinux uses the security table for packet handling. It is not applicable on a machine that is not using SELinux.

We will not discuss the security table in more detail in this book, however if you use SELinux, you should be aware of its use.

Iptables executes tables in a particular order: raw, mangle, nat, filter. However, this order of execution is broken up by chains. Linux users generally accept the mantra of “tables contains chains”, but may feel misleading. The order of execution is chains, then tables. So, for example, a packet will trigger raw PREROUTING, mangle PREROUTING, nat PREROUTING, and then trigger the mangle table in either the INPUT or FORWARD chain (depending on the packet). We’ll cover this in more detail in the next section on chains, as we put more pieces together.

Iptables Chains

Iptables’ chains are a list of rules. When a packet triggers or passes through a chain, each rule is sequentially evaluated, until the packet matches a “terminating target” (such as DROP), or the packet reaches the end of the chain.

The builtin, “top-level” chains are PREROUTING, INPUT, NAT, OUTPUT, and POSTROUTING. These are powered by netfilter hooks. Each chain corresponds to a hook - below is a table of chain and hook mappings. Table 2-5 shows the chain and hook pairs. There are also user-defined subchains, which exist to help organize rules.

Table 2-5. Iptables Chains, And Corresponding Netfilter Hooks
Iptables Chain Netfilter Hook











Returning to our diagram of netfilter hook ordering, we can infer the equivalent diagram of iptables chain execution and ordering, for a given packet (see Figure 2-4).

neku 0204
Figure 2-4. The possible flows of a packet through iptables chains.

Again, like netfilter, there are only a handful of ways that a packet can traverse these chains (assuming the packet is not rejected or dropped along the way). Let’s use an example with 3 machines, with IP addresses,, and respectively. We will show some routing scenarios, from the perspective of machine 1 (with IP address Let’s examine them.

Table 2-6. Iptables Chains Executed, in Various Scenarios
Packet Description Packet Source Packet Destination Tables Processed

An inbound packet, from another machine.


An inbound packet, not destined for this machine.


An outbound packet, originating locally, destined for another machine.


A packet from a local program is destined for the same machine.

OUTPUT, POSTROUTING (then PREROUTING, INPUT as the packet re-enters via the loopback interface)

Experimenting With Chain Execution

You can experiment with chain execution behavior on your own, using LOG rules. For example:

iptables -A OUTPUT -p tcp --dport 22 -j LOG --log-level info --log-prefix "ssh-output "

will log TCP packets to port 22 when they are processed by the OUTPUT chain, with the log prefix “ssh-output”. Be aware log size can quickly become unwieldy. Log on important hosts with care.

Recall that when a packet triggers a chain, iptables executes tables within that chain (specifically, the rules within each table) in the following order:

  1. Raw

  2. Mangle

  3. NAT

  4. Filter

Most chains do not contain all tables, however the relative execution order remains the same. This is a design decision to reduce redundancy. For example, the raw table exists to manipulate packets “entering” iptables, and therefore only has PREROUTING and OUTPUT chains, in accordance with netfilter’s packet flow.

Table 2-7. Which Iptables Tables (Rows) Contain Which Chains (Columns)
raw mangle nat filter






You can list the chains that correspond to a table yourself, with iptables -L -t <table>.

$ iptables -L -t filter
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

There is a small caveat for the NAT table: DNAT can only be performed in PREROUTING or OUTPUT, and SNAT may only be performed in INPUT or POSTROUTING.

To give an example, suppose we have an inbound packet, destined for our host. The order of execution would be:


    1. raw

    2. mangle

    3. nat

  2. INPUT

    1. mangle

    2. nat

    3. filter

Now that we’ve learned about netfilter hooks, tables, and chains, let’s take one last look at the flow of a packet through iptables, shown in Figure 2-5.

Iptables packet
Figure 2-5. The flow of a packet through iptables tables and chains. A circle denotes a table/hook combination that exists in iptables.

All iptables rules belong to a table and chain, the possible combinations of which are represented as dots in our flow chart. iptables evaluates chains (and the rules in them, in order) based on the order of netfilter hooks that a packet triggers. For the given chain, iptables evaluates that chain in each table that it is present in (note that some chain/table combinations do not exist, such as filter/POSTROUTING). If we trace the flow of a packet originating from the local host, we see the following table/chains pairs evaluated, in order:

  1. raw/OUTPUT

  2. mangle/OUTPUT

  3. nat/OUTPUT

  4. filter/OUTPUT

  5. mangle/POSTROUTING



The aforementioned chains are the top-level, or entrypoint chains. However, users can define their own sub-chains, and execute them with the JUMP target. Iptables executes such a chain in the same manner - target by target, until a terminating target matches. This can be use useful for logical separation, or re-using a series of targets that can be executed in more than one context (i.e., similar motivation to why we might organize code into a function). Such organization of rules across chains can have a substantial impact on performance. Iptables is, effectively, running tens, or hundreds, or thousands of if-statements against every single packet that goes in or out of your system. That has measurable impact on packet latency, cpu use, and network throughput. A well-organized set of chains reduces this overhead by eliminating effectively redundant checks or actions. However, iptables’ performance given a service with many pods is still a problem in Kubernetes, which makes other solutions with less or no iptables use, such as IPVS or eBPF, more appealing.

Let’s look at creating new chains in Example 2-6.

Example 2-6. Sample Iptables Chain For SSH Firewalling
# Create incoming-ssh chain.
$ iptables -N incoming-ssh

# Allow packets from specific IPs.
$ iptables -A incoming-ssh -s -j ACCEPT
$ iptables -A incoming-ssh -s -j ACCEPT

# Log the packet.
$ iptables -A incoming-ssh -j LOG --log-level info --log-prefix "ssh-failure "

# Drop packets from all other IPs.
$ iptables -A incoming-ssh -j DROP

# Evaluate the incoming-ssh chain, if the packet is an inbound TCP packet addressed to port 22.
$ iptables -A INPUT -p tcp --dport 22 -j incoming-ssh

This example creates a new chain, incoming-ssh, which is evaluated for any TCP packets inbound on port 22. The chain allows packets from 2 specific IP addresses, and packets from other addresses are logged and dropped.

Filter chains end in a default action, e.g., DROP the packet if no prior target matched. Chains will default to ACCEPT if no default is specified. iptables -P <chain> <target> sets the default.

Iptables Rules

Rules have 2 parts: a match condition, and an action (called a target). The match condition describes a packet attribute. If the packet matches, the action will be executed. If the packet does not match, iptables will move to check the next rule.

Match conditions check if a given packet meets some criteria - for example, if the packet has a specific source address. Order of operations from tables/chains is important to remember, as prior operations can impact the packet, by mutating it, dropping it, or rejecting it. Table 2-8 shows some common match types.

Table 2-8. Some Common Iptables Match Types
Match Type Flag(s) Description


-s, --src, --source

Matches packets with the specified source address.


-d, --dest, --destination

Matches packets with the destination source address.


-p, --protocol

Matches packets with the specified protocol.

In Interface

-i, --in-interface

Matches packets that entered via the specified interface.

Out Interface

-o, --out-interface

Matches packets that are leaving the specified interface.


-m state --state <states>

Matches packets from connections that are in one of the comma-separated states. This uses the conntrack states (NEW, ESTABLISHED, RELATED, INVALID).

Iptables Supports Match “Extensions”

Using -m or --match, iptables can use extensions for match criteria. Extensions range from nice-to-haves such as specifying multiple ports in a single rule (multiport), to more complex features such as eBPF interactions. man iptables-extensions contains more information.

There are two kinds of target actions: terminating, and non-terminating. A terminating target will stop iptables from checking subsequent targets in the chain, essentially acting as a final decision. A non-terminating target will allow iptables to continue checking subsequent targets in the chain. ACCEPT, DROP, REJECT, and RETURN are all terminating targets. Note that ACCEPT and RETURN are only terminating within their chain. That is to say, if a packet hits an ACCEPT target in a subchain, the parent chain will resume processing, and could potentially DROP or REJECT the target. Example 2-7 shows set of rules that would reject packets to port 80, despite matching an ACCEPT at one point. Some command output has been removed for simplicity.

Example 2-7. Rule Sequence Which Would REJECT Some Previously Accepted Packets
$ iptables -L --line-numbers
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination
1    accept-all  all  --  anywhere             anywhere
2    REJECT     tcp  --  anywhere             anywhere
    tcp dpt:80 reject-with icmp-port-unreachable

Chain accept-all (1 references)
num  target     prot opt source               destination
1               all  --  anywhere             anywhere

Table 2-9 contains a summary of common target types, and their behavior.

Table 2-9. Common Iptables Target Types And Behavior
Target Type Applicable Tables Description



Records data about accepted, dropped, or rejected packets.



Allows the packet to continue, unimpeded and without further modification.



Modifies the destination address.



Discards the packet. To an external observer, it will appear as though the packet was never received.



Executes another chain. Once that chain finishes executing, execution of the parent chain will continue.



Logs the packet contents, via the kernel log.



Sets a special integer for the packet, used as an identifier by netfilter. The integer can be used in other iptables decisions, and is not written to the packet itself.



Modifies the source address of the packet, replacing it with the address of a specified network interface. This is similar to SNAT, but does not require the machine’s IP address to be known in advance.



Discards the packet, and sends a rejection reason.



Stops processing the current chain (or sub-chain). Note that this is not a terminating target, and if there is a parent chain, that chain will continue to be processed.



Modifies the source address of the packet, replacing it with a fixed address. See also: MASQUERADE.

Each target type may have specific options, such as ports or log strings that apply to the rule. Table 2-10 shows some example commands and explanations.

Table 2-10. Iptables Target Command Examples
Command Explanation

iptables -A INPUT -s

Accepts an inbound packet if the source address is

iptables -A INPUT -p ICMP

Accepts all inbound ICMP packets.

iptables -A INPUT -p tcp --dport 443

Accepts all inbound TCP packets to port 443.

iptables -A INPUT -p tcp --dport 22 -j DROP

Drops all inbound TCP ports to port 22.

A target belongs to both a table and a chain, which control when (if at all) iptables executes the aforementioned target, for a given packet. Next, we’ll put together what we’ve learned, and look at iptables commands in practice.

Practical Iptables

The iptables Program is IPv4-Only

There is a distinct but nearly identical program, ip6tables, for managing IPv6 rules. iptables and ip6tables rules are completely separate. e.g., dropping all packets to TPC with iptables will not prevent connections to TCP [::]:22, and vice versa for ip6tables.

For simplicity, we will only refer to iptables and IPv4 addresses in this section.

Iptables Rules Don’t Persist

iptables rules aren’t persisted across restarts. iptables provides iptables-save and iptables-restore tools, which can be used manually or with simple automation to capture or reload rules. This is something that most firewall tools paper over, by automatically creating their own iptables rules every time the system starts.

You can show iptables chains with iptables -L.

$ iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

--line-numbers shows numbers for each rule in a chain. This can be helpful when inserting or deleting rules. -I <chain> <line> inserts a rule at the specified line number, before the previous rule at that line.

The typical format of a command to interact with iptables rules is:

iptables [-t table] {-A|-C|-D} chain rule-specification

where -A is for append, -C is for check, and -D is for delete.

Iptables can masquerade connections, making it appear that the packets came from its own IP address. This is useful to provide a simplified exterior to the outside world. A common use case is to provide a known host for traffic, as a security bastion or to provide a predictable set of IP addresses to 3rd parties. In Kubernetes, masquerading can make pods use their node’s IP address, despite the fact that pods have unique IP addresses. This is necessary to communicate outside the cluster in many setups, where pods have internal IP addresses that cannot communicate directly with the Internet. The MASQUERADE target is similar to SNAT, however it does not require a --source-address to be known and specified in advance. Instead, it uses the address of a specified interface. This is slightly less performant than SNAT in cases where the new source address is static, as iptables must continuously fetch the address.

$iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Iptables can perform connection-level load balancing, or more accurately, connection fanout. This technique relies on DNAT rules, and random selection (to prevent every connection from being routed to the first DNAT target).

$ iptables -t nat -A OUTPUT -p tcp --dport 80 -d $FRONT_IP -m statistic 
--mode random --probability 0.5 -j DNAT --to-destination $BACKEND1_IP:80
$ iptables -t nat -A OUTPUT -p tcp --dport 80 -d $FRONT_IP 
-j DNAT --to-destination $BACKEND2_IP:80

In the above example, there is a 50% chance of routing to the first backend. Otherwise, the packet proceeds to the next rule, which is guaranteed to route the connection to the second backend. The math gets a little tedious for adding more backends. In order to have an equal chance of routing to any backend, the nth backend must have a 1/n chance of being routed to. If there were 3 backends, the probabilities would need to be 0.3 (repeating), 0.5, and 1.

Chain KUBE-SVC-I7EAKVFJLYM7WH25 (1 references)
target     prot opt source               destination
KUBE-SEP-LXP5RGXOX6SCIC6C  all  --  anywhere             anywhere
    statistic mode random probability 0.25000000000
KUBE-SEP-XRJTEP3YTXUYFBMK  all  --  anywhere             anywhere
    statistic mode random probability 0.33332999982
KUBE-SEP-OMZR4HWUSCJLN33U  all  --  anywhere             anywhere
    statistic mode random probability 0.50000000000
KUBE-SEP-EELL7LVIDZU4CPY6  all  --  anywhere             anywhere

When Kubernetes uses iptables load balancing for a Service, it creates a chain like the above. If you look closely, you can see rounding errors in one of the probability numbers.

Using DNAT fanout for load balancing has several caveats. It has no feedback for the load of a given backend, and will always map application-level queries on the same connection to the same backend. Because the DNAT result lasts the lifetime of the connection, if long-lived connections are common, many downstream clients may stick to the same upstream backend if that backend is longer-lived than others. To give a Kubernetes example, suppose a gRPC service has only 2 replicas, then additional replicas scale up. gRPC reuses the same HTTP/2 connection, so existing downstream clients (using the Kubernetes Service and not gRPC load balancing) will stay connected to the initial 2 replicas, skewing the load profile amongst gRPC backends. Because of this, many developers use a smarter client (such as making use of gRPC’s client side load balancing), force periodic reconnects at the server and/or client, or use service meshes to externalize the problem. We’ll discuss load balancing in more detail in Chapters 44 and 5.

Although iptables is widely used in Linux, it can become slow in the presence of a huge number of rules, and offers very limited load balancing functionality. Next we’ll look at IPVS, an alternative that is more purpose-built for load balancing.


IPVS (IP Virtual Server) is a Linux connection (L4) load balancer. Figure 2-6 shows a simple diagram of IPVS’ role in routing packets.

Figure 2-6. IPVS

Iptables can do simple L4 load balancing by randomly routing connections, with the randomness shaped by the weights on individual DNAT rules. IPVS supports multiple load balancing modes (in contrast with iptables’ one), which are outlined in Table 2-11. This allows IPVS to spread load more effectively than iptables, depending on IPVS configuration and traffic patterns.

Table 2-11. IPVS Modes Supported In Kubernetes
Name Shortcode Description



Sends subsequent connections to the “next” host in a cycle. This increases the time between subsequent connections sent to a given host, compared to random routing like iptables enables.

Least connection


Sends connections to the host that currently has the least open connections.

Destination hashing


Sends connections deterministically to a specific host, based on the connection’s destination addresses.

Source hashing


Sends connections deterministically to a specific host, based on the connections’ source addresses.

Shortest expected delay


Sends connections to the host with the lowest connections to weight ratio.

Never queue


Sends connections to any host with no existing connections, otherwise uses “shortest expected delay” strategy.

IPVS Supports packet forwarding modes:

  1. NAT – rewrites source and destination addresses

  2. DR - encapsulate IP datagram within IP datagram

  3. IP Tunneling - directly routes packets to backend server through rewriting MAC address of data frame with the MAC address of the selected backend server.

There are three aspects to look at when it comes to look at issues with iptables as a load balancer.

Number of nodes in the cluster

Even though Kubernetes already supports 5000 nodes in release v1.6, kube-proxy with iptables is a bottleneck to scale the cluster to 5000 nodes. One example is that with a NodePort Service in a 5000-node cluster, if we have 2000 services and each service has ten pods, this will cause at least 20000 iptables records on each worker node, and this can make the kernel pretty busy.


Time spent to add one rule when there are 5k services (40k rules): 11 minutes 20k services (160k rules): 5 hours.


Latency to access service (routing latency), each packet must traverse the iptables list until a match is made

  1. There is latency to add/remove rules, inserting and removing from an extensive list is an intensive operation at scale.

IPVS also supports session affinity, which is exposed as an option in Services (Service.spec.sessionAffinity and Service.spec.sessionAffinityConfig). Repeated connections, within the session affinity time window, will route to the same host. This can be useful for scenarios such as minimizing cache misses. It can also make routing in any mode effectively stateful (by indefinitely routing connections from the same address to the same host), but the routing stickiness is less absolute in Kubernetes, where individual pods come and go.

To create a basic load balancer, with two equally weighted destinations, run ipvsadm -A -t <address> -s <mode>. -A, -E, and -D are used to add, edit, and delete virtual services respectively. The lowercase counterparts, -a, -e, and -d are used to add, edit, and delete host backends respectively.

# ipvsadm -A -t -s lc
# ipvsadm -a -t -r -m -w 100
# ipvsadm -a -t -r -m -w 100

You can list the IPVS hosts with -L. Each virtual server (a unique IP address and port combination) is shown, with its backends.

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP lc
  ->             Masq    100    0          0
  ->             Masq    100    0          0

-L supports multiple options, such as --stats to show additional connection statistics.


eBPF is a programming system that allows special, sandboxed programs to run in the kernel, without passing back and forth between kernel and user space, like we saw with netfilter and iptables.

Before eBPF, there was BPF. The Berkeley Packet Filter (BPF) is a technology used in the kernel, among other things, to analyze network traffic. BPF support filtering packets, which allows an userspace process to supply a filter that specifies which packets it wants to inspect. One of BPF’s use cases is tcpdump, shown in Figure 2-7. When you specify a filter on tcpdump, it compiles it as a BPF program and passes it to BPF. The techniques in BPF have been extended to other processes and kernel operations.

Figure 2-7. TCP Dump

An eBPF program has direct access to syscalls. eBPF programs can directly watch and block syscalls, without the usual approach of adding kernel hooks to an userspace program. Because of its performance characteristics, it is well suited for writing networking software.

Learn more

You can learn more about eBPF at

In addition to socket filtering, other supported attach points in the kernel are:


Dynamic kernel tracing of internal kernel components.


User-space tracing.


Kernel static tracing. These are programed into the kernel by developers and are more stable as compared to kprobes which may change between kernel versions.


Timed sampling of data and events.


Specialized eBPF programs that can go lower than kernel space to access driver space to act directly on packets.

Let’s return to tcpdump as an example. Figure 2-8 shows a simplified rendition of tcpdump’s interactions with eBPF.

Figure 2-8. eBPF Example

Suppose we run tcpdump -i any.

The string is compiled by pcap_compile into a BPF program. The kernel will then use this BPF program to filter all packets that go through the all the network devices we specified, any with the -I in our case.

It will make this data available to tcpdump via a map. Maps are a data structures consisting of key/value pairs used by the bpf programs to exchange data.

There are many reasons to use eBPF with Kubernetes;

Performance (hashing table versus Iptables list)

For every service added to Kubernetes, the list of iptables rules have to be traversed grows exponentially. Because of the lack of incremental updates, the entire list of rules has to be replaced each time a new rule is added. This leads to a total duration of 5 hours to install the 160K iptables rules representing 20K Kubernetes services.


Using BPF we can gather Pod and container-level network statistics BPF socket filter is nothing new, but BPF socket filter per cgroup is. Introduced in Linux 4.10, cgroup-bpf allows attaching eBPF programs to cgroups. Once attached, the program is executed for all packets entering or exiting any process in the cgroup. Auditing kubectl-exec with eBPF - With eBPF, you can attach a program that would record any commands executed in the kubectl exec session and pass those commands to an userspace program that logs those events.


secured computing, which restrict what syscalls are allowed. Seccomp filters can be written in eBPF.


Open Source Container Native Runtime Security that uses eBPF.

The most common use of eBPF in Kubernetes is Cilium, a popular container network interface (CNI) and Service implementation. Cilium replaces kube-proxy, which writes iptables rules to map a Service’s IP address on to its corresponding pods.

Through eBPF, Cilium can intercept and route all packets directly in the kernel, which is faster and allows for application-level (L7) load balancing. We will cover kube-proxy in Chapter 4.

Network Troubleshooting Tools

Troubleshooting network-related issues with Linux is a complex topic, and could easily fill its own book. In this section, we will introduce some key troubleshooting tools, and the basics of their use (Table 2-12 is provided as a simple cheatsheet of tools and applicable use cases). Think of this section as a jumping-off point for common Kubernetes-related tool uses. Man pages, --help, and the internet can guide you further. There is substantial overlap in the tools that we describe, so you may find learning about some tools (or tool features) redundant. Some are better suited to a given task than others (for example, multiple tools will catch TLS errors but OpenSSL provides the richest debugging information). Exact tool use may come down to preference, familiarity, and availability.

Table 2-12. Cheatsheet of Common Debugging Cases And Tools
Case Tools

Checking connectivity

Traceroute, Ping, Telnet, Netcat

Port scanning


Checking DNS records

Dig, commands mentioned in “Checking connectivity”

Checking HTTP/1

Curl, Telnet, Netcat

Checking HTTPS

OpenSSL, Curl

Checking listening programs


Some networking tools that we describe likely won’t be pre-installed in your distro of choice, but all should be available through your distro’s package manager. We will sometimes use # Truncated in command output where we omitted text, to avoid examples becoming repetitive or overly long.

Security Warning

Before we get into tooling details, we need to talk about security. An attacker can utilize any tool listed here, in order to explore and access additional systems. There are many strong opinions on this topic, but we consider it considered best practice to leave the fewest possible networking tools installed on a given machine.

An attacker may still be able to download tools themselves (e.g., by downloading a binary off the internet), or using the standard package manager (if they have sufficient permission). In most cases, you are simply introducing some additional friction prior to exploring and exploiting. However, in some cases you can reduce an attacker’s capabilities by not pre-installing networking tools.

Linux file permissions include something called the “setuid bit”, which is sometimes used by networking tools. If a file has the setuid bit set, executing said file causes the file to be executed as the user who owns the file, rather than the current user. You can observe this by looking for an s rather than an x in the permission readout of a file.

$ ls -la /etc/passwd
-rwsr-xr-x 1 root root 68208 May 28  2020 /usr/bin/passwd

This allows programs to expose limited, privileged capabilities (for example, passwd uses this ability to allow a user to update their password, without allowing arbitrary writes to the password file). A number of networking tools (ping, nmap, etc) may use the setuid bit on some systems, to send raw packets, sniff packets, etc. If an attacker downloads their own copy of a tool, and cannot gain root privileges, they will be able to do less with said tool than if it was installed by the system with the setuid bit set.


Ping is a simple program that sends ICMP ECHO_REQUEST packets to networked devices. It is a common, simple way to test network connectivity from one host to another.

ICMP is a layer 4 protocol, like TCP and UDP. Kubernetes services support TCP and UDP, but not ICMP. This means that pings to a Kubernetes service will always fail. Instead, you will need to use telnet, or a higher level tool such as curl to check connectivity to a service. Individual pods may still be reachable by ping, depending on your network configuration.

ICMP Reachability Doesn’t Guarantee Other Reachability

Firewalls and routing software are aware of ICMP packets, and can be configured to specific filter or route ICMP packets. It is common, but not guaranteed (or necessarily advisable) to have very permissive rules for ICMP packets. Some network administrators, network software, or cloud providers will default-allow ICMP packets.

The basic use of ping is simply ping <address>. The address can be an IP, or a domain. Ping will send a packet, wait, and report the status of that request when a response or timeout happens.

By default, ping will send packets forever, and must be manually stopped (e.g., with ctrl-c). -c <count> will make ping perform a fixed number of pings, before shutting down. On shutdown, ping also prints a summary.

$ ping -c 2
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=117 time=12.665 ms
64 bytes from icmp_seq=1 ttl=117 time=12.403 ms

--- ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 12.403/12.534/12.665/0.131 ms

Table 2-13 shows common ping options.

Table 2-13. Useful Ping Options
Option Description

-c <count>

Sends the specified number of packets. Exits after the final packet is received or times out.

-i <seconds>

Sets the wait interval between sending packets. Defaults to 1 second. Extremely low values are not recommended, as ping can flood the network.


Exit after receiving 1 packet. Equivalent to -c 1.

-S <source address>

Uses the specified source address for the packet.

-W <milliseconds>

Sets the wait interval to receive a packet. If ping receives the packet later than the wait time, it will still count towards the final summary.


Traceroute shows the network route taken from one host to another. This allows users to easily validate and debug the route taken (or where routing fails) from one machine to another.

Traceroute sends packets with specific IP time-to-live values. Recall from Chapter 1 that each host that handles a packet decrements the time-to-live (TTL) value on packets by 1, therefore limiting the number of hosts that a request can be handled by. When a host receives a packet, and decrements the TTL to 0, it sends a TIME_EXCEEDED packet, and discards the original packet. The TIME_EXCEEDED response packet contains the source address of the machine where the packet timed out. By starting with a TTL of 1, and raising the TTL by 1 for each packet, traceroute is able to get a response from each host along the route to the destination address.

Traceroute displays hosts line-by-line, starting with the first external machine. Each line contains the hostname (if available), IP address, and response time.

traceroute to (, 64 hops max, 52 byte packets
 1  router (  8.061 ms  2.273 ms  1.576 ms
 2 (  2.037 ms  1.856 ms  1.835 ms
 3 (  4.675 ms  7.179 ms  9.930 ms
 4  * * *
 5 (  20.272 ms  8.142 ms  8.046 ms
 6 (  14.715 ms  8.257 ms  12.038 ms
 7 (  5.057 ms  4.963 ms  5.004 ms
 8 (  5.560 ms (  6.396 ms (  5.729 ms
 9  * * *
10 (  64.473 ms  10.008 ms  9.321 ms

If traceroute receives no response from a given hop before timing out, it prints a *. Some hosts may refuse to send a TIME_EXCEEDED packet, or a firewall along the way may prevent successful delivery.

Table 2-14 shows common traceroute options.

Table 2-14. Useful Traceroute Options
Option Syntax Description

First TTL

-f <TTL>, -M <TTL>

Set the starting IP TTL (default value: 1). Setting the TTL to n will cause traceroute to not report the first n-1 hosts en-route to the destination.


-m <TTL>

Set the maximum TTL, i.e., the maximum number of hosts that traceroute will attempt to route through.


-P <protocol>

Send packets of the specified protocol (TCP, UDP, ICMP, and sometimes other options). UDP is default.

Source Address

-s <address>

Specify the source IP address of outgoing packets.


-w <seconds>

Set the time to wait for a probe response.


Dig is a DNS lookup tool. You can use it to make DNS queries from the command line, and display the results.

The general form of a dig command is dig [options] <domain>. By default, dig will display the CNAME, A, and AAAA records.

$ dig

; <<>> DiG 9.10.6 <<>>
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51818
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

; EDNS: version: 0, flags:; udp: 1452
;			IN	A


;; Query time: 12 msec
;; SERVER: 2600:1700:2800:7d4f:6238:e0ff:fe08:6a7b#53(2600:1700:2800:7d4f:6238:e0ff:fe08:6a7b)
;; WHEN: Mon Jul 06 00:10:35 PDT 2020
;; MSG SIZE  rcvd: 71

To display a particular type of DNS record, run dig <domain> <type> (or dig -t <type> <domain>). This is overwhelmingly the main use case for dig.

$ dig TXT

; <<>> DiG 9.10.6 <<>> -t TXT
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16443
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

; EDNS: version: 0, flags:; udp: 512
;			IN	TXT

;; ANSWER SECTION:		3599	IN	TXT	"v=spf1 ~all"		3599	IN	TXT	"google-site-verification=oPORCoq9XU6CmaR7G_bV00CLmEz-wLGOL7SXpeEuTt8"

;; Query time: 49 msec
;; SERVER: 2600:1700:2800:7d4f:6238:e0ff:fe08:6a7b#53(2600:1700:2800:7d4f:6238:e0ff:fe08:6a7b)
;; WHEN: Sat Aug 08 18:11:48 PDT 2020
;; MSG SIZE  rcvd: 171

Table 2-15 shows common dig options.

Table 2-15. Useful Dig Options
Option Syntax Description



Use IPv4 only.



Use IPv6 only.


-b <address>[#<port>]

Specify the address to make a DNS query to. Port can optionally be included, preceded by #.


-p <port>

Specify the port to query, in case DNS is exposed on a nonstandard port. Default is 53, the DNS standard.


-q <domain>

The domain name to query. The domain name is usually specified as a positional argument.

Record Type

-t <type>

The DNS record type to query. The record type can alternatively be specified as a positional argument.


Telnet is both a network protocol, and a tool for using said protocol. Telnet was once used for remote login, in a manner similar to SSH. SSH has become dominant due to having better security, but telnet is still extremely useful for debugging servers that use a text-based protocol. For example, with telnet, you can connect to an HTTP/1 server and manually make requests against it.

The basic syntax of telnet is telnet <address> <port>. This establishes a connection, and provides an interactive command line interface. Hitting enter twice will send a command, which easily allows multi-line commands to be written. Hit ctrl-] to exit the session.

$ telnet
Connected to
Escape character is '^]'.
> HEAD / HTTP/1.1
> Host:
HTTP/1.1 301 Moved Permanently
Cache-Control: public, max-age=0, must-revalidate
Content-Length: 0
Content-Type: text/plain
Date: Thu, 30 Jul 2020 01:23:53 GMT
Age: 2
Connection: keep-alive
Server: Netlify
X-NF-Request-ID: a48579f7-a045-4f13-af1a-eeaa69a81b2f-23395499

To make full use of telnet, you will need to understand how the application protocol that you are using works. Telnet is a classic tool to debug servers running HTTP, HTTPS, POP3, IMAP, and so on.


nmap is a port scanner, which allows you to explore and examine services on your network.

The general syntax of nmap is nmap [options] <target>, where target is a domain, IP address, or IP CIDR. Nmap’s default options will give a fast and brief summary of open ports on a host.

$ nmap
Starting Nmap 7.80 ( ) at 2020-07-29 20:14 PDT
Nmap scan report for my-host (
Host is up (0.011s latency).
Not shown: 997 closed ports
22/tcp   open  ssh
3000/tcp open  ppp
5432/tcp open  postgresql

Nmap done: 1 IP address (1 host up) scanned in 0.45 seconds

In the above example, nmap detects 3 open ports, and guesses which service is running on each port.

Use Nmap To Find Unnecessary Exposed Services

Because Nmap can quickly show you which services are accessible from a remote machine, it can be a quick and easy way to spot services that should not be exposed. Nmap is a favorite tool for attackers, for this reason.

Nmap has a dizzying number of options, which change the scan behavior, and level of detail provided. As with other commands, we will summarize some key options, but we highly recommend reading nmap’s help/man pages.

Table 2-16 shows common nmap options.

Table 2-16. Useful Nmap Options
Option Syntax Description

Additional detection


Enable OS detection, version detection, and more.

Decrease Verbosity


Decrease the command verbosity. Using multiple `d`s (e.g., -dd) increases the effect.

Increase Verbosity


Increase the command verbosity. Using multiple `v`s (e.g., -vv) increases the effect.


netstat can display a wide range of information about a machine’s network stack and connections.

$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0    164 my-host:ssh             laptop:50113            ESTABLISHED
tcp        0      0 my-host:50051           example-host:48760      ESTABLISHED
tcp6       0      0 2600:1700:2800:7d:54310 2600:1901:0:bae2::https TIME_WAIT
udp6       0      0 localhost:38125         localhost:38125         ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  13     [ ]         DGRAM                    8451     /run/systemd/journal/dev-log
unix  2      [ ]         DGRAM                    8463     /run/systemd/journal/syslog
[Cut for brevity]

Invoking netstat with no additional arguments will display all connected sockets on the machine. In our example, we see 3 TCP sockets, 1 UDP socket, and a multitude of unix sockets. The output includes the address (IP address and port) on both sides of a connection.

We can use the -a flag to show all connections, or -l to show only listening connections.

$ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0   *               LISTEN
tcp        0      0*               LISTEN
tcp        0    172 my-host:ssh             laptop:50113            ESTABLISHED
[Content cut]

A common use of netstat is to check which process is listening on a specific port. To do that, we run sudo netstat -lp - l for “listening”, and p for “program”. Sudo may be necessary in order for netstat to view all program information. The output for -l shows which address a service is listening on (e.g., or

We can use simple tools like grep to get a clear output from netstat, when we are looking for a specific result.

$ sudo netstat -lp | grep 3000
tcp6     0    0 [::]:3000       [::]:*       LISTEN     613/grafana-server

Table 2-17 shows common netstat options.

Table 2-17. Useful Netstat Commands
Option Syntax Description

Show all sockets

netstat -a

Show all sockets, not only open connections.

Show statistics

netstat -s

Shows networking statistics. By default, netstat shows stats from all protocols.

Show listening sockets

netstat -l

Shows sockets that are listening. This is an easy way to find running services.


netstat -t

The -t flag shows only TCP data. It can be used with other flags, e.g., -lt (show sockets listening with TCP).


netstat -u

The -u flag shows only UDP data. It can be used with other flags, e.g., -lu (show sockets listening with UDP).


Netcat is a multipurpose tool for making connections, sending data, or listening on a socket. It can be helpful as a way to “manually” run a server or client, to inspect what happens in greater detail. Netcat is arguably similar to telnet in this regard, though netcat is capable of far more things.


nc is an alias for netcat on most systems.

Netcat can connect to a server, when invoked as netcat <address> <port>. Netcat has an interactive stdin, which allows you to manually type data, or pipe data to netcat. Very telnet-esque so far.

$ echo -e "GET / HTTP/1.1
Host: localhost
" > cmd
$ nc localhost 80 < cmd
HTTP/1.1 302 Found
Cache-Control: no-cache
Content-Type: text/html; charset=utf-8
[Content cut]


OpenSSL powers a substantial chunk of the world’s HTTPS connections. Most heavy lifting with OpenSSL is done with language bindings, but it also has a CLI for operational tasks, and debugging. openssl can do things such as creating keys and certificates, signing certificates, and, most relevant to us, testing TLS/SSL connections. Many other tools, including ones outlined in this chapter, can test TLS/SSL connections. However, OpenSSL stands out for its feature-richness and level of detail.

Commands usually take the form openssl [sub-command] [arguments] [options]. openssl has a vast number of sub-commands (for example, openssl rand allows you to generate pseudo-random data). The list subcommand allows you to list capabilities, with some search options (e.g., openssl list --commands for commands) To learn more about individual sub-commands, you can check openssl <subcommand> --help, or its man page (man openssl-<subcommand> or just man <subcommand>).

openssl s_client -connect will connect to a server, and display detailed information about the server’s certificate. Below is the default invocation:

openssl s_client -connect
depth=2 O = Digital Signature Trust Co., CN = DST Root CA X3
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
verify return:1
depth=0 CN =
verify return:1
Certificate chain
0 s:CN =
i:C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
1 s:C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
i:O = Digital Signature Trust Co., CN = DST Root CA X3
Server certificate
[Content cut]
subject=CN =

issuer=C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3

No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
SSL handshake has read 3915 bytes and written 378 bytes
Verification: OK
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

If you are using a self-signed CA, you can use `-CAfile <path> to use that CA. This will allow you to establish and verify connections against a self-signed certificate.


Curl is a data transfer tool that supports multiple protocols, notably HTTP and HTTPS.


wget is a similar tool to curl. Some distros or administrators may install it instead of curl.

Curl commands are of the form curl [options] <URL>. Curl prints the URL’s contents, and sometimes curl-specific messages to stdout. The default behavior is to make an HTTP GET request.

$ curl
<!doctype html>
    <title>Example Domain</title>
# Truncated

By default, curl does not follow redirects, such as HTTP 301s or protocol upgrades. The -L flag (or --location) will enable redirect following.

$ curl
Redirecting to

$ curl -L
<!doctype html><html lang=en class=no-js><head>
# Truncated

Use the -X option to perform a specific HTTP verb, e.g., curl -X DELETE foo/bar to make an DELETE request.

You can supply data (for a POST, PUT, etc) in a few ways:

  • Urlencoded: -d "key1=value1&key2=value2"

  • JSON: -d '{"key1":"value1", "key2":"value2"}'

  • As a file in either format: -d @data.txt

The -H option adds an explicit header, although basic headers such as content-type are added automatically.

-H "Content-Type: application/x-www-form-urlencoded"

Here are some examples:

$ curl -d "key1=value1" -X PUT localhost:8080

$ curl -H "X-App-Auth: xyz" -d "key1=value1&key2=value2" -X POST https://localhost:8080/demo

Curl can help diagnose TLS issues. Just like a reputable browser, curl validates the certificate chain returned by HTTP sites, and checks against the host’s CA certs.

Use specialized tools for TLS debugging

Curl can be of some help when debugging TLS issues, but more specialized tools such as openssl may be more helpful.

$ curl https://expired-tls-site
curl: (60) SSL certificate problem: certificate has expired
More details here:

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Like many programs, curl has a verbose flag, -v, which will print more information about the request and response. This is extremely valuable when debugging a layer-7 protocol such as HTTP.

$ curl https://expired-tls-site -v
*   Trying
* Connected to expired-tls-site ( port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS alert, certificate expired (557):
* SSL certificate problem: certificate has expired
* Closing connection 0
curl: (60) SSL certificate problem: certificate has expired
More details here:

# Truncated

Curl has many additional features that we have not covered, such as the ability to use timeouts, custom CA certs, custom DNS, and so on.

This chapter has provided you with a whirlwind tour of networking in Linux. We focused primarily on concepts that are required to understand Kubernetes’ implementation, cluster setup constraints, and debugging Kubernetes-related networking problems (in workloads on Kubernetes, or Kubernetes itself). This chapter was by no means exhaustive, and you may find it valuable to learn more.


Next, we will start to look at containers in Linux, and how containers interact with the network.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.