Chapter 2. The Linux Kernel

In “Why an Operating System at All?” we said that the main function of an operating system is to abstract over different hardware and provide us with an Application Programming Interface (API). Programming against this API allows us to write applications without having to worry about where and how they are executed. In a nutshell, the kernel provides such an API to programs.

In this chapter we discuss what the Linux kernel is and how you should be thinking about it as a whole as well as about its components. You will learn about the overall Linux architecture and the essential role the Linux kernel plays. One main takeawy of this chapter should be that while the kernel provides all the core functionality, on its own it is not the operating system but only a, admittedly very central, part of it.

First, we take a birds eye view in “Linux Architecture”, looking at how the kernel fits in and interacts with the underlying hardware. Then, in “CPU Architectures” we review the computational core, discussing different CPU architectures and how they relate to the kernel. Next we zoom in on the individual kernel components in “Kernel Components”, and discuss the API the kernel provides to programs you can run. Finally, we look at how to customize and extend the Linux kernel in “Kernel Extensions”.

The purpose of this chapter is to equip you with the necessary terminology, make you aware of the interfacing between programs and the kernel, and give you a basic idea what the functionality is. The chapter does not aim to turn you into a kernel developer or even a sysadmin configuring and compiling kernels. If you, however, want to dive into that, I’ve put together some pointers at the end.

And now, let’s jump into the deep end: the Linux architecture, and the central role the kernel plays in this context.

Linux Architecture

On a high level the Linux architecture looks as depicted in Figure 2-1. There are three distinct layers you can group things into:

  • At the bottom you find the hardware: from CPUs to main memory to disk drives, network interfaces and peripheral devices such as keyboards and monitors.

  • The kernel itself, the focus of the rest of this chapter.

  • The user land: where the majority of apps are running, including operating system compnents such as shells (discsussed in Chapter 3), utilities like ps or ssh, as well as graphical user interfaces such as X Window System-based desktops.

We focus in this book is on the upper two layers of Figure 2-1, that is, the kernel and user land. The hardware layer on the other hand is something we only touch on in this and few other chapters, where relevant.

The interfaces between the different layers are well defined and part of the Linux operating system package. Between the kernel and user land the interface is called system calls (syscalls for short) and we will explore this in detail in “Syscalls”.

The interface between the hardware and the kernel is, unlike the syscalls, not a single one. It consists of a collection of individual interfaces, usually grouped by hardware:

  1. This interface is represented by CPU architecture specific code, see “CPU Architectures”.

  2. The interface with the main memory, covered in “Memory Management”.

  3. Network interfaces and drivers (wired and wireless), see also “Networking”.

  4. Filesystem and block devices driver interfaces, see “Filesystems”.

  5. Character devices, hardware interrupts, and device drivers, for input devices like keyboards, terminals and other I/O (“Device Drivers”).

linux architecture
Figure 2-1. A high-level view on the Linux architecture

As you can see, many of the things we usually consider part of the Linux operating system such as shell or utilities such as grep, find, and ping are, in fact, not part of the kernel but, very much like an app you download, part of user land.

On the topic of “user land”: you will often read or hear about user vs. kernel mode. This effectively means how privileged the access to hardware is and how restricted the abstractions available are.

In general, kernel mode means fast execution with limited abstraction while user level land mode means comparatively slower but safer and mor convenient abstractions. Unless you are a kernel developer, you can almost always ignore kernel mode, since all your apps will run in user land. Knowing how to interact with the kernel (“Syscalls”) on the other hand, is vital and part of our considerations.

With this Linux architecture overview out of the way, let’s work our way up from the hardware.

CPU Architectures

Before we discuss the kernel components let’s review a basic concept first: computer architectures or CPU families, which we will use interchangeably. The fact that Linux runs on a large number of different CPU architectures is arguably one of the reasons it is so popular.

Next to generic code and drivers, the Linux kernel contains architecture-specific code. This separation allows it to port Linux and make it available on new hardware, quickly.

There are a number of ways to figure out what CPU your Linux is running, let’s have a look at a few in turn.

One way is a dedicated tool that interacts with the BIOS called dmidecode and if this doesn’t yield results you could try (output shortened):

$ lscpu
Architecture:                x86_64 1
CPU op-mode(s):              32-bit, 64-bit
Byte Order:                  Little Endian
Address sizes:               40 bits physical, 48 bits virtual
CPU(s):                      4 2
On-line CPU(s) list:         0-3
Thread(s) per core:          1
Core(s) per socket:          4
Socket(s):                   1
NUMA node(s):                1
Vendor ID:                   GenuineIntel
CPU family:                  6
Model:                       60
Model name:                  Intel Core Processor (Haswell, no TSX, IBRS) 3
Stepping:                    1
CPU MHz:                     2592.094
...
1

The architecture we’re looking at here is x86_64.

2

It looks like there are four CPUs available.

3

The CPU model name is Intel Core Processor (Haswell).

In the previous command we saw that the CPU architecture was reported to be x86_64 as well as the model was reported as “Intel Core Processor (Haswell)”. We will learn more about how to decode this in a moment.

Another way to glean similar architecture information is using cat /proc/cpuinfo or, if you’re really only interested in the architecture, simply call uname -m.

Now that we have a handle on querying the architecture information on Linux, let’s see how to decode it.

x86 Architecture

x86 is an instruction set family originally developed by Intel and later licensed to AMD. Within the kernel, when you see x64 that refers to the Intel 64 bit processors as well as x86 stands for Intel 32 bit. Further, when you come across amd64 that refers to AMD 64 bit processors.

Even nowadays, you find the x86 CPU family mostly in desktops and laptops, but also widely used in servers. Specifically, x86 forms the basis of the public cloud. It is a powerful and widely available architecture, however it is not very energy efficien. Partially due to its heavy reliance on out-of-order execution, it recently received a lot of attention around security issues such as Meltdown.

For further details, for example the Linux/x86 boot protocol or Intel and AMD specific background, see the x86 specific kernel documentation.

Arm Architecture

Arm is a more than 30 years old family of Reduced Instruction Set Computing (RISC) architectures. RISC usually consists of a large numbe of generic CPU registers along with a small set of instructions that can be executed faster.

Because the designers at Acorn—the original company behind Arm—focused from the get-go on minimal power consumption you find Arm-based chips in a number of portable devices such as iPhones. They are also in most Android-based phones, and in embedded systems found in IoT such as in the Raspberry Pi.

Given that Arm-based CPUs are fast, cheap, and produce—compared to x86 chips—less heat, you shouldn’t be suprised to increasingly also find them in the datacenter, for example, AWS Graviton. While simpler than x86, Arm is not immune to vulnerabilities, for example, Spectre.

For further details, see the Arm specific kernel documentation.

RISC-V Architecture

An up-and-coming player, RISC-V (pronounce: risk five) is an open RISC standard, that was originally developed by the University of California, Berkeley (UCB). As of 2021, a number of implementations exist, ranging from Alibaba Group and Nvidia to a range of startups such as SiFive. While exciting, this is a relatively new and not widely used (yet) CPU family, and to get an idea how it look and feels, you may want to research it a little (a good start is Shae Erisson’s article Linux on RISC-V).

For further details, see the RISC-V specific kernel documentation.

Now that you know the basics about CPU architectures let’s move on to the kernel components.

Kernel Components

Now that you have an idea what the core compute unit, the CPU architectures, mean it’s time to dive into the kernel. While the Linux kernel is a monolithic one—that is, all the components discussed are part of a single binary—there are functional areas in the code base that we can identify and ascribe dedicated responsibilities.

As we’ve discussed in “Linux Architecture”, the kernel sits between the hardware and the apps you want to run. The main functional blocks you find in the kernel code base are as follows:

  • Process management such as starting a process based on an executable file, as discussed in “Process Management”.

  • Memory management, for example, allocating memory for a process, which we will review in “Memory Management”.

  • Networking, like managing network interfaces or providing the TCP/IP stack, as described in “Networking”.

  • Filesystems as you can read up in “Filesystems”, supporting the creation and deleting of files, for example.

  • Management of character devices and device drivers as per “Device Drivers”.

These functional components oftentimes come with interdependencies and its a truly challenging task to make sure that the kernel developer motto “kernel never breaks user land” holds true.

With that, let’s have a closer look at the kernel components.

Process Management

There are a number of process management related parts in the kernel. Some of them deal with CPU architecture specific things, such as interrupts and others focus on the launching and scheduling of programs.

Before we get to Linux specifics, let’s note that commonly, a process is the user-facing unit, based on an executable programm (or binary). A thread, on the other hand is a unit of execution in the context of a process. You might have come across the term multi-threading, this means that a process has a number of parallel executions going on, potentially running on different CPUs.

With the general view out of the way, let’s see how Linux goes about it. From most granular to smallest unit, Linux has:

Sessions

Contains one or more a process groups, it represents a high-level user-facing unit with optional tty attached. The kernel identifies a session via a number that is called session ID.

Process groups

Contains one or more processes, withat most one process group in a session being the foreground process group. The kernel identifies a process group via a number that is called process group ID (PGID).

Processes

A process is an abstraction that groups multiple resources (address space, one or more threads, sockets, etc.), which the kernel exposes to you via /proc/self for the current process. The kernel identifies a process via a number that is called process ID (PID).

Threads

The kernel implements threads as processes. That is, there are no dedicated data structures representing threads. Rather, a thread is a process that shares certain resources (such as memory or signal handlers) with other processes. The kernel identifies a thread via thread IDs (TID), thread group IDs (TGID), with the semantics that a shared TGID value means a multi-threaded process (in user land; there are also kernel threads but that’s beyond our scope).

Tasks

In the kernel there is a data structure called task_struct (defined in sched.h) which forms the basis of implementing processes and threads alike. This data structure captures scheduling related information (see below), identifiers (such as PID and TGID), signal handlers, as well other information including performance and security related ones. In a nutshell, all of the above units are derived and/or anchored in tasks, however, tasks are not exposed as such outside of the kernel.

We will see sessions, process groups, and processes in action and how to manage them in Chapter 6 as well as further on, in the context of containers, in Chapter 9.

Let’s see some of the above terms in action:

$ ps -j
PID    PGID   SID   TTY     TIME CMD
6756   6756   6756  pts/0   00:00:00 bash 1
6790   6790   6756  pts/0   00:00:00 ps 2
1

The bash shell process has PID, PGID, and SID of 6756. From ls -al /proc/6756/task/6756/ we can glean the task-level information.

2

The ps process has PID/PGID 6790 and the same SID as the shell.

We mentioned earlier on that in Linux the task data structure has some scheduling related information at the ready. This means that a process at any given time is in a certain state as shown in Figure 2-2.

process states
Figure 2-2. Linux process states

Different events cause state transitions. For example, a running process might transition to the waiting state when it carries out some I/O operation (such as reading from a file) and can’t proceed with execution (off CPU).

After this short part on process managment, let’s have a closer look at a very related topic: memory.

Memory Management

Virtual memory makes your system appear as if it has more memory than it physically has. In fact, every process gets a lot of (virtual) memory. The way it works is as follows: both physical and virtual memory is divided into fixed length chunks we call pages.

Figure 2-3 shows the virtual address spaces of two processes, each with their own page tables. These page tables map virtual pages of the process into physical pages in main memory (aka RAM).

virtual memory management
Figure 2-3. Virtual memory management overview

Multiple virtual pages can point to the same physical page via their respective process-level page tables. This is in a sense the core of memory management: how to effectively provide each process with the illusion that their page actually exists in RAM while using the existing space optimally.

Every time the CPU accesses a process’ virtual page, the CPU would in principle have to translate the virtual address a process uses to the corresponding physical address. To speed up this process—that can be multi-level and hence slow—modern CPU architectures support a lookup on-chip called translation lookaside buffer (TLB). The TLB is effectively a small cache that, in case of a miss, causes the CPU to go via the process page table(s) to calculate the physical address of a page and update the TLB with it.

Traditionally, Linux had a default page size of 4 KB but since kernel v2.6.3 it supports huge pages, to better support modern architectures and workloads. For example, 64-bit Linux allows you to use up to 128 TB of virtual address space per processes, with an approximate 64 TB of physical memory in total.

Ok, that was a lot of information and also on the theoretical side, let’s have a look at it from a more practical point of view. A very useful tool to figure out memory-related information such as how much RAM is available to you is the /proc/meminfo interface:

$ grep MemTotal /proc/meminfo 1
MemTotal:        4014636 kB

$ grep VmallocTotal /proc/meminfo 2
VmallocTotal:   34359738367 kB

$ grep Huge /proc/meminfo 3
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
1

List details on physical memory (RAM); that’s 4GB there.

2

List details on virtual memory; that’s a bit more than 34 TB there.

3

List huge pages info; apparently here the page size is 2MB.

With that, we move on to the next kernel function: networking.

Networking

One important function of the kernel is to provide networking functionality. No matter if you want to browse the Web, or if you want to copy data to a remote system, you depend on the network.

The Linux network stack follows a layered architecture:

  • The sockets, which abstract communication.

  • The Transmission Control Protocol (TCP) for connection-oriented communication as well as User Datagram Protocol (UDP) for connection-less communication.

  • The Internet Protocol (IP) for adressing machines.

These three actions are all that the kernel takes care of. The application layer protocols such as HTTP or SSH are, usually, implemented in user land.

You can get an overview of your network interfaces using (output edited):

$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
   DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00
   brd 00:00:00:00:00:00
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state
   UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:12:34:56
   brd ff:ff:ff:ff:ff:ff

Further, ip route provides you with routing information. Since we have a dedicated networking chapter (Chapter 7) where we will dive deep into the networking stack, the supported protocols, and typical operations, we keep it at this and move on to the next kernel component, block devices and filesystems.

Filesystems

Linux uses fileystems to organize files and directories on storage devices such as hard disk drives (HDDs) and solid-state drives (SSD)s or flash memory. There are many types of filesystems such as ext4 and btrfs or NTFS and you can have multiple instances of the same filesystem in use.

Virtual File System (VFS) was originally introduced to support multiple filesystem types and instances. The highest layer in VFS provides a common API abstraction of functions such as open, close, read, and write. At the bottom of VFS are filesystem abstractions called plug-ins for the given filesystem.

We will go into greater detail concerning filesystems and file operations in Chapter 5.

Device Drivers

A driver is a bit of code that runs in the kernel to manage a device, which can be actual hardware—like a keyboard or disk drives—or it can be a pseudo device. Another interesting class of hardware are Graphics Processing Unit (GPU) which traditionally were used to accelerate graphics output and with it ease the load on the CPU. In the past years, GPUs have found a new use case in the context of machine learning and hence they are not exclusively relevant in desktop environments.

The driver may be built statically into the kernel or it can be built as a kernel module (“Modules”) so that it can be dynamically loaded when needed.

Tip

If you’re interested in an interactive way to explore device drivers and how kernel components interact, check out the Linux kernel map.

The kernel driver model is complicated and out of scope for this book. However, a few hints how to interact with it in the following, just enough so that you know where to find what.

To get an overview on the devices on your Linux system, you can use:

$ ls -al /sys/devices/
total 0
drwxr-xr-x 15 root root 0 Aug 17 15:53 .
dr-xr-xr-x 13 root root 0 Aug 17 15:53 ..
drwxr-xr-x  6 root root 0 Aug 17 15:53 LNXSYSTM:00
drwxr-xr-x  3 root root 0 Aug 17 15:53 breakpoint
drwxr-xr-x  3 root root 0 Aug 17 17:41 isa
drwxr-xr-x  4 root root 0 Aug 17 15:53 kprobe
drwxr-xr-x  5 root root 0 Aug 17 15:53 msr
drwxr-xr-x 15 root root 0 Aug 17 15:53 pci0000:00
drwxr-xr-x 14 root root 0 Aug 17 15:53 platform
drwxr-xr-x  8 root root 0 Aug 17 15:53 pnp0
drwxr-xr-x  3 root root 0 Aug 17 15:53 software
drwxr-xr-x 10 root root 0 Aug 17 15:53 system
drwxr-xr-x  3 root root 0 Aug 17 15:53 tracepoint
drwxr-xr-x  4 root root 0 Aug 17 15:53 uprobe
drwxr-xr-x 18 root root 0 Aug 17 15:53 virtual

Further, you can use the following to list mounted devices:

$ mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
...
tmpfs on /run/snapd/ns type tmpfs (rw,nosuid,nodev,noexec,relatime,size=401464k,mode=755,inode64)
nsfs on /run/snapd/ns/lxd.mnt type nsfs (rw)

And with this we have covered the Linux kernel components and move to the interface between the kernel and user land.

Syscalls

Whether you sit in front of a terminal and type touch test.txt or wheter one of your apps wants to download the content of a file from a remote system, at the end of the day, you ask Linux to turn the high-level instruction such as “create a file” or “read all bytes from address so and so” into a set of concrete, architecture-dependent steps. In other words, the service interface the kernel exposes and that user land entities call is the set of system calls or syscalls for short.

Linux has hundreds of syscalls, 300+ depending on the CPU family, available. However you and your programs don’t usually invoke these syscalls directly but via what we call the C standard library. The standard library provides wrapper functions and is available in various implementations such as glibc or musl.

These wrapper libraries perform an important task. They take care of the repetitive low-level handling of the execution of a syscall. As system calls are implemented as software interrupts, causing an exception that transfers the control to an exception handler. There are a number of steps to take care of every time a syscall is invoked, as depicted in Figure 2-4:

syscall seq
Figure 2-4. Syscall execution steps in Linux
  • Defined in syscall.h and architecture-dependent files, the kernel uses a so called syscall table, effectively an array of function pointers in memory (storedin a variable called sys_call_table) to keep track of syscalls and their corresponding handlers.

  • With the system_call() function acting like a syscall multiplexer it first saves the hardware context on the stack, then performs checks (like if tracing is performed), and then jumps to the function pointed to by the the respective syscall number index in the sys_call_table.

  • After the syscall completed with sysexit, the wrapper library restores the hardware context and the programm execution resumes in user land.

Notably in the previous steps is the switching between kernel mode and user land mode, an operation that costs time.

OK, that was a little dry and theoretical, so to better appreciate how syscalls look and feel in practice let’s have a look at a concrete example. We will use strace to look behind the curtain, a tool useful for troubleshooting, for example, if you don’t have the source code of an app but want to learn what it does.

Let’s assume you wonder what syscalls are involved when you execute the innocent looking ls command. Here’s how you can find it out using strace:

$ strace ls 1
execve("/usr/bin/ls", ["ls"], 0x7ffe29254910 /* 24 vars */) = 0 2
brk(NULL)                           = 0x5596e5a3c000 3
...
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory) 4
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 5
...
read(3, "177ELF2113>1 p"..., 832) = 832 6
...
1

With strace ls we ask strace to capture syscall that ls uses. Note that I edited the output since strace spits out some 162 lines. Further, the output you see there comes via stderr, so if you want to redirect it you have to use 2> here. Learn more about this in Chapter 3.

2

The syscall execve executes /user/bin/ls, causing the shell process to be replaced.

3

The brk syscall is an outdated way to allocate memory, it’s safer and more portable to use malloc.

4

The access syscall checks if the process is allowed to access a certain file.

5

Syscall openat opens the file /etc/ld.so.cache relative to a directory file descriptor (here the 1st argument, AT_FDCWD which stands for the current directory) and using flags O_RDONLY|O_CLOEXEC (last argument).

6

The read syscall reads from a file descriptor (1st argument, 3) 832 bytes (last argument) into a buffer (2nd argument).

strace is useful to see exactly what syscalls have been called—in which order and with which arguments—effectively hooking into the live stream of events between user land and kernel. It’s also good for performance diagnostics. Let’s see where a curl command spends most of its time (output shortened):

$ strace -c  1
         curl -s https://mhausenblas.info > /dev/null 2
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 26.75    0.031965         148       215           mmap
 17.52    0.020935         136       153         3 read
 10.15    0.012124         175        69           rt_sigaction
  8.00    0.009561         147        65         1 openat
  7.61    0.009098         126        72           close
  ...
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.119476         141       843        11 total
1

Use the -c option to generate overview stats of the syscalls used.

2

Discard all output of curl.

Interesting: the curl command here spends almost half of its time with mmap and read syscalls and the connect syscall takes 0.3 ms, not bad.

To get a feeling for the coverage, I’ve put together Table 2-1 which lists examples of widely used syscalls across kernel components as well as system wide ones. You can look up details of sycalls, including their parameters and return values, via the section 2 of the man pages.

Table 2-1. Exemplary syscalls
Category Example syscalls

Process management

clone, fork, execve, wait, exit, getpid, setuid, setns, getrusage, capset, ptrace

Memory management

brk, mmap, munmap, mremap, mlock, mincore

Networking

socket, setsockop, getsockopt, bind, listen, accept, connect, shutdown, recvfrom, recvmsg, sendto, sethostname, bpf

Filesystems

open, openat, close, mknod, rename, truncate, mkdir, rmdir, getcwd, chdir, chroot, getdents, link, symlink, unlink, umask, stat, chmod, utime, access, ioctl, flock, read, write, lseek, sync, select, poll, mount,

Time

time, clock_settime, timer_create, alarm, nanosleep

Signals

kill, pause, signalfd, eventfd,

Global

uname, sysinfo, syslog, acct, _sysctl, iopl, reboot

Now that you have a basic idea of the Linux kernel, its main components and interface, let’s move on to the question of how to extend it.

Kernel Extensions

In this section we will focus on how to extend the kernel. In a sense, the content here is advanced and an optional one. You won’t need it for your day to day work, in general.

Note

Configuring and compiling your own Linux kernel is out of scope for this book. For information on how to do it I recommend “Linux Kernel in a Nutshell” written by Greg Kroah-Hartman, one of the main Linux maintainers and project lead. He covers the entire range o tasks, starting with downloading the source code to configuration and installation steps, to kernel options at runtime.

Let’s start with something easy: how do you know what kernel version you’re using? You can use the following command to determine this:

$ uname -srm
Linux 5.11.0-25-generic x86_64

In my case, I’m using a—relatively recent 5.11 kernel on an x86_64 machine, see also “x86 Architecture”.

Now that we know the kernel version, address the question of how to extend the kernel out-of-tree, that is, without having to add features to the kernel source code and then have to build it. For this extention we can use modules, so let’s have a look at that.

Modules

In a nutshell, a module is a program that you can load into a kernel on demand. That is, you do not necessarily have to recompile the kernel and/or reboot the machine.

To list available modules (output has been edited down as on my system there are over 1000 lines there):

$ find /lib/modules/$(uname -r) -type f -name '*.ko*'
/lib/modules/5.11.0-25-generic/kernel/ubuntu/ubuntu-host/ubuntu-host.ko
/lib/modules/5.11.0-25-generic/kernel/fs/nls/nls_iso8859-1.ko
/lib/modules/5.11.0-25-generic/kernel/fs/ceph/ceph.ko
/lib/modules/5.11.0-25-generic/kernel/fs/nfsd/nfsd.ko
...
/lib/modules/5.11.0-25-generic/kernel/net/ipv6/esp6.ko
/lib/modules/5.11.0-25-generic/kernel/net/ipv6/ip6_vti.ko
/lib/modules/5.11.0-25-generic/kernel/net/sctp/sctp_diag.ko
/lib/modules/5.11.0-25-generic/kernel/net/sctp/sctp.ko
/lib/modules/5.11.0-25-generic/kernel/net/netrom/netrom.ko

That’s great! But which modules did the kernel actually load? Let’s see (output shortened):

$ lsmod
Module                  Size  Used by
...
linear                 20480  0
crct10dif_pclmul       16384  1
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
virtio_net             57344  0
net_failover           20480  1 virtio_net
ahci                   40960  0
aesni_intel           372736  0
crypto_simd            16384  1 aesni_intel
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
glue_helper            16384  1 aesni_intel

Note that the above info is available via /proc/modules. This is thanks to the kernel exposing this information via a pseudo-filesystem interface, more on this topic in Chapter 6.

Want to learn more about a module or have a nice way to manipulate kernel modules? Then modprobe is your friend. For example, to list the dependencies:

$ modprobe --show-depends async_memcpy
insmod /lib/modules/5.11.0-25-generic/kernel/crypto/async_tx/async_tx.ko
insmod /lib/modules/5.11.0-25-generic/kernel/crypto/async_tx/async_memcpy.ko

Next up: an alternative, modern way to extend the kernel.

A Modern Way to Extend the Kernel: eBPF

An increasingly popular way to extend kernel functionality is eBPF. Originally, knowns as Berkeley Packet Filter (BPF), nowadays, the kernel project and technology is commonly known as eBPF (a term which does not stand for anything).

Technically, eBPF is a feature of the Linux kernel and you’ll need the Linux kernal version 3.18 or above to benefit from it. It enables you to safely and efficiently extend the Linux kernel functions by using the bpf syscall. eBPF is implemented as a in-kernel virtual machine using a custom 64 bit RISC instruction set.

In Figure 2-5 you see a high-level overview taken from Brendan Gregg’s book Linux Extended BPF (eBPF) Tracing Tools:

bpf overview
Figure 2-5. eBPF overview in the Linux kernel

eBPF is already used in a number of places and for use cases such as:

  • In Kubernetes, as a CNI plugin to enable pod networking for example, in Cilium and Project Calico as well as for service scalability.

  • For observability, like for Linux kernel tracing such as with iovisor/bpftrace as well as in a clustered setup with Hubble (see Chapter 8).

  • As a security control, for example to perform container runtime scanning as you can use with projects such as CNCF Falco.

  • Network loadbalancing like Facebook’s L4 katran library.

In mid 2021 the Linux Foundation announced that Facebook, Google, Isovalent, Microsoft and Netflix joined together to create the eBPF Foundation, and with it giving the eBPF project a vendor-neutral home. Stay tuned!

To dive deeper into the eBPF topic I recommend Matt Oswalt’s nice Introduction to eBPF. If you want to stay on top of things, have a look at ebpf.io.

Conclusion

The Linux kernel is the core of Linux the operating system and no matter what distribution you are using or in which ever environment you are using Linux, be it on your desktop or in the cloud, you should have a basic idea of its components and functionality.

In this chapter we reviewed the overall Linux architecture, the role the kernel has as well as its interfaces. Most importantly, the kernel abstracts away the differences of the hardware—CPU architecures and peripheral devices—and makes Linux very portable. The most important interface is the syscall interface, through which the kernel exposes its functionality, be it opening a file, allocating memory or listing network interfaces.

We have also looked a bit at inner workings of the kernel, including modules and eBPF as a way to extend the kernel functionality as well as the lower level functions offered by device drivers.

If you want to learn more about certain kernel aspects, the following resources should provide you with some starting points:

  1. General:

  2. Memory management:

  3. Device drivers:

  4. Syscalls:

Equipped with this knowledge we are now ready to climb up the abstraction ladder a bit an move to the primary user interface we consider in this book: the shell, both in manual usage as well as automation through scripts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset