In “Why an Operating System at All?” we said that the main function of an operating system is to abstract over different hardware and provide us with an Application Programming Interface (API). Programming against this API allows us to write applications without having to worry about where and how they are executed. In a nutshell, the kernel provides such an API to programs.
In this chapter we discuss what the Linux kernel is and how you should be thinking about it as a whole as well as about its components. You will learn about the overall Linux architecture and the essential role the Linux kernel plays. One main takeawy of this chapter should be that while the kernel provides all the core functionality, on its own it is not the operating system but only a, admittedly very central, part of it.
First, we take a birds eye view in “Linux Architecture”, looking at how the kernel fits in and interacts with the underlying hardware. Then, in “CPU Architectures” we review the computational core, discussing different CPU architectures and how they relate to the kernel. Next we zoom in on the individual kernel components in “Kernel Components”, and discuss the API the kernel provides to programs you can run. Finally, we look at how to customize and extend the Linux kernel in “Kernel Extensions”.
The purpose of this chapter is to equip you with the necessary terminology, make you aware of the interfacing between programs and the kernel, and give you a basic idea what the functionality is. The chapter does not aim to turn you into a kernel developer or even a sysadmin configuring and compiling kernels. If you, however, want to dive into that, I’ve put together some pointers at the end.
And now, let’s jump into the deep end: the Linux architecture, and the central role the kernel plays in this context.
On a high level the Linux architecture looks as depicted in Figure 2-1. There are three distinct layers you can group things into:
At the bottom you find the hardware: from CPUs to main memory to disk drives, network interfaces and peripheral devices such as keyboards and monitors.
The kernel itself, the focus of the rest of this chapter.
The user land: where the majority of apps are running, including operating
system compnents such as shells (discsussed in Chapter 3),
ssh, as well as graphical user interfaces such
as X Window System-based desktops.
We focus in this book is on the upper two layers of Figure 2-1, that is, the kernel and user land. The hardware layer on the other hand is something we only touch on in this and few other chapters, where relevant.
The interfaces between the different layers are well defined and part of the Linux operating system package. Between the kernel and user land the interface is called system calls (syscalls for short) and we will explore this in detail in “Syscalls”.
The interface between the hardware and the kernel is, unlike the syscalls, not a single one. It consists of a collection of individual interfaces, usually grouped by hardware:
This interface is represented by CPU architecture specific code, see “CPU Architectures”.
The interface with the main memory, covered in “Memory Management”.
Network interfaces and drivers (wired and wireless), see also “Networking”.
Filesystem and block devices driver interfaces, see “Filesystems”.
Character devices, hardware interrupts, and device drivers, for input devices like keyboards, terminals and other I/O (“Device Drivers”).
As you can see, many of the things we usually consider part of the Linux
operating system such as shell or utilities such as
ping are, in fact, not part of the kernel but, very much like an app you download,
part of user land.
On the topic of “user land”: you will often read or hear about user vs. kernel mode. This effectively means how privileged the access to hardware is and how restricted the abstractions available are.
In general, kernel mode means fast execution with limited abstraction while user level land mode means comparatively slower but safer and mor convenient abstractions. Unless you are a kernel developer, you can almost always ignore kernel mode, since all your apps will run in user land. Knowing how to interact with the kernel (“Syscalls”) on the other hand, is vital and part of our considerations.
With this Linux architecture overview out of the way, let’s work our way up from the hardware.
Before we discuss the kernel components let’s review a basic concept first: computer architectures or CPU families, which we will use interchangeably. The fact that Linux runs on a large number of different CPU architectures is arguably one of the reasons it is so popular.
Next to generic code and drivers, the Linux kernel contains architecture-specific code. This separation allows it to port Linux and make it available on new hardware, quickly.
There are a number of ways to figure out what CPU your Linux is running, let’s have a look at a few in turn.
One way is a dedicated tool that interacts with the BIOS called
if this doesn’t yield results you could try (output shortened):
The architecture we’re looking at here is
It looks like there are four CPUs available.
The CPU model name is Intel Core Processor (Haswell).
In the previous command we saw that the CPU architecture was reported to
x86_64 as well as the model was reported as “Intel Core Processor (Haswell)”.
We will learn more about how to decode this in a moment.
Another way to glean similar architecture information is using
cat /proc/cpuinfo or, if you’re really only interested in the architecture,
Now that we have a handle on querying the architecture information on Linux, let’s see how to decode it.
x86 is an instruction set family originally
developed by Intel and later licensed to AMD. Within the kernel, when you see
that refers to the Intel 64 bit processors as well as
x86 stands for Intel 32 bit.
Further, when you come across
amd64 that refers to AMD 64 bit processors.
Even nowadays, you find the x86 CPU family mostly in desktops and laptops,
but also widely used in servers. Specifically,
x86 forms the basis
of the public cloud. It is a powerful and widely available architecture, however
it is not very energy efficien. Partially due to its heavy reliance on
out-of-order execution, it recently received a lot of attention around security
issues such as Meltdown.
For further details, for example the Linux/x86 boot protocol or Intel and AMD specific background, see the x86 specific kernel documentation.
Arm is a more than 30 years old family of Reduced Instruction Set Computing (RISC) architectures. RISC usually consists of a large numbe of generic CPU registers along with a small set of instructions that can be executed faster.
Because the designers at Acorn—the original company behind Arm—focused from the get-go on minimal power consumption you find Arm-based chips in a number of portable devices such as iPhones. They are also in most Android-based phones, and in embedded systems found in IoT such as in the Raspberry Pi.
Given that Arm-based CPUs are fast, cheap, and produce—compared to x86 chips—less heat, you shouldn’t be suprised to increasingly also find them in the datacenter, for example, AWS Graviton. While simpler than x86, Arm is not immune to vulnerabilities, for example, Spectre.
For further details, see the Arm specific kernel documentation.
An up-and-coming player, RISC-V (pronounce: risk five) is an open RISC standard, that was originally developed by the University of California, Berkeley (UCB). As of 2021, a number of implementations exist, ranging from Alibaba Group and Nvidia to a range of startups such as SiFive. While exciting, this is a relatively new and not widely used (yet) CPU family, and to get an idea how it look and feels, you may want to research it a little (a good start is Shae Erisson’s article Linux on RISC-V).
For further details, see the RISC-V specific kernel documentation.
Now that you know the basics about CPU architectures let’s move on to the kernel components.
Now that you have an idea what the core compute unit, the CPU architectures, mean it’s time to dive into the kernel. While the Linux kernel is a monolithic one—that is, all the components discussed are part of a single binary—there are functional areas in the code base that we can identify and ascribe dedicated responsibilities.
As we’ve discussed in “Linux Architecture”, the kernel sits between the hardware and the apps you want to run. The main functional blocks you find in the kernel code base are as follows:
Process management such as starting a process based on an executable file, as discussed in “Process Management”.
Memory management, for example, allocating memory for a process, which we will review in “Memory Management”.
Networking, like managing network interfaces or providing the TCP/IP stack, as described in “Networking”.
Filesystems as you can read up in “Filesystems”, supporting the creation and deleting of files, for example.
Management of character devices and device drivers as per “Device Drivers”.
These functional components oftentimes come with interdependencies and its a truly challenging task to make sure that the kernel developer motto “kernel never breaks user land” holds true.
With that, let’s have a closer look at the kernel components.
There are a number of process management related parts in the kernel. Some of them deal with CPU architecture specific things, such as interrupts and others focus on the launching and scheduling of programs.
Before we get to Linux specifics, let’s note that commonly, a process is the user-facing unit, based on an executable programm (or binary). A thread, on the other hand is a unit of execution in the context of a process. You might have come across the term multi-threading, this means that a process has a number of parallel executions going on, potentially running on different CPUs.
With the general view out of the way, let’s see how Linux goes about it. From most granular to smallest unit, Linux has:
Contains one or more a process groups, it represents a high-level user-facing
unit with optional
tty attached. The kernel identifies a session via
a number that is called session ID.
Contains one or more processes, withat most one process group in a session being the foreground process group. The kernel identifies a process group via a number that is called process group ID (PGID).
A process is an abstraction that groups multiple resources (address space,
one or more threads, sockets, etc.), which the kernel exposes to you via
/proc/self for the current process. The kernel identifies a process via
a number that is called process ID (PID).
The kernel implements threads as processes. That is, there are no dedicated data structures representing threads. Rather, a thread is a process that shares certain resources (such as memory or signal handlers) with other processes. The kernel identifies a thread via thread IDs (TID), thread group IDs (TGID), with the semantics that a shared TGID value means a multi-threaded process (in user land; there are also kernel threads but that’s beyond our scope).
In the kernel there is a data structure called
which forms the basis of implementing processes and threads alike. This data
structure captures scheduling related information (see below), identifiers
(such as PID and TGID), signal handlers, as well other information including
performance and security related ones. In a nutshell, all of the above
units are derived and/or anchored in tasks, however, tasks are not
exposed as such outside of the kernel.
We will see sessions, process groups, and processes in action and how to manage them in Chapter 6 as well as further on, in the context of containers, in Chapter 9.
Let’s see some of the above terms in action:
bash shell process has PID, PGID, and SID of 6756. From
ls -al /proc/6756/task/6756/ we can glean the task-level information.
ps process has PID/PGID 6790 and the same SID as the shell.
We mentioned earlier on that in Linux the task data structure has some scheduling related information at the ready. This means that a process at any given time is in a certain state as shown in Figure 2-2.
Different events cause state transitions. For example, a running process might transition to the waiting state when it carries out some I/O operation (such as reading from a file) and can’t proceed with execution (off CPU).
After this short part on process managment, let’s have a closer look at a very related topic: memory.
Virtual memory makes your system appear as if it has more memory than it physically has. In fact, every process gets a lot of (virtual) memory. The way it works is as follows: both physical and virtual memory is divided into fixed length chunks we call pages.
Figure 2-3 shows the virtual address spaces of two processes, each with their own page tables. These page tables map virtual pages of the process into physical pages in main memory (aka RAM).
Multiple virtual pages can point to the same physical page via their respective process-level page tables. This is in a sense the core of memory management: how to effectively provide each process with the illusion that their page actually exists in RAM while using the existing space optimally.
Every time the CPU accesses a process’ virtual page, the CPU would in principle have to translate the virtual address a process uses to the corresponding physical address. To speed up this process—that can be multi-level and hence slow—modern CPU architectures support a lookup on-chip called translation lookaside buffer (TLB). The TLB is effectively a small cache that, in case of a miss, causes the CPU to go via the process page table(s) to calculate the physical address of a page and update the TLB with it.
Traditionally, Linux had a default page size of 4 KB but since kernel v2.6.3 it supports huge pages, to better support modern architectures and workloads. For example, 64-bit Linux allows you to use up to 128 TB of virtual address space per processes, with an approximate 64 TB of physical memory in total.
Ok, that was a lot of information and also on the theoretical side, let’s have
a look at it from a more practical point of view. A very useful tool to figure
out memory-related information such as how much RAM is available to you is the
List details on physical memory (RAM); that’s 4GB there.
List details on virtual memory; that’s a bit more than 34 TB there.
List huge pages info; apparently here the page size is 2MB.
With that, we move on to the next kernel function: networking.
One important function of the kernel is to provide networking functionality. No matter if you want to browse the Web, or if you want to copy data to a remote system, you depend on the network.
The Linux network stack follows a layered architecture:
The sockets, which abstract communication.
The Transmission Control Protocol (TCP) for connection-oriented communication as well as User Datagram Protocol (UDP) for connection-less communication.
The Internet Protocol (IP) for adressing machines.
These three actions are all that the kernel takes care of. The application layer protocols such as HTTP or SSH are, usually, implemented in user land.
You can get an overview of your network interfaces using (output edited):
$ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu
65536qdisc noqueue state UNKNOWN mode DEFAULT group default qlen
1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu
1500qdisc fq_codel state UP mode DEFAULT group default qlen
1000link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
ip route provides you with routing information. Since we have a dedicated
networking chapter (Chapter 7) where we will dive deep into the networking stack,
the supported protocols, and typical operations, we keep it at this and move on to
the next kernel component, block devices and filesystems.
Linux uses fileystems to organize files and directories on storage devices
such as hard disk drives (HDDs) and solid-state drives (SSD)s or flash memory.
There are many types of filesystems such as
btrfs or NTFS
and you can have multiple instances of the same filesystem in use.
Virtual File System (VFS) was originally introduced to support multiple filesystem types and instances. The highest layer in VFS provides a common API abstraction of functions such as open, close, read, and write. At the bottom of VFS are filesystem abstractions called plug-ins for the given filesystem.
We will go into greater detail concerning filesystems and file operations in Chapter 5.
A driver is a bit of code that runs in the kernel to manage a device, which can be actual hardware—like a keyboard or disk drives—or it can be a pseudo device. Another interesting class of hardware are Graphics Processing Unit (GPU) which traditionally were used to accelerate graphics output and with it ease the load on the CPU. In the past years, GPUs have found a new use case in the context of machine learning and hence they are not exclusively relevant in desktop environments.
The driver may be built statically into the kernel or it can be built as a kernel module (“Modules”) so that it can be dynamically loaded when needed.
If you’re interested in an interactive way to explore device drivers and how kernel components interact, check out the Linux kernel map.
The kernel driver model is complicated and out of scope for this book. However, a few hints how to interact with it in the following, just enough so that you know where to find what.
To get an overview on the devices on your Linux system, you can use:
$ls -al /sys/devices/ total 0 drwxr-xr-x
1715:53 . dr-xr-xr-x
1715:53 .. drwxr-xr-x
1715:53 LNXSYSTM:00 drwxr-xr-x
1715:53 breakpoint drwxr-xr-x
1717:41 isa drwxr-xr-x
1715:53 kprobe drwxr-xr-x
1715:53 msr drwxr-xr-x
1715:53 pci0000:00 drwxr-xr-x
1715:53 platform drwxr-xr-x
1715:53 pnp0 drwxr-xr-x
1715:53 software drwxr-xr-x
1715:53 system drwxr-xr-x
1715:53 tracepoint drwxr-xr-x
1715:53 uprobe drwxr-xr-x
Further, you can use the following to list mounted devices:
$mount sysfs on /sys
)proc on /proc
)devpts on /dev/pts
)... tmpfs on /run/snapd/ns
)nsfs on /run/snapd/ns/lxd.mnt
And with this we have covered the Linux kernel components and move to the interface between the kernel and user land.
Whether you sit in front of a terminal and type
touch test.txt or
wheter one of your apps wants to download the content of a file from a
remote system, at the end of the day, you ask Linux to turn the high-level instruction
such as “create a file” or “read all bytes from address so and so” into a
set of concrete, architecture-dependent steps. In other words, the service
interface the kernel exposes and that user land entities call is the set
of system calls or syscalls
Linux has hundreds of syscalls, 300+ depending on the CPU family, available. However you and your programs don’t usually invoke these syscalls directly but via what we call the C standard library. The standard library provides wrapper functions and is available in various implementations such as glibc or musl.
These wrapper libraries perform an important task. They take care of the repetitive low-level handling of the execution of a syscall. As system calls are implemented as software interrupts, causing an exception that transfers the control to an exception handler. There are a number of steps to take care of every time a syscall is invoked, as depicted in Figure 2-4:
syscall.h and architecture-dependent files, the kernel uses
a so called syscall table, effectively an array of function pointers in memory
(storedin a variable called
sys_call_table) to keep track of syscalls
and their corresponding handlers.
system_call() function acting like a syscall multiplexer
it first saves the hardware context on the stack, then performs checks (like
if tracing is performed), and then jumps to the function pointed to by the
the respective syscall number index in the
After the syscall completed with
sysexit, the wrapper library restores the
hardware context and the programm execution resumes in user land.
Notably in the previous steps is the switching between kernel mode and user land mode, an operation that costs time.
OK, that was a little dry and theoretical, so to better appreciate how syscalls look and feel in practice let’s have a look at a concrete example. We will use strace to look behind the curtain, a tool useful for troubleshooting, for example, if you don’t have the source code of an app but want to learn what it does.
Let’s assume you wonder what syscalls are involved when you execute the
ls command. Here’s how you can find it out using