As any developer familiar with programming in a multithreaded environment (or even a single-threaded one where multiple processes work on shared memory, or where interrupts are a possibility) is well aware, there is a need for synchronization whenever two or more threads (code paths in general) may race; that is, their outcome cannot be predicted. Pure code itself is never an issue as its permissions are read/execute (r-x); reading and executing code simultaneously on multiple CPU cores is not only perfectly fine and safe, but it's encouraged (it results in better throughput and is why multithreading is a good idea). However, the moment you're working on shared writeable data is the moment you need to start being very careful!

The discussions around concurrency and its control – synchronization – are varied, especially in the context of a complex piece of software such as a Linux kernel (its subsystems and related regions, such as device drivers), which is what we're dealing with in this book. Thus, for convenience, we will split this large topic into two chapters, this one and the next.

In this chapter, we will cover the following topics:

Critical sections, exclusive execution, and atomicity
Concurrency concerns within the Linux kernel
Mutex or spinlock? Which to use when
Using the mutex lock
Using the spinlock
Locking and interrupts

Let's get started!

Critical sections, exclusive execution, and atomicity

Imagine you're writing software for a multicore system (well, nowadays, it's typical that you will work on multicore systems, even on most embedded projects). As we mentioned in the introduction, running multiple code paths in parallel is not only safe, it's desirable (why spend those dollars otherwise, right?). On the other hand, concurrent (parallel and simultaneous) code paths within which shared writeable data (also known as shared state) is accessed in any manner is where you are required to guarantee that, at any given point in time, only one thread can work on that data at a time! This is really key; why? Think about it: if you allow multiple concurrent code paths to work in parallel on shared writeable data, you're literally asking for trouble: data corruption (a "race") can occur as a result.

What is a critical section?

A code path that can execute in parallel and that works on (reads and/or writes) shared writeable data (shared state) is called a critical section. They require protection from parallelism. Identifying and protecting critical sections from simultaneous execution is an implicit requirement for correct software that you – the designer/architect/developer – must handle.

A critical section is a piece of code that must run either exclusively; that is, alone (serialized), or atomically; that is, indivisibly, to completion, without interruption.

By exclusively, we're implying that at any given point in time, one thread is running the code of the critical section; this is obviously required for data safety reasons.

This notion also brings up the important concept of atomicity: a single atomic operation is one that is indivisible. On any modern processor, two operations are considered to always be atomic; that is, they cannot be interrupted and will run to completion:

The execution of a single machine language instruction.
Reads or writes to an aligned primitive data type that is within the processor's word size (typically 32 or 64 bits); for example, reading or writing a 64-bit integer on a 64-bit system is guaranteed to be atomic. Threads reading that variable will never see an in-between, torn, or dirty result; they will either see the old or the new value.

So, if you have some lines of code that work upon shared (global or static) writeable data, it cannot – in the absence of any explicit synchronization mechanism – be guaranteed to run exclusively. Note that at times, running the critical section's code atomically, as well as exclusively, is required, but not all the time.

When the code of the critical section is running in a safe-to-sleep process context (such as typical file operations on a driver via a user app (open, read, write, ioctl, mmap, and so on), or the execution path of a kernel thread or workqueue), it might well be acceptable to not have the critical section being truly atomic. However, when its code is running in a non-blocking atomic context (such as a hardirq, tasklet, or softirq), it must run atomically as well as exclusively (we shall cover these points in more detail in the Mutex or spinlock? Which to use when section).

A conceptual example will help clarify things. Let's say that three threads (from user space app(s)) attempt to open and read from your driver more or less simultaneously on a multicore system. Without any intervention, they may well end up running the critical section's code in parallel, thus working on the shared writable data in parallel, thus very likely corrupting it! For now, let's look at a conceptual diagram to see how non-exclusive execution within a critical section's code path is wrong (we won't even talk about atomicity here):

Figure 6.1 – A conceptual diagram showing how a critical section code path is violated by having >1 thread running within it simultaneously

As shown in the preceding diagram, in your device driver, within its (say) read method, you're having it run some code in order to perform its job (reading some data from the hardware). Let's take a more in-depth look at this diagram in terms of data accesses being made at different points in time:

From time t0 to t1: None or only local variable data is accessed. This is concurrent-safe, with no protection required, and can run in parallel (since each thread has its own private stack).
From time t1 to t2: Global/static shared writeable data is accessed. This is not concurrent-safe; it's a critical section and thus must be protected from concurrent access. It should only contain code that runs exclusively (alone, exactly one thread at a time, serialized) and, perhaps, atomically.
From time t2 to t3: None or only local variable data is accessed. This is concurrent-safe, with no protection required, and can run in parallel (since each thread has its own private stack).

In this book, we assume that you are already aware of the need to synchronize critical sections; we will not discuss this particular topic any further. Those of you who are interested may refer to my earlier book, Hands-On System Programming with Linux (Packt, October 2018), which covers these points in detail (especially Chapter 15, Multithreading with Pthreads Part II – Synchronization).

So, knowing this, we can now restate the notion of a critical section while also mentioning when the situation arises (shown in square brackets and italics in the bullet points). A critical section is code that must run as follows:

(Always) Exclusively: Alone (serialized)
(When in an atomic context) Atomically: Indivisibly, to completion, without interruption

In the next section, we'll look at a classic scenario – the increment of a global integer.

A classic case – the global i ++

Think of this classic example: a global i integer is being incremented within a concurrent code path, one within which multiple threads of execution can simultaneously execute. A naive understanding of computer hardware and software will lead you to believe that this operation is obviously atomic. However, the reality is that modern hardware and software (the compiler and OS) are much more sophisticated than you may imagine, thus causing all kinds of invisible (to the app developer) performance-driven optimizations.

We won't attempt to delve into too much detail here, but the reality is that modern processors are extremely complex: among the many technologies they employ toward better performance, a few are superscalar and super-pipelined execution in order to execute multiple independent instructions and several parts of various instructions in parallel (respectively), performing on-the-fly instruction and/or memory reordering, caching memory in complex hierarchical on-CPU caches, false sharing, and so on! We will delve into some of these details in Chapter 7, Kernel Synchronization – Part 2, in the Cache effects – false sharing and Memory barriers sections.

The paper What every systems programmer should know about concurrency by Matt Kline, April 2020, (https://assets.bitbashing.io/papers/concurrency-primer.pdf) is superb and a must-read on this subject; do read it!

All of this makes for a situation that's more complex than it appears to be at first glance. Let's continue with the classic i ++:

static int i = 5;
[ ... ]
foo()
{
    [ ... ]
    i ++;     // is this safe? yes, if truly atomic... but is it truly atomic??
}

Is this increment safe by itself? The short answer is no, you must protect it. Why? It's a critical section – we're accessing shared writeable data for a read and/or write operation. The longer answer is that it really depends on whether the increment operation is truly atomic (indivisible); if it is, then i ++ poses no danger in the presence of parallelism – if not, it does! So, how do we know whether i ++ is truly atomic or not? Two things determine this:

The processor's Instruction Set Architecture (ISA), which determines (among several things related to the processor at a low level) the machine instructions that execute at runtime.
The compiler.

If the ISA has the facility to employ a single machine instruction to perform an integer increment, and the compiler has the intelligence to use it, then it's truly atomic – it's safe and doesn't require locking. Otherwise, it's not safe and requires locking!

Try this out: Navigate your browser to this wonderful compiler explorer website: https://godbolt.org/. Select C as the programming language and then, in the left pane, declare the global i integer and increment within a function. Compile in the right pane with an appropriate compiler and compiler options. You'll see the actual machine code generated for the C high-level i ++; statement. If it's indeed a single machine instruction, then it will be safe; if not, you will require locking. By and large, you will find that you can't really tell: in effect, you cannot afford to assume things – you will have to assume it's unsafe by default and protect it! This can be seen in the following screenshot:

Figure 6.2 – Even with the latest stable gcc version but no optimization, the x86_64 gcc produces multiple instructions for the i ++

The preceding screenshot clearly shows this: the yellow background regions in the left- and right-hand panes is the C source and the corresponding assembly generated by the compiler, respectively (based on the x86_64 ISA and the compiler's optimization level). By default, with no optimization, i ++ becomes three machine instructions. This is exactly what we expect: it corresponds to the fetch (memory to register), the increment, and the store (register to memory)! Now, this is not atomic; it's entirely possible that, after one of the machine instructions executes, the control unit interferes and switches the instruction stream to a different point. This could even result in another process or thread being context switched in!

The good news is that with a quick -O2 in the Compiler options... window, i ++ becomes just one machine instruction – truly atomic! However, we can't predict these things in advance; one day, your code may execute on a fairly low-end ARM (RISC) system, increasing the chance that multiple machine instructions are required for i ++. (Worry not – we shall cover an optimized locking technology specifically for integers in the Using the atomic integer operators section).

Modern languages provide native atomic operators; for C/C++, it's fairly recent (from 2011); the ISO C++11 and the ISO C11 standards provide ready-made and built-in atomic variables for this. A little googling will quickly reveal them to you. Modern glibc also makes use of them. As an example, if you've worked with signaling in user space, you will know to use the volatile sig_atomic_t data type to safely access and/or update an atomic integer within signal handlers. What about the kernel? In the next chapter, you'll learn about the Linux kernel's solution to this key issue. We'll cover this in the Using the atomic integer operators and Using the atomic bit operators sections.

The Linux kernel is, of course, a concurrent environment: multiple threads of execution run in parallel on multiple CPU cores. Not only that, but even on uni-processor (UP/single CPU) systems, the presence of hardware interrupts, traps, faults, exceptions, and software signals can cause data integrity issues. Needless to say, protecting against concurrency at required points in the code path is easier said than done; identifying and protecting critical sections using technologies such as locking – as well as other synchronization primitives and technologies – is absolutely essential, which is why this is the core subject matter of this chapter and the next.

Concepts – the lock

We require synchronization because of the fact that, without any intervention, threads can concurrently execute critical sections where shared writeable data (shared state) is being worked upon. To defeat concurrency, we need to get rid of parallelism, and we need to serialize code that's within the critical section – the place where the shared data is being worked upon (for reading and/or writing).

To force a code path to become serialized, a common technique is to use a lock. Essentially, a lock works by guaranteeing that precisely one thread of execution can "take" or own the lock at any given point in time. Thus, using a lock to protect a critical section in your code will give you what we're after – running the critical section's code exclusively (and perhaps atomically; more on this to come):

Figure 6.3 – A conceptual diagram showing how a critical section code path is honored, given exclusivity, by using a lock

The preceding diagram shows one way to fix the situation mentioned previously: using a lock to protect the critical section! How does the lock (and unlock) work, conceptually?

The basic premise of a lock is that whenever there is contention for it – that is, when multiple competing threads (say, n threads) attempt to acquire the lock (the LOCK operation) – exactly one thread will succeed. This is called the "winner" or the "owner" of the lock. It sees the lock API as a non-blocking call and thus continues to run happily – and exclusively – while executing the code of the critical section (the critical section is effectively the code between the lock and the unlock operations!). What happens to the n-1 "loser" threads? They (perhaps) see the lock API as a blocking call; they, to all practical effect, wait. Wait upon what? The unlock operation, of course, which is performed by the owner of the lock (the "winner" thread)! Once unlocked, the remaining n-1 threads now compete for the next "winner" slot; of course, exactly one of them will "win" and proceed forward; in the interim, the n-2 losers will now wait upon the (new) winner's unlock; this repeats until all n threads (finally and sequentially) acquire the lock.

Now, locking works of course, but – and this should be quite intuitive – it results in (pretty steep!) overhead, as it defeats parallelism and serializes the execution flow! To help you visualize this situation, think of a funnel, with the narrow stem being the critical section where only one thread can fit at a time. All other threads get choked; locking creates bottlenecks:

Figure 6.4 – A lock creates a bottleneck, analogous to a physical funnel

Another oft-mentioned physical analog is a highway with several lanes merging into one very busy – and choked with traffic – lane (a poorly designed toll booth, perhaps). Again, parallelism – cars (threads) driving in parallel with other cars in different lanes (CPUs) – is lost, and serialized behavior is required – cars are forced to queue one behind the other.

Thus, it is imperative that we, as software architects, try and design our products/projects so that locking is minimally required. While completely eliminating global variables is not practically possible in most real-world projects, optimizing and minimizing their usage is required. We shall cover more regarding this, including some very interesting lockless programming techniques, later.

Another really key point is that a newbie programmer might naively assume that performing reads on a shared writeable data object is perfectly safe and thus requires no explicit protection (with the exception of an aligned primitive data type that is within the size of the processor's bus); this is untrue. This situation can lead to what's called dirty or torn reads, a situation where possibly stale data can be read as another writer thread is simultaneously writing while you are – incorrectly, without locking – reading the very same data item.

Since we're on the topic of atomicity, as we just learned, on a typical modern microprocessor, the only things guaranteed to be atomic are a single machine language instruction or a read/write to an aligned primitive data type within the processor bus's width. So, how can we mark a few lines of "C" code so that they're truly atomic? In user space, this isn't even possible (we can come close, but cannot guarantee atomicity).

How do you "come close" to atomicity in user space apps? You can always construct a user thread to employ a SCHED_FIFO policy and a real-time priority of 99. This way, when it wants to run, pretty much nothing besides hardware interrupts/exceptions can preempt it. (The old audio subsystem implementation heavily relied on this.)

In kernel space, we can write code that's truly atomic. How, exactly? The short answer is that we can use spinlocks! We'll learn about spinlocks in more detail shortly.

A summary of key points

Let's summarize some key points regarding critical sections. It's really important to go over these carefully, keep these handy, and ensure you use them in practice:

A critical section is a code path that can execute in parallel and that works upon (reads and/or writes) shared writeable data (also known as "shared state").
Because it works on shared writable data, the critical section requires protection from the following:
- Parallelism (that is, it must run alone/serialized/in a mutually exclusive fashion)
- When running in an atomic (interrupt) non-blocking context – atomically: indivisibly, to completion, without interruption. Once protected, you can safely access your shared state until you "unlock".
Every critical section in the code base must be identified and protected:
- Identifying critical sections is critical! Carefully review your code and make sure you don't miss them.
- Protecting them can be achieved via various technologies; one very common technique is locking (there's also lock-free programming, which we'll look at in the next chapter).
- A common mistake is only protecting critical sections that write to global writeable data; you must also protect critical sections that read global writeable data; otherwise, you risk a torn or dirty read! To help make this key point clear, visualize an unsigned 64-bit data item being read and written on a 32-bit system; in such a case, the operation can't be atomic (two load/store operations are required). Thus, what if, while you're reading the value of the data item in one thread, it's being simultaneously written to by another thread!? The writer thread takes a "lock" of some sort but because you thought reading is safe, the lock isn't taken by the reader thread; due to an unfortunate timing coincidence, you can end up performing a partial/torn/dirty read! We will learn how to overcome these issues by using various techniques in the coming sections and the next chapter.
- Another deadly mistake is not using the same lock to protect a given data item.
- Failing to protect critical sections leads to a data race, a situation where the outcome – the actual value of the data being read/written – is "racy", which means it varies, depending on runtime circumstances and timing. This is known as a bug. (A bug that, once in "the field", is extremely difficult to see, reproduce, determine its root cause, and fix. We will cover some very powerful stuff to help you with this in the next chapter, in the Lock debugging within the kernel section; be sure to read it!)
Exceptions: You are safe (implicitly, without explicit protection) in the following situations:
- When you are working on local variables. They're allocated on the private stack of the thread (or, in the interrupt context, on the local IRQ stack) and are thus, by definition, safe.
- When you are working on shared writeable data in code that cannot possibly run in another context; that is, it's serialized by nature. In our context, the init and cleanup methods of an LKM qualify (they run exactly once, serially, on insmod and rmmod only).
- When you are working on shared data that is truly constant and read-only (don't let C's const keyword fool you, though!).
Locking is inherently complex; you must carefully think, design, and implement this to avoid deadlocks. We'll cover this in more detail in the Locking guidelines and deadlocks section.

Concurrency concerns within the Linux kernel

Recognizing critical sections within a piece of kernel code is of critical importance; how can you protect it if you can't even see it? The following are a few guidelines to help you, as a budding kernel/driver developer, recognize where concurrency concerns – and thus critical sections – may arise:

The presence of Symmetric Multi-Processor (SMP) systems (CONFIG_SMP)
The presence of a preemptible kernel
Blocking I/O
Hardware interrupts (on either SMP or UP systems)

These are critical points to understand, and we will discuss each in this section.

Multicore SMP systems and data races

The first point is pretty obvious; take a look at the pseudocode shown in the following screenshot:

Figure 6.5 – Pseudocode – a critical section within a (fictional) driver's read method; it's wrong as there's no locking

It's a similar situation to what we showed in Figures 6.1 and 6.3; it's just that here, we're showing the concurrency in terms of pseudocode. Clearly, from time t2 to time t3, the driver is working on some global shared writeable data, thus making this a critical section.

Now, visualize a system with, say, four CPU cores (an SMP system); two user space processes, P1 (running on, say, CPU 0) and P2 (running on, say, CPU 2), can concurrently open the device file and simultaneously issue a read(2) system call. Now, both processes will be concurrently executing the driver read "method", thus simultaneously working on shared writeable data! This (the code between t2 and t3) is a critical section, and since we are in violation of the fundamental exclusivity rule – critical sections must be executed by only a single thread at any point in time – we can very well end up corrupting the data, the application, or worse.

In other words, this is now a data race; depending on delicate timing coincidences, we may or may not generate an error (a bug). This very uncertainty – the delicate timing coincidence – is what makes finding and fixing errors like this extremely difficult (it can escape your testing effort).

This aphorism is all too unfortunately true: Testing can detect the presence of errors, not their absence. Adding to this, you're worse off if your testing fails to catch races (and bugs), allowing them free rein in the field.

You might feel that since your product is a small embedded system running on one CPU core (UP), this discussion regarding controlling concurrency (often, via locking) does not apply to you. We beg to differ: pretty much all modern products, if they haven't already, will move to multicore (in their next-generation phases, perhaps). More importantly, even UP systems have concurrency concerns, as we shall explore.

Preemptible kernels, blocking I/O, and data races

Imagine you're running your kernel module or driver on a Linux kernel that's been configured to be preemptible (that is, CONFIG_PREEMPT is on; we covered this topic in the companion guide Linux Kernel Programming, Chapter 10, The CPU Scheduler – Part 1). Consider that a process, P1, is running the driver's read method code in the process context, working on the global array. Now, while it's within the critical section (between time t2 and t3), what if the kernel preempts process P1 and context switches to another process, P2, which is just waiting to execute this very code path? It's dangerous, and again, a data race. This could well happen on even a UP system!

Another scenario that's somewhat similar (and again, could occur on either a single core (UP) or multicore system): process P1 is running through the critical section of the driver method (between time t2 and t3; again, see Figure 6.5). This time, what if, within the critical section, it hits a blocking call?

A blocking call is a function that causes the calling process context to be put to sleep, waiting upon an event; when that event occurs, the kernel will "wake up" the task, and it will resume execution from where it left off. This is also known as blocking on I/O and is very common; many APIs (including several user space library and system calls, as well as several kernel APIs, are blocking by nature). In such a case, process P1 is effectively context switches off the CPU and goes to sleep, which means that the code of schedule() runs and enqueues it onto a wait queue.

In the interim, before P1 gets switched back, what if another process, P2, is scheduled to run? What if that process is also running this particular code path? Think about it – by the time P1 is back, the shared data could have changed "underneath it", causing all kinds of errors; again, a data race, a bug!

Hardware interrupts and data races

Finally, envision this scenario: process P1 is, again, innocently running the driver's read method code; it enters the critical section (between time t2 and t3; again, see Figure 6.5). It makes some progress but then, alas, a hardware interrupt triggers (on the same CPU)! On the Linux OS, hardware (peripheral) interrupts have the highest priority; they preempt any code (including kernel code) by default. Thus, process (or thread) P1 will be at least temporarily shelved, thus losing the processor; the interrupt handling code will preempt it and run.

Well, you might be wondering, so what? Indeed, this is a completely commonplace occurrence! Hardware interrupts fire very frequently on modern systems, effectively (and literally) interrupting all kinds of task contexts (do a quick vmstat 3 on your shell; the column under system labeled in shows the number of hardware interrupts that fired on your system in the last 1 second!). The key question to ask is this: is the interrupt handling code (either the hardirq top half or the so-called tasklet or softirq bottom half, whichever occurred), sharing and working upon the same shared writable data of the process context that it just interrupted?

If this is true, then, Houston, we have a problem – a data race! If not, then your interrupted code is not a critical section with respect to the interrupt code path, and that's fine. The fact is that the majority of device drivers do handle interrupt(s); thus, it is the driver author's (your!) responsibility to ensure that no global or static data – in effect, no critical sections – are shared between the process context and interrupt code paths. If they are (which does happen), you must somehow protect that data from data races and possible corruption.

These scenarios might leave you feeling that protecting against these concurrency concerns is a really tall order; how exactly can you accomplish data safety in the face of critical sections existing, along with various possible concurrency concerns? Interestingly, the actual APIs are not hard to learn to use; again, we emphasize that recognizing critical sections is the key thing to do.

Again, the basics regarding how a lock (conceptually) works, locking guidelines (very important; we'll recap on them shortly), and the types of and how to prevent deadlocks, are all dealt with in my earlier book, Hands-On System Programming with Linux (Packt, Oct 2018). This books covers these points in detail in Chapter 15, Multithreading with Pthreads Part II – Synchronization.

Without further ado, let's dive into the primary synchronization technology that will serve to protect our critical sections – locking.

Locking guidelines and deadlocks

Locking, by its very nature, is a complex beast; it tends to give rise to complex interlocking scenarios. Not understanding it well enough can lead to both performance headaches and bugs – deadlocks, circular dependencies, interrupt-unsafe locking, and more. The following locking guidelines are key to ensuring correctly written code when using locking:

Locking granularity: The 'distance' between the lock and the unlock (in effect, the length of the critical section) should not be coarse (too long a critical section) it should be 'fine enough'; what does this mean? The points below explain this:
- You need to be careful here. When you're working on large projects, keeping too few locks is a problem, as is keeping too many! Too few locks can lead to performance issues (as the same locks are repeatedly used and thus tend to be highly contended).
- Having a lot of locks is actually good for performance, but not good for complexity control. This also leads to another key point to understand: with many locks in the code base, you should be very clear on which lock protects which shared data object. It's completely meaningless if you use, say, lockA to protect mystructX, but in a code path far away (perhaps an interrupt handler) you forget this and try and use some other lock, lockB, for protection when working on the same structure! Right now, these things might sound obvious, but (as experienced developers know), under sufficient pressure, even the obvious isn't always obvious!
- Try and balance things out. In large projects, using one lock to protect one global (shared) data structure is typical. (Naming the lock variable well can become a big problem in itself! This is why we place the lock that protects a data structure within it as a member.)

Lock ordering is critical; locks must be taken in the same order throughout, and their order should be documented and followed by all the developers working on the project (annotating locks is useful too; more on this in the section on lockdep in the next chapter). Incorrect lock ordering often leads to deadlocks.
Avoid recursive locking as much as possible.
Take care to prevent starvation; verify that a lock, once taken, is indeed released "quickly enough".
Simplicity is key: Try to avoid complexity or over-design, especially with regard to complex scenarios involving locks.

On the topic of locking, the (dangerous) issue of deadlocks arises. A deadlock is the inability to make any progress; in other words, the app and/or kernel component(s) appear to hang indefinitely. While we don't intend to delve into the gory details of deadlocks here, I will quickly mention some of the more common types of deadlock scenarios that can occur:

Simple case, single lock, process context:
- We attempt to acquire the same lock twice; this results in a self-deadlock.

Simple case, multiple (two or more) locks, process context – an example:
- On CPU 0, thread A acquires lock A and then wants lock B.
- Concurrently, on CPU 1, thread B acquires lock B and then wants lock A.
- The result is a deadlock, often called the AB-BA deadlock.
- It can be extended; for example, the AB-BC-CA circular dependency (A-B-C lock chain) results in a deadlock.
Complex case, single lock, and process and interrupt contexts:
- Lock A takes in an interrupt context.
- What if an interrupt occurs (on another core) and the handler attempts to take lock A? Deadlock is the result! Thus, locks acquired in the interrupt context must always be used with interrupts disabled. (How? We will look at this in more detail when we cover spinlocks.)
More complex cases, multiple locks, and process and interrupt (hardirq and softirq) contexts

In simpler cases, always following the lock ordering guideline is sufficient: always obtain and release locks in a well-documented order (we will provide an example of this in kernel code in the Using the mutex lock section). However, this can get very complex; complex deadlock scenarios can trip up even experienced developers. Luckily for us, lockdep – the Linux kernel's runtime lock dependency validator – can catch every single deadlock case! (Don't worry – we shall get there: we'll cover lockdep in detail in the next chapter). When we cover spinlocks (the Using the spinlock section), we'll come across process and/or interrupt context scenarios similar to the ones mentioned previously; the type of spinlock to use is made clear there.

With regard to deadlocks, a pretty detailed presentation on lockdep was given by Steve Rostedt at a Linux Plumber's Conference (back in 2011); the relevant slides are informative and explore both simple and complex deadlock scenarios, as well as how lockdep can detect them (https://blog.linuxplumbersconf.org/2011/ocw/sessions/153).

Also, the reality is that not just deadlock, but even livelock situations, can be just as deadly! Livelock is essentially a situation similar to deadlock; it's just that the state of the participating task is running and not waiting. An example, an interrupt "storm" can cause a livelock; modern network drivers mitigate this effect by switching off interrupts (under interrupt load) and resorting to a polling technique called New API; Switching Interrupts (NAPI) (switching interrupts back on when appropriate; well, it's more complex than that, but we leave it at that here).

For those of you who've been living under a rock, you will know that the Linux kernel has two primary types of locks: the mutex lock and the spinlock. Actually, there are several more types, including other synchronization (and "lockless" programming) technology, all of which will be covered in the course of this chapter and the next.

Mutex or spinlock? Which to use when

The exact semantics of learning to use the mutex lock and the spinlock are quite simple (with appropriate abstraction within the kernel API set, making it even easier for the typical driver developer or module author). The critical question in this situation is a conceptual one: what really is the difference between the two locks? More to the point, under which circumstances should you use which lock? You will learn the answers to these questions in this section.

Taking our previous driver read method's pseudocode (Figure 6.5) as a base example, let's say that three threads – tA, tB, and tC – are running in parallel (on an SMP system) through this code. We shall solve this concurrency issue, while avoiding any data races, by taking or acquiring a lock prior to the start of the critical section (time t2), and release the lock (unlock) just after the end of the critical section code path (time t3). Let's take a look at the pseudocode once more, this time with locking to ensure it's correct:

Figure 6.6 – Pseudocode – a critical section within a (fictional) driver's read method; correct, with locking

When the three threads attempt to simultaneously acquire the lock, the system guarantees that only exactly one of them will get it. Let's say that tB (thread B) gets the lock: it's now the "winner" or "owner" thread. This means that threads tA and tC are the "losers"; what do they do? They wait upon the unlock! The moment the "winner" (tB) completes the critical section and unlocks the lock, the battle resumes between the previous losers; one of them will be the next winner and the process repeats.

The key difference between the two lock types – the mutex and the spinlock – is based on how the losers wait upon the unlock. With the mutex lock, the loser threads are put to sleep; that is, they wait by sleeping. The moment the winner performs the unlock, the kernel awakens the losers (all of them) and they run, again competing for the lock. (In fact, mutexes and semaphores are sometimes referred to as sleeplocks.)

With the spinlock, however, there is no question of sleeping; the losers wait by spinning upon the lock until it is unlocked. Conceptually, this looks as follows:

while (locked) ;

Note that this is only conceptual. Think about it a moment – this is actually polling. However, as a good programmer, you will understand, that polling is usually considered a bad idea. Why, then, does the spinlock work this way? Well, it doesn't; it has only been presented in this manner for conceptual purposes. As you will soon understand, spinlocks only really have meaning on multicore (SMP) systems. On such systems, while the winner thread is away and running the critical section code, the losers wait by spinning on other CPU cores! In reality, at the implementation level, the code that's used to implement the modern spinlock is highly optimized (and arch-specific) and does not work by trivially "spinning" (for example, many spinlock implementations for ARM use the wait for event (WFE) machine language instruction, which has the CPU optimally wait in a low power state; see the Further reading section for several resources on the internal implementation of spinlocks).

Determining which lock to use – in theory

How the spinlock is implemented is really not our concern here; the fact that the spinlock has a lower overhead than the mutex lock is of interest to us. How so? It's simple, really: for the mutex lock to work, the loser thread has to go to sleep. To do so, internally, the schedule() function gets called, which means the loser sees the mutex lock API as a blocking call! A call to the scheduler will ultimately result in the processer being context-switched off. Conversely, when the owner thread unlocks the lock, the loser thread(s) must be woken up; again, it will be context-switched back onto the processor. Thus, the minimal "cost" of the mutex lock/unlock operation is the time it takes to perform two context switches on the given machine. (See the Information Box in the next section.) By relooking at the preceding screenshot once more, we can determine a few things, including the time spent in the critical section (the "locked" code path); that is, t_locked = t3 - t2.

Let's say that t_ctxsw represents the time to context switch. As we've learned, the minimal cost of the mutex lock/unlock operation is 2 * t_ctxsw. Now, let's say that the following expression is true:

t_locked < 2 * t_ctxsw

In other words, what if the time spent within the critical section is less than the time taken for two context switches? In this case, using the mutex lock is just wrong as this is far too much overhead; more time is being spent performing metawork than actual work – a phenomenon known as thrashing. It's this precise use case – the presence of very short critical sections – that's often the case on modern OSes such as Linux. So, in conclusion, for short non-blocking critical sections, using a spinlock is (far) superior to using a mutex lock.

Determining which lock to use – in practice

So, operating under the t_locked < 2 * t_ctxsw "rule" might be great in theory, but hang on: are you really expected to precisely measure the context switch time and the time spent in the critical section of each and every case where one (critical section) exists? No, of course not – that's pretty unrealistic and pedantic.

Practically speaking, think about it this way: the mutex lock works by having the loser threads sleep upon the unlock; the spinlock does not (the losers "spin"). Let's recall one of our golden rules of the Linux kernel: a kernel cannot sleep (call schedule()) in any kind of atomic context. Thus, we can never use the mutex lock in an interrupt context, or indeed in any context where it isn't safe to sleep; using the spinlock, however, would be fine. (Remember, a blocking API is one that puts the calling context to sleep by calling schedule().) Let's summarize this:

Is the critical section running in an atomic (interrupt) context, or, in a process context, where it cannot sleep? Use the spinlock.
Is the critical section running in a process context and sleep in the critical section is necessary? Use the mutex lock.

Of course, using the spinlock is considered lower overhead than using the mutex; thus, you can even use the spinlock in the process context (such as our fictional driver's read method), as long as the critical section does not block (sleep).

[1] The time taken for a context switch is varied; it largely depends on the hardware and the OS quality. Recent (September 2018) measurements show that context switching time is in the region of 1.2 to 1.5 us (microseconds) on a pinned-down CPU, and around 2.2 us without pinning (https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/).

Both hardware and the Linux OS have improved tremendously, and because of that, so has the average context switching time. An old (December 1998) Linux Journal article determined that on an x86 class system, the average context switch time was 19 us (microseconds), and that the worst-case time was 30 us.

This brings up the question, how do we know if the code is currently running in a process or interrupt context? Easy: our PRINT_CTX() macro (within our convenient.h header) shows us this:

if (in_task())
    /* we're in process context (usually safe to sleep / block) */
else
    /* we're in an atomic or interrupt context (cannot sleep / block) */

Now that you understand which one – mutex or spinlock – to use and when, let's get into the actual usage. We'll begin with how to use the mutex lock!

Using the mutex lock

Mutexes are also called sleepable or blocking mutual exclusion locks. As you have learned, they are used in the process context if the critical section can sleep (block). They must not be used within any kind of atomic or interrupt context (top halves, bottom halves such as tasklets or softirqs, and so on), kernel timers, or even the process context where blocking is not allowed.

Initializing the mutex lock

A mutex lock "object" is represented in the kernel as a struct mutex data structure. Consider the following code:

#include <linux/mutex.h>
struct mutex mymtx;

To use a mutex lock, it must be explicitly initialized to the unlocked state. Initialization can be performed statically (declare and initialize the object) with the DEFINE_MUTEX() macro, or dynamically via the mutex_init() function (this is actually a macro wrapper over the __mutex_init() function).

For example, to declare and initialize a mutex object called mymtx, we can use DEFINE_MUTEX(mymtx);.

We can also do this dynamically. Why dynamically? Often, the mutex lock is a member of the (global) data structure that it protects (clever!). For example, let's say we have the following global context structure in our driver code (note that this code is fictional):

struct mydrv_priv {
    <member 1>
    <member 2>
    [...]
    struct mutex mymtx; /* protects access to mydrv_priv */
    [...]
};

Then, in your driver's (or LKM's) init method, do the following:

static int init_mydrv(struct mydrv_priv *drvctx)
{
    [...]
    mutex_init(drvctx-mymtx);
    [...]
}

Keeping the lock variable as a member of the (parent) data structure it protects is a common (and clever) pattern that's used within Linux; this approach has the added benefit of avoiding namespace pollution and is unambiguous about which mutex protects which shared data item (a bigger problem than it might appear to be at first, especially in enormous projects such as the Linux kernel!).

Keep the lock protecting a global or shared data structure as a member within that data structure.

Correctly using the mutex lock

Typically, you can find very insightful comments within the kernel source tree. Here's a great one that neatly summarizes the rules you must follow to correctly use a mutex lock; please read this carefully:

// include/linux/mutex.h
/*
 * Simple, straightforward mutexes with strict semantics:
 *
 * - only one task can hold the mutex at a time
 * - only the owner can unlock the mutex
 * - multiple unlocks are not permitted
 * - recursive locking is not permitted
 * - a mutex object must be initialized via the API
 * - a mutex object must not be initialized via memset or copying
 * - task may not exit with mutex held
 * - memory areas where held locks reside must not be freed
 * - held mutexes must not be reinitialized
 * - mutexes may not be used in hardware or software interrupt
 * contexts such as tasklets and timers
 *
 * These semantics are fully enforced when DEBUG_MUTEXES is
 * enabled. Furthermore, besides enforcing the above rules, the mutex
 * [ ... ]

As a kernel developer, you must understand the following:

A critical section causes the code path to be serialized, defeating parallelism. Due to this, it's imperative that you keep the critical section as short as possible. A corollary to this is lock data, not code.
Attempting to reacquire an already acquired (locked) mutex lock – which is effectively recursive locking – is not supported and will lead to a self-deadlock.
Lock ordering: This is a very important rule of thumb for preventing dangerous deadlock situations. In the presence of multiple threads and multiple locks, it is critical that the order in which locks are taken is documented and strictly followed by all the developers working on the project. The actual lock ordering itself isn't sacrosanct, but the fact that once it's been decided on it must be followed, is. While browsing through the kernel source tree, you will come across many places where the kernel developers ensure this is done, and they (usually) write a comment regarding this for other developers to see and follow. Here's a sample comment from the slab allocator code (mm/slub.c):

/*
 * Lock order:
 * 1. slab_mutex (Global Mutex)
 * 2. node-list_lock
 * 3. slab_lock(page) (Only on some arches and for debugging)

Now that we understand how mutexes work from a conceptual standpoint (and we understand their initialization), let's learn how to make use of the lock/unlock APIs.

Mutex lock and unlock APIs and their usage

The actual locking and unlocking APIs for the mutex lock are as follows. The following code shows how to lock and unlock a mutex, respectively:

void __sched mutex_lock(struct mutex *lock);
void __sched mutex_unlock(struct mutex *lock);

(Ignore __sched here; it's just a compiler attribute that has this function disappear in the WCHAN output, which shows up in procfs and with certain option switches to ps(1) (such as -l)).

Again, the comments within the source code in kernel/locking/mutex.c are very detailed and descriptive; I encourage you to take a look at this file in more detail. We've only shown some of its code here, which has been taken directly from the 5.4 Linux kernel source tree:

// kernel/locking/mutex.c
[ ... ]
/**
 * mutex_lock - acquire the mutex
 * @lock: the mutex to be acquired
 *
 * Lock the mutex exclusively for this task. If the mutex is not
 * available right now, it will sleep until it can get it.
 *
 * The mutex must later on be released by the same task that
 * acquired it. Recursive locking is not allowed. The task
 * may not exit without first unlocking the mutex. Also, kernel
 * memory where the mutex resides must not be freed with
 * the mutex still locked. The mutex must first be initialized
 * (or statically defined) before it can be locked. memset()-ing
 * the mutex to 0 is not allowed.
 *
 * (The CONFIG_DEBUG_MUTEXES .config option turns on debugging
 * checks that will enforce the restrictions and will also do
 * deadlock debugging)
 *
 * This function is similar to (but not equivalent to) down().
 */
void __sched mutex_lock(struct mutex *lock)
{
    might_sleep();

    if (!__mutex_trylock_fast(lock))
        __mutex_lock_slowpath(lock);
}
EXPORT_SYMBOL(mutex_lock);

might_sleep() is a macro with an interesting debug property; it catches code that's supposed to execute in an atomic context but doesn't! So, think about it: might_sleep(), which is the first line of code in mutex_lock(), implies that this code path should not be executed by anything that's in an atomic context since it might sleep. This means that you should only use the mutex in the process context when it's safe to sleep!

A quick and important reminder: The Linux kernel can be configured with a large number of debug options; in this context, the CONFIG_DEBUG_MUTEXES config option will help you catch possible mutex-related bugs, including deadlocks. Similarly, under the Kernel Hacking menu, you will find a large number of debug-related kernel config options. We discussed this in the companion guide Linux Kernel Programming - Chapter 5, Writing Your First Kernel Module – LKMs Part 2. There are several very useful kernel configs with regard to lock debugging; we shall cover these in the next chapter, in the Lock debugging within the kernel section.

Mutex lock – via [un]interruptible sleep?

As usual, there's more to the mutex than what we've seen so far. You already know that a Linux process (or thread) cycles through various states of a state machine. On Linux, sleeping has two discrete states – an interruptible sleep and an uninterruptible sleep. A process (or thread) in an interruptible sleep is sensitive, which means it will respond to user space signals, whereas a task in an uninterruptible sleep is not sensitive to user signals.

In a human-interactive application with an underlying driver, as a general rule of thumb, you should typically put a process into an interruptible sleep (while it's blocking upon the lock), thus leaving it up to the end user as to whether to abort the application by pressing Ctrl + C (or some such mechanism involving signals). There is a design rule that's often followed on Unix-like systems: provide mechanism, not policy. Having said this, on non-interactive code paths, it's often the case that you must wait on the lock to wait indefinitely, with the semantic that a signal that's been delivered to the task should not abort the blocking wait. On Linux, the uninterruptible case turns out to be the most common one.

So, here's the thing: the mutex_lock() API always puts the calling task into an uninterruptible sleep. If this is not what you want, use the mutex_lock_interruptible() API to put the calling task into an interruptible sleep. There is one difference syntax-wise; the latter returns an integer value of 0 on success and -EINTR (remember the 0/-E return convention) on failure (due to signal interruption).

In general, using mutex_lock() is faster than using mutex_lock_interruptible(); use it when the critical section is short (thus pretty much guaranteeing that the lock is held for a short while, which is a very desirable characteristic).

The 5.4.0 kernel contains over 18,500 and just over 800 instances of calling the mutex_lock() and mutex_lock_interruptible() APIs, respectively; you can check this out via the powerful cscope(1) utility on the kernel source tree.

In theory, the kernel provides a mutex_destroy() API as well. This is the opposite of mutex_init(); its job is to mark the mutex as being unusable. It must only be invoked once the mutex is in the unlocked state, and once invoked, the mutex cannot be used. This is a bit theoretical because, on regular systems, it just reduces to an empty function; only on a kernel with CONFIG_DEBUG_MUTEXES enabled does it become actual (simple) code. Thus, we should use this pattern when working with the mutex, as shown in the following pseudocode:

DEFINE_MUTEX(...);        // init: initialize the mutex object
/* or */ mutex_init();
[ ... ]
    /* critical section: perform the (mutex) locking, unlocking */
    mutex_lock[_interruptible]();
    << ... critical section ... >>
    mutex_unlock();
    mutex_destroy();      // cleanup: destroy the mutex object

Now that you have learned how to use the mutex lock APIs, let's put this knowledge to use. In the next section, we will build on top of one of our earlier (poorly written – no protection!) "misc" drivers by employing the mutex object to lock critical sections as required.

Mutex locking – an example driver

We have created a simple device driver code example in Chapter 1 - Writing a Simple misc Character Device Driver; that is, ch1/miscdrv_rdwr. There, we wrote a simple misc class character device driver and used a user space utility program (ch12/miscdrv_rdwr/rdwr_drv_secret.c) to read and write a (so-called) secret from and to the device driver's memory.

However, what we glaringly (egregiously is the right word here!) failed to do in that code is protect shared (global) writeable data! This will cost us dearly in the real world. I urge you to take some time to think about this: it isn't viable that two (or three or more) user mode processes open the device file of this driver, and then concurrently issue various I/O reads and writes. Here, the global shared writable data (in this particular case, two global integers and the driver context data structure) could easily get corrupted.

So, let's learn from and correct our mistakes by making a copy of this driver (we will now call it ch12/1_miscdrv_rdwr_mutexlock/1_miscdrv_rdwr_mutexlock.c) and rewriting some portions of it. The key point is that we must use mutex locks to protect all critical sections. Instead of displaying the code here (it's in this book's GitHub repository at https://github.com/PacktPublishing/Linux-Kernel-Programming, after all, please do git clone it!), let's do something interesting: let's look at a "diff" (the differences – the delta generated by diff(1)) between the older unprotected version and the newer protected code version. The output here has been truncated:

$ pwd
<.../ch12/1_miscdrv_rdwr_mutexlock
$ diff -u ../../ch12/miscdrv_rdwr/miscdrv_rdwr.c miscdrv_rdwr_mutexlock.c>> miscdrv_rdwr.patch
$ cat miscdrv_rdwr.patch
[ ... ]
+#include <linux/mutex.h> // mutex lock, unlock, etc
 #include "../../convenient.h"
[ ... ] 
-#define OURMODNAME "miscdrv_rdwr"
+#define OURMODNAME "miscdrv_rdwr_mutexlock"

+DEFINE_MUTEX(lock1); // this mutex lock is meant to protect the integers ga and gb
[ ... ]
+     struct mutex lock; // this mutex protects this data structure
 };
[ ... ]

Here, we can see that in the newer safe version of the driver, we have declared and initialized a mutex variable called lock1; we shall use it to protect the (just for demonstration purposes) two global integers, ga and gb, within our driver. Next, importantly, we declared a mutex lock named lock within the "driver context" data structure; that is, drv_ctx. This will be used to protect any and all access to members of that data structure. It is initialized within the init code:

+     mutex_init(&ctx->lock);
+
+     /* Initialize the "secret" value :-) */
      strscpy(ctx->oursecret, "initmsg", 8);
-     dev_dbg(ctx->dev, "A sample print via the dev_dbg(): driver initialized
");
+     /* Why don't we protect the above strscpy() with the mutex lock?
+      * It's working on shared writable data, yes?
+      * Yes, BUT this is the init code; it's guaranteed to run in exactly
+      * one context (typically the insmod(8) process), thus there is
+      * no concurrency possible here. The same goes for the cleanup
+      * code path.
+      */

This detailed comment clearly explains why we don't need to lock/unlock around strscpy(). Again, this should be obvious, but local variables are implicitly private to each process context (as they reside in that process or thread's kernel mode stack) and therefore require no protection (each thread/process has a separate instance of the variable, so no one steps on anyone's toes!). Before we forget, the cleanup code path (which is invoked via the rmmod(8) process context), must destroy the mutexes:

-static void __exit miscdrv_rdwr_exit(void)
+static void __exit miscdrv_exit_mutexlock(void)
 {
+     mutex_destroy(&lock1);
+     mutex_destroy(&ctx->lock);
      misc_deregister(&llkd_miscdev);
 }

Now, let's look at the diff of the driver's open method:

+
+     mutex_lock(&lock1);
+     ga++; gb--;
+     mutex_unlock(&lock1);
+
+     dev_info(dev, " filename: "%s"
"
      [ ... ]

This is where we manipulated the global integers, making this a critical section; unlike the previous version of this program, here, we do protect this critical section with the lock1 mutex. So, there it is: the critical section here is the code ga++; gb--;: the code between the (mutex) lock and unlock operations.

But (there's always a but, isn't there?), all is not well! Take a look at the printk function (dev_info()) following the mutex_unlock() line of code:

+ dev_info(dev, " filename: "%s"
"
+         " wrt open file: f_flags = 0x%x
"
+         " ga = %d, gb = %d
",
+         filp->f_path.dentry->d_iname, filp->f_flags, ga, gb);

Does this look okay to you? No, look carefully: we are reading the value of the global integers, ga and gb. Recall the fundamentals: in the presence of concurrency (which is certainly a possibility here in this driver's open method), even reading shared writeable data without the lock is potentially unsafe. If this doesn't make sense to you, please think: what if, while one thread is reading the integers, another is simultaneously updating (writing) them; what then? This kind of situation is called a dirty read (or a torn read); we might end up reading stale data and must be protected against. (The fact is that this isn't really a great example of a dirty read as, on most processors, reading and writing single integer items does tend to be an atomic operation. However, we must not assume such things – we must simply do our job and protect it.)

In fact, there's another similar bug-in-waiting: we have read data from the open file structure (the filp pointer) without bothering to protect it (indeed, the open file structure has a lock; we're supposed to use it! We shall do so later).

The precise semantics of how and when things such as dirty reads occur does tend to be very arch (machine)-dependent; nevertheless, our job as module or driver authors is clear: we must ensure that we protect all critical sections. This includes reads upon shared writable data.

For now, we shall just flag these as potential errors (bugs). We will take care of this in the Using the atomic integer operators section, in a more performance-friendly manner. Looking at the diff of the driver's read method reveals something interesting (ignore the line numbers shown here; they might change):

Figure 6.7 – The diff of the driver's read() method; see the usage of the mutex lock in the newer version

We have now used the driver context structure's mutex lock to protect the critical sections. The same goes for both the write and close (release) methods of the device driver (generate the patch for yourself and take a look).

Note that the user mode app remains unchanged, which means for us to test the new safer version, we must continue using the user mode app at ch12/miscdrv_rdwr/rdwr_drv_secret.c. Running and testing code such as this driver code on a debug kernel, which contains various locking errors and deadlock detection capabilities, is crucial (we'll return to these "debug" capabilities in the next chapter, in the Lock debugging within the kernel section).

In the preceding code, we took the mutex lock just before the copy_to_user() routine; that's fine. However, we only release it after dev_info(). Why not release it before this printk, thus shortening the critical section?

A closer look at dev_info() reveals why it's within the critical section. We are printing the values of three variables here: the number of bytes read by secret_len and the number of bytes that are "transmitted" and "received" by ctx->tx and ctx->rx, respectively. secret_len is a local variable and does not require protection, but the other two variables are within the global driver context structure and thus do require protection, even from (possibly dirty) reads.

The mutex lock – a few remaining points

In this section, we will cover a few additional points regarding mutexes.

Mutex lock API variants

First, let's take a look at a few variants of the mutex lock API; besides the interruptible variant (described in the Mutex lock – via [un]interruptible sleep? section), we have the trylock, killable, and io variants.

The mutex trylock variant

What if you would like to implement a busy-wait semantic; that is, test for the availability of the (mutex) lock and, if available (meaning it's currently unlocked), acquire/lock it and continue with the critical section code path? If this is not available (it's currently in the locked state), do not wait for the lock; instead, perform some other work and retry. In effect, this is a non-blocking mutex lock variant and is called the trylock; the following flowchart shows how it works:

Figure 6.8 – The "busy wait" semantic, a non-blocking trylock operation

The API for this trylock variant of the mutex lock is as follows:

int mutex_trylock(struct mutex *lock);

This API's return value signifies what transpired at runtime:

A return value of 1 indicates that the lock has been successfully acquired.
A return value of 0 indicates that the lock is currently contended (locked).

Though it might sound tempting to, do not attempt to use the mutex_trylock() API to figure out if a mutex lock is in a locked or unlocked state; this is inherently "racy". Next, note that using this trylock variant in a highly contended lock path may well reduce your chances of acquiring the lock. The trylock variant has been traditionally used in deadlock prevention code that might need to back out of a certain lock order sequence and be retried via another sequence (ordering).

Also, with respect to the trylock variant, even though the literature uses the term try and acquire the mutex atomically, it does not work in an atomic or interrupt context – it only works in the process context (as with any type of mutex lock). As usual, the lock must be released by mutex_unlock() being invoked by the owner context.

I suggest that you try working on the trylock mutex variant as an exercise. See the Questions section at the end of this chapter for an assignment!

The mutex interruptible and killable variants

As you have already learned, the mutex_lock_interruptible() API is used when the driver (or module) is willing to acknowledge any (user space) signal interrupting it (and returns -ERESTARTSYS to tell the kernel VFS layer to perform signal handling; the user space system call will fail with errno set to EINTR). An example can be found in the module handling code in the kernel, within the delete_module(2) system call (which rmmod(8) invokes):

// kernel/module.c
[ ... ]
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
        unsigned int, flags)
{
    struct module *mod;
    [ ... ]
    if (!capable(CAP_SYS_MODULE) || modules_disabled)
        return -EPERM;
    [ ... ]
    if (mutex_lock_interruptible(&module_mutex) != 0)
        return -EINTR;
    mod = find_module(name);
    [ ... ]
out:
    mutex_unlock(&module_mutex);
    return ret;
}

Notice how the API returns -EINTR on failure. (The SYSCALL_DEFINEn() macro becomes a system call signature; n signifies the number of parameters this particular system call accepts. Also, notice the capability check – unless you are running as root or have the CAP_SYS_MODULE capability (or module loading is completely disabled), the system call just returns a failure (-EPERM).)

If, however, your driver is only willing to be interrupted by fatal signals (those that will kill the user space context), then use the mutex_lock_killable() API (the signature is identical to that of the interruptible variant).

The mutex io variant

The mutex_lock_io() API is identical in syntax to the mutex_lock() API; the only difference is that the kernel thinks that the wait time of the loser thread(s) is the same as waiting for I/O (the code comment in kernel/locking/mutex.c:mutex_lock_io() clearly documents this; take a look). This can matter accounting-wise.

You can find fairly exotic APIs such as mutex_lock[_interruptible]_nested() within the kernel, with the emphasis here being on the nested suffix. However, note that the Linux kernel does not prefer developers to use nested (or recursive) locking (as we mentioned in the Correctly using the mutex lock section). Also, these APIs only get compiled in the presence of the CONFIG_DEBUG_LOCK_ALLOC config option; in effect, the nested APIs were added to support the kernel lock validator mechanism. They should only be used in special circumstances (where a nesting level must be incorporated between instances of the same lock type).

In the next section, we will answer a typical FAQ: what's the difference between the mutex and semaphore objects? Does Linux even have a semaphore object? Read on to find out!

The semaphore and the mutex

The Linux kernel does provide a semaphore object, along with the usual operations you can perform on a (binary) semaphore:

A semaphore lock acquire via the down[_interruptible]() (and variations) APIs
A semaphore unlock via the up() API.

In general, the semaphore is an older implementation, so it's advised that you use the mutex lock in place of it.

An FAQ worth looking at, though, is this: what is the difference between a mutex and a semaphore? They appear to be conceptually similar, but are actually quite different:

A semaphore is a more generalized form of a mutex; a mutex lock can be acquired (and subsequently released or unlocked) exactly once, while a semaphore can be acquired (and subsequently released) multiple times.
A mutex is used to protect a critical section from simultaneous access, while a semaphore should be used as a mechanism to signal another waiting task that a certain milestone has been reached (typically, a producer task posts a signal via the semaphore object, which a consumer task is waiting to receive, in order to continue with further work).
A mutex has the notion of ownership of the lock and only the owner context can perform the unlock; there is no ownership for a binary semaphore.

Priority inversion and the RT-mutex

A word of caution when using any kind of locking is that you should carefully design and code to prevent the dreaded deadlock scenarios that could arise (more on this in the next chapter in the The lock validator lockdep – catch locking issues early section).

Aside from deadlocks, there is another risky scenario that arises when using the mutex: that of priority inversion (again, we will not delve into the details in this book). Suffice it to say that the unbounded priority inversion case can be a deadly one; the end result is that the product's high(est) priority thread is kept off the CPU for too long.

As I covered in some detail in my earlier book, Hands-on System Programming with Linux, it's precisely this priority inversion issue that struck NASA's Mars Pathfinder robot, on the Martian surface no less, back in July 1997! See the Further reading section of this chapter for interesting resources about this, something that every software developer should be aware of!

The userspace Pthreads mutex implementation certainly has priority inheritance (PI) semantics available. But what about within the Linux kernel? For this, Ingo Molnar provided the PI-futex-based RT-mutex (a real-time mutex; in effect, a mutex extended to have PI capabilities. futex(2) is a sophisticated system call that provides a fast userspace mutex). These become available when the CONFIG_RT_MUTEXES config option is enabled. Quite similar to the "regular" mutex semantics, RT-mutex APIs are provided to initialize, (un)lock, and destroy the RT-mutex object. (This code has been merged into the mainline kernel from Ingo Molnar's -rt tree). As far as actual usage is concerned, the RT-mutex is used for internally implementing the PI futex (the futex(2) system call itself internally implements the userspace Pthreads mutex). Besides this, the kernel locking self-test code and the I2C subsystem uses the RT-mutex directly.

Thus, for a typical module (or driver) author, these APIs are not going to be used very frequently. The kernel does provide some documentation on the internal design of the RT-mutex at https://www.kernel.org/doc/Documentation/locking/rt-mutex-design.rst (covering priority inversion, priority inheritance, and more).

Internal design

A word on the reality of the internal implementation of the mutex lock deep within the kernel fabric: Linux tries to implement a fast path approach when possible.

A fast path is the most optimized high-performance type of code path; for example, one with no locks and no blocking. The intent is to have code follow this fast path as far as possible. Only when it really isn't possible does the kernel fall back to a (possible) "mid path", and then a "slow path", approach; it still works but is slow(er).

This fast path is taken in the absence of contention for the lock (that is, the lock is in an unlocked state to begin with). So, the lock is locked with no fuss, pretty much immediately. If, however, the mutex is already locked, then the kernel typically uses a mid path optimistic spinning implementation, making it more of a hybrid (mutex/spinlock) lock type. If even this isn't possible, the "slow path" is followed – the process context attempting to get the lock may well enter the sleep state. If you're interested in its internal implementation, more details can be found within the official kernel documentation: https://www.kernel.org/doc/Documentation/locking/mutex-design.rst.

LDV (Linux Driver Verification) project: in the companion guide Linux Kernel Programming - Chapter 1, Kernel Workspace Setup, in the section The LDV – Linux Driver Verification – project, we mentioned that this project has useful "rules" with respect to various programming aspects of Linux modules (drivers, mostly) as well as the core kernel.

With regard to our current topic, here's one of the rules: Locking a mutex twice or unlocking without prior locking (http://linuxtesting.org/ldv/online?action=show_rule&rule_id=0032). It mentions the kind of things you cannot do with the mutex lock (we have already covered this in the Correctly using the mutex lock section). The interesting thing here: you can see an actual example of a bug – a mutex lock double-acquire attempt, leading to (self) deadlock – in a kernel driver (as well as the subsequent fix).

Now that you've understood how to use the mutex lock, let's move on and look at the other very common lock within the kernel – the spinlock.

Using the spinlock

In the Mutex or spinlock? Which to use when section, you learned when to use the spinlock instead of the mutex lock and vice versa. For convenience, we have reproduced the key statements we provided previously here:

Is the critical section running in an atomic (interrupt) context or in a process context where it cannot sleep? Use the spinlock.
Is the critical section running in a process context and sleep in the critical section is necessary? Use the mutex lock.

In this section, we shall consider that you've now decided to use the spinlock.

Spinlock – simple usage

For all the spinlock APIs, you must include the relevant header file; that is, include <linux/spinlock.h>.

Similar to the mutex lock, you must declare and initialize the spinlock to the unlocked state before use. The spinlock is an "object" that's declared via the typedef data type named spinlock_t (internally, it's a structure defined in include/linux/spinlock_types.h). It can be initialized dynamically via the spin_lock_init() macro:

spinlock_t lock;
spin_lock_init(&lock);

Alternatively, this can be performed statically (declared and initialized) with DEFINE_SPINLOCK(lock);.

As with the mutex, declaring a spinlock within the (global/static) data structure is meant to protect against concurrent access, and is typically a very good idea. As we mentioned earlier, this very idea is made use of within the kernel often; as an example, the data structure representing an open file on the Linux kernel is called struct file:

// include/linux/fs.h
struct file {
    [...]
    struct path f_path;
    struct inode *f_inode; /* cached value */
    const struct file_operations *f_op;
    /*
     * Protects f_ep_links, f_flags.
     * Must not be taken from IRQ context.
     */
    spinlock_t f_lock;
    [...]
    struct mutex f_pos_lock;
    loff_t f_pos;
    [...]

Check it out: for the file structure, the spinlock variable named f_lock is the spinlock that protects the f_ep_links and f_flags members of the file data structure (it also has a mutex lock to protect another member; that is, the file's current seek position – f_pos).

How do you actually lock and unlock the spinlock? There are quite a few variations on the API that are exposed by the kernel to us module/driver authors; the simplest form of the spin(un)lock APIs are as folows:

void spin_lock(spinlock_t *lock);
<< ... critical section ... >>
void spin_unlock(spinlock_t *lock);

Note that there is no spinlock equivalent of the mutex_destroy() API.

Now, let's see the spinlock APIs in action!

Spinlock – an example driver

Similar to what we did with our mutex locking sample driver (the Mutex locking – an example driver section), to illustrate the simple usage of a spinlock, we shall make a copy of our earlier ch12/1_miscdrv_rdwr_mutexlock driver as a starting template and then place it in a new kernel driver; that is, ch12/2_miscdrv_rdwr_spinlock. Again, here, we'll only show small parts of the diff (the differences, the delta generated by diff(1)) between that program and this one (we won't show every line of the diff, only the relevant portions):

// location: ch12/2_miscdrv_rdwr_spinlock/
+#include <linux/spinlock.h>
[ ... ]
-#define OURMODNAME "miscdrv_rdwr_mutexlock"
+#define OURMODNAME "miscdrv_rdwr_spinlock"
[ ... ]
static int ga, gb = 1;
-DEFINE_MUTEX(lock1); // this mutex lock is meant to protect the integers ga and gb
+DEFINE_SPINLOCK(lock1); // this spinlock protects the global integers ga and gb
[ ... ]
+/* The driver 'context' data structure;
+ * all relevant 'state info' reg the driver is here.
  */
 struct drv_ctx {
    struct device *dev;
@@ -63,10 +66,22 @@
    u64 config3;
 #define MAXBYTES 128
    char oursecret[MAXBYTES];
- struct mutex lock; // this mutex protects this data structure
+ struct mutex mutex; // this mutex protects this data structure
+ spinlock_t spinlock; // ...so does this spinlock
 };
 static struct drv_ctx *ctx;

This time, to protect the members of our drv_ctx global data structure, we have both the original mutex lock and a new spinlock. This is quite common; the mutex lock protects member usage in a critical section where blocking can occur, while the spinlock is used to protect members in critical sections where blocking (sleeping – recall that it might sleep) cannot occur.

Of course, we must ensure that we initialize all the locks so that they're in the unlocked state. We can do this in the driver's init code (continuing with the patch output):

-   mutex_init(&ctx->lock);
+   mutex_init(&ctx->mutex);
+   spin_lock_init(&ctx->spinlock);

In the driver's open method, we replace the mutex lock with the spinlock to protect the increments and decrements of the global integers:

 * open_miscdrv_rdwr()
@@ -82,14 +97,15 @@

    PRINT_CTX(); // displays process (or intr) context info

-   mutex_lock(&lock1);
+   spin_lock(&lock1);
    ga++; gb--;
-   mutex_unlock(&lock1);
+   spin_unlock(&lock1);

Now, within the driver's read method, we use the spinlock instead of the mutex to protect some critical sections:

 static ssize_t read_miscdrv_rdwr(struct file *filp, char __user *ubuf, size_t count, loff_t  *off)
 {
-   int ret = count, secret_len;
+   int ret = count, secret_len, err_path = 0;
    struct device *dev = ctx->dev;

-   mutex_lock(&ctx->lock);
+   spin_lock(&ctx->spinlock);
    secret_len = strlen(ctx->oursecret);
-   mutex_unlock(&ctx->lock);
+   spin_unlock(&ctx->spinlock);

However, that's not all! Continuing with the driver's read method, carefully take a look at the following code and comment:

[ ... ]
@@ -139,20 +157,28 @@
     * member to userspace.
     */
    ret = -EFAULT;
-   mutex_lock(&ctx->lock);
+   mutex_lock(&ctx->mutex);
+   /* Why don't we just use the spinlock??
+    * Because - VERY IMP! - remember that the spinlock can only be used when
+    * the critical section will not sleep or block in any manner; here,
+    * the critical section invokes the copy_to_user(); it very much can
+    * cause a 'sleep' (a schedule()) to occur.
+    */
    if (copy_to_user(ubuf, ctx->oursecret, secret_len)) {
[ ... ]

When protecting data where the critical section has possibly blocking APIs – such as in copy_to_user() – we must only use a mutex lock! (Due to lack of space, we haven't displayed more of the code diff here; we expect you to read through the spinlock sample driver code and try it out for yourself.)

Test – sleep in an atomic context

You have already learned that the one thing we should not do is sleep (block) in any kind of atomic or interrupt context. Let's put this to the test. As always, the empirical approach – where you test things for yourself rather than relying on other's experiences – is key!

How exactly can we test this? Easy: we shall use a simple integer module parameter, buggy, that, when set to 1 (the default value being 0), executes a code path within our spinlock's critical section that violates this rule. We shall invoke the schedule_timeout() API (which, as you learned in Chapter 5, Working with Kernel Timers, Threads, and Workqueues, in the Understanding how to use the *sleep() blocking APIs section) internally invokes schedule(); it's how we go to sleep in the kernel space). Here's the relevant code:

// ch12/2_miscdrv_rdwr_spinlock/2_miscdrv_rdwr_spinlock.c
[ ... ]
static int buggy;
module_param(buggy, int, 0600);
MODULE_PARM_DESC(buggy,
"If 1, cause an error by issuing a blocking call within a spinlock critical section");
[ ... ]
static ssize_t write_miscdrv_rdwr(struct file *filp, const char __user *ubuf,
                size_t count, loff_t *off)
{
    int ret, err_path = 0;
    [ ... ]
    spin_lock(&ctx->spinlock);
    strscpy(ctx->oursecret, kbuf, (count > MAXBYTES ? MAXBYTES : count));
    [ ... ]
    if (1 == buggy) {
        /* We're still holding the spinlock! */
        set_current_state(TASK_INTERRUPTIBLE);
        schedule_timeout(1*HZ); /* ... and this is a blocking call!
                    * Congratulations! you've just engineered a bug */
    }
    spin_unlock(&ctx->spinlock);
    [ ... ]
}

Now, for the interesting part: let's test this (buggy) code path in two kernels: first, in our custom 5.4 "debug" kernel (the kernel where we have enabled several kernel debug configuration options (mostly from the Kernel Hacking menu in make menuconfig), as explained in the companion guide Linux Kernel Programming - Chapter 5, Writing Your First Kernel Module – LKMs Part 2), and second, on a generic distro (we usually run on Ubuntu) 5.4 kernel without any relevant kernel debug options enabled.

Testing on a 5.4 debug kernel

First of all, ensure you've built the custom 5.4 kernel and that all the required kernel debug config options enabled (again, look back to the companion guide Linux Kernel Programming - Chapter 5, Writing Your First Kernel Module – LKMs Part 2, the Configuring a debug kernel section if you need to). Then, boot off your debug kernel (here, it's named 5.4.0-llkd-dbg). Now, build the driver (in ch12/2_miscdrv_rdwr_spinlock/) against this debug kernel (the usual make within the driver's directory should do this; you might find that, on the debug kernel, the build is noticeably slower!):

$ lsb_release -a 2>/dev/null | grep "^Description" ; uname -r
Description: Ubuntu 20.04.1 LTS
5.4.0-llkd-dbg
$ make
[ ... ]
$ modinfo ./miscdrv_rdwr_spinlock.ko 
filename: /home/llkd/llkd_src/ch12/2_miscdrv_rdwr_spinlock/./miscdrv_rdwr_spinlock.ko
[ ... ]
description: LLKD book:ch12/2_miscdrv_rdwr_spinlock: simple misc char driver rewritten with spinlocks
[ ... ]
parm: buggy:If 1, cause an error by issuing a blocking call within a spinlock critical section (int)
$ sudo virt-what
virtualbox
kvm
$

As you can see, we're running our custom 5.4.0 "debug" kernel on our x86_64 Ubuntu 20.04 guest VM.

How do you know whether you're running on a virtual machine (VM) or on the "bare metal" (native) system? virt-what(1) is a useful little script that shows this (you can install it on Ubuntu with sudo apt install virt-what).

To run our test case, insert the driver into the kernel and set the buggy module parameter to 1. Invoking the driver's read method (via our user space app; that is, ch12/miscdrv_rdwr/rdwr_test_secret) isn't an issue, as shown here:

$ sudo dmesg -C
$ sudo insmod ./miscdrv_rdwr_spinlock.ko buggy=1
$ ../../ch12/miscdrv_rdwr/rdwr_test_secret 
Usage: ../../ch12/miscdrv_rdwr/rdwr_test_secret opt=read/write device_file ["secret-msg"]
 opt = 'r' => we shall issue the read(2), retrieving the 'secret' form the driver
 opt = 'w' => we shall issue the write(2), writing the secret message <secret-msg>
  (max 128 bytes)
$ 
$ ../../ch12/miscdrv_rdwr/rdwr_test_secret r /dev/llkd_miscdrv_rdwr_spinlock 
Device file /dev/llkd_miscdrv_rdwr_spinlock opened (in read-only mode): fd=3
../../ch12/miscdrv_rdwr/rdwr_test_secret: read 7 bytes from /dev/llkd_miscdrv_rdwr_spinlock
The 'secret' is:
 "initmsg"
$

Next, we issue a write(2) to the driver via the user mode app; this time, our buggy code path gets executed. As you saw, we issued a schedule_timeout() within a spinlock critical section (that is, between the lock and unlock). The debug kernel detects this as a bug and spawns (impressively large) debug diagnostics into the kernel log (note that bugs like this can quite possibly hang your system, so test this on a VM first):

Figure 6.9 – Kernel diagnostics being triggered by the "scheduling in atomic context" bug we've deliberately hit here

The preceding screenshot shows part of what transpired (follow along while viewing the driver code in ch12/2_miscdrv_rdwr_spinlock/2_miscdrv_rdwr_spinlock.c):

First, we have our user mode app's process context (rdwr_test_secre; notice how the name is truncated to the first 16 characters, including the NULL byte), which enters the driver's write method; that is, write_miscdrv_rdwr(). This can be seen in the output of our useful PRINT_CTX() macro (we've reproduced this line here):

miscdrv_rdwr_spinlock:write_miscdrv_rdwr(): 004) rdwr_test_secre :23578 | ...0 /*  write_miscdrv_rdwr() */

It copies in the new 'secret' from the user space writer process and writes it, for 24 bytes.
It then "takes" the spinlock, enters the critical section, and copies this data to the oursecret member of our driver's context structure.
After this, if (1 == buggy) { evaluates to true.
Then, it calls schedule_timeout(), which is a blocking API (as it internally calls schedule()), triggering the bug, which is helpfully highlighted in red:

BUG: scheduling while atomic: rdwr_test_secre/23578/0x00000002

The kernel now dumps a good deal of the diagnostic output. Among the first things to be dumped is the call stack.

The call stack or stack backtrace (or "call trace") of the kernel mode stack of the process – here, it's our user space app, rdwr_drv_secret, which is running our (buggy) driver's code in the process context – can be clearly seen in Figure 6.9. Each line after the Call Trace: header is essentially a call frame on the kernel stack.

As a tip, ignore the stack frames that begin with the ? symbol; they are literally questionable call frames, in all likelihood "leftovers" from previous stack usage in the same memory region. It's worth taking a small memory-related diversion here: this is how stack allocation really works; stack memory isn't allocated and freed on a per-call frame basis as that would be frightfully expensive. Only when a stack memory page is exhausted is a new one automatically faulted in! (Recall our discussions in the companion guide Linux Kernel Programming - Chapter 9, Kernel Memory Allocation for Module Authors – Part 2, in the A brief note on memory allocations and demand paging section.) So, the reality is that, as code calls and returns from functions, the same stack memory page(s) tend to keep getting reused.

Not only that, but for performance reasons, the memory isn't wiped each time, leading to leftovers from previous frames often appearing. (They can literally "spoil" the picture. However, fortunately, the modern stack call frame tracing algorithms are usually able to do a superb job in figuring out the correct stack trace.)

Following the stack trace bottom-up (always read it bottom-up), we can see that, as expected, our user space write(2) system call (it often shows up as (something like) SyS_write or, on the x86, as __x64_sys_write, though not visible in Figure 6.9) invokes the kernel's VFS layer code (you can see vfs_write() here, which calls __vfs_write()), which further invokes our driver's write method; that is, write_miscdrv_rdwr()! This code, as we well know, invokes the buggy code path where we call schedule_timeout(), which, in turn, invokes schedule() (and __schedule()), causing the whole BUG: scheduling while atomic bug to trigger.

The format of the scheduling while atomic code path is retrieved from the following line of code, which can be found in kernel/sched/core.c:

printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x
", prev->comm, prev->pid, preempt_count());

Interesting! Here, you can see that it printed the following string:

      BUG: scheduling while atomic: rdwr_test_secre/23578/0x00000002

After atomic:, it prints the process name – the PID – and then invokes the preempt_count() inline function, which prints the preempt depth; the preempt depth is a counter that's incremented every time a lock is taken and decremented on every unlock. So, if it's positive, this implies that the code is within a critical or atomic section; here, it shows as the value 2.

Note that this bug gets neatly served up during this test run precisely because the CONFIG_DEBUG_ATOMIC_SLEEP debug kernel config option is turned on. It's on because we're running a custom "debug kernel" (kernel version 5.4.0)! The config option details (you can interactively find and set this option in make menuconfig, under the Kernel Hacking menu) are as follows:

// lib/Kconfig.debug
[ ... ]
config DEBUG_ATOMIC_SLEEP
    bool "Sleep inside atomic section checking"
    select PREEMPT_COUNT
    depends on DEBUG_KERNEL
    depends on !ARCH_NO_PREEMPT
    help 
      If you say Y here, various routines which may sleep will become very 
 noisy if they are called inside atomic sections: when a spinlock is
 held, inside an rcu read side critical section, inside preempt disabled
 sections, inside an interrupt, etc...

Testing on a 5.4 non-debug distro kernel

As a contrasting test, we will now perform the very same thing on our Ubuntu 20.04 LTS VM, which we'll boot via its default generic 'distro' 5.4 Linux kernel that is typically not configured as a 'debug' kernel (here, the CONFIG_DEBUG_ATOMIC_SLEEP kernel config option hasn't been set).

First, we insert our (buggy) driver. Then, when we run our rdwr_drv_secret process in order to write the new secret to the driver, the buggy code path gets executed. However, this time, the kernel does not crash, nor does it report any issues at all (looking at the dmesg(1) output validates this):

$ uname -r
5.4.0-56-generic
$ sudo insmod ./miscdrv_rdwr_spinlock.ko buggy=1
$ ../../ch12/miscdrv_rdwr/rdwr_test_secret w /dev/llkd_miscdrv_rdwr_spinlock "passwdcosts500bucksdude"
Device file /dev/llkd_miscdrv_rdwr_spinlock opened (in write-only mode): fd=3
../../ch12/miscdrv_rdwr/rdwr_test_secret: wrote 24 bytes to /dev/llkd_miscdrv_rdwr_spinlock
$ dmesg 
[ ... ]
[ 65.420017] miscdrv_rdwr_spinlock:miscdrv_init_spinlock(): LLKD misc driver (major # 10) registered, minor# = 56, dev node is /dev/llkd_miscdrv_rdwr
[ 81.665077] miscdrv_rdwr_spinlock:miscdrv_exit_spinlock(): miscdrv_rdwr_spinlock: LLKD misc driver deregistered, bye
[ 86.798720] miscdrv_rdwr_spinlock:miscdrv_init_spinlock(): VERMAGIC_STRING = 5.4.0-56-generic SMP mod_unload 
[ 86.799890] miscdrv_rdwr_spinlock:miscdrv_init_spinlock(): LLKD misc driver (major # 10) registered, minor# = 56, dev node is /dev/llkd_miscdrv_rdwr
[ 130.214238] misc llkd_miscdrv_rdwr_spinlock: filename: "llkd_miscdrv_rdwr_spinlock"
                wrt open file: f_flags = 0x8001
                ga = 1, gb = 0

[ 130.219233] misc llkd_miscdrv_rdwr_spinlock: stats: tx=0, rx=0
[ 130.219680] misc llkd_miscdrv_rdwr_spinlock: rdwr_test_secre wants to write 24 bytes
[ 130.220329] misc llkd_miscdrv_rdwr_spinlock: 24 bytes written, returning... (stats: tx=0, rx=24)
[ 131.249639] misc llkd_miscdrv_rdwr_spinlock: filename: "llkd_miscdrv_rdwr_spinlock"
                ga = 0, gb = 1
[ 131.253511] misc llkd_miscdrv_rdwr_spinlock: stats: tx=0, rx=24
$

We know that our write method has a deadly bug, yet it doesn't seem to fail in any manner! This is really bad; it's this kind of thing that can erroneously lead you to conclude that your code is just fine when there's actually a nasty bug silently lying in wait to pounce one fine day!

To help us investigate what exactly is going on under the hood, let's run our test app (the rdwr_drv_secret process) once more, but this time via the powerful trace-cmd(1) tool (a very useful wrapper over the Ftrace kernel infrastructure; the following is its truncated output:

The Linux kernel's Ftrace infrastructure is the kernel's primary tracing infrastructure; it provides a detailed trace of pretty much every function that's been executed in the kernel space. Here, we are leveraging Ftrace via a convenient frontend: the trace-cmd(1) utility. These are indeed very powerful and useful debug tools; we've mentioned several others in the companion guide Linux Kernel Programming - Chapter 1, Kernel Workspace Setup, but unfortunately, the details are beyond the scope of this book. Check out the man pages to learn more.

$ sudo trace-cmd record -p function_graph -F ../../ch12/miscdrv_rdwr/rdwr_test_secret w /dev/llkd_miscdrv_rdwr_spinlock "passwdcosts500bucks"
$ sudo trace-cmd report -I -S -l > report.txt
$ sudo less report.txt
[ ... ]

The output can be seen in the following screenshot:

Figure 6.10 – A partial screenshot of the trace-cmd(1) report output

As you can see, the write(2) system call from our user mode app becomes, as expected, vfs_write(), which itself (after security checks) invokes __vfs_write(), which, in turn, invokes our driver's write method – the write_miscdrv_rdwr() function!

In the (large) Ftrace output stream, we can see that the schedule_timeout() function has indeed been invoked:

Figure 6.11 – A partial screenshot of the trace-cmd(1) report output, showing the (buggy!) calls to schedule_timeout() and schedule() within an atomic context

A few lines of output after schedule_timeout(), we can clearly see schedule() being invoked! So, there we have it: our driver has (deliberately, of course) performed something buggy – calling schedule() in an atomic context. But again, the key point here is that on this Ubuntu system, we are not running a "debug" kernel, which is why we have the following:

$ grep DEBUG_ATOMIC_SLEEP /boot/config-5.4.0-56-generic
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
$

This is why the bug isn't being reported! This proves the usefulness of running test cases – and indeed performing kernel development – on a "debug" kernel, a kernel with many debug features enabled. (As an exercise, if you haven't done so already, prepare a "debug" kernel and run this test case on it.)

Linux Driver Verification (LDV) project: In the companion guide Linux Kernel Programming - Chapter 1, Kernel Workspace Setup, in the section The LDV – Linux Driver Verification – project, we mentioned that this project has useful "rules" with respect to various programming aspects of Linux modules (drivers, mostly) as well as the core kernel.

With regard to our current topic, here's one of the rules: Usage of spin lock and unlock functions (http://linuxtesting.org/ldv/online?action=show_rule&rule_id=0039). It mentions key points with regard to the correct usage of spinlocks; interestingly, here, it shows an actual bug instance in a driver where a spinlock was attempted to be released twice – a clear violation of the locking rules, leading to an unstable system.

Locking and interrupts

So far, we have learned how to use the mutex lock and, for the spinlock, the basic spin_[un]lock() APIs. A few other API variations on the spinlock exist, and we shall examine the more common ones here.

To understand exactly why you may need other APIs for spinlocks, let's go over a scenario: as a driver author, you find that the device you're working on asserts a hardware interrupt; accordingly, you write the interrupt handler for it. Now, while implementing a read method for your driver, you find that you have a non-blocking critical section within it. This is easy to deal with: as you have learned, you should use a spinlock to protect it. Great! But what if, while in the read method's critical section, the device's hardware interrupt fires? As you're aware, hardware interrupts preempt anything and everything; thus, control will go to the interrupt handler code preempting the driver's read method.

The key question here: is this an issue? That answer depends both on what your interrupt handler and your read method were doing and how they were implemented. Let's visualize a few scenarios:

The interrupt handler (ideally) uses only local variables, so even if the read method were in a critical section, it really doesn't matter; the interrupt handling will complete very quickly and control will be handed back to whatever was interrupted (again, there's more to it than this; as you know, any existing bottom-half, such as a tasklet or softirq, may also need to execute). In other words, as such, there is really no race in this case.
The interrupt handler is working on (global) shared writeable data but not on the data items that your read method is using. Thus, again, there is no conflict and no race with the read code. What you should realize, of course, is that the interrupt code does have a critical section and that it must be protected (perhaps with another spinlock).
The interrupt handler is working on the same global shared writeable data that your read method is using. In this case, we can see that the potential for a race definitely exists, so we need locking!

Let's focus on the third case. Obviously, we should use a spinlock to protect the critical section within the interrupt handling code (recall that using a mutex is disallowed when we're in any kind of interrupt context). Also, unless we use the very same spinlock in both the read method and the interrupt handler's code path, they will not be protected at all! (Be careful when working with locks; take the time to think through your design and code in detail.)

Let's try and make this a bit more hands-on (with pseudocode for now): let's say we have a global (shared) data structure named gCtx; we're operating on it in both the read method as well as the interrupt handler (hardirq handler) within our driver. Since it's shared, it's a critical section and therefore requires protection; since we are running in an atomic (interrupt) context, we can't use a mutex, so we must use a spinlock instead (here, the spinlock variable is called slock). The following pseudocode shows some timestamps (t1, t2, ...) for this situation:

// Driver read method ; WRONG !
driver_read(...)                  << time t0 >>
{
    [ ... ]
    spin_lock(&slock);
    <<--- time t1 : start of critical section >>
... << operating on global data object gCtx >> ...
    spin_unlock(&slock);
    <<--- time t2 : end of critical section >>
    [ ... ]
}                                << time t3 >>

The following pseudocode is for the device driver's interrupt handler:

handle_interrupt(...)           << time t4; hardware interrupt fires!     >>
{
    [ ... ]
    spin_lock(&slock);
    <<--- time t5: start of critical section >>
    ... << operating on global data object gCtx >> ...
    spin_unlock(&slock);
    <<--- time t6 : end of critical section >>
    [ ... ]
}                               << time t7 >>

This can be summed up with the following diagram:

Figure 6.12 – Timeline – the driver's read method and hardirq handler run sequentially when working on global data; there's no issues here

Luckily, everything has gone well – "luckily" because the hardware interrupt fired after the read function's critical section completed. Surely we can't count on luck as the exclusive safety stamp of our product! The hardware interrupt is asynchronous; what if it fired at a less opportune time (for us) – say, while the read method's critical section is running between time t1 and t2? Well, isn't the spinlock going to do its job and protect our data?

At this point, the interrupt handler's code will attempt to acquire the same spinlock (&slock). Wait a minute – it cannot "get" it as it's currently locked! In this situation, it "spins", in effect waiting on the unlock. But how can it be unlocked? It cannot, and there we have it: a (self) deadlock.

Interestingly, the spinlock is more intuitive and makes sense on an SMP (multicore) system. Let's assume that the read method is running on CPU core 1; the interrupt can be delivered on another CPU core, say core 2. The interrupt code path will "spin" on the lock on CPU core 2, while the read method, on core 1, completes the critical section and then unlocks the spinlock, thus unblocking the interrupt handler. But what about on UP (uniprocessor, with only one CPU core)? How will it work then? Ah, so here's the solution to this conundrum: when "racing" with interrupts, regardless of uniprocessor or SMP, simply use the _irq variant of the spinlock API:

#include <linux/spinlock.h>
void spin_lock_irq(spinlock_t *lock);

The spin_lock_irq() API internally disables interrupts on the processor core that it's running on; that is, the local core. So, by using this API in our read method, interrupts will be disabled on the local core, thus making any possible "race" impossible via interrupts. (If the interrupt does fire on another CPU core, the spinlock technology will simply work as advertised, as discussed previously!)

The spin_lock_irq() implementation is pretty nested (as with most of the spinlock functionality), yet fast; down the line, it ends up invoking the local_irq_disable() and preempt_disable() macros, disabling both interrupts and kernel preemption on the local processor core that it's running on. (Disabling hardware interrupts has the (desirable) side effect of disabling kernel preemption as well.)

spin_lock_irq() pairs off with the corresponding spin_unlock_irq() API. So, the correct usage of the spinlock for this scenario (as opposed to what we saw previously) is as follows:

// Driver read method ; CORRECT !
driver_read(...)                  << time t0 >>
{
    [ ... ]
    spin_lock_irq(&slock);
    <<--- time t1 : start of critical section >>
[now all interrupts + preemption on local CPU core are masked (disabled)]
... << operating on global data object gCtx >> ...
    spin_unlock_irq(&slock);
    <<--- time t2 : end of critical section >>
    [ ... ]
}                                << time t3 >>

Before patting ourselves solidly on the back and taking the rest of the day off, let's consider another scenario. This time, on a more complex product (or project), it's quite possible that, among the several developers working on the code base, one has deliberately set the interrupt mask to a certain value, thus blocking some interrupts while allowing others. For the sake of our example, let's say that this has occurred earlier, at some point in time t0. Now, as we described previously, another developer (you!) comes along, and in order to protect a critical section within the driver's read method, uses the spin_lock_irq() API. Sounds correct, yes? Yes, but this API has the power to turn off (mask) all hardware interrupts (and kernel preemption, which we'll ignore for now) on the local CPU core. It does so by manipulating, at a low level, the (very arch-specific) hardware interrupt mask register. Let's say that setting a bit corresponding to an interrupt to 1 enables that interrupt, while clearing the bit (to 0) disables or masks it. Due to this, we may end up with the following scenario:

time t0: The interrupt mask is set to some value, say, 0x8e (10001110b), enabling some and disabling some interrupts. This is important to the project (here, for simplicity, we're assuming there's an 8-bit mask register)
[... time elapses ... ].
time t1: Just before entering the driver read method's critical section, call
spin_lock_irq(&slock);. This API will have the internal effect of clearing all the bits in the interrupt mask registered to 0, thus disabling all interrupts (as we think we desire).
time t2: Now, hardware interrupts cannot fire on this CPU core, so we go ahead and complete the critical section. Once we're done, we call spin_unlock_irq(&slock);. This API will have the internal effect of setting all the bits in the interrupt mask register to 1, reenabling all interrupts.

However, the interrupt mask register has now been wrongly "restored" to a value of 0xff (11111111b), not the value 0x8e as the original developer wants, requires, and assumes! This can (and probably will) break something in the project.

The solution is quite straightforward: don't assume anything, simply save and restore the interrupt mask. This can be achieved with the following API pair:

#include <linux/spinlock.h>>
 unsigned long spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
 void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);

The first parameter to both the lock and unlock functions is the spinlock variable to use. The second parameter, flags, must be a local variable of the unsigned long type. This will be used to save and restore the interrupt mask:

spinlock_t slock;
spin_lock_init(&slock);
[ ... ]
driver_read(...)                  
{
    [ ... ]
    spin_lock_irqsave(&slock, flags);
    << ... critical section ... >>
    spin_unlock_irqrestore(&slock, flags);
    [ ... ]
}

To be pedantic, spin_lock_irqsave() is not an API, but a macro; we've shown it as an API for readability. Also, although the return value of this macro is not void, it's an internal detail (the flags parameter variable is updated here).

What about if a tasklet or a softirq (a bottom-half interrupt mechanism) has a critical section that "races" with your process-context code paths? In such situations, using the spin_lock_bh() routine is likely what's required since it can disable bottom halves on the local processor and then take the spinlock, thus safeguarding the critical section (similar to the way that spin_lock_irq[save]() protects the critical section in the process context by disabling hardware interrupts on the local core):

void spin_lock_bh(spinlock_t *lock);

Of course, overhead does matter in highly performance-sensitive code paths (the network stack being a great example). Thus, using the simplest form of spinlocks will help with more complex variants. Having said that, though, there are certainly going to be occasions that demand the use of the stronger forms of the spinlock API. For example, on the 5.4.0 Linux kernel, this is an approximation of the number of usage instances of different forms of the spinlock APIs we have seen: spin_lock(): over 9,400 usage instances; spin_lock_irq(): over 3,600 usage instances; spin_lock_irqsave(): over 15,000 usage instances; and spin_lock_bh(): over 3,700 usage instances. (We don't draw any major inference from this; it's just that we wish to point out that using the stronger form of spinlock APIs is quite widespread in the Linux kernel).

Finally, let's provide a very brief note on the internal implementation of the spinlock: in terms of under-the-hood internals, the implementation tends to be very arch-specific code, often comprised of atomic machine language instructions that execute very fast on the microprocessor. On the popular x86[_64] architecture, for example, the spinlock ultimately boils down to an atomic test-and-set machine instruction on a member of the spinlock structure (typically implemented via the cmpxchg machine language instruction). On ARM machines, as we mentioned earlier, it's often the wfe (Wait For Event, as well as the SetEvent (SEV)) machine instruction at the heart of the implementation. (You will find resources regarding its internal implementation in the Further reading section). Regardless, as a kernel or driver author, you should only use the exposed APIs (and macros) when using spinlocks.

Using spinlocks – a quick summary

Let's quickly summarize spinlocks:

Simplest, lowest overhead: Use the non-irq spinlock primitives, spin_lock()/spin_unlock(), when protecting critical sections in the process context (there's either no interrupts to deal with or there are interrupts, but we do not race with them at all; in effect, use this when interrupts don't come into play or don't matter).
Medium overhead: Use the irq-disabling (as well as kernel preemption disabling) versions, spin_lock_irq() / spin_unlock_irq(), when interrupts are in play and do matter (the process and interrupt contexts can "race"; that is, they share global data).
Strongest (relatively), high overhead: This is the safest way to use a spinlock. It does the same as the medium overhead, except it performs a save-and-restore on the interrupt mask via the spin_lock_irqsave() / spin_unlock_irqrestore() pair, so as to guarantee that the previous interrupt mask settings aren't inadvertently overwritten, which could happen with the previous case.

As we saw earlier, the spinlock – in the sense of "spinning" on the processor it's running on when awaiting the lock – is impossible on UP (how can you spin on the one CPU that's available while another thread runs simultaneously on the very same CPU?). Indeed, on UP systems, the only real effect of the spinlock APIs is that it can disable hardware interrupts and kernel preemption on the processor! On SMP (multicore) systems, however, the spinning logic actually comes into play, and thus the locking semantics work as expected. But hang on – this should not stress you, budding kernel/driver developer; in fact, the whole point is that you should simply use the spinlock APIs as described and you will never have to worry about UP versus SMP; the details of what is done and what isn't are all hidden by the internal implementation.

Though this book is based on the 5.4 LTS kernel, a new feature was added to the 5.8 kernel from the Real-Time Linux (RTL, previously called PREEMPT_RT) project, which deserves a quick mention here: "local locks". While the main use case for local locks is for (hard) real-time kernels, they help with non-real-time kernels too, mainly for lock debugging via static analysis, as well as runtime debugging via lockdep (we cover lockdep in the next chapter). Here's the LWN article on the subject: https://lwn.net/Articles/828477/.

With this, we complete the section on spinlocks, an extremely common and key lock used in the Linux kernel by virtually all its subsystems, including drivers.

Summary

Congratulations on completing this chapter!

Understanding concurrency and its related concerns is absolutely critical for any software professional. In this chapter, you learned key concepts regarding critical sections, the need for exclusive execution within them, and what atomicity means. You then learned why we need to be concerned with concurrency while writing code for the Linux OS. After that, we delved into the actual locking technologies – mutex locks and spinlocks – in detail. You also learned what lock you should use and when. Finally, learning how to handle concurrency concerns when hardware interrupts (and their possible bottom halves) are in play was covered.

But we aren't done yet! There are many more concepts and technologies we need to learn about, which is just what we will do in the next, and final, chapter of this book. I suggest that you digest the content of this chapter well first by browsing through it, as well as the resources in the Further reading section and the exercises provided, before diving into the last chapter!

Questions

As we conclude, here is a list of questions for you to test your knowledge regarding this chapter's material: https://github.com/PacktPublishing/Linux-Kernel-Programming/tree/master/questions. You will find some of the questions answered in the book's GitHub repo: https://github.com/PacktPublishing/Linux-Kernel-Programming/tree/master/solutions_to_assgn.

Table of Contents for Kernel Synchronization - Part 1

Create new playlist

Sign In

Sign Up