7.1 Overview
A concurrent program handles more than one task at a time. A familiar example is a web server that handles multiple client requests at the same time. Although concurrent programs can run even on a single-processor machine of bygone days, these programs should show a marked gain in performance by running on a multiprocessor machine: different tasks can be delegated to different processors. A parallel program in this sense is a concurrent program whose tasks can be handled literally at the same time because multiple processors are at hand.
The two traditional and still relevant approaches to concurrency are multiprocessing and multithreading . Applications such as web servers and database systems may mix the approaches and throw in acceleration techniques such as nonblocking I/O. Multiprocessing has a relatively long history and is still widespread. For example, early web servers supported concurrency through multiprocessing; but even state-of-the-art web servers such as Nginx are multiprocessing systems.
Recall that a process is a program in execution and that each process has its own address space. Two processes could share a memory location, but this requires setup: shared memory is not the default. Separate address spaces are appealing to the programmer, who does need to worry about memory-based race conditions when writing a multiprocessing program. A typical race condition arises when two or more operations, at least one of which is a write, could access the same memory location at the same time. Of interest now is that separate processes, by default, do not share access to a memory location, which is requisite for such a race condition.
What is the downside of multiprocessing? When the operating system preempts a not-yet-finished process , a process-level context switch occurs: the operating system gives the processor to another process for its execution. The preempted process must be scheduled again to complete its execution. A process-level context switch is expensive because the operating system may have to swap data structures such as page tables (virtual-to-physical address translators) between memory and disk; in any case, there is nontrivial bookkeeping to track the state of both the preempted and the newly executing process. It is hard to come up with an exact figure, but a process-level context switch takes about 5ms to 15ms (milliseconds), time that is not available for other tasks.
Recall too that a thread (short for thread of execution) is a sequence of executable instructions. Every process has at least one thread; a process with only one thread is single threaded , and a process with more than one thread is multithreaded . Operating systems schedule threads to processors; to schedule a process is, in effect, to schedule one of its threads. On a multiprocessor machine, multiple threads from the same process can execute at the very same time. A thread-level context switch —preempting one thread in a process for another in the same process—is not free, but the cost is very low: nanoseconds rather than milliseconds. Multithreading is efficient.
In a simplifying move, Linux systems turn process scheduling into thread scheduling by treating even a multithreaded process as if it were single threaded. A multithreaded process with N threads then requires N scheduling actions to cover the threads. Threads within a multithreaded process remain related in that they share resources such as memory address space . Accordingly, Linux threads are sometimes described as lightweight processes, with the lightweight underscoring the sharing of resources among the threads within a process.
What is the downside of multithreading? Threads within a process have the same address space; hence, multithreaded programs are susceptible to memory-based race conditions . On a multiprocessor machine, for instance, one thread might try to read memory location N at the very instant that another thread is trying to write N. The outcome is indeterminate. The burden of preventing race conditions falls on the programmer, not the operating system. Multithreaded programs, especially ones with variables shared among the threads, are a challenge even for the experienced programmer.
7.2 Multiprocessing Through Process Forking
The standard library functions provide options for multiprocessing , but the fork function is the most explicit. The first code example covers the basics of a fork call using unnamed pipes ; an earlier example (recall Listings 5-8 and 5-9) covered named pipes. A look at unnamed pipes from the command line serves as preparation.
The greeting Hello, world! appears on the screen; then, after about five seconds, the command-line prompt returns, signaling that both the sleep and echo processes have exited. The pipe is closed automatically when the reader and writer terminate. There is multiprocessing here, but it does no useful work; instead, the example shows how the unnamed pipe works.
In normal usage, the writer process on the left writes bytes to the pipe, and the reader process on the right blocks until there are bytes to read. By closing the write end of a pipe before exiting, the writer process thereby generates an end-of-stream condition. The reader process closes the read end before exiting as well. Once the reader and the writer process exit, the pipe shuts down.
The preceding example is contrived because the sleep process does not write any bytes to the pipe and the echo process does not read any bytes from the pipe. Nonetheless, there is multiprocessing. The sleep process on the left does just that, and for five seconds. In the meanwhile, the echo process immediately writes its greeting to the screen because this process need not wait for bytes from the pipe. The echo process exits after printing its message. The sleep process then exits, the pipe goes away, and the command-line prompt reappears.
Introducing the fork function
The basicFork program (see Listing 7-1) opens with a call to the signal function . This is a precaution to prevent zombie processes , as clarified in an upcoming section. The int variable n is declared and initialized to 777. If the subsequent call to the library function fork succeeds, both the child and the parent process get their own separate copy of variable n; hence, each process manages different variables with the same name.
The library function fork tries to create a new process. If the attempt succeeds, the newly created process becomes the child of the original process, which is now a parent. The fork function returns an integer value ; for portability, the recommended type is pid_t, where pid stands for process identifier. The tricky part of the fork call is that, if successful, it returns one value to the parent—but a different value to the child. A short digression into the process id explains.
Every process has a nonnegative integer value as its identifier (pid) . There is a library function getpid to retrieve the pid, and a related function getppid to retrieve the parent process identifier (ppid) . Every process except the first has a ppid, which is guaranteed to be the same as the parent’s pid.
0 to the child
The child’s pid to the parent
Once forked, the child process executes a copy of the very same code as the parent—the code that comes after the call to fork. Accordingly, a test is typically used (in this case, the if test) to distinguish between code intended for the child and code intended for the parent. In this example, the child executes the if block, printing 787; the parent executes the else block, printing 7770. The order in which the prints occur is indeterminate. If the program runs on a multiprocessor machine, this concurrent program can execute in a truly parallel fashion.
The basics of the fork function
The pipe function takes an int array of two elements as its single argument: the first element (index 0) is the file descriptor for read operations, and the second element (index 1) is the file descriptor for write operations.
The function returns -1 to signal failure and 0 to signal success.
Note that the pipe function creates an unnamed pipe, whereas the mkfifo function creates a named pipe.
The fork function is used to create the reader process , although this spawned process could have been the writer. The process that does the forking is the parent, and the forked process is the child. The child process, an almost exact duplicate of the parent, is said to inherit from the parent. For example, a forked child process inherits open file descriptors from the parent. Recall that once forked, the child process executes the very same code as the parent process, unless an if test or the equivalent is used to divide the code that each process executes. A closer look at the example clarifies.
A returned value of -1 indicates an error: the fork failed to spawn a child process. This could occur for various reasons, including a full process table. The process table is a data structure that the operating system maintains in tracking processes.
- If the fork call succeeds, it returns different values to the child and the parent processes:
0 is returned to the child.
The child’s process identifier (pid) is returned to the parent.
The else clause is thus for the parent to execute. Because the child process is the reader, it immediately closes the WriteEnd of the pipe; in a similar fashion, the parent process as the writer immediately closes the ReadEnd of the pipe. Both file descriptors are open because of the call to pipe. By closing one end of the pipe, each process exhibits the separation-of-concerns pattern .
The writer process then writes bytes to the pipe, and the reader process reads these bytes one at a time. When the writer process closes the pipe’s write end, an end-of-stream marker is sent to the reader, which responds by closing the pipe’s read end. At this point, the pipe closes down.
7.2.1 Safeguarding Against Zombie Processes
In the pipeUN program , the parent process writes a full string to the pipe and then waits for the child process to terminate with the call to library function wait ; the child reads the string byte by byte. The wait call is a precaution against creating a permanent zombie process: a zombie is a process that has terminated but which still has an entry in the process table. If zombies are not reaped from the process table, this table can fill—and thus prevent the forking of any other process. Although a forked child is largely independent of its parent process, the operating system does notify the parent when the child terminates. If a child terminates after its parent, and there is no safeguard against zombies, the child can remain a zombie.
In the pipeUN example, it is unpredictable whether the parent or the child will terminate first, and so the parent—the process being notified—makes the precautionary call to wait : if the child has already exited, the call has no effect; otherwise, the parent’s execution is suspended until the child terminates. The wait function expects one argument, the address of an int variable that stores the exit code of the process being waited on. In this example, the argument of NULL is used to keep things simple, but a parent process in general might implement different logic depending on the status code of a terminated child. There is also a waitpid function of three arguments, which allows for more granular control. The waitpid function is used in a forthcoming example.
The pipeUN program adopts another safeguard. The child calls library function _exit rather than exit: the former fast-tracks parent notification and so speeds up the reaping of a zombie entry. The parent process, by contrast, calls the regular exit function.
The effect of this call is to automate the reaping of a zombie. Were this approach taken in the current example, the parent’s call to wait would not be needed to safeguard against a zombie.
7.3 The exec Family of Functions
In the forking of a child process, the multiprocessing is obvious in that the parent process, which calls fork, continues to execute as well; indeed, the parent and the child execute the same code unless program logic explicitly controls which process executes which code. The typical approach, illustrated in the code examples so far, is to use an if-test to separate the code intended for the parent from the code intended for the child.
The functions in the exec family, mentioned several times already but not yet analyzed, work differently. All of the functions in the family do essentially the same thing, but their argument formats differ. For example, the execv function has an argument vector, implemented as a NULL-terminated array of strings. Other members of the family such as execle use an environment variable to pass information to the executing program. The next code example goes into the details.
Recall that a process is a program in execution, something dynamic. The executable program is stored somewhere, typically as a file on a local disk. To execute the program, the operating system first must load the file into memory. This in-memory representation of the process, read-only during process execution, is the process image.
The exec family of functions
Replaces the image of the process that calls an exec function with a new process image. This is described as overlaying one process image with another.
The new process, in this case cline , runs with the same pid as the original process, in this case execing.
The cline program expects command-line arguments, which are supplied in a NULL-terminated array of strings; the cline program simply prints the arguments to the standard output and then exits.
The first argument is the path to the executable as a string, in this case ".cline".
The second argument is an array of strings, including (by tradition) the name of the executable as the first element in this array. A NULL marks the end of the string array.
does not execute. Only the newly executed cline program runs to completion: the process image for the forked child indeed has been overlaid.
Immediately after the successful fork of the child process, print the child’s pid value, which can be obtained with a call within the if block to the getpid function .
Amend the cline program to print its own pid, again using the library function getpid.
The two printed pid values should be the same, thereby confirming that the execed program cline is executing under the forked child’s pid. The code available on GitHub includes this experiment.
7.3.1 Process Id and Exit Status
The next program reviews the forking API, in particular the pid and ppid values for a child process, but also focuses on the information available about how a child process terminates. The exit status of a forked process is available, with convenient macros for extracting this status information. These macros belong to C’s waiting API, whose principal functions are wait (one argument for ease of use) and waitpid (three or four arguments for fine-grained control). The example introduces the waitpid function .
Exit status
The first argument is the pid of the process on which to wait, in this case the child.
The second argument points to an int variable where the child’s exit or comparable status is stored.
The last argument consists of additional options, for instance, WNOHANG for return at once if no child has exited.
The first argument to waitpid (-1) means, in effect, any child of mine; the second argument is NULL instead of a pointer to an int variable to store the child’s exit status; and the third argument is NULL for no flags.
The child exits normally, with a nonnegative return value.
The child receives a signal such as SIGKILL (terminate immediately), which cannot be ignored, or SIGTERM (please terminate immediately), which can be ignored.
The child receives a SIGSTOP (stop executing: pause) signal, which cannot be ignored.
In this example, the child exits normally with a call to _exit. The WEXITSTATUS macro returns the low-order 8 bits of the child’s 32-bit explicitly returned value, 0xaa11bb22 in hex. The macro thus extracts 22.
The exiting program also confirms that a child’s ppid is the same as its parent’s pid. In a sample run, this value was 2613, and the child’s pid was 2614. These values are not guaranteed to be consecutive, but it is a common pattern: the child’s pid is one greater than the parent’s.
7.4 Interprocess Communication Through Shared Memory
Although every process has its own address space, which ensures that processes do not share memory locations by default, processes can share memory. A standard library provides the appropriate functions. Shared memory is, like pipes, a mechanism for interprocess communication. A code example with two processes explores the details.
There are two separate libraries and APIs for shared memory: the legacy System V library and API, and the more recent POSIX pair. These APIs should never be mixed in a single application, however. The POSIX pair is still in development and dependent upon the version of the operating system kernel, which impacts code portability. By default, the POSIX API implements shared memory as a memory-mapped file : for a shared memory segment, the system maintains a backing file with corresponding contents. Shared memory under POSIX can be configured without a backing file, but this may impact portability. My example uses the POSIX API with a backing file, which combines the benefits of memory access (speed) and file storage (persistence).
The shared memory example has two programs, named memwriter and memreader , and uses a semaphore to coordinate their access to the shared memory. Whenever shared memory comes into the picture with a writer, so does the risk of a memory-based race condition with indeterminate results; hence, the semaphore is used to coordinate (synchronize) access to the shared memory so that the writer and the reader operations do not overlap.
Here is a review of how semaphores work as a synchronization mechanism. A general semaphore also is called a counting semaphore, as it has a value (typically initialized to zero) that can be incremented. Consider a shop that rents bicycles, with a hundred of them in stock, with a program that clerks use to do the rentals. Every time a bike is rented, the semaphore is incremented by one; when a bike is returned, the semaphore is decremented by one. Rentals can continue until the value hits 100 but then must halt until at least one bike is returned, thereby decrementing the semaphore to 99.
The memwriter program
allocates ByteSize bytes , in this case, a modest 512 bytes. The memwriter and memreader programs access the shared memory only, not the backing file. The system is responsible for synchronizing the shared memory and the backing file.
to get a pointer to the shared memory. (The memreader makes a similar call.) The pointer type caddr_t starts with a c for calloc, which initializes dynamically allocated storage to zeros. The memwriter uses the memptr for the later write operation, which uses the library strcpy function . At this point, the memwriter is ready for writing, but it first creates a semaphore to ensure exclusive access to the shared memory.
then the writing can proceed. The SemaphoreName (any unique nonempty name will do) identifies the semaphore in both the memwriter and the memreader . The initial value of zero gives the semaphore’s creator (in this case, the memwriter) the right to proceed (in this case, to the write operation).
The memreader program
The first argument to mmap is NULL, which means that the system determines where to allocate the memory in virtual address space . It is possible (but tricky) to specify an address instead. The MAP_SHARED flag indicates that the allocated memory is shareable among processes, and the last argument (in this case, zero) means that the offset for the shared memory should be the first byte. The size argument specifies the number of bytes to be allocated (in this case, 512), and the protection argument indicates that the shared memory can be written and read.
removes the backing file. If the unlink statement is omitted, then the backing file persists after the program terminates.
Once the wait is over, the memreader reads the ASCII bytes from the shared memory, cleans up, and terminates.
The shared memory API includes operations explicitly to synchronize the shared memory segment and the backing file. These operations have been omitted from the example to reduce clutter and keep the focus on the memory-sharing and semaphore code.
The memwriter and memreader programs are likely to execute without inducing a race condition even if the semaphore code is removed: the memwriter creates the shared memory segment and writes immediately to it; the memreader cannot even access the shared memory until this has been created. However, best practice requires that shared memory access is synchronized whenever a write operation is in the mix, and the semaphore API is important enough to be highlighted in a code example.
7.5 Interprocess Communication Through File Locking
A producer should gain an exclusive lock on the file before writing to the file. An exclusive lock can be held by one process at most, which rules out a race condition because no other process can access the file until the lock is released. (It is possible to lock only part of a file.)
A consumer should gain at least a shared lock on the file before reading from the file. Multiple readers can hold a shared lock at the same time, but no writer can access a file when even a single reader holds a shared lock. A shared lock promotes efficiency. If one process is just reading a file and not changing its contents, there is no reason to prevent other processes from doing the same. Writing, however, clearly demands exclusive access to a file, as a whole or just in part.
The producer program
makes the lock an exclusive (read-write) rather than a shared (read-only) lock. If the producer gains the lock, then no other process will be able to write or read the file until the producer releases the lock, either explicitly with the appropriate call to fcntl or implicitly by closing the file. (When the process terminates, any opened files would be closed automatically, thereby releasing the lock.) The program then initializes the remaining fields. The chief effect is that the entire file is to be locked. However, the locking API allows only designated bytes to be locked. For example, if the file contains multiple text records, then a single record (or even part of a record) could be locked and the rest left unlocked.
tries to lock the file exclusively, checking whether the call succeeded. In general, the fcntl function returns -1 (hence, less than zero) to indicate failure. The second argument F_SETLK means that the call to fcntl does not block: the function returns immediately, either granting the lock or indicating failure. If the flag F_SETLKW (the W at the end is for wait) were used instead, the call to fcntl would block until gaining the lock was possible. In the calls to fcntl, the first argument fd is the file descriptor, the second argument specifies the action to be taken (in this case, F_SETLK for setting the lock), and the third argument is the address of the lock structure (in this case, &lock).
The consumer program
The F_GETLK operation specified in the fcntl call checks for a lock, in this case, an exclusive lock given as F_WRLCK in the first statement earlier. If the specified lock does not exist, then the fcntl call automatically changes the lock type field to F_UNLCK to indicate this fact. If the file is exclusively locked, the consumer terminates. (A more robust version of the program might have the consumer sleep a bit and try again several times.)
If the file is not currently locked, then the consumer tries to gain a shared (read-only) lock (F_RDLCK). To shorten the program, the F_GETLK call to fcntl could be dropped because the F_RDLCK call would fail if a read-write lock already were held by some other process. Recall that a read-only lock does prevent any other process from writing to the file but allows other processes to read from the file. In short, a shared lock can be held by multiple processes. After gaining a shared lock, the consumer program reads the bytes one at a time from the file, prints the bytes to the standard output, releases the lock, closes the file, and terminates.
The data shared through this interprocess communication is text: two lines from Shakespeare’s play Richard III . Yet the shared file’s contents could be voluminous, arbitrary bytes (e.g., a digitized movie), which makes file sharing an impressively flexible mechanism. The downside is that file access is relatively slow, whether the access involves reading or writing. As always, programming comes with trade-offs.
7.6 Interprocess Communication Through Message Queues
Earlier code examples highlighted pipes, both named and unnamed. Pipes of either type have strict FIFO behavior: the first byte written is the first byte read, the second byte written is the second byte read, and so forth. Message queues can behave in the same way but are flexible enough that byte chunks can be retrieved out of FIFO order .
The payload, which is an array of bytes (char).
A type, given as a positive integer value; types categorize messages for flexible retrieval.
Of the four messages shown, the one labeled 1 is at the front, that is, closest to the receiver. Next come two messages with label 2, and finally, a message labeled 3 at the back. If strict FIFO behavior were in play, then the messages would be received in the order 1-2-2-3. However, the message queue allows other retrieval orders. For example, the messages could be retrieved by the receiver in the order 3-2-1-2.
The header file queue.h
The message sender program
The message receiver program
7.7 Multithreading
A first multithreaded example
The first argument is a pointer to a pthread_t instance, in this case an element in the threads array.
The second argument specifies thread attributes. A value of NULL indicates that the default attributes should be used.
The third argument is a pointer to the thread’s start function, which the thread executes once the operating system starts the thread. A created thread automatically terminates when it returns from its start function. The start function can call other functions and do whatever else comes naturally to functions.
The fourth and last argument specifies what should be passed, as an argument, to the start function. In this case, the argument passed to the greet start function will be one of the values 1, 2, 3, and 4, which identify each of the created threads. The argument passed to the start function is always of the generic type void*, and NULL for no argument can be used.
All four of the created threads execute the same code, the body of the greet function, but no race condition arises. Arguments passed to a function, and local (auto or register) variables within the function, are thereby thread-safe because each thread gets its own copies. If a variable is neither extern nor static, then it represents a thread-safe memory location.
The pthread_create function returns -1 to signal an error and 0 to signal success. A successfully created thread is ready to be scheduled for execution on a processor.
The order of thread execution is indeterminate. Once the threads are created, the operating system takes over the scheduling, using whatever algorithm the host system employs. A pthread instance is a native thread under operating system control. By contrast, a green thread is under the control of a virtual machine. For example, early implementations of Java (before JDK 1.4) were required to support only green threads. If the multiT program is run several times, the output is likely to differ each time.
The Portable Operating System Interface is a family of standards from the IEEE Computer Society meant to encourage compatibility among operating systems. The multithreading examples use pthreads, where the p stands for POSIX.
7.7.1 A Thread-Based Race Condition
The next code example illustrates a race condition in a multithreaded program. The program later introduces a mechanism for coordinating thread execution, thereby preventing this race condition. A short depiction of a race condition follows.
- 1.
Thread T1 gets 123 from its call to rand() and performs the addition. Assume that the sum of the two numbers 123 + 1 = 124 is stored on the stack. Call this storage location temp1, which now holds 124.
- 2.
Thread T2 gets 987 from its call to rand() and performs the addition. The sum 988 is stored in temp2, also on the stack.
- 3.
Thread T2 performs the assignment, using the value from temp2: the value of n is updated to 988.
- 4.
Thread T1 performs the assignment, using the value from temp1: the value of n changes to 124.
It is clear that improper interleaving of machine-level instructions has taken place. Thread T2 does its addition and assignment without interruption, which is the correct way to perform the two operations. By contrast, thread T1 does its addition, is delayed two ticks of the clock, and then finishes up with an assignment. By coming in last, thread T1 wins the race: the final value of variable n, 124, reflects only what thread T1 did, and what thread T2 did is effectively lost.
critical section . The outcome is, therefore, indeterminate and unpredictable.
7.7.2 The Miser/Spendthrift Race Condition
The forthcoming miserSpend program encourages a race condition by having two threads concurrently update a shared memory location, in this case the single static variable named account, which represents a shared bank account: both threads access the same account. A memory-based race condition requires contention for a shared memory location. Of course, the account variable could be extern rather than static without changing the program’s behavior.
On a multiprocessor machine , the miser and the spendthrift can execute in a truly parallel fashion. Because access to the account is uncoordinated, a race condition ensues, and the final value of account is indeterminate. Indeed, if the two threads increment and decrement a sufficient number of times (e.g., ten million apiece), it becomes highly unlikely that the account will have zero as its value at the end, or that the account will have a repeated value over multiple runs.
Creating, starting, and waiting on the miser and spendthrift threads
The first argument is the address of a pthread_t instance, in this case, of either the miser or the spendt variable.
The second argument, NULL, indicates that default thread properties are to be used.
The third argument is the address of the start function, either deposit (miser) or withdraw (spendthrift). Recall that each created thread terminates automatically when exiting its start function.
The fourth argument is the address of the argument passed to the start function, in this case the address of integer variable n, which is the number of times that each started thread should update the account.
The miser/spendthrift start functions
The second part of the saveSpend program (see Listing 7-14) has the two start functions for the created threads: deposit (miser) and withdraw (spendthrift). Each of these functions takes, as its single argument, the number of times to perform an account update, implemented as the update function: the deposit function calls update with 1 as the argument, whereas the withdraw function calls update with -1 as the argument.
With a command-line argument of 10M, a result of zero is highly unlikely.
Fixing the saveSpend program
A pthread_mutex_lock variable named lock is added. There should be a single lock to ensure that the miser and the spendthrift contend for the same lock. The lock is static but could be extern as well.
The lock is used in the update function. To update the account, a thread first must grab the lock, expressed here as the condition of the if clause. The pthread_mutex_lock function returns 0 to signal that the lock has been grabbed.
Once a thread completes its update, the thread releases the lock so that another thread can try to grab it.
With these changes in place, the saveSpend program always prints 0 as the value of the account when the miser and spendthrift threads have terminated.
To execute a locked critical section, a thread first must grab the lock. After finishing the execution of the critical section, a thread should release the lock to enable some other thread to grab the lock and thus to safeguard against deadlock.
If multiple threads are contending for the lock, the implementation ensures that exactly one thread grabs it.
In general, a mutex such as pthread_mutex does not guarantee fairness. For example, if two threads are contending for the lock, the mutex implementation does not guarantee that each thread will be successful half the time. However, the saveSpend program has other logic to ensure that the miser and the spendthrift threads execute the same number of times.
If the fixed saveSpend program is run with a sufficiently large loop count (e.g., 10,000,000) as the command-line argument, there will be noticeable slowdown compared to the original version of the program. There is a performance cost to mutual exclusion, which enforces single-threaded execution of a critical section; in this code example, the cost ensures that the saveSpend program runs correctly.
7.8 Deadlock in Multithreading
Deadlocking with threads
A code analysis shows what happened.
The main thread creates two threads: t1 and t2. Thread t1 is created first, and the output confirms that t1 starts executing first—although the order of execution is indeterminate. There are two locks, lock1 and lock2, which protect resource, an int variable: thread t1 tries to set this variable to -9999, whereas thread t2 tries to set the variable to 1111. To set the variable, a thread must grab both locks.
Thread t1 has thread1 as its start function, and t2 has thread2 as its start function. In turn, these functions immediately call the grab_locks function , but with arguments in a different order. Recall that, in multithreading, each thread has its own copies of arguments and local variables.
- 1.
Thread t1 succeeds in grabbing lock1 but fails in the attempt to grab lock2. After grabbing lock1, thread t1 sleeps for 100 microseconds—time enough, as it turns out, for thread t2 to grab lock2.
- 2.
After grabbing lock2, thread t2 tries to grab lock1, which thread t1 already holds. At this point, t1 holds lock1 and t2 holds lock2.
- 3.
The last two statements in the grab_locks function release the locks. However, neither thread can proceed to the release code without first grabbing a lock that the other thread already holds—deadlock.
Why is deadlock not certain in the deadlock program? On my desktop machine, no deadlock results if the usleep call is removed from the grab_locks function . No deadlock results if the argument passed to usleep is sufficiently small. Even with the current usleep value of 100, it is possible that thread t1 might grab both locks before thread t2 even begins executing. It is also possible, on a multiprocessor machine, that thread t2 is scheduled on a faster processor than is t1; as a result, t2 grabs both locks before t1 even begins executing. A thread that holds both locks can proceed to the release code: no deadlock occurs. The deadlock program tries to cause deadlock, but even this requires some experimentation by setting the amount of time that thread t1 sleeps after grabbing the first lock.
The deadlock program tries to cause deadlock, but the real-world challenge is a concurrent program that, although designed not to deadlock, does so anyway. Modern database systems typically include at least a deadlock-detection module . In general, however, software systems neither detect, nor prevent, nor recover from deadlock. The burden thus falls on the programmer to write code that avoids deadlock.
7.9 SIMD Parallelism
The acronym SIMD was introduced in the mid-1960s as part of Flynn’s taxonomy for parallel computing. SIMD stands for single instruction, multiple data stream . Flynn’s taxonomy introduces other acronyms (e.g., MIMD for multiple instruction, multiple data stream) to describe additional approaches to parallel computation. This section focuses on SIMD parallelism.
Imagine integer values collected in array and a code segment that doubles the value of each element. A conventional approach would be to loop over the array and, one element at a time, double each value. In a SIMD architecture, a single instruction would execute on each element in parallel. The serial or iterative computation gives way to a one-step parallel computation, with a boost in performance that is both intuitive and compelling.
The concurrent programs examined so far become truly parallel programs without any programmer intervention. If a multiprocessing or multithreading program happens to execute on a multiprocessor machine (now the norm), then the operating system transforms the concurrent program into a parallel one by scheduling processes/threads onto different processors. SIMD parallelism differs in that parallel instructions come into play. SIMD is thus a type of instruction-level parallelism , which requires underlying architectural support.
The appeal of SIMD parallelism is obvious. Even everyday applications regularly iterate over arrays, performing the same operation on each element. For an array of size N, this iterative approach requires that N instructions be executed in sequence. Assume, for simplicity, that each instruction requires one tick of the system clock. In this scenario, doubling the array elements takes N ticks. If the doubling can be done in a single SIMD instruction, the time required drops from N ticks to roughly one tick, although there is nontrivial overhead to set up the parallel addition.
For some time, computers have had devices tailored for SIMD. A graphics processing unit (GPU) is a case in point; indeed, the acronym GP_GPU describes a GPU designed for general purpose rather than just graphics-specific processing . There are various C libraries and entire frameworks devoted to putting such devices to use in SIMD processing. This section goes another way, focusing instead on how the standard C compilers are now able to use native SIMD instructions, in particular on modern Intel and AMD machines. (ARM Neon machines likewise support SIMD.)
A SIMD program in C
The attribute specifier has two underscores in front and in back. The specified attribute is vector_size, whose value is Length (defined as 8) multiplied by sizeof(double), which is typically 8 bytes. A doubleV8 instance is thereby defined as a vector of eight 8-byte floating-point values, which requires 512 bits in all.
The standard compilers now make SIMD programming straightforward in C itself, without any additional libraries or tools. Of course, the underlying architecture must support machine-level SIMD instructions. It is reasonable to expect that SIMD architectures will continue to improve and that the C compilers will continue to generate code that takes advantage of the evolving SIMD instruction sets and architectures.
7.10 What’s Next?
The next chapter covers miscellaneous topics to provide a better sense of the libraries available in C, both standard and third party. There is also a section on building software libraries from scratch. As usual, the code examples highlight the power and flexibility of C.
The forthcoming code examples cover regular expressions for pattern matching and data validation; assertions for enforcing conditions in code modules; locale management for internationalization ; the compilation of C code into WebAssembly for high-performance web modules; signals for interprocess communication ; and the building, deployment, and use (by both C and Python clients) of software libraries.