To control the execution of processes, the kernel must be able to suspend the execution of the process running on the CPU and resume the execution of some other process previously suspended. This activity goes variously by the names process switch , task switch , or context switch . The next sections describe the elements of process switching in Linux.
While each process can have its own address space, all processes have to share the CPU registers. So before resuming the execution of a process, the kernel must ensure that each such register is loaded with the value it had when the process was suspended.
The set of data that must be loaded into the registers before the process resumes its execution on the CPU is called the hardware context . The hardware context is a subset of the process execution context, which includes all information needed for the process execution. In Linux, a part of the hardware context of a process is stored in the process descriptor, while the remaining part is saved in the Kernel Mode stack.
In the description that follows, we will assume the
prev
local variable refers to the process
descriptor of the process being switched out and
next
refers to the one being switched in to
replace it. We can thus define a process switch
as the activity consisting of saving the hardware context of
prev
and replacing it with the hardware context of
next
. Since process switches occur quite often, it
is important to minimize the time spent in saving and loading
hardware contexts.
Old versions of Linux took advantage of the hardware support offered
by the Intel architecture and performed a process switch through a
far
jmp
instruction[20] to the selector of the
Task State Segment Descriptor of the next
process.
While executing the instruction, the CPU performs a
hardware context switch
by automatically saving the old
hardware context and loading a new one. But Linux 2.4 uses software
to perform a process switch for the following reasons:
Step-by-step switching performed through a sequence of
mov
instructions allows better control over the
validity of the data being loaded. In particular, it is possible to
check the values of segmentation registers. This type of checking is
not possible when using a single far jmp
instruction.
The amount of time required by the old approach and the new approach is about the same. However, it is not possible to optimize a hardware context switch, while there might be room for improving the current switching code.
Process switching occurs only in Kernel Mode. The contents of all
registers used by a process in User Mode have already been saved
before performing process switching (see Chapter 4). This includes the contents of the
ss
and esp
pair that specifies
the User Mode stack pointer address.
The 80 × 86 architecture includes a specific segment type called the Task State Segment (TSS), to store hardware contexts. Although Linux doesn’t use hardware context switches, it is nonetheless forced to set up a TSS for each distinct CPU in the system. This is done for two main reasons:
When an 80 × 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS (see Chapter 4).
When a User Mode process attempts to access an I/O port by means of
an in
or out
instruction, the
CPU may need to access an I/O Permission Bitmap stored in the TSS to
verify whether the process is allowed to address the port.
More precisely, when a process executes an in
or
out
I/O instruction in User Mode, the control unit
performs the following operations:
It checks the 2-bit IOPL field in the eflags
register. If it is set to 3, the control unit executes the I/O
instructions. Otherwise, it performs the next check.
It accesses the tr
register to determine the
current TSS, and thus the proper I/O Permission Bitmap.
It checks the bit of the I/O Permission Bitmap corresponding to the I/O port specified in the I/O instruction. If it is cleared, the instruction is executed; otherwise, the control unit raises a “General protection error” exception.
The tss_struct
structure describes the format of
the TSS. As already mentioned in Chapter 2, the
init_tss
array stores one TSS for each different
CPU on the system. At each process switch, the kernel updates some
fields of the TSS so that the corresponding CPU’s control unit may
safely retrieve the information it needs.
Each TSS has its own 8-byte Task State Segment Descriptor
(TSSD). This Descriptor includes a
32-bit Base
field that points to the TSS starting
address and a 20-bit Limit
field. The
S
flag of a TSSD is cleared to denote the fact
that the corresponding TSS is a System Segment
.
The Type
field is set to either 9 or 11 to denote
that the segment is actually a TSS. In the Intel’s original design,
each process in the system should refer to its own TSS; the second
least significant bit of the Type
field is called
the Busy bit
; it is set to 1 if the process is being
executed by a CPU, and to 0 otherwise. In Linux design, there is just
one TSS for each CPU, so the Busy bit is always set to 1.
The TSSDs created by Linux are stored in the Global Descriptor Table
(GDT), whose base address is stored in the gdtr
register of each CPU. The tr
register of each CPU
contains the TSSD Selector of the corresponding TSS. The register
also includes two hidden, nonprogrammable fields: the
Base
and Limit
fields of the
TSSD. In this way, the processor can address the TSS directly without
having to retrieve the TSS address from the GDT.
At every process switch, the hardware context of the process being replaced must be saved somewhere. It cannot be saved on the TSS, as in the original Intel design, because we cannot make assumptions about when the process being replaced will resume execution and what CPU will execute it again.
Thus, each process descriptor includes a field called
thread
of type thread_struct
,
in which the kernel saves the hardware context whenever the process
is being switched out.
As we shall see later, this data structure includes fields for most of the CPU registers, such as the general-purpose registers, the floating point registers, and so on.
A process switch may occur at just one
well-defined point: the schedule( )
function
(discussed at length in Chapter 11). Here, we are
only concerned with how the kernel performs a process switch.
Essentially, every process switch consists of two steps:
Switching the Page Global Directory to install a new address space; we’ll describe this step in Chapter 8.
Switching the Kernel Mode stack and the hardware context, which provides all the information needed by the kernel to execute the new process, including the CPU registers.
Again, we assume that prev
points to the
descriptor of the process being replaced, and next
to the descriptor of the process being activated. As we shall see in
Chapter 11, prev
and
next
are local variables of the schedule( )
function.
The second step of the process switch is performed by the
switch_to
macro. It is one of the most
hardware-dependent routines of the kernel, and it takes some effort
to understand what it does.
First of all, the macro has three parameters called
prev
, next
, and
last
. The actual invocation of the macro in
schedule( )
is:
switch_to(prev, next, prev);
You might easily guess the role of prev
and
next
— they are just placeholders for the
local variables prev
and next
— but what about the third parameter last
?
Well, the point is that in any process switch, three processes are
involved, not just two.
Suppose the kernel decides to switch off process A and to activate
process B. In the schedule( )
function,
prev
points to A’s descriptor and
next
points to B’s descriptor. As
soon as the switch_to
macro deactivates A, the
execution flow of A freezes.
Later, when the kernel wants to reactivate A, it must switch off
another process C (in general, this is different from B) by executing
another switch_to
macro with
prev
pointing to C and next
pointing to A. When A resumes its execution flow, it finds its old
Kernel Mode stack, so the prev
local variable
points to A’s descriptor and next
points to B’s descriptor. The kernel, which is now
executing on behalf of process A, has lost any reference to C.
The last parameter of the switch_to
macro
reinserts the address of C’s descriptor into the
prev
local variable. The mechanism exploits the
state of registers during function calls. The first
prev
parameter corresponds to a CPU register,
which is loaded with the content of the prev
local
variable when the macro starts. When the macro ends, it writes the
content of the same register in the last
parameter
— namely, in the prev
local variable.
However, the CPU register doesn’t change across the
process switch, so prev
receives the address of
C’s descriptor (as we shall see in Chapter 11, the scheduler checks whether C should be
readily executed on another CPU).
Here is a description of what the switch_to
macro
does on an 80 × 86 microprocessor:
Saves the values of prev
and
next
in the eax
and
edx
registers, respectively:
movl prev,%eax movl next,%edx
The eax
and edx
registers
correspond to the prev
and next
parameters of the macro.
Saves another copy of prev
in the
ebx
register; ebx
corresponds
to the last
parameter of the macro:
movl %eax,%ebx
Saves the contents of the esi
,
edi
, and ebp
registers in the
prev
Kernel Mode stack. They must be saved because
the compiler assumes that they will stay unchanged until the end of
switch_to
:
pushl %esi pushl %edi pushl %ebp
Saves the content of esp
in
prev->thread.esp
so that the field points to
the top of the prev
Kernel Mode stack:
movl %esp, 616(%eax)
The 616(%eax)
operand identifies the memory cell
whose address is the contents of eax
plus 616.
Loads next->thread.esp
in
esp
. From now on, the kernel operates on the
Kernel Mode stack of next
, so this instruction
performs the actual process switch from prev
to
next
. Since the address of a process descriptor is
closely related to that of the Kernel Mode stack (as explained in
Section 3.2.2 earlier in
this chapter), changing the kernel stack means changing the current
process:
movl 616(%edx), %esp
Saves the address labeled 1
(shown later in this
section) in prev->thread.eip
. When the process
being replaced resumes its execution, the process executes the
instruction labeled as 1
:
movl $1f, 612(%eax)
On the Kernel Mode stack of next
, the macro pushes
the next->thread.eip
value, which, in most
cases, is the address labeled 1:
pushl 612(%edx)
Jumps to the _ _switch_to( )
C function:
jmp _ _switch_to
This function acts on the prev
and
next
parameters that denote the former process and
the new process. This function call is different from the average
function call, though, because _ _switch_to( )
takes the prev
and next
parameters from the eax
and edx
(where we saw they were stored), not from the stack like most
functions. To force the function to go to the registers for its
parameters, the kernel uses the _ _attribute_ _
and regparm
keywords, which are nonstandard
extensions of the C language implemented by the
gcc
compiler. The _ _switch_to( )
function is declared in the include /asm-i386
/system.h
header file as follows:
_ _switch_to(struct task_struct *prev, struct task_struct *next) _ _attribute_ _(regparm(3))
The _ _switch_to( )
function completes the process
switch started by the switch_to( )
macro. It
includes extended inline assembly language code that makes for rather
complex reading because the code refers to registers by means of
special symbols:
Executes the code yielded by the unlazy_fpu( )
macro (see Section 3.3.4
later in this chapter) to optionally save the contents of the FPU,
MMX, and XMM registers. As we shall see, there is no need to load the
corresponding registers of next
while performing
the context switch:
unlazy_fpu(prev);
Loads next->esp0
in the esp0
field of the TSS relative to the current CPU so that any future
privilege level change from User Mode to Kernel Mode automatically
forces this address into the esp
register:
init_tss[smp_processor_id( )].esp0 = next->thread.esp0;
The smp_processor_id( )
macro yields the index of
the executing CPU.
Stores the contents of the fs
and
gs
segmentation registers in
prev->thread.fs
and
prev->thread.gs
, respectively; the
corresponding assembly language instructions are:
movl %fs,620(%esi) movl %gs,624(%esi)
The esi
register points to the
prev->thread
structure.
Loads the fs
and gs
segment
registers with the values contained in
next->thread.fs
and
next->thread.gs
, respectively. This step
logically complements the actions performed in the previous step. The
corresponding assembly language instructions are:
movl 12(%ebx),%fs movl 17(%ebx),%gs
The ebx
register points to the
next->thread
structure. The code is actually
more intricate, as an exception might be raised by the CPU when it
detects an invalid segment register value. The code takes this
possibility into account by adopting a
“fix-up” approach (see
Section 9.2.6).
Loads six debug registers[21] with the contents of the
next->thread.debugreg
array.
This is done only
if next
was using the debug registers when it was
suspended (that is, field
next->thread.debugreg[7]
is not 0). As we shall
see in Chapter 20, these registers are modified
only by writing in the TSS, so there is no need to save the
corresponding registers of prev
:
if (next->thread.debugreg[7]){ loaddebug(&next->thread, 0); loaddebug(&next->thread, 1); loaddebug(&next->thread, 2); loaddebug(&next->thread, 3); /* no 4 and 5 */ loaddebug(&next->thread, 6); loaddebug(&next->thread, 7); }
Updates the I/O bitmap in the TSS, if necessary. This must be done
when either next
or prev
have
their own customized I/O Permission Bitmap:
if (next->thread.ioperm) { memcpy(init_tss[smp_processor_id( )].io_bitmap, next->thread.io_bitmap, 128)); init_tss[smp_processor_id( )].bitmap = 104; } else if (prev->thread.ioperm) init_tss[smp_processor_id( )].bitmap = 0x8000;
The customized I/O Permission Bitmap of a process is stored in a
buffer pointed to by the thread.io_bitmap
field of
the process descriptor. If next
has a customized
bitmap, it is copied into the io_bitmap
field of
the TSS. Otherwise, if next
doesn’t have it, the kernel checks whether
prev
defined such a bitmap. In this case, the
bitmap must be invalidated.
Terminates. Like any other function, _ _switch_to( )
ends by means of a ret
assembly language
instruction, which loads the eip
program counter
with the return address stored into the stack. However, the
_ _switch_to( )
function has been invoked simply
by jumping into it. Therefore the ret
assembly
language instruction finds on the stack the address of the
instruction shown in the following item and labeled
1
, which was pushed by the
switch_to
macro. If next
was
never suspended before because it is being executed for the first
time, the function finds the starting address of the
ret_from_fork( )
function (see Section 3.4.1 later in this chapter).
Includes a few instructions that restore the contents of the
esi
, edi
, and
ebp
registers. The first of these three
instructions is labeled 1
:
1: popl %ebp popl %edi popl %esi
Notice how these pop
instructions refer to the
kernel stack of the prev
process. They will be
executed when the scheduler selects prev
as the
new process to be executed on the CPU, thus invoking
switch_to
with prev
as the
second parameter. Therefore, the esp
register
points to the prev
’s Kernel Mode
stack.
Copies the content of the ebx
register
(corresponding to the last
parameter of the
switch_to
macro) into the prev
local variable:
movl %ebx,prev
As discussed earlier, the ebx
register points to
the descriptor of the process that has just been replaced.
Starting with the Intel 80486, the arithmetic
floating-point unit (FPU) has been
integrated into the CPU. The name mathematical coprocessor
continues to be used in memory of
the days when floating-point computations were executed by an
expensive special-purpose chip. To maintain compatibility with older
models, however, floating-point arithmetic functions are performed
with ESCAPE instructions
, which are instructions with a
prefix byte ranging between 0xd8
and
0xdf
. These instructions act on the set of
floating point registers included in the CPU. Clearly, if a process
is using ESCAPE instructions, the contents of the floating point
registers belong to its hardware context.
In later Pentium models, Intel introduced a new set of assembly language instructions into its microprocessors. They are called MMX instructions and are supposed to speed up the execution of multimedia applications. MMX instructions act on the floating point registers of the FPU. The obvious disadvantage of this architectural choice is that programmers cannot mix floating-point instructions and MMX instructions. The advantage is that operating system designers can ignore the new instruction set, since the same facility of the task-switching code for saving the state of the floating-point unit can also be relied upon to save the MMX state.
MMX instructions speed up multimedia applications because they introduce a single-instruction multiple-data (SIMD) pipeline inside the processor. The Pentium III model extends such SIMD capability: it introduces the SSE extensions (Streaming SIMD Extensions), which adds facilities for handling floating-point values contained in eight 128-bit registers (the XMM registers). Such registers do not overlap with the FPU and MMX registers, so SSE and FPU/MMX instructions may be freely mixed. The Pentium 4 model introduces yet another feature: the SSE2 extensions, which is basically an extension of SSE supporting higher-precision floating-point values. SSE2 uses the same set of XMM registers as SSE.
The 80 × 86 microprocessors do not automatically save the
FPU, MMX, and XMM registers in the TSS. However, they include some
hardware support that enables kernels to save these registers only
when needed. The hardware support consists of a TS
(Task-Switching) flag in the cr0
register, which
obeys the following rules:
Every time a hardware context switch is performed, the
TS
flag is set.
Every time an ESCAPE, MMX, SSE, or SSE2 instruction is executed when
the TS
flag is set, the control unit raises a
“Device not available” exception
(see Chapter 4).
The TS flag allows the kernel to save and restore the FPU, MMX, and XMM registers only when really needed. To illustrate how it works, suppose that a process A is using the mathematical coprocessor. When a context switch occurs, the kernel sets the TS flag and saves the floating point registers into the TSS of process A. If the new process B does not use the mathematical coprocessor, the kernel won’t need to restore the contents of the floating point registers. But as soon as B tries to execute an ESCAPE or MMX instruction, the CPU raises a “Device not available” exception, and the corresponding handler loads the floating point registers with the values saved in the TSS of process B.
Let’s now describe the data structures introduced to
handle selective loading of the FPU, MMX, and XMM registers. They are
stored in the thread.i387
subfield of the process
descriptor, whose format is described by the
i387_union
union:
union i387_union { struct i387_fsave_struct fsave; struct i387_fxsave_struct fxsave; struct i387_soft_struct soft; };
As you see, the field may store just one of three different types of
data structures. The i387_soft_struct
type is used
by CPU models without a mathematical coprocessor; the Linux kernel
still supports these old chips by emulating the coprocessor via
software. We don’t discuss this legacy case further,
however. The i387_fsave_struct
type is used by CPU
models with a mathematical coprocessor and, optionally, a MMX unit.
Finally, the i387_fxsave_struct
type is used by
CPU models featuring SSE and SSE2 extensions.
The process descriptor includes two additional flags:
The PF_USEDFPU
flag, which is included in the
flags
field. It specifies whether the process used
the FPU, MMX, or XMM registers in the current execution run.
The used_math
field. This flag specifies whether
the contents of the thread.i387
subfield are
significant. The flag is cleared (not significant) in two cases,
shown in the following list.
When the process starts executing a new program by invoking an
execve( )
system call (see Chapter 20). Since control will never return to the
former program, the data currently stored in
thread.i387
is never used again.
When a process that was executing a program in User Mode starts
executing a signal handler procedure (see Chapter 10). Since signal handlers are asynchronous with
respect to the program execution flow, the floating point registers
could be meaningless to the signal handler. However, the kernel saves
the floating point registers in thread.i387
before
starting the handler and restores them after the handler terminates.
Therefore, a signal handler is allowed to use the mathematical
coprocessor, but it cannot carry on a floating-point computation
started during the normal program execution flow.
As stated earlier, the _ _switch_to( )
function
executes the unlazy_fpu
macro, passing the process
descriptor of the process being replaced as an argument. The macro
checks the value of the PF_USEDFPU
flags of
prev
. If the flag is set, prev
has used a FPU, MMX, SSE, or SSE2 instructions in this run of
execution; therefore, the kernel must save the relative hardware
context:
if (prev->flags & PF_USEDFPU) save_init_fpu(prev);
The save_init_fpu( )
function, in turn, executes
the following operations:
Dumps the contents of the FPU registers in the process descriptor of
prev
and then re-initializes the FPU. If the CPU
uses SSE/SSE2 extensions, it also dumps the contents of the XMM
registers and re-initialize the SSE/SSE2 unit. A couple of powerful
assembly language instructions take care of everything, either:
asm volatile( "fxsave %0 ; fnclex" : "=m" (tsk->thread.i387.fxsave) );
if the CPU uses SSE/SSE2 extensions, or otherwise:
asm volatile( "fnsave %0 ; fwait" : "=m" (tsk->thread.i387.fsave) );
Resets the PF_USEDFPU
flag of
prev
:
prev->flags &= ~PF_USEDFPU;
Sets the TS flag of cr0
by means of the
stts( ) macro
, which in practice yields the
following assembly language instructions:
movl %cr0, %eax orl $8,%eax movl %eax, %cr0
The contents of the floating point registers are not restored right
after a process resumes execution. However, the TS
flag of cr0
has been set by unlazy_fpu( )
. Thus, the first time the process tries to execute an
ESCAPE, MMX, or SSE/SSE2 instruction, the control unit raises a
“Device not available” exception,
and the kernel (more precisely, the exception handler involved by the
exception) runs the math_state_restore( )
function:
void math_state_restore( ) { asm("clts"); /* clear the TS flag of cr0 */ if (current->used_math) { restore_fpu(current); } else { /* initialize the FPU unit */ asm("fninit"); /* and also the SSE/SSE2 unit, if present */ if ( cpu_has_xmm ) load_mxcsr(0x1f80); current->used_math = 1; } current->flags |= PF_USEDFPU; }
Since the process is executing an FPU, MMX, or SSE/SSE2 instruction,
this function sets the PF_USEDFPU
flag. Moreover,
the function clears the TS flags of cr0
so that
further FPU, MMX, or SSE/SSE2 instructions executed by the process
won’t trigger the “Device is not
available” exception. If the data stored in the
thread.i387
field is valid, the
restore_fpu( )
function loads the registers with
the proper values. To do this, either the fxrstor
or the frstor
assembly language instructions are
used, depending on whether the CPU supports SSE/SSE2 extensions.
Otherwise, if the data stored in the thread.i387
field is not valid, the FPU/MMX unit is re-initialized and all its
registers are cleared. To re-initialize the SSE/SSE2 unit, it is
sufficient to load a value in a XMM register.
[20]
far jmp
instructions modify both the
cs
and eip
registers, while
simple jmp
instructions modify only
eip
.
[21] The 80 × 86 debug registers allow a process to be monitored by the hardware. Up to four breakpoint areas may be defined. Whenever a monitored process issues a linear address included in one of the breakpoint areas, an exception occurs.
3.129.25.231