9.2. The Kernel Dispatcher

The dispatcher is the kernel code segment that manages queues of runnable kernel threads, called dispatch queues; places the highest-priority runnable thread on a processor for execution; and manages the recalculation of thread priorities that are based on execution time, sleep time, and time spent on the queue waiting for a processor. The context switching of threads on and off processors is driven by the dispatcher code.

Early versions of Solaris 2 implemented a simple, global dispatch queue that was shared by all processors on a multiprocessor system. Scalability and dispatch latency, especially for real-time threads, was less than optimal because of the coarse-grained locking. A single scheduling lock for the dispatch queue needed to be acquired for a processor to insert or remove a kthread from the queue, which became a point of contention on busy multiprocessor systems. A new design was done for Solaris 2.3, and since that release, Solaris implements multiple dispatch queues, one for each processor, and a global queue for threads that run at a priority high enough to cause kernel preemption. Such threads are real-time and interrupt threads (although interrupt threads are not placed on a dispatch queue—they live on a linked list off every CPU structure). Solaris 2.5 saw an algorithmic change to the thread selection process (which thread gets execution time next). We go through the select and ratify process of thread selection in our discussion of the dispatcher algorithm.

Another change came with the introduction of processor sets in Solaris 2.6. Processor sets can be configured through psrset(1M), where some number of processors are configured into the set, and processes and kernel threads can be bound to a processor set. The bound processes and threads are scheduled only on processors in the set; other threads and processes running on the system that are not bound to the processor set will not be scheduled on processors in the set. Multiple processor sets can be configured, with multiple bindings. (Processor sets are also discussed in Chapter 1, “System Overview.”) The addition of processor sets added a new data structure, cpupart, and a modification to the kernel preempt dispatch queues. Prior to Solaris 2.6, only one kernel preempt queue maintained runnable RT class threads. Beginning in Solaris 2.6, there is a kernel preempt queue for each processor set or one queue systemwide if processor sets have not been configured.

The dispatcher uses data stored in different areas of the kernel to maintain scheduling information such as priorities, thread execution time, dispatch queue utilization and state, thread wait and sleep time, per-processor queue data and state, etc. The dispatch tables described in the previous section are just one area of the kernel where scheduling-specific information is stored. In addition to the tables, a scheduling-class-specific data structure is linked to every kthread on the system. These structures, illustrated in Figure 9.3, are used by the dispatcher for maintaining bits of information on each thread's execution time, state, and priority.

The class-specific data structures are linked not only to the kthread but also to each other. That is, the class-specific structures are linked on a linked list maintained by the dispatcher code. Many of the fields in the class-specific structures have the same name and meaning across all three scheduling classes (TS, IA, and RT; there is no structure for the SYS class). Hence, we describe each structure in Table 9-3 by using a single row where a structure member has a common name and meaning for more than one scheduling class. The structures take the name of xxproc, where xx is ts, rt, or ia.

Table 9-3. Scheduling-Class-Specific Data Structure Members
tsproc iaproc rtproc Meaning
ts_timeleft ia_timeleft rt_timeleft Time remaining in thread's time quantum
ts_dispwait ia_dispwait N/A Wall clock elapsed time since quantum began. Not reset if thread is preempted
ts_cpupri ia_cpupri N/A Priority
ts_uprilim ia_uprilim N/A User priority limit
ts_upri ia_upri N/A User priority
ts_umdpri ia_umdpri N/A User priority within class
ts_nice ia_nice N/A Nice value
N/A N/A rt_pquantum Time quantum for the thread
N/A N/A rt_pri Priority within RT class
ts_flags ia_flags rt_flags State flags
ts_boost N/A N/A Priority boost value
N/A ia_pstatp rt_pstatp Pointer to pstat
N/A ia_pprip rt_pprip Pointer to thread priority
N/A ia_pflagp rt_pflagp Pointer to p_flag
N/A ia_mode N/A IA on/off flag
ts_tp ia_tp rt_tp Pointer back to kthread
ts_next ia_next rt_next Forward link
ts_prev ia_prev rt_prev Backward link

As we mentioned earlier, the kernel maintains doubly linked lists of the class-specific structures—separate lists for each class. Current implementations of Solaris, up to and including Solaris 7, do not actually use the iaproc for interactive class threads (it's declared in the sys/ia.h header file but not defined anywhere in the kernel code). Threads in the IA class link to a tsproc structure, and most of the class supporting code for interactive threads is handled by the TS routines. IA threads are distinguished from TS threads by a flag in the ts_flags field, the TSIA flag.

Maintaining the linked lists for the class structures greatly simplifies the dispatcher supporting code that updates different fields in the structures (such as time quantum) during the clock-driven dispatcher code. Since most Solaris systems have considerably more TS threads than any other scheduling class, the tsproc lists are managed differently from the RT list.

The kernel builds an array of 16 tsproc structures (array name ts_plisthead) that anchor up to 16 doubly linked lists of the tsproc structures systemwide. The code implements a hash function, based on the thread pointer, to determine which list to place a thread on, and each list is protected by its own kernel mutex (ts_list_lock is an array of 16 kernel mutexes). Implementing multiple linked lists in this way makes for faster traversal of all the tsproc structures in a running system, and the use of a lock per list allows multiple kernel routines to traverse different lists at the same time (see Figure 9.4).

Figure 9.4. tsproc Structure Lists


9.2.1. Dispatch Queues

Every processor on the system has its own dispatch queue (actually, a queue of queues), and per-processor scheduling information is maintained in the data structures that describe the processors in the kernel (the CPU structures). Data structures make up the actual dispatch queues, which provide the linkage to the lists of runnable kthreads on the system and state information for each queue. Threads in the TS, IA, and SYS classes are placed on the per-processor dispatch queues.

A separate queue is maintained for threads that are runnable at a priority high enough to cause a kernel preemption. The Solaris kernel is, for the most part, preemptable, such that a processor executing in kernel (system) mode can be preempted so a higher-priority thread can run. Currently, the global dispatch priorities for RT class threads and interrupt threads are above SYS and can generate a kernel preemption. Since interrupt threads do not actually go on a dispatch queue, only unbound RT class threads are placed on the kernel preempt (kp_preempt) queue. There is one such queue systemwide on systems that do not have processor sets configured. A kernel preempt queue is created when a processor set is configured, so there is a per-processor set kp_preempt queue, plus one for the default processor set. Since the kernel will not allow every available processor on a system to be placed in a processor set, the number of kernel preempt queues on a system will always be the number of processor sets configured, plus 1. Without processor sets configured, there is one kp_preempt queue systemwide. We discuss preemption later in more detail.

The creation of the dispatch queues occurs at boot time during the Solaris system initialization. A kernel dispatcher initialization routine, dispinit(), is called from the architecture-specific startup code, which builds the kernel preemption queue and the per-cpu dispatch queues. The kernel data structures that make up the complete dispatch queue picture are the cpu, disp_queue_info, _disp, and dispq structures, as shown in Figure 9.5. It is during the dispatcher initialization that the default processor partition is created, which is where the kernel preempt queue (cp_kp_queue) is maintained. In Solaris 2.5.1, which does not have processor sets, the global kernel preempt queue was maintained in a separate disp_t structure, disp_kp_queue.

Figure 9.5. Solaris Dispatch Queues


A quick note on terminology before we proceed. A processor partition is essentially the same thing as a processor set. A user's view of a processor set is what the kernel references as a processor partition. A different abstraction exists for a possible future implementation, where some processor partitions are visible to the kernel but not visible to the user, so what users see as available processor sets may be different from the kernel's view of usable processor partitions. A default processor partition, which includes all available processors on the system, is created at boot time. As processor sets are created, processors configured into a user processor set are removed from the default partition and added to the configured set. (Processor sets are discussed in “Kernel Processor Control and Processor Sets”.)

The kernel retains at least one processor for the default partition and does not allow all available processors to be configured into user-created processor sets. Kernel threads that are part of the operating system, such as the daemon threads created at boot time (Chapter 2), pageout, NFS server threads, etc., will not run on a user processor set. That's why a default partition is created at boot time and at least one processor is retained for the default partition. In practice, it is always prudent to assess processor requirements before allocating processor resources by means of processor sets. A system with sufficient processing power for bound application processes may run poorly if CPU resources for the kernel are inadequate.

Several dispatcher-related variables are embedded in each cpu structure.

  • cpu_disp — An embedded _disp structure; the dispatch queue data for the processor, which contains the link to the actual per-processor queues.

  • cpu_runrun — A preemption flag indicating that the thread currently running on the CPU will be preempted before switching back to user mode. The user preemption flag.

  • cpu_kprunrun — Another preemption flag, kernel preemption, indicating that the thread should be preempted immediately or as soon as possible (e.g., return from an interrupt). The kernel preemption flag.

  • cpu_chosen_level — The priority at which the CPU was chosen for scheduling a kthread. Used by the dispatcher when selecting the next thread for execution.

  • cpu_dispthread — A pointer to the kthread selected for execution on the CPU.

  • cpu_thread_lock — A dispatcher lock on the current thread.

  • cpu_last_switch — A time value: the last time the CPU switched execution to a new thread.

The dispatcher queue structure embedded in the cpu structure is where the per-CPU dispatch queue is rooted. It links to the per-priority dispatch queues maintained for each processor. Within each processor dispatch queue, separate linked lists of runnable kthreads are maintained for every priority level, except real-time lists. Unbound real-time threads are maintained on the global kp_queue. If processor sets are configured, then a kernel preempt queue is configured for each processor set.

Other data the dispatcher needs for queue management and thread scheduling maintained in the dispatch queue structure includes the following.

  • disp_lock — A synchronization lock to protect fields in the structure.

  • disp_npri — The number of priority levels in the queue.

  • disp_q — The pointer to the dispatch queue (pointer to a dispq_t structure).

  • disp_q_limit — Another pointer to a dispq_t structure, set to point to the end of the queue.

  • disp_qactmap — A bitmap indicating which queues actually have runnable threads in them. Using the bitmap in conjunction with disp_maxrunpri and disp_nrunnable, the dispatcher code can traverse the queues with extreme speed and efficiency, searching for the best runnable thread.

  • disp_maxrunpri — The priority of the highest-priority thread on the queue.

  • disp_max_unbound_pri — The priority of the highest-priority unbound thread on the queue.

  • disp_nrunnable — Total number of runnable threads on the queue.

  • disp_cpu — A pointer back to the cpu structure that owns this queue.

The dispatch queue entry itself is a small, three-member data structure, defined as a dispq_t structure. It contains a pointer to the first kthread on the queue (dq_first), a pointer to the last kthread on the queue (dq_last), and a count of kthreads on the queue (dq_sruncnt). Each per-priority queue maintains a count of threads on the queue, and the sum of all per-priority queues is represented in disp_nrunnable.

Finally, there's the disp_queue_info structure. An array of these structures is created at boot time, one for each processor on the system. Each array member (disp_queue_info structure) contains a pointer to one per-process dispatch queue (disp_t) embedded in the cpu structure. There are two additional disp_t pointers, olddispq and newdispq, and storage space for an old and new queue bitmap field, oldqactmap and newqactmap, for queue manipulation and maintenance. The number of global priorities in the queue is also maintained in oldnglobpris. As we move through the algorithmic flow of the dispatcher code, we'll see how these structures and members are used. Figure 9.5 provides the big picture.

9.2.2. Thread Priorities

The operating system uses a global priority scheme to assign priorities to kernel threads; the priorities are derived from a dispatch table according to the scheduling class the thread belongs to. As we saw, each scheduling class has a range of priorities assigned to it. As scheduling classes load into the system, the kernel must ensure that every priority level across all loaded scheduling classes has a unique priority number.

The range of priorities for the TS/IA class is 0–59, which corresponds to global priorities 0–59 since the TS/IA class is the lowest class on the system. SYS priorities range from 0–39 and are assigned global priorities 60–99 since the SYS class sits directly above the TS/IA class. If the RT class is not loaded, interrupt priorities, range 0–9, occupy global priorities 100–109. Thus, without the RT class loaded, there is a total of 110 (0–109) global priorities. When the RT class (range 0– 39) loads, the interrupt thread priorities are bumped up to global priorities 160– 169, and the RT global priority range is 100–159. This protocol is represented in Figure 9.2 and Table 9-1.

During dispatcher initialization, the kernel calculates the maximum SYS global priority value and the total number of global priorities as scheduling classes are loaded. These values are recalculated if scheduling classes are loaded following initialization, for example, if the RT class is loaded sometime after the system has been booted. The calculated values are stored in the system var structure's v_nglobpris and v_maxsyspri field.

							# /etc/crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> v
.
.
v_nglobpris: 170
v_maxsyspri:  99
.
.

Newly created threads inherit their priority and scheduling class from the parent process (parent thread, actually—remember, nonthreaded processes have one LWP/kthread pair). Recall from the previous chapter, the forklwp() kernel routine calls lwp_create(), which in turn calls thread_create(). thread_create() is passed as arguments a thread state (TS_STOPPED—the initial state of a kthread) and priority, which is established from the calling thread. The other applicable kthread structure members are initialized toward the end of the forklwp() code, including the allocation of the class-specific data structure (e.g., tsproc). Entry into the class-specific fork routine through CL_FORK() completes the setup of the dispatcher data for the new thread, initializing the class data structure for the thread and putting it on the linked list of class structures.

The kernel thread structure fields relevant to thread priorities are t_pri, which is the actual global dispatcher priority for the thread, t_epri, which reflects an inherited priority, and t_kpri_req, which is a flag to indicate to the dispatcher that a SYS priority is requested for the thread. Other per-thread priority-related data is maintained in the class-specific data structure linked to the kthread: tsproc for TS/IA threads, rtproc for RT threads. For TS/IA threads, the kernel provides a facility for user-level priority adjustments, similar in concept to the traditional nice(1) command. Unix systems provided a nice(1) command that allowed users to adjust the priority of the processes they owned. Non-root users could only make their own process priorities worse (they could be nice to the other users of the system), and root could make priorities better with nice(1). Solaris supports nice(1) today for compatibility, but the proper way to apply user-level priorities to threads is to use the priocntl(1) command. Issuing priocntl(1) with the -l (lowercase l) flag displays the range of user priorities.

							# priocntl -l
CONFIGURED CLASSES
==================

SYS (System Class)

TS (Time Sharing)
        Configured TS User Priority Range: -60 through 60

IA (Interactive)
        Configured IA User Priority Range: -60 through 60

RT (Real Time)
        Maximum Configured RT Priority: 59

The user priority ranges shown in the preceding priocntl(1) command output clearly do not reflect the range of priorities for the TS/IA class from a kernel perspective. The user priority facility allows a user to move a process or thread priority in a positive (better) or negative (worse) direction. The priority value specified on the command line can be as high as 60 (maximum potential priority) or as low as −60 (lowest possible priority). A priority value specified on a priocntl(1) command line does not become the process's or thread's actual global dispatch priority (as seen with ps(1), e.g., ps -Lc), but rather is used by the dispatcher code to adjust a kernel thread's priority. A user-supplied priority will alter the global dispatcher priority of the thread in one direction or another, depending on the value specified.

For RT class threads or processes, the valid range of priorities is 0–59; the priocntl(1) output shown above indicates a maximum value of 59. The RT class does not support the notion of user mode priorities. Priority tweaking of RT threads involves simply specifying a priority value in the RT class range; then, that value will be directly reflected in the global dispatch priority of the process. For example, the global priority range for RT is 100–159, where global priority 100 corresponds to RT priority 0, global priority 120 corresponds to RT priority 19, and so on. So, below, we used priocntl(1) to start a program in real time and specified a priority.

							# priocntl -e -c RT -p 59 ./hog &
1016
# ps -Lc
   PID   LWP  CLS PRI TTY     LTIME CMD
  1016     1   RT 159 pts/2    0:03 hog
  1015     1   TS  48 pts/2    0:00 sh
  1017     1   TS  48 pts/2    0:00 ps

We specified a priority value of 59 on the command line. Note the actual global dispatch priority of the process we started is 159. This value results from the kernel turning the specified priority into a global dispatch priority. In the case of RT processes and threads, the kernel uses the user-supplied RT priority value as an index into the RT dispatch table to retrieve the corresponding global dispatch priority, as illustrated in Figure 9.6.

Figure 9.6. Setting RT Priorities


An invocation of the priocntl(1) command to execute a process as a real-time process defaults to the base RT class priority, which is 0, resulting in a global dispatch priority of 100—the value in the 0th entry in the real-time dispatch table (rt_dptbl). The example in Figure 9.6 specifies a value of 59, which the kernel will plug into the rt_pri field of the thread's rtproc structure and use as an index in the dispatch table, rt_dbtbl. The corresponding value is used to set the thread's actual global dispatch priority, 159, in this example. A kernel thread_change_pri() function is called to set the dispatcher priority in the kthread's t_pri field.

User-settable priorities for RT threads is the easy case, which is why we talked about that first. The SYS class is even easier: user-level tweaking of SYS class thread priorities does not exist, period. Which brings us back to TS/IA class threads. First, a quick note before we move on. Kernel threads in the same process can run at different scheduling classes and priorities. A multithreaded process can have LWP/kthread pairs in the TS, IA, SYS, and RT class. The priocntl(1) command, however, does not provide that level of granularity for control. priocntl(1) provides for scheduling class and priority control only on a per-process level, not per LWP/kthread. The priocntl(2) system call has facilities for specifying an LWP ID versus a PID, so code can alter scheduling classes and priorities at the LWP/kthread level.

Several fields in the tsproc structure support the implementation of user mode priorities for TS threads: ts_umdpri, the user priority within the TS class range; ts_upri, the user priority value; ts_uprilim, a boundary (limit) on user priority values; and ts_cpupri, the kernel component of user priorities. ts_cpupri exists essentially to give the kernel some control over user-adjusted priorities (think of it as a throttle). It is used in conjunction with the user priority, ts_umdpri, to index into the TS dispatch table to set new values for both ts_cpupri and the thread's actual priority, t_pri. The ts_uprilim field exists to enable users to place their own throttle on user mode priorities.

The dispatcher uses these values in the kernel for boundary checking on user-supplied priorities and for adjusting a thread's priority as a result of the priocntl() call. Once user priorities are set, they remain in use for subsequent thread priority adjustments. ts_cpupri is set to 29 when the LWP/kthread is initialized; ts_upri and ts_uprilim are initialized to 0. Without an explicit priocntl(1) or priocntl(2) command issued on a process or kthread, the user mode priorities do not factor in to the normal priority adjustments done during the execution life of a kernel thread. If an LWP/kthread has a user priority set and issues a fork(), then the new process LWP/kthread inherits the user-supplied values from the parent thread.

RT class thread priorities are fixed, as oppsed to TS/IA thread priorities, which are adjusted in many places during the life of a thread: after initialization on return from a fork(), when a thread is made runnable by ts_setrun(), upon sleep and wakeup, upon return from a trap, during clock tick processing, prior to a context switch, and during a regular thread update that is executed through the callout queue feature from the clock interrupt handler. The actual priority adjustment is, in some cases, contingent on a conditional test in the code, for example, has a thread been waiting to get on a dispatch queue for an inordinate amount of time? We'll look at some examples to illustrate the mechanics of establishing a thread's priority, starting with TS and IA class threads.

Refer to Figure 9.7, which illustrates the following sequence.

Figure 9.7. Setting a Thread's Priority Following fork()


  1. After the creation and initialization of the LWP/kthread (fork()), a fork return function is called to set the kthread's priority and place it on a dispatch queue. The inherited ts_cpupri value indexes into the TS dispatch table (1).

  2. It creates a new ts_cpupri value that is based on the ts_tqexp value in the indexed table location.

  3. The ts_timeleft value is set on the basis of the allotted time quantum for the priority level.

  4. ts_dispwait is set to 0. A user mode priority is calculated, setting the ts_umdpri value, which is then used as an index into the TS/IA dispatch table to establish the thread's actual dispatch priority,

  5. t_pri is determined by the ts_globpri (global priority) value in the indexed table location (5).

The calculation of the user mode priority (between steps 2 and 3), which is done whenever a TS/IA thread priority adjustment is required, is simple arithmetic, summing the ts_cpupri, ts_upri, and ts_boost values. ts_boost is used for IA class threads as a means of boosting the priority by a value of 10. ts_umdpri is set to the sum of ts_cpupri, ts_upri, and ts_boost and is tested against some boundary values; it cannot be greater than the maximum TS/IA class priority, 59, or less than 0.

The other relevant step taken in the fork return code is the setting of the ts_timeleft and ts_dispwait fields in the kthread's tsproc structure. ts_timeleft tracks execution time when the kthread is running on a processor and is decremented in the clock tick handler (if the thread is actually running on a processor, of course). When ts_timeleft counts down to 0, the kthread has used its time quantum and is context-switched off the processor. ts_dispwait tracks how much time the thread has spent sitting on a dispatch queue, waiting to get context-switched onto a processor for execution. For all threads on a dispatch queue, ts_dispwait is updated once a second by the ts_update() thread, which is run out of the callout queue from the clock interrupt handler (more on this in a minute).

When the priority work has been done and t_pri in the kernel thread has a valid dispatch priority, the thread is ready to be placed on a dispatch queue; such placement is ultimately done by one of the dispatcher queue management functions, setfrontdq() or setbackdq(). How these functions work and whether a kthread ends up at the front or back of a queue are matters we'll get into in just a bit.

Once a kthread is off and running, it can go through potentially many state changes that determine how and where priority adjustments are made. A typical thread spends time executing, gets interrupted, gets preempted, issues a blocking system call causing the thread to be placed on a sleep queue, and so forth. Class-specific functions are called and act appropriately as dictated by the transition and the dispatcher-related values in the thread's tsproc structure for priority adjustment. For example, if a thread burns through its allotted time quantum (as set from the ts_quantum value in the dispatch table, Figure 9.7, step 3), the kthread is switched off the processor and given a new dispatch priority.

The scheduling-class-specific clock tick handler is called from the system clock interrupt handler for each CPU that has a noninterrupt thread running on it. When the ts_tick() function is entered, the ts_timeleft value is decremented and tested. If ts_timeleft reaches 0, the time quantum has expired for the kthread. This is the case for TS/IA threads that are not at a SYS class priority— threads running at a SYS class priority are not time sliced. Assuming an expired time quantum, the code tests to see if a scheduler activation has been turned on for the thread, in the form of preemption control.

Solaris provides an innovative facility, called preemption control, for stretching a thread's time quantum for a short time. Preemption control does not require altering priorities or dispatcher table values. Using a simple API— schedctl_init(3X), schedctl_start(3X), schedctl_stop(3X)—programmers can develop code that initializes and turns on a preemption control for a kthread. What this does is effectively give the kthread a few extra clock ticks of execution time on a processor, beyond its time quantum, before it is switched off. This feature addresses situations where a thread is holding a critical resource, such as a mutex lock (an application-level mutex lock, not a kernel mutex lock) and a few extra ticks of execution time will allow the kthread to complete its task and free the lock. Otherwise, if the thread is taken off the processor before releasing the resource, other threads that need the same resource will begin to block, waiting for the resource to be freed. This behavior requires that the thread holding the resource work its way up the dispatch queue for rescheduling.

The use of preemption control in application code is illustrated in the following pseudocode, showing preemption control being turned on and then off around the acquisition and freeing of a mutex lock.

mutex_t mylock;
mutex_init(&mylock);
schedctl_init();
.
.
schedctl_start();
mutex_lock(&mylock);

        do work while holding the mutex

mutex_unlock(&mylock);
schedctl_stop();

Without the activation turned on to enable preemption control, there's a greater possibility that the kthread will be forced to surrender the processor before it is able to free the lock. Think of preemption control as a means of telling the dispatcher “I'm about to grab a critical resource, so grant me a little extra execution time.”

Getting back to the ts_tick() code, we've entered the main processing segment of ts_tick() because the kthread was not running at a SYS class priority and its time quantum has expired. If preemption control has been turned on for the kthread, it is allowed an extra couple of clock ticks to execute, no priority tweaks are done, and ts_tick() is finished with the thread. There is a limit to how many additional clock ticks a kthread with preemption control turned on will be given. If that limit has been exceeded, the kernel sets a flag such that the thread will get one more time slice and on the next pass through ts_tick(), the preemption control test will fail and normal tick processing will be done. In this way, the kernel does not allow the scheduler activation to keep the thread running indefinitely. If the requirement is that the thread must stay on the processor indefinitely, the RT class should be used.

Regular TS/IA tick processing involves adjusting the thread priority in the same way the priority is set in the fork return code (Figure 9.7). ts_cpupri is used as an index into the TS dispatch table and is assigned a new value that is based on ts_tqexp from the indexed location. The user mode priority is calculated, ts_dispwait is set to 0, and a new dispatcher priority is derived from the TS/IA dispatch table. The new priority is based on the global priority value in the table row corresponding to ts_umdpri, which is used as the dispatch table array index. What's different between the ts_tick() handler and the fork return scenario is that the thread's priority is not set directly in ts_tick(), but rather in a thread_change_pri() kernel function. A change in a thread's priority may warrant a change in its position on a queue; thread_change_pri() handles such a case. In the fork return, we are dealing with a new thread that has not yet been on a queue, so it's not an issue.

By design for TS/IA class threads, the adjusted priority in the preceding scenario will end up being something worse than the thread was previously. If a thread consumed its entire time quantum, it gets a worse priority to allow other threads a better opportunity to obtain processor time. You can test this behavior by plugging in some actual values, walking through the adjustment sequence as we described it, and figuring out where the thread's priority lands after the thread uses a full time quantum.

Here's a quick example, assuming a simple case of no user mode priority.

  • A kthread starts with a ts_cpupri value of 29, which puts it right in the middle of the TS/IA priority range (which we're assuming is why the default value of 29 was chosen). The ts_tqexp value at table location 29 (the 28th row, because the count starts at 0) is 18.

  • Thus, the new ts_cpupri value is 18, which also becomes the ts_umdpri value when the user mode priority is calculated—since ts_upri and ts_boost are zero values (no user priority, and it's a TS class thread, so no priority boost).

  • The global dispatch priority at index location 18 is 17, so the kthread's initial priority (t_pri) is 17, with a 120 clock tick time quantum (derived by using ts_cpupri as an index).

  • When 120 ticks are gone and ts_timeleft has clicked down to 0, we'll essentially repeat the steps to get the new priority.

  • ts_cpupri is now 18, the ts_tqexp value at the corresponding table location is 7, so the new ts_cpupri value is 7, which again becomes ts_umdpri after the user mode priority is calculated.

  • The global dispatch priority at the 7th array location is 6, so the kthread's priority after the kthread uses its time quantum will go from 17 to 6, a worse priority. Its time quantum is larger, at 160 clock ticks, so it will wait longer to run but will get more clock ticks when it does execute. The behavior is consistent with the targeted behavior of timeshare class threads. As threads get more execution time, their priority is worsened; as threads spend more time waiting to run, their priority is improved.

The second case we'll look at centers around kthread sleep and wakeup. Specifics of the sleep queue implementation and wakeup mechanism are discussed in the next section, but we can still examine the priority component of those state changes without detracting from either discussion.

  • The ts_sleep() function handles the preparation of a TS/IA class thread for sleep. The code tests the t_kpri_req flag (kernel priority requested), which will be set if the thread is holding a kernel RW lock or an exclusive page lock on a memory page when the sleep function is invoked. If t_kpri_req is set, the thread priority, t_pri, is set to 100, the lowest SYS class priority, which puts it above all TS/IA class priorities.

  • The kthreads TSKPRI flag is set to indicate that thread is at a SYS priority.

    If a kernel priority has not been requested for the thread, the ts_sleep() code tests the ts_dispwait value (how long the thread has been waiting on a dispatcher queue) against the ts_maxwait value in the dispatch table, as indexed by ts_umdpri, which will index the same row used to retrieve the global priority the last time it was set. The test looks something like this.

    if (kthread->tsproc.ts_dispwait > ts_dptbl[ts_umdpri].ts_maxwait)
    

    That condition is true if the thread has been waiting a relatively long time for processor cycles.

  • The priority is recomputed, as previously described, with one significant difference. The ts_cpupri value is set from the ts_slpret (sleep return priority) column in the TS/IA dispatch table, not the ts_tqexp column, as was the case in the previous example. The ts_slpret priorities give the thread a high priority so they are scheduled sooner than most other TS/IA threads sitting on the dispatch queues.

  • The final scenario in the ts_sleep() code is entered if the thread has not had a kernel priority requested and if the thread has not been waiting longer than ts_maxwait (the first two scenarios just described).

    If those two conditions are not true and if the thread is already at a SYS priority, the thread's priority is set back to a TS/IA class priority, with ts_umdpri used as an index into the TS/IA dispatch table and t_pri set to the corresponding global priority.

    If none of the three conditions is true—which means the thread is not at a SYS priority, is not required to be assigned a SYS priority, and has not been waiting an inordinate amount of time—then the ts_sleep() code essentially does nothing and simply returns to the caller without having altered any of the thread's dispatcher properties. The thread will get another shot at having its priority tweaked in the wakeup code.

  • The wakeup function places the kthread on a dispatch queue and adjusts the thread priority only if the thread wait test described previously is true; that is, the thread's ts_dispwait value was greater than ts_maxwait, again using ts_umdpri to retrieve the ts_maxwait value from the dispatch table.

    The wakeup code also resets the ts_dispwait value back to 0 after adjusting the thread's priority, before calling the dispatcher queue functions for queue insertion. The actual priority adjustment is done by use of the ts_slpret dispatch table value to set ts_cpupri, as was the case in the ts_sleep() code for a long-waiting thread.

    ts_timeleft is set to ts_quantum, and t_pri is assigned a dispatch priority after the user priority arithmetic is done, with the ts_umdpri value determining from which row the dispatch table the global priority is retrieved, as shown in the sequence in Figure 9.8.

    Figure 9.8. Priority Adjustment with ts_slpret

  • The kernel runs a ts_update() routine once per second with the callout facility. ts_update() walks through a linked list of kernel threads and updates the thread's ts_dispwait value for TS/IA class threads (RT and SYS class threads are not updated).

    The update code alternates across the multiple linked lists of tsproc structures (see Figure 9.4), starting at ts_listhead, traversing one list at a time. If the thread's updated ts_dispwait is greater than ts_maxwait, the thread's priority is bumped by the methods already described but with a different column in the dispatch table to reset ts_cpupri.

    ts_lwait in the dispatch table is fetched from the indexed row to set the new ts_cpupri value, the user mode priority is calculated, and ts_dispwait is reset to 0.

    The ts_update() function calls ts_change_pri(), which does a little extra work on behalf of the update process.

• If the thread is currently running on a processor (thread state is ONPROC), then a new global priority is set in t_pri: the new ts_umdpri value derived in the ts_update() function grabs the global priority from the dispatch table.

• If the thread's new priority is greater than the highest-priority thread sitting on the processor's dispatch queue (the dispatcher structure disp_maxrunpri, which is always updated when a thread in placed on a dispatch queue, is used for the test), then the thread has its ts_timeleft parameter reset from ts_quantum (in the dispatch table) and continues to run.

• If a higher-priority thread is sitting on the dispatch queue, the thread is forced to surrender the processor.

  • The other possible condition in ts_change_pri is that the thread is not running on a processor, in which case thread_change_pri() is called to set the new priority. A thread in the SLEEP state (sitting on a sleep queue) will have its priority changed by a synchronization object-specific function.

    For example, if the thread is sleeping on a mutex lock, the mutex_change_pri() code is invoked to change the priority and reposition the thread on the sleep queue if necessary (more on this in “The Kernel Sleep/Wakeup Facility”). Otherwise, the thread must be sitting on a dispatch queue. The priority is set, and one of the dispatcher queue insertion functions is called to set the thread's position on the dispatch queue.

This entire process is illustrated in the pseudocode listed below.

ts_update()
        set list from ts_plisthead[] /* lists of tsproc structures */
        ts_update_list()
                while (not at the end of the current list)
                        if (thread is not in TS or IA class)
                                        bail out
                        if (thread has preemption control turned on)
                                        bail out
                        if (thread is not sitting on a dispatch queue)
                                        set thread flags for post trap processing
                                        bail out
                        kthread->tsproc.ts_cpupri = ts_dptbl[ts_cpupri].ts_lwait
                        calculate new user mode priority (ts_umdpri)
                        kthread->tsproc.ts_dispwait = 0
                        if (the threadUs priority will change based on new ts_umdpri)
                                ts_change_priority()
                end while loop
ts_change_priority()
        if (thread is running on a processor) /* state is ONPROC */
                kthread.t_pri = ts_dptbl[ts_umdpri].ts_globpri
                if (threadUs priority > max runnable priority on dispatch queue)
                        kthread->tsproc.ts_timeleft = ts_dptlb[ts_umdpri].ts_quantum
                else
                        set flag for back of dispatch queue
                        surrender cpu
        else /* thread is not running */
                set front dispatch queue flag if IA class thread
                thread_change_pri()
                if (thread was on a run queue)
                        kthread->tsproc.ts_timeleft = ts_dptbl[ts_umdpri].ts_quantum
                else
                        set dispatch back queue flag
thread_change_pri()
        if (thread is not on a sleep queue or run queue)
                set new priority and return
        if (thread is on a sleep queue)
                call synchronization object specific priority change code
        else
                take the thread of the dispatch queue
                set the new priority
                call a dispatcher queue insertion function /* put it back on a queue */

Having examined a couple TS/IA-specific functions that occur at regular intervals—ts_tick() and ts_update(), both driven from the clock interrupt handler—we need to ensure that the distinction is clear between what each function is intended to do. ts_tick() is designed to process threads running on a processor. ts_update() is designed to process threads that are not running on a processor but rather are sitting on a dispatch queue or sleep queue. In case you got lost in some of the detail of the preceding discussions, we thought it a good idea to drive this salient point home.

As we pointed out earlier, IA class threads are, for the most part, processed by the TS class functions just described. The class-specific functions defined for the IA class include code to initialize the IA class and retrieve class-specific information and support code for setting and fetching class parameters. The ia_init() code is minimal, as most of the IA class work is done in the TA class code. ia_init() sets its scheduling class ID, a pointer to its class functions table, and its maximum global priority (which is 59, same as with TS). The user mode priority support functions, ia_parmsin(), ia_parmsset(), and ia_parmsget(), track the equivalent TS support code in flow and function.

Processes are put in the IA class by an IA class-specific routine, ia_set_process_group(), which is called from a STREAMS ioctl() (strioctl()—ioctls are I/O control calls, supported by all character device drivers) when a TTY is taken over by a new process group. It is rooted in the STREAMS-based terminal driver code in the kernel. If you were to boot your Solaris desktop system and not start a windowing system, you would not have any IA processes running (just TS and the SYS class Solaris daemons). When the windowing system starts, it takes over control of the “terminal,” which in the case of a desktop is a keyboard, mouse, and the graphics card interface to the monitor.

The takeover generates the set-process-group ioctl call, which ultimately calls the CL_SET_PROCESS_GROUP macro. This macro resolves to the ia_set_process_group() IA-class-specific function since the caller is an interactive process (the windowing system software sees to that). All the processes associated with the windowing system are put in the IA class. And since processes and threads inherit their scheduling class from the parent, newly created processes (terminal windows, applications) are also put in the IA class. IA class threads are given a priority boost value of 10, which is factored in to the thread's user mode priority when that calculation is done, that is, every time a thread's priority is adjusted. Recall that the user priority is the sum of ts_cpupri, ts_upri, and ts_boost. (ts_boost has a fixed value of 10 for IA class threads.)

Processing for an RT class thread is much simpler; since RT priorities are fixed, they do not get better or worse during the execution life of the thread. The only way the priority of an RT class thread changes is if it is explicitly changed as the result of a priocntl() command or system call. When a thread enters the RT scheduling class (rt_enterclass()) through priocntl(), the thread RT class priority, rt_pri in the rtproc structure, is set to the default of 0 unless a priority has been specified; in that case, rt_pri will reflect the requested priority (provided it falls within the valid range for the class). rt_pquantum (also in rtproc) is set to the rt_quantum value from the RT dispatch table corresponding to the row number, as indexed by the priority. For example, the default RT class priority of 0 defines a time quantum of 1000, which translates to 1000 milliseconds, or 100 clock ticks (see the description of RES in “Dispatch Tables”).

If the thread is currently running on a processor, the thread's global priority is set in t_pri, and the code looks at the dispatch queue to determine whether the current thread has a priority greater than the highest-priority thread sitting on a dispatch queue. If it does, rt_timeleft is set to rt_pquantum, and the thread is allowed to continue to run; otherwise, it is forced to surrender the processor. If the thread is not currently running, the thread_change_pri() code is called (the same function described in the TS/IA examples), and the thread is placed on the appropriate dispatch queue. The following pseudocode illustrates this sequence of events.

rt_enterclass()
        if (not superuser)
                return  /* must be root to use RT class */
        if (no user-supplied priority)
                rtproc->rt_pri = 0;
                rtproc->rt_pqunatum = rt_dbtbl[0].rt_qunatum;
        else
                rtproc->rt_pri = user-supplied priority;
                if (user-supplied time quantum) /* -t flag in priocntl */
                        rtproc->rt_pqunatum = user-supplied time quantum
                else
                        rtproc->rt_pquantum = rt_dptbl[rt_pri].rt_quantum;
        if (thread is running)
                kthread->t_pri = rt_dptbl[rt_pri].rt_globpri;
                if (thread priority > highest-priority thread on a queue)
                        rtproc->rt_timeleft = rtproc->rt_pquantum
                else
                        surrender the processor
        else /* thread is not running */
                thread_change_pri(new_priority)

Other housekeeping work—initializing the other members of rtproc and setting the pointers in rtproc and the kthread structure—done in rt_enterclass() is not shown above.

A thread will also enter the RT class if it is forked from another RT thread, because child threads inherit their scheduling class from the parent. The child thread inherits the parent priority and time quantum in the rt_fork() code. The execution order of parent and child threads in the RT class is different from that of TS/IA class threads. The child executes first, by design, when a TS/IA class thread is forked (this is especially true in the case of vfork(), as discussed in the previous chapter). RT child threads end up on the dispatch while the parent continues to run, unless, of course, the application code is written explicitly to do otherwise.

The rt_tick() code, called from the clock tick handler for an RT class thread, determines if the thread's time quantum has expired. If it has, the thread is forced to surrender the processor; otherwise, it continues. RT class threads decrement a time counter, rt_timeleft, in the rtproc structure to track processor time clicks (as is the case with TS threads). A zero rt_timeleft means the time quantum has been used. It's possible to set an infinite time quantum for RT class threads. An rt_quantum value of −2 in the RT dispatch table is interpreted by the kernel as an infinite time quantum, in which case an RT class thread will continue to run unless it voluntarily surrenders the processor as a result of a sleep (e.g., issuing a blocking system call).

The RT code does not index the RT dispatch table to fetch the rt_quantum data every time a test is necessary. Once the RT thread is established, the time quantum from the RT table is stored in rt_pquantum in the thread's rtproc structure. Subsequent tests are done with rt_pquantum, which will get a new value should a priocntl() be issued on the thread, giving it a new priority and time quantum. Once an RT thread is initialized and begins execution, the priority never changes unless explicitly changed with priocntl().

9.2.3. Dispatcher Functions

The dispatcher's primary functions are to decide which runnable thread gets executed next, to manage the context switching of threads on and off processors, and to provide a mechanism for inserting kthreads that become runnable into a dispatch queue. Support code for the initialization process (loading scheduling classes, setting up dispatch queues, etc.) and support of the dispadmin(1M) and priocntl(1) commands are also defined within the dispatcher subsystem.

The heart of the dispatcher is the disp() function, which selects the next thread for execution. disp() is called primarily from the context-switching function, swtch(), which is the main entry point into the dispatcher and which is called from many areas of the kernel. Queue insertion is handled by the setfrontdq() and setbackdq() code, which place a kthread on a dispatch queue according to the thread's priority. Whether the kthread is placed at the front or the back of the queue is determined before the queue insertion function is called; we'll take a look at how that is decided as we move through the function descriptions.

9.2.3.1. Dispatcher Queue Insertion

In the previous section on thread priorities, we mentioned queue insertion following the priority setting from several class-specific routines. When a thread is created, the fork return code calls the class-specific setrun() function after setting the thread's priority. The queue insertion code is invoked from setrun() as the final step in the thread creation process. The class wakeup functions call into the dispatcher to insert a newly awakened kernel thread, moving it from a sleep queue to a dispatch queue. The preemption code must have the preempted thread placed on a dispatch queue, and thread_change_pri() also calls setfrontdq() or setbackdq(). Figure 9.9 illustrates the execution phases of a typical kernel thread as it is moved to and from dispatch queues and sleep queues.

Figure 9.9. Kernel Thread Queue Insertion


The thread yield (xx_yield) scenario occurs only when a yield call is issued programmatically in application code through thr_yield(3T). Preemption, as we mentioned earlier, means a thread is involuntarily context-switched off a processor in favor of a higher-priority thread. Once the code segments shown in Figure 9.9 have completed the priority work, it's time for queue selection and insertion. Basically, four possible scenarios drive queue selection:

  • An unbound thread whose priority is less than kpreemptpri

  • An unbound thread whose priority is greater than or equal to kpreemptpri

  • A bound thread whose priority is less than kpreemptpri

  • A bound thread whose priority is greater than or equal to kpreemptpri

kpreemptpri, kernel preemption priority, is a global variable set by the kernel during initialization. The kernel sets the kpreemptpri value as scheduling classes are loaded, either during boot or when a class is loaded at some point while the system is running. By default, kpreemptpri is set to a value of 100, which is the maximum system priority (99) plus 1. Any thread with priority equal to or greater than kpreemptpri will cause a kernel preemption.

The basic, simplified logic of queue selection goes like this.

if (unbound AND priority < kpreemptpri)
        insert on dispatcher queue for t_cpu processor
if (unbound AND priority >= kpreemtppri)
        insert in cp_kp_queue /* systemwide kernel preempt queue */
if (bound AND priority < kpreemptpri)
        insert in dispatch queue of CPU thread is bound to
if (bound AND priority >- kpreemptpri)
        insert in dispatch queue of CPU thread is bound to

When a thread is switched onto a processor for execution, the thread's t_cpu pointer points to the CPU structure of the selected processor. Unbound TS/IA threads are placed on the dispatch queue of the processor they ran on last, in the hopes of hitting a warm processor cache and thus maximizing performance. This loose affinity maintained by the dispatcher has a time limit in the form of the kernel rechoose_interval parameter, which by default is 3 clock ticks. If more than rechoose_interval ticks have transpired since the thread ran last, the likelihood of the thread getting a cache hit has diminished, and the next available processor is selected. rechoose_interval can be set to a higher value in /etc/system and for some loads can provide some measurable improvement in performance, especially on loads that have a lot of threads with short execution times and processors with relatively large caches. However, as with anything else, you must take great care if you alter the default value. Never do such a thing on a production system without having first tested extensively.

Unbound RT threads are placed on the systemwide kernel preempt queue, cp_kp_queue. If processor sets are configured and the thread has been bound to a processor set, the cp_kp_queue for the processor set is used. If processor sets have been configured and the thread has not been bound to a set, the cp_kp_queue for the default partition is selected. Bound threads in any scheduling class, even RT class threads, are placed on the dispatch queue or the processor they've been bound to.

The insertion of a thread on a dispatch queue is accomplished with the setfrontdq() and setbackdq() routines for the per-processor dispatch queues, and setkpdq() for the kernel preemption queue (or queues). The algorithm for setfrontdq(), whose job it is to place a thread at the front of a dispatch queue, is represented in the following pseudocode.

setfrontdq()
if (thread is swapped out or on swap queue)
        call disp_swapped_setrun()
        return
if (uniprocessor system)
        selected_cpu = t_cpu
if (MP system with 1 online CPU)
        selected_cpu = t_cpu
else if (thread is not bound)
        if (thread priority >= kpreemptpri)
                call setkpdq()
                return
        selected_cpu = t_cpu
        if (thread was running on the same partition that selected CPU belongs to)
                if (thread priority < highest-priority thread sitting on the queue)
                        selected_cpu = cpu_choose()
                else
                        selected_cpu = disp_lowpri_cpu()
/*
* at this point, the processor and associated dispatch queue have been selected
*/
set thread state to TS_RUN /* thread is runnable */
increment runnable thread count on dispatch queue structure - disp_nrunnable
place thread on the processorUs dispatch queue
set the disp_qactmap queue bitmap to reflect added thread
if (priority of thread placed on queue > disp_maxrunpri)
                dispatch_queue->disp_maxrunpri = threadUs priority
                call cpu_resched();

The selection of a dispatch queue for a TS/IA thread is essentially the selection of a processor, since each processor has its own dispatch queue. That is why we reference selected_cpu in the preceding pseudocode. For a uniprocessor system or a multiprocessor system with one online processor, processor selection becomes trivial, which is why the code tests for those conditions up-front, immediately following the swapped thread test. A thread that is either swapped out or sitting on the swap queue needs to be reloaded before it can be placed on a dispatch queue.

To nudge the memory scheduler (PID 0), disp_swapped_setrun() will set one of two possible flags: either wake_sched or wake_sched_sec, depending on the priority of the thread. These flags are tested in the clock interrupt handler, where wake_sched is tested every clock interrupt and wake_sched_sec is tested every 100 clock interrupts, or once every second. Threads at a priority greater than 99 (highest SYS class priority) wake up the scheduler right away—the next clock interrupt (wake_sched); lower-priority threads result in wake_sched_sec getting set and thus wait a little longer.

The next test in setfrontdq() is for bound threads and uses t_bound_cpu in the thread structure. A bound thread will have its t_bound_cpu pointer pointing to the processor it is bound to. If the thread is not bound and its priority is greater than kpreemptpri, call setkpri() to place the thread on the kernel preempt queue and return. Otherwise (not bound, priority less than kpreemptpri), a cpu is selected initially from the threads t_cpu pointers (the loose affinity discussed earlier) and the processor partition test is done. Intuitively, it would seem that a processor set from the thread's t_cpu structure would belong to the processor partition referenced by the same thread's t_cpupart pointer, but it's possible some processor set configuration changes were made since they last ran, so the test is necessary.

If the selected processor is in the same partition on which the thread last ran, and if the thread's priority is greater than or equal to the highest-priority thread sitting on the processor's dispatch queue, stick with the selected processor. Otherwise, call cpu_choose() and find another processor queue in the same partition on which to place the thread. If the partition test fails, meaning the partition that the selected processor belongs to is different from the partition on which the thread last ran, then setfrontdq() calls disp_lowpri_cpu() to find a processor in the thread's t_cpupart partition. This code deals with situations where processor set configuration changes were made since the thread was last placed on a dispatch queue.

The cpu_choose() routine looks for the best processor on which to put the thread. In cpu_choose(), the kernel compares the amount of time (clock ticks) that has transpired since the thread last ran against rechoose_interval. The thread's t_disp_time field is incremented (by the ts_update thread) in the clock tick handler if the thread is not on a queue, and the kernel compares this value against rechoose_interval. If t_disp_time is less than rechoose_interval, the thread is kept on the processor it ran on last. Otherwise, disp_lowpri_cpu() is called to find a processor. disp_lowpri_cpu() searches the linked list of online processors in the partition for the processor with the lowest-priority thread. Once found, the thread is inserted in the selected processor's dispatch queue.

Bound threads, as we mentioned earlier, are placed on the queue of the processor they are bound to; or, if bound to a processor set, a processor within the set is selected, applying the loose affinity rule with rechoose_interval. Once the processor dispatch queue has been selected, the thread's priority is used to determine the correct priority queue. Recall that the per-processor queues are a queue of queues, and within each processor queue are linked lists of threads on individual queues, one for each priority. The disp_nrunnable counter in the processor queue is incremented, as is the dq_sruncnt counter for the priority queue on which the thread is placed. The thread is inserted at the front of the selected priority queue (remember, we are going through setfrontdq()), and the disp_qactmap (dispatcher queue active map) bitmap is updated. The priority of the newly inserted thread is tested against the queue's disp_maxrunpri value. If it is greater, disp_maxrunpri is set to the thread's priority to reflect the highest-priority thread sitting on the queue, and cpu_resched() is called to determine if a preemption condition exists.

cpu_resched() checks the priority of the thread currently executing on the processor against the priority of the thread just inserted onto the processor dispatch queue and also tests for a user or kernel preemption. A user preemption means the thread has a greater priority than the currently running thread, but not greater than kpreemptpri. More succinctly, if the thread's priority is less than 100 but greater than the currently running thread, the code sets up a user preemption. A kernel preemption is the result of the thread having a priority greater than the currently running thread and greater than kpreemptpri. The cpu_resched() work is represented in the following pseudocode.

cpu_resched()
        if (CPU is NOT IDLE AND thread priority > current thread priority)
                if (thread priority >= upreemptpri AND cpu_runrun == 0)
                        set cpu_runrun in CPU structure
                        set t_astflag in currently executing thread
                        if (thread priority < kpreemptpri AND selected cpu is not CPU)
                                poke_cpu()
                if (thread_priority >= kpreemptpri AND cpu_kprunrun == 0)
                        set cpu_kprunrun in CPU structure
                        if (selected cpu is not CPU)
                                poke_cpu()

A couple of additional points to make on cpu_resched(). First, on a multiprocessor system, the processor that owns the dispatch queue selected for the thread may be a different processor than the one currently executing the code being examined: the setfrontdq() routine. The uppercase CPU reference is a kernel macro that always resolves to the currently executing processor. Thus, the tests described above are testing to see if the selected processor is different from the current processor. If it is not, a cross-call is generated by poke_cpu() to get the attention of the selected processor and force it to enter the kernel via a cross-call trap handler so the preemption will happen. If it is the same processor, the cross-call is not necessary since the current processor is already in the kernel and will detect the preemption flag when it returns to the code that originated the setfrontdq() call.

The last thing that setfrontdq() does is to set the dispatch queue's disp_max_unbound_pri variable, which, as the name implies, maintains the priority value of the highest-priority unbound thread on the queue. If the newly inserted thread's priority is greater than the current disp_max_unbound_pri and the thread is not bound, the value will be updated. That done, the kernel setfrontdq() queue insertion process is completed. The thread is now sitting on a dispatch queue and will be scheduled on the basis of its priority and position in the queue when the dispatcher disp() and swtch() code executes next. We're going to walk through that process in just a moment.

The setbackdq(), which puts a thread at the back of a dispatch queue, is similar from an algorithmic point of view to setfrontdq(), with just a few differences. First, in setbackdq() the kernel attempts to maintain a balance in queue depths across processors. Once a CPU has been selected with cpu_choose(), the number of runnable threads on the processor queue for the corresponding thread priority is examined. If it is greater than MAX_RUNQ_DIFF, which has a value of 2, then the kernel tries the next CPU in the same partition. That behavior aside, setbackdq() is essentially the same as setfrontdq(), except, of course, the thread is inserted at the back of the selected queue.

The decision as to whether setfrontdq() or setbackdq() is called from the various points in the kernel where queue insertion is called is driven by factors such as how long a thread has been waiting to run, whether or not the thread is in the IA class, etc. IA class threads are put on the front of a dispatch queue, for an additional edge on getting scheduled. A preempted thread with a scheduler activation is always placed at the front of a queue. RT class threads are always placed at the back of the kernel preempt queue. Threads that have waited a while (relatively speaking) to run (as determined by the thread's t_disp_time value) are placed at the front of a queue.

9.2.3.2. Thread Preemption

We talked a bit about thread preemption in the setfrontdq() code description and referred to it in the thread priority section. To complete the picture, we'll tie up some loose ends on the subject here. First, a quick review of what preemption is.

The kernel will preempt a thread running on a processor when a higher-priority thread is inserted onto a dispatch queue. The thread is effectively forced to reschedule itself and surrender the processor before having used up its time quantum. Two types of preemption conditions are implemented—a user preemption and a kernel preemption—distinguished by the priority level of the preempted thread, which drives how quickly the preemption will take place.

A user preemption occurs if a thread is placed on a dispatch queue and the thread has a higher priority than the thread currently running on the processor associated with the queue but has a lower priority than the minimum required for a kernel preemption. A kernel preemption occurs when a thread is placed on a dispatch queue with a priority higher than kpreemptpri, which is set to 100, representing the lowest global dispatch priority for an RT class thread. RT and interrupt threads have global priorities greater than kpreemptpri.

User preemption provides for higher-priority TS/IA threads getting processor time expediently. Kernel preemption is necessary for support of real-time threads. Traditional real-time support in Unix systems was built on a kernel with various preemption points, allowing a real-time thread to displace the kernel at a few well-defined preemptable places. The Solaris implementation goes the next step and implements a preemptable kernel with a few non-preemption points. In critical code paths, Solaris will temporarily disable kernel preemption for a short period and reenable it when the critical path has completed. Kernel preemption is disabled for very short periods in the thread_create() code, during the pause_cpus() routine, and in a few memory management (MMU) code paths, such as when a hardware address translation (HAT) is being set up.

Preemptions are flagged through fields in the per-processor cpu structure: cpu_runrun and cpu_kprunrun. cpu_runrun flags a user preemption; it is set when a thread inserted into a dispatch queue is a higher priority than the one running but a lower priority than kpreemptpri. cpu_kprunrun flags a kernel preemption. We saw in the cpu_resched() code one example of where these flags get set. The runrun flags can also get set in the following kernel routines.

  • cpupart_move_cpu(). When a processor set configuration is changed and a processor is moved from a processor set, the runrun flags are set to force a preemption so the threads running on the processor being moved can be moved to another processor in the set they've been bound to. Note that if only one processor is left in the set and there are bound threads, the processor set cannot be destroyed until any bound threads are first unbound.

  • cpu_surrender(). A thread is surrendering the processor it's running on, called from the TS/IA clock tick handler, sleep code, and trap return code. The RT class routines for enterclass, setting parameters, and clock tick handler can also call cpu_surrender(). Recall from the section on thread priorities that cpu_surrender() is called following a thread's priority change and a test to determine if preemption conditions exist. Entering cpu_surrender() means a preemption condition has been detected and is the first step in a kthread giving up a processor in favor of a higher-priority thread.

    Two other areas of the kernel that potentially call cpu_surrender() are the priority inheritance code and the processor support code that handles the binding of a thread to a processor. The conditions under which the priority inheritance code calls cpu_surrender() are the same as previously described, that is, a priority test determined that a preemption is warranted. The thread binding code will force a preemption through cpu_surrender() when a thread is bound to a processor in a processor set and the processor the thread is currently executing on is not part of the processor set the thread was just bound to. This is the only case when a preemption is forced that is not the result of a priority test.

    cpu_surrender() will set the cpu_runrun flag and will set cpu_kprunrun if the preemption priority is greater than kpreemptpri. On a multiprocessor system, if the processor executing the cpu_surrender() code is different from the processor that needs to preempt its thread, then a cross-call is sent to the processor that needs to be preempted, forcing it into a trap handler. At that point the runrun flags are tested. The other possible condition is one in which the processor executing the cpu_surrender() code is the same processor that must preempt the current thread, in which case it will test the runrun flags before returning to user mode; thus, the cross-call is not needed. In other words, the processor is already in the kernel by virtue of the fact that it is running the cpu_surrender() kernel routine, so a cross-call would be superfluous.

Once the preemption condition has been detected and the appropriate runrun flag has been set in the processor's CPU structure, the kernel must enter a code path that tests the runrun flags before the actual preemption occurs. This happens in different areas of the kernel for user versus kernel preemptions. User preemptions are tested for (cpu_runrun) when the kernel returns from a trap or interrupt handler. Kernel preemptions are tested for (cpu_kprunrun) when a dispatcher lock is released.

The trap code that executes after the main trap or interrupt handler has completed tests cpu_runrun, and if it is set, calls the kernel preempt() function. preempt() tests two conditions initially. If the thread is not running on a processor (thread state is not ONPROC) or if the thread's dispatch queue pointer is referencing a queue for a processor other than the processor currently executing, then no preemption is necessary and the code falls through and simply clears a dispatcher lock.

Consider the two test conditions. If the thread is not running (the first test), then obviously it does not need to be preempted. If the thread's t_disp_queue pointer is referencing a dispatch queue for a different processor (different from the processor currently executing the preempt() code), then clearly the thread has already been placed on another processor's queue, so that condition also obviates the need for a preemption.

If the conditions just described are not true, preempt() increments the LWP's lrusage structure nicsw counter, which counts the number of involuntary context switches. The processor's inv_switch counter is also incremented in the cpu_sysinfo structure, which counts involuntary context switches processor-wide, and the scheduling-class-specific preempt code is called. The per-processor counters are available with mpstat(1M), reflected in the icsw column. The LWP lrusage data is not readily available with a bundled command, but you can develop code that uses procfs and reads the data by means of /proc/<PID>/lwp/<LWPID>/lwpusage.

The class-specific code for TS/IA threads prepares the thread for placement on a dispatch queue and calls either setfrontdq() or setbackdq() for actual queue insertion. ts_preempt() checks whether the thread is in kernel mode and whether the kernel-priority-requested flag (t_kpri_req) in the thread structure is set. If it is set, the thread's priority is set to the lowest SYS class priority (typically 60). The t_trapret and t_astflag kthread flags are set, causing the ts_trapret() function to run when the thread returns to user mode (from kernel mode). At that point, the thread's priority is set back to something in the TS/IA priority range. ts_preempt() tests for a scheduler activation on the thread. If an activation has been enabled and the thread has not avoided preemption beyond the threshold of two clock ticks and the thread is not in kernel mode, then the thread's priority is set to the highest user mode priority (59) and is placed at the front of a dispatch queue with setfrontdq().

If the thread's TSBACKQ flag is set, indicating the thread should be placed in the back of a dispatch queue with setbackdq(), the thread preemption is due to time-slice expiration. (Recall that ts_tick() will call cpu_surrender().) The thread's t_dispwait field is zeroed, and a new time quantum is set in ts_timeleft from the dispatch table (indexed with ts_cpupri, as previously discussed) prior to setbackdq() being called. Otherwise, if TSBACKQ is not set, a real preemption occurred (higher-priority thread became runnable) and the thread is placed at the front of a dispatch queue.

The rt_preempt() code is much less complex. If RTBACKQ is true, the preemption was due to a time quantum expiration (as was the case previously), and setbackdq() is called to place the thread at the back of a queue after setting the rt_timeleft value from rt_pquantum. Otherwise, the thread is placed at the front of a dispatch queue with setfrontdq().

The class-specific preempt code, once completed, returns to the generic preempt() routine, which then enters the dispatcher by calling swtch(). We look at the swtch() code in the next section.

Kernel preemption, as we said earlier, is detected when a dispatcher lock is released. It is also tested for in kpreempt_enable(), which reenables kernel preemption after kpreempt_disable() blocked preemptions for a short time. The goal is to have kernel preemptions detected and handled more expediently (with less latency) than user preemptions. A dispatcher lock is a special type of mutex lock that not only provides mutual exclusion semantics for access to a specific dispatch queue, but does so at a raised priority level, protecting the queue from access from an interrupt handler for a low-priority interrupt. Dispatcher locks are acquired and held at priority level 10 on SPARC systems, blocking all low-priority interrupts. Interrupts above level 10 are high priority, and handlers for high-priority interrupts are not permitted to execute code that may require entering the dispatcher (e.g., causing a sleep and context switch). Every dispatch queue has a disp_lock, maintained in the dispatch queue structure that describes the queue (see Figure 9.5).

Because the test for a kernel preemption, cpu_kprunrun, is put in the disp_lock_exit() code, the detection happens synchronously with respect to other thread scheduling and queue activity. The dispatcher locks are acquired and freed at various points in the dispatcher code, either directly through the dispatcher lock interfaces or indirectly through macro calls. For example, each kernel thread maintains a pointer to a dispatcher lock, which serves to lock the thread and the queue during dispatcher functions. The THREAD_LOCK and THREAD_UNLOCK macros use the dispatcher lock entry and exit functions. The key point is that a kernel preemption will be detected before a processor running a thread flagged for preemption completes a pass through the dispatcher.

When disp_lock_exit() is entered, it tests whether cpu_kprunrun is set; if so, then disp_lock_exit() calls kpreempt(). A clear cpu_kprunrun flag indicates a kernel preemption is not pending, so there is no need to call kpreempt().

Kernel preemptions are handled by the kpreempt() code, represented here in pseudocode.

kpreempt()
        if (current_thread->t_preempt)
                do statistics
                return
        if (current_thread NOT running) OR (current_thread NOT on this CPUs queue)
                return
        if (current PIL >= LOCK_LEVEL)
                return
        block kernel preemption (increment current_thread->t_preempt)
        call preempt()
        enable kernel preemption (decrement current_thread->t_preempt)

The preceding pseudocode basically summarizes at a high level what happens in kpreempt(). Kernel threads have a t_preempt flag, which, if set, signifies the thread is not to be preempted. This flag is set in some privileged threads, such as a processor's idle and interrupt threads. Kernel preemption is disabled by incrementing t_preempt in the current thread and is reenabled by decrementing t_preempt. kpreempt() tests t_preempt in the current thread; if t_preempt is set, kpreempt() increments some statistics counters and returns. If t_preempt is set, the code will not perform a kernel preemption.

The second test is similar in logic to what happens in the preempt() code previously described. If the thread is not running or is not on the current processor's dispatch queue, there's no need to preempt. The third test checks the priority level of the processor. If we're running at a high PIL, we cannot preempt the thread, since it may be holding a spin lock. Preempting a thread holding a lock could result in a deadlock situation.

Any of the first three test conditions evaluating true will cause kpreempt() to return without actually doing a preemption. Assuming the kernel goes ahead with the preemption, kernel preemptions are disabled (to prevent nested kernel preemptions) and the preempt() function is called. Once preempt() completes, kernel preemption is enabled and kpreempt() is done.

The kernel statistical data maintained for kernel preemption events is not accessible with a currently available Solaris command. The data is maintained in a kpreempt_cnts (kernel preemption counts) data structure and can be interrogated with adb(1).

								# adb -k /dev/ksyms /dev/mem
physmem fdde
kpreempt_cnts/D
kpreempt_cnts:
kpreempt_cnts:                  0 <- Idle thread
kpreempt_cnts+4:                3 <- Interrupt thread
kpreempt_cnts+8:                0 <- Clock thread
kpreempt_cnts+0xc:              37<- Blocked *t_preempt is set)
kpreempt_cnts+0x10:             32<- Not on Proc
kpreempt_cnts+0x14:             0<- Inv Switch
kpreempt_cnts+0x18:             15<- Priority to High
kpreempt_cnts+0x1c:             35<- Async Preempts
kpreempt_cnts+0x20:             593759<- Sync Preempts

In the example, we inserted a short description of the event each counter correlates to. Most of the descriptions are comprehensible when examined in conjunction with the preceding text. The last two counters, async preempts and sync preempts, count each of the possible methods of kernel preemption. The kpreempt() function is passed one argument, asyncspl. For async preempts, asyncspl is a priority-level argument, and the kpreempt() code will raise the PIL, as dictated by the value passed. Sync preempts pass a −1 argument and do not change the processor's priority level.

Thread preemption is a relatively simple concept to understand, but as with many other things, it's easy to get lost in the details of implementation. Figure 9.10 summarizes thread preemption at a high level.

Figure 9.10. Thread Preemption Flow


In the case of both user and kernel preemption, the code ultimately executes the preempt() function, which as a last step, enters the dispatcher swtch() routine—the topic of the next section.

9.2.3.3. The Heart of the Dispatcher: swtch()

The kernel swtch() routine initiates the context switching of a thread off a processor, figures out which thread should run next, and context-switches the selected thread onto a processor for execution. It's called from many places within the operating system: in the class fork return function (a thread has just been created), from the idle thread (executed by processors if there are no runnable threads on a dispatch queue), by interrupt and trap handlers (to reenter the dispatcher), for thread sleep management, in kernel synchronization support code (mutexes, reader/writer locks, condition variables, etc.), and, of course, from the preempt() function. The various entry points to swtch() are listed in Table 9-4.

Table 9-4. Sources of Calls to swtch()
Kernel Subsystem Kernel Function Description
Dispatcher idle Per-processor idle thread
 preempt Last phase of a preemption
Kthread release_interrupt Called from an interrupt thread
TS/IA class ts_forkret After kthread is created
Sleep/wakeup cv_xxxx Various conditional variable func tions
CPU force_migrate Thread migration to another pro cessor
 cpu_pause Processor state change to pause
Mutex mutex_vector_enter Mutex lock acquisition
RWlock rw_enter_sleep RW lock acquisition
Memory scheduler sched PID 0
Semaphore sema_p Semaphore “p” operation
Signal stop Thread stop function
Sleep/wakeup slp_cv_wait Thread to sleep state
Interrupt intr_thread_exit Exit of an interrupt handler

Entering the swtch() routine causes the cpu_sysinfo.pswtch counter to be incremented, as reported in mpstat(1M) in the csw column, and reflects the number of context switches per second, on a per-processor basis.

The swtch() function first checks to see if the current thread is an interrupt thread. When interrupts happen, the thread stack is switched to that of an interrupt thread, from the linked list of interrupt threads in the processor's cpu structure. If swtch() was entered with an interrupt thread as the current thread, the kernel must restore the interrupted thread's state so it can be resumed. The interrupted thread is unpinned (a thread that has been preempted for an interrupt is considered pinned), and the kernel resume_from_interrupt() assembly routine is called to restore the state of the interrupted thread.

If the thread is not an interrupt thread, the code tests the handoff pointer in the kernel thread (t_handoff). A non-NULL t_handoff pointer indicates the processor should be handed off to another thread, in support of scheduler activations. (Scheduler activations were first introduced in Solaris 2.6 and are discussed in “Scheduler Activations”.)

Briefly, one of the goals of scheduler activations is to provide a framework for communication between the kernel dispatcher and the user-level threads library. Such communication improves the scheduling of user threads and reduces the latency of a runnable user thread getting an LWP if all the LWPs in a process are blocked. Recall from an earlier discussion that a user-level thread, that is, a thread created through a thr_create(3T) or pthread_create(3T) call, is mapped onto an LWP by the user thread's scheduler. If all the LWPs in a process are blocked and there are runnable user threads, the user thread's scheduler does not have the resource it needs, an LWP, to schedule a user thread. Prior to Solaris 2.6, the kernel implemented a signal mechanism, using SIGWAITING in conjunction with a signal handler for SIGWAITING, to increase the pool of available LWPs for the process.

With scheduler activations, a communication path is established between the kernel and the user threads library. If an activation has been established for the process, a pool of LWPs might be readily available, in which case the t_handoff pointer would be non-NULL. If that is the case, a kernel handoff function is called, and ultimately the kernel resume() code is entered to context-switch the kernel thread referenced with t_handoff onto the processor.

If the current thread is not an interrupt thread and a handoff is not required, swtch() calls the disp() function, which is the code segment that looks for the highest-priority thread to run, sets the thread's state to running (TS_ONPROC), and arranges for it to be switched onto a processor. At a high level, the disp() function searches the dispatch queues for the best-priority kernel thread, starting with the kernel preempt queue and then searching the queue of the current processor— that is, the processor executing the disp() code. If those searches come up blank, then the code searches the dispatch queues of other processors on a multiprocessor system, looking for a runnable kernel thread. If no threads are found on the dispatch queues, the processor executes its idle thread, which executes a tight loop, testing for runnable threads on each pass through the loop and entering swtch() if the run count is greater than 0.

The search for the highest-priority thread begins with the kernel preempt queue, as referenced by the current processor through its cpu_part structure, where the preempt queue is linked to cp_kp_queue. In this case, on a system with multiple processor partitions (user processor sets—recall there is a kernel preempt queue per processor set), the preempt queue for the processor partition that the executing processor belongs to is searched first. The cp_kp_queue search is represented in the following pseudocode.

kpq = pointer to kernel preempt queue
dq = pointer to processor's dispatch queue
while ( priority = kpq->dispmaxrunpri >= 0 ) AND
                ( priority >= dq->dispmaxrunpri) AND
                ( the current CPU is NOT offline) AND
                ( thread_pointer = disp_getbest(kpq) != NULL )
                        if (disp_ratify(thread_pointer, kpq) != NULL)
                                return(thread_pointer)

The preceding queue search loop validates the priority value according to the queue's disp_maxrunpri, which reflects the highest-priority thread sitting on the queue, makes sure the current processor is not offline, and calls the dispatcher disp_getbest() code to fetch the best-priority thread from the kernel preempt queue. disp_getbest() finds the highest-priority unbound thread, calls dispdeq() to have the thread removed from the dispatch queue, and returns the thread pointer back to disp(). If nothing is found, a NULL value is returned.

disp_getbest()
        dpq = dispatch queue pointer (cp_kp_queue in this example)
        priority = dpq->disp_max_unbound_pri
        if (priority == -1)
                return(NULL)
        queue = dpq->disp_q[pri];
        thread_pointer = queue->dq_first;
        loop through linked list of threads on queue, skip bound threads
        if (no unbound threads)
                return NULL
        else
                thread_pointer = thread found
        dispdeq(thread_pointer)
        set thread t_disp_queue, processorUs cpu_dispthread, thread state to ONPROC
        return (thread_pointer)

If an unbound thread is found in disp_getbest(), the thread is dequeued with dispdeq(), the thread's t_disp_queue pointer is set to reference the processor's cpu structure cpu_disp queue pointer, the processor's cpu_dispthread pointer is set to the selected thread pointer, and the thread state is set to ONPROC.

dispdeq() deals with updating the dispatch queue data structures with the selected thread removed from the queue. It decrements disp_nrunnable, which is the total count for all the queues, and dq_sruncnt, which maintains the count of runnable threads at the same priority (refer to Figure 9.5). If the per-priority queue count, dq_sruncnt, is 0, then the queue bitmap is updated to reflect an empty queue. The disp_qactmap bitmap uses a set bit to reflect the presence of runnable threads on a per-priority queue; thus, the bit that corresponds to the zeroed queue is cleared. The disp_maxrunpri and disp_max_unbound_pri fields are also updated to reflect the new highest-priority thread on the queue if it is different from the thread that has just been removed from the queue.

Once the thread selection has been made and the thread dequeued, the code returns to disp(), which calls disp_ratify() to ensure that the selected thread was, in fact, the best candidate to run next. The fine-grained locking used within the dispatcher routines allows for simultaneous changes to be made to the queues and the queue state by potentially many processors. For this reason, a select-and-ratify algorithm was chosen for implementation. The select phase of the algorithm now completed, disp_ratify() is entered to complete the ratify phase. The ratify code simply compares the priority of the selected thread to the disp_maxrunpri values of the processor and kernel preempt queue. If the selected thread priority is greater than maxrunpri, the selection is ratified and the context switch is done. If not, we reenter the code loop to find the best runnable thread. More precisely, if a higher-priority thread appears on the queue when disp_ratify() executes, the selected thread is placed back on the dispatch queue with a call to setfrontdq() and disp_ratify() returns NULL to disp().

If a thread is not found on the kernel preempt queue, then the per-processor queue disp_maxrunpri is tested. A value of −1 means that nothing is on the queue. In that case, the code searches the queues of the other processors on the system, beginning with the disp_getwork() code, which finds a processor with the highest-priority thread. Then, the code uses the disp_getbest() and disp_ratify() functions previously described.

If the current processor's disp_maxrunpri indicates runnable threads, the first thread from the highest priority queue is removed, the queue data is updated (disp_nrunnable, dq_nruncnt, disp_qactmap, disp_max_unbound_pri, and disp_maxrunpri), the selection is ratified, and disp() returns the thread pointer to swtch().

If no work is found on any of the dispatch queues, the processor's idle thread is selected by setting the thread pointer to the cpu_idle_thread, referenced from the processor's cpu structure. The pointer to the idle thread is returned to the swtch() code.

Back in swtch(), with a thread pointer for the selected thread (or idle thread), the kernel resume() code is called to handle the switching of the thread on the processor. resume() is implemented in assembly language because the process of context switching requires low-level contact with processor hardware, for two reasons: to save the hardware context of the thread being switched off; and to set up the hardware registers and other context information so the new thread can begin execution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.203.137