Chapter 2. CPUs

Key resources to any computer system are the central processing units (CPUs). Many modern systems from Sun boast numerous CPUs or virtual CPUs (which may be cores or hardware threads). The CPUs are shared by applications on the system, according to a policy prescribed by the operating system and scheduler (see Chapter 3 in Solaris Internals).

If the system becomes CPU resource limited, then application or kernel threads have to wait on a queue to be scheduled on a processor, potentially degrading system performance. The time spent on these queues, the length of these queues and the utilization of the system processor are important metrics for quantifying CPU-related performance bottlenecks. In addition, we can directly measure CPU utilization and wait states in various forms by using DTrace.

Tools for CPU Analysis

A number of different tools analyze CPU activity. The following summarizes both these tools and the topics covered in this section.

  • Utilization. Overall CPU utilization can be determined from the idle (id) field from vmstat, and the user (us) and system (sy) fields indicate the type of activity. Heavy CPU saturation is more likely to degrade performance than is CPU utilization.

  • Saturation. The run queue length from vmstat (kthr:r) can be used as a measure of CPU saturation, as can CPU latency time from prstat -m.

  • Load averages. These numbers, available from both the uptime and prstat commands, provide 1-, 5-, and 15-minute averages that combine both utilization and saturation measurements. This value can be compared to other servers if divided by the CPU count.

  • History. sar can be activated to record historical CPU activity. This data can identify long-term patterns; it also provides a reference for what CPU activity is “normal.”

  • Per-CPU utilization. mpstat lists statistics by CPU, to help identify application scaling issues should CPU utilization be unbalanced.

  • CPU by process. Commands such as ps and prstat can be used to identify CPU consumption by process.

  • Microstate accounting. High-resolution time counters track several states for user threads; prstat -m reports the results.

  • DTrace analysis. DTrace can analyze CPU consumption in depth and can measure events in minute detail.

Table 2.1 summarizes the tools covered in this chapter, cross-references them, and lists the origin of the data that each tool uses.

Table 2.1. Tools for CPU Analysis

Tool

Uses

Description

Reference

vmstat

Kstat

For an initial view of overall CPU behavior

2.2 and 2.12.1

psrinfo

Kstat

For physical CPU properties

2.5

uptime

getloadavg()

For the load averages, to gauge recent CPU activity

2.6 and 2.12.2

sar

Kstat, sadc

For overall CPU behavior, and dispatcher queue statistics; sar also allows historical data collection

2.7 and 2.12.1

mpstat

Kstat

For per-CPU statistics

2.9

prstat

procfs

To identify process CPU consumption

2.10 and 2.11

dtrace

DTrace

For detailed analysis of CPU activity, including scheduling events and dispatcher analysis

2.13, 2.14, and 2.15

vmstat Tool

The vmstat tool provides a glimpse of the system’s behavior on one line and is often the first command you run to familiarize yourself with a system. It is useful here because it indicates both CPU utilization and saturation on one line.

$ vmstat 5
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr dd f0 s1 --   in   sy   cs us sy  id
 0 0 0 1324808 319448 1   2  2  0  0  0  0  0  0  0  0  403   21   54  0  1  99
 2 0 0 1318528 302696 480 6 371 0  0  0  0 73  0  0  0  550 5971  190 84 16   0
 3 0 0 1318504 299824 597 0 371 0  0  0  0 48  0  0  0  498 8529  163 81 19   0
 2 0 0 1316624 297904 3   0 597 0  0  0  0 91  0  0  0  584 2009  242 84 16   0
 2 0 0 1311008 292288 2   0 485 0  0  0  0 83  0  0  0  569 2357  252 77 23   0
 2 0 0 1308240 289520 2   0 749 0  0  0  0 107 0  0  0  615 2246  290 82 18   0
 2 0 0 1307496 288768 5   0 201 0  0  0  0 58  0  0  0  518 2231  210 79 21   0
...

The first line is the summary since boot, followed by samples every five seconds. vmstat reads its statistics from kstat, which maintains CPU utilization statistics for each CPU. The mechanics behind this are discussed in Section 2.12.

Two columns are of greatest interest in this example. On the far right is cpu:id for percent idle, which lets us determine how utilized the CPUs are; and on the far left is kthr:r for the total number of threads on the ready to run queues, which is a measure of CPU saturation.

In this vmstat example, the idle time for the five-second samples was always 0, indicating 100% utilization. Meanwhile, kthr:r was mostly 2 and sustained, indicating a modest saturation for this single CPU server.

vmstat provides other statistics to describe CPU behavior in more detail, as listed in Table 2.2

Table 2.2. CPU Statistics from the vmstat Command

Counter

Description

kthr

 

r

Total number of runnable threads on the dispatcher queues; used as a measure of CPU saturation

faults

 

in

Number of interrupts per second

sy

Number of system calls per second

cs

Number of context switches per second, both voluntary and involuntary

cpu

 

us

Percent user time; time the CPUs spent processing user-mode threads

sy

Percent system time; time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads

id

Percent idle; time the CPUs are waiting for runnable threads. This value can be used to determine CPU utilization

The following sections discuss CPU utilization and saturation in greater detail.

CPU Utilization

You can calculate CPU utilization from vmstat by subtracting id from 100 or by adding us and sy. Keep in mind the following points when considering CPU utilization.

  • 100% utilized may be fine—it can be the price of doing business.

  • When a Solaris system hits 100% CPU utilization, there is no sudden dip in performance; the performance degradation is gradual. Because of this, CPU saturation is often a better indicator of performance issues than is CPU utilization.

  • The measurement interval is important: 5% utilization sounds close to idle; however, for a 60-minute sample it may mean 100% utilization for 3 minutes and 0% utilization for 57 minutes. It is useful to have both short- and long-duration measurements.

  • An server running at 10% CPU utilization sounds like 90% of the CPU is available for “free,” that is, it could be used without affecting the existing application. This isn’t quite true. When an application on a server with 10% CPU utilization wants the CPUs, they will almost always be available immediately. On a server with 100% CPU utilization, the same application will find that the CPUs are already busy—and will need to preempt the currently running thread or wait to be scheduled. This can increase latency (which is discussed in more detail in Section 2.11).

CPU Saturation

The kthr:r metric from vmstat is useful as a measure for CPU saturation. However, since this is the total across all the CPU run queues, divide kthr:r by the CPU count for a value that can be compared with other servers.

Any sustained non-zero value is likely to degrade performance. The performance degradation is gradual (unlike the case with memory saturation, where it is rapid).

Interval time is still quite important. It is possible to see CPU saturation (kthr:r) while a CPU is idle (cpu:idl). To understand how this is possible, either examine the %runocc from sar -q or measure the run queues more accurately by using DTrace. You may find that the run queue is quite long for a short period of time, followed by idle time. Averaging over the interval gives both a non-zero run queue length and idle time.

psrinfo Command

To determine the number of processors in the system and their speed, use the psrinfo -v command. In Solaris 10, -vp prints additional information.

$ psrinfo -vp
The physical processor has 1 virtual processor (0)
  UltraSPARC-III+ (portid 0 impl 0x15 ver 0x23 clock 900 MHz)
The physical processor has 1 virtual processor (1)
  UltraSPARC-III+ (portid 1 impl 0x15 ver 0x23 clock 900 MHz)

uptime Command

The uptime command is a quick way to print the CPU load averages.[1]

$ uptime
 12:29am  up 274 day(s), 6 hr(s),  7 users,  load average: 2.00, 1.07, 0.46

The numbers are the 1-, 5-, and 15-minute load averages. They represent both utilization and saturation of the CPUs. Put simply, a value equal to your CPU count usually means 100% utilization; less than your CPU count is proportionally less than 100% utilization; and greater than your CPU count is a measure of saturation. To compare a load average between servers, divide the load average by the CPU count for a consistent metric.

By providing the 1-, 5-, and 15-minute averages, recently increasing or decreasing CPU load can be identified. The previous uptime example demonstrates an increasing profile (2.00, 1.07, 0.46).

The calculation used for the load averages is often described as the average number of runnable and running threads, which is a reasonable description.[2] As an example, if a single CPU server averaged one running thread on the CPU and two on the dispatcher queue, then the load average would be 3.0. A similar load for a 32-CPU server would involve an average of 32 running threads plus 64 on the dispatcher queues, resulting in a load average of 96.0.

A consistent load average higher than your CPU count may cause degraded application performance. CPU saturation is something that Solaris handles very well, so it is possible that a server can run at some level of saturation without a noticeable effect on performance.

The system actually calculates the load averages by summing high-resolution user time, system time, and thread wait time, then processing this total to generate averages with exponential decay. Thread wait time measures CPU latency. The calculation no longer samples the length of the dispatcher queues, as it did with older Solaris. However, the effect of summing thread wait time provides an average that is usually (but not always) similar to averaging queue length anyway. For more details, see Section 2.12.2.

It is important not to become too obsessed with load averages: they condense a complex system into three numbers and should not be used for anything more than an initial approximation of CPU load.

sar Command

The system activity reporter (sar) can provide live statistics or can be activated to record historical CPU statistics. This can be of tremendous value because you may identify long-term patterns that you might have missed when taking a quick look at the system. Also, historical data provides a reference for what is “normal” for your system.

sar Default Output

The following example shows the default output of sar, which is also the -u option to sar. An interval of 1 second and a count of 5 were specified.

$ sar 1 5

SunOS titan 5.11 snv_16 sun4u    02/27/2006

03:20:42    %usr    %sys    %wio   %idle
03:20:43      82      17       0       1
03:20:44      92       8       0       0
03:20:45      91       9       0       0
03:20:46      94       6       0       0
03:20:47      93       7       0       0

Average       91       9       0       0

sar has printed the user (%usr), system (%sys), wait I/O (%wio), and idle times (%idle). User, system, and idle are also printed by the vmstat command and are defined in 2.2. The following are some additional points.

  • %usr, %sys (user, system). A commonly expected ratio is 70% usr and 30% sys, but this depends on the application. Applications that use I/O heavily, for example a busy Web server, can cause a much higher %sys due to a large number of system calls. Applications that spend time processing userland code, for example, compression tools, can cause a higher %usr. Kernel mode services, such as the NFS server, are %sys based.

  • %wio (wait I/O). This was supposed to be a measurement of the time spent waiting for I/O events to complete.[3] The way it was measured was not very accurate, resulting in inconsistent values and much confusion. This statistic has now been deliberately set to zero in Solaris 10.

  • %idle (idle). There are different mentalities for percent idle. One is that percent idle equals wasted CPU cycles and should be put to use, especially when server consolidation solutions such as Solaris Zones are used. Another is that some level of %idle is healthy (anywhere from 20% to 80%) because it leaves “head room” for short increases in activity to be dispatched quickly.

sar -q

The -q option for sar provides statistics on the run queues (dispatcher queues).

$ sar -q 5 5

SunOS titan 5.11 snv_16 sun4u    02/27/2006

03:38:43 runq-sz %runocc swpq-sz %swpocc
03:38:48     0.0       0     0.0       0
03:38:53     1.0      80     0.0       0
03:38:58     1.6      99     0.0       0
03:39:03     2.4     100     0.0       0
03:39:08     3.4     100     0.0       0

Average      2.2      76     0.0       0

There are four fields.

  • runq-sz (run queue size). Equivalent to the kthr:r field from vmstat; can be used as a measure of CPU saturation.[4]

  • %runocc (run queue occupancy). Helps prevent a danger when intervals are used, that is, short bursts of activity can be averaged down to unnoticeable values. The run queue occupancy can identify whether short bursts of run queue activity occurred.[5]

  • swpq-sz (swapped-out queue size). Number of swapped-out threads. Swapping out threads is a last resort for relieving memory pressure, so this field will be zero unless there was a dire memory shortage.

  • %swpocc (swapped out occupancy). Percentage of time there were swapped out threads.

Capturing Historical Data

To activate sar to record statistics in Solaris 10, use svcadm enable sar.[6] The defaults are to take a one-second sample every hour plus every twenty minutes during business hours. This should be customized because a one-second sample every hour isn’t terribly useful (the man page for sadc suggests it should be greater than five seconds). You can change it by placing an interval and a count after the sa1 lines in the crontab for the sys user (crontab -e sys).

Clock Tick Woes

At some point in a discussion on CPU statistics it is obligatory to lament the inaccuracy of a 100 hertz sample: What if each sample coincided with idle time, mis-representing the state of the server?

Once upon a time, CPU statistics were gathered every clock tick or every hundredth of a second.[7] As CPUs became faster, it became increasingly possible for fleeting activity to occur between clock ticks, and such activity would not be measured correctly. Now we use microstate accounting. It uses high-resolution timestamps to measure CPU statistics for every event, producing extremely accurate statistics. See Section 2.10.3 in Solaris Internals

If you look through the Solaris source, you will see high-resolution counters just about everywhere. Even code that expects clock tick measurements will often source the high-resolution counters instead. For example:

cpu_sys_stats_ks_update(kstat_t *ksp, int rw)
{
...
        csskd->cpu_ticks_idle.value.ui64 =
            NSEC_TO_TICK(csskd->cpu_nsec_idle.value.ui64);
        csskd->cpu_ticks_user.value.ui64 =
            NSEC_TO_TICK(csskd->cpu_nsec_user.value.ui64);
        csskd->cpu_ticks_kernel.value.ui64 =
            NSEC_TO_TICK(csskd->cpu_nsec_kernel.value.ui64);
...
                                                                See uts/common/os/cpu.c

In this code example, NSEC_TO_TICK converts from the microstate accounting value (which is in nanoseconds) to a ticks count. For more details on CPU microstate accounting, see Section 2.12.1.

While most counters you see in Solaris are highly accurate, sampling issues remain in a few minor places. In particular, the run queue length as seen from vmstat (kthr:r) is based on a sample that is taken every second. Running vmstat with an interval of 5 prints the average of five samples taken at one-second intervals. The following (somewhat contrived) example demonstrates the problem.

$ vmstat 2 5
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 -- --   in   sy   cs us sy id
 0 0 23 1132672 198460 34 47 96 2  2  0 15  6  0  0  0  261  392  170  2  1 97
 0 0 45 983768 120944 1075 4141 0 0 0 0  0  0  0  0  0  355 2931  378  7 25 67
 0 0 45 983768 120944 955 3851 0 0 0  0  0  0  0  0  0  342 1871  279  4 22 73
 0 0 45 983768 120944 940 3819 0 0 0  0  0  0  0  0  0  336 1867  280  4 22 73
 0 0 45 983768 120944 816 3561 0 0 0  0  0  0  0  0  0  338 2055  273  5 20 75

$ uptime
  4:50am  up 14 day(s), 23:32,  4 users,  load average: 4.43, 4.31, 4.33

For this single CPU server, vmstat reports a run queue length of zero. However, the load averages (which are now based on microstate accounting) suggest considerable load. This was caused by a program that deliberately created numerous short-lived threads every second, such that the one-second run queue sample usually missed the activity.

The runq-sz from sar -q suffers from the same problem, as does %runocc(which for short-interval measurements defeats the purpose of %runocc).

These are all minor issues, and a valid workaround is to use DTrace, with which statistics can be created at any accuracy desired. Demonstrations of this are in Section 2.14.

mpstat Command

The mpstat command summarizes the utilization statistics for each CPU. The following output is an example from a system with 32 CPUs.

$ mpstat 1

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  279   267  112  106    7    7   85    0   219   85   2   0  13
  1    1   0  102    99    0  177    9   15  119    2   381   72   3   0  26
  2    2   0   75   130    0  238   19   11   98    5   226   70   3   0  28
  3    1   0   94    32    0   39    8    6   95    2   380   81   2   0  17
  4    1   0   70    75    0  128   11    9   96    1   303   66   2   0  32
  5    1   0   88    62    0   99    7   11   89    1   370   74   2   0  24
  6    4   0   78    47    0   85   24    6   67    8   260   86   2   0  12
  7    2   0   73    29    0   45   21    5   57    7   241   85   1   0  14
  8    2   0   94    21    0   17   10    3   73    0   392   86   2   0  12
  9    3   0   64   106    0  198   23   12   85    7   209   66   2   0  32
 10    2   0   23   164    0  331   20   13   85    7    89   17   2   0  81
 11    3   0   31   105    0  200   14   11   63    4   144   33   2   0  66
 12    0   0   47   129    0  260    3    8   86    0   248   33   2   0  65
 13    4   0   76    48    0   90   25    5   77    8   255   75   2   0  24
 14    5   0   39   130    0  275   17    9   83   10   158   36   2   0  62
 15    2   0   67    99    0  183   18    5  101    4   207   72   2   0  25
 16    3   0   79    61    0  103   20    2   89    3   252   83   2   0  15
 17    2   0   17    64    0  123    8    7   38    2    65   18   1   0  81
 18    4   0   86    22    0   21   21    0   98    4   283   98   2   0   0
 19    1   0   10    75    0  148    9    5   38    1    47    9   3   0  88
 20    1   0   49   108    1  208    4    6   85    0   252   29   2   0  69
 21    5   0   90    46    0   77   29    1   75    8   277   92   2   0   6
 22    2   0   56    98    0  186   15    5   71    3   176   59   2   0  39
 23    5   0   37   156    0  309   19    6   75    4   136   39   2   0  59
 24    0   0   32    51    0   97    2    3   32    1   198   15   1   0  83
 25    8   0   82    56    0  142   13    8   87   13   294   82   2   0  16
 26    2   0   66    40    0   75   12    3   66    2   237   73   2   0  25
 27    6   0   80    33    0   57   21    5   89    3   272   86   2   0  13
 28    1   0   97    35    0   56    7    3   94    2   369   76   2   0  22
 29    4   0   83    44    0   79   27    3   69    7   286   89   2   0   9
 30    1   0   95    41    1   69    8    4  105    1   382   80   2   0  18
 31    5   0   16    31    2   99    5    9   20   15    97    9   1   0  90
...

For each CPU, a set of event counts and utilization statistics are reported. The first output printed is the summary since boot. After vmstat is checked, the mpstat processor utilization metrics are often the next point of call to ascertain how busy the system CPUs are.

Processor utilization is reported by percent user (usr), system (sys), wait I/O (wt) and idle (idl) times, which have the same meanings as the equivalent columns from vmstat (Section 2.2) and sar (Section 2.7). The syscl field provides additional information for understanding why system time was consumed.

  • syscl(system calls). System calls per second. See Section 2.13 for an example of how to use DTrace to investigate the impact and cause of system call activity.

The scheduling-related statistics reported by mpstat are as follows.

  • csw (context switches). This field is the total of voluntary and involuntary context switches. Voluntary context switches occur when a thread performs a blocking system call, usually one performing I/O when the thread voluntarily sleeps until the I/O event has completed.

  • icsw (number of involuntary context switches). This field displays the number of threads involuntarily taken off the CPU either through expiration of their quantum or preemption by a higher-priority thread. This number often indicates if there were generally more threads ready to run than physical processors. To analyze further, a DTrace probe, dequeue, fires when context switches are made, as described in Section 2.15.

  • migr(migrations of threads between processors). This field displays the number of times the OS scheduler moves ready-to-run threads to an idle processor. If possible, the OS tries to keep the threads on the last processor on which it ran. If that processor is busy, the thread migrates. Migrations on traditional CPUs are bad for performance because they cause a thread to pull its working set into cold caches, often at the expense of other threads.

  • intr (interrupts). This field indicates the number of interrupts taken on the CPU. These may be hardware- or software-initiated interrupts. See Section 3.11 in Solaris Internals for further information.

  • ithr (interrupts as threads). The number of interrupts that are converted to real threads, typically as a result of inbound network packets, blocking for a mutex, or a synchronization event. (High-priority interrupts won’t do this, and interrupts without mutex contention typically interrupt the running thread and complete without converting to a full thread). See Section 3.11 in Solaris Internals for further information.

The locking-related statistics reported by mpstat are as follows.

  • smtx (kernel mutexes). This field indicates the number of mutex contention events in the kernel. Mutex contention typically manifests itself first as system time (due to busy spins), which results in high system (%sys) time, which don’t show in smtx. More useful lock statistics are available through lockstat(1M) and the DTrace lockstat provider (see Section 9.3.5 and Chapter 17 in Solaris Internals).

  • srw (kernel reader/writer mutexes). This field indicates the number of reader-writer lock contention events in the kernel. Excessive reader/writer lock contention typically results in nonscaling performance and systems that are unable to use all the available CPU resources (symptom is idle time). More useful lock statistics are available with lockstat(1M) and the DTrace lockstat provider—See Section 9.3.5 and Chapter 17 in Solaris Internals.

See Chapter 3 in Solaris Internals, particularly Section 3.8.1, for further information.

Who Is Using the CPU?

The prstat command was introduced in Solaris 8 to provide real-time process status in a meaningful way (it resembles top, the original freeware tool written by William LeFebvre). prstat uses procfs, the /proc file system, to fetch process details (see proc(4)), and the getloadavg() syscall to get load averages.

$ prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE       TIME  CPU PROCESS/NLWP
 25639 rmc      1613M   42M cpu22    0   10    0:33:10 3.1% filebench/2
 25655 rmc      1613M   42M cpu23    0   10    0:33:10 3.1% filebench/2
 25659 rmc      1613M   42M cpu30    0   10    0:33:11 3.1% filebench/2
 25644 rmc      1613M   42M cpu14    0   10    0:33:10 3.1% filebench/2
 25658 rmc      1613M   42M cpu16    0   10    0:33:10 3.1% filebench/2
 25636 rmc      1613M   42M cpu21    0   10    0:33:10 3.1% filebench/2
 25646 rmc      1613M   42M cpu15    0   10    0:33:10 3.1% filebench/2
 25661 rmc      1613M   42M cpu8     0   10    0:33:11 3.1% filebench/2
 25652 rmc      1613M   42M cpu20    0   10    0:33:09 3.1% filebench/2
 25647 rmc      1613M   42M cpu0     0   10    0:33:10 3.1% filebench/2
 25641 rmc      1613M   42M cpu27    0   10    0:33:10 3.1% filebench/2
 25656 rmc      1613M   42M cpu7     0   10    0:33:10 3.1% filebench/2
 25634 rmc      1613M   42M cpu11    0   10    0:33:11 3.1% filebench/2
 25637 rmc      1613M   42M cpu17    0   10    0:33:10 3.1% filebench/2
 25643 rmc      1613M   42M cpu12    0   10    0:33:10 3.1% filebench/2
 25648 rmc      1613M   42M cpu1     0   10    0:33:10 3.1% filebench/2
 25640 rmc      1613M   42M cpu26    0   10    0:33:10 3.1% filebench/2
 25651 rmc      1613M   42M cpu31    0   10    0:33:10 3.1% filebench/2
 25654 rmc      1613M   42M cpu29    0   10    0:33:10 3.1% filebench/2
 25650 rmc      1613M   42M cpu5     0   10    0:33:10 3.1% filebench/2
 25653 rmc      1613M   42M cpu10    0   10    0:33:10 3.1% filebench/2
 25638 rmc      1613M   42M cpu18    0   10    0:33:10 3.1% filebench/2
Total: 91 processes, 521 lwps, load averages: 29.06, 28.84, 26.68

The default output from the prstat command shows one line of output per process, including a value that represents recent CPU utilization. This value is from pr_pctcpu in procfs and can express CPU utilization before the prstat command was executed (see Section 2.12.3).

The system load average indicates the demand and queuing for CPU resources averaged over a 1-, 5-, and 15-minute period. They are the same numbers as printed by the uptime command (see Section 2.6). The output in our example shows a load average of 29 on a 32-CPU system. An average load average that exceeds the number of CPUs in the system is a typical sign of an overloaded system.

CPU Run Queue Latency

The microstate accounting system maintains accurate time counters for threads as well as CPUs. Thread-based microstate accounting tracks several meaningful states per thread in addition to user and system time, which include trap time, lock time, sleep time, and latency time. The process statistics tool, prstat, reports the per-thread microstates for user processes.

$ prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 25644 rmc       98 1.5 0.0 0.0 0.0 0.0 0.0 0.1   0  36 693   0 filebench/2
 25660 rmc       98 1.7 0.1 0.0 0.0 0.0 0.0 0.1   2  44 693   0 filebench/2
 25650 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.1   0  45 699   0 filebench/2
 25655 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.2   0  46 693   0 filebench/2
 25636 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.2   1  50 693   0 filebench/2
 25651 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.2   0  54 693   0 filebench/2
 25656 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.2   0  60 693   0 filebench/2
 25639 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.2   1  61 693   0 filebench/2
 25634 rmc       98 1.3 0.1 0.0 0.0 0.0 0.0 0.4   0  63 693   0 filebench/2
 25654 rmc       98 1.3 0.1 0.0 0.0 0.0 0.0 0.4   0  67 693   0 filebench/2
 25659 rmc       98 1.7 0.1 0.0 0.0 0.0 0.0 0.4   1  68 693   0 filebench/2
 25647 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.4   0  73 693   0 filebench/2
 25648 rmc       98 1.6 0.1 0.0 0.0 0.0 0.3 0.2   2  48 693   0 filebench/2
 25643 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.5   0  75 693   0 filebench/2
 25642 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.5   0  80 693   0 filebench/2
 25638 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.6   0  76 693   0 filebench/2
 25657 rmc       97 1.8 0.1 0.0 0.0 0.0 0.4 0.3   6  64 693   0 filebench/2
 25646 rmc       97 1.7 0.1 0.0 0.0 0.0 0.0 0.6   6  83 660   0 filebench/2
 25645 rmc       97 1.6 0.1 0.0 0.0 0.0 0.0 0.9   0  55 693   0 filebench/2
 25652 rmc       97 1.7 0.2 0.0 0.0 0.0 0.0 0.9   2 106 693   0 filebench/2
 25658 rmc       97 1.5 0.1 0.0 0.0 0.0 0.0 1.0   0  72 693   0 filebench/2
 25637 rmc       97 1.7 0.1 0.0 0.0 0.0 0.3 0.6   4  95 693   0 filebench/2
Total: 91 processes, 510 lwps, load averages: 28.94, 28.66, 24.39

By specifying the -m (show microstates) and -L (show per-thread) options, we can observe the per-thread microstates. These microstates represent a time-based summary broken into percentages for each thread. The columns USR through LAT sum to 100% of the time spent for each thread during the prstat sample. The important microstates for CPU utilization are USR, SYS, and LAT. The USR and SYS columns are the user and system time that this thread spent running on the CPU. The LAT (latency) column is the amount of time spent waiting for CPU. A non-zero number means there was some queuing for CPU resources. This is an extremely useful metric—we can use it to estimate the potential speedup for a thread if more CPU resources are added, assuming no other bottlenecks obstruct the way. In our example, we can see that on average the filebench threads are waiting for CPU about 0.2% of the time, so we can conclude that CPU resources for this system are not constrained.

Another example shows what we would observe when the system is CPU-resource constrained. In this example, we can see that on average each thread is waiting for CPU resource about 80% of the time.

$ prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 25765 rmc       22 0.3 0.1 0.0 0.0 0.0 0.0  77   0  42 165   0 filebench/2
 25833 rmc       22 0.3 0.3 0.0 0.0 0.0 0.0  77   0 208 165   0 filebench/2
 25712 rmc       20 0.2 0.1 0.0 0.0 0.0 0.0  80   0  53 132   0 filebench/2
 25758 rmc       20 0.3 0.1 0.0 0.0 0.0 0.0  80   0  84 148   0 filebench/2
 25715 rmc       20 0.3 0.1 0.0 0.0 0.0 0.0  80   0  56 132   0 filebench/2
 25812 rmc       19 0.2 0.1 0.0 0.0 0.0 0.0  81   0  50 132   0 filebench/2
 25874 rmc       19 0.2 0.0 0.0 0.0 0.0 0.0  81   0  22 132   0 filebench/2
 25842 rmc       19 0.2 0.2 0.0 0.0 0.0 0.0  81   1  92 132   0 filebench/2
 25732 rmc       19 0.2 0.1 0.0 0.0 0.0 0.0  81   0  54  99   0 filebench/2
 25714 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  81   0  84 165   0 filebench/2
 25793 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  81   0  30 132   0 filebench/2
 25739 rmc       18 0.3 0.3 0.0 0.0 0.0 0.0  81   0 150 115   0 filebench/2
 25849 rmc       18 0.3 0.0 0.0 0.0 0.0 0.0  81   1  19 132   0 filebench/2
 25788 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  81   0  77  99   0 filebench/2
 25760 rmc       18 0.2 0.0 0.0 0.0 0.0 0.0  82   0  26 132   0 filebench/2
 25748 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  82   0  58 132   0 filebench/2
 25835 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  82   0  65 132   0 filebench/2
 25851 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   0  87  99   0 filebench/2
 25811 rmc       18 0.3 0.2 0.0 0.0 0.0 0.0  82   0 129 132   0 filebench/2
 25767 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   1  25 132   0 filebench/2
 25740 rmc       18 0.3 0.2 0.0 0.0 0.0 0.0  82   0 118 132   0 filebench/2
 25770 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   0  68 132   0 filebench/2
Total: 263 processes, 842 lwps, load averages: 201.45, 192.26, 136.16

We can further investigate which threads are consuming CPU within each process by directing prstat to examine a specific process.

$ prstat -Lm -p 25691
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 25691 rmc       17 0.2 0.2 0.0 0.0 0.0 0.0  83   0  74  99   0 filebench/2
 25691 rmc      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 filebench/1

The example shows us that thread number two in the target process is using the most CPU, and spending 83% of its time waiting for CPU. We can further look at information about thread number two with the pstack command.

$ pstack 25691/2
25691:  shadow -a shadow -i 2 -s ffffffff10000000 -m /var/tmp/fbenchJDailY
-----------------  lwp# 2 / thread# 2  --------------------
 000000010001ae90 flowoplib_hog (30d40, ffffffff6518dc60, 30d40, ffffffff37352c88, 1,
2570d) + 68
 00000001000194a4 flowop_start (ffffffff37352c88, 0, 1, 0, 1, 1000888b0) + 408
 ffffffff7e7ccea0 _lwp_start (0, 0, 0, 0, 0, 0)

In this example, we’ve taken a snapshot of the stack of thread number two of our target process. At the time the snapshot was taken, we can see that the function flowop_start was calling flowoplib_hog. It’s sometimes worth taking several snapshots to see if a pattern is exhibited. DTrace can analyze this further.

CPU Statistics Internals

The following is a brief reference for how some of the CPU statistics are maintained by the kernel.

usr, sys, idl Times

The percent user, system and idle times printed by vmstat, sar, and mpstat are retrieved from kstat statistics. These statistics are updated by CPU microstate counters, which are kept in each CPU struct as cpu->cpu_acct[NCMSTATES]; these measure cumulative time in each CPU microstate as high-resolution time counters (hrtime_t). There are three CPU microstates, CMS_USER, CMS_SYSTEM, and CMS_IDLE (there is also a fourth, CMS_DISABLED, which isn’t used for microstate accounting).

These per CPU microstate counters are incremented by functions such as new_cpu_mstate() and syscall_mstate() from uts/common/os/msacct.c. When the CPU state changes, a timestamp is saved in cpu->cpu_mstate_start and the new state is saved in cpu->cpu_mstate. When the CPU state changes next, the current time is fetched (curtime) so that the elapsed time in that state can be calculated with curtime - cpu_mstate_start and then added to the appropriate microstate counter in cpu_acct[].

These microstates are then saved in kstat for each CPU as part of the cpu_sys_stats_ks_data struct defined in uts/common/os/cpu.c and are given the names cpu_nsec_user, cpu_nsec_kernel, and cpu_nsec_idle. Since user-land code expects these counters to be in terms of clock ticks, they are rounded down using NSEC_TO_TICK (see Section 2.8), and resaved in kstat with the names cpu_ticks_user, cpu_ticks_kernel, and cpu_ticks_idle.

Figure 2.1 summarizes the flow of data from the CPU structures to userland tools through kstat

CPU Statistic Data Flow

Figure 2.1. CPU Statistic Data Flow

This is the code from cpu.c which copies the cpu_acct[] values to kstat.

static int
cpu_sys_stats_ks_update(kstat_t *ksp, int rw)
{
..
        css = &cp->cpu_stats.sys;

        bcopy(&cpu_sys_stats_ks_data_template, ksp->ks_data,
            sizeof (cpu_sys_stats_ks_data_template));
        csskd->cpu_ticks_wait.value.ui64 = 0;
        csskd->wait_ticks_io.value.ui64 = 0;
        csskd->cpu_nsec_idle.value.ui64 = cp->cpu_acct[CMS_IDLE];
        csskd->cpu_nsec_user.value.ui64 = cp->cpu_acct[CMS_USER];
        csskd->cpu_nsec_kernel.value.ui64 = cp->cpu_acct[CMS_SYSTEM];
...

Note that cpu_ticks_wait is set to zero; this is the point in the code where wait I/O has been deprecated.

An older location for tick-based statistics is cpu->cpu_stats.sys, which is of cpu_sys_stats_t. These are defined in /usr/include/sys/sysinfo.h, where original tick counters of the style cpu_ticks_user are listed. The remaining statistics from cpu->cpu_stats.sys (for example, readch, writech) are copied directly into kstat’s cpu_sys_stats_ks_data.

Tools such as vmstat fetch the tick counters from kstat, which provides them under cpu:#:sys: for each CPU. Although these counters use the term “ticks,” they are extremely accurate because they are rounded versions of the nsec counters; which are copied from the CPU microstate counters. The mpstat command prints individual CPU statistics (Section 2.9) and the vmstat command aggregates statistics across all CPUs (Section 2.2).

Load Averages

The load averages that tools such as uptime print are retrieved using system call getloadavg(), which returns them from the kernel array of signed ints called avenrun[]. They are actually maintained in a high precision uint64_t array called hp_avenrun[], and then converted to avenrun[] to meet the original API. The code that maintains these arrays is in the clock() function from uts/common/os/clock.c, and is run once per second. It involves the following.

The loadavg_update() function is called to add user + system + thread wait (latency) microstate accounting times together. This value is stored in an array within a struct loadavg_s, one of which exists for each CPU, each CPU partition, and for the entire system. These arrays contain the last ten seconds of raw data. Then genloadavg() is called to process both CPU partition and the system wide arrays, and return the average for the last ten seconds. This value is fed to calcloadavg(), which applies exponential decays for the 1-, 5-, 15-minute values, saving the results in hp_avenrun[] or cp_hp_avenrun[] for the CPU partitions. hp_avenrun[] is then converted into avenrun[].

This means that these load averages are damped more than once. First through a rolling ten second average, and then through exponential decays. Apart from the getloadavg() syscall, they are also available from kstat where they are called avenrun_1min, avenrun_5min, avenrun_15min. Running kstat -s avenrun* prints the raw unprocessed values, which must be divided by FSCALE to produce the final load averages.

pr_pctcpu Field

The CPU field that prstat prints is pr_pctcpu, which is fetched by user-level tools from procfs. It is maintained for each thread as thread->t_pctcpu by the cpu_update_pct() function in common/os/msacct.c. This takes a high-resolution timestamp and calculates the elapsed time since the last measurement, which was stored in each thread’s t_hrtime. cpu_update_pct() is called by scheduling events, producing an extremely accurate measurement as this is based on events and not ticks. cpu_update_pct() is also called by procfs when a pr_pctcpu value is read, at which point every thread’s t_pctcpu is aggregated into pr_pctcpu.

The cpu_update_pct() function processes t_pctcpu as a decayed average by using two other functions: cpu_grow() and cpu_decay(). The way this behaves may be quite familiar: If a CPU-bound process begins, the reported CPU value is not immediately 100%; instead it increases quickly at first and then slows down, gradually reaching 100. The algorithm has the following comment above the cpu_decay() function.

/*
 * Given the old percent cpu and a time delta in nanoseconds,
 * return the new decayed percent cpu:  pct * exp(-tau),
 * where 'tau' is the time delta multiplied by a decay factor.
 * We have chosen the decay factor (cpu_decay_factor in param.c)
 * to make the decay over five seconds be approximately 20%.
 *
...

This comment explains that the rate of t_pctcpu change should be 20% for every five seconds (and the same for pr_pctcpu).

User-level commands read pr_pctcpu by reading /proc/<pid>/psinfo for each process, which contains pr_pctcpu in a psinfo struct as defined in /usr/ include/sys/procfs.h.

Using DTrace to Explain Events from Performance Tools

DTrace can be exploited to attribute the source of events noted in higher-level tools such as mpstat(1M). For example, if we see a significant amount of system time (%sys) and a high system call rate (syscl), then we might want to know who or what is causing those system calls.

# mpstat 2
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  117   0 1583   883  111 1487  593  150 6104   64 11108    7  92   0   1
  1  106   0  557   842    0 1804  694  150 6553   84 10684    6  93   0   1
  2  112   0  664   901    0 1998  795  143 6622   64 11227    6  93   0   1
  3   95   0  770  1035    0 2232  978  131 6549   59 11769    7  92   0   1
^C
# dtrace -n 'syscall:::entry { @[execname] = count(); }'
dtrace: description 'syscall:::entry ' matched 229 probes
^C

  inetd                                                              1
  svc.configd                                                        1
  fmd                                                                2
  snmpdx                                                             2
  utmpd                                                              2

  inetd                                                              1
  svc.configd                                                        1
  fmd                                                                2
  snmpdx                                                             2
  utmpd                                                              2
  svc.startd                                                        13
  sendmail                                                          30
  snmpd                                                             36
  nscd                                                             105
  dtrace                                                          1311
  filebench                                                    3739725

Using the DTrace syscall provider, we can quickly identify which process is causing the most system calls. This dtrace one-liner measures system calls by process name. In this example, processes with the name filebench caused 3, 739, 725 system calls during the time the dtrace command was running.

We can then drill deeper by matching the syscall probe only when the exec name matches our investigation target, filebench, and counting the syscall name.

# dtrace -n 'syscall:::entry /execname == "filebench"/ { @[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 229 probes
^C

  lwp_continue                                                      4
  lwp_create                                                        4
  mmap                                                              4
  schedctl                                                          4
  setcontext                                                        4
  lwp_sigmask                                                       8
  nanosleep                                                        24
  yield                                                           554
  brk                                                            1590
  pwrite                                                        80795
  lwp_park                                                     161019
  read                                                         324159
  pread                                                        898401
  semsys                                                      1791717

Ah, so we can see that the semsys syscall is hot in this case. Let’s look at what is calling semsys by using the ustack() DTrace action.

# dtrace -n 'syscall::semsys:entry /execname == "filebench"/ { @[ustack()] = count();}'
dtrace: description 'syscall::semsys:entry ' matched 1 probe
^C
              libc.so.1`_syscall6+0x1c
              filebench`flowop_start+0x408
              libc.so.1`_lwp_start
            10793

              libc.so.1`_syscall6+0x1c
              filebench`flowop_start+0x408
              libc.so.1`_lwp_start
            10942

              libc.so.1`_syscall6+0x1c
              filebench`flowop_start+0x408
              libc.so.1`_lwp_start
            11084

We can now identify which system call, and then even obtain the hottest stack trace for accesses to that system call. We conclude by observing that the filebench flowop_start function is performing the majority of semsys system calls on the system.

DTrace Versions of runq-sz, %runocc

Existing tools often provide useful statistics, but not quite in the way that we want. For example, the sar command provides measurements for the length of the run queues (runq-sz), and a percent run queue occupancy (%runocc). These are useful metrics, but since they are sampled only once per second, their accuracy may not be satisfactory. DTrace allows us to revisit these measurements, customizing them to our liking.

runq-sz: DTrace can measure run queue length for each CPU and produce a distribution plot.

# dtrace -n 'profile-1000hz { @[cpu] = lquantize(     curthread->t_cpu->cpu_disp->disp_
nrunnable, 0, 64, 1); }'
dtrace: description 'profile-1000hz ' matched 1 probe
^C

        0
           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     4665
               1 |@@@@                                     489
               2 |                                         41
               3 |                                         25
               4 |                                         4
               5 |                                         0

Rather than sampling once per second, this dtrace one-liner[8] samples at 1000 hertz. The example shows a single CPU system with some work queuing on its run queue, but not a great deal. A value of zero means no threads queued (no saturation); however, the CPU may still be processing a user or kernel thread (utilization).

What is actually measured by DTrace is the value of disp_nrunnable from the disp_t for the current CPU.

typedef struct _disp {
...
        pri_t           disp_maxrunpri; /* maximum run priority */
        pri_t           disp_max_unbound_pri;   /* max pri of unbound threads */

        volatile int    disp_nrunnable; /* runnable threads in cpu dispq */

        struct cpu      *disp_cpu;      /* cpu owning this queue or NULL */
} disp_t;
                                                             See /usr/include/sys/disp.h

%runocc: Measuring run queue occupancy is achieved in a similar fashion.disp_nrunnable is also used, but this time just to indicate the presence of queued threads.

#!/usr/sbin/dtrace -s
#pragma D option quiet

profile-1000hz
/curthread->t_cpu->cpu_disp->disp_nrunnable/
{
        @qocc[cpu] = count();
}

profile:::tick-1sec
{
        normalize(@qocc, 10);
        printf("
%8s %8s
", "CPU", "%runocc");
        printa("%8d %@8d
", @qocc);
        clear(@qocc);
}

This script samples at 1000 hertz and uses a DTrace normalization of 10 to turn the 1000-count into a percentage. We ran this script on a busy 4-CPU server.

# ./runocc.d

     CPU  %runocc
       3       39
       1       49
       2       65
       0       97

     CPU  %runocc
       1        2
       3        8
       2       99
       0      100
...

Each CPU has an occupied run queue, especially CPU 0.

These examples of sampling activity at 1000 hertz are simple and possibly sufficiently accurate (certainly better than the original 1 hertz statistics). While DTrace can sample activity, it may be is better suited to trace activity, measuring nanosecond timestamps for each event. The sched provider exists to facilitate the tracing of scheduling events. With sched, runq-sz and %runocc can be measured with a much higher accuracy.

DTrace Probes for CPU States

The sched provider makes available probes related to CPU scheduling. Because CPUs are the one resource that all threads must consume, the sched provider is very useful for understanding systemic behavior. For example, using the sched provider, you can understand when and why threads sleep, run, change priority, or wake other threads.

As an example, one common question you might want answered is which CPUs are running threads and for how long. You can use the on-cpu and off-cpu probes to easily answer this question systemwide as shown in the following example.

#!/usr/sbin/dtrace -s

sched:::on-cpu
{
              self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
              @[cpu] = quantize(timestamp - self->ts);
              self->ts = 0;
}

The CPU overhead for the tracing of the probe events is proportional to their frequency. The on-cpu and off-cpu probes occur for each context switch, so the CPU overhead increases as the rate of context switches per second increases. Compare this to the previous DTrace scripts that sampled at 1000 hertz—their probe frequency is fixed. Either way, the CPU cost for running these scripts should be negligible.

The following is an example of running this script.

# ./where.d
dtrace: script './where.d' matched 5 probes
^C

        0
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@@                                       37
            8192 |@@@@@@@@@@@@@                            212
           16384 |@                                        30
           32768 |                                         10
           65536 |@                                        17
          131072 |                                         12
          262144 |                                         9
          524288 |                                         6
         1048576 |                                         5
         2097152 |                                         1
         4194304 |                                         3
         8388608 |@@@@                                     75
        16777216 |@@@@@@@@@@@@                             201
        33554432 |                                         6
        67108864 |                                         0

        1
           value  ------------- Distribution ------------- count
            2048 |                                         0
            4096 |@                                        6
            8192 |@@@@                                     23
           16384 |@@@                                      18
           32768 |@@@@                                     22
           65536 |@@@@                                     22
          131072 |@                                        7
          262144 |                                         5
          524288 |                                         2
         1048576 |                                         3
         2097152 |@                                        9
         4194304 |                                         4
         8388608 |@@@                                      18
        16777216 |@@@                                      19
        33554432 |@@@                                      16
        67108864 |@@@@                                     21
       134217728 |@@                                       14
       268435456 |                                         0

The value is nanoseconds, and the count is the number of occurrences a thread ran for this duration without leaving the CPU. The floating integer above the Distribution plot is the CPU ID.

For CPU 0, a thread ran for between 8 and 16 microseconds on 212 occasions, shown by a small spike in the distribution plot. The other spike was for the 16 to 32 millisecond duration (sounds like TS class quanta—see Chapter 3 in Solaris Internals), for which threads ran 201 times.

The sched provider is discussed in Section 10.6.3.



[1] w -u prints the same line of output, perhaps not surprising since w is a hard link to uptime.

[2] This was the calculation, but now it has changed (see 2.12.2); the new way often produces values that resemble those of the old way, so the description still has some merit.

[3] Historically, this metric was useful on uniprocessor systems as a way of indicating how much time was spent waiting for I/O. In a multiprocessor system it’s not possible to make this simple approximation, which led to a significant amount of confusion (basically, if %wio was non-zero, then the only useful information that could be gleaned is that at least one thread somewhere was waiting for I/O. The magnitude of the %wio value is related more to how much time the system is idle than to waiting for I/O. You can get a more accurate waiting-for-I/O measure by measuring individual thread, and you can obtain it by using DTrace.

[4] sar seems to have a blind spot for a run queue size between 0.0 and 1.0.

[5] A value of 99% for short intervals is usually a rounding error. Another error can be due to drifting intervals and measuring the statistic after an extra update; this causes %runocc to be reported as over 100% (e.g., 119% for a 5-second interval).

[6] Pending bug 6302763; the description contains a workaround.

[7] In fact, once upon a time statistics were gathered every 60th of a second.

[8] This exists in the DTraceToolkit as dispqlen.d.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.255.140