In this chapter, we explore tools that can be used to monitor performance of kernel subsystems, drivers and other loadable kernel modules.
There are several tools available in the Solaris environment to measure and optimize the performance of kernel code and device drivers. The following tasks are the most common:
Identify the reason for high system time (mpstat %sys
). We can use a kernel profile (DTrace or lockstat -I
) or trace (DTrace) to produce a ranked list of system calls, functions, modules, drivers, or subsystems that are contributing to system time.
Identify the reason for nonscalability on behalf of a system call. Typically, our approach is to observe the wall clock time and CPU cycles of a code path as load is increased. We can use DTrace to identify both the CPU cycles and endto-end wall clock time of a code path and quickly focus on the problem areas.
Understand the execution path of a subsystem to assist in diagnosis of a performance or functional problem. We can use DTrace to map the code’s actual execution graph.
Identify the performance characteristics and optimize a particular code path. By measuring the CPU consumption of the code path, we can identify costly code or functions and made code-level improvements. The lockstat
kernel profile can pinpoint CPU cycles down to individual instructions if required. DTrace can help us understand key performance factors for arbitrary code paths.
Identify the source of lock contention. We can use the lockstat(1M)
utility and DTrace lockstat provider to quantify and attribute lock contention to source.
Examine interrupt statistics. We can use vmstat -i
or intrstat
(DTrace).
The lockstat
command and DTrace can profile the kernel and so identify hot functions. We begin by discussing lockstat
’s kernel profile function (the profile capability is buried inside the lock statistics tool). We then briefly mention how we would use DTrace. For a full description of how to use DTrace, refer to Chapter 10.
The lockstat
utility contains a kernel profiling capability. By specifying the -I
option, you instruct the lockstat
utility to collect kernel function samples from a time-based profile interrupt, rather than from lock contention events. The following profile summarizes sampled instruction addresses and can optionally be reduced to function names or other specific criteria.
# lockstat -kIi997 sleep 10
Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec)
Count indv cuml rcnt nsec CPU+PIL Caller
-------------------------------------------------------------------------------
5122 48% 48% 1.00 1419 cpu[0] default_copyout
1292 12% 61% 1.00 1177 cpu[1] splx
1288 12% 73% 1.00 1118 cpu[1] idle
911 9% 81% 1.00 1169 cpu[1] disp_getwork
695 7% 88% 1.00 1170 cpu[1] i_ddi_splhigh
440 4% 92% 1.00 1163 cpu[1]+11 splx
414 4% 96% 1.00 1163 cpu[1]+11 i_ddi_splhigh
254 2% 98% 1.00 1176 cpu[1]+11 disp_getwork
27 0% 99% 1.00 1349 cpu[0] uiomove
27 0% 99% 1.00 1624 cpu[0] bzero
24 0% 99% 1.00 1205 cpu[0] mmrw
21 0% 99% 1.00 1870 cpu[0] (usermode)
9 0% 99% 1.00 1174 cpu[0] xcopyout
8 0% 99% 1.00 650 cpu[0] ktl0
6 0% 99% 1.00 1220 cpu[0] mutex_enter
5 0% 99% 1.00 1236 cpu[0] default_xcopyout
3 0% 100% 1.00 1383 cpu[0] write
3 0% 100% 1.00 1330 cpu[0] getminor
3 0% 100% 1.00 333 cpu[0] utl0
2 0% 100% 1.00 961 cpu[0] mmread
2 0% 100% 1.00 2000 cpu[0]+10 read_rtc
In the example, we use -I
to request a kernel profile at 997 hertz (-i997
) and to coalesce instruction addresses into function names (-k
). If we didn’t specify -k
, then we would see samples with instruction level resolution, as function+offset
.
In the next example, we request that stack backtraces be collected for each sample, to a depth of 10 (-s 10
). With this option, lockstat
prints a summary of each unique stack as sampled.
# lockstat -i997 -Iks 10 sleep 30
Profiling interrupt: 119800 events in 30.038 seconds (3988 events/sec)
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec CPU+PIL Hottest Caller
29919 25% 25% 0.00 5403 cpu[2] kcopy
nsec ------ Time Distribution ------ count Stack
1024 | 2 uiomove
2048 | 18 rdip
4096 | 25 ufs_read
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29853 fop_read
16384 | 21 pread64
sys_syscall32 --------------------
-----------------------------------------------------------
Count indv cuml rcnt nsec CPU+PIL Hottest Caller
29918 25% 50% 0.00 5386 cpu[1] kcopy
nsec ------ Time Distribution ------ count Stack
4096 | 38 uiomove
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29870 rdip
16384 | 10 ufs_read
fop_read
pread64
sys_syscall32
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec CPU+PIL Hottest Caller
29893 25% 75% 0.00 5283 cpu[3] kcopy
nsec ------ Time Distribution ------ count Stack
1024 | 140 uiomove
2048 | 761 rdip
4096 |@ 1443 ufs_read
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 27532 fop_read
16384 | 17 pread64
sys_syscall32
-------------------------------------------------------------------------------
Locks are used in the kernel to serialize access to critical regions and data structures. If contention occurs around a lock, a performance problem or scalability limitation can result. Two main tools analyze lock contention in the kernel:lockstat(1M)
and the DTrace lockstat
provider.
Adaptive locks enforce mutual exclusion to a critical section and can be acquired in most contexts in the kernel. Because adaptive locks have few context restrictions, they constitute the vast majority of synchronization primitives in the Solaris kernel. These locks are adaptive in their behavior with respect to contention. When a thread attempts to acquire a held adaptive lock, it determines if the owning thread is currently running on a CPU. If the owner is running on another CPU, the acquiring thread spins. If the owner is not running, the acquiring thread blocks.
To observe adaptive locks, first consider the spin behavior. Locks that spin excessively burn CPU cycles, behavior that is manifested as high system time. If you notice high system time with mpstat(1M)
, spin locks might be a contributor. You can confirm the amount of system time that results from spinning lock contention by looking at the kernel function profile; spinning locks show up as mutex_*
functions high in the profile. To identify which lock is spinning and which functions are causing the lock contention, use lockstat(1M)
and the DTrace lockstat
provider.
Adaptive locks that block yield the CPU, and excessive blocking results in idle time and nonscalability. To identify which lock is blocking and which functions are causing the lock contention, again use lockstat(1M)
and DTrace.
Threads cannot block in some kernel contexts, such as high-level interrupt context and any context-manipulating dispatcher state. In these contexts, this restriction prevents the use of adaptive locks. Spin locks are instead used to effect mutual exclusion to critical sections in these contexts. As the name implies, the behavior of these locks in the presence of contention is to spin until the lock is released by the owning thread.
Locks that spin excessively burn CPU cycles, manifested as high system time. If you notice high system time with mpstat(1M)
, spin locks might be a contributor. You can confirm the amount of system time that results from spinning lock contention by looking at the kernel function profile; spinning locks show up as mutex_*
functions high in the profile. To identify which lock is spinning and which functions are causing the lock contention, use lockstat(1M)
and the DTrace lockstat
provider.
Readers/writer locks enforce a policy of allowing multiple readers or a single writer—but not both—to be in a critical section. These locks are typically used for structures that are searched more frequently than they are modified and for which there is substantial time in the critical section. If critical section times are short, readers/writer locks implicitly serialize over the shared memory used to implement the lock, giving them no advantage over adaptive locks. See rwlock(9F)
for more details on readers/writer locks.
Reader/writer locks that block yield the CPU, and excessive blocking results in idle time and nonscalability. To identify which lock is blocking and which functions are causing the lock contention, use lockstat(1M)
and the DTrace lockstat
provider.
A thread lock is a special kind of spin lock that locks a thread in order to change thread state.
The lockstat
command provides summary or detail information about lock events in the kernel. By default (without the -I
as previously demonstrated), it provides a systemwide summary for lock contention events for the duration of a command that is supplied as an argument. For example, to make lockstat
sample for 30 seconds, we often use sleep 30
as the command. Note that lockstat
doesn’t actually introspect the sleep
command; it’s only there to control the sample window.
We recommend starting with the -P
option, which sorts by the product of the number of contention events with the cost of the contention event (this puts the most resource expensive events at the top of the list).
# lockstat -P sleep 30
Adaptive mutex spin: 3486197 events in 30.031 seconds (116088 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
1499963 43% 43% 0.00 84 pr_pidlock pr_p_lock+0x29
1101187 32% 75% 0.00 24 0xffffffff810cdec0 pr_p_lock+0x50
285012 8% 83% 0.00 27 0xffffffff827a9858 rdip+0x506
...
For each type of lock, the total number of events during the sample and the length of the sample period are displayed. For each record within the lock type, the following information is provided:
Count
. The number of contention events for this lock.
indv
. The percentage that this record contributes to the total sample set.
cuml
. A cumulative percentage of samples contributing to the total sample set.
rcnt
. Average reference count. This will always be 1 for exclusive locks (mutexes, spin locks, rwlocks held as writer) but can be greater than 1 for shared locks (rwlocks held as reader).
nsec
or spin
. The average amount of time the contention event occurred for block events or the number of spins (spin locks).
Lock
. The address or symbol name of the lock object.
CPU+PIL. The CPU ID and the processor interrupt level at the time of the sample. For example, if CPU 4 is interrupted while at PIL 6, this is reported as cpu[4]+6
.
Caller
. The calling function and the instruction offset within the function.
To estimate the impact of a lock, multiply Count
by the cost. For example, if a blocking event on average costs 48, 944, 759 ns and the event occurs 1, 929 times in a 30-second window, we can assert that the lock is blocking threads for a total of 94 seconds during that period (30 seconds). How is this greater than 30 seconds? Multiple threads are blocking, so because of overlapping blocking events, the total blocking time can be larger than the elapsed time of the sample.
The full output from this example with the -P
option follows.
# lockstat -P sleep 30
Adaptive mutex spin: 3486197 events in 30.031 seconds (116088 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
1499963 43% 43% 0.00 84 pr_pidlock pr_p_lock+0x29
1101187 32% 75% 0.00 24 0xffffffff810cdec0 pr_p_lock+0x50
285012 8% 83% 0.00 27 0xffffffff827a9858 rdip+0x506
212621 6% 89% 0.00 29 0xffffffff827a9858 rdip+0x134
98531 3% 92% 0.00 103 0xffffffff9321d480 releasef+0x55
92486 3% 94% 0.00 19 0xffffffff8d5c4990 ufs_lockfs_end+0x81
89404 3% 97% 0.00 27 0xffffffff8d5c4990 ufs_lockfs_begin+0x9f
83186 2% 99% 0.00 96 0xffffffff9321d480 getf+0x5d
6356 0% 99% 0.00 186 0xffffffff810cdec0 clock+0x4e9
1164 0% 100% 0.00 141 0xffffffff810cdec0 post_syscall+0x352
294 0% 100% 0.00 11 0xffffffff801a4008 segmap_smapadd+0x77
279 0% 100% 0.00 11 0xffffffff801a41d0 segmap_getmapflt+0x275
278 0% 100% 0.00 11 0xffffffff801a48f0 segmap_smapadd+0x77
276 0% 100% 0.00 11 0xffffffff801a5010 segmap_getmapflt+0x275
276 0% 100% 0.00 11 0xffffffff801a4008 segmap_getmapflt+0x275
...
Adaptive mutex block: 3328 events in 30.031 seconds (111 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
1929 58% 58% 0.00 48944759 pr_pidlock pr_p_lock+0x29
263 8% 66% 0.00 47017 0xffffffff810cdec0 pr_p_lock+0x50
255 8% 74% 0.00 53392369 0xffffffff9321d480 getf+0x5d
217 7% 80% 0.00 26133 0xffffffff810cdec0 clock+0x4e9
207 6% 86% 0.00 227146 0xffffffff827a9858 rdip+0x134
197 6% 92% 0.00 64467 0xffffffff8d5c4990 ufs_lockfs_begin+0x9f
122 4% 96% 0.00 64664 0xffffffff8d5c4990 ufs_lockfs_end+0x81
112 3% 99% 0.00 164559 0xffffffff827a9858 rdip+0x506
Spin lock spin: 3491 events in 30.031 seconds (116 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
2197 63% 63% 0.00 2151 turnstile_table+0xbd8 disp_lock_enter+0x35
314 9% 72% 0.00 3129 turnstile_table+0xe28 disp_lock_enter+0x35
296 8% 80% 0.00 3162 turnstile_table+0x888 disp_lock_enter+0x35
211 6% 86% 0.00 2032 turnstile_table+0x8a8 disp_lock_enter+0x35
127 4% 90% 0.00 856 turnstile_table+0x9f8 turnstile_interlock+0x171
114 3% 93% 0.00 269 turnstile_table+0x9f8 disp_lock_enter+0x35
44 1% 95% 0.00 90 0xffffffff827f4de0 disp_lock_enter_high+0x13
37 1% 96% 0.00 581 0xffffffff827f4de0 disp_lock_enter+0x35
...
Thread lock spin: 1104 events in 30.031 seconds (37 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
487 44% 44% 0.00 1671 turnstile_table+0xbd8 ts_tick+0x26
219 20% 64% 0.00 1510 turnstile_table+0xbd8 turnstile_block+0x387
92 8% 72% 0.00 1941 turnstile_table+0x8a8 ts_tick+0x26
77 7% 79% 0.00 2037 turnstile_table+0xe28 ts_tick+0x26
74 7% 86% 0.00 2296 turnstile_table+0x888 ts_tick+0x26
36 3% 89% 0.00 292 cpu[0]+0xf8 ts_tick+0x26
27 2% 92% 0.00 55 cpu[1]+0xf8 ts_tick+0x26
11 1% 93% 0.00 26 cpu[3]+0xf8 ts_tick+0x26
10 1% 94% 0.00 11 cpu[2]+0xf8 post_syscall+0x556
The lockstat
provider probes help you discern lock contention statistics or understand virtually any aspect of locking behavior. The lockstat(1M)
command is actually a DTrace consumer that uses the lockstat
provider to gather its raw data.
The lockstat
provider makes available two kinds of probes: content-event probes and hold-event probes.
Contention-event probes correspond to contention on a synchronization primitive; they fire when a thread is forced to wait for a resource to become available. Solaris is generally optimized for the noncontention case, so prolonged contention is not expected. Use these probes to aid your understanding of those cases in which contention does arise. Because contention is relatively rare, enabling contention-event probes generally doesn’t substantially affect performance.
Hold-event probes correspond to acquiring, releasing, or otherwise manipulating a synchronization primitive. These probes can answer arbitrary questions about the way synchronization primitives are manipulated. Because Solaris acquires and releases synchronization primitives very often (on the order of millions of times per second per CPU on a busy system), enabling hold-event probes has a much higher probe effect than does enabling contention-event probes. While the probe effect induced by enabling the probes can be substantial, it is not pathological, so you can enable them with confidence on production systems.
The lockstat
provider makes available probes that correspond to the different synchronization primitives in Solaris; these primitives and the probes that correspond to them are discussed in Section 10.6.4.
The provider probes are as follows:
Adaptive lock probes. The four lockstat probes are adaptive-acquire
, adaptive-block
, adaptive-spin
, and adaptive-release
. They are shown for reference in Table 10.7. For each probe, arg0
contains a pointer to the kmutex_t
structure that represents the adaptive lock.
Adaptive locks are much more common than spin locks. The following script displays totals for both lock types to provide data to support this observation.
lockstat:::adaptive-acquire /execname == "date"/ { @locks["adaptive"] = count(); } lockstat:::spin-acquire /execname == "date"/ { @locks["spin"] = count(); }
If we run this script in one window and run a date(1)
command in another, then when we terminate the DTrace script, we see the following output.
# dtrace -s ./whatlock.d
dtrace: script './whatlock.d' matched 5 probes
^C
spin 26
adaptive 2981
As this output indicates, over 99% of the locks acquired from running the date
command are adaptive locks. It may be surprising that so many locks are acquired in doing something as simple as retrieving a date. The large number of locks is a natural artifact of the fine-grained locking required of an extremely scalable system like the Solaris kernel.
Spin lock probes. The three probes pertaining to spin locks are spin-acquire
, spin-spin
, and spin-release
. They are shown in Table 10.8.
Thread locks. Thread lock hold events are available as spin lock hold-event probes (that is, spin-acquire
and spin-release
), but contention events have their own probe (thread-spin
) specific to thread locks. The thread lock hold-event probe is described in Table 10.9.
Readers/writer lock probes. The probes pertaining to readers/writer locks are rw-acquire
, rw-block
, rw-upgrade
, rw-downgrade
, rw-release
. They are shown in Table 10.10. For each probe, arg0
contains a pointer to the krwlock_t
structure that represents the adaptive lock.
The profile provider in DTrace identifies hot functions by sampling the kernel stack activity.
# dtrace -n 'profile-997hz / arg0 != 0 / { @ks[stack()]=count() }'
dtrace: description 'profile-997ms ' matched 1 probe
^C
genunix'syscall_mstate+0x1c7
unix'sys_syscall32+0xbd
1
unix'bzero+0x3
procfs'pr_read_lwpusage_32+0x2f
procfs'prread+0x5d
genunix'fop_read+0x29
genunix'pread+0x217
genunix'pread32+0x26
unix'sys_syscall32+0x101
1
unix'kcopy+0x38
genunix'copyin_nowatch+0x48
genunix'copyin_args32+0x45
genunix'syscall_entry+0xcb
unix'sys_syscall32+0xe1
1
unix'sys_syscall32+0xae
1
unix'mutex_exit+0x19
ufs'rdip+0x368
ufs'ufs_read+0x1a6
genunix'fop_read+0x29
genunix'pread64+0x1d7
unix'sys_syscall32+0x101
2
unix'kcopy+0x2c
genunix'uiomove+0x17f
ufs'rdip+0x382
ufs'ufs_read+0x1a6
genunix'fop_read+0x29
genunix'pread64+0x1d7
unix'sys_syscall32+0x101
13
Another useful measure of kernel activity is the number of received interrupts. A device may be busy processing a flood of interrupts and consuming significant CPU time. This CPU time may not appear in the usual by-process view from prstat
.
The -i
option of the vmstat
command obtains interrupt statistics.
$ vmstat -i
interrupt total rate
--------------------------------
clock 272636119 100
hmec0 726271 0
audiocs 0 0
fdc0 8 0
ecppc0 0 0
--------------------------------
Total 273362398 100
In this example, the hmec0
device received 726, 271 interrupts. The rate is also printed, which for the clock interrupt is 100 hertz. This output may be handy, although the counters that vmstat
currently uses are of ulong_t
, which may wrap and thus print incorrect values if a server is online for several months.
The intrstat
command, new in Solaris 10, uses DTrace. It measures the number of interrupts and, more importantly, the CPU time consumed servicing interrupts, by driver instance. This information is priceless and was extremely difficult to measure on previous versions of Solaris.
In the following example we ran intrstat
on an UltraSPARC 5 with a 360 MHz CPU and a 100 Mbits/sec interface while heavy network traffic was received.
# intrstat 2 device | cpu0%tim ------------------+---------------- hme#0 | 2979 43.5 device | cpu0%tim ------------------+---------------- hme#0 | 2870 42.6 uata#0 | 0 0.0 ...
The hme0
instance consumed a whopping 43.5% of the CPU for the first 2-second sample. This value is huge, bearing in mind that the network stack of Solaris 10 is much faster than previous versions. Extrapolating, it seems unlikely that this server could ever drive a gigabit Ethernet card at full speed if one was installed.
The intrstat
command should become a regular tool for the analysis of both kernel driver activity and CPU consumption, especially for network drivers.
18.218.136.90