Chapter 9. Kernel Monitoring

In this chapter, we explore tools that can be used to monitor performance of kernel subsystems, drivers and other loadable kernel modules.

Tools for Kernel Monitoring

There are several tools available in the Solaris environment to measure and optimize the performance of kernel code and device drivers. The following tasks are the most common:

  • Identify the reason for high system time (mpstat %sys). We can use a kernel profile (DTrace or lockstat -I) or trace (DTrace) to produce a ranked list of system calls, functions, modules, drivers, or subsystems that are contributing to system time.

  • Identify the reason for nonscalability on behalf of a system call. Typically, our approach is to observe the wall clock time and CPU cycles of a code path as load is increased. We can use DTrace to identify both the CPU cycles and endto-end wall clock time of a code path and quickly focus on the problem areas.

  • Understand the execution path of a subsystem to assist in diagnosis of a performance or functional problem. We can use DTrace to map the code’s actual execution graph.

  • Identify the performance characteristics and optimize a particular code path. By measuring the CPU consumption of the code path, we can identify costly code or functions and made code-level improvements. The lockstat kernel profile can pinpoint CPU cycles down to individual instructions if required. DTrace can help us understand key performance factors for arbitrary code paths.

  • Identify the source of lock contention. We can use the lockstat(1M) utility and DTrace lockstat provider to quantify and attribute lock contention to source.

  • Examine interrupt statistics. We can use vmstat -i or intrstat (DTrace).

Profiling the Kernel and Drivers

The lockstat command and DTrace can profile the kernel and so identify hot functions. We begin by discussing lockstat’s kernel profile function (the profile capability is buried inside the lock statistics tool). We then briefly mention how we would use DTrace. For a full description of how to use DTrace, refer to Chapter 10.

Profiling the Kernel with lockstat -I

The lockstat utility contains a kernel profiling capability. By specifying the -I option, you instruct the lockstat utility to collect kernel function samples from a time-based profile interrupt, rather than from lock contention events. The following profile summarizes sampled instruction addresses and can optionally be reduced to function names or other specific criteria.

# lockstat -kIi997 sleep 10
Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec)
Count indv cuml rcnt     nsec CPU+PIL                Caller
-------------------------------------------------------------------------------
 5122  48%  48% 1.00     1419 cpu[0]                 default_copyout
 1292  12%  61% 1.00     1177 cpu[1]                 splx
 1288  12%  73% 1.00     1118 cpu[1]                 idle
  911   9%  81% 1.00     1169 cpu[1]                 disp_getwork
  695   7%  88% 1.00     1170 cpu[1]                 i_ddi_splhigh
  440   4%  92% 1.00     1163 cpu[1]+11              splx
  414   4%  96% 1.00     1163 cpu[1]+11              i_ddi_splhigh
  254   2%  98% 1.00     1176 cpu[1]+11              disp_getwork
   27   0%  99% 1.00     1349 cpu[0]                 uiomove
   27   0%  99% 1.00     1624 cpu[0]                 bzero
   24   0%  99% 1.00     1205 cpu[0]                 mmrw
   21   0%  99% 1.00     1870 cpu[0]                 (usermode)
    9   0%  99% 1.00     1174 cpu[0]                 xcopyout
    8   0%  99% 1.00      650 cpu[0]                 ktl0
    6   0%  99% 1.00     1220 cpu[0]                 mutex_enter
    5   0%  99% 1.00     1236 cpu[0]                 default_xcopyout
    3   0% 100% 1.00     1383 cpu[0]                 write
    3   0% 100% 1.00     1330 cpu[0]                 getminor
    3   0% 100% 1.00      333 cpu[0]                 utl0
    2   0% 100% 1.00      961 cpu[0]                 mmread
    2   0% 100% 1.00     2000 cpu[0]+10              read_rtc

In the example, we use -I to request a kernel profile at 997 hertz (-i997) and to coalesce instruction addresses into function names (-k). If we didn’t specify -k, then we would see samples with instruction level resolution, as function+offset.

In the next example, we request that stack backtraces be collected for each sample, to a depth of 10 (-s 10). With this option, lockstat prints a summary of each unique stack as sampled.

# lockstat -i997 -Iks 10 sleep 30
Profiling interrupt: 119800 events in 30.038 seconds (3988 events/sec)
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec CPU+PIL                Hottest Caller
29919  25%  25% 0.00     5403 cpu[2]                 kcopy

      nsec ------ Time Distribution ------ count     Stack
      1024 |                               2         uiomove
      2048 |                               18        rdip
      4096 |                               25        ufs_read
      8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  29853     fop_read
     16384 |                               21        pread64
                                                     sys_syscall32 --------------------
-----------------------------------------------------------
Count indv cuml rcnt     nsec CPU+PIL                Hottest Caller
29918  25%  50% 0.00     5386 cpu[1]                 kcopy

      nsec ------ Time Distribution ------ count     Stack
      4096 |                               38        uiomove
      8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  29870     rdip
     16384 |                               10        ufs_read
                                                     fop_read
                                                     pread64
                                                     sys_syscall32
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec CPU+PIL                Hottest Caller
29893  25%  75% 0.00     5283 cpu[3]                 kcopy

      nsec ------ Time Distribution ------ count     Stack
      1024 |                               140       uiomove
      2048 |                               761       rdip
      4096 |@                              1443      ufs_read
      8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@    27532     fop_read
     16384 |                               17        pread64
                                                     sys_syscall32
-------------------------------------------------------------------------------

Analyzing Kernel Locks

Locks are used in the kernel to serialize access to critical regions and data structures. If contention occurs around a lock, a performance problem or scalability limitation can result. Two main tools analyze lock contention in the kernel:lockstat(1M) and the DTrace lockstat provider.

Adaptive Locks

Adaptive locks enforce mutual exclusion to a critical section and can be acquired in most contexts in the kernel. Because adaptive locks have few context restrictions, they constitute the vast majority of synchronization primitives in the Solaris kernel. These locks are adaptive in their behavior with respect to contention. When a thread attempts to acquire a held adaptive lock, it determines if the owning thread is currently running on a CPU. If the owner is running on another CPU, the acquiring thread spins. If the owner is not running, the acquiring thread blocks.

To observe adaptive locks, first consider the spin behavior. Locks that spin excessively burn CPU cycles, behavior that is manifested as high system time. If you notice high system time with mpstat(1M), spin locks might be a contributor. You can confirm the amount of system time that results from spinning lock contention by looking at the kernel function profile; spinning locks show up as mutex_* functions high in the profile. To identify which lock is spinning and which functions are causing the lock contention, use lockstat(1M) and the DTrace lockstat provider.

Adaptive locks that block yield the CPU, and excessive blocking results in idle time and nonscalability. To identify which lock is blocking and which functions are causing the lock contention, again use lockstat(1M) and DTrace.

Spin Locks

Threads cannot block in some kernel contexts, such as high-level interrupt context and any context-manipulating dispatcher state. In these contexts, this restriction prevents the use of adaptive locks. Spin locks are instead used to effect mutual exclusion to critical sections in these contexts. As the name implies, the behavior of these locks in the presence of contention is to spin until the lock is released by the owning thread.

Locks that spin excessively burn CPU cycles, manifested as high system time. If you notice high system time with mpstat(1M), spin locks might be a contributor. You can confirm the amount of system time that results from spinning lock contention by looking at the kernel function profile; spinning locks show up as mutex_* functions high in the profile. To identify which lock is spinning and which functions are causing the lock contention, use lockstat(1M) and the DTrace lockstat provider.

Reader/Writer Locks

Readers/writer locks enforce a policy of allowing multiple readers or a single writer—but not both—to be in a critical section. These locks are typically used for structures that are searched more frequently than they are modified and for which there is substantial time in the critical section. If critical section times are short, readers/writer locks implicitly serialize over the shared memory used to implement the lock, giving them no advantage over adaptive locks. See rwlock(9F) for more details on readers/writer locks.

Reader/writer locks that block yield the CPU, and excessive blocking results in idle time and nonscalability. To identify which lock is blocking and which functions are causing the lock contention, use lockstat(1M) and the DTrace lockstat provider.

Thread Locks

A thread lock is a special kind of spin lock that locks a thread in order to change thread state.

Analyzing Locks with lockstat

The lockstat command provides summary or detail information about lock events in the kernel. By default (without the -I as previously demonstrated), it provides a systemwide summary for lock contention events for the duration of a command that is supplied as an argument. For example, to make lockstat sample for 30 seconds, we often use sleep 30 as the command. Note that lockstat doesn’t actually introspect the sleep command; it’s only there to control the sample window.

We recommend starting with the -P option, which sorts by the product of the number of contention events with the cost of the contention event (this puts the most resource expensive events at the top of the list).

# lockstat -P sleep 30

Adaptive mutex spin: 3486197 events in 30.031 seconds (116088 events/sec)

Count indv cuml rcnt     spin Lock                   Caller
-------------------------------------------------------------------------------
1499963 43%  43% 0.00       84 pr_pidlock             pr_p_lock+0x29
1101187 32%  75% 0.00       24 0xffffffff810cdec0     pr_p_lock+0x50
285012   8%  83% 0.00       27 0xffffffff827a9858     rdip+0x506
...

For each type of lock, the total number of events during the sample and the length of the sample period are displayed. For each record within the lock type, the following information is provided:

  • Count. The number of contention events for this lock.

  • indv. The percentage that this record contributes to the total sample set.

  • cuml. A cumulative percentage of samples contributing to the total sample set.

  • rcnt. Average reference count. This will always be 1 for exclusive locks (mutexes, spin locks, rwlocks held as writer) but can be greater than 1 for shared locks (rwlocks held as reader).

  • nsec or spin. The average amount of time the contention event occurred for block events or the number of spins (spin locks).

  • Lock. The address or symbol name of the lock object.

  • CPU+PIL. The CPU ID and the processor interrupt level at the time of the sample. For example, if CPU 4 is interrupted while at PIL 6, this is reported as cpu[4]+6.

  • Caller. The calling function and the instruction offset within the function.

To estimate the impact of a lock, multiply Count by the cost. For example, if a blocking event on average costs 48, 944, 759 ns and the event occurs 1, 929 times in a 30-second window, we can assert that the lock is blocking threads for a total of 94 seconds during that period (30 seconds). How is this greater than 30 seconds? Multiple threads are blocking, so because of overlapping blocking events, the total blocking time can be larger than the elapsed time of the sample.

The full output from this example with the -P option follows.

# lockstat -P sleep 30

Adaptive mutex spin: 3486197 events in 30.031 seconds (116088 events/sec)

Count indv cuml rcnt     spin Lock                   Caller
-------------------------------------------------------------------------------
1499963  43%  43% 0.00       84 pr_pidlock             pr_p_lock+0x29
1101187  32%  75% 0.00       24 0xffffffff810cdec0     pr_p_lock+0x50
285012   8%  83% 0.00       27 0xffffffff827a9858     rdip+0x506
212621   6%  89% 0.00       29 0xffffffff827a9858     rdip+0x134
98531   3%  92% 0.00      103 0xffffffff9321d480     releasef+0x55
92486   3%  94% 0.00       19 0xffffffff8d5c4990     ufs_lockfs_end+0x81
89404   3%  97% 0.00       27 0xffffffff8d5c4990     ufs_lockfs_begin+0x9f
83186   2%  99% 0.00       96 0xffffffff9321d480     getf+0x5d
 6356   0%  99% 0.00      186 0xffffffff810cdec0     clock+0x4e9
 1164   0% 100% 0.00      141 0xffffffff810cdec0     post_syscall+0x352
  294   0% 100% 0.00       11 0xffffffff801a4008     segmap_smapadd+0x77
  279   0% 100% 0.00       11 0xffffffff801a41d0     segmap_getmapflt+0x275
  278   0% 100% 0.00       11 0xffffffff801a48f0     segmap_smapadd+0x77
  276   0% 100% 0.00       11 0xffffffff801a5010     segmap_getmapflt+0x275
  276   0% 100% 0.00       11 0xffffffff801a4008     segmap_getmapflt+0x275
...
Adaptive mutex block: 3328 events in 30.031 seconds (111 events/sec)
Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
 1929  58%  58% 0.00 48944759 pr_pidlock             pr_p_lock+0x29
  263   8%  66% 0.00    47017 0xffffffff810cdec0     pr_p_lock+0x50
  255   8%  74% 0.00 53392369 0xffffffff9321d480     getf+0x5d
  217   7%  80% 0.00    26133 0xffffffff810cdec0     clock+0x4e9
  207   6%  86% 0.00   227146 0xffffffff827a9858     rdip+0x134
  197   6%  92% 0.00    64467 0xffffffff8d5c4990     ufs_lockfs_begin+0x9f
  122   4%  96% 0.00    64664 0xffffffff8d5c4990     ufs_lockfs_end+0x81
  112   3%  99% 0.00   164559 0xffffffff827a9858     rdip+0x506
Spin lock spin: 3491 events in 30.031 seconds (116 events/sec)
Count indv cuml rcnt     spin Lock                   Caller
-------------------------------------------------------------------------------
 2197  63%  63% 0.00     2151 turnstile_table+0xbd8  disp_lock_enter+0x35
  314   9%  72% 0.00     3129 turnstile_table+0xe28  disp_lock_enter+0x35
  296   8%  80% 0.00     3162 turnstile_table+0x888  disp_lock_enter+0x35
  211   6%  86% 0.00     2032 turnstile_table+0x8a8  disp_lock_enter+0x35
  127   4%  90% 0.00      856 turnstile_table+0x9f8  turnstile_interlock+0x171
  114   3%  93% 0.00      269 turnstile_table+0x9f8  disp_lock_enter+0x35
   44   1%  95% 0.00       90 0xffffffff827f4de0     disp_lock_enter_high+0x13
   37   1%  96% 0.00      581 0xffffffff827f4de0     disp_lock_enter+0x35
...
Thread lock spin: 1104 events in 30.031 seconds (37 events/sec)
Count indv cuml rcnt     spin Lock                   Caller
-------------------------------------------------------------------------------
  487  44%  44% 0.00     1671 turnstile_table+0xbd8  ts_tick+0x26
  219  20%  64% 0.00     1510 turnstile_table+0xbd8  turnstile_block+0x387
   92   8%  72% 0.00     1941 turnstile_table+0x8a8  ts_tick+0x26
   77   7%  79% 0.00     2037 turnstile_table+0xe28  ts_tick+0x26
   74   7%  86% 0.00     2296 turnstile_table+0x888  ts_tick+0x26
   36   3%  89% 0.00      292 cpu[0]+0xf8            ts_tick+0x26
   27   2%  92% 0.00       55 cpu[1]+0xf8            ts_tick+0x26
   11   1%  93% 0.00       26 cpu[3]+0xf8            ts_tick+0x26
   10   1%  94% 0.00       11 cpu[2]+0xf8            post_syscall+0x556

DTrace lockstat Provider

The lockstat provider probes help you discern lock contention statistics or understand virtually any aspect of locking behavior. The lockstat(1M) command is actually a DTrace consumer that uses the lockstat provider to gather its raw data.

The lockstat provider makes available two kinds of probes: content-event probes and hold-event probes.

  • Contention-event probes correspond to contention on a synchronization primitive; they fire when a thread is forced to wait for a resource to become available. Solaris is generally optimized for the noncontention case, so prolonged contention is not expected. Use these probes to aid your understanding of those cases in which contention does arise. Because contention is relatively rare, enabling contention-event probes generally doesn’t substantially affect performance.

  • Hold-event probes correspond to acquiring, releasing, or otherwise manipulating a synchronization primitive. These probes can answer arbitrary questions about the way synchronization primitives are manipulated. Because Solaris acquires and releases synchronization primitives very often (on the order of millions of times per second per CPU on a busy system), enabling hold-event probes has a much higher probe effect than does enabling contention-event probes. While the probe effect induced by enabling the probes can be substantial, it is not pathological, so you can enable them with confidence on production systems.

The lockstat provider makes available probes that correspond to the different synchronization primitives in Solaris; these primitives and the probes that correspond to them are discussed in Section 10.6.4.

The provider probes are as follows:

  • Adaptive lock probes. The four lockstat probes are adaptive-acquire, adaptive-block, adaptive-spin, and adaptive-release. They are shown for reference in Table 10.7. For each probe, arg0 contains a pointer to the kmutex_t structure that represents the adaptive lock.

    Adaptive locks are much more common than spin locks. The following script displays totals for both lock types to provide data to support this observation.

    lockstat:::adaptive-acquire
    /execname == "date"/
    {
                  @locks["adaptive"] = count();
    }
    
    lockstat:::spin-acquire
    /execname == "date"/
    {
                  @locks["spin"] = count();
    }
    

    If we run this script in one window and run a date(1) command in another, then when we terminate the DTrace script, we see the following output.

    # dtrace -s ./whatlock.d
    dtrace: script './whatlock.d' matched 5 probes
    ^C
    spin                                                             26
    adaptive                                                       2981
    

    As this output indicates, over 99% of the locks acquired from running the date command are adaptive locks. It may be surprising that so many locks are acquired in doing something as simple as retrieving a date. The large number of locks is a natural artifact of the fine-grained locking required of an extremely scalable system like the Solaris kernel.

  • Spin lock probes. The three probes pertaining to spin locks are spin-acquire, spin-spin, and spin-release. They are shown in Table 10.8.

  • Thread locks. Thread lock hold events are available as spin lock hold-event probes (that is, spin-acquire and spin-release), but contention events have their own probe (thread-spin) specific to thread locks. The thread lock hold-event probe is described in Table 10.9.

  • Readers/writer lock probes. The probes pertaining to readers/writer locks are rw-acquire, rw-block, rw-upgrade, rw-downgrade, rw-release. They are shown in Table 10.10. For each probe, arg0 contains a pointer to the krwlock_t structure that represents the adaptive lock.

DTrace Kernel Profiling

The profile provider in DTrace identifies hot functions by sampling the kernel stack activity.

# dtrace -n 'profile-997hz / arg0 != 0 / { @ks[stack()]=count() }'
dtrace: description 'profile-997ms ' matched 1 probe
^C


              genunix'syscall_mstate+0x1c7
              unix'sys_syscall32+0xbd
                1

              unix'bzero+0x3
              procfs'pr_read_lwpusage_32+0x2f
              procfs'prread+0x5d
              genunix'fop_read+0x29
              genunix'pread+0x217
              genunix'pread32+0x26
              unix'sys_syscall32+0x101
                1

              unix'kcopy+0x38
              genunix'copyin_nowatch+0x48
              genunix'copyin_args32+0x45
              genunix'syscall_entry+0xcb
              unix'sys_syscall32+0xe1
                1

              unix'sys_syscall32+0xae
                1

              unix'mutex_exit+0x19
              ufs'rdip+0x368
              ufs'ufs_read+0x1a6
              genunix'fop_read+0x29
              genunix'pread64+0x1d7
              unix'sys_syscall32+0x101
                2

              unix'kcopy+0x2c
              genunix'uiomove+0x17f
              ufs'rdip+0x382
              ufs'ufs_read+0x1a6
              genunix'fop_read+0x29
              genunix'pread64+0x1d7
              unix'sys_syscall32+0x101
               13

Interrupt Statistics: vmstat -i

Another useful measure of kernel activity is the number of received interrupts. A device may be busy processing a flood of interrupts and consuming significant CPU time. This CPU time may not appear in the usual by-process view from prstat.

The -i option of the vmstat command obtains interrupt statistics.

$ vmstat -i
interrupt         total     rate
--------------------------------
clock         272636119      100
hmec0            726271        0
audiocs               0        0
fdc0                  8        0
ecppc0                0        0
--------------------------------
Total         273362398      100

In this example, the hmec0 device received 726, 271 interrupts. The rate is also printed, which for the clock interrupt is 100 hertz. This output may be handy, although the counters that vmstat currently uses are of ulong_t, which may wrap and thus print incorrect values if a server is online for several months.

Interrupt Analysis: intrstat

The intrstat command, new in Solaris 10, uses DTrace. It measures the number of interrupts and, more importantly, the CPU time consumed servicing interrupts, by driver instance. This information is priceless and was extremely difficult to measure on previous versions of Solaris.

In the following example we ran intrstat on an UltraSPARC 5 with a 360 MHz CPU and a 100 Mbits/sec interface while heavy network traffic was received.

# intrstat 2

      device |      cpu0%tim
------------------+----------------
       hme#0 |      2979  43.5

      device |      cpu0%tim
------------------+----------------
       hme#0 |      2870 42.6
      uata#0 |         0 0.0
...

The hme0 instance consumed a whopping 43.5% of the CPU for the first 2-second sample. This value is huge, bearing in mind that the network stack of Solaris 10 is much faster than previous versions. Extrapolating, it seems unlikely that this server could ever drive a gigabit Ethernet card at full speed if one was installed.

The intrstat command should become a regular tool for the analysis of both kernel driver activity and CPU consumption, especially for network drivers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.136.90