Chapter 4. Disk Behavior and Analysis

This chapter discusses the key factors used for understanding disk behavior and presents an overview of the analysis tools available.

Terms for Disk Analysis

The following terms are related to disk analysis; the list also summarizes topics covered in this section.

  • Environment. The first step in disk analysis is to know what the disks are—single disks or a storage array—and what their expected workload is: random, sequential, or otherwise.

  • Utilization. The percent busy value from iostat -x serves as a utilization value for disk devices. The calculation behind it is based on the time a device spends active. It is a useful starting point for understanding disk usage.

  • Saturation. The average wait queue length from iostat -x is a measure of disk saturation.

  • Throughput. The kilobytes/sec values from iostat -x can also indicate disk activity, and for storage arrays they may be the only meaningful metric that Solaris provides.

  • I/O rate. The number of disk transactions per second can be seen by means of iostat or DTrace. The number is interesting because each operation incurs a certain overhead. This term is also known as IOPS (I/O operations per second).

  • I/O sizes. You can calculate the size of disk transactions from iostat -x by using the (kr/s + kw/s) / (r/s + w/s) ratio, which gives average event size; or you can measure the size directly with DTrace. Throughput is usually improved when larger events are used.

  • Service times. The average wait queue and active service times can be printed from iostat -x. Longer service times are likely to degrade performance.

  • History. sar can be activated to archive historical disk activity statistics. Long-term patterns can be identified from this data, which also provides a reference for what statistics are “normal” for your disks.

  • Seek sizes. DTrace can measure the size of each disk head seek and present this data in a meaningful report.

  • I/O time. Measuring the time a disk spends servicing an I/O event is valuable because it takes into account various costs of performing an I/O operation: seek time, rotation time, and the time to transfer data. DTrace can fetch event time data.

Table 4.1 summarizes and cross-references tools used in this section.

Table 4.1. Tools for Disk Analysis

Tool

Uses

Description

Reference

iostat

Kstat

For extended disk device statistics

4.6

sar

Kstat, sadc

For disk device statistics and history data archiving

4.13

iotrace.d

DTrace

Simple script for events by device and file name

4.15.3

bites.d

DTrace

Simple script to aggregate disk event size

4.15.4

seeks.d

DTrace

Simple script to measure disk event seek size

4.15.5

files.d

DTrace

Simple script to aggregate size by file name

4.15.6

iotop

DTrace

For a disk statistics by-process summary

4.17.1

iosnoop

DTrace

For a trace of disk events, including process ID, times, block addresses, sizes, etc.

4.17.2

Random vs. Sequential I/O

We frequently use the terms random and sequential while discussing disk behavior. Random activity means the disk accesses blocks from random locations on disk, usually incurring a time penalty while the disk heads seek and the disk itself rotates. Sequential activity means the disk accesses blocks one after the other, that is, sequentially.

The following demonstrations compare random to sequential disk activity and illustrate why recognizing this behavior is important.

Demonstration of Sequential Disk Activity

While a dd command runs to request heavy sequential disk activity, we examine the output of iostat to see the effect. (The options and output of iostat are covered in detail in subsequent sections.)

# dd if=/dev/rdsk/c0d0s0 of=/dev/null bs=64k
$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.1    0.7   16.2   18.8  0.3  0.0  144.4    2.7   0   0 c0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.2   0   0 jupiter:vold(pid564)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
  819.6    0.0 52453.3    0.0  0.0  1.0    0.0    1.2   1  97  c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  820.9    0.2 52535.2    1.6  0.0  1.0    0.0    1.2   1  97 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  827.8    0.0 52981.2    0.0  0.0  1.0    0.0    1.2   1  97 c0d0
...

The disk was 97% busy, for which it delivered over 50 Mbytes/sec.

Demonstration of Random Disk Activity

Now for random activity, on the same system and the same disk. This time we use the filebench tool to generate a consistent and configurable workload.

filebench> load randomread
filebench> set $nthreads=64
filebench> run 600
 1089: 0.095: Random Read Version 1.8 05/02/17 IO personality successfully loaded
 1089: 0.096: Creating/pre-allocating files
 1089: 0.279: Waiting for preallocation threads to complete...
 1089: 0.279: Re-using file /filebench/bigfile0
 1089: 0.385: Starting 1 rand-read instances
 1090: 1.389: Starting 64 rand-thread threads
 1089: 4.399: Running for 600 seconds...

$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.0    0.7    8.6   18.8  0.3  0.0  154.2    2.8   0   0 c0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.2   0   0 jupiter:vold(pid564)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
  291.6    0.2 1166.5    1.6  0.0  1.0    0.0    3.3   0  97  c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  290.0    0.0 1160.1    0.0  0.0  1.0    0.0    3.3   0  97 c0d0
...

This disk is also 97% busy, but this time it delivers around 1.2 Mbytes/sec. The random disk activity was over 40 times slower in terms of throughput. This is quite a significant difference.

Had we only been looking at disk throughput, then 1.2 Mbytes/sec may have been of no concern for a disk that can pull 50 Mbytes/sec; in reality, however, our 1.2 Mbytes/sec workload almost saturated the disk with activity. In this case, the percent busy (%b) measurement was far more useful, but for other cases (storage arrays), we may find that throughput has more meaning.

Storage Arrays

Larger environments often use storage arrays: These are usually hardware RAID along with an enormous frontend cache (256 Mbytes to 256+ Gbytes). Rather than the millisecond crawl of traditional disks, storage arrays are fast—often performing like an enormous hunk of memory. Reads and writes are served from the cache as much as possible, with the actual disks updated asynchronously.

If we are writing data to a storage array, Solaris considers it completed when the sd or ssd driver receives the completion interrupt. Storage arrays like to use writeback caching, which means the completion interrupt is sent as soon as the cache receives the data. The service time that iostat reports will be tiny because we did not measure a physical disk event. The data remains in the cache until the storage array flushes it to disk at some later time, based on algorithms such as Least Recently Used. Solaris can’t see any of this. Solaris metrics such as utilization may have little meaning; the best metric we do have is throughput—kilobytes written per second—which we can use to estimate activity.

In some situations the cache can switch to writethrough mode, such as in the event of a hardware failure (for example, the batteries die). Suddenly the statistics in Solaris change because writes now suffer a delay as the storage array waits for them to write to disk, before an I/O completion is sent. Service times increase, and utilization values such as percent busy may become more meaningful.

If we are reading data from a storage array, then at times delays occur as the data is read from disk. However, the storage array tries its best to serve reads from (its very large) cache, especially effective if prefetch is enabled and the workload is sequential. This means that usually Solaris doesn’t observe the disk delay, and again the service times are small and the percent utilizations have little meaning.

To actually understand storage array utilization, you must fetch statistics from the storage array controller itself. Of interest are cache hit ratios and array controller CPU utilization. The storage array may experience degraded performance as it performs other tasks, such as verification, volume creation, and volume reconstruction. How the storage array has been configured and its underlying volumes and other settings are also of great significance.

The one Solaris metric we can trust for storage arrays is throughput, the data read and written to it. That can be used as an indicator for activity. What happens beyond the cache and to the actual disks we do not know, although changes in average service times may give us a clue that some events are synchronous.

Sector Zoning

Sector zoning, also known as Multiple Zone Recording (MZR), is a disk layout strategy for optimal performance. A track on the outside edge of a disk can contain more sectors than one on the inside because a track on the outside edge has a greater length. Since the disk can read more sectors per rotation from the outside edge than the inside, data stored near the outside edge is faster. Manufacturers often break disks into zones of fixed sector per-track ratios, with the number of zones and ratios chosen for both performance and data density.

Data throughput on the outside edge may also be faster because many disk heads rest at the outside edge, resulting in reduced seek times for data blocks nearby.

A simple way to demonstrate the effect of sector zoning is to perform a sequential read across the entire disk. The following example shows the throughput at the start of the test (outside edge) and at the end of the test (inside edge).

# dd if=/dev/rdsk/c0t0d0s2 of=/dev/null bs=128k
$ iostat -xnz 10
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  104.0    0.0 13311.0    0.0  0.0  1.0    0.0    9.5   0  99 c0t0d0
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   71.1    0.0 9100.4    0.0  0.0  1.0    0.0   13.9   0  99 c0t0d0

Near the outside edge the speed was around 13 Mbytes/sec, while at the inside edge this has dropped to 9 Mbytes/sec. A common procedure that takes advantage of this behavior is to slice disks so that the most commonly accessed data is positioned near the outside edge.

Max I/O Size

An important characteristic when storage devices are configured is the maximum size of an I/O transaction. For sequential access, larger I/O sizes are better; for random access, I/O sizes should to be picked to match the workload. Your first step when configuring I/O sizes is to know your workload: DTrace is especially good at measuring this (see Section 4.15).

A maximum I/O transaction size can be set at a number of places:

  • maxphys. Disk driver maximum I/O size. By default this is 128 Kbytes on SPARC systems and 56 Kbytes on x86 systems. Some devices override this value if they can.

  • maxcontig. UFS maximum I/O size. Defaults to equal maxphys, it can be set during newfs(1M) and changed with tunefs(1M). UFS uses this value for read-ahead.

  • stripe width. Maximum I/O size for a logical volume (hardware RAID or software VM) configured by setting a stripe size (per-disk maximum I/O size) and choosing a number of disks. stripe width = stripe size × number of disks.

  • interlace. SVM stripe size.

Ideally, stripe width is an integer divisor of the average I/O transaction size; otherwise, there is a remainder. Remainders can reduce performance for a few reasons, including inefficient filling of cache blocks; and in the case of RAID5, remainders can compromise write performance by incurring the penalty of a read-modify-write or reconstruct-write operation.

The following is a quick demonstration to show maxphys capping I/O size on Solaris 10 x86.

# dd if=/dev/dsk/c0d0s0 of=/dev/null bs=1024k

$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    2.4    0.6   55.9   17.8  0.2  0.0   78.9    1.8   0   0 c0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.2   0   0 jupiter:vold(pid564)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  943.3    0.0 52822.6    0.0  0.0  1.3    0.0    1.4   3 100 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  959.2    0.0 53716.1    0.0  0.0  1.3    0.0    1.4   3 100 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  949.8    0.0 53186.3    0.0  0.0  1.2    0.0    1.3   3  96 c0d0
...

Although we requested 1024 Kbytes per transaction, the disk device delivered 56 Kbytes (52822 ÷ 943), which is the value of maxphys.

The dd command can be invoked with different I/O sizes while the raw (rdsk) device is used so that the optimal size for sequential disk access can be discovered.

iostat Utility

The iostat utility is the official place to get information about disk I/O performance, and it is a classic kstat(3kstat) consumer along with vmstat and mpstat. iostat can be run in a variety of ways.

In the following style, iostat provides single-line summaries for active devices.

$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
    0.2    0.2    1.1    1.4  0.0  0.0    6.6    6.9   0   0  c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    7.7   0   0  c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    3.0   0   0  mars:vold(pid512)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
  277.1    0.0 2216.4    0.0  0.0  0.6    0.0    2.1   0  58  c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
   79.8    0.0  910.0    0.0  0.4  1.9    5.1   23.6  41  98  c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
   87.0    0.0  738.5    0.0  0.8  2.0    9.4   22.4  65  99  c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
   92.2    0.6  780.4    2.2  2.1  1.9   22.8   21.0  87  98  c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b  device
  101.4    0.0  826.6    0.0  0.8  1.9    8.0   19.0  46  99  c0t0d0
...

The first output is the summary since boot, followed by samples every five seconds. Some columns have been highlighted in this example. On the right is %b; this is percent busy and tells us disk utilization, [1] which we explain in the next section. In the middle is wait, the average wait queue length; it is a measure of disk saturation. On the left are kr/s and kw/s, kilobytes read and written per second, which tells us the current disk throughput.

In the iostat example, the first five-second sample shows a percent busy of 58%—fairly moderate utilization. For the following samples, we can see the average wait queue length climb to a value of 2.1, indicating that this disk was becoming saturated with requests.

The throughput in the example began at over 2 Mbytes/sec and fell to less than 1 Mbytes/sec. Throughput can indicate disk activity.

iostat provides other statistics that we discuss later. These utilization, saturation, and throughput metrics are a useful starting point for understanding disk behavior.

Disk Utilization

When considering disk utilization, keep in mind the following points:

  • Any level of disk utilization may degrade application performance because accessing disks is a slow activity—often measured in milliseconds.

  • Sometimes heavy disk utilization is the price of doing business; this is especially the case for database servers.

  • Whether a level of disk utilization actually affects an application greatly depends on how the application uses the disks and how the disk devices respond to requests. In particular, notice the following:

    • – An application may be using the disks synchronously and suffering from each delay as it occurs, or an application may be multithreaded or use asynchronous I/O to avoid stalling on each disk event.

    • – Many OS and disk mechanisms provide writeback caching so that although the disk may be busy, the application does not need to wait for writes to complete.

  • Utilization values are averages over time, and it is especially important to bear this in mind for disks. Often, applications and the OS access the disks in bursts: for example, when reading an entire file, when executing a new command, or when flushing writes. This can cause short bursts of heavy utilization, which may be difficult to identify if averaged over longer intervals.

  • Utilization alone doesn’t convey the type of disk activity—in particular, whether the activity was random or sequential.

  • An application accessing a disk sequentially may find that a heavily utilized disk often seeks heads away, causing what would have been sequential access to behave in a random manner.

  • Storage arrays may report 100% utilization when in fact they are able to accept more transactions. 100% utilization here means that Solaris believes the storage device is fully active during that interval, not that it has no further capacity to accept transactions. Solaris doesn’t see what really happens on storage array disks.

  • Disk activity is complex! It involves mechanical disk properties, buses, and caching and depends on the way applications use I/O. Condensing this information to a single utilization value verges on oversimplification. The utilization value is useful as a starting point, but it’s not absolute.

In summary, for simple disks and applications, utilization values are a meaningful measurement so we can understand disk behavior in a consistent way. However, as applications become more complex, the percent utilization requires careful consideration. This is also the case with complex disk devices, especially storage arrays, for which percent utilization may have little value.

While we may debate the accuracy of percent utilization, it still often serves its purpose as being a “useful starting point,” which is followed by other metrics when deeper analysis is desired (especially those from DTrace).

Disk Saturation

A sustained level of disk saturation usually means a performance problem. A disk at saturation is constantly busy, and new transactions are unable to preempt the currently active disk operation in the same way a thread can preempt the CPU. This means that new transactions suffer an unavoidable delay as they queue, waiting their turn.

Disk Throughput

Throughput is interesting as an indicator of activity. It is usually measured in kilobytes or megabytes per second. Sometimes it is of value when we discover that too much or too little throughput is happening on the disks for the expected application workload.

Often with storage arrays, throughput is the only statistic available from iostat that is accurate. Knowing utilization and saturation of the storage array’s individual disks is beyond what Solaris normally can see. To delve deeper into storage array activity, we must fetch statistics from the storage array controller.

iostat Reference

The iostat command can print a variety of different outputs, depending on which command-line options were used. Many of the standard options are listed below.[2]

  • -c. Print the standard system time percentages: us, sy, wt, id.

  • -d. Print classic fields: kps, tps, serv.

  • -D. “New” style output, print disk utilization with a decimal place.

  • -e. Print device error statistics.

  • -E. Print extended error statistics. Useful for quickly listing disk details.

  • -I. Print raw interval counts, rather than per second.

  • -l n. Limit number of disks printed to n. Useful when also specifying a disk.

  • -M. Print throughput in Mbytes/sec rather than Kbytes/sec.

  • -n. Use logical disk names rather than instance names.

  • -p. Print per partition statistics as well as per device.

  • -P. Print partition statics only.

  • -t. Print terminal I/O statistics.

  • -x. Extended disk statistics. This prints a line per device and provides the breakdown that includes r/s, w/s, kr/s, kw/s, wait, actv, svc_t, %w, and %b.

The default options of iostat are -cdt, which prints a summary of up to four disks on one line along with CPU and terminal I/O details. This is rarely used.[3]

Several new formatting flags crept in around Solaris 8:

  • -C. Report disk statistics by controller.

  • -m. For mounted partitions, print the mount point (useful with -p or -P).

  • -r. Display data in comma-separated format.

  • -s. Suppress state change messages.

  • -T d | u. Print timestamps in date (d) or UNIX time (u) format.

  • -z. Don’t print lines that contain all zeros.

People have their own favorite combination, in much the same way they form habits with the ls command. For small environments -xnmpz may be suitable, and for larger -xnMz. Always type iostat -E at some point to check for errors.

The man page for iostat suggests that iostat -xcnCXTdz [interval] is particularly useful for identifying problems.

Some of these options are demonstrated one by one in the next subsections. A demonstration for many of them at once is as follows.

$ iostat -xncel1 -Td c0t0d0 5
Sun Feb 19 18:01:24 2006
     cpu
 us sy wt id
  1  1  0 98
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.3    0.2   1.9    1.4  0.0  0.0    6.3    7.0   0   0  0   0   0   0 c0t0d0s0 (/)
Sun Feb 19 18:01:29 2006
     cpu
 us sy wt id
  1 19  0 80
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
  311.3    0.0 2490.2    0.0  0.0  0.8    0.0    2.7   0  84   0   0   0   0 c0t0d0
Sun Feb 19 18:01:34 2006
     cpu
 us sy wt id
  1 21  0 77
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
  213.0   21.0 1704.1  105.8  1.0  1.1    4.3    4.5  19  83   0   0   0   0 c0t0d0
...

The output columns include the following:

  • wait. Average number of transactions queued and waiting

  • actv. Average number of transactions actively being serviced

  • wsvc_t. Average time a transaction spends on the wait queue

  • asvc_t. Average time a transaction is active or running

  • %w. Percent wait, based on the time that transactions were queued

  • %b. Percent busy, based on the time that the device was active

iostat Default

By default, iostat prints a summary since boot line.

$ iostat
   tty        dad1          sd1           nfs1           cpu
 tin tout kps tps serv  kps tps serv  kps tps serv   us sy wt id
   0    1   6   1   11    0   0    8    0   0    3    1  1  0 98

The output lists devices by their instance name across the top and provides details such as kilobytes per second (kps), transactions per second (tps), and average service time (serv). Also printed are standard CPU and tty[4] statistics such as percentage user (us), system (sy) and idle (id) time, and terminal in chars (tin) and out chars (tout).

We almost always want to measure what is happening now rather than some dim average since boot, so we specify an interval and an optional count.

$ iostat 5 2
   tty        dad1          sd1           nfs1           cpu
 tin tout kps tps serv  kps tps serv  kps tps serv   us sy wt id
   0    1   6   1   11    0   0    8    0   0    3    1  1  0 98
   0   39 342 253    3    0   0    0    0   0    0    4 18  0 79

Here the interval was five seconds with a count of two. The first line of output is printed immediately and is still the summary since boot. The second and last line is a five-second sample, showing that some disk activity was occurring on dad1.

iostat -D

The source code to iostat flags the default style of output as DISK_OLD. A DISK_ NEW is also defined[5] and is printed with the -D option.

$ iostat -D 5 2
    dad1          sd1           nfs1
rps wps util  rps wps util  rps wps util
  0   0  0.3    0   0  0.0    0   0  0.0
 72  32 74.9    0   0  0.0    0   0  0.0

Now we see reads per second (rps), writes per second (wps), and percent utilization (util). Notice that iostat now drops the tty and cpu summaries. We can see them if needed by using -t and -c. The reduced width of the output leaves room for more disks.

The following was run on a server with over twenty disks.

$ iostat -D 5 2
    sd0           sd1           sd6           sd30
rps wps util  rps wps util  rps wps util  rps wps util
  0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0
370  75 89.3    0   0  0.0    0   0  0.0    0   0  0.0

However, by default iostat prints only four disks, selected from the top four in an alphabetically sorted list of I/O devices.[6]

iostat -l n

Continuing the previous example, if we want to see more than four disks, we use the -l option. Here we use -l 6 so that six disks are printed.

$ iostat -Dl6 5 2
    sd0           sd1           sd6           sd30          sd31          sd32
rps wps util  rps wps util  rps wps util  rps wps util  rps wps util  rps wps util
  0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0
369   9 68.8    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0

If we don’t like iostat’s choice of disks to monitor, we can specify them on the command line as with the following.

$ iostat -Dl6 sd30 sd31 sd32 sd33 sd34 sd35 5 2
    sd30          sd31          sd32          sd33          sd34          sd35
rps wps util  rps wps util  rps wps util  rps wps util  rps wps util  rps wps util
  0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0
  0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0    0   0  0.0

iostat -n

Often we don’t think in terms of device instance names. The -n option uses the familiar logical name for the device.

$ iostat -n 5 2
   tty       c0t0d0        c0t2d0     mars:vold(pid2     cpu
 tin tout kps tps serv  kps tps serv  kps tps serv   us sy wt id
   0    1   6   1   11    0   0    8    0   0    3    1  1  0 98
   0   39 260 168    4    0   0    0    0   0    0    6 22  0 72

iostat -x

Extended device statistics are printed with the -x option, making the output of iostat multiline.

$ iostat -x 5 2
                  extended device statistics
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
dad1         0.5    0.2    4.9    1.4  0.0  0.0   11.1   0   0
fd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
sd1          0.0    0.0    0.0    0.0  0.0  0.0    7.7   0   0
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    3.0   0   0
                  extended device statistics
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
dad1       109.6    0.0  165.8    0.0  0.0  0.6    5.6   0  61
fd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
sd1          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

Now iostat is printing a line per device, which contains many of the statistics previously discussed. This includes percent busy (%b) and the average wait queue length (wait). Also included are reads and writes per second (r/s, w/s), kilobytes read and written per second (kr/s, kw/s), average active transactions (actv), average event service time (svc_t)—which includes both waiting and active times—and percent wait queue populated (%w).

The -x multiline output is much more frequently used than iostat’s original single-line output, which now seems somewhat antiquated.

iostat -p, -P

Per-partition (or “slice”) statistics can be printed with -p. iostat continues to print entire disk summaries as well, unless the -P option is used. The following demonstrates a combination of a few common options.

$ iostat -xnmPz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.5    0.2    4.8    1.4  0.0  0.0    5.2    6.7   0   0 c0t0d0s0 (/ )
    0.0    0.0    0.0    0.0  0.0  0.0    0.1   32.0   0   0 c0t0d0s1
    0.0    0.0    0.2    0.0  0.0  0.0    1.1    2.6   0   0 c0t0d0s3 (/extra1 )
    0.0    0.0    0.1    0.0  0.0  0.0    3.1    7.7   0   0 c0t0d0s4 (/extra2 )
    0.0    0.0    0.0    0.0  0.0  0.0   11.9   17.4   0   0 c0t0d0s5 (/extra3 )
    0.0    0.0    0.0    0.0  0.0  0.0   10.3   12.0   0   0 c0t0d0s6 (/extra4 )
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    3.0   0   0 mars:vold(pid512)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    9.6   88.9   69.0  187.6  3.4  1.9   34.2   19.8  61 100 c0t0d0s0 (/)
...

With the extended output (-x), a line is printed for each partition (-P), along with its logical name (-n) and mount point if available (-m). Lines with zero activity are not printed (-z). No count was given, so iostat will continue forever. In this example, only c0t0d0s0 was active.

iostat -e

Error statistics can be printed with the -e option.

$ iostat -xne
                             extended device statistics       ---- errors ---
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.5    0.2    5.0     1.4  0.0  0.0    5.0    6.6   0   0   0   0   0   0 c0t0d0
    0.0    0.0    0.0     0.0  0.0  0.0    0.0    0.0   0   0   0   0   0   0 fd0
    0.0    0.0    0.0     0.0  0.0  0.0    0.0    7.7   0   0   0   0   1   1 c0t2d0
    0.0    0.0    0.0     0.0  0.0  0.0    0.0    3.0   0   0   0   0   0   0 mars:vold

The errors are soft (s/w), hard (h/w), transport (trn), and a total (tot). The following are examples for each of these errors.

  • Soft disk error. A disk sector fails its CRC and needs to be reread.

  • Hard disk error. A disk sector continues to fail its CRC after being reread several times (usually 15) and cannot be read.

  • Transport error. One reported by the I/O bus.

iostat -E

All error statistics available can be printed with -E, which is also useful for discovering the existence of disks.

$ iostat -E
dad1      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST38420A         Revision: 3.05     Serial No: 7AZ04J9S
Size: 8.62GB <8622415872 bytes>
Media Error: 0 Device Not Ready: 0  No Device: 0 Recoverable: 0
Illegal Request: 0
sd1       Soft Errors: 0 Hard Errors: 0 Transport Errors: 1
Vendor:    LG    Product: CD-ROM CRD-8322B Revision: 1.05 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

This example shows a system with an 8.62 gigabyte disk (dad1, ST38420A) and a CD-ROM (sd1). Only one transport error on the CD-ROM device occurred.

Reading iostat

Previously we discussed the %b and wait fields of iostat’s extended output. Many more fields provide other insights into disk behavior.

Event Size Ratio

The extended iostat output includes per-second averages for the number of events and sizes, which are in the first four columns. To demonstrate them, we captured the following output while a find / command was also running.

$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2    0.2    1.1    1.5  0.0  0.0    6.5    7.1   0   0 c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    7.7   0   0 c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    3.0   0   0 mairs:vold(pid512)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  179.8    0.0  290.4    0.0  0.0  0.6    0.0    3.5   0  64 c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  227.0    0.0  351.8    0.0  0.0  0.8    0.0    3.7   0  83 c0t0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  217.2    0.0  358.6    0.0  0.0  0.8    0.0    3.8   0  84 c0t0d0

Observe the r/s and kr/s fields when the disk was 83% busy. Let’s begin with the fact it is 83% busy and only pulling 351.8 Kbytes/sec; extrapolating from 83% to 100%, this disk would peak at a miserable 420 Kbytes/sec. Now, given that we know that this disk can be driven at over 12 Mbytes/sec, [7] running at a speed of 420 Kbytes/sec (3% of the maximum) is a sign that something is seriously amiss. In this case, it is likely to be caused by the nature of the I/O—heavy random disk activity caused by the find command (which we can prove by using DTrace).

Had we only been looking at volume (kr/s + kw/s), then a rate of 351.8 Kbytes/ sec may have incorrectly implied that this disk was fairly idle.

Another detail to notice is that there were on average 227 reads per second for that sample. There are certain overheads involved when asking a disk to perform an I/O event, so the number of IOPS (I/O operations per second) is useful to consider. Here we would add r/s and w/s.

Finally, we can take the value of kr/s and divide by r/s, to calculate the average read size: 351.8 Kbytes / 227 = 1.55 Kbytes. A similar calculation is used for the average write size. A value of 1.55 Kbytes is small but to be expected from the find command because it reads through small directory files and inodes.

Service Times

Three service times are available: wsvc_t, for the average time spent on the wait queue; asvc_t, for the average time spent active (sent to the disk device); and svc_t for wsvc_t plus asvc_t. iostat prints these in milliseconds.

The active service time is the most interesting; it is the time from when a disk device accepted the event to when it sent a completion interrupt. The source code behind iostat describes active time as “run” time. The following demonstrates small active service times caused by running dd on the raw device.

$ iostat -xnz 5
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  549.4    0.0 4394.8    0.0  0.0  1.0    0.0    1.7   0  95 c0t0d0

Next, we observe longer active service times while a find / runs.

$ iostat -xnz 5
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t   %w  %b device
   26.2   64.0  209.6  127.1  2.8  1.5   31.2   16.9   43  80 c0t0d0

From the previous discussion on event size ratios, we can see that a dd command pulling 4395 Kbytes/sec at 95% busy is using the disks in a better manner than a find / command pulling 337 Kbytes/sec (209.6 + 127.1) at 80% busy.

Now we can consider the average active service times, which have been highlighted (asvc_t). For the dd command, this was 1.7 ms, while for the find / command, it was much slower at 16.9 ms. Faster is better, so this statistic can directly describe average disk event behavior without any further calculation. It also helps to become familiar with what values are “good” or “bad” for your disks. Note:iostat(1M) does warn against believing service times for very idle disks.

Should the disk become saturated with requests, we may also see average wait queue times (wsvc_t). This indicates the average time penalty for disk events that have queued, and as such can help us understand the effects of saturation.

Lastly, disk service times are interesting from a disk perspective, but they do not necessarily equal application latency; that depends on what the file system is doing (caching, reading ahead). See Section 5.2, to continue the discussion of application latency from the FS.

iostat Internals

iostat is a consumer of kstat (the Kernel statistics facility, Chapter 11), which prints statistics for KSTAT_TYPE_IO devices. We can use the kstat(1M) command to see the data that iostat is using.

$ kstat -n dad1
module: dad                              instance: 1
name:   dad1                             class:     disk
        crtime                           1.718803613
        nread                            5172183552
        nwritten                         1427398144
        rcnt                             0
        reads                            509751
        rlastupdate                      1006817.75420951
        rlentime                         4727.596773858
        rtime                            3551.281310393
        snaptime                         1006817.75420951
        wcnt                             0
        wlastupdate                      1006817.75420951
        wlentime                         3681.523121192
        writes                           207061
        wtime                            492.453167341
$ kstat -n dad1,error
module: daderror                         instance: 1
name:   dad1,error                       class:    device_error
        No Device                        0
        Device Not Ready                 0
        Hard Errors                      0
        Illegal Request                  0
        Media Error                      0
        Model                            ST38420A        Revision
        Recoverable                      0
        Revision                         3.05
        Serial No                        7AZ04J9S        Size
        Size                             8622415872
        Soft Errors                      0
        Transport Errors                 0
        crtime                           1.718974829
        snaptime                         1006852.93847071

This shows a kstat object named dad1, which is of kstat_io_t and is well documented in sys/kstat.h. The dad1, error object is a regular kstat object.

A sample is below.

typedef struct kstat_io {
...
        hrtime_t wtime;         /* cumulative wait (pre-service) time */
        hrtime_t wlentime;      /* cumulative wait length*time product */
        hrtime_t wlastupdate;   /* last time wait queue changed */
        hrtime_t rtime;         /* cumulative run (service) time */
        hrtime_t rlentime;      /* cumulative run length*time product */
        hrtime_t rlastupdate;   /* last time run queue changed */
                                                                        See sys/kstat.h

Since kstat has already provided meaningful data, it is fairly easy for iostat to sample it, run some interval calculations, and then print it. As a demonstration of what iostat really does, the following is the code for calculating %b.

                /* % of time there is a transaction running */
                t_delta = hrtime_delta(old ? old->is_stats.rtime : 0,
                    new->is_stats.rtime);
                if (t_delta) {
                        r_pct = (double)t_delta;
                        r_pct /= hr_etime;
                        r_pct *= 100.0;
                                                                  See ...cmd/stat/iostat.c

The key statistic, is_stats.rtime, is from the kstat_io struct and is described as “cumulative run (service) time.” Since this is a cumulative counter, the old value of is_stats.rtime is subtracted from the new, to calculate the actual cumulative runtime since the last sample (t_delta). This is then divided by hr_etime—the total elapsed time since the last sample—and then multiplied by 100 to form a percentage.

This approach could be described as saying a service time of 1000 ms is available every one second. This provides a convenient known upper limit that can be used for percentage calculations. If 200 ms of service time was consumed in one second, then the disk is 20% busy. Consider using Kbytes/sec instead for our busy calculation; the upper limit would vary according to random or sequential activity, and determining it would be quite challenging.

How wait is calculated in the iostat.c source looks identical, this time with is_stats.wlentime. kstat.h describes this as “cumulative wait length × time product” and discusses when it is updated.

         * At each change of state (entry or exit from the queue),
         * we add the elapsed time (since the previous state change)
         * to the active time if the queue length was non-zero during
         * that interval; and we add the product of the elapsed time
         * times the queue length to the running length*time sum.
...
                                                                         See kstat.h

This method, known as a “Riemann sum,” allows us to calculate a proportionally accurate average wait queue length, based on the length of time at each queue length.

The comment from kstat.h also sheds light on how percent busy is calculated: At each change of disk state the elapsed time is added to the active time if there was activity. This sum of active time is the rtime used earlier.

For more information on these statistics and kstat, see Section 11.5.2.

sar -d

iostat is not the only kstat disk statistics consumer in Solaris; there is also the system activity reporter, sar. This is both a command (/usr/sbin/sar) and a background service (in the crontab for sys) that archives statistics over time and keeps them under /var/adm/sa. In Solaris 10 the service is called svc:/system/ sar:default. It can be enabled by svcadm enable sar.[8]

Gathering statistics over time can be especially valuable for identifying long-term patterns. Such statistics can also help identify what activity is “normal” for your disks and can highlight any change around the same time that performance problems were noticed. The disks may not misbehave the moment you analyze them with iostat.[9]

To demonstrate the disk statistics that sar uses, we can run it by providing an interval.

# sar -d 5

SunOS mars 5.11 snv_16 sun4u    02/21/2006

15:56:55   device        %busy   avque   r+w/s  blks/s  avwait  avserv

15:57:00   dad1             58     0.6     226    1090     0.0     2.7
           dad1,a           58     0.6     226    1090     0.0     2.7
           dad1,b            0     0.0       0       0     0.0     0.0
           dad1,c            0     0.0       0       0     0.0     0.0
           dad1,d            0     0.0       0       0     0.0     0.0
           dad1,e            0     0.0       0       0     0.0     0.0
           dad1,f            0     0.0       0       0     0.0     0.0
           dad1,g            0     0.0       0       0     0.0     0.0
           fd0               0     0.0       0       0     0.0     0.0
           nfs1              0     0.0       0       0     0.0     0.0
           sd1               0     0.0       0       0     0.0     0.0

The output of sar -d includes many fields that we have previously discussed, including percent busy (%busy), average wait queue length (avque), average wait queue time (avwait), and average service time (avserv). Since sar reads the same Kstats that iostat uses, the values reported should be the same.

sar -d also provides the total of reads + writes per second (r+w/s), and the number of 512 byte blocks per second (blk/s).[10]

The disk statistics from sar are among its most trustworthy. Be aware that sar is an old tool and that many parts of Solaris have changed since sar was written (file system caches, for example). Careful interpretation is needed to make use of the statistics that sar prints.

Some tools plot the sar output, [11] which affords a helpful way to visualize data. So long as we understand what the data really means.

Trace Normal Form (TNF) Tracing for I/O

The TNF tracing facility was added to Solaris 2.5 release. It provided various kernel debugging probes that could be enabled to measure thread activity, syscalls, paging, swapping, and I/O events. The I/O probes could answer questions that iostat and Kstat could not, such as which process was causing disk activity. The probes could measure details such as I/O size, block addresses, and event times.

TNF tracing wasn’t for the faint-hearted, and not many people learned how to interpret its terse output. A few tools based on TNF tracing were written, including the TAZ disk tool (Richard McDougall) and psio (Brendan Gregg).

For details on TNF tracing see tracing(3TNF) and tnf_kernel_probes(4).

DTrace supersedes TNF tracing, and is discussed in the next section. DTrace can measure the same events that TNF tracing did, but in an easy and programmable manner.

DTrace for I/O

DTrace was added to the Solaris 10 release; see Chapter 10 for a reference. DTrace can trace I/O events with ease by using the io provider, and tracing I/O with the io provider is often used as a demonstration of DTrace itself.

io Probes

The io provider supplies io:::start and io:::done probes, which for disk events represents the initiation and completion of physical I/O.

# dtrace -lP io
   ID   PROVIDER            MODULE                           FUNCTION NAME
   60         io           genunix                            biodone done
   61         io           genunix                            biowait wait-done
   62         io           genunix                            biowait wait-start
   71         io           genunix                     default_physio start
   72         io           genunix                      bdev_strategy start
   73         io           genunix                            aphysio start
  862         io               nfs                           nfs4_bio done
  863         io               nfs                           nfs3_bio done
  864         io               nfs                            nfs_bio done
  865         io               nfs                           nfs4_bio start
  866         io               nfs                           nfs3_bio start
  867         io               nfs                            nfs_bio start

In this example, we list the probes from the io provider. This provider also tracks NFS events, raw disk I/O events, and asynchronous disk I/O events.

The names for the io:::start and io:::done probes include the kernel function names. Disk events are likely to use the functions bdev_strategy and biodone, the same functions that TNF tracing probed. If you are writing DTrace scripts to match only one type of disk activity, then specify the function name. For example, io::bdev_strategy:start matches physical disk events.

The probes io:::wait-start and io:::wait-done trace the time when a thread blocks for I/O and begins to wait and the time when the wait has completed.

Details about each I/O event are provided by three arguments to these io probes. Their DTrace variable names and contents are as follows:

  • args[0]: struct bufinfo. Useful details from the buf struct

  • args[1]: struct devinfo. Details about the device: major and minor numbers, instance name, etc.

  • args[2]: struct fileinfo. Details about the file name, path name, file system, offset, etc.

Note that the io probes fire for all I/O requests to peripheral devices and for all file read and file write requests to an NFS server. However, requests for metadata from an NFS server, for example. readdir(3C), do not trigger io probes.

The io probes are documented in detail in Section 10.6.1.

I/O Size One-Liners

You can easily fetch I/O event details with DTrace. The following one-liner command tracks PID, process name, and I/O event size.

# dtrace -n 'io:::start { printf("%d %s %d",pid,execname,args[0]->b_bcount); }'
dtrace: description 'io:::start ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0     72              bdev_strategy:start 418 nfsd 36864
  0     72              bdev_strategy:start 418 nfsd 36864
  0     72              bdev_strategy:start 418 nfsd 36864
  0     72              bdev_strategy:start 0 sched 512
  0     72              bdev_strategy:start 0 sched 1024
  0     72              bdev_strategy:start 418 nfsd 1536
  0     72              bdev_strategy:start 418 nfsd 1536
...

This command assumes that the correct PID is on the CPU for the start of an I/O event, which in this case is fine. When you use DTrace to trace PIDs, be sure to consider whether the process is synchronous with the event.

Tracing I/O activity as it occurs can generate many screenfuls of output. The following one-liner produces a simple summary instead, printing a report of PID, process name, and IOPS (I/O count). We match on io:genunix::start so that this script matches disk events and not NFS events.

# dtrace -n 'io:genunix::start { @[pid, execname] = count(); }'
dtrace: description 'io:genunix::start ' matched 3 probes
^C
    16585  find                                                             420
    16586  tar                                                             2812
    16584  dd                                                             22443

From the output, we can see that the dd command requested 22, 443 disk events, and find requested 420.

A More Elaborate Example

While one-liners can be handy, it is often more useful to write DTrace scripts. The following DTrace script uses the device, buffer, and file name information from the io probes.

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
        printf("%10s %58s %2s %8s
", "DEVICE", "FILE", "RW", "Size");
}
io:::start
{
        printf("%10s %58s %2s %8d
", args[1]->dev_statname,
            args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W",
            args[0]->b_bcount);
}

When run, it provides a simple tracelike output showing the device, file name, read/write flag, and I/O size.

# ./iotrace.d

    DEVICE                                                       FILE RW      SIZE
     cmdk0                               /export/home/rmc/.sh_history  W      4096
     cmdk0                                 /opt/Acrobat4/bin/acroread  R      8192
     cmdk0                                 /opt/Acrobat4/bin/acroread  R      1024
     cmdk0                                 /var/tmp/wscon-:0.0-gLaW9a  W      3072
     cmdk0                           /opt/Acrobat4/Reader/AcroVersion  R      1024
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      4096
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192

The way this script traces I/O events as they occur is similar to the way the Solaris snoop command traces network packets. An enhanced version of this script, called iosnoop, is discussed later in this chapter.

Since I/O events are generally “slow” (a few hundred per second, depending on activity), the CPU costs for tracing them with DTrace is minimal (often less than 0.1% CPU).

I/O Size Aggregation

The following short DTrace script makes for an incredibly useful tool; it is available in the DTraceToolkit as bitesize.d. It traces the requested I/O size by process and prints a report that uses the DTrace quantize aggregating function.

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
        printf("Tracing... Hit Ctrl-C to end.
");
}
io:::start
{
        @size[pid, curpsinfo->pr_psargs] = quantize(args[0]->b_bcount);
}
dtrace:::END
{
        printf("%8s  %s
", "PID", "CMD");
        printa("%8d  %s
%@d
", @size);
}

The script was run while a find / command executed.

# ./bites.d
Tracing... Hit Ctrl-C to end.
^C
     PID  CMD
   14818  find /

           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    2009
            2048 |                                         0
            4096 |                                         0
            8192 |@@@                                      180
           16384 |                                         0

The find command churned thorough directory files and inodes on disk, causing many small disk events. The distribution plot that DTrace has printed nicely conveys the disk behavior that find caused and is read as follows: 2009 disk events were between 1024 and 2047 bytes in size, and 180 disk events were between 8 Kbytes and 15.9 Kbytes. In summary, we measured find causing a storm of small disk events.

Such a large number of small events usually indicates random disk activity—a characteristic that DTrace can also accurately measure.

Finding the size of disk events alone can be quite valuable. To demonstrate this further, we ran the same script for a different workload. This time we used a tar command to archive the disk.

# ./bites.d
Tracing... Hit Ctrl-C to end.
^C
8122  tar cf /dev/null /

           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   226
            2048 |@@                                       19
            4096 |@@                                       23
            8192 |@@@@@@@                                  71
           16384 |                                         3
           32768 |                                         1
           65536 |@                                        8
          131072 |@@@@@                                    52
          262144 |                                         0

While tar must work through many of the same directory files as find, it now also reads through file contents. There are now many events in the 128 to 255 Kbytes bucket because tar has encountered some large files.

And finally, we ran the script with a deliberately large sequential workload—add command with specific options.

# ./bites.d
Tracing... Hit Ctrl-C to end.
^C
     PID  CMD
    8112  dd if=/dev/rdsk/c0t0d0s0 of=/dev/null bs=128k

           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 246
          262144 |                                         0

We used the dd command to read 128-Kbyte blocks from the raw device, and that’s exactly what happened.

It is interesting to compare raw device behavior with that of the block device. In the following demonstration, we changed the rdsk to dsk and ran dd on a slice that contained a freshly mounted file system.

# ./bites.d
Tracing... Hit Ctrl-C to end.
^C
    8169  dd if=/dev/dsk/c0t0d0s3 of=/dev/null bs=128k

           value  ------------- Distribution -------------  count
           32768 |                                          0
           65536 |                                          1
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1027
          262144 |                                          0

No difference there, except that when the end of the slice was reached, a smaller I/O event was issued.

This demonstration becomes interesting after the dd command has been run several times on the same slice. The distribution plot then looks like this.

# ./bites.d
Tracing... Hit Ctrl-C to end.
^C
    8176 dd if=/dev/dsk/c0t0d0s3 of=/dev/null bs=128k

           value  ------------- Distribution ------------- count
            4096 |                                          0
            8192 |@@@@@@@@@@@@@                             400
           16384 |@@@                                       83
           32768 |@                                         29
           65536 |@@                                        46
          131072 |@@@@@@@@@@@@@@@@@@@@@@                    667
          262144 |                                          0

The distribution plot has become quite different, with fewer 128-Kbyte events and many 8-Kbyte events. What is happening is that the block device is reclaiming pages from the page cache and is at times going to disk only to fill in the gaps.

We next used a different DTrace one-liner to examine this further, summing the bytes read by two different invocations of dd: the first (PID 8186) on the dsk device and the second (PID 8187) on the rdsk device.

# dtrace -n 'io:::start { @[pid, args[1]->dev_statname] = sum(args[0]->b_bcount); }'
dtrace: description 'io:::start ' matched 6 probes
^C
     8186  dad1                                                       89710592
     8187  dad1                                                      134874112

The rdsk version read the full slice, 134, 874, 112 bytes. The dsk version read 89, 710, 592 bytes, 66.5%.

I/O Seek Aggregation

The following script can help identify random or sequential activity by measuring the seek distance for disk events and generating a distribution plot. The script is available in the DTraceToolkit as seeksize.d.

#!/usr/sbin/dtrace -s
#pragma D option quiet

self int last[dev_t];

dtrace:::BEGIN
{
        printf("Tracing... Hit Ctrl-C to end.
");
}
io:genunix::start
/self->last[args[0]->b_edev] != 0/
{
        this->last = self->last[args[0]->b_edev];
        this->dist = (int)(args[0]->b_blkno - this->last) > 0 ?
            args[0]->b_blkno - this->last : this->last - args[0]->b_blkno;
        @size[args[1]->dev_statname] = quantize(this->dist);
}
io:genunix::start
{
        self->last[args[0]->b_edev] = args[0]->b_blkno +
            args[0]->b_bcount / 512;
}

Since the buffer struct is available to the io probes, we can examine the block address for each I/O event, provided as args[0]->b_blkno. This address is relative to the slice, so we must be careful to compare addresses only when the events are on the same slice, achieved in the script by matching on args[0]->b_edev.

We are assuming that we can trust the block address and that the disk device did not map it to something strange (or if it did, it was mapped proportionally). We are also assuming that the disk device isn’t using a frontend cache to initially avoid seeks altogether, as with storage arrays.

The following example uses this script to examine random activity that was generated with filebench.

# ./seeks.d
Tracing... Hit Ctrl-C to end.
^C

  cmdk0

           value  ------------- Distribution -------------  count
              -1 |                                          0
               0 |@@@@                                      174
               1 |                                          0
               2 |                                          0
               4 |                                          0
               8 |                                          2
              16 |@@                                        104
              32 |@@@                                       156
              64 |@@                                        98
             128 |@                                         36
             256 |@                                         39
             512 |@@                                        70
            1024 |@@                                        71
            2048 |@@                                        71
            4096 |@                                         55
            8192 |@                                         43
           16384 |@                                         63
           32768 |@@                                        91
           65536 |@@@                                       135
          131072 |@@@                                       159
          262144 |@@                                        107
          524288 |@@@@                                      183
         1048576 |@@@@                                      174
         2097152 |                                          0

And the following is for sequential activity from filebench.

# ./seeks.d
Tracing... Hit Ctrl-C to end.
^C

  cmdk0

           value  ------------- Distribution -------------  count
              -1 |                                          0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   27248
               1 |                                          0
               2 |                                          0
               4 |                                          0
               8 |                                          12
              16 |                                          141
              32 |                                          218
              64 |                                          118
             128 |                                          7
             256 |                                          81
             512 |                                          0

The difference is dramatic. For the sequential test most of the events incurred a zero length seek, whereas with the random test, the seeks were distributed up to the 1, 048, 576 to 2, 097, 151 bucket. The units are called disk blocks (not file system blocks), which are disk sectors (512 bytes).

I/O File Names

Sometimes knowing the file name that was accessed is of value. This is another detail that DTrace makes easily available through args[2]->fi_pathname, as demonstrated by the following script.

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
        printf("Tracing... Hit Ctrl-C to end.
");
}
io:::start
{
        @files[pid, execname, args[2]->fi_pathname] = sum(args[0]->b_bcount);
}
dtrace:::END
{
        normalize(@files, 1024);
        printf("%6s %-12s %6s %s
", "PID", "CMD", "KB", "FILE");
        printa("%6d %-12.12s %@6d %s
", @files);
}

Running this script with several files of a known size on a newly mounted file system produces the following.

# ./files.d
Tracing... Hit Ctrl-C to end.
^C
   PID CMD              KB FILE
  5797 bash              1 /extra1
  8376 grep              8 /extra1/lost+found
  8376 grep             10 /extra1/testfile_size10k
  8376 grep             20 /extra1/testfile_size20k
  8376 grep             30 /extra1/testfile_size30k
  8376 grep             64 <none>
  8376 grep          10240 /extra1/testfile_size10m
  8376 grep          20480 /extra1/testfile_size20m
  8376 grep          30720 /extra1/testfile_size30m

Not only can we see that the sizes match the files (see the file names), we can also see that the bash shell has read one kilobyte from the /extra1 directory—no doubt reading the directory contents. The “<none>” file name occurs when file system blocks not related to a file are accessed.

Disk I/O Time

DTrace makes many I/O details available to us so that we can understand disk behavior. The previous examples measured I/O counts, I/O size, or seek distance, by disk, process, or file name. One measurement we haven’t discussed yet is disk response time.

The time consumed responding to a disk event takes into account seek time, rotation time, transfer time, controller time, and bus time, and as such is an excellent metric for disk utilization. It also has a known maximum: 1000 ms per second per disk. The trick is being able to measure it accurately.

We are already familiar with one disk time measurement: iostat’s percent busy (%b), which measures disk active time.

Measuring disk I/O time properly for storage arrays has become a complex topic, one that depends on the vendor and the storage array model. To cover each of them is beyond what we have room for here. Some of the following concepts may still apply for storage arrays, but many will need careful consideration.

Simple Disk Event

The time the disk spends satisfying a disk request is often called the service time or the active service time. Ideally, we would be able to read event timestamps from the disk controller itself so that we knew exactly when the heads were seeking, when the sectors were read, and so on. Instead, we have the bdev_strategy and biodone events from the driver presented to DTrace as io:::start and io:::done.

By measuring the time from the strategy (bdev_strategy) to the biodone, we have the driver’s view of response time; it’s the closest measurement available for the actual disk response time. In reality it includes a little extra time to arbitrate and send the request over the I/O bus, which in comparison to the disk time (which is usually measured in milliseconds) often is negligible. This is illustrated in Figure 4.1 for a simple disk event.

Visualizing a Single Disk Event

Figure 4.1. Visualizing a Single Disk Event

The algorithm to measure disk response time is then

time(disk response) = time(biodone) – time(strategy)

We could estimate the total I/O time for a process as a sum of all its disk response times; however, it’s not that simple. Modern disks allow multiple events to be sent to the disk, where they are queued. These events can be reordered by the disk so that events can be completed with a minimal sweep of the heads. The following example illustrates the multiple event problem.

Concurrent Disk Events

Let’s consider that five concurrent disk requests are sent at time = 0 and that they complete at times = 10, 20, 30, 40, and 50 ms, as is represented in Figure 4.2.

Measuring Concurrent Disk Event Times

Figure 4.2. Measuring Concurrent Disk Event Times

The disk is busy processing these events from time = 0 to 50 ms and so is busy for 50 ms. The previous algorithm gives disk response times of 10, 20, 30, 40, and 50 ms. The total would then be 150 ms, implying that the disk has delivered 150 ms of disk response time in only 50 ms. The problem is that we are overcounting response times; just adding them together assumes that the disk processes events one by one, which is not always the case.

Later in this section we measure actual concurrent disk events by using DTrace and then plot it (see Section 4.17.4), which shows that this scenario does indeed occur.

To improve the algorithm for measuring concurrent events, we could treat the end time of the previous disk event as the start time. Time would then be measured from one biodone to the next. That would work nicely for the previous illustration. It doesn’t work if disk events are sparse, such that the previous disk event was followed by a period of idle time. We would need to keep track of when the disk was idle to eliminate that problem.

More scenarios exist, too many to list here, that increase the complexity of our algorithm. To cut to the chase, we end up considering the following adaptive disk I/O time algorithm to be suitable for most situations.

Adaptive Disk I/O Time Algorithm

To cover simple, concurrent, sparse, and other types of events, we need to be a bit creative:

time(disk response) = MIN( time(biodone) – time(previous biodone, same dev), time(biodone) – time(previous idle -> strategy event, same dev) )

We achieve the tracking of idle -> strategy events by counting pending events and matching on a strategy event when pending == 0. Both previous times above refer to previous times on the same disk device. This covers all scenarios, and is the algorithm currently used by the DTrace tools in the next section.

In Figure 4.3, both concurrent and post-idle events are measured correctly.

Best Disk Response Times

Figure 4.3. Best Disk Response Times

There are some bizarre scenarios for which it could be argued that this algorithm is not perfect and that it is only an approximation. If we keep throwing scenarios at our disk algorithm and are fantastically lucky, we’ll end up with an elegant algorithm to cover everything in an obvious way. However, there is a greater chance that we’ll end up with an overly complex beastlike monstrosity and several contrived scenarios that still don’t fit.

So we consider this algorithm presented as sufficient, as long as we remember that at times it may only be a close approximation.

Other Response Times

Thread-response time is the response time that the requesting thread experiences. This can be measured from the moment that a read/write system call blocks to its completion, assuming the request made it to disk and wasn’t cached. This time includes other factors such as the time spent waiting on the run queue to be rescheduled and the time spent checking the page cache if used.

Application-response time is the time for the application to respond to a client event, often transaction oriented. Such a response time helps us understand why an application may respond slowly.

Time by Layer

The relationship between the response times is summarized in Figure 4.4, which depicts a typical sequence of events. This figure highlights both the different layers from which to consider response time and the terminology.

Relationship among Response Times

Figure 4.4. Relationship among Response Times

The sequence of events in Figure 4.4 is accurate for raw devices but is less meaningful for block devices. Reads on block devices often trigger read-ahead, which at times drives the disks asynchronously to the application reads; and writes often return from the cache and are later flushed to disk.

To understand the performance effect of response times purely from an application perspective, focus on thread and application response times and treat the disk I/O system as a black box. This leaves application latency as the most useful measurement, as discussed in Section 5.3.

DTraceToolkit Commands

The DTraceToolkit is a free collection of DTrace-based tools, some of which analyze disk behavior. We previously demonstrated cut-down versions of two of its scripts, bitesize.d and seeksize.d. Two of the most popular are iotop and iosnoop.

iotop Script

iotop uses DTrace to print disk I/O summaries by process, for details such as size (bytes) and disk I/O times. The following demonstrates the default output of iotop, which prints size summaries and refreshes the screen every five seconds.

# ./iotop
2006 Feb 13 13:38:21,  load: 0.35,  disk_r:  56615 Kb,  disk_w:    637 Kb

  UID    PID   PPID CMD              DEVICE  MAJ MIN D            BYTES
    0  27732  27703 find             cmdk0   102   0 R            38912
    0      0      0 sched            cmdk5   102 320 W           150016
    0      0      0 sched            cmdk2   102 128 W           167424
    0      0      0 sched            cmdk3   102 192 W           167424
    0      0      0 sched            cmdk4   102 256 W           167424
    0  27733  27703 bart             cmdk0   102   0 R         57897984
...

In the above output, the bart process read approximately 57 Mbytes from disk. Disk I/O time summaries can also be printed with -o, which uses the adaptive disk-response-time algorithm previously discussed. Here we demonstrate this with an interval of ten seconds.

# ./iotop -o 10
2006 Feb 13 13:39:19,  load: 0.38,  disk_r:  74885 Kb,  disk_w:   1345 Kb

  UID    PID   PPID CMD              DEVICE  MAJ MIN D         DISKTIME
    1    418      1 nfsd             cmdk3   102 192 W              362
    1    418      1 nfsd             cmdk4   102 256 W              382
    1    418      1 nfsd             cmdk5   102 320 W              460
    1    418      1 nfsd             cmdk2   102 128 W              534
    0      0      0 sched            cmdk5   102 320 W            20643
    0      0      0 sched            cmdk3   102 192 W            25500
    0      0      0 sched            cmdk4   102 256 W            31024
    0      0      0 sched            cmdk2   102 128 W            35166
    0  27732  27703 find             cmdk0   102   0 R           722951
    0  27733  27703 bart             cmdk0   102   0 R          8858818

Note that iotop prints totals, not per second values. In this example, we read 74, 885 Mbytes from disk during those ten seconds (disk_r), with the top process bart (PID 27733) consuming 8.8 seconds of disk time. For this ten-second interval, 8.8 seconds equates to a utilization value of 88%.

iotop can print %I/O utilization with the -P option; this percentage is based on 1000 ms of disk response time per second. The -C option can also be used to prevent the screen from being cleared and to instead provide a rolling output.

# ./iotop -CP 1
...
2006 Feb 13 13:40:34,  load: 0.36,  disk_r:   2350 Kb,  disk_w:   1026 Kb

  UID    PID   PPID CMD              DEVICE  MAJ MIN D   %I/O
    0      0      0 sched            cmdk0   102   0 R      0
    0      3      0 fsflush          cmdk0   102   0 W      1
    0  27743  27742 dtrace           cmdk0   102   0 R      2
    0      3      0 fsflush          cmdk0   102   0 R      8
    0      0      0 sched            cmdk0   102   0 W     14
    0  27732  27703 find             cmdk0   102   0 R     19
    0  27733  27703 bart             cmdk0   102   0 R     42
...

Figure 4.5 plots %I/O as find and bart read through /usr. This time bart causes heavier %I/O because there are bigger files to read and fewer directories for find to traverse.

find and bart Read through /usr

Figure 4.5. find and bart Read through /usr

Other options for iotop can be listed with -h (this is version 0.75):

# ./iotop -h
USAGE: iotop [-C] [-D|-o|-P] [-j|-Z] [-d device] [-f filename]
             [-m mount_point] [-t top] [interval [count]]

                -C      # don't clear the screen
                -D      # print delta times, elapsed, us
                -j      # print project ID
                -o      # print disk delta times, us
                -P      # print %I/O (disk delta times)
                -Z      # print zone ID
                -d device       # instance name to snoop
                -f filename     # snoop this file only
                -m mount_point  # this FS only
                -t top          # print top number only
   eg,
        iotop         # default output, 5 second samples
        iotop 1       # 1 second samples
        iotop -P      # print %I/O (time based)
        iotop -m /    # snoop events on filesystem / only
        iotop -t 20   # print top 20 lines only
        iotop -C 5 12 # print 12 x 5 second samples

These options including printing Zone and Project details.

iosnoop Script

iosnoop uses DTrace to monitor disk events in real time. The default output prints details such as PID, block address, and size. In the following example, a grep process reads several files from the /etc/default directory.

# ./iosnoop
  UID   PID D    BLOCK   SIZE       COMM PATHNAME
    0  1570 R   172636   2048       grep /etc/default/autofs
    0  1570 R   102578   1024       grep /etc/default/cron
    0  1570 R   102580   1024       grep /etc/default/devfsadm
    0  1570 R   108310   4096       grep /etc/default/dhcpagent
    0  1570 R   102582   1024       grep /etc/default/fs
    0  1570 R   169070   1024       grep /etc/default/ftp
    0  1570 R   108322   2048       grep /etc/default/inetinit
    0  1570 R   108318   1024       grep /etc/default/ipsec
    0  1570 R   102584   2048       grep /etc/default/kbd
    0  1570 R   102588   1024       grep /etc/default/keyserv
    0  1570 R   973440   8192       grep /etc/default/lu
...

The output is printed as the disk events complete.

To see a list of available options for iosnoop, use the -h option. The options include -o to print disk I/O time, using the adaptive disk-response-time algorithm previously discussed. The following is from iosnoop version 1.55.

# ./iosnoop -h
USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename]
               [-m mount_point] [-n name] [-p PID]
       iosnoop          # default output
                -a      # print all data (mostly)
                -A      # dump all data, space delimited
                -D      # print time delta, us (elapsed)
                -e      # print device name
                -g      # print command arguments
                -i      # print device instance
                -N      # print major and minor numbers
                -o      # print disk delta time, us
                -s      # print start time, us
                -t      # print completion time, us
                -v      # print completion time, string
                -d device       # instance name to snoop
                -f filename     # snoop this file only
                -m mount_point  # this FS only
                -n name         # this process name only
                -p PID          # this PID only
   eg,
        iosnoop -v      # human readable timestamps
        iosnoop -N      # print major and minor numbers
        iosnoop -m /    # snoop events on filesystem / only

The block addresses printed are relative to the disk slice, so what may appear to be similar block addresses may in fact be on different slices or disks. The -N option can help ensure that we are examining the same slice since it prints major and minor numbers on which we can be match.

Plotting Disk Activity

Using the -t option for iosnoop prints the disk completion time in microseconds. In combination with -N, we can use this data to plot disk events for a process on one slice. Here we fetch the data for the find command, which contains the time (microseconds since boot) and block address. These are our X and Y coordinates. We check that we remain on the same slice (major and minor numbers) and then generate an X/Y plot.

# ./iosnoop -tN
TIME           MAJ MIN   UID   PID D    BLOCK   SIZE       COMM PATHNAME
1175384556358  102   0     0 27703 W  3932432   4096        ksh /root/.sh_history
1175384556572  102   0     0 27703 W     3826    512        ksh <none>
1175384565841  102   0     0 27849 R   198700   1024       find /usr/dt
1175384578103  102   0     0 27849 R   770288   3072       find /usr/dt/bin
1175384582354  102   0     0 27849 R   690320   8192       find <none>
1175384582817  102   0     0 27849 R   690336   8192       find <none>
1175384586787  102   0     0 27849 R   777984   2048       find /usr/dt/lib
1175384594313  102   0     0 27849 R   733880   1024       find /usr/dt/lib/amd64
...

We ran a find / command to generate random disk activity; the results are shown in Figure 4.6. As the disk heads seek to different block addresses, the position of the heads is plotted in red.

Plotting Disk Activity, a Random I/O Example

Figure 4.6. Plotting Disk Activity, a Random I/O Example

Are we really looking at disk head seek patterns? Not exactly. What we are looking at are block addresses for biodone functions from the block I/O driver. We aren’t using some X-ray vision to look at the heads themselves.

Now, if this is a simple disk device, then the block address probably relates to the disk head location.[12] But if this is a virtual device, say, a storage array, then block addresses could map to anything, depending on the storage layout. However, we could at least say that a large jump in block address probably means a seek at some point (although storage arrays will cache).

The block addresses do help us visualize the pattern of completed disk activity. But remember that “completed” means the block I/O driver thinks that the I/O event completed.

Plotting Concurrent Activity

Previously, we discussed concurrent disk activity and included a plot (Figure 4.2) to help us understand how these events may occur. Since DTrace can easily trace concurrent disk activity, we can include a plot of actual activity. The following DTrace script provides input for a spreadsheet. We match on a device by its major and minor numbers, then print timestamps as the first column and block addresses for strategy and biodone events in the remaining columns.

#!/usr/sbin/dtrace -s
#pragma D option quiet

io:genunix::start
/args[1]->dev_major == 102 && args[1]->dev_minor == 0/
{
        printf("%d,%d,
", timestamp/1000, args[0]->b_blkno);
}
io:genunix::done
/args[1]->dev_major == 102 && args[1]->dev_minor == 0/
{
        printf("%d,,%d
", timestamp/1000, args[0]->b_blkno);
}

The output of the DTrace script was plotted as Figure 4.7, with timestamps as X-coordinates.

Plotting Raw Driver Events: Strategy and Biodone

Figure 4.7. Plotting Raw Driver Events: Strategy and Biodone

Initially, we see many quick strategies between 0 and 200 µs, ending in almost a vertical line. This is then followed by slower biodones as the disk catches up at mechanical speeds.

DTraceTazTool

TazTool[13] was a GUI disk-analysis tool that used TNF tracing to monitor disk events. It was most notable for its unique disk-activity visualization, which made identifying disk access patterns trivial. This visualization included long-term patterns that would normally be difficult to identify from screenfuls of text.

This visualization technique is returning with the development of a DTrace version of taztool: DTraceTazTool. A screenshot of this tool is shown in Figure 4.8.

DTraceTazTool

Figure 4.8. DTraceTazTool

The first section of the plot measures a ufsdump of a file system, and the second measures a tar archive of the same file system, both times freshly mounted. We can see that the ufsdump command caused heavier sequential access (represented by dark stripes in the top graph and smaller seeks in the bottom graph) than did the tar command.

It is interesting to note that when the ufsdump command begins, disk activity can be seen to span the entire slice—ufsdump doing its passes.



[1] iostat -D prints the same statistic and calls it “util” or “percentage disk utilization.”

[2] Many of these were actually added in Solaris 2.6. The Solaris 2.5 Synopsis for iostat was /

usr/bin/iostat [ -cdDItx ] [ -l n ] [ disk . . . ] [ interval [ count ] ]

[3] If you would like to cling to the original single-line summaries of iostat, try iostat -cnDl99 1. Make your screen wide if you have many disks. Add a -P for some real entertainment.

[4] A throwback to when ttys were real teletypes, and service times were real service times.

[5] DISK_NEW” for iostat means sometime before Solaris 2.5.

[6] See cmd/stat/common/acquire.c: insert_into() scans a list of I/O devices, calling iodev_cmp() to decide placement. iodev_cmp() initially groups in the following order: controllers, disks/partitions, tapes, NFS, I/O paths, unknown. strcmp() is then used for alphabetical sorting.

[7] We know this from watching iostat while a simple dd test runs: dd if =/dev/rdsk/c0t0d0s0 of =/dev/null bs =128K.

[8] Pending bug 6302763.

[9] Some people do automate iostat to run at regular intervals and log the output. Having this sort of comparative data on hand during a crisis can be invaluable.

[10] It’s possible that sar was written before the kilobytes unit was conventional.

[11] Solaris 10 does ship with StarOffice 7, which can plot interactively.

[12] Even “simple” disks these days map addresses in firmware to an internal optimized layout; all we know is the image of the disk that its firmware presents. The classic example here is sector zoning, as discussed in Section 4.4.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.125.51