Due to the disparity in speed between storage and RAM, one of the first signs of distress that a DBA will observe is directly related to disk utilization. A badly written query, an unexpected batch-loading process, a forced checkpoint, overwhelmed write caches—the number of things that can ruin disk performance is vast.
The first step in tracking down the culprit(s) is to visualize the activity. The iostat
utility is fairly coarse in that it does not operate at the process level. However, it does output storage activity by device and includes columns such as reads or writes per second, the size of the request queue, and how busy it is compared to its maximum throughput.
This allows us to see the devices that are actually slow, busy, or overworked. Furthermore, we can combine this information with other methods of analysis to find the activity's source. For now, let's explore the tool itself.
As iostat
is part of the sysstat
package, we should ensure that the statistics-gathering elements are enabled. Debian, Mint, and Ubuntu users should modify the /etc/default/sysstat
file and make sure that the ENABLED
variable resembles this line:
ENABLED="true"
Red Hat, Fedora, CentOS, and Scientific Linux users should make sure that the SADC_OPTIONS
variable in /etc/sysconfig/sysstat
is set to the following:
SADC_OPTIONS="-d"
Once these changes are complete, restart the sysstat
service with this command as a root-level user:
sudo service sysstat restart
Leverage some sample iostat
output by following these steps:
iostat -d 1
iostat -dm 1 10
sda
device with this command:iostat -dmx sda 1
The iostat
utility has a rather unique method of interpreting command-line arguments. If no recognized disks are part of the command, it simply shows information about all of them. After devices, it checks for timing statistics. To get a second-by-second status, we specify 1 second as the final argument. By providing the -d
argument, we remove CPU utilization from the report.
The default output rate of iostat
is in kilobytes per second. Current hardware is often so fast that these results can be almost too high to easily compare, so we set the -m
parameter in the second command to change the output to megabytes per second. We also take advantage of the fact that the last two parameters are related to timing. The first parameter specifies the interval, and the second is the number of samples. So, the second command takes 10 samples at the rate of one per second.
The last command adds two more elements. First, we place a disk device (sda
) before the timing interval. We can list as many devices as we want, and iostat
will restrict the output to not include any other devices. This is especially helpful in servers that can have dozens of disk devices, thus making it hard to isolate potential performance issues. Then, we include the -x
argument, which lists extended statistics.
Without extended statistics, the output is not very useful. For example, watching the sda
device for 1 second will normally look like this:
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 806.59 3147.25 4742.86 5728 8632
The last two columns only list the cumulative activity for the sampling interval. This is of limited use. However, the first three columns display the number of transactions per second (tps) and how much data was either read from or written to that device per second. Depending on the hardware we purchased, we might actually know its limits regarding these measurements, so we have a basic idea of how busy it might be.
If we enable extended statistics with the -x
argument, we gain several extra fields, including the following:
r/s
: This column lists the number of reads per second from the device. This was previously aggregated into the tps
field.w/s
: This column shows the number of writes per second to the device. This was previously aggregated into the tps
field.avgqu-sz
: This column describes the amount of requests in the disk's queue. If this gets very large, the disk will have trouble keeping up with requests.await
: This column outlines the average time a request spends waiting in the queue and being serviced, in milliseconds. An overloaded disk will often have a very high value in this column as it is unable to keep up with requests.r_await
: This column details the average time read requests spend waiting in the queue and being serviced, in milliseconds. This helps isolate whether or not the read activity is overloading the disk.w_await
: This column depicts the average time write requests spend waiting in the queue and being serviced, in milliseconds. This helps isolate whether or not the write activity is overloading the disk.%util
: This column represents the percentage of time the device was busy servicing I/O requests. This is actually a function of the queue size and the average time waiting in the queue. It is also one of, if not the most important, metrics. If this is at or near 100 percent for long periods of time, we need to start analyzing the sources of I/O requests and think about upgrading our storage.Our examples of iostat
always include the -d
argument to only show disk information. By default, it shows both CPU and disk measurements. The CPU data looks like this:
avg-cpu: %user %nice %system %iowait %steal %idle 9.38 0.00 16.67 11.46 0.00 62.50
This can be useful for analysis as well, though there are several other tools that also provide this data. If we use the -c
parameter instead of -d
, we will see only the CPU statistics, and no information about disk devices will be included in the output.
3.147.85.221