Chapter 3. Troubleshooting Disk Subsystem Issues

SQL Server is a very I/O intensive application: it is constantly reading data from and writing data to disk. Good I/O throughput is essential for SQL Server performance and health. Unfortunately, many SQL Server installations are I/O bound, even with modern flash-based storage.

In this chapter, I will show you how to analyze and troubleshoot disk subsystem performance issues. You will learn how SQL Server processes I/O requests internally and how to identify and detect possible bottlenecks through the entire I/O stack, on the SQL Server, OS, Virtualization and Storage levels.

Next, I will talk about checkpoint process tuning, a common source of I/O bottlenecks in busy OLTP systems.

Finally, I will cover the most common I/O-related waits you may encounter in your system.

Anatomy of the SQL Server I/O Subsystem

SQL Server never works with data pages directly in database files. Every time a data page needs to be read or modified, SQL Server reads that page to memory and caches it in the buffer pool. Each page in a buffer pool is referenced by a buffer structure, sometimes simply called buffer. It includes the page’s address in the data file, a pointer to the data page in the memory, status information, and the page latching queue.

SQL Server uses latches to protect internal objects in memory preventing their corruptions when multiple threads modifying them simultaneously. The two most common types of latch are exclusive, which blocks any access to the object, and shared, which allows simultaneous reads but prevents modifications of the objects.

Conceptually, latches are similar to critical sections or mutexes in application development languages. We will talk about latches in detail in Chapter 10.

The location of data pages in a buffer pool does not represent the order in which they are stored in the database files. SQL Server, however, can efficiently locate the page in the buffer pool when needed. Every time SQL Server accesses the page there, it performs a logical read. When the page is not present in memory and needs to be read from disk, the physical read also occurred.

When data needs to be modified, SQL Server changes the pages in the buffer pool, marking them as dirty, then writes log records to the transaction log file. It saves dirty pages to the data files asynchronously in the Checkpoint and, sometimes, the Lazy Writer processes. We’ll discuss both of those processes later in this chapter and transaction logs in Chapter 11. For now, remember that data modifications require SQL Server to read data pages from disk if they have not been already cached.

Now let’s look at how SQL Server works with I/O in more detail.

Scheduling and I/O

As you remember from chapter 2, SQL Server uses cooperative scheduling, with multiple workers running on CPUs in a rotating fashion. The workers voluntarily yield when the short quantum expires, allowing other workers to proceed. This model requires SQL Server to use asynchronous I/O as much as possible – it is impossible for workers to wait until I/O request is completed, preventing other workers from executing.

By default, all SQL Server schedulers handle I/O requests. You can override this behavior and bind I/O to specific CPUs by setting the affinity I/O mask. In theory, this may help improve I/O throughput in very busy OLTP systems; however, I rarely find it necessary. In most cases, you’ll achieve better results by performing optimizations and reducing CPU and I/O load.

You can read about affinity I/O masking in the Microsoft documentation.

Every scheduler has a dedicated I/O queue. When a worker needs to perform an I/O operation, it creates an I/O request structure, puts it to the scheduler’s queue and finally issues an asynchronous OS API I/O call. It does not wait until the request is completed; it either continues to run, doing other things, or suspends itself, moving to the SUSPENDED queue.

When a new worker starts to run on the scheduler (switching to RUNNING state), it goes through the scheduler’s I/O queue. The I/O request structures contain enough information to check if the asynchronous OS API call has been completed, along with a pointer to callback function that the worker calls to complete the I/O request in SQL Server.

I know that this sounds complicated – please bear with me and we’ll look at the details in the next section. The key things I’d like you to remember are:

All active schedulers are handling I/O requests by default.
Most I/O requests in SQL Server use asynchronous OS API calls. This is true even for write-ahead logging – the worker that issues the COMMIT statement may be suspended until the log record is written to disk; however, the OS API write command will be executed asynchronously.
The I/O request may be completed by a different worker than the one that issues it.
You can see a list of pending I/O requests in the sys.dm_io_pending_io_requests view. The io_pending_ms_ticks column provides the duration of that request. The io_pending column indicates if the OS API call has been completed and if the request is waiting for a worker to finish it. This may help you to determine if request latency is being affected by CPU load in the system.

Now, as promised, let’s look at that process again, with more concrete examples of reading data pages from disk.

Data Reads

When SQL Server needs to access a data page, it checks if the page already exists in buffer pool. If it does not, the worker allocates the buffer for the page, protecting it with an exclusive latch. This prevents workers from accessing the page until it is read – they will be blocked, waiting for the latch to clear.

Next, the worker creates the I/O request structure, puts it in the scheduler I/O queue, and initiates an OS API read request. Then it tries to acquire another shared latch on the buffer, which is blocked by the incompatible, exclusive latch held there. The worker then suspends itself with PAGEIOLATCH wait (Figure 3-1 illustrates that state).

Figure 3-1. Reading data page from disk – initiating read

When another worker switches to a RUNNING state, it checks to see if any I/O requests in the scheduler’s queue have been completed. If so, the worker calls the callback function to finalize the operation: this validates that page is not corrupted and removes the exclusive latch from the buffer. The worker that submitted the I/O request can then resume and access the data page (Figure 3-2).

Figure 3-2. Reading data page from disk – completing the read

There are several errors that may occur during I/O requests. All of them are severe, and you need to set up alerts in the system for them.

  • Error 823 indicates that the OS I/O API call was not successful. This is often a sign of hardware issues.
  • Errors 605 and 824 indicate logical consistency issues with the data pages. When you encountered either of these errors, immediately check whether the database is corrupted, using the DBCC CHECKDB command. You may also encounter those errors in case of faulty I/O drivers, which can corrupt data pages during transfer.
  • Error 833 tells you that an I/O request (OS API call) took longer than 15 seconds to return. This is abnormal; check the health of the disk subsystem when you see this error.
  • Error 825 indicates that an I/O request failed and had to be retried in order to succeed. As with Error 833, check the health of the disk subsystem.

When troubleshooting those errors, you can look for the details in your SQL Server error log (use the code from Listing 1-4) and system event log.

It is very common for SQL Server to read multiple data pages in a single I/O request. For example, it uses read-ahead logic, reading multiple data pages during scans. As result, the query may perform thousands of logical reads with just a handful of physical reads. Another example is ramp-up reads, which is when SQL Server reads a large number of pages on each I/O request, trying to fill the buffer pool quickly on startup.

Data Writes

SQL Server handles data writes very similarly to data reads. In most cases, those writes are done asynchronously using a scheduler’s I/O queues, as you just saw in the previous examples. Obviously, the callback function will be implemented differently in different I/O operations.

When you change some data in the database, SQL Server modifies data pages in the buffer pool, reading pages from disk if needed. It generates log records for the modifications and saves them to the transaction log. The transaction is not considered to have been committed until the log records are hardened on disk. While, technically, you can treat write-ahead logging as synchronous writes, SQL Server uses an asynchronous I/O pattern for log writes.

SQL Server writes modified data pages in user databases asynchronously during checkpoint. This process finds dirty data pages in the buffer pool and saves them to disk. It tries to minimize the number of disk requests by combining and writing adjacent modified pages together in a single I/O operation when possible.

Another SQL Server process, called lazy writer, periodically sweeps the buffer pool to remove data pages that have not been recently accessed, freeing up the memory. In normal circumstances, lazy writer skips dirty data pages; however, it may also write them to disk if there is memory pressure in the system.

There are, as always, some exceptions. For example, during a bulk import operation, SQL Server allocates a set of buffers in the buffer pool and reuses them, writing data to the database outside of checkpoint. This preserves the content of the buffer pool, so it isn’t flushed by massive data imports.

Checkpoint I/O may introduce issues on busy systems. I will talk about checkpoint tuning later in this chapter. But first, let’s take a holistic look at the entire storage subsystem.

The Storage Subsystem: A Holistic View

Troubleshooting slow I/O performance in SQL Server is not an easy task. I’ve seen many heated discussions between database and infrastructure teams. Database engineers generally complain about slow disk performance, while the storage engineers analyze the metrics from SAN devices with sub-millisecond latency and insist that all issues are on the SQL Server side. Neither team is right. They usually make the same mistake: oversimplifying the storage subsystem to just a couple of components. However, the storage subsystem isn’t that simple.

Figure 3-3 shows a very high-level diagram of the network-based storage subsystem, with many details missing. (It also references some troubleshooting tools. We’ll get to those, but don’t focus on them yet.) The point here is that bad I/O performance can be caused by any component, so you need to analyze all layers in the stack.

Figure 3-3. Storage Subsystem (Network-Based)

There is also an option of using direct-attached storage (DAS). In this configuration, the storage either installed locally on the server (think about NVMe drives) or directly connected to it. This setup eliminates network from the storage path and may provide you better I/O performance in the system. As the downside, you’d lose the flexibility of external storage, where you can add additional space and perform maintenance on the fly, transparently to the server.

Every storage subsystem has a “tipping point” after which the latency of I/O requests will start to grow exponentially with increase in throughput and IOPS (I/O operations per second). For example, you may get a 1-millisecond response with an IOPS workload of 1,000 and a 3-millisecond response with an IOPS workload of 50,000. However, you might cross the tipping point at 100,000 IOPS and start to get double-digit or even triple-digit latency.

Every component in the stack will have its own tipping point. For example, low queue depth in the HBA adapter may lead to queueing on the controller level as the number of I/O requests increases. In this case SQL Server will suffer from high latency and poor I/O performance; however, all SAN metrics will be perfectly healthy, with no latency at all.

You can use the DiskSpd utility to test storage subsystem performance. That utility emulates SQL Server’s workload in the system. You can download it from GitHub.

As I’ve noted, you’ll need to look at all storage subsystem components when you troubleshoot bad I/O performance. Nevertheless, the place to start is analyzing overall storage latency and the number of data SQL Server reads and writes. You can do this by looking at sys.dm_io_virtual_file_stats view.

sys.dm_io_virtual_file_stats view

The sys.dm_io_virtual_file_stats view is the most important tool in SQL Server I/O performance troubleshooting. This view provides I/O statistics by database file, including number of I/O operations, amount of data read and written, and information about stalls, or time for I/O requests to complete. (I use the terms latency and stalls interchangeably throughout this book.)

The data in this view is cumulative and is calculated from the time of the SQL Server restart. Take two snapshots of the data and calculate the delta between them (Listing 3-1 shows the code to do that). This code filters out database files with low I/O activity, since their metrics are usually skewed and not very useful.

Example 3-1. Using the sys.dm_io_virtual_file_stats view
CREATE TABLE #Snapshot
(
	database_id SMALLINT NOT NULL,
	file_id SMALLINT NOT NULL,
	num_of_reads BIGINT NOT NULL,
	num_of_bytes_read BIGINT NOT NULL,
	io_stall_read_ms BIGINT NOT NULL,
	num_of_writes BIGINT NOT NULL,
	num_of_bytes_written BIGINT NOT NULL,
	io_stall_write_ms BIGINT NOT NULL
);

INSERT INTO #Snapshot(database_id,file_id,num_of_reads,num_of_bytes_read
	,io_stall_read_ms,num_of_writes,num_of_bytes_written,io_stall_write_ms)
	SELECT database_id,file_id,num_of_reads,num_of_bytes_read
		,io_stall_read_ms,num_of_writes,num_of_bytes_written,io_stall_write_ms
	FROM sys.dm_io_virtual_file_stats(NULL,NULL)
OPTION (RECOMPILE);

-- Set test interval (1 minute). Use larger intervals in production
WAITFOR DELAY '00:01:00.000';

;WITH Stats(db_id, file_id, Reads, ReadBytes, Writes
	,WrittenBytes, ReadStall, WriteStall)
as
(
	SELECT
		s.database_id, s.file_id
		,fs.num_of_reads - s.num_of_reads
		,fs.num_of_bytes_read - s.num_of_bytes_read
		,fs.num_of_writes - s.num_of_writes
		,fs.num_of_bytes_written - s.num_of_bytes_written
		,fs.io_stall_read_ms - s.io_stall_read_ms
		,fs.io_stall_write_ms - s.io_stall_write_ms
	FROM
		#Snapshot s JOIN  sys.dm_io_virtual_file_stats(NULL, NULL) fs ON
			s.database_id = fs.database_id and s.file_id = fs.file_id
)
SELECT
	s.db_id AS [DB ID], d.name AS [Database]
	,mf.name AS [File Name], mf.physical_name AS [File Path]
	,mf.type_desc AS [Type], s.Reads 
	,CONVERT(DECIMAL(12,3), s.ReadBytes / 1048576.) AS [Read MB]
	,CONVERT(DECIMAL(12,3), s.WrittenBytes / 1048576.) AS [Written MB]
	,s.Writes, s.Reads + s.Writes AS [IO Count]
	,CONVERT(DECIMAL(5,2),100.0 * s.ReadBytes / 
			(s.ReadBytes + s.WrittenBytes)) AS [Read %]
	,CONVERT(DECIMAL(5,2),100.0 * s.WrittenBytes / 
			(s.ReadBytes + s.WrittenBytes)) AS [Write %]
	,s.ReadStall AS [Read Stall]
	,s.WriteStall AS [Write Stall]
	,CASE WHEN s.Reads = 0 
		THEN 0.000
		ELSE CONVERT(DECIMAL(12,3),1.0 * s.ReadStall / s.Reads) 
	END AS [Avg Read Stall] 
	,CASE WHEN s.Writes = 0 
		THEN 0.000
		ELSE CONVERT(DECIMAL(12,3),1.0 * s.WriteStall / s.Writes) 
	END AS [Avg Write Stall] 
FROM
	Stats s JOIN  sys.master_files mf WITH (NOLOCK) ON
		s.db_id = mf.database_id and
		s.file_id = mf.file_id
	JOIN  sys.databases d WITH (NOLOCK) ON 
		s.db_id = d.database_id  
WHERE -- Only display files with more than 20MB throughput
	(s.ReadBytes + s.WrittenBytes) > 20 * 1048576
ORDER BY
	s.db_id, s.file_id
OPTION (RECOMPILE);

Figure 3-4 shows the output from the view.

Figure 3-4. Sample output from sys.dm_io_vitual_file_stats

The goal is to keep stalls/latency metrics as low as possible. It is impossible to define thresholds that can be applied to all systems, but my rule of thumb is not to exceed 1 to 2-millisecond write stalls for transaction logs and 5 to 7- millisecond read and write stalls for data files when network storage is used. The latency should be even lower, in the sub-millisecond range, when you are using mode modern direct-attached drives.

Next, analyze throughput in the system. High stalls with low throughput usually indicate performance issues outside of SQL Server. Don’t forget to look at throughput across all files that share the same drive or controller. High throughput in some files may impact the metrics in others that share the same resource.

There is usually a correlation between throughput and stalls – the more data you are reading and writing, the higher latency you’ll have. This correlation is usually linear until you reach the tipping point, after which latency increases very quickly.

A large amount of reads and read stalls in the data files is often accompanied by a significant percent of PAGEIOLATCH waits and a low Page Life Expectancy performance counter value. This indicates that a large amount of data is constantly being read from disk. You need to understand why that is happening. In most cases, it’s due to nonoptimized queries that perform large scans reading data from disk. We will talk about how to detect those queries in the next chapter.

Don’t discount the possibility, though, that the server is underprovisioned and doesn’t have enough memory to accommodate an active dataset. That is also entirely possible. In either case, adding extra memory may be a completely acceptable solution that will reduce I/O load and improve performance of the system. It is, obviously, not the best solution, but in many cases it’s easier and cheaper to use hardware to solve the problem.

In users’ databases, large amount of writes and write stalls in data files often indicate inefficient checkpoint configuration. You may get some improvements by tuning the checkpoint configuration, as I will show later in the chapter. In the longer term, you may need to analyze if it is possible to reduce the number of data pages SQL Server writes to disk. Some ways to do this include removing unnecessary indexes; reducing page splits by changing FILLFACTOR and tuning the index maintenance strategy; decreasing the number of data pages by implementing data compression; and, potentially, refactoring database schema and applications.

When you see large throughput and stalls in tempdb, identify what causes them. The three most common causes are version store activity, massive tempdb spills, and excessive usage of temporary objects. We will talk about these in Chapter 9.

Finally, you can also get an idea of I/O latency by analyzing resource wait time in PAGEIOLATCH and other I/O-related waits. This won’t give you detailed information on a per-file basis, but it may be a good metric when you look at systemwide I/O performance.

Performance Counters and OS Metrics

The sys.dm_io_virtual_file_stats view provides useful and detailed information and points you in the right direction for further I/O troubleshooting, but it has one limitation: it averages data over the sampling interval.

This is completely acceptable when I/O latency is low. However, if latency numbers are high, you’ll want to determine if performance is generally slow or if the numbers have been skewed by some bursts in activity. You can do this by looking at the performance counters correlating SQL Server and disk metrics.

The troubleshooting process will vary slightly between Windows and Linux. In Windows, the simplest way to analyze the metrics is using the well-known PerfMon (Performance Monitor) utility. You can look at the SQL Server and I/O performance counters together and correlate data from them.

Table 3-1. I/O-Related Performance Counters
Performance Object Performance Counters Description
Physical Disk Avg Disk Queue Length
Avg Disk Read Queue Length
Avg Disk Write Queue Length
Provides the average number of I/O requests (total, read, and write, respectively) queued during the sampling interval. Those numbers should be as low as possible. Spikes indicate that I/O requests are being queued at the OS level.
Current Disk Queue Length Gives you the size of the I/O request queue when the metric was collected.
Avg Disk sec/Transfer
Avg Disk sec/Read
Avg Disk sec/Write
Indicates average latency for disk operations during the sampling interval. These numbers are usually similar to latency/stall metrics from the sys.dm_io_virtual_file_stats view when sampled over the same time period. However, because you typically measure sys.dm_io_virtual_file_stats over larger intervals, these counters will show you if I/O stalls were always high or if data has been affected by latency spikes.
Disk Transfers/sec
Disk Reads/sec
Disk Writes/sec
Disk Bytes/sec
Disk Read Bytes/sec
Disk Write Bytes/sec
Displays the number of I/O operations and throughput at the time of the reading. Similar to latency counters, you can use them to analyze the uniformity of the disk workload.
Avg Disk Bytes/Transfer
Avg Disk Bytes/Read
Avg Disk Bytes/Write
Shows the average size of I/O requests, which can help you understand I/O patterns in the system.
SQL Server: Buffer Manager Checkpoint pages/sec
Background writer pages/sec
Shows the number of dirty pages written by the checkpoint process.
Lazy writer/sec Provides number of pages written by lazy writer process
Page reads/sec
Page writes/sec
Display the number of physical reads and writes
Readahead pages/sec Shows the number of pages read by read-ahead process.
SQL Server: Databases Log Bytes Flushed/sec
Log Flush Write Time (ms)
Log Flushes/sec
Provide you the data about throughput, latency and number of write requests for transaction log writes. Use those counters to understand uniformity of log generation when you troubleshoot high log write latency
SQL Server: SQL Statistics Batch Requests/sec While these two counters are not I/O-related, they can be used to analyze spikes in system workload that may lead to bursts in I/O activity.
SQL Server: Databases Transactions/sec

Usually, I start by looking at Avg Disk sec/Read and Avg Disk sec/Write latency counters, along with Avg Disk Queue Length. If I see any spikes in their values, I add SQL Server–specific counters to identify what processes may be leading to the bursts in activity.

Figure 3-5 illustrates one such example. You can see the correlation between Checkpoint pages/sec and high Avg Disk sec/Write and Avg Disk Queue Length values. This leads to the simple conclusion that the I/O subsystem cannot keep up with bursts of writes from the checkpoint process.

Figure 3-5. Checkpoint and disk queueing

Pay attention to other applications installed on the server – it is possible that they are responsible for I/O activity bursts or other issues.

Linux doesn’t offer the standard PerfMon utility; however, there are plenty of free and commercial monitoring tools available. You can also use tools like iostat, dstat, and iotop, which are included in major Linux distributions. They provide general disk performance metrics on a per-process or system level.

On the SQL Server side, you can access performance counters through sys.dm_os_performance_counters view. Listing 3-2 shows you how to do that.

Example 3-2. Using sys.dm_os_performance_counters view
CREATE TABLE #PerfCntrs
(
	collected_time DATETIME2(7) NOT NULL DEFAULT SYSDATETIME(),
	object_name SYSNAME NOT NULL,
	counter_name SYSNAME NOT NULL,
	instance_name SYSNAME NOT NULL,
	cntr_value BIGINT NOT NULL,
	PRIMARY KEY (object_name, counter_name, instance_name)
);

;WITH Counters(obj_name, ctr_name)
AS
(
	SELECT C.obj_name, C.ctr_name
	FROM 
	(
		VALUES
			('SQLServer:Buffer Manager','Checkpoint pages/sec')
			,('SQLServer:Buffer Manager','Background writer pages/sec')
			,('SQLServer:Buffer Manager','Lazy writes/sec')
			,('SQLServer:Buffer Manager','Page reads/sec')
			,('SQLServer:Buffer Manager','Page writes/sec')
			,('SQLServer:Buffer Manager','Readahead pages/sec')
			,('SQLServer:Databases','Log Flushes/sec') -- For all DBs
			,('SQLServer:Databases','Log Bytes Flushed/sec') -- For all DBs
			,('SQLServer:Databases','Log Flush Write Time (ms)') -- For all DBs
			,('SQLServer:Databases','Transactions/sec') -- For all DBs
			,('SQLServer:SQL Statistics','Batch Requests/sec') 
	) C(obj_name, ctr_name)
)
INSERT INTO #PerfCntrs(object_name,counter_name,instance_name,cntr_value)
	SELECT 
		pc.object_name, pc.counter_name, pc.instance_name, pc.cntr_value
	FROM 
		sys.dm_os_performance_counters pc WITH (NOLOCK) JOIN Counters c ON
			pc.counter_name = c.ctr_name AND pc.object_name = c.obj_name;

WAITFOR DELAY '00:00:01.000';

;WITH Counters(obj_name, ctr_name)
AS
(
	SELECT C.obj_name, C.ctr_name
	FROM 
	(
		VALUES
			('SQLServer:Buffer Manager','Checkpoint pages/sec')
			,('SQLServer:Buffer Manager','Background writer pages/sec')
			,('SQLServer:Buffer Manager','Lazy writes/sec')
			,('SQLServer:Buffer Manager','Page reads/sec')
			,('SQLServer:Buffer Manager','Page writes/sec')
			,('SQLServer:Buffer Manager','Readahead pages/sec')
			,('SQLServer:Databases','Log Flushes/sec') -- For all DBs
			,('SQLServer:Databases','Log Bytes Flushed/sec') -- For all DBs
			,('SQLServer:Databases','Log Flush Write Time (ms)') -- For all DBs
			,('SQLServer:Databases','Transactions/sec') -- For all DBs
			,('SQLServer:SQL Statistics','Batch Requests/sec') 
	) C(obj_name, ctr_name)
)
SELECT 
	pc.object_name, pc.counter_name, pc.instance_name
	,CASE pc.cntr_type
		WHEN 272696576 THEN 
			(pc.cntr_value - h.cntr_value) * 1000 / 
				DATEDIFF(MILLISECOND,h.collected_time,SYSDATETIME())
		WHEN 65792 THEN 
			pc.cntr_value
		ELSE NULL
	END as cntr_value
FROM 
	sys.dm_os_performance_counters pc WITH (NOLOCK) JOIN Counters c ON
		pc.counter_name = c.ctr_name AND pc.object_name = c.obj_name
	JOIN #PerfCntrs h ON
		pc.object_name = h.object_name AND
		pc.counter_name = h.counter_name AND
		pc.instance_name = h.instance_name
ORDER BY
	pc.object_name, pc.counter_name, pc.instance_name
OPTION (RECOMPILE);

You can also bring sys.dm_io_virtual_file_stats view to the analysis, sampling its data and performance counters together every second. The approach is the same one we just discussed – you’ll look at the correlation between disk latency and activity and evaluate the general performance of the I/O subsystem, identifying tipping points in the load.

Virtualization, HBA, and Storage Layers

There are several layers in the storage stack you may need to analyze in addition to OS. They include virtualization, HBA/SCSI controller configuration, and the storage array itself.

I recommend working together with infrastructure and storage engineers during troubleshooting.

SQL Server mostly operates in shared environments. It shares storage and network infrastructure with other clients, and when virtualized, it runs on the same physical host with other VMs. As I said earlier in this book, when virtualization is being used, be sure to validate that the host is not overcommitted, which could lead to all sorts of performance issues.

Unless you have a very simple SQL Server setup that uses local storage, I/O requests will be serialized and sent over network. There are two typical problems here: insufficient queue depth and noisy neighbors.

Insufficient queue depth
The first is insufficient queue depth somewhere in the I/O path. Unfortunately, the default query depth may not be enough for a highly demanding I/O workload. You’ll need to check and potentially increase it in the datastore, vSCSI controller, and HBA adapter settings. The typical sign of insufficient queue depth is low latency on the storage combined with much higher latency in VM and/or OS, with disk queueing present.
Noisy neighbors
The second problem is noisy neighbors. Multiple I/O intensive VMs running on the same host may affect each other. Similarly, multiple high-throughput servers sharing the same network and storage may overload them. Unfortunately, troubleshooting noisy neighbor problem is never easy and you need to analyze multiple components in the infrastructure to detect it.

A word of caution – storage arrays can handle a limitednumber of outstanding requests. Increasing queue depth on a busy server could increase the number of outstanding requests on the storage. You might shift the bottleneck from the server to the storage layer, especially if the storage serves requests from many busy systems.

The virtualization host and storage both expose throughput, IOPS, and latency metrics for analysis. On virtualization layers, the metrics may vary based on technology. For example, in Hyper-V you can use regular disk performance counters on the host. In VMWare, you can get the data from ESXTOP utility. In either case, the troubleshooting approach is very similar to what we have already discussed. Look at the available metrics, correlate data from them, and detect the bottlenecks in the I/O path.

Finally, check the storage configuration. Storage vendors usually publish best practices for SQL Server workloads: they are a good starting point. Pay attention to the allocation unit size’s alignment with the raid stripe size and partition offset, though.

For example, a 1024 MB partition offset, 4 KB disk block, 64 KB allocation unit, and 128 KB raid stripes are perfectly aligned, with each I/O request served by a single disk. On the other hand, 96 KB raid stripes will spread 64 KB allocation units across two disks, which leads to extra I/O requests and can seriously impact performance.

Again, it is always beneficial to work together with infrastructure and storage engineers. They are the subject matter experts and may help you to find the root cause of the problem faster than when you are working alone.

Finally, the best approach to get predictable performance in critical systems is to use a dedicated environment. Run SQL Server on dedicated hardware with direct-attached storage (DAS) to get the best performance possible.

Checkpoint Tuning

As we all know, SQL Server uses write-ahead logging. Transactions are considered to be committed only after the log records are hardened in the transaction logs. SQL Server does not need to save dirty data pages to disk at the same time – it can reapply the changes by replaying log records if needed.

The checkpoint process saves data pages into the data files. The main goal of checkpoint is reducing recovery time in event of an SQL crash or failover: the fewer changes need to be replayed, the faster recovery will be. The maximum desired recovery time is controlled at either the server level or the database level. By default, both of them are 60 seconds.

Note

You should not consider the recovery target to be a hard value. In many cases, the database will recover much faster than that. It is also possible for bursts of activity and long running transactions to prolong recovery beyond the target time.

There are four different types of checkpoints:

Internal checkpoints
Internal checkpoints occur during some SQL Server operations, such as starting database backup or creating a database snapshot.
Manual checkpoints
Manual checkpoint occur manually, as the name indicates, when users trigger them with the CHECKPOINT command.
Automatic checkpoint
Historically, SQL Server used automatic checkpoints, with the recovery interval controlled at the server level. The checkpoint process wakes up once or few times each recovery interval and flushes dirty data pages to disk. Unfortunately, this approach can lead to bursts of data writes, which can be problematic in busy systems.
Indirect checkpoint
Starting with SQL Server 2012, you have another option: indirect checkpoint. With this method, SQL Server tries to balance I/O load by executing checkpoints much more frequently – in some cases, even continuously. This helps to mitigate bursts of data writes, making the I/O load much more balanced. Use it instead of automatic checkpoint whenever possible. Indirect checkpoint is controlled on a per-database basis and enabled by default in databases created in SQL Server 2016 and above. However, SQL Server does not enable indirect checkpoint automatically when you upgrade an SQL Server instance, or in SQL Server 2012 and 2014. You can do it manually by setting up a recovery target at the database level with the ALTER DATABASE SET TARGET_RECOVERY_TIME command.

Let me show you an example from one system I worked with. The sample of data from sys.dm_io_virtual_file_stats view over 1 minute had very high write latency for the data files. However, the smaller samples (1 to 3 seconds) rarely showed any activity at all.

Figure 3-6 shows the data, with the 1-minute sample at the top and the 1-second sample at the bottom.

Figure 3-6. Sample sys.dm_io_virtual_file_stats with automatic checkpoint

This behavior led me to believe that the issue was related to checkpoint. I confirmed this hypothesis by looking at the Checkpoint pages/sec, Disk Writes/sec, and Avg Disk Queue Length performance counters. You can clearly see that burst of disk writes from the checkpoint process in Figure 3-5 earlier in the chapter, which shows the screenshot from PerfMon.

Although this instance ran SQL Server 2016, it used automatic checkpoint, because all databases had been upgraded from the earlier version of SQL Server. Enabling indirect checkpoint in the system immediately changed the I/O pattern, making it much more balanced.

You can see the performance counters in Figure 3-7. Notice that with indirect checkpoint, you should use Background writer pages/sec instead of the Checkpoint pages/sec counter.

Figure 3-7. Indirect checkpoint performance counters

Figure 3-8 shows the output from a 1-minute sample in the sys.dm_io_virtual_file_stats view. As you can see, latency went back to normal.

Figure 3-8. Sample sys.dm_io_virtual_file_stats with indirect checkpoint

Indirect checkpoints do not completely eliminate I/O bursts. You can still have them, especially if the system has some spikes in data modifications. However, they are less frequent than with automatic checkpoints.

You may also need to tune the recovery target to get the most balanced I/O load. In the case above, I got the best results with a 90-second target. Of course, high values may increase recovery time in the system.

I/O Waits

SQL Server uses several different wait types related to I/O operations. It is very common to see all of them present when the disk subsystem is not fast enough. Let’s look at five of the most common: ASYNC_IO_COMPLETION, IO_COMPLETION, WRITELOG, WRITE_COMPLETION, and PAGEIOLATCH.

ASYNC_IO_COMPLETION waits

This wait type occurs when SQL Server waits for asynchronous I/O operations (read or write) for non-buffer pool pages to complete. Examples include:

  • Internal checkpoint when you start database backup or run DBCC CHECKDB
  • Reading GAM pages from data files
  • Reading data pages from database during database backup. (Unfortunately, this tends to skew the average wait time, making it harder to analyze.)

When I see significant presence of both ASYNC_IO_COMPLETION and PAGEIOLATCH waits in the system, I perform general I/O troubleshooting. If PAGEIOLATCH waits are not present, I look at how often ASYNC_IO_COMPLETION occurs. I may ignore that wait if its percentage is not very significant and disk latency is low.

IO_COMPLETION waits

The IO_COMPLETION wait type occurs during synchronous reads and writes in data files and during some read operations in transaction log. A few examples:

  • Reading allocation map pages from the database
  • Reading the transaction log during database recovery
  • Writing data to tempdb during sort spills

When you see significant percentages of this wait in the system, perform general disk-performance troubleshooting. Pay specific attention to tempdb latency and throughput; in my experience, bad tempdb performance is the most common reason for this wait. We will talk more about tempdb troubleshooting in Chapter 9.

WRITELOG waits

As you can guess by the name, this wait occurs when SQL Server writes log records to the transaction log. It is normal to see this wait in any system; however, a large percentage may indicate a transaction-log bottleneck.

Look at average wait time and transaction log write latency in the sys.dm_io_virtual_file_stats view during troubleshooting. High numbers are impactful and may affect throughput in the system.

In addition to optimizing disk subsystem throughput, there are several other things you can do to reduce that wait. We will discuss them in Chapter 11.

WRITE_COMPLETION waits

This wait occurs during synchronous write operations in database and log files. In my experience, it is most common with database snapshots.

SQL Server maintains snapshot databases by persisting versions of data pages that existed at time the snapshot was created. At checkpoints after the snapshot was created, SQL Server reads old copies of data pages from data files and saves them into the snapshot before saving dirty pages to disk. This can significantly increase the amount of I/O in the system.

When you see this wait in the system, check if there are database snapshots present. Remember that some internal processes, like DBCC CHECKDB, also create internal database snapshots.

When snapshots are present and their usage is legitimate, you may need to analyze how to improve disk performance to support them. In other cases, you may need to remove them from the system if storage cannot keep up.

PAGEIOLATCH waits

As you already know, PAGEIOLATCH waits occur when SQL Server reads data pages from disk. Those waits are very common and are present in any system. Technically, there are six such waits, but only three are typically present in the system:

PAGEIOLATCH_EX
Occurs when the worker wants to update the data page and is waiting for the page to be read from disk to the buffer pool.
PAGEIOLATCH_SH
Occurs when the worker wants to read the data page and is waiting for the page to be read from disk to the buffer pool.
PAGEIOLATCH_UP
Occurs when the worker wants to update a system page (for example, the allocation map) and is waiting for the page to be read from disk to the buffer pool.

Excessive amounts of PAGEIOLATCH waits show that SQL Server is constantly reading data from disk. This usually occurs under two conditions. The first is an underprovisioned SQL Server: when the active data does not fit into the memory. Second, and more often, it indicates the presence of nonoptimized queries that scan unnecessary data, flushing the contents of the buffer pool.

You can cross-check the data by looking at the Page Life Expectancy performance counter, which shows how long data pages stay in the buffer pool. As a baseline, you can generally use the value of 300 seconds per 4 GB of buffer pool memory: for example, 7,500 seconds on the server with 100 GB buffer pool.

You can see the value of Page Life Expectancy in the PerfMon utility or with the sys.dm_os_performance_counters view, as shown in Listing 3-3. It also returns values for individual NUMA nodes in the system.

Example 3-3. Getting Page Life Expectancy in the system
SELECT object_name, counter_name, instance_name, cntr_value as [PLE(sec)]
        FROM sys.dm_os_performance_counters WITH (NOLOCK)
        WHERE counter_name = 'Page life expectancy'
        OPTION (RECOMPILE);

A large percentage of PAGEIOLATCH waits always requires troubleshooting. While it does not always introduce customer-facing problems, especially with low-latency flash-based disk arrays, the data growth may push the disk subsystem over the limit, which can become a problem that quickly affects the entire system.

You can reduce the impact of PAGEIOLATCH waits by upgrading the disk subsystem or adding more memory to the server. However, the best approach is reducing the amount of data to read from disk by detecting and optimizing inefficient queries. We’ll look at how to detect those queries in the next chapter.

Summary

SQL Server uses cooperative scheduling and, in the majority of cases, asynchronous I/O when it reads and writes data. By default, each scheduler has its own I/O queue and handles I/O in the system.

The sys.dm_io_virtual_file_stats view provides I/O throughput and latency metrics per database file. In a properly tuned system, the latency of transaction log writes should not exceed 1 to 2 milliseconds, and the latency of reads and writes to data files should not exceed 5 to 7 milliseconds with network-based storage and should be even lower with DAS.

Look at the entire I/O stack when troubleshooting bad I/O performance. The problem may be anywhere – in the OS, virtualization, network path, or storage layers.

In many cases, high I/O latency is introduced by bursts in I/O activity. Analyze and tune the checkpoint process – it is one of the most common offenders in busy systems.

In many cases, reducing disk activity will help you improve disk latency and system performance. Query optimization is one of the best ways to achieve that. We will look at how to detect non-optimized queries in the system in the next chapter.

Troubleshooting Checklist

Troubleshoot the following:

  • Analyze disk subsystem latency with the sys.dm_io_virtual_file_stats view
  • Check if high latency is caused by bursts in I/O activity by analyzing SQL Server and OS performance counters.
  • Review I/O metrics at the VM and storage levels, paying attention to noisy neighbors in your setup.
  • Check disk queue depth settings in the I/O stack.
  • Troubleshoot SQL Server checkpoint performance and switch to indirect checkpoints.
  • Troubleshoot log performance if you see significant WRITELOG waits (see Chapter 11).
  • Troubleshoot tempdb performance if you see significant IO_COMPLETION waits and high tempdb usage and latency (see Chapter 9).
  • Detect and optimize inefficient queries if you see high PAGEIOLATCH waits in the system.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.247.196