WBThrottle and/or nr

Filestore uses buffered I/O to write; this brings a number of advantages if the filestore journal is on a faster media. Client requests are acknowledged as soon as they are written to the journal, and are then flushed to the data disk at a later date by the standard writeback functionality in Linux. This allows the spinning disk OSDs to provide write latency similar to SSDs when writing in small bursts. The delayed writeback also allows the kernel to rearrange I/O requests to the disk to hopefully either coalesce them, or allow the disk heads to take a more optimal path across the platters. The end effect is that you can squeeze some more I/O out of each disk than what would be possible with a direct or sync I/O.

However, the problem occurs when the amount of incoming writes to the Ceph cluster outstrips the capabilities of the underlying disks. In this scenario, the number of pending I/Os waiting to be written on disk can increase uncontrollably, and the resulting queue of I/Os can saturate the disk and Ceph queues. Read requests are particularly badly effected, as they get stuck behind potentially thousands of write requests, which may take several seconds to flush to the disk.

In order to combat this problem, Ceph has a writeback throttle mechanism built into filestore called WBThrottle. It is designed to limit the amount of writeback I/Os that can queue up and start the flushing process earlier than what would be naturally triggered by the kernel. Unfortunately, testing has shown that the defaults may still not curtail the behavior that can reduce the impact on the read latency. Tuning can alter this behavior to reduce the write queue lengths and allow reads not to get effected much. However, there is a trade-off; by reducing the maximum number of writes allowed to be queued up, you can reduce the kernel's opportunity to maximize the efficiency of reordering the requests. Some thought needs to be given to what is important for your given use case, workloads, and tune to match it.

To control the writeback queue depth, you can either reduce the maximum amount of outstanding I/Os using Ceph's WBThrottle settings, or lower the maximum outstanding requests at the block layer in the kernel. Both can effectively control the same behavior, and it's really a preference on how you prefer to implement the configuration.

It should also be noted that the operation priorities in Ceph are more effective with a shorter queue at the disk level. By shortening the queue at the disk, the main queueing location moves up into Ceph where it has more control over what I/O has priority. Consider the following example:

echo 8 > /sys/block/sda/queue/nr_requests

With the release of the Linux 4.10 kernel, a new feature was introduced which deprioritizes writeback I/O; this greatly reduces the impact of write starvation with Ceph and is worth investigating if running the 4.10 kernel is feasible.

Table of Contents for
WBThrottle and/or nr_requests

WBThrottle and/or nr_requests

Table of Contents for WBThrottle and/or nr_requests

Create new playlist

Sign In

Sign Up

Table of Contents for
WBThrottle and/or nr_requests