Latency

When carrying out benchmarks, you are ultimately measuring the result of latency. All other forms of benchmarking metrics, including IOPS, MBps, or even higher level application metrics, are derived from the latency of that request.

IOPS are the number of I/O requests done in a second; the latency of each request directly effects the possible IOPS and can be seen by this formula:

An average latency of 2 milliseconds per request will result in roughly 500 IOPS assuming each request is submitted in a synchronous fashion:

1/0.002 = 500

MBps is simply the number of IOPS multiplied by the I/O size:

500 IOPS * 64 KB = 32,000 KBps

It should be clear that when you are carrying out benchmarks, you are actually measuring the end result of a latency. Therefore, any tuning that you are carrying out should be done to reduce end-to-end latency for each I/O request.

Before moving on to looking at how to benchmark various components of your Ceph cluster and the various tuning options available, we first need to understand the various sources of latency from a typical I/O request. Once we can break down each source of latency into its own category, then it will be possible to perform benchmarking on each one so that we can reliably track both negative and positive tuning outcomes at each stage.

The following diagram shows an example Ceph write request with the main sources of latency:

Starting with the client, we can see that on an average, there is probably around a 100 microseconds worth of latency for it to talk to the primary OSD. With 1G networking, this latency figure could be nearer to 1 milliseconds. We can confirm this figure by either using ping or iperf to measure the round trip delay between two nodes.

From the previous formula, we can see that with 1G networking, even if there were no other sources of latency, the maximum synchronous write IOPS would be around 1000.

Although the client introduces some latency of its own, it is minimal compared to the other sources, and so, it is not included in the diagram.

Next, the OSD which runs the Ceph code introduces latency as it processes the request. It is hard to put an exact figure against this, but it is affected by the speed of the CPU. A faster CPU with a higher frequency will run through the code path faster, reducing latency. Early on in the book, the primary OSD would send the request on to the other two OSDs in the replica set. These are both processed in parallel so that there is minimal increase in latency going from 2x to 3x replicas, assuming the backend disks can cope with the load.

There is also an extra network hop between the primary and the replicated OSDs, which introduces latency into each request

Once the primary OSD has committed the request to its journal and has had an acknowledgement back from all the replica OSDs that they have also done the same, it can then send an acknowledgment back to the client and can submit the next I/O request.

Regarding the journal, depending on the type of media being used, the commit latency can vary. NVMe SSDs will tend to service requests in the 10-20 microseconds range, whereas SATA/SAS based SSDs will typical service requests in the 50-100 microseconds range. NVMe devices also tend to have a more consistent latency profile with an increase in the queue depth, making them ideal for cases where multiple disks might use a single SSD as the same journal. Way ahead are the hard drives that are measured in 10s of milliseconds, although they are fairly consistent in terms of latency as the I/O size increases.

It should be obvious that for small, high-performance workloads, hard drive latency would dominate the total latency figures, and so, SSDs, preferably NVMe, should be used for it.

Overall, in a well-designed and tuned Ceph cluster, all of these parts combined should allow an average write 4 KB request to be serviced in around 500-750 microseconds.

Table of Contents for Latency

Create new playlist

Sign In

Sign Up

Table of Contents for
Latency