CPU

As Ceph is software-defined for storage, its performance is heavily effected by the speed of the CPUs in the OSD nodes. Faster CPUs mean that the Ceph code can run faster and will spend less time processing each I/O request. The result is a lower latency per I/O, which, if the underlying storage can cope, will reduce the CPU as a bottleneck and give a higher overall performance. In Chapter 1, Planning for Ceph, it was advised that high Ghz processors should be preferred for performance reasons; however, there are additional concerns with high core count CPUs when they are over specified for the job.

To understand, we will need to cover a brief history on CPU design. During the early 2000s, CPUs were all single core designs, which ran constantly at the same frequency and didn't support many low power modes. As they moved to higher frequencies and core counts started, it became apparent that not every core would be able to run at its maximum frequency all the time. The amount of heat generated from the CPU package was simply too great. Fast forward to today, and this still holds true, there is no such thing as a 4 GHz 20 core CPU; it would simply generate too much heat to be feasible.

However, the clever people who designed CPUs came up with a solution, which allowed each core to run at a different frequency and also allowed them to power themselves down into deep sleep states. Both approaches lowered the power and cooling requirements of the CPU down to single figure watts. The CPUs have much lower clock speeds, but with the ability for a certain total number of cores to engage turbo mode, higher GHz are possible. There is normally a gradual decrease in the top turbo frequency as the number of active cores increases to keep the heat output below a certain threshold. If a low-threaded process is started, the CPU wakes up a couple of cores and speeds them up to a much higher frequency to get better single-threaded performance. In Intel CPUs, the different frequency levels are called P-states and sleep levels are called C-states.

This all sounds like the perfect package: a CPU that when idle consumes hardly any power, and yet when needed, it can turbo boost a handful of cores to achieve high clock speed. Unfortunately, as with most things in life, there is no such thing as a free lunch. There are some overheads with this approach that have a detrimental effect on the latency sensitive applications, with Ceph being one of them.

There are two main problems with this approach that impact the latency sensitive applications. The first being that it takes time for a core to wake up from a sleep state. The deeper the sleep, the longer it takes to wake up. The core has to reinitialize certain things before it is ready to be used. Here is a list from an Intel E3-1200v5 CPU; older CPUs may fair slightly worse:

POLL = 0 microsecond
C1-SKL = 2 microseconds
C1E-SKL = 10 microseconds
C3-SKL = 70 microseconds
C6-SKL = 85 microseconds
C7s-SKL = 124 microseconds
C8-SKL = 200 microseconds

We can see that in a worse case, it may take a core up to 200 microseconds to wake up from its deepest sleep. When you consider that a single Ceph I/O may require several threads across several nodes to wake up a CPU core, these exit latencies can start to really add up. While P-states that effect the core frequency don't impact performance quite as much as the C-state exit latencies, the core's frequency doesn't immediately increase in a speed to maximum as soon as its in use. This means that under a low utilization, the CPU cores may only be operating at a low GHz. This leads onto the second problem that lies with the Linux scheduler.

Linux is aware of what core is active and which C-state and P-state each core is running at. It can fully control each core's behavior. Unfortunately, Linux's scheduler doesn't take any of this information into account, and instead, it prefers to try and balance threads across cores evenly. What this means is that with at low utilization, all the CPU cores will spend the bulk on their time in their lowest C-state and will operate at a low frequency. During a low utilization, this can impact the latency for small I/Os by 4-5x, which is a significant impact.

Until Linux has a power aware scheduler that will take into account which cores are already active and schedules threads on them to reduce latency, the best approach is to force the CPU to only sleep down to a certain C-state and also force it to run at the highest frequency all the time. This does increase the power draw, but in the newest models of CPU, this has somewhat been reduced. For this reason, it should be clear why it is recommended to size your CPU to your workload. Running a 40 core server at a high C-state and high frequency will consume a lot of power.

To force Linux to only drop down to the C1 C-state, add this to your GRUB config:

    intel_idle.max_cstate=1

Some Linux distributions have a performance mode where this runs the CPUs at a maximum frequency. However, the manual way to achieve this is to echo values via sysfs. Sticking the following in the /etc/rc.local will set all your cores to run at their maximum frequency on the boot:

    /sys/devices/system/cpu/intel_pstate/min_perf_pct

After you restart your OSD node, these changes should be in effect. Confirm by running these commands:

    sudo cpupower monitor

As mentioned earlier in the chapter, before making the changes, run a reference benchmark, and then do it again afterwards so that you can understand the gains made by this change.

Table of Contents for CPU

Create new playlist

Sign In

Sign Up

Table of Contents for
CPU