PG distributions

While not strictly a performance tuning option, ensuring even PG distribution across your Ceph cluster is an essential task that should be undertaken during the early stages of the deployment of your cluster. As Ceph uses CRUSH to pseudo randomly determine where to place data, it will not always balance PG equally across every OSD. A Ceph cluster that is not balanced will be unable to take full advantage of the raw capacity, as the most oversubscribed OSD will effectively become the limit to the capacity.

An unevenly balanced cluster will mean that a higher number of requests will be targeted at the OSDs holding the most PGs. These OSDS will then place an artificial performance ceiling on the cluster, especially if the cluster is comprised of spinning disk OSDS.

To rebalance PGs across a Ceph cluster, you simply have to reweight the OSD so that CRUSH adjusts how many PGs will be stored on it. It's important to note that, by default, the weight of every OSD is 1, and you cannot weight an underutilized OSD above 1 to increase its utilization. The only option is to decrease the reweight value of over-utilized OSDs, which should move PGs to the less utilized OSDS.

It is also important to understand that there is a difference between the CRUSH weight of an OSD and the reweight value. The reweight value is used as an override to correct the misplacement from the CRUSH algorithm. The reweight command only effects the OSD and will not affect the weight of the bucket (for example, host), that it is a member of. It is also reset to 1.0 on restart of the OSD. While this can be frustrating, it's important to understand that any future modification to the cluster, be it increasing the number of PGs or adding additional OSDs, would have likely made any reweight value incorrect. Therefore, reweighting OSDs should not be looked at as a one-time operation, but something that is being continuously done and will adjust to the changes in the cluster.

To reweight an OSD, simply use this simple command:

Ceph osd reweight <osd number> <weight value 0.0-1.0>

Once executed, Ceph will start backfilling to move PGs to their newly assigned OSDS.

Of course, searching through all your OSDS and trying to find the OSD that needs weighting, and then running this command for everyone, would be a very lengthy process. Luckily, there is another Ceph tool that can automate a large part of this process:

    ceph osd reweight-by-utilisation <threshold> <max change>
    <number of OSDs>

This command will compare all the OSDs in your cluster and change the override weighting of the top N most OSDs, where N is controlled by the last parameter, which is over the threshold value. You can also limit the maximum change applied to each OSD by specifying the second parameter: 0.05 or 5% is normally a recommended figure.

There is also a test-reweight-by-utilization command, which will allow you to see what the command would do before running it.

While this command is safe to use, there are a number of things that should be taken into consideration before running it:

It has no concept of different pools on different OSDs. If, for example, you have an SSD tier and an HDD tier, the reweight-by-utilization command will still try and balance data across all OSDs. If your SSD tier is not as full as the HDD tier, the command will not work as expected. If you wish to balance OSDs confined to a single bucket, look into the script version of this command created by Cern.
It is possible to reweight the cluster to the point that CRUSH is unable to determine placement for some PGs. If recovery halts and one or more PGs are left in a remapped state, then this is likely what has happened. Simply increase or reset the reweight values to fix it.

Once you are confident with the operation of the command, it is possible to schedule it via cron so that your cluster is kept in a more balanced state automatically.

Table of Contents for PG distributions

Create new playlist

Sign In

Sign Up

Table of Contents for
PG distributions