Splitting PGs

A filesystem has a limit on the number of files that can be stored in a directory before performance starts to degrade when asked to list the contents:

As Ceph is storing millions of objects per disk—which are just files. It splits the files across a nested directory structure to limit the number of files placed in each directory.
As the number of objects in the cluster increases, so does the number of files per directory.
When the number of files in these directories exceeds these limits, Ceph splits the directory into further subdirectories and migrates the objects to them.

This operation can have a significant performance penalty when it occurs. Furthermore, XFS tries to place its files in the same directory close together on the disk. When PG splitting occurs, fragmentation of the XFS filesystem can occur, leading to further performance degradation.

By default, Ceph will split a PG when it contains 320 objects. An 8 TB disk in a Ceph cluster configured with the recommended number of PGs per OSD will likely have over 5,000 objects per PG. This PG would have gone through several PG split operations in its lifetime, resulting in a deeper and more complex directory structure.

As mentioned in the VFS cache pressure section, to avoid costly dentry lookups, the kernel tries to cache them. The result of PG splitting means that there is a higher number of directories to cache, and there may not be enough memory to cache them all, leading to poorer performance.

A common approach to this problem is to increase the allowed number of files in each directory by setting the OSD configuration options, as follows:

filestore_split_multiple

Also, use the following setting:

filestore_merge_threshold

With the following formula, you can set at what threshold Ceph will split a PG:

Care should be taken, however. Although increasing the threshold will reduce the occurrences of PG splitting and also reduce the complexity of the directory structure, when a PG split does occur, it will have to split far more objects. The greater the number of objects that need to be split, the greater the impact on performance, which may even lead to OSDs timing out. There is a trade-off of split frequency to split time; the defaults may be slightly on the conservative side, especially with larger disks.

Doubling or tripling the split threshold can probably be done safely without too much concern; greater values should be tested with the cluster under I/O load before putting it into production.

Table of Contents for Splitting PGs

Create new playlist

Sign In

Sign Up

Table of Contents for
Splitting PGs