PG Splitting

A filesystem has a limit on the number of files that can be stored in a directory before performance starts to degrade when asked to list the contents. As Ceph storing millions of objects per disk, which are just files, it splits the files across a nested directory structure to limit the number of files placed in each directory. As the number of objects in the cluster increases, so does the number of files per directory. When the number of files in these directories exceed these limits, Ceph splits the directory into further sub directories and migrates the objects to them. This operation can have a significant performance penalty when it occurs. Furthermore, XFS tries to place its files in the same directory close together on the disk. When PG splitting occurs, fragmentation of the XFS filesystem can occur, leading to further performance degradation.

By default, Ceph will split a PG when it contains 320 objects. An 8 TB disk in a Ceph cluster configured with the recommended number of PGs per OSD will likely have over 5000 objects per PG. This PG would have gone through several PG split operations in its lifetime, resulting in a deeper and more complex directory structure.

As mentioned previously in the VFS cache pressure section, to avoid costly dentry lookups, the kernel tries to cache them. The result of PG splitting means that there is a higher number of directories to cache, and there may not be enough memory to cache them all, leading to poorer performance.

A common approach to this problem is to increase the allowed number of files in each directory by setting the OSD configuration options as follows:

filestore_split_multiple

also this setting

filestore_merge_threshold

With the formula here, you can set at what threshold Ceph will split a PG:

Care should be taken. Although increasing the threshold will reduce the occurences of PG splitting and also reduce the complexity of the directory structure, when a PG split does occur, it will have to split far more objects. The greater the number of objects that will need to be split will have a bigger impact on performance and may even lead to OSD's timing out. There is a trade-off of split frequency to split time; the defaults may be slightly on the conservative side, especially with larger disks.

Doubling or tripling the split threshold can probably be done safely without too much concern; greater values should be tested with the cluster under I/O load before putting it into production.

Table of Contents for PG Splitting

Create new playlist

Sign In

Sign Up

Table of Contents for
PG Splitting