Placement groups

When a Ceph cluster receives requests for data storage, it splits into sections known as placement groups (PG). However, CRUSH data is first broken down into a set of objects, and based on the hash operation on object names, replication levels and total number of placement groups in the system, placement group IDs are generated. A placement group is a logical collection of objects that are replicated on OSDs to provide reliability in a storage system. Depending on the replication level of your Ceph pool, each placement group is replicated and distributed on more than one OSD of a Ceph cluster. You can consider a placement group as a logical container holding multiple objects such that this logical container is mapped to multiple OSDs. The placement groups are essential for the scalability and performance of a Ceph storage system.

Placement groups

Without placement groups, it will be difficult to manage and track tens of millions of objects that are replicated and spread over hundreds of OSDs. The management of these objects without a placement group will also result in a computational penalty. Instead of managing every object individually, a system has to manage placement groups with numerous objects. This makes Ceph a more manageable and less complex function. Each placement group requires some amount of system resources, CPU, and memory since every placement group has to manage multiple objects. The number of placement groups in a cluster should be meticulously calculated. Usually, increasing the number of placement groups in your cluster reduces the per OSD load, but the increment should always be done in a regulated way. 50 to 100 placement groups per OSD is recommended. This is to avoid high resource utilization from an OSD node. As your data needs to increase, you will need to scale your cluster up by adjusting placement group counts. When devices are added or removed from a cluster, most of the placement groups remain in position; CRUSH manages the relocation of placement groups across clusters.

Note

PGP is the total number of placement groups for placement purposes. This should be equal to the total number of placement groups.

Calculating PG numbers

Deciding the correct number of placement groups is an essential step in building enterprise class Ceph storage clusters. Placement groups can improve or affect storage performance to a certain extent.

The formula to calculate the total number of placement groups for a Ceph cluster is:

Total PGs = (Total_number_of_OSD * 100) / max_replication_count

This result must be rounded up to the nearest power of 2. For example, if a Ceph cluster has 160 OSDs and the replication count is 3, the total number of placement groups will come as 5333.3, and rounding up this value to the nearest power of 2 will give the final value as 8192 PGs.

We should also make a calculation to find out the total number of PGs per pool in the Ceph cluster. The formula for this is as follows:

Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) / pool count

We will consider the same example that we used earlier. The total number of OSDs is 160, the replication level is 3, and the total number of pools is three. Based on this assumption, the formula will generate 1777.7. Finally, rounding it up to the power of 2 will give 2048 PGs per pool.

It's important to balance the number of PGs per pool with the number of PGs per OSD in order to reduce the variance per OSD and avoid the recovery process, which is slow.

Modifying PG and PGP

If you manage a Ceph storage cluster, you might need to change the PG and PGP count for your pool at some point. Before proceeding towards PG and PGP modification, let's understand what PGP is.

PGP is Placement Group for Placement purpose, which should be kept equal to the total number of placement groups (pg_num). For a Ceph pool, if you increase the number of placement groups, that is, pg_num, you should also increase pgp_num to the same integer value as pg_num so that the cluster can start rebalancing. The undercover rebalancing mechanism can be understood in the following way.

The pg_num value defines the number of placement groups, which are mapped to OSDs. When pg_num is increased for any pool, every PG of this pool splits into half, but they all remain mapped to their parent OSD. Until this time, Ceph does not start rebalancing. Now, when you increase the pgp_num value for the same pool, PGs start to migrate from the parent to some other OSD, and cluster rebalancing starts. In this way, PGP plays an important role in cluster rebalancing. Now, let's learn how to change pg_num and pgp_num:

  1. Check the existing PG and PGP numbers:
    # ceph osd pool get data pg_num
    # ceph osd pool get data pgp_num
    
    Modifying PG and PGP
  2. Check the pool replication level by executing the following command, and look for the rep size value:
    # ceph osd dump | grep size
    
    Modifying PG and PGP
  3. Calculate the new placement group count for our setup using the following formula:
    Total OSD = 9, Replication pool level (rep size) = 2, pool count = 3

    Based on the preceding formula, the placement group count for each pool comes to 150, rounding it up to the next power of 2 gives us 256.

  4. Modify the PG and PGP for the pool:
    # ceph osd pool set data pg_num 256
    # ceph osd pool set data pgp_num 256
    
    Modifying PG and PGP
  5. Similarly, modify the PG and PGP numbers for metadata and RBD pools:
    Modifying PG and PGP

PG peering, up and acting sets

A Ceph OSD daemon performs the peering operation for the state of all objects and their metadata for particular PGs, which involves the agreement between OSDs storing a placement group. A Ceph storage cluster stores multiple copies of any object on multiple PGs, which are then stored on multiple OSDs. These OSDs are referred to as primary, secondary, tertiary, and so on. An acting set refers to a group of OSDs responsible for PGs. The primary OSD is known as the first OSD from the acting set and is responsible for the peering operation for each PG with its secondary/tertiary OSD. The primary OSD is the only OSD that entertains write operations from clients. The OSD, which is up, remains in the acting set. Once the primary OSD is down, it is first removed from the up set; the secondary OSD is then promoted to the primary OSD. Ceph recovers PGs of the failed OSD on to the new OSD and adds it to the up and acting sets to ensure high availability.

In a Ceph cluster, an OSD can be the primary OSD for some PGs, while at the same time, it's the secondary or tertiary OSD for other PGs.

PG peering, up and acting sets

In the preceding example, the acting set contains three OSDs (osd.24, osd.72, and osd.11). Out of these, osd.24 is the primary OSD, and osd.72 and osd.11 are the secondary and tertiary OSDs, respectively. Since osd.24 is the primary OSD, it takes care of the peering operation for all the PGs that are on these three OSDs. In this way, Ceph makes sure PGs are always available and consistent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.41.212