Monitoring Ceph OSD

OSD in a Ceph cluster are the workhorses; they perform all the work at the bottom layer and store the user data. Monitoring OSDs is a crucial task and requires a lot of attention as there are a lot of OSDs to monitor and take care of. The bigger your cluster, the more OSDs it will have, and the more rigorous monitoring it requires. Generally, a Ceph cluster hosts a lot of disks, so the chances of getting OSD failure is quite high. We will now focus on Ceph commands for OSD monitoring.

OSD tree view

The tree view of OSD is quite useful to view OSD status such as IN or OUT and UP or DOWN. The tree view of OSD displays each node with all its OSDs and its location in a CRUSH map. You can check the tree view of OSD using the following command:

# ceph osd tree

This will display the following output:

OSD tree view

This command displays various useful information for Ceph OSDs, such as weight, the UP/DOWN status, and the IN/OUT status. The output will be beautifully formatted as per your Ceph CRUSH map. If you maintain a big cluster, this format will be beneficial for you to locate your OSDs and their hosting servers from a long list.

OSD statistics

To check OSD statistics, use # ceph osd stat; this command will help you to get the OSD map epoch, total OSD count, and their IN and UP statuses.

To get detailed information about the Ceph cluster and OSD, execute the following command:

# ceph osd dump

This is a very useful command that will output OSD map epochs, pool details, including the pool ID, pool name, and pool type that is replicated or erasure, the CRUSH ruleset, and placement groups.

This command will also display information such as OSD ID, status, weight, and a clean interval epoch for each OSD. This information is extremely helpful for cluster monitoring and troubleshooting.

To display blacklisted clients, use the following command:

# ceph osd blacklist ls

Checking the CRUSH map

We can query a CRUSH map directly from the ceph osd commands. The CRUSH map command-line utility can save a lot of time for a system administrator compared to manual way of viewing and editing a CRUSH map.

To view the CRUSH map, execute the following command:

# ceph osd crush dump

To view CRUSH map rules, execute:

# ceph osd crush rule list

To view a detailed CRUSH rule, execute:

# ceph osd crush rule dump  <crush_rule_name>
Checking the CRUSH map

If you are managing a large Ceph cluster with several hundred OSDs, it's sometimes difficult to find the location of a specific OSD in the CRUSH map. It's also difficult if your CRUSH map contains multiple bucket hierarchies. You can use ceph osd find to search an OSD and its location in a CRUSH map:

# ceph osd find  <Numeric_OSD_ID>
# ceph osd find 1
Checking the CRUSH map

Monitoring placement groups

OSDs store placement groups, and placement groups contain objects. The overall health of a cluster majorly depends on placement groups. The cluster will remain in a HEALTH_OK status only if all the placement groups are on the active + clean status. If your Ceph cluster is not healthy, there are chances that the placement groups are not active + clean. Placement groups can exhibit multiple states:

  • Peering: Under peering, the placement groups of the OSDs that are in the acting set, storing the replica of the placement group, come into agreement about the state of the object and its metadata in the PG. Once peering is completed, OSDs that store the PG agree about the current state of it.
  • Active: Once the peering operation is complete, Ceph makes the PG active. Under the active state, the data in the PG is available on the primary PG and its replica for an I/O operation.
  • Clean: Under the clean state, the primary and secondary OSDs have successfully peered, no PG moves away from their correct location, and the objects are replicated correct number of times.
  • Degraded: Once an OSD goes down, Ceph changes the state of all its PGs that are assigned to this OSD as degraded. After the OSD comes UP, it has to peer again to make the degraded PGs clean. If the OSD remains down and out for more than 300 seconds, Ceph recovers all the PGs that are degraded from their replica PGs to maintain the replication count. Clients can perform I/O even after PGs are in a degraded stage. There can be one more reason why a placement group can be degraded; this is when one or more objects inside a PG become unavailable. Ceph assumes the object should be in the PG, but it's not actually available. In such cases, Ceph marks the PG as degraded and tries to recover the PG form its replica.
  • Recovering: When an OSD goes down, the content of its placement groups fall behind the contents of its replica PGs on other OSDs. Once the OSD comes UP, Ceph initiates a recovery operation on the PGs to keep them up to date with replica PGs in other OSDs.
  • Backfilling: As soon as a new OSD is added to a cluster, Ceph tries to rebalance the data by moving some PGs from other OSDs to this new OSD; this process is known as backfilling. Once backfilling is completed for the placement groups, the OSD can participate in client I/O. Ceph performs backfilling smoothly in the background and makes sure not to overload the cluster.
  • Remapped: Whenever there is a change in a PG acting set, data migration happens from the old acting set OSD to the new acting set OSD. This operation might take some time based on the data size that gets migrated to the new OSD. During this time, the old primary OSD of the old acting group serves the client request. As soon as the data migration operation completes, Ceph uses new primary OSDs from the acting group.
  • Stale: Ceph OSD reports its statistics to a Ceph monitor every 0.5 seconds; by any chance, if the primary OSDs of the placement group acting set fail to report their statistics to the monitors, or if other OSDs report their primary OSDs down, the monitor will consider these PGs as stale.

You can monitor placement groups using the commands explained here. The following is the command to get a placement group status:

# ceph pg stat
Monitoring placement groups

The output of the pg stat command will display a lot of information in a specific format:

vNNNN: X pgs: Y active+clean; R bytes data, U MB used, F GB / T GB avail

The variables here are:

  • vNNNN: This is the PG map version number
  • X: This is the total number of placement groups
  • Y: This states the number of PGs with their states
  • R: This specifies the raw data stored
  • U: This specifies the real data stored after replication
  • F: This is the remaining free capacity
  • T: This is the total capacity

To get a placement group list, execute:

# ceph pg dump

This command will generate a lot of essential information, such as the PG map version, PG ID, PG state, acting set, and acting set primary, with respect to placement groups. The output of this command can be huge depending on the number of PGs in your cluster.

To query a particular PG for detailed information, execute the following command that has the ceph pg <PG_ID> query syntax:

# ceph pg 2.7d query

To list the stuck placement group, execute the following command that has the ceph pg dump_stuck < unclean | Inactive | stale > syntax:

# ceph pg dump_stuck unclean
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.113.163