OSD in a Ceph cluster are the workhorses; they perform all the work at the bottom layer and store the user data. Monitoring OSDs is a crucial task and requires a lot of attention as there are a lot of OSDs to monitor and take care of. The bigger your cluster, the more OSDs it will have, and the more rigorous monitoring it requires. Generally, a Ceph cluster hosts a lot of disks, so the chances of getting OSD failure is quite high. We will now focus on Ceph commands for OSD monitoring.
The tree view of OSD is quite useful to view OSD status such as IN or OUT and UP or DOWN. The tree view of OSD displays each node with all its OSDs and its location in a CRUSH map. You can check the tree view of OSD using the following command:
# ceph osd tree
This will display the following output:
This command displays various useful information for Ceph OSDs, such as weight, the UP/DOWN status, and the IN/OUT status. The output will be beautifully formatted as per your Ceph CRUSH map. If you maintain a big cluster, this format will be beneficial for you to locate your OSDs and their hosting servers from a long list.
To check OSD statistics, use # ceph osd stat
; this command will help you to get the OSD map epoch, total OSD count, and their IN and UP statuses.
To get detailed information about the Ceph cluster and OSD, execute the following command:
# ceph osd dump
This is a very useful command that will output OSD map epochs, pool details, including the pool ID, pool name, and pool type that is replicated or erasure, the CRUSH ruleset, and placement groups.
This command will also display information such as OSD ID, status, weight, and a clean interval epoch for each OSD. This information is extremely helpful for cluster monitoring and troubleshooting.
To display blacklisted clients, use the following command:
# ceph osd blacklist ls
We can query a CRUSH map directly from the ceph osd
commands. The CRUSH map command-line utility can save a lot of time for a system administrator compared to manual way of viewing and editing a CRUSH map.
To view the CRUSH map, execute the following command:
# ceph osd crush dump
To view CRUSH map rules, execute:
# ceph osd crush rule list
To view a detailed CRUSH rule, execute:
# ceph osd crush rule dump <crush_rule_name>
If you are managing a large Ceph cluster with several hundred OSDs, it's sometimes difficult to find the location of a specific OSD in the CRUSH map. It's also difficult if your CRUSH map contains multiple bucket hierarchies. You can use ceph osd find
to search an OSD and its location in a CRUSH map:
# ceph osd find <Numeric_OSD_ID> # ceph osd find 1
OSDs store placement groups, and placement groups contain objects. The overall health of a cluster majorly depends on placement groups. The cluster will remain in a HEALTH_OK
status only if all the placement groups are on the active + clean
status. If your Ceph cluster is not healthy, there are chances that the placement groups are not active + clean
. Placement groups can exhibit multiple states:
You can monitor placement groups using the commands explained here. The following is the command to get a placement group status:
# ceph pg stat
The output of the pg stat
command will display a lot of information in a specific format:
vNNNN: X pgs: Y active+clean; R bytes data, U MB used, F GB / T GB avail
The variables here are:
To get a placement group list, execute:
# ceph pg dump
This command will generate a lot of essential information, such as the PG map version, PG ID, PG state, acting set, and acting set primary, with respect to placement groups. The output of this command can be huge depending on the number of PGs in your cluster.
To query a particular PG for detailed information, execute the following command that has the ceph pg <PG_ID> query
syntax:
# ceph pg 2.7d query
To list the stuck placement group, execute the following command that has the ceph pg dump_stuck < unclean | Inactive | stale >
syntax:
# ceph pg dump_stuck unclean
3.144.113.163