In this recipe, we will learn commands that are used to monitor the overall Ceph cluster.
Here is how we go about monitoring the Ceph cluster. The steps are explained topic-wise as follows.
To check the health of your cluster, use the ceph
command followed by health
as the command option:
# ceph health
The output of this command would be divided into several sections separated by a semicolon:
The first section of the output shows that your cluster is in the warning state, HEALTH_WARN
, as 64
placement groups (PGs) are degraded. The second section represents that 1408 PGs are not clean, and the third section of the output represents that cluster recovery is going on for 1 out of 5744 objects and the cluster is 0.017% degraded. If your cluster is healthy, you will receive the output as HEALTH_OK
.
To find out more details of your cluster health, use the ceph health detail
command. This command will tell you all the PGs that are not active and clean, that is, all the PGs that are unclean, inconsistent, and degraded will be listed here with their details. If your cluster is healthy, you will receive the output as HEALTH_OK
.
You can monitor cluster events using the ceph
command with the -w
option. This command will display all the cluster events' messages including information (INF), warning (WRN), and error (ERR) in real time. The output of this command will be continuous, live cluster changes; you can use Ctrl + C to get on to the shell:
# ceph -w
There are other options that can be used with the ceph
command to gather different types of event details. They are as follows:
To know your cluster's space utilization statistics, use the ceph
command with the df
option. This command will show the total cluster size, the available size, the used size, and the percentage. This will also display pool information, such as the pool name, ID, utilization, and number of objects in each pool:
# ceph df
The output is as follows:
Checking the cluster's status is the most common and the most frequent operation when managing a Ceph cluster. You can check the status of your cluster using the ceph
command and status
as the option:
# ceph status
Instead of the status
subcommand, you can also use a shorter version, -s
, as an option:
# ceph -s
The following screenshot shows the status of our cluster:
This command will dump a lot of useful information for your Ceph cluster:
mdsmap
epoch version and the mdsmap
status.osdmap
epoch, OSD total, up and in count.pgmap
version, total number of PGs, pool count, capacity in use for a single copy, and total objects. It also displays information about cluster utilization including used size, free size, and total size. Finally, it will display the PG status.Usually a Ceph cluster is deployed with more than one MON instance for high availability. Since there is a large number of monitors, they should attain a quorum to make the cluster function properly.
We will now focus on Ceph commands for OSD monitoring. The steps will be explained topic-wise as follows:
To display the cluster's MON status and MON map, use the ceph
command with either mon stat
or the mon dump
suboption:
# ceph mon stat # ceph mon dump
The following screenshot displays the output of this command:
To maintain a quorum between Ceph MONs, the cluster should always have more than half of the available monitors in a Ceph cluster. Checking the quorum status of a cluster is very useful at the time of MON troubleshooting. You can check the quorum status by using the ceph
command and the quorum_status
subcommand:
# ceph quorum_status -f json-pretty
The following screenshot displays the output of this command:
The quorum status displays election_epoch
, which is the election version number, and quorum_leader_name
, which denotes the hostname of the quorum leader. It also displays the MON map epoch and cluster ID. Each cluster monitor is allocated with a rank. For I/O operations, clients first connect to the quorum lead monitor; if the leader MON is unavailable, the client then connects to the next rank monitor:
Monitoring OSDs is a crucial task and requires a lot of attention, as there are a lot of OSDs to monitor and take care of. The bigger your cluster, the more OSDs it would have, and the more rigorous the monitoring it would require. Generally, Ceph clusters host a lot of disks, so the chances of facing an OSD failure are quite high.
We will now focus on Ceph commands for OSD monitoring. The steps will be explained topic-wise as follows:
The tree view in OSD is quite useful for knowing OSD statuses such as IN
or OUT
and UP
or DOWN
. The tree view in OSD displays each node with all its OSDs and its location in the CRUSH map. You can check the tree view of OSD using the following command:
# ceph osd tree
This command displays various useful information for Ceph OSDs, such as weight, UP/DOWN
status, and IN/OUT
status. The output will be beautifully formatted as per your Ceph crush map. If you were maintaining a big cluster, this format would be beneficial to locating your OSDs and their hosting server from a long list.
To check OSD statistics, use # ceph osd stat
; this command will help you get the OSD map epoch, total OSD count, and their IN
and UP
statuses.
To get detailed information about the Ceph cluster and OSD, execute the following command:
# ceph osd dump
This is a very useful command that will output the OSD map epoch, pool details including pool ID, pool name, pool type, that is, replicated or erasure, crush ruleset, and PGs. This command will also display information for each OSD, such as the OSD ID, status, weight, last clean interval epoch, and so on. All this information is extremely helpful for cluster monitoring and troubleshooting.
You can also make an OSD blacklist to prevent it from connecting to other OSDs so that no heartbeat process can take place. It's mostly used to prevent a lagging metadata server from making bad changes to data on the OSD. Usually, blacklists are maintained by Ceph itself and shouldn't need manual intervention, but it's good to know.
To display blacklisted clients, execute the following command:
# ceph osd blacklist ls
We can query the crush map directly from the ceph osd
commands. The crush map command line utility can save a lot of the system administrator's time as compared to the conventional way of viewing and editing it after the decompilation of the crush map:
# ceph osd crush dump
# ceph osd crush rule list
# ceph osd crush rule dump <crush_rule_name>
The following figure displays the output of our query crush map:
If you are managing a large Ceph cluster with several hundreds of OSDs, it's sometimes difficult to find the location of a specific OSD in the crush map. It's also difficult if your crush map contains multiple bucket hierarchy. You can use ceph osd find
to search for an OSD and its location in a crush map:
# ceph osd find <Numeric_OSD_ID>
OSDs store PGs, and each PG contains objects. The overall health of a cluster depends majorly on PGs. The cluster will remain in a HEALTH_OK
status only if all the PGs are on the status, active + clean
. If your Ceph cluster is not healthy, then there are chances that the PGs are not active + clean
. Placement groups can exhibit multiple states, and even combination of states. The following are some states that a PG can be:
Creating
: The PG is being created.Peering
: The process of bringing all of the OSDs that store PGs into agreement about the state of all objects including their metadata in that PG.Active
: Once the peering operation is completed, Ceph lists the PG as active. Under the active state, the data in the PG data is available on the primary PG and its replica for the I/O operation.Clean
: A clean state means that the primary and secondary OSDs have successfully peered and no PG moves away from their correct location. It also shows that PGs are replicated the correct number of times.Down
: This means that the replica with the necessary data is down, so the PG is offline.Degraded
: Once an OSD is DOWN
, Ceph changes the state of all the PGs that are assigned to that OSD to DEGRADED. After the OSD comes UP
, it has to peer again to make the degraded PGs clean. If the OSD remains DOWN
and out for more than 300 seconds, Ceph recovers all the PGs that are degraded from their replica PGs to maintain the replication count. Clients can perform I/O even after PGs are in the degraded stage.Recovering
: When an OSD goes DOWN
, the content of the PGs of that OSD fall behind the contents of the replica PGs on other OSDs. Once the OSD comes UP
, Ceph initiates a recovery operation on the PGs to keep them up to date with the replica PGs in other OSDs.Backfilling
: As soon as a new OSD is added to the cluster, Ceph tries to rebalance the data by moving some PGs from other OSDs to this new OSD; this process is known as backfilling. Once the backfilling is completed for the PGs, the OSD can participate in the client I/O.Remapped
: Whenever there is a change in the PG acting set, data migration happens from the old acting set OSD to the new acting set OSD. This operation might take some time depending on the data size that is being migrated to the new OSD. During this time, the old primary OSD of the old acting group serves to client request. As soon as the data migration operation completes, Ceph uses new primary OSDs from the acting group.An acting set refers to a group of OSDs responsible for PGs. The primary OSD is known as the first OSD from the acting set and is responsible for the peering operation for each PG with its secondary/tertiary OSD. It also entertains write operations from clients. The OSD, which is up, remains in the acting set. Once the primary OSD is down
, it's first removed from the up set; the secondary OSD is then promoted to be the primary OSD.
Stale
: Ceph OSD reports their statistics to the Ceph monitor every 0.5 seconds; by any chance, if the primary OSDs of the PG acting set fail to report their statistics to the monitors, or if other OSDs have reported the primary OSD down
, the monitor will consider those PGs as stale.You can monitor PGs using the following commands:
# ceph pg stat
:The output of the pg stat
command will display a lot of information in a specific format: vNNNN: X pgs: Y active+clean; R MB data, U MB used, F GB / T GB avail
.
Where the variables are defined as follows:
active+clean
state# ceph pg dump -f json-pretty
This command will generate a lot of essential information with respect to PGs, such as the PG map version, PG ID, PG state, acting set, acting set primary, and so on. The output of this command can be huge depending on the number of PGs in your cluster.
ceph pg <PG_ID> query
:# ceph pg 2.7d query
ceph pg dump_stuck < unclean | Inactive | stale >
:# ceph pg dump_stuck unclean
Metadata servers are used only for CephFS, which is not production ready as of now. The Metadata server has several states, such as UP
, DOWN
, ACTIVE
, and INACTIVE
. While performing the monitoring of MDS, you should make sure the state of MDS is UP
and ACTIVE
. The following commands will help you get information related to the Ceph MDS.
18.189.171.125