Chapter 8. Monitoring Your Ceph Cluster

Ceph cluster monitoring is one of the prime responsibilities of Ceph storage administration. Monitoring plays a vital role in troubleshooting cluster and fixes the problem when a cluster is unhealthy.

In this chapter, we will cover the following topics:

  • Monitoring a Ceph cluster
  • Monitoring MON and MDS
  • Monitoring OSD and PG
  • Open source dashboards for Ceph such as Kraken, ceph-dash, and Calamari

Monitoring a Ceph cluster

Monitoring is one of the most important responsibilities for a storage administrator. System monitoring usually comes after cluster designing, deployment, and service implementation. As a storage administrator, you will need to keep an eye on your Ceph storage cluster and find out what's going on at any given time. Regular and disciplined monitoring keeps you updated with your cluster health. Based on monitoring notifications, you will get a bit more time to take necessary actions before service outages. Monitoring a Ceph cluster is an everyday task, which includes monitoring of MON, OSD, MDS, and PG, storage provisioning services such as RBD, radosgw, and CephFS, and Ceph clients. Ceph comes with a rich set of native command-line tools and API to monitor these components. In addition to this, there are open source projects, which are intentionally developed to monitor Ceph clusters on a GUI one-view dashboard.

Monitoring has a wider scope, which should not be limited to the software layer of Ceph. It should be extended to the underlying infrastructure, including hardware, networking, and other related systems which power your Ceph cluster. Usually, the manufacturer of these hardware systems provides a rich monitoring interface, which may or may not involve cost. We recommend you to use such tools for system monitoring at an infrastructure level. Remember that the more stable your underlying infrastructure is, the better results you can get out of your Ceph cluster. We will now focus on Ceph-based monitoring tools as well as some other open source projects for monitoring. Ceph comes with powerful CLI tools for cluster monitoring and troubleshooting. You can use the ceph tool to monitor your cluster.

Checking cluster health

To check the health of your cluster, use the ceph command followed by health as the command option:

# ceph health

The output of this command will be divided into several sections separated by semicolons:

Checking cluster health

The first section of the output shows that your cluster is in the warning state, HEALTH_WARN, as 64 placement groups (PGs) are degraded. The second section shows that 1,408 PGs are not clean, and the third section of the output shows that cluster recovery is in process for one out of 5,744 objects and the cluster is 0.017 percent degraded. If your cluster is healthy, you will receive the output as HEALTH_OK.

To know the health details of your Ceph cluster, use the ceph health detail command; this command will tell you all the placement groups that are not active and clean, that is, all the PGs that are unclean, inconsistent, and degraded will be listed here with their details. If your cluster is healthy, you will receive the output as HEALTH_OK. The following screenshot shows the health details of ceph:

Checking cluster health

Watching cluster events

You can monitor cluster events using the ceph command with the -w option. This command will display all the cluster events, including INF (information), WRN (warning), and ERR (errors), in real time. This command will generate a continuous output of live cluster changes; you can use Ctrl + C to get on to the shell:

# ceph -w
Watching cluster events

There are other options as well that can be used with the ceph command to gather different types of event details:

  • --watch-debug: This is used to watch debug events
  • --watch-info: This is used to watch info events
  • --watch-sec: This is used to watch security events
  • --watch-warn: This is used to watch warning events
  • --watch-error: This is used to watch error events

Cluster utilization statistics

To know your cluster's space utilization statistics, use the ceph command with the df option. This command will show the total cluster size, available size, used size, and percentage. This will also display pool information such as pool name, ID, utilization, and number of objects in each pool:

# ceph df
Cluster utilization statistics

Checking the cluster status

Checking the cluster status is the most common and frequent operation when managing a Ceph cluster. You can check the status of your cluster using the ceph command and the status option. Instead of the status subcommand, you can also use a shorter version, -s, as an option:

# ceph status

Alternatively, you can use:

# ceph -s

The following screenshot shows the status of our cluster:

Checking the cluster status

This command will dump a lot of useful information for your Ceph cluster. The following is the explanation:

  • cluster: This represents the Ceph unique cluster ID.
  • health: This shows the cluster health.
  • monmap: This represents the monitor map epoch version, information, election epoch version, and quorum status.
  • mdsmap: This represents the mdsmap epoch version and status.
  • osdmap: This represents osdmap epoch and OSD's UP and IN count.
  • pgmap: This shows the pgmap version, total number of PGs, pool count, and total objects. It also displays information about cluster utilization, including the used size, free size, and total size. Finally, it displays the PG status.

In order to view the real-time cluster status, you can use ceph status with Unix's watch command to get a continuous output:

# watch ceph -s

Cluster authentication keys

Ceph works on a strong authentication system based on keys. All cluster components interact with one other once they undergo a key-based authentication system. As a Ceph administrator, you might need to check key lists managed by clusters. You can use the ceph command with the auth list subcommand to get a list of all keys:

# ceph auth list

Note

To know more about command operations, you can use help as a suboption, for instance, # ceph auth --help. Use the command as directed by the help option.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.212.160