Replacing a failed disk drive

Being a Ceph storage admin, you will need to manage Ceph clusters with multiple physical disks. As the physical disk count increases for your Ceph cluster, the frequency of disk failures might also increase. Hence, replacing a failed disk drive might become a repetitive task for a Ceph storage administrator. There is generally no need to worry if one or more disks fail in your Ceph cluster as Ceph will take care of the data by its replication and high availability feature. The process of removing OSDs from a Ceph cluster relies on Ceph's data replication and removing all the entries of failed OSDs from CRUSH cluster maps. We will now see the failed disk replacement process on ceph-node1 and osd.0.

Firstly, check the status of your Ceph cluster. Since this cluster does not have any failed disk, the status will be HEALTH_OK:

# ceph status

Since we are demonstrating this exercise on virtual machines, we need to forcefully fail a disk by bringing ceph-node1 down, detaching a disk and powering up the VM:

# VBoxManage  controlvm ceph-node1 poweroff
# VBoxManage storageattach ceph-node1 --storagectl "SATA Controller" --port 1 --device 0 --type hdd --medium none
# VBoxManage startvm ceph-node1

In the following screenshot, you will notice that ceph-node1 contains a failed osd.0 that should be replaced:

Replacing a failed disk drive

Once the OSD is down, Ceph will mark this OSD out of the cluster in some time; by default, it is after 300 seconds. If not, you can make it out manually:

# ceph osd out osd.0

Remove the failed OSD from the CRUSH map:

# ceph osd crush rm osd.0

Delete Ceph authentication keys for the OSD:

# ceph auth del osd.0 

Finally, remove the OSD from the Ceph cluster:

# ceph osd rm osd.0
Replacing a failed disk drive

Since one of your OSDs is unavailable, the cluster health will not be OK, and it will perform recovery; there is nothing to worry about, this is a normal Ceph operation.

Now, you should physically replace the failed disk with the new disk on your Ceph node. These days, almost all the server hardware and operating systems support disk hot swapping, so you will not require any downtime for disk replacement. Since we are using a virtual machine, we need to power off the VM, add a new disk, and restart the VM. Once the disk is inserted, make a note of its OS device ID:

# VBoxManage  controlvm ceph-node1 poweroff
# VBoxManage storageattach ceph-node1 --storagectl "SATA Controller" --port 1 --device 0 --type hdd --medium ceph-node1-osd1.vdi
# VBoxManage startvm ceph-node1

Perform the following commands to list disks; the new disk generally does not have any partition:

# ceph-deploy disk list ceph-node1

Before adding the disk to the Ceph cluster, perform disk zap:

# ceph-deploy disk zap ceph-node1:sdb

Finally, create an OSD on the disk, and Ceph will add it as osd.0:

# ceph-deploy --overwrite-conf osd create  ceph-node1:sdb

Once an OSD is created, Ceph will perform a recovery operation and start moving placement groups from secondary OSDs to the new OSD. The recovery operation might take a while, after which your Ceph cluster will be HEALTHY_OK again:

Replacing a failed disk drive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.123.126