Investigating asserts

Assertions are used in Ceph to ensure that during the execution of the code any assumptions that have been made about the operating environment remain true. These assertions are scattered throughout the Ceph code and are designed to catch any conditions that may go on to cause further problems if the code is not stopped.

If you trigger an assertion in Ceph, it's likely that some form of data has a value that is unexpected. This may be caused by some form or corruption or unhandled bug.

If an OSD causes an assert and refuses to start anymore, the usual recommended approach would be to destroy the OSD, recreate it, and then let Ceph backfill objects back to it. If you have a reproducible failure scenario, then it is probably also worth filing a bug in the Ceph bug tracker.

As mentioned several times in this chapter, OSDs can fail either due to hardware faults or soft faults in either the stored data or OSD code. Soft faults are much more likely to affect multiple OSDs at once; if your OSDs have become corrupted due to a power outage, then it's highly likely that more than once OSD will be affected. In the case, where multiple OSDs are failing with asserts and they are causing one or more PG's in the cluster to be offline, simply recreating the OSDs is not an option. The OSDs that are offline contain all the three copies of the PG, and so, recreating the OSDs would make any form of recovery impossible and result in permanent data loss.

First, before attempting the recovery techniques in this chapter, such as exporting and importing PGs, investigation into the asserts should be done. Depending on your technical ability and how much downtime you can tolerate before you need to start focusing on other recovery steps, investigating the asserts may not result in any success. By investigating the assert and looking through the Ceph source referenced by the assert, it may be possible to identify the cause of the assert. If this is possible then a fix can be implemented in the Ceph code to avoid the OSD asserting. Don't be afraid to reach out to the community for help on these matters.

In some cases the OSD corruption may be so severe that even the objectstore tool may itself assert when trying to read from the OSD. This will limit the recovery steps outlined in this chapter, and trying to fix the reason behind the assert might be the only option. Although by this point, it is likely that the OSD has sustained heavy corruption, and recovery may not be possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.82.21