Full OSDs

By default, Ceph will warn when OSD utilization approaches 85%, and it will stop write I/O to the OSD when it reaches 95%. If, for some reason, the OSD completely fills up to 100%, then the OSD is likely to crash and will refuse to come back online. An OSD that is above the 85% warning level will also refuse to participate in backfilling, so the recovery of the cluster may be impacted when OSDs are in a near full state.

Before covering the troubleshooting steps around full OSDs, it is highly recommended that you monitor the capacity utilization of your OSDs, as described in the monitoring chapter. This will give you advanced warning as OSDs approach the near_full warning threshold.

If you find yourself in a situation where your cluster is above the near full warning state, you have two main options:

Add some more OSDs.
Delete some data.

However, in the real world, both of these are either impossible or will take time, in which case the situation can deteriorate. If the OSD is only at the near_full threshold, then you can probably get things back on track by checking whether your OSD utilization is balanced, and then you can perform PG balancing if not. This was covered in more detail in the tuning chapter. The same applies to the too_full OSDs as well; although you are unlikely going to get them back below 85%, at least you can resume write operations.

If your OSDs have completely filled up, then they are in an offline state and will refuse to start. Now, you have an additional problem. If the OSDs will not start, then no matter what rebalancing or deletion of data you carry out, it will not be reflected on the full OSDs as they are offline. The only way to recover from this situation is to manually delete some PGs from the disk's file system to let the OSD start.

The following steps should be undertaken for this:

Make sure the OSD process is not running.
Set nobackfill on the cluster, to stop the recovery from happening when the OSD comes back online.
Find a PG that is in an active, clean, and remapped state and exists on the offline OSD.
Delete this PG from the offline OSD.
Hopefully you should now be able to restart the OSD.
Delete data from the Ceph cluster or rebalance PG’s.
Remove nobackfill.
Run a scrub and repair the PG you just deleted.

Table of Contents for Full OSDs

Create new playlist

Sign In

Sign Up

Table of Contents for
Full OSDs