K+M

The more erasure code shards you have, the more OSD failures you can tolerate and still successfully read data. Likewise, the ratio of K to M shards each object is split into has a direct effect on the percentage of raw storage that is required for each object.

A 3+1 configuration will give you 75% usable capacity but only allows for a single OSD failure, and so would not be recommended. In comparison, a three-way replica pool only gives you 33% usable capacity.

4+2 configurations would give you 66% usable capacity and allows for two OSD failures. This is probably a good configuration for most people to use.

At the other end of the scale, 18+2 would give you 90% usable capacity and still allows for two OSD failures. On the surface this sounds like an ideal option, but the greater total number of shards comes at a cost. A higher number of total shards has a negative impact on performance and also an increased CPU demand. The same 4 MB object that would be stored as a whole single object in a replicated pool would now be split into 20x200 KB chunks, which have to be tracked and written to 20 different OSDs. Spinning disks will exhibit faster bandwidth, measured in MBps with larger I/O sizes, but bandwidth drastically tails off at smaller I/O sizes. These smaller shards will generate a large amount of small I/O and cause additional load on some clusters.

Also, it's important not to forget that these shards need to be spread across different hosts according to the CRUSH map rules: no shard belonging to the same object can be stored on the same host as another shard from the same object. Some clusters may not have a sufficient number of hosts to satisfy this requirement.

Reading back from these high chunk pools is also a problem. Unlike in a replica pool where Ceph can read just the requested data from any offset in an object, in an erasure pool, all shards from all OSDs have to be read before the read request can be satisfied. In the 18+2 example, this can massively amplify the amount of required disk read ops and average latency will increase as a result. This behavior is a side effect which tends to only cause a performance impact with pools that use a large number of shards. A 4+2 configuration in some instances will get a performance gain compared to a replica pool, from the result of splitting an object into shards. As the data is effectively striped over a number of OSDs, each OSD has to write less data and there are no secondary and tertiary replicas to write.

Table of Contents for K+M

Create new playlist

Sign In

Sign Up

Table of Contents for
K+M