Data protection and redundancy technologies have existed for many decades. One of the most popular methods for data reliability is replication. The replication method involves storing the same data multiple times on different physical locations. This method proves to be good when it comes to performance and data reliability, but it increases the overall cost associated with a storage system. The TOC with a replication method is way too high.
This method requires double the amount of storage space to provide redundancy. For instance, if you are planning for a storage solution with 1 PB of data with a replication factor of one, you will require a 2 PB of physical storage to store 1 PB of replicated data. In this way, the replication cost per gigabyte of storage system increases significantly. You might ignore the storage cost for a small storage cluster, but imagine where the cost will hit if you build up a hyper-scale data storage solution based on replicated storage backend.
Erasure coding mechanism comes as a gift in such scenarios. It is the mechanism used in storage for data protection and data reliability, which is absolutely different from the replication method. It guarantees data protection by dividing each storage object into smaller chunks known as data chunks, expanding and encoding them with coding chunks, and finally storing all these chunks across different failure zones of a Ceph cluster.
The erasure coding feature has been introduced in Ceph Firefly, and it is based on a mathematical function to achieve data protection. The entire concept revolves around the following equation:
n = k + m
The following points explain these terms and what they stand for:
Based on the preceding equation, every object in an erasure-coded Ceph pool will be stored as k+m chunks, and each chunk is stored in OSD in an acting set. In this way, all the chunks of an object are spread across the entire Ceph cluster, providing a higher degree of reliability. Now, let's discuss some useful terms with respect to erasure coding:
For instance, consider a Ceph pool with five OSDs that is created using the erasure code (3, 2) rule. Every object that is stored inside this pool will be divided into the following set of data and coding chunks:
n = k + m similarly, 5 = 3 + 2 hence n = 5 , k = 3 and m = 2
So, every object will be divided into three data chunks, and two extra erasure-coded chunks will be added to it, making a total of five chunks that will be stored and distributed on five OSDs of erasure-coded pool in a Ceph cluster. In an event of failure, to construct the original file, we need any three chunks out of any five chunks to recover it. Thus, we can sustain failure of any two OSDs as the data can be recovered using three OSDs.
Encoding rate (r) = 3 / 5 = 0.6 < 1 Storage required = 1/r = 1 / 0.6 = 1.6 times of original file.
Suppose there is a data file of size 1 GB. To store this file in a Ceph cluster on a erasure coded (3, 5) pool, you will need 1.6 GB of storage space, which will provide you file storage with sustainability of two OSD failures.
In contrast to replication method, if the same file is stored on a replicated pool, then in order to sustain the failure of two OSDs, Ceph will need a pool of replica size 3, which eventually requires 3 GB of storage space to reliably store 1 GB of file. In this way, you can save storage cost by approximately 40 percent by using the erasure coding feature of Ceph and getting the same reliability as with replication.
Erasure-coded pools require less storage space compared to replicated pools; however, this storage saving comes at the cost of performance because the erasure coding process divides every object into multiple smaller data chunks, and few newer coding chunks are mixed with these data chunks. Finally, all these chunks are stored across different failure zones of a Ceph cluster. This entire mechanism requires a bit more computational power from the OSD nodes. Moreover, at the time of recovery, decoding the data chunks also requires a lot of computing. So, you might find the erasure coding mechanism of storing data somewhat slower than the replication mechanism. Erasure coding is mainly use-case dependent, and you can get the most out of erasure coding based on your data storage requirements.
With erasure coding, you can store more with less money. Cold storage can be a good use case for erasure code, where read and write operations on data are less frequent; for example, large data sets where images and genomics data is stored for a longer time without reading and writing them, or some kind of archival system where data is archived and is not accessed frequently.
Usually, such types of low-cost cold storage erasure pools are tiered with faster replicated pools so that data is initially stored on the replicated pool, and if the data is not accessible for a certain time period (some weeks), it will be flushed to low-cost erasure code, where performance is not a criteria.
Erasure code is implemented by creating Ceph pools of type erasure; each of these pools are based on an erasure code profile that defines erasure coding characteristics. We will now create an erasure code profile and erasure-coded pool based on this profile:
EC-profile
, which will have characteristics of k=3 and m=2, which are the number of data and coding chunks, respectively. So, every object that is stored in the erasure-coded pool will be divided into 3 (k) data chunks, and 2 (m) additional coding chunks are added to them, making a total of 5 (k + m) chunks. Finally, these 5 (k + m) chunks are spread across different OSD failure zones.# ceph osd erasure-code-profile set EC-profile ruleset-failure-domain=osd k=3 m=2
# ceph osd erasure-code-profile ls
# ceph osd erasure-code-profile get EC-profile
# ceph osd pool create EC-pool 16 16 erasure EC-profile
Check the status of your newly created pool; you should find that the size of the pool is 5 (k + m), that is, erasure size 5
. Hence, data will be written to five different OSDs:
# ceph osd dump | grep -i EC-pool
At this stage, we have completed setting up an erasure pool in a Ceph cluster. Now, we will deliberately try to break OSDs to see how the erasure pool behaves when OSDs are unavailable.
Bring down osd.4 and check the OSD map for EC-pool and object1. You should notice that osd.4 is replaced by a random number 2147483647
, which means that osd.4 is no longer available for this pool:
# ssh ceph-node2 service ceph stop osd.5 # ceph osd map EC-pool object1
2147483647
, which means that osd.5 is also no longer available for this pool:In this way, erasure coding provides reliability to Ceph pools, and at the same time, less amount of storage is required to provide the required reliability.
The Erasure code feature is greatly benefited by Ceph's robust architecture. When Ceph detects unavailability of any failure zone, it starts its basic operation of recovery. During the recovery operation, erasure pools rebuild themselves by decoding failed chunks on to new OSDs, and after that, they make all the chunks available automatically.
In the last two steps mentioned above, we intentionally broke osd.4 and osd.5. After a while, Ceph started recovery and regenerated missing chunks onto different OSDs. Once the recovery operation is complete, you should check the OSD map for EC-pool and object1; you will be amazed to see the new OSD ID as osd.1 and osd.3, and thus, an erasure pool becomes healthy without administrative input.
This is how Ceph and erasure coding make a great combination. The erasure coding feature for a storage system such as Ceph, which is scalable to the petabyte level and beyond, will definitely give a cost-effective, reliable way of data storage.
3.137.188.201