Filestore limitations

Filestore was originally designed as an object store to enable developers to test Ceph on their local machines. Due to its stability, it quickly became the standard object store and found itself in use in production clusters throughout the world.

Initially, the thought behind filestore was that the upcoming B-tree file system (btrfs), which offered transaction support, would allow Ceph to offload the atomic requirements to btrfs. Transactions allow an application to send a series of requests to btrfs and only receive acknowledgement once all have been committed to stable storage. Without a transaction support, if there is an interruption halfway through a Ceph write operation, either the data or metadata could be missing or one out of sync with the other.

Unfortunately, the reliance on btrfs to solve these problems turned out to be a false hope, and several limitations were discovered. btrfs can still be used with filestore, but there are numerous known issues that can affect the stability of Ceph.

In the end, it turned out that XFS was the best choice to use with filestore, but XFS had the major limitation that it didn't support transactions, meaning that there was no way for Ceph to guarantee atomicity of its writes. The solution to this was the write-ahead journal. All writes including data and metadata would first be written into a journal, residing on a raw block device. Once the filesystem containing the data and metadata confirmed that all data had been safely flushed to disk, the journal entries could be flushed. A beneficial side effect of this is that when using a SSD to hold the journal for a spinning disk, it acts like a write back cache, lowering the latency of writes to the speed of the SSD. However, if the filestore journal resides on the same storage device as the data partition, then throughput will be at least halved. In the case of spinning disk OSDs, this can lead to very poor performance as the disk heads are constantly moving between two areas of the disks, even for sequential operations. Although filestore on SSD-based OSDs don't suffer nearly the same performance penalty, their throughput is still effectively halved due to double the amount of data required to be written. In either case, this loss of performance is very undesirable and in the case of flash, also wears the device faster, requiring more expensive write endurance flash. The following diagram shows how Filestore and its journal interacts with a block device, you can see that all data operations have to go through the Filestore journal and the filesystems journal.

Additional challenges with filestore were around trying to control the actions of the underlying POSIX filesystem to perform and behave in a way that Ceph required. Large amounts of work has been done over the years by filesystem developers to try and make filesystems intelligent and to predict how an application might submit I/O. In the case of Ceph, a lot of these optimizations interfere with what it's trying to instruct the filesystem to do, requiring more work around and complexity.

Object metadata is stored in combinations of filesystem attributes named Extended Attributes (XATTRsand in a LevelDB key value store, which also resides on the OSD disk. LevelDB was chosen at the time of filestore's creation rather than RocksDB, as RocksDB was not available and LevelDB suited a lot of Ceph's requirements.

Ceph is designed to scale to petabytes of data and store billions of objects. However, due to limitations around the number of files you can reasonably store in a directory, further workarounds to help limit this were introduced. Objects are stored in a hierarchy of hashed directory names; when the number of files in one of these folders reaches the set limit, the directory is split into a further level and the objects moved. However, there is a trade-off to improving the speed of object enumeration, when these directory splits occur they impact performance as the objects are moved into the correct directories. On larger disks, the increased number of directories puts additional pressure on the VFS cache and can lead to additional performance penalties for infrequently accessed objects.

As this book will cover in the performance tuning chapter, a major performance bottleneck in filestore is when XFS has to start looking up inodes and directory entries as they are not currently cached in RAM. For scenarios where there are a large number of objects stored per OSD, there is currently no real solution to this problem, and it's quite common to observe that the Ceph cluster gradually starts to slow down as it fills up.

Moving away from storing objects on a POSIX filesystem is really the only way to solve most of these problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.55.18