Filestore limitations

Filestore was originally designed as an object store to enable developers to test Ceph on their local machines. Because of its stability, it quickly became the standard object store and found itself in use in production clusters throughout the world.

Initially, the thought behind filestore was that the upcoming B-tree file system (btrfs), which offered transaction support, would allow Ceph to offload the atomic requirements to btrfs. Transactions would allow an application to send a series of requests to btrfs and only receive acknowledgement once all were committed to stable storage. Without a transaction support, if there was an interruption halfway through a Ceph write operation, either the data or metadata could have been missing or one could be out of sync with the other.

Unfortunately, the reliance on btrfs to solve these problems turned out to be a false hope, and several limitations were discovered. Btrfs can still be used with filestore, but there are numerous known issues that can affect the stability of Ceph.

In the end, it turned out that XFS was the best choice to use with filestore, but XFS had the major limitation that it didn't support transactions, meaning that there was no way for Ceph to guarantee atomicity of its writes. The solution to this was the write-ahead journal. All writes, including data and metadata, would first be written into a journal, located on a raw block device. Once the filesystem containing the data and metadata confirmed that all data had been safely flushed to disk, the journal entries could be flushed. A beneficial side effect of this is that, when using an SSD to hold the journal for a spinning disk, it acts like a write back cache, lowering the latency of writes to the speed of the SSD; however, if the filestore journal resides on the same storage device as the data partition, then throughput will be at least halved.

In the case of spinning-disk OSDs, this can lead to very poor performance, as the disk heads will constantly be moving between two areas of the disks, even for sequential operations. Although filestore on SSD-based OSDs doesn't suffer nearly the same performance penalty, their throughput is still effectively halved because double the amount of data needs to be written due to the filestore journal. In either case, this loss of performance is very undesirable, and in the case of flash drives, this wears the device faster, requiring the more expensive version of flash, called write endurance flash. The following diagram shows how filestore and its journal interacts with a block device. You can see that all data operations have to go through the filestore journal and the filesystems journal:

Additional challenges with filestore arose from developers trying to control the actions of the underlying POSIX filesystem to perform and behave in a way that Ceph required. A large amount of work has been done over the years by filesystem developers to try and make filesystems intelligent and to predict how an application might submit I/O. In the case of Ceph, a lot of these optimizations interfere with what it's trying to instruct the filesystem to do, requiring more workarounds and complexity.

Object metadata is stored in combinations of filesystem attributes, called extended attributes (XATTRs), and in a LevelDB key–value store, which also resides on the OSD disk. LevelDB was chosen at the time of filestore's creation rather than RocksDB, as RocksDB wasn't available and LevelDB suited a lot of Ceph's requirements.

Ceph is designed to scale to petabytes of data and store billions of objects. However, because of limitations around the number of files you can reasonably store in a directory, further workarounds to help limit this were introduced. Objects are stored in a hierarchy of hashed directory names; when the number of files in one of these folders reaches the set limit, the directory is split into another level and the objects are moved.

However, there's a trade-off to improving the speed of object enumeration: when these directory splits occur, they impact performance as the objects are moved into the correct directories. On larger disks, the increased number of directories puts additional pressure on the VFS cache and can lead to additional performance penalties for infrequently accessed objects.

As this book will cover in the chapter on performance tuning, a major performance bottleneck in filestore is when XFS has to start looking up inodes and directory entries that aren't currently cached in RAM. For scenarios where there are a large number of objects stored per OSD, there is currently no real solution to this problem, and it's quite common to for a Ceph cluster to gradually slow down as it fills up.

Moving away from storing objects on a POSIX filesystem is really the only way to solve most of these problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.53.119