ESXi hypervisor

A reasonably frequent requirement is to be able to export Ceph storage and consume it via VMware's ESXi hypervisor. ESXi supports iSCSI block storage that's formatted with its own VMFS clustered filesystem and file-based NFS storage. Both are fully functional and supported, meaning that it is normally a matter of user preference as to which is implemented or what's best supported by their storage array.

When exporting Ceph storage to ESXi, there are a number of additional factors that may need to be taken into consideration when using Ceph as a storage provider and when deciding between iSCSI and NFS. As such, this section of this chapter is dedicated to explaining the additional factors that should be taken into consideration when presenting Ceph storage to ESXi.

The first thing to consider is that ESXi was developed with enterprise storage arrays in mind, and a couple of the design decisions during its development have been made around the operation of these storage arrays. As discussed in the opening chapter, direct attached, fiber channel, and iSCSI arrays will have much lower latency than distributed network storage. With Ceph, an additional hop will be required, acting as the NFS or iSCSI proxy; this often results in a write latency that's several times that of a good block storage array.

To assist with storage vendors' QOS attempts (ignoring VAAI accelerations for now), ESXi will break up any clone or migration operations into smaller 64 KB I/Os, with the reasoning being that a large number of parallel 64 KBs are easier to schedule for disk time than large multi MB I/Os, which would block disk operations for a longer time. Ceph, however, tends to favor larger I/O sizes, and so tends to perform worse when cloning or migrating VMs. Additionally, depending on the exportation method, Ceph may not provide read ahead, and so might harm sequential read performance.

Another area in which care needs to be taken is in managing the impact of Ceph's PG locking. When accessing an object stored in Ceph, the PG containing that object is locked to preserve data consistency. All other I/Os to that PG have to queue until the lock is released. For most scenarios, this presents minimal issues; however, when exporting Ceph to ESXi, there are a number of things that ESXi does that can cause contention around this PG locking.

As mentioned previously, ESXi migrates VMs by submitting the I/Os as 64 KB. It also tries to maintain a stream of 32 of these operations in parallel to keep performance acceptable. This causes issues when using Ceph as the underlying storage, as a high percentage of these 64 KB I/Os will all be hitting the same 4 MB object, which means that out of the 32 parallel requests, each one ends up being processed in an almost serial manner. RBD striping may be used to try and ensure that these highly parallel but also highly localized I/Os are distributed across a number of objects, but your mileage may vary. VAAI accelerations may help with some of the migration and cloning operations but, in some cases, these aren't always possible to use, and so ESXi will fall back to the default method.

In relation to VM migration, if you are using a VMFS over iSCSI over RBD configuration, you can also experience PG lock contention upon updating the VMFS metadata, which is stored in only a small area of the disk. The VMFS metadata will often be updated heavily when growing a thinly provisioned VMDK or writing into snapshotted VM files. PG lock contention can limit throughput when a number of VMs on the VMFS filesystem are all trying to update the VMFS metadata at once.

At the time of writing, the official Ceph iSCSI support disables RBD caching. For certain operations, the lack of read-ahead caching has a negative impact on I/O performance. This is especially seen when you have to read sequentially through VMDK files, such as when you are migrating a VM between datastores or removing snapshots.

Regarding HA support, at the time of writing, the official Ceph iSCSI support only uses implicit ALUA to manage the active iSCSI paths. This causes issues if an ESXi host fails over to another path and other hosts in the same vSphere cluster stay on the original path. The long-term solution will be to switch to explicit ALUA, which allows the iSCSI initiator to control the active paths on the target, thereby ensuring that all hosts talk down the same path. The only current workaround to enable a full HA stack is to only run one VM per datastore.

The NFS–XFS–RBD configuration shares a lot of the PG lock contention issues as the iSCSI configuration, and suffers from the contention caused by the XFS journal. The XFS journal is a small circular buffer measured in 10s of MBs, covering only a few underlying RADOS objects. As ESXi is sending sync writes via NFS, parallel writes to XFS queue up, waiting on journal writes to complete. Because XFS is not a distributed filesystem, extra steps need to be implemented when building an HA solution to manage the mounting of the RBDs and XFS filesystems.

Finally, we have the NFS and CephFS method. As CephFS is a filesystem, it can be directly exported, meaning that there is one less layer than there is with the other two methods. Additionally, as CephFS is a distributed filesystem, it can be mounted across multiple proxy nodes at the same time, meaning that there are two fewer cluster objects to track and manage.

It's also likely that a single CephFS filesystem will be exported via NFS, providing a single large ESXi datastore, meaning that there is no need to worry about migrating VMs between datastores, as is the case with RBDs. This greatly simplifies the operation, and works around a lot of the limitations that we've discussed so far.

Although CephFS still requires metadata operations, these are carried out in parallel far better than they are in the way metadata operations in XFS or VMFS are handled, and so there is only minimal impact on performance. The CephFS metadata pool can also be placed on flash storage to further increase performance. The way metadata updates are handled also greatly lowers the occurrence of PG locking, meaning that parallel performance on the datastore is not restricted.

As mentioned previously in the NFS section, CephFS can be exported both directly via the Ganesha FSAL or by being mounted through the Linux kernel and then exported. For performance reasons, mounting CephFS via the kernel and then exporting is the current preferred method.

Before deciding on which method suits your environment the best, it is recommended that you investigate each method further and make sure that you are happy administering the solution. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.7