Write Anywhere File Layout
This chapter describes Write Anywhere File Layout (WAFL), which is a file system designed specifically to work in a file server appliance. Our primary focus in this chapter is on the algorithms and data structures that WAFL uses to perform its I/O and to implement Snapshots (read-only clones of the active file system). WAFL uses a unique copy-on-write technique to minimize the disk space that Snapshots consume. This chapter also describes how WAFL uses Snapshots to eliminate the need for file system consistency checking after an unclean shutdown.
The file system requirements for a file server storage system are different from those for a general-purpose UNIX/Windows system because a file server storage system must be optimized for network file access and because it must be easy to use.
The following topics are covered:
 
3.1 Introduction to Write Anywhere File Layout
An appliance is a device designed to perform a particular function. A recent trend in networking has been to provide common services using appliances instead of general purpose computers.
A new type of network appliance is the unified file server appliance. Traditionally, the N series based on WAFL started out as an NFS appliance. The requirements for a file system operating in an NFS appliance are different from those for a general purpose file system: NFS access patterns are different from local access patterns, and the special-purpose nature of an appliance also affects the design. Write Anywhere File Layout (WAFL) is the file system used in all Network Appliance Corporation's file servers.
WAFL was designed to meet four primary requirements:
It must provide fast NFS service.
It must support large file systems (tens of GB) that grow dynamically as disks are added.
It must provide high performance while supporting Redundant Array of Independent Disks (RAID).
It must restart quickly, even after an unclean shutdown due to power failure or system crash.
The requirement for fast NFS service is obvious, with WAFL's intended use in an NFS appliance. Support for large file systems simplifies system administration by allowing all disk space to belong to a single large partition. Large file systems make RAID desirable because the probability of disk failure increases with the number of disks. Large file systems require special techniques for fast restart because the file system consistency checks for normal UNIX file systems become unacceptably slow as file systems grow.
NFS and RAID both strain write performance: NFS because servers must store data safely before replying to NFS requests, and RAID because of the read-modify-write sequence it uses to maintain parity. It led us to use non-volatile RAM to reduce NFS response time and a write-anywhere design that allows WAFL to write to disk locations that minimize RAID's write performance penalty. The write-anywhere design enables Snapshots, which in turn eliminate the requirement for time-consuming consistency checks after power loss or system failure.
In later iterations, unified protocol access including file (NFS, CIFS) and block protocols (FC, iSCSI) have been added. Still, WAFL remains the underlying file system structure for the N series.
3.2 Write Anywhere File Layout design
WAFL is a compatible file system optimized for network file access. It is unique in that it stores sufficient information to make it compatible with a number of different client environments (NFS, CIFS, HTTP, and so on) and is optimized to maximize the reading and writing of disk content while supplying it to various types of network clients.
In many ways, WAFL is similar to other UNIX file systems (UFS), such as the Berkeley Fast File System (FFS) and the TransArc Episode file system (Figure 3-1). WAFL is a block-based file system that uses inodes to describe files (it stores all information about a file, directory, file system object except its data, and name).
Figure 3-1 Write Anywhere File Layout comparison
3.2.1 WAFL overview
WAFL is a UNIX compatible file system optimized for network file access. In many ways WAFL is similar to other UNIX file systems such as the Berkeley Fast File System (FFS) and TransArc's Episode file system. WAFL is a block-based file system that uses inodes to describe files. It uses 4 KB blocks with no fragments.
Each WAFL inode contains 16 block pointers to indicate which blocks belong to the file. Unlike FFS, all the block pointers in a WAFL inode refer to blocks at the same level. Thus, inodes for files smaller than 64 KB use the 16 block pointers to point to data blocks. Inodes for files larger than 64 MB point to indirect blocks which point to actual file data. Inodes for larger files point to doubly indirect blocks. For very small files, data is stored in the inode itself in place of the block pointers.
Figure 3-2 illustrates inode space usage. Each inode contains 16 block pointers, meaning that a single inode can address a file smaller than or equal to 64 KB. If a file exceeds the 64 KB limit, metadata blocks are used to point to actual data, while small files (metadata) are stored directly in the inode file.
Figure 3-2 The inode space usage
3.2.2 Metadata resides in files
Like Episode, WAFL stores metadata in files. WAFL's three metadata files are the inode file, which contains the inodes for the file system; the block-map file, which identifies free blocks; and the inode-map file, which identifies free inodes. The term map is used instead of bit map because these files use more than one bit for each entry. The block-map file's format is described in detail next (see Figure 3-3).
Figure 3-3 Metadata files with regular files underneath
Keeping metadata in files allows WAFL to write metadata blocks anywhere on disk. It is the origin of the name WAFL, which stands for Write Anywhere File Layout. The write-anywhere design allows WAFL to operate efficiently with RAID by scheduling multiple writes to the same RAID stripe whenever possible to avoid the 4-to-1 write penalty that RAID incurs when it updates just one block in a stripe.
Keeping metadata in files makes it easy to increase the size of the file system on the fly. When a new disk is added, the N series server automatically increases the sizes of the metadata files. The system administrator can increase the number of inodes in the file system manually if the default is too small. Finally, the write-anywhere design enables the copy-on-write technique used by Snapshots. For Snapshots to work, WAFL must be able to write all new data, including metadata, to new locations on disk, instead of overwriting the old data. If WAFL stored metadata at fixed locations on disk, it would not be possible.
3.2.3 A tree of blocks
A WAFL file system is best thought of as a tree of blocks. At the root of the tree structure is the root inode, as shown in Figure 3-3 on page 33. The root inode is a special inode that describes the inode file. The inode file contains the inodes that describe the rest of the files in the file system, including the block-map and inode-map files. The leaves of the tree are the data blocks of all the files.
Figure 3-4 here shows a more detailed version of Figure 3-3. It illustrates that files are made up of individual blocks, and that large files have additional layers of indirection between the inode and the actual data blocks. In order for WAFL to boot, it must be able to find the root of this tree, so the only exception to the WAFL write-anywhere rule is that the block containing the root inode must reside at a fixed location on disk where WAFL can find it.
Figure 3-4 Detailed view of the WAFL tree of blocks
3.3 File system consistency and non-volatile RAM
WAFL avoids the need for file system consistency checking after an unclean shutdown by creating a special Snapshot called a consistency point every few seconds. Unlike other Snapshots, a consistency point has no name, and it is not accessible through NFS. However, like all Snapshots, a consistency point is a completely self consistent image of the entire file system. When WAFL restarts, it simply reverts to the most recent consistency point. It allows an N series server to reboot in about a minute even with 20 GB or more of data in its single partition.
Between consistency points, WAFL does write data to disk, but it writes only to blocks that are not in use, so the tree of blocks representing the most recent consistency point remains completely unchanged. WAFL processes hundreds or thousands of NFS requests between consistency points, so the on-disk image of the file system remains the same for many seconds until WAFL writes a new consistency point, at which time the on-disk image advances atomically to a new state that reflects the changes made by the new requests. Although this technique is unusual for a UNIX file system, it is well known for databases. Even in databases, it is unusual to write as many operations at one time as WAFL does in its consistency points.
WAFL uses non-volatile RAM (NVRAM) to keep a log of NFS requests it has processed since the last consistency point. (NVRAM is special memory with batteries that allow it to store data even when system power is off.) After an unclean shutdown, WAFL replays any requests in the log to prevent them from being lost. When an N series server shuts down normally, it creates one last consistency point after suspending NFS service. Thus, on a clean shutdown, the NVRAM does not contain any unprocessed NFS requests, and it is turned off to increase its battery life.
WAFL actually divides the NVRAM into two separate logs. When one log gets full, WAFL switches to the other log and starts writing a consistency point to store the changes from the first log safely on disk. WAFL schedules a consistency point every 10 seconds, even if the log is not full, to prevent the on-disk image of the file system from getting too far out of date.
Logging NFS requests to NVRAM has several advantages over the more common technique of using NVRAM to cache writes at the disk driver layer. Lyon and Sandberg describe the NVRAM write cache technique, which Legato's Prestoserve NFS accelerator uses.
Processing an NFS request and caching the resulting disk writes generally takes much more NVRAM than simply logging the information required to replay the request. For instance, to move a file from one directory to another, the file system must update the contents and inodes of both the source and target directories. In FFS, where blocks are 8 KB each, it uses 32 KB of cache space. WAFL uses about 150 bytes to log the information needed to replay a rename operation. Rename, with its factor of 200 difference in NVRAM usage, is an extreme case, but even for a simple 8 KB write, caching disk blocks will consume 8 KB for the dat a, 8 KB for the inode update, and for large files another 8 KB for the indirect block. WAFL logs just the 8 KB of data along with about 120 bytes of header information. With a typical mix of NFS operations, WAFL can store more than 1000 operations per megabyte of NVRAM.
Using NVRAM as a cache of unwritten disk blocks turns it into an integral part of the disk subsystem. An NVRAM failure can corrupt the file system in ways that fsck cannot detect or repair. If something goes wrong with WAFL's NVRAM, WAFL might lose a few NFS requests, but the on-disk image of the file system remains completely self consistent. It is important because NVRAM is reliable, but not as reliable as a RAID disk array.
A final advantage of logging NFS requests is that it improves NFS response times. To reply to an NFS request, a file system without any NVRAM must update its in-memory data structures, allocate disk space for new data, and wait for all modified data to reach disk. A file system with an NVRAM write cache does all the same steps, except that it copies modified data into NVRAM instead of waiting for the data to reach disk. WAFL can reply to an NFS request much more quickly because it need only update its in-memory data structures and log the request. It does not allocate disk space for new data or copy modified data to NVRAM.
3.4 Write allocation
Write performance is especially important for network file servers. Ousterhout observed that as read caches get larger at both the client and server, writes begin to dominate the I/O subsystem. This effect is especially pronounced with NFS which allows very little client-side write caching. The result is that the disks on an NFS server might have 5 times as many write operations as reads.
WAFL's design was motivated largely by a desire to maximize the flexibility of its write allocation policies. This flexibility takes three forms:
WAFL can write any file system block (except the one containing the root inode) to any location on disk. In FFS, metadata, such as inodes and bit maps, is kept in fixed locations on disk. It prevents FFS from optimizing writes by, for example, putting both the data for a newly updated file and its inode right next to each other on disk. Because WAFL can write metadata anywhere on disk, it can optimize writes more creatively.
WAFL can write blocks to disk in any order. FFS writes blocks to disk in a carefully determined order so that fsck can restore file system consistency after an unclean shutdown. WAFL can write blocks in any order because the on-disk image of the file system changes only when WAFL writes a consistency point. The one constraint is that WAFL must write all the blocks in a new consistency point before it writes the root inode for the consistency point.
WAFL can allocate disk space for many NFS operations at once in a single write episode. FFS allocates disk space as it processes each NFS request. WAFL gathers up hundreds of NFS requests before scheduling a consistency point, at which time it allocates blocks for all requests in the consistency point at once. Deferring write allocation improves the latency of NFS operations by removing disk allocation from the processing path of the reply, and it avoids wasting time allocating space for blocks that are removed before they reach disk.
These features give WAFL extraordinary flexibility in its write allocation policies. The ability to schedule writes for many requests at once enables more intelligent allocation policies, and the fact that blocks can be written to any location and in any order allows a wide variety of strategies. It is easy to try new block allocation strategies without any change to WAFL's on-disk data structures.
The details of WAFL's write allocation policies are outside the scope of this paper. In short, WAFL improves RAID performance by writing to multiple blocks in the same stripe; WAFL reduces seek time by writing blocks to locations that are near each other on disk; and WAFL reduces head- contention when reading large files by placing sequential blocks in a file on a single disk in the RAID array. Optimizing write allocation is difficult because these goals often conflict.
3.5 Summary
WAFL was developed and became stable surprisingly quickly for a new file system. We attribute this stability in part to the WAFL use of consistency points. Processing file system requests is simple because WAFL updates only in-memory data structures and the NVRAM log. Consistency points eliminate ordering constraints for disk writes, which are a significant source of errors in most file systems. The code that writes consistency points is concentrated in a single file and interacts little with the rest of WAFL.
More importantly, it is much easier to develop high-quality, high-performance system software for an appliance than for a general-purpose operating system. Special purpose file systems also have difficulty achieving good performance and reliability because they are often hosted on general-purpose platforms, which limits their efficiencies and reliability.
Compared with a general-purpose file system, WAFL handles a regular and simple set of requests. A general-purpose file system receives requests from thousands of different applications with a wide variety of different access patterns, and new applications are added frequently. By contrast, WAFL receives requests only from the network-attached storage or SAN client modules of other systems that have been implemented following a strict regime of industry-developed protocol definitions. iSCSI, NFS, FTP, and HTTP all must function the same regardless of which platform they are running on because the protocol that they follow is well constructed. CIFS is only available from a single source, so it too is well constrained.
Of course, applications are the ultimate source of I/O requests, but the client code converts application requests into a regular pattern of network requests, and it filters out error cases before they reach the server. The small number of operations that WAFL supports makes it possible to define and test the entire range of inputs that it is expected to handle.
These advantages apply to any IBM System Storage N series, not just to file server appliances. Network-attached storage only makes sense for protocols that are well defined and widely used, but for such protocols, network-attached storage can provide important advantages over a general-purpose computer.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.196.175