In this section we’ll briefly describe the enhanced filesystem that has evolved from Ext2, named Ext3. The new filesystem has been designed with two simple concepts in mind:
To be a journaling filesystem (see the next section)
To be, as much as possible, compatible with the old Ext2 filesystem
Ext3 achieves both the goals very well. In particular, it is largely based on Ext2, so its data structures on disk are essentially identical to those of an Ext2 filesystem. As a matter of fact, if an Ext3 filesystem has been cleanly unmounted, it can be remounted as an Ext2 filesystem; conversely, creating a journal of an Ext2 filesystem and remounting it as an Ext3 filesystem is a simple, fast operation.
Thanks to the compatibility between Ext3 and Ext2, most descriptions in the previous sections of this chapter apply to Ext3 as well. Therefore, in this section, we focus on the new feature offered by Ext3 — “the journal.”
As disks became larger, one design choice of traditional Unix filesystems (like Ext2) turns out to be inappropriate. As we know from Chapter 14, updates to filesystem blocks might be kept in dynamic memory for long period of time before being flushed to disk. A dramatic event like a power-down failure or a system crash might thus leave the filesystem in an inconsistent state. To overcome this problem, each traditional Unix filesystem is checked before being mounted; if it has not been properly unmounted, then a specific program executes an exhaustive, time-consuming check and fixes all filesystem’s data structures on disk.
For instance, the Ext2 filesystem status is stored in the
s_mount_state
field of the superblock on disk. The
e2fsck
utility program is invoked by the boot
script to check the value stored in this field; if it is not equal to
EXT2_VALID_FS
, the filesystem was not properly
unmounted, and therefore e2fsck
starts checking
all disk data structures of the filesystem.
Clearly, the time spent checking the consistency of a filesystem depends mainly on the number of files and directories to be examined; therefore, it also depends on the disk size. Nowadays, with filesystems reaching hundreds of gigabytes, a single consistency check may take hours. The involved downtime is unacceptable for any production environment or high-availability server.
The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the whole filesystem by looking instead in a special disk area that contains the most recent disk write operations named journal . Remounting a journaling filesystem after a system failure is a matter of few seconds.
The idea behind Ext3 journaling is to perform any high-level change to the filesystem in two steps. First, a copy of the blocks to be written is stored in the journal; then, when the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.
While recovering after a system failure, the
e2fsck
program distinguishes the following two
cases:
The system failure occurred before a commit to the journal. Either the copies of the blocks relative to the high-level change are
missing from the journal or they are incomplete; in both cases,
e2fsck
ignores them.
The system failure occurred after a commit to the journal. The copies of the blocks are valid and e2fsck
writes them into the filesystem.
In the first case, the high-level change to the filesystem is lost,
but the filesystem state is still consistent. In the second case,
e2fsck
applies the whole high-level change, thus
fixing any inconsistency due to unfinished I/O data transfers into
the filesystem.
Don’t expect too much from a journaling filesystem;
it ensures consistency only at the system call level. For instance, a
system failure that occurs while you are copying a large file by
issuing several write( )
system calls will
interrupt the copy operation, thus the duplicated file will be
shorter than the original one.
Furthermore, journaling filesystems do not usually copy all blocks into the journal. In fact, each filesystem consists of two kinds of blocks: those containing the so-called metadata and those containing regular data. In the case of Ext2 and Ext3, there are six kinds of metadata: superblocks, group block descriptors, inodes, blocks used for indirect addressing (indirection blocks), data bitmap blocks, and inode bitmap blocks. Other filesystems may use different metadata.
Most journaling filesystems, like ReiserFS, SGI’s XFS, and IBM’s JFS, limit themselves to log the operations affecting metadata. In fact, metadata’s log records are sufficient to restore the consistency of the on-disk filesystem data structures. However, since operations on blocks of file data are not logged, nothing prevents a system failure from corrupting the contents of the files.
The Ext3 filesystem, however, can be configured to log the operations affecting both the filesystem metadata and the data blocks of the files. Since logging every kind of write operation leads to a significant performance penalty, Ext3 lets the system administrator decide what has to be logged; in particular, it offers three different journaling modes:
All filesystem data and metadata changes are logged into the journal. This mode minimizes the chance of losing the updates made to each file, but it requires many additional disk accesses. For example, when a new file is created, all its data blocks must be duplicated as log records. This is the safest and slowest Ext3 journaling mode.
Only changes to filesystem metadata are logged into the journal. However, the Ext3 filesystem groups metadata and relative data blocks so that data blocks are written to disk before the metadata. This way, the chance to have data corruption inside the files is reduced; for instance, any write access that enlarges a file is guaranteed to be fully protected by the journal. This is the default Ext3 journaling mode.
Only changes to filesystem metadata are logged; this is the method found on the other journaling filesystems and is the fastest mode.
The journaling mode of the Ext3 filesystem is specified by an option
of the mount
system command. For instance, to
mount an Ext3 filesystem stored in the /dev/sda2
partition on the /jdisk
mount point with the
“writeback” mode, the system
administrator can type the command:
# mount -t ext3 -o data=writeback /dev/sda2 /jdisk
The Ext3 journal is usually stored in a
hidden file named .journal
located in the root
directory of the filesystem.
The Ext3 filesystem does not handle the journal on its own; rather, it uses a general kernel layer named Journaling Block Device , or JBD. Right now, only Ext3 uses the JBD layer, but other filesystems might use it in the future.
The JBD layer is a rather complex piece of software. The Ext3 filesystem invokes the JBD routines to ensure that its subsequent operations don’t corrupt the disk data structures in case of system failure. However, JBD typically uses the same disk to log the changes performed by the Ext3 filesystem, and it is therefore vulnerable to system failures as much as Ext3. In other words, JBD must also protect itself from any system failure that could corrupt the journal.
Therefore, the interaction between Ext3 and JBD is essentially based on three fundamental units:
Describes a single update of a disk block of the journaling filesystem.
Includes log records relative to a single high-level change of the filesystem; typically, each system call modifying the filesystem gives rise to a single atomic operation handle.
Includes several atomic operation handles whose log records are
marked valid for e2fsck
at the same time.
A log record is essentially the description of a low-level operation that is going to be issued by the filesystem. In some journaling filesystems, the log record consists of exactly the span of bytes modified by the operation, together with the starting position of the bytes inside the filesystem. The JBD layer, however, uses log records consisting of the whole buffer modified by the low-level operation. This approach may waste a lot of journal space (for instance, when the low-level operation just changes the value of a bit in a bitmap), but it is also much faster because the JBD layer can work directly with buffers and their buffer heads.
Log records are thus represented inside the journal as normal blocks
of data (or metadata). Each such block, however, is associated with a
small tag of type journal_block_tag_t
, which
stores the logical block number of the block inside the filesystem
and a few status flags.
Later, whenever a buffer is being considered by the JBD, either
because it belongs to a log record or because it is a data block that
should be flushed to disk before the corresponding metadata block (in
the “ordered” journaling mode), the
kernel attaches a journal_head
data structure to
the buffer head. In this case, the b_private
field
of the buffer head stores the address of the
journal_head
data structure and the
BH_JBD
flag is set (see Section 13.4.4).
Any system call modifying the filesystem is usually split into a series of low-level operations that manipulate disk data structures.
For instance, suppose that Ext3 must satisfy a user request to append a block of data to a regular file. The filesystem layer must determine the last block of the file, locate a free block in the filesystem, update the data block bitmap inside the proper block group, store the logical number of the new block either in the file’s inode or in an indirect addressing block, write the contents of the new block, and finally, update several fields of the inode. As you see, the append operation translates into many lower-level operations on the data and metadata blocks of the filesystem.
Now, just imagine what could happen if a system failure occurred in the middle of an append operation, when some of the lower-level manipulations have already been executed while others have not. Of course, the scenario could be even worse, with high-level operations affecting two or more files (for example, moving a file from one directory to another).
To prevent data corruption, the Ext3 filesystem must ensure that each system call is handled in an atomic way. An atomic operation handle is a set of low-level operations on the disk data structures that correspond to a single high-level operation. When recovering from a system failure, the filesystem ensures that either the whole high-level operation is applied or none of its low-level operations is.
Any atomic operation handle is represented by a descriptor of type
handle_t
. To start an atomic operation, the Ext3
filesystem invokes the journal_start( )
JBD
function, which allocates, if necessary, a new atomic operation
handle and inserts it into the current transactions (see the next
section). Since any low-level operation on the disk might suspend the
process, the address of the active handle is stored in the
journal_info
field of the process descriptor. To
notify that an atomic operation is completed, the Ext3 filesystem
invokes the journal_stop( )
function.
For reasons of efficiency, the JBD layer manages the journal by grouping the log records that belong to several atomic operation handles into a single transaction . Furthermore, all log records relative to a handle must be included in the same transaction.
All log records of a transaction are stored in consecutive blocks of the journal. The JBD layer handles each transaction as a whole. For instance, it reclaims the blocks used by a transaction only after all data included in its log records is committed to the filesystem.
As soon as it is created, a transaction may accept log records of new handles. The transaction stops accepting new handles when either of the following occurs:
A fixed amount of time has elapsed, typically 5 seconds.
There are no free blocks in the journal left for a new handle
A transaction is represented by a descriptor of type
transaction_t
. The most important field is
t_state
, which describes the current status of the
transaction.
Essentially, a transaction can be:
All log records included in the transaction have been physically
written onto the journal. When recovering from a system failure,
e2fsck
considers every complete transaction of
the journal and writes the corresponding blocks into the filesystem.
In this case, the i_state
field stores the value
T_FINISHED
.
At least one log record included in the transaction has not yet been
physically written to the journal, or new log records are still being
added to the transaction. In case of system failure, the image of the
transaction stored in the journal is likely not up to date.
Therefore, when recovering from a system failure,
e2fsck
does not trust the incomplete
transactions in the journal and skips them. In this case, the
i_state
field stores one of the following values:
T_RUNNING
Still accepting new atomic operation handles.
T_LOCKED
Not accepting new atomic operation handles, but some of them are still unfinished.
T_FLUSH
All atomic operation handles have finished, but some log records are still being written to the journal.
T_COMMIT
All log records of the atomic operation handles have been written to disk, and the transaction is marked as completed on the journal.
At any given instance, the journal may include several transactions.
Just one of them is in the T_RUNNING
state —
it is the active transaction
that is accepting the new atomic
operation handle requests issued by the Ext3 filesystem.
Several transactions in the journal might be incomplete because the buffers containing the relative log records have not yet been written to the journal.
A complete transaction is deleted from the journal only when the JBD layer verifies that all buffers described by the log records have been successfully written onto the Ext3 filesystem. Therefore, the journal can include at most one incomplete transaction and several complete transactions. The log records of a complete transaction have been written to the journal but some of the corresponding buffers have yet to be written onto the filesystem.
Let’s try to explain how journaling works with an example: the Ext3 filesystem layer receives a request to write some data blocks of a regular file.
As you might easily guess, we are not going to describe in detail every single operation of the Ext3 filesystem layer and of the JBD layer. There would be far too many issues to be covered! However, we describe the essential actions:
The service routine of the write( )
system call
triggers the write
method of the file object
associated with the Ext3 regular file. For Ext3, this method is
implemented by the generic_file_write( )
function,
already described in Section 15.1.3.
The generic_file_write( )
function invokes the
prepare_write
method of the
address_space
object several times, once for every
page of data involved by the write operation. For Ext3, this method
is implemented by the ext3_prepare_write( )
function.
The ext3_prepare_write( )
function starts a new
atomic operation by invoking the journal_start( )
JBD function. The handle is added to the active transaction.
Actually, the atomic operation handle is created only when executing
the first invocation of the journal_start( )
function. Following invocations verify that the
journal_info
field of the process descriptor is
already set and use the referenced handle.
The ext3_prepare_write( )
function invokes the
block_prepare_write( )
function already described
in Chapter 15, passing to it the address of the
ext3_get_block( )
function. Remember that
block_prepare_write( )
takes care of preparing the
buffers and the buffer heads of the file’s page.
When the kernel must determine the logical number of a block of the
Ext3 filesystem, it executes the ext3_get_block( )
function. This function is actually similar to
ext2_get_block( )
, which is described in the
earlier section Section 17.6.5. A
crucial difference, however, is that the Ext3 filesystem invokes
functions of the JBD layer to ensure that the low-level operations
are logged:
Before issuing a low-level write operation on a
metadata block of the filesystem, the function invokes
journal_get_write_access( )
. Basically, this
latter function adds the metadata buffer to a list of the active
transaction. However, it must also check whether the metadata is
included in an older incomplete transaction of the journal; in this
case, it duplicates the buffer to make sure that the older
transactions are committed with the old content.
After updating the buffer containing the
metadata block, the Ext3 filesystem invokes
journal_dirty_metadata( )
to move the metadata
buffer to the proper dirty list of the active transaction and to log
the operation in the journal.
Notice that metadata buffers handled by the JBD layer are not usually included in the dirty lists of buffers of the inode, so they are not written to disk by the normal disk cache flushing mechanisms described in Chapter 14.
If the Ext3 filesystem has been mounted in
“journal” mode, the
ext3_prepare_write( )
function also invokes
journal_get_write_access( )
on every buffer
touched by the write operation.
Control returns to the generic_file_write( )
function, which updates the page with the data stored in the User
Mode address space and then invokes the
commit_write
method of the
address_space
object. For Ext3, this method is
implemented by the ext3_commit_write( )
function.
If the Ext3 filesystem has been mounted in
“journal” mode, the
ext3_commit_write( )
function invokes
journal_dirty_metadata( )
on every buffer of data
(not metadata) in the page. This way, the buffer is included in the
proper dirty list of the active transaction and not in the dirty list
of the owner inode; moreover, the corresponding log records are
written to the journal.
If the Ext3 filesystem has been mounted in
“ordered” mode, the
ext3_commit_write( )
function invokes the
journal_dirty_data( )
function on every buffer of
data in the page to insert the buffer in a proper list of the active
transactions. The JBD layer ensures that all buffers in this list are
written to disk before the metadata buffers of the transaction. No
log record is written onto the journal.
If the Ext3 filesystem has been mounted in
“ordered” or
“writeback” mode, the
ext3_commit_write( )
function executes the normal
generic_commit_write( )
function described in
Chapter 15, which inserts the data buffers in the
list of the dirty buffers of the owner inode.
Finally, ext3_commit_write( )
invokes
journal_stop( )
to notify the JBD layer that the
atomic operation handle is closed.
The service routine of the write( )
system call
terminates here. However, the JBD layer has not finished its work.
Eventually, our transaction becomes complete when all its log records
have been physically written to the journal. Then
journal_commit_transaction( )
is executed.
If the Ext3 filesystem has been mounted in
“ordered” mode, the
journal_commit_transaction( )
function activates
the I/O data transfers for all data buffers included in the list of
the transaction and waits until all data transfers terminate.
The journal_commit_transaction( )
function
activates the I/O data transfers for all metadata buffers included in
the transaction (and also for all data buffers, if Ext3 was mounted
in “journal” mode).
Periodically, the kernel activates a checkpoint activity for every
complete transaction in the journal. The checkpoint basically
involves verifying whether the I/O data transfers triggered by
journal_commit_transaction( )
have successfully
terminated. If so, the transaction can be deleted from the journal.
Of course, the log records in the journal never play an active role
until a system failure occurs. Only in this case, in fact, does the
e2fsck
utility program scan the journal stored
in the filesystem and reschedule all write operations described by
the log records of the complete transactions.
18.222.3.255