The storage of a file on disk
differs from the view the programmer has of the file in two ways:
blocks can be scattered around the disk (although the filesystem
tries hard to keep blocks sequential to improve access time), and
files may appear to a programmer to be bigger than they really are
because a program can introduce holes into them (through the
lseek( )
system call).
In this section, we explain how the Ext2 filesystem manages the disk space — how it allocates and deallocates inodes and data blocks. Two main problems must be addressed:
Space management must make every effort to avoid file fragmentation — the physical storage of a file in several, small pieces located in nonadjacent disk blocks. File fragmentation increases the average time of sequential read operations on the files, since the disk heads must be frequently repositioned during the read operation.[116] This problem is similar to the external fragmentation of RAM discussed in Section 7.1.7.
Space management must be time-efficient; that is, the kernel should be able to quickly derive from a file offset the corresponding logical block number in the Ext2 partition. In doing so, the kernel should limit as much as possible the number of accesses to addressing tables stored on disk, since each such intermediate access considerably increases the average file access time.
The
ext2_new_inode( )
function creates an Ext2 disk
inode, returning the address of the corresponding inode object (or
NULL
, in case of failure). It acts on two
parameters: the address dir
of the inode object
that refers to the directory into which the new inode must be
inserted and a mode
that indicates the type of
inode being created. The latter argument also includes an
MS_SYNCHRONOUS
flag that requires the current
process to be suspended until the inode is allocated. The function
performs the following actions:
Invokes new_inode( )
to allocate a new inode
object and initializes its i_sb
field to the
superblock address stored in dir->i_sb
.
Invokes down( )
on the s_lock
semaphore included in the parent superblock. As we know, the kernel
suspends the current process if the semaphore is already busy.
If the new inode is a directory, tries to place it so that directories are evenly scattered through partially filled block groups. In particular, allocates the new directory in the block group that has the maximum number of free blocks among all block groups that have a greater than average number of free inodes. (The average is the total number of free inodes divided by the number of block groups).
If the new inode is not a directory, allocates it in a block group having a free inode. The function selects the group by starting from the one that contains the parent directory and moving farther away from it; to be precise:
Performs a quick logarithmic search starting from the block group
that includes the parent directory dir
. The
algorithm searches log(n) block groups, where
n is the total number of block groups. The
algorithm jumps further ahead until it finds an available block group
— for example, if we call the number of the starting block
group i, the algorithm considers block groups
i mod (n),
i+1 mod
(n), i+1+2
mod (n),
i+1+2+4 mod
(n), etc.
If the logarithmic search failed in finding a block group with a free
inode, the function performs an exhaustive linear search starting
from the block group that includes the parent directory
dir
.
Invokes load_inode_bitmap( )
to get the inode
bitmap of the selected block group and searches for the first null
bit into it, thus obtaining the number of the first free disk inode.
Allocates the disk inode: sets the corresponding bit in the inode
bitmap and marks the buffer containing the bitmap as dirty. Moreover,
if the filesystem has been mounted specifying the
MS_SYNCHRONOUS
flag, invokes ll_rw_block( )
and waits until the write operation terminates (see
Section 12.4.2).
Decrements the bg_free_inodes_count
field of the
group descriptor. If the new inode is a directory, increments the
bg_used_dirs_count
field. Marks the buffer
containing the group descriptor as dirty.
Decrements the s_free_inodes_count
field of the
disk superblock and marks the buffer containing it as dirty. Sets the
s_dirt
field of the VFS’s
superblock object to 1.
Initializes the fields of the inode object. In particular, sets the
inode number i_no
and copies the value of
xtime.tv_sec
into i_atime
,
i_mtime
, and i_ctime
. Also
loads the i_block_group
field in the
ext2_inode_info
structure with the block group
index. Refer to Table 17-3 for the meaning of these
fields.
Inserts the new inode object into the hash table
inode_hashtable
and invokes
mark_inode_dirty( )
to move the inode object into
the superblock’s dirty inode list (see
Section 12.2.2).
Invokes up( )
on the s_lock
semaphore included in the parent superblock.
The
ext2_free_inode( )
function deletes a disk inode,
which is identified by an inode object whose address is passed as the
parameter. The kernel should invoke the function after a series of
cleanup operations involving internal data structures and the data in
the file itself. It should come after the inode object has been
removed from the inode hash table, after the last hard link referring
to that inode has been deleted from the proper directory and after
the file is truncated to 0 length to reclaim all its data blocks (see
Section 17.6.6 later in
this chapter). It performs the following actions:
Invokes down( )
on the s_lock
semaphore included in the parent superblock to get exclusive access
to the superblock object.
Invokes clear_inode( )
to perform the following
operations:
Invokes invalidate_inode_buffers( )
to remove the
dirty buffers that belong to the inode from its
i_dirty_buffers
and
i_dirty_data_buffers
lists (see Section 14.2.1).
If the I_LOCK
flag of the inode is set, some of
the inode’s buffers are involved in I/O data
transfers; the function suspends the current process until these I/O
data transfers terminate.
Invokes the clear_inode
method of the superblock
object, if defined; the Ext2 filesystem does not define it.
Sets the state of the inode to I_CLEAR
(the inode
object contents are no longer meaningful).
Computes the index of the block group containing the disk inode from the inode number and the number of inodes in each block group.
Invokes load_inode_bitmap( )
to get the inode
bitmap.
Increments the bg_free_inodes_count
field of the
group descriptor. If the deleted inode is a directory, decrements the
bg_used_dirs_count
field. Marks the buffer that
contains the group descriptor as dirty.
Increments the s_free_inodes_count
field of the
disk superblock and marks the buffer that contains it as dirty. Also
sets the s_dirt
field of the superblock object to
1.
Clears the bit corresponding to the disk inode in the inode bitmap
and marks the buffer that contains the bitmap as dirty. Moreover, if
the filesystem has been mounted with the
MS_SYNCHRONIZE
flag, invokes ll_rw_block( )
and waits until the write operation on the
bitmap’s buffer terminates.
Invokes up( )
on the s_lock
semaphore included in the parent superblock object.
Each nonempty regular file consists of a group of data blocks. Such blocks may be referred to either by their relative position inside the file (their file block number ) or by their position inside the disk partition (their logical block number, explained in Section 13.4.4).
Deriving the logical block number of the corresponding data block from an offset f inside a file is a two-step process:
Derive from the offset f the file block number — the index of the block that contains the character at offset f.
Translate the file block number to the corresponding logical block number.
Since Unix files do not include any control characters, it is quite easy to derive the file block number containing the f th character of a file: simply take the quotient of f and the filesystem’s block size and round down to the nearest integer.
For instance, let’s assume a block size of 4 KB. If f is smaller than 4,096, the character is contained in the first data block of the file, which has file block number 0. If f is equal to or greater than 4,096 and less than 8,192, the character is contained in the data block that has file block number 1, and so on.
This is fine as far as file block numbers are concerned. However, translating a file block number into the corresponding logical block number is not nearly as straightforward, since the data blocks of an Ext2 file are not necessarily adjacent on disk.
The Ext2 filesystem must therefore provide a method to store the connection between each file block number and the corresponding logical block number on disk. This mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside the inode. It also involves some specialized blocks that contain extra pointers, which are an inode extension used to handle large files.
The i_block
field in the disk inode is an array of
EXT2_N_BLOCKS
components that contain logical
block numbers. In the following discussion, we assume that
EXT2_N_BLOCKS
has the default value, namely 15.
The array represents the initial part of a larger data structure,
which is illustrated in Figure 17-5. As can be seen
in the figure, the 15 components of the array are of 4 different
types:
The first 12 components yield the logical block numbers corresponding to the first 12 blocks of the file—to the blocks that have file block numbers from 0 to 11.
The component at index 12 contains the logical block number of a block that represents a second-order array of logical block numbers. They correspond to the file block numbers ranging from 12 to b/4+11, where b is the filesystem’s block size (each logical block number is stored in 4 bytes, so we divide by 4 in the formula). Therefore, the kernel must look in this component for a pointer to a block, and then look in that block for another pointer to the ultimate block that contains the file contents.
The component at index 13 contains the logical block number of a block containing a second-order array of logical block numbers; in turn, the entries of this second-order array point to third-order arrays, which store the logical block numbers that correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.
Finally, the component at index 14 uses triple indirection: the fourth-order arrays store the logical block numbers corresponding to the file block numbers ranging from (b/4)2+(b/4)+12 to (b/4)3+(b/4)2+(b/4)+11 upward.
In Figure 17-5, the number inside a block represents the corresponding file block number. The arrows, which represent logical block numbers stored in array components, show how the kernel finds its way to reach the block that contains the actual contents of the file.
Notice how this mechanism favors small files. If the file does not
require more than 12 data blocks, any data can be retrieved in two
disk accesses: one to read a component in the
i_block
array of the disk inode and the other to
read the requested data block. For larger files, however, three or
even four consecutive disk accesses may be needed to access the
required block. In practice, this is a worst-case estimate, since
dentry, buffer, and page caches contribute significantly to reduce
the number of real disk accesses.
Notice also how the block size of the filesystem affects the
addressing mechanism, since a larger block size allows the Ext2 to
store more logical block numbers inside a single block. Table 17-10 shows the upper limit placed on a
file’s size for each block size and each addressing
mode. For instance, if the block size is 1,024 bytes and the file
contains up to 268 kilobytes of data, the first 12 KB of a file can
be accessed through direct mapping and the remaining 13-268 KB can be
addressed through simple indirection. Files larger than 2 GB must be
opened on 32-bit architectures by specifying the
O_LARGEFILE
opening flag. In any case, the Ext2
filesystem puts an upper limit on the file size equal to 2 TB minus
4,096 bytes.
A file hole is a portion of a regular file that contains null characters and is not stored in any data block on disk. Holes are a long-standing feature of Unix files. For instance, the following Unix command creates a file in which the first bytes are a hole:
$ echo -n "X" | dd of=/tmp/hole bs=1024 seek=6
Now /tmp/hole
has 6,145 characters (6,144 null
characters plus an X character), yet the file occupies just one data
block on disk.
File holes were introduced to avoid wasting disk space. They are used extensively by database applications and, more generally, by all applications that perform hashing on files.
The Ext2 implementation of file holes is based on dynamic data block
allocation: a block is actually assigned to a file only when the
process needs to write data into it. The i_size
field of each inode defines the size of the file as seen by the
program, including the hole, while the i_blocks
field stores the number of data blocks effectively assigned to the
file (in units of 512 bytes).
In our earlier example of the dd
command, suppose
the /tmp/hole
file was created on an Ext2
partition that has blocks of size 4,096. The
i_size
field of the corresponding disk inode
stores the number 6,145, while the i_blocks
field
stores the number 8 (because each 4,096-byte block includes eight
512-byte blocks). The second element of the
i_block
array (corresponding to the block having
file block number 1) stores the logical block number of the allocated
block, while all other elements in the array are null (see Figure 17-6).
When the
kernel has to locate a block holding data for an Ext2 regular file,
it invokes the ext2_get_block( )
function. If the
block does not exist, the function automatically allocates the block
to the file. Remember that this function is invoked every time the
kernel issues a read or write operation on a Ext2 regular file (see
Section 15.1.1 and Section 15.1.3).
The ext2_get_block( )
function handles the data
structures already described in Section 17.6.3, and when necessary, invokes
the ext2_alloc_block( )
function to actually
search for a free block in the Ext2 partition.
To reduce file fragmentation, the Ext2 filesystem tries to get a new block for a file near the last block already allocated for the file. Failing that, the filesystem searches for a new block in the block group that includes the file’s inode. As a last resort, the free block is taken from one of the other block groups.
The Ext2 filesystem uses preallocation of data blocks. The file does
not get just the requested block, but rather a group of up to eight
adjacent blocks. The i_prealloc_count
field in the
ext2_inode_info
structure stores the number of
data blocks preallocated to a file that are still unused, and the
i_prealloc_block
field stores the logical block
number of the next preallocated block to be used. Any preallocated
blocks that remain unused are freed when the file is closed, when it
is truncated, or when a write operation is not sequential with
respect to the write operation that triggered the block
preallocation.
The ext2_alloc_block( )
function receives as
parameters a pointer to an inode object and a
goal
.
The goal is a logical block number that represents the preferred
position of the new block. The ext2_getblk( )
function sets the goal parameter according to the following
heuristic:
If the block that is being allocated and the previously allocated block have consecutive file block numbers, the goal is the logical block number of the previous block plus 1; it makes sense that consecutive blocks as seen by a program should be adjacent on disk.
If the first rule does not apply and at least one block has been previously allocated to the file, the goal is one of these blocks’ logical block numbers. More precisely, it is the logical block number of the already allocated block that precedes the block to be allocated in the file.
If the preceding rules do not apply, the goal is the logical block number of the first block (not necessarily free) in the block group that contains the file’s inode.
The ext2_alloc_block( )
function checks whether
the goal refers to one of the preallocated blocks of the file. If so,
it allocates the corresponding block and returns its logical block
number; otherwise, the function discards all remaining preallocated
blocks and invokes ext2_new_block( )
.
This latter function searches for a free block inside the Ext2 partition with the following strategy:
If the preferred block passed to ext2_alloc_block( )
, the goal, is free, and the function allocates the block.
If the goal is busy, the function checks whether one of the next 64 blocks after the preferred block is free.
If no free block is found in the near vicinity of the preferred block, the function considers all block groups, starting from the one including the goal. For each block group, the function does the following:
Looks for a group of at least eight adjacent free blocks.
If no such group is found, looks for a single free block.
The search ends as soon as a free block is found. Before terminating,
the ext2_new_block( )
function also tries to
preallocate up to eight free blocks adjacent to the free block found
and sets the i_prealloc_block
and
i_prealloc_count
fields of the disk inode to the
proper block location and number of blocks.
When a
process deletes a file or truncates it to 0 length, all its data
blocks must be reclaimed. This is done by ext2_truncate( )
, which receives the address of the
file’s inode object as its parameter. The function
essentially scans the disk inode’s
i_block
array to locate all data blocks and all
blocks used for the indirect addressing. These blocks are then
released by repeatedly invoking ext2_free_blocks( )
.
The ext2_free_blocks( )
function releases a group
of one or more adjacent data blocks. Besides its use by
ext2_truncate( )
, the function is invoked mainly
when discarding the preallocated blocks of a file (see the earlier
section Section 17.6.5). Its
parameters are:
inode
The address of the inode object that describes the file
block
The logical block number of the first block to be released
count
The number of adjacent blocks to be released
The function invokes down( )
on the
s_lock
superblock’s semaphore to
get exclusive access to the filesystem’s superblock,
and then performs the following actions for each block to be
released:
Gets the block bitmap of the block group, including the block to be released
Clears the bit in the block bitmap that corresponds to the block to be released and marks the buffer that contains the bitmap as dirty
Increments the bg_free_blocks_count
field in the
block group descriptor and marks the corresponding buffer as dirty
Increments the s_free_blocks_count
field of the
disk superblock, marks the corresponding buffer as dirty, and sets
the s_dirt
flag of the superblock object
If the filesystem has been mounted with the
MS_SYNCHRONOUS
flag set, invokes
ll_rw_block( )
and waits until the write operation
on the bitmap’s buffer terminates
Finally, the function invokes up( )
to release the
superblock’s s_lock
semaphore.
[116] Please note that fragmenting a file across block groups (A Bad Thing) is quite different from the not-yet-implemented fragmentation of blocks to store many files in one block (A Good Thing).
18.227.24.118