In this section we'll briefly describe the enhanced filesystem that has evolved from Ext2, named Ext3. The new filesystem has been designed with two simple concepts in mind:
· To be a journaling filesystem (see the next section)
· To be, as much as possible, compatible with the old Ext2 filesystem
Ext3 achieves both the goals very well. In particular, it is largely based on Ext2, so its data structures on disk are essentially identical to those of an Ext2 filesystem. As a matter of fact, if an Ext3 filesystem has been cleanly unmounted, it can be remounted as an Ext2 filesystem; conversely, creating a journal of an Ext2 filesystem and remounting it as an Ext3 filesystem is a simple, fast operation.
Thanks to the compatibility between Ext3 and Ext2, most descriptions in the previous sections of this chapter apply to Ext3 as well. Therefore, in this section, we focus on the new feature offered by Ext3 — "the journal."
As disks became larger, one design choice of traditional Unix filesystems (like Ext2) turns out to be inappropriate. As we know from Chapter 14, updates to filesystem blocks might be kept in dynamic memory for long period of time before being flushed to disk. A dramatic event like a power-down failure or a system crash might thus leave the filesystem in an inconsistent state. To overcome this problem, each traditional Unix filesystem is checked before being mounted; if it has not been properly unmounted, then a specific program executes an exhaustive, time-consuming check and fixes all filesystem's data structures on disk.
For instance, the Ext2 filesystem status is stored in the s_mount_state field of the superblock on disk. The e2fsck utility program is invoked by the boot script to check the value stored in this field; if it is not equal to EXT2_VALID_FS, the filesystem was not properly unmounted, and therefore e2fsck starts checking all disk data structures of the filesystem.
Clearly, the time spent checking the consistency of a filesystem depends mainly on the number of files and directories to be examined; therefore, it also depends on the disk size. Nowadays, with filesystems reaching hundreds of gigabytes, a single consistency check may take hours. The involved downtime is unacceptable for any production environment or high-availability server.
The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the whole filesystem by looking instead in a special disk area that contains the most recent disk write operations named journal. Remounting a journaling filesystem after a system failure is a matter of few seconds.
The idea behind Ext3 journaling is to perform any high-level change to the filesystem in two steps. First, a copy of the blocks to be written is stored in the journal; then, when the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.
While recovering after a system failure, the e2fsck program distinguishes the following two cases:
· The system failure occurred before a commit to the journal. Either the copies of the blocks relative to the high-level change are missing from the journal or they are incomplete; in both cases, e2fsck ignores them.
· The system failure occurred after a commit to the journal. The copies of the blocks are valid and e2fsck writes them into the filesystem.
In the first case, the high-level change to the filesystem is lost, but the filesystem state is still consistent. In the second case, e2fsck applies the whole high-level change, thus fixing any inconsistency due to unfinished I/O data transfers into the filesystem.
Don't expect too much from a journaling filesystem; it ensures consistency only at the system call level. For instance, a system failure that occurs while you are copying a large file by issuing several write( ) system calls will interrupt the copy operation, thus the duplicated file will be shorter than the original one.
Furthermore, journaling filesystems do not usually copy all blocks into the journal. In fact, each filesystem consists of two kinds of blocks: those containing the so-called metadata and those containing regular data. In the case of Ext2 and Ext3, there are six kinds of metadata: superblocks, group block descriptors, inodes, blocks used for indirect addressing (indirection blocks), data bitmap blocks, and inode bitmap blocks. Other filesystems may use different metadata.
Most journaling filesystems, like ReiserFS, SGI's XFS, and IBM's JFS, limit themselves to log the operations affecting metadata. In fact, metadata's log records are sufficient to restore the consistency of the on-disk filesystem data structures. However, since operations on blocks of file data are not logged, nothing prevents a system failure from corrupting the contents of the files.
The Ext3 filesystem, however, can be configured to log the operations affecting both the filesystem metadata and the data blocks of the files. Since logging every kind of write operation leads to a significant performance penalty, Ext3 lets the system administrator decide what has to be logged; in particular, it offers three different journaling modes:
Journal
All filesystem data and metadata changes are logged into the journal. This mode minimizes the chance of losing the updates made to each file, but it requires many additional disk accesses. For example, when a new file is created, all its data blocks must be duplicated as log records. This is the safest and slowest Ext3 journaling mode.
Ordered
Only changes to filesystem metadata are logged into the journal. However, the Ext3 filesystem groups metadata and relative data blocks so that data blocks are written to disk before the metadata. This way, the chance to have data corruption inside the files is reduced; for instance, any write access that enlarges a file is guaranteed to be fully protected by the journal. This is the default Ext3 journaling mode.
Writeback
Only changes to filesystem metadata are logged; this is the method found on the other journaling filesystems and is the fastest mode.
The journaling mode of the Ext3 filesystem is specified by an option of the mount system command. For instance, to mount an Ext3 filesystem stored in the /dev/sda2 partition on the /jdisk mount point with the "writeback" mode, the system administrator can type the command:
# mount -t ext3 -o data=writeback /dev/sda2 /jdisk
The Ext3 journal is usually stored in a hidden file named .journal located in the root directory of the filesystem.
The Ext3 filesystem does not handle the journal on its own; rather, it uses a general kernel layer named Journaling Block Device, or JBD. Right now, only Ext3 uses the JBD layer, but other filesystems might use it in the future.
The JBD layer is a rather complex piece of software. The Ext3 filesystem invokes the JBD routines to ensure that its subsequent operations don't corrupt the disk data structures in case of system failure. However, JBD typically uses the same disk to log the changes performed by the Ext3 filesystem, and it is therefore vulnerable to system failures as much as Ext3. In other words, JBD must also protect itself from any system failure that could corrupt the journal.
Therefore, the interaction between Ext3 and JBD is essentially based on three fundamental units:
Log record
Describes a single update of a disk block of the journaling filesystem.
Atomic operation handle
Includes log records relative to a single high-level change of the filesystem; typically, each system call modifying the filesystem gives rise to a single atomic operation handle.
Transaction
Includes several atomic operation handles whose log records are marked valid for e2fsck at the same time.
A log record is essentially the description of a low-level operation that is going to be issued by the filesystem. In some journaling filesystems, the log record consists of exactly the span of bytes modified by the operation, together with the starting position of the bytes inside the filesystem. The JBD layer, however, uses log records consisting of the whole buffer modified by the low-level operation. This approach may waste a lot of journal space (for instance, when the low-level operation just changes the value of a bit in a bitmap), but it is also much faster because the JBD layer can work directly with buffers and their buffer heads.
Log records are thus represented inside the journal as normal blocks of data (or metadata). Each such block, however, is associated with a small tag of type journal_block_tag_t, which stores the logical block number of the block inside the filesystem and a few status flags.
Later, whenever a buffer is being considered by the JBD, either because it belongs to a log record or because it is a data block that should be flushed to disk before the corresponding metadata block (in the "ordered" journaling mode), the kernel attaches a journal_head data structure to the buffer head. In this case, the b_private field of the buffer head stores the address of the journal_head data structure and the BH_JBD flag is set (see Section 13.4.4).
Any system call modifying the filesystem is usually split into a series of low-level operations that manipulate disk data structures.
For instance, suppose that Ext3 must satisfy a user request to append a block of data to a regular file. The filesystem layer must determine the last block of the file, locate a free block in the filesystem, update the data block bitmap inside the proper block group, store the logical number of the new block either in the file's inode or in an indirect addressing block, write the contents of the new block, and finally, update several fields of the inode. As you see, the append operation translates into many lower-level operations on the data and metadata blocks of the filesystem.
Now, just imagine what could happen if a system failure occurred in the middle of an append operation, when some of the lower-level manipulations have already been executed while others have not. Of course, the scenario could be even worse, with high-level operations affecting two or more files (for example, moving a file from one directory to another).
To prevent data corruption, the Ext3 filesystem must ensure that each system call is handled in an atomic way. An atomic operation handle is a set of low-level operations on the disk data structures that correspond to a single high-level operation. When recovering from a system failure, the filesystem ensures that either the whole high-level operation is applied or none of its low-level operations is.
Any atomic operation handle is represented by a descriptor of type handle_t. To start an atomic operation, the Ext3 filesystem invokes the journal_start( ) JBD function, which allocates, if necessary, a new atomic operation handle and inserts it into the current transactions (see the next section). Since any low-level operation on the disk might suspend the process, the address of the active handle is stored in the journal_info field of the process descriptor. To notify that an atomic operation is completed, the Ext3 filesystem invokes the journal_stop( ) function.
For reasons of efficiency, the JBD layer manages the journal by grouping the log records that belong to several atomic operation handles into a single transaction. Furthermore, all log records relative to a handle must be included in the same transaction.
All log records of a transaction are stored in consecutive blocks of the journal. The JBD layer handles each transaction as a whole. For instance, it reclaims the blocks used by a transaction only after all data included in its log records is committed to the filesystem.
As soon as it is created, a transaction may accept log records of new handles. The transaction stops accepting new handles when either of the following occurs:
· A fixed amount of time has elapsed, typically 5 seconds.
· There are no free blocks in the journal left for a new handle
A transaction is represented by a descriptor of type transaction_t. The most important field is t_state, which describes the current status of the transaction.
Essentially, a transaction can be:
Complete
All log records included in the transaction have been physically written onto the journal. When recovering from a system failure, e2fsck considers every complete transaction of the journal and writes the corresponding blocks into the filesystem. In this case, the i_state field stores the value T_FINISHED.
Incomplete
At least one log record included in the transaction has not yet been physically written to the journal, or new log records are still being added to the transaction. In case of system failure, the image of the transaction stored in the journal is likely not up to date. Therefore, when recovering from a system failure, e2fsck does not trust the incomplete transactions in the journal and skips them. In this case, the i_state field stores one of the following values:
T_RUNNING
Still accepting new atomic operation handles.
T_LOCKED
Not accepting new atomic operation handles, but some of them are still unfinished.
T_FLUSH
All atomic operation handles have finished, but some log records are still being written to the journal.
T_COMMIT
All log records of the atomic operation handles have been written to disk, and the transaction is marked as completed on the journal.
At any given instance, the journal may include several transactions. Just one of them is in the T_RUNNING state — it is the active transaction that is accepting the new atomic operation handle requests issued by the Ext3 filesystem.
Several transactions in the journal might be incomplete because the buffers containing the relative log records have not yet been written to the journal.
A complete transaction is deleted from the journal only when the JBD layer verifies that all buffers described by the log records have been successfully written onto the Ext3 filesystem. Therefore, the journal can include at most one incomplete transaction and several complete transactions. The log records of a complete transaction have been written to the journal but some of the corresponding buffers have yet to be written onto the filesystem.
Let's try to explain how journaling works with an example: the Ext3 filesystem layer receives a request to write some data blocks of a regular file.
As you might easily guess, we are not going to describe in detail every single operation of the Ext3 filesystem layer and of the JBD layer. There would be far too many issues to be covered! However, we describe the essential actions:
1. The service routine of the write( ) system call triggers the write method of the file object associated with the Ext3 regular file. For Ext3, this method is implemented by the generic_file_write( ) function, already described in Section 15.1.3.
2. The generic_file_write( ) function invokes the prepare_write method of the address_space object several times, once for every page of data involved by the write operation. For Ext3, this method is implemented by the ext3_prepare_write( ) function.
3. The ext3_prepare_write( ) function starts a new atomic operation by invoking the journal_start( ) JBD function. The handle is added to the active transaction. Actually, the atomic operation handle is created only when executing the first invocation of the journal_start( ) function. Following invocations verify that the journal_info field of the process descriptor is already set and use the referenced handle.
4. The ext3_prepare_write( ) function invokes the block_prepare_write( ) function already described in Chapter 15, passing to it the address of the ext3_get_block( ) function. Remember that block_prepare_write( ) takes care of preparing the buffers and the buffer heads of the file's page.
5. When the kernel must determine the logical number of a block of the Ext3 filesystem, it executes the ext3_get_block( ) function. This function is actually similar to ext2_get_block( ), which is described in the earlier section Section 17.6.5. A crucial difference, however, is that the Ext3 filesystem invokes functions of the JBD layer to ensure that the low-level operations are logged:
o Before issuing a low-level write operation on a metadata block of the filesystem, the function invokes journal_get_write_access( ). Basically, this latter function adds the metadata buffer to a list of the active transaction. However, it must also check whether the metadata is included in an older incomplete transaction of the journal; in this case, it duplicates the buffer to make sure that the older transactions are committed with the old content.
o After updating the buffer containing the metadata block, the Ext3 filesystem invokes journal_dirty_metadata( ) to move the metadata buffer to the proper dirty list of the active transaction and to log the operation in the journal.
Notice that metadata buffers handled by the JBD layer are not usually included in the dirty lists of buffers of the inode, so they are not written to disk by the normal disk cache flushing mechanisms described in Chapter 14.
6. If the Ext3 filesystem has been mounted in "journal" mode, the ext3_prepare_write( ) function also invokes journal_get_write_access( ) on every buffer touched by the write operation.
7. Control returns to the generic_file_write( ) function, which updates the page with the data stored in the User Mode address space and then invokes the commit_write method of the address_space object. For Ext3, this method is implemented by the ext3_commit_write( ) function.
8. If the Ext3 filesystem has been mounted in "journal" mode, the ext3_commit_write( ) function invokes journal_dirty_metadata( ) on every buffer of data (not metadata) in the page. This way, the buffer is included in the proper dirty list of the active transaction and not in the dirty list of the owner inode; moreover, the corresponding log records are written to the journal.
9. If the Ext3 filesystem has been mounted in "ordered" mode, the ext3_commit_write( ) function invokes the journal_dirty_data( ) function on every buffer of data in the page to insert the buffer in a proper list of the active transactions. The JBD layer ensures that all buffers in this list are written to disk before the metadata buffers of the transaction. No log record is written onto the journal.
10. If the Ext3 filesystem has been mounted in "ordered" or "writeback" mode, the ext3_commit_write( ) function executes the normal generic_commit_write( ) function described in Chapter 15, which inserts the data buffers in the list of the dirty buffers of the owner inode.
11. Finally, ext3_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
12. The service routine of the write( ) system call terminates here. However, the JBD layer has not finished its work. Eventually, our transaction becomes complete when all its log records have been physically written to the journal. Then journal_commit_transaction( ) is executed.
13. If the Ext3 filesystem has been mounted in "ordered" mode, the journal_commit_transaction( ) function activates the I/O data transfers for all data buffers included in the list of the transaction and waits until all data transfers terminate.
14. The journal_commit_transaction( ) function activates the I/O data transfers for all metadata buffers included in the transaction (and also for all data buffers, if Ext3 was mounted in "journal" mode).
15. Periodically, the kernel activates a checkpoint activity for every complete transaction in the journal. The checkpoint basically involves verifying whether the I/O data transfers triggered by journal_commit_transaction( ) have successfully terminated. If so, the transaction can be deleted from the journal.
Of course, the log records in the journal never play an active role until a system failure occurs. Only in this case, in fact, does the e2fsck utility program scan the journal stored in the filesystem and reschedule all write operations described by the log records of the complete transactions.