As we have seen, in Version 2.4 of Linux, there is no substantial difference between accessing a regular file through the filesystem, accessing it by referencing its blocks on the underlying block device file, or even establishing a file memory mapping. There are, however, some highly sophisticated programs (self-caching applications) that would like to have full control of the whole I/O data transfer mechanism. Consider, for example, high-performance database servers: most of them implement their own caching mechanisms that exploit the peculiar nature of the queries to the database. For these kinds of programs, the kernel page cache doesn't help; on the contrary, it is detrimental for the following reasons:
· Lots of page frames are wasted to duplicate disk data already in RAM (in the user-level disk cache)
· The read( ) and write( ) system calls are slowed down by the redundant instructions that handle the page cache and the read-ahead; ditto for the paging operations related to the file memory mappings
· Rather than transferring the data directly between the disk and the user memory, the read( ) and write( ) system calls make two transfers: between the disk and a kernel buffer and between the kernel buffer and the user memory
Since block hardware devices must be handled through interrupts and Direct Memory Access (DMA), and this can be done only in Kernel Mode, some sort of kernel support is definitively required to implement self-caching applications.
Version 2.4 of Linux offers a simple way to bypass the page cache: direct I/O transfers. In each I/O direct transfer, the kernel programs the disk controller to transfer the data directly from/to pages belonging to the User Mode address space of a self-caching application.
As we know, any data transfer proceeds asynchronously. While it is in progress, the kernel may switch the current process, the CPU may return to User Mode, the pages of the process that raised the data transfer might be swapped out, and so on. This works just fine for ordinary I/O data transfers because they involve pages of the disk caches. Disk caches are owned by the kernel, cannot be swapped out, and are visible to all processes in Kernel Mode.
On the other hand, direct I/O transfers should move data within pages that belong to the User Mode address space of a given process. The kernel must take care that these pages are accessible by any process in Kernel Mode and that they are not swapped out while the data transfer is in progress. This is achieved thanks to the "direct access buffers."
A direct access buffer consists of a set of physical page frames reserved for direct I/O data transfers, which are mapped both by the User Mode Page Tables of a self-caching application and by the kernel Page Tables (the Kernel Mode Page Tables of each process). Each direct access buffer is described by a kiobuf data structure, whose fields are shown in Table 15-2.
Suppose a self-caching application wishes to directly access a file. As a first step, the application opens the file specifying the O_DIRECT flag (see Section 12.6.1). While servicing the open( ) system call, the dentry_open( ) function checks the value of this flag; if it is set, the function invokes alloc_kiovec( ), which allocates a new direct access buffer descriptor and stores its address into the f_iobuf field of the file object. Initially the buffer includes no page frames, so the nr_pages field of the descriptor stores the value 0. The alloc_kiovec( ), however, preallocates 1,024 buffer heads, whose addresses are stored in the bh array of the descriptor. These buffer heads ensure that the self-caching application is not blocked while directly accessing the file (recall that ordinary data transfers block if no free buffer heads are available). A drawback of this approach, however, is that data transfers must be done in chunks of at most 512 KB.
Next, suppose the self-caching application issues a read( ) or write( ) system call on the file opened with O_DIRECT. As mentioned earlier in this chapter, the generic_file_read( ) and generic_file_write( ) functions check the value of the flag and handle the case in a special way. For instance, the generic_file_read( ) function executes a code fragment essentially equivalent to the following:
if (filp->f_flags & O_DIRECT) {
inode = filp->f_dentry->d_inode->i_mapping->host;
if (count == 0 || *ppos >= inode->i_size)
return 0;
if (*ppos + count > inode->i_size)
count = inode->i_size - *ppos;
retval = generic_file_direct_IO(READ, filp, buf, count, *ppos);
if (retval > 0)
*ppos += retval;
UPDATE_ATIME(filp->f_dentry->d_inode);
return retval;
}
The function checks the current values of the file pointer, the file size, and the number of requested characters, and then invokes the generic_file_direct_IO( ) function, passing to it the READ operation type, the file object pointer, the address of the User Mode buffer, the number of requested bytes, and the file pointer. The generic_file_write( ) function is similar, but of course it passes the WRITE operation type to the generic_file_direct_IO( ) function.
The generic_file_direct_IO( ) function performs the following steps:
1. Tests and sets the f_iobuf_lock lock in the file object. If it was already set, the direct access buffer descriptor stored in f_iobuf is already in use by a concurrent direct I/O transfer, so the function allocates a new direct access buffer descriptor and uses it in the following steps.
2. Checks that the file pointer offset and the number of requested characters are multiples of the block size of the file; returns -EINVAL if they are not.
3. Checks that the direct_IO method of the address_space object of the file (filp->f_dentry->d_inode->i_mapping) is defined; returns -EINVAL if it isn't.
4. Even if the self-caching application is accessing the file directly, there could be other applications in the system that access the file through the page cache. To avoid data loss, the disk image is synchronized with the page cache before starting the direct I/O transfer. The function flushes the dirty pages belonging to memory mappings of the file to disk by invoking the filemap_fdatasync( ) function (see the previous section).
5. Flushes to disk the dirty pages updated by write( ) system calls by invoking the fsync_inode_data_buffers( ) function, and waits until the I/O transfer terminates.
6. Invokes the filemap_fdatawait( ) function to wait until the I/O operations started in the Step 4 complete (see the previous section).
7. Starts a loop, and divides the data to be transferred in chunks of 512 KB. For every chunk, the function performs the following substeps:
a. Invokes map_user_kiobuf( ) to establish a mapping between the direct access buffer and the portion of the user-level buffer corresponding to the chunk. To achieve this, the function:
1. Invokes expand_kiobuf( ) to allocate a new array of page descriptor addresses in case the array embedded in the direct access buffer descriptor is too small. This is not the case here, however, because the 129 entries in the map_array field suffice to map the chunk of 512 KB (notice that the additional page is required when the buffer is not page-aligned).
2. Accesses all user pages in the chunk (allocating them when necessary by simulating Page Faults) and stores their addresses in the array pointed to by the maplist field of the direct access buffer descriptor.
3. Properly initializes the nr_pages, offset, and length fields, and resets the locked field to 0.
b. Invokes the direct_IO method of the address_space object of the file (explained next).
c. If the operation type was READ, invokes mark_dirty_kiobuf( ) to mark the pages mapped by the direct access buffer as dirty.
d. Invokes unmap_kiobuf( ) to release the mapping between the chunk and the direct access buffer, and then continues with the next chunk.
8. If the function allocated a temporary direct access buffer descriptor in Step 1, it releases it. Otherwise, it releases the f_iobuf_lock lock in the file object.
In almost all cases, the direct_IO method is a wrapper for the generic_direct_IO( ) function, passing it the address of the usual filesystem-dependent function that computes the position of the physical blocks on the block device (see the earlier section Section 15.1.1). This function executes the following steps:
1. For each block of the file portion corresponding to the current chunk, invokes the filesystem-dependent function to determine its logical block number, and stores this number in an entry of the blocks array in the direct access buffer descriptor. The 1,024 entries of the array suffice because the minimum block size in Linux is 512 bytes.
2. Invokes the brw_kiovec( ) function, which essentially calls the submit_bh( ) function on each block in the blocks array using the buffer heads stored in the bh array of the direct access buffer descriptor. The direct I/O operation is similar to a buffer or page I/O operation, but the b_end_io method of the buffer heads is set to the special function end_buffer_io_kiobuf( ) rather than to end_buffer_io_sync( ) or end_buffer_io_async( ) (see Section 13.4.8). The method deals with the fields of the kiobuf data structure. brw_kiovec( ) does not return until the I/O data transfers are completed.