14.1 The Page Cache

To avoid unnecessary disk accesses, the kernel never tries to read a page from disk without looking into the page cache and verifying that it does not already include the requested data. To take the maximum advantage from the page cache, searching into it should be a very fast operation.

The unit of information kept in the page cache is, of course, a whole page of data. A page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by a device number and a block number. Instead, a page in the page cache is identified by an address of a special data structure named address_space, and by an offset within the file (or whatever) referenced by the address_space data structure.

14.1.1 The address_space Object

Table 14-1 suggests that the Linux page cache speeds up several different kinds of I/O operations. In fact, the page cache may include the following types of pages:

· Pages containing data of regular files and directories of disk-based filesystems; in Chapter 15, we describe how the kernel handles read and write operations on them.

· Pages containing data of a memory-mapped file; see Chapter 15 for details.

· Pages containing data directly read from block device files (skipping the filesystem layer); as discussed in Chapter 15, the kernel handles them using the same set of functions as for pages containing data of regular files.

· Pages containing data of User Mode processes that have been swapped out on disk. As we shall see in Chapter 16, the kernel could be forced to keep in the page cache some pages whose contents have been already written on a swap area.

· Pages belonging to an Interprocess Communication (IPC) shared memory region; we describe IPC resources in Chapter 19.

So far, so good, but how is the kernel supposed to keep track of how every page in the page cache should be handled? For instance, suppose the kernel wishes to update the content of a page included in the page cache — reading the page contents from a regular file, from a directory, from a block device file, or from a swap area are quite different operations, and the kernel must execute the proper operation according to the type of page.

The key data structure that establishes the relationship between pages and methods that operate on the pages is the address_space object. Formally, each address_space object establishes a link between a generic kernel object (the so-called owner) and a set of methods that operate on the pages belonging to the owner.

As stated before, the page cache includes five kinds of pages, so a page may belong to five possible kinds of owners.

For instance, if a page belongs to a regular file that is stored in an Ext2 filesystem, the owner of the page is an inode object. The i_mapping field of this object points to an address_space object. In turn, the address_space object defines a set of methods that allow the kernel to act on the pages containing the data of our regular file.

Specifically, the address_space object includes the fields shown in Table 14-2.

Table 14-2. The fields of the address_space object
Type	Field	Description
`struct list_head`	`clean_pages`	List of owner's clean pages
`struct list_head`	`dirty_pages`	List of owner's nonlocked dirty pages
`struct list_head`	`locked_pages`	List of owner's locked dirty pages
`unsigned long`	`nrpages`	Total number of owner's pages
`struct` `address_space_operations *`	`a_ops`	Methods that operate on the owner's pages
`struct inode *`	`host`	Pointer to the owning inode
`struct vm_area_struct *`	`i_mmap`	List of memory regions for private memory mapping
`struct vm_area_struct *`	`i_mmap_shared`	List of memory regions for shared memory mapping
`spinlock_t`	`i_shared_lock`	Spin lock for the lists of memory regions
`int`	`gfp_mask`	Memory allocator flags for the owner's pages

The clean_pages, dirty_pages, and locked_pages fields represent the heads of three lists of page descriptors. Together, these lists include all pages that belong to the owner of the address_space object. We discuss the role of each list in the next section. The nrpages field stores the total number of pages inserted in the three lists.

Although the owner of the address_space object could be any generic kernel object, usually it is a VFS inode object. (After all, the page cache was introduced to speed up disk accesses!) In this case, the host field points to the inode that owns the address_space object.

The i_mmap, i_mmap_shared, i_shared_lock, and gfp_mask fields are used whenever the owner of the address_space object is an inode of a memory-mapped file. We discuss them in Section 15.2.1.

The most important field of the address_space object is a_ops, which points to a table of type address_space_operations containing the methods that define how the owner's pages are handled. These methods are shown in Table 14-3.

Table 14-3. The methods of the address_space object
Method	Description
`writepage`	Write operation (from the page to the owner's disk image)
`readpage`	Read operation (from the owner's disk image to the page)
`sync_page`	Start the I/O data transfer of already scheduled operations on the page
`prepare_write`	Prepare the write operation (used by disk-based filesystems)
`commit_write`	Complete the write operation (used by disk-based filesystems)
`bmap`	Get a logical block number from a file block index
`flushpage`	Prepare to delete the page from the owner's disk image
`releasepage`	Used by journaling filesystems to prepare the release of a page
`direct_IO`	Direct I/O transfer of the data of the page

The most important methods are readpage, writepage, prepare_write, and commit_write. We discuss them in Chapter 15. In most cases, the methods link the owner inode objects with the low-level drivers that access the physical devices. For instance, the function that implements the readpage method for an inode of a regular file "knows" how to locate the positions on the physical disk device of the blocks corresponding to any page of the file. In this chapter, however, we don't have to discuss the address_space methods further.

14.1.2 Page Cache Data Structures

The page cache uses the following main data structures:

A page hash table

This lets the kernel quickly derive the page descriptor address for the page associated with a specified address_space object and a specified offset (presumably, a file offset)

Page descriptor lists in the address_space object

This lets the kernel quickly retrieve all pages in a given state owned by a particular inode object (or other kernel object) referenced by an address_space object

Manipulation of the page cache involves adding and removing entries from these data structures, as well as updating the fields in all objects that reference the cached pages. The pagecache_lock spin lock protects the page cache data structures against concurrent accesses in multiprocessor systems.

14.1.2.1 The page hash table

When a process reads a large file, the page cache may become filled with pages related to that file. In such cases, scanning a long list of page descriptors to find the page that maps the required file portion could become a time-consuming operation.

For this reason, Linux uses a hash table of page descriptor pointers named page_hash_table. Its size depends on the amount of available RAM; for example, for systems having 128 MB of RAM, page_hash_table is stored in 32 page frames and includes 32,768 page descriptor pointers.

The page_hash macro uses the address of an address_space object and an offset value to derive the address of the corresponding entry in the hash table. As usual, chaining is introduced to handle entries that cause a collision: the next_hash and pprev_hash fields of the page descriptors are used to implement doubly circular lists of entries that have the same hash value. The page_cache_size variable specifies the number of page descriptors included in the collision lists of the page hash table (and therefore in the whole page cache).

The add_page_to_hash_queue( ) and remove_page_from_hash_queue( ) functions add an element into the hash table and remove an element from it, respectively.

14.1.2.2 The lists of page descriptors in the address_space object

As we have seen, the address_space object includes three lists of page descriptors, whose heads are stored in the clean_pages, dirty_pages, and locked_pages fields. The lists allow the kernel to quickly find all pages of a file (or whatever) in a specific state:

clean_pages

Includes the pages not locked and not dirty (the PG_locked and PG_dirty flags in the page descriptor are equal to 0). The PG_uptodate flag indicates whether the data in the pages is up to date. Typically, a page is not up to date when its contents have yet to be read from the corresponding image on disk.

dirty_pages

Includes the pages that contain up-to-date data, but whose image on disk have yet to be updated. The PG_uptodate and PG_dirty flags in the page descriptor are set, while the PG_locked flag is clear.

locked_pages

Includes the pages whose contents are being transferred to or from disk, so the pages cannot be currently accessed. The PG_locked flag is set.

The add_page_to_inode_queue( ) function is used to insert a page descriptor into the clean_pages list of an address_space object. Conversely, the remove_page_from_inode_queue( ) is used to remove a page descriptor from the list that is currently including it.^[1] The kernel moves a page descriptor from a list to another one whenever the page changes its state.

^[1] The names of these functions are inherited from the old Version 2.2 of the kernel.

14.1.2.3 Page descriptor fields related to the page cache

When a page is included in the page cache, some fields of the corresponding page descriptor have special meanings:

list

Depending on the state of the page, includes pointers for the next and previous elements in the doubly linked list of clean, dirty, or locked pages of the address_space object.

mapping

Points to the address_space object to which the page belongs. If the page does not belong to the page cache, this field is NULL.

index

When the page's owner is an inode object, specifies the position of the data contained in the page within the disk image. The value is in page-size units.

next_hash

Points to the next colliding page descriptor in the page hash list.

pprev_hash

Points to the next_hash field of the previous colliding page descriptor in the page hash list.

In addition, when a page is inserted into the page cache, the usage counter (count field) of the corresponding page descriptor is incremented. If the count field is exactly 1, the page belongs to the cache but is not being accessed by any process; it can thus be removed from the page cache whenever free memory becomes scarce, as described in Chapter 16.

14.1.3 Page Cache Handling Functions

The high-level functions that use the page cache involve finding, adding, and removing a page.

The find_get_page macro receives as parameters the address of an address_space object and an offset value. It uses the page_hash macro to derive the address of the hash table entry corresponding to the values of the parameters, and invokes the _ _find_get_page( ) function to search for the requested page descriptor in the proper collision list. In turn, _ _find_get_page( ) acquires the pagecache_lock spin lock, scans the list of entries that have the same hash value, then releases the spin lock. If the page is found, the function increments the count field of the corresponding page descriptor and returns its address; otherwise, it returns NULL.

The add_to_page_cache( ) function inserts a new page descriptor (whose address is passed as a parameter) in the page cache. This is achieved by performing the following operations:

1. Acquires the pagecache_lock spin lock.

2. Clears the PG_uptodate, PG_error, PG_dirty, PG_referenced, PG_arch_1, and PG_checked flags, and sets the PG_locked flag of the page frame to indicate that the page is locked and present in the cache, but not yet filled with data.

3. Increments the count field of the page descriptor.

4. Initializes the index field of the page descriptor with a value passed as a parameter, which specifies the position of the data contained in the page within the page's disk image.

5. Invokes add_page_to_inode_queue( ) to insert the page descriptor in the clean_pages list of an address_space object, whose address is passed as a parameter.

6. Invokes add_page_to_hash_queue( ) to insert the page descriptor in the hash table, using the address_space object address and the value of page's index field as hash keys.

7. Releases the pagecache_lock spin lock.

8. Invokes lru_cache_add( ) to add the page descriptor in the inactive list (see Chapter 16).

The find_or_create_page( ) function is similar to find_get_page; however, if the requested page is not in the cache, the function invokes alloc_page( ) to get a new page frame, then invokes add_to_page_cache( ) to insert the page descriptor in the page cache.

The remove_inode_page( ) function removes a page descriptor from the page cache. This is achieved by acquiring the pagecache_lock spin lock, invoking remove_page_from_inode_queue( ) and remove_page_from_hash_queue( ), and then releasing the spin lock.