16.5 Swapping Out Pages

The later section Section 16.7 explains what happens when pages are swapped out. As we indicated at the beginning of this chapter, swapping out pages is a last resort and appears as part of a general strategy to free memory that uses other tactics as well. In this section, we show how the kernel performs a swap out. This is achieved by a series of functions called in cascading fashion. Let's start with the functions at the higher level.

The swap_out( ) function acts on a single classzone parameter that specifies the memory zone from which pages should be swapped out (see Section 7.1.2). Two other parameters, priority and gfp_mask, are not used.

The swap_out( ) function scans existing memory descriptors and tries to swap out the pages referenced in each process's Page Tables. It terminates as soon as one of the following conditions occurs:

·         The function succeeds in releasing SWAP_CLUSTER_MAX page frames (by default, 32). A page frame is considered released when it is removed from the Page Tables of all processes that share it.

·         The function scans n memory descriptors, where n is the length of the memory descriptor list when the function starts.[5]

[5] The swap_out( ) function can block, so memory descriptors might appear and disappear on the list during a single invocation of the function.

To ensure that all processes are evenly penalized by swap_out( ), the function starts scanning the list from the memory descriptor that was last analyzed in the previous invocation; the address of this memory descriptor is stored in the swap_mm global variable.

For each memory descriptor mm to be considered, the swap_out( ) function increments the usage counter mm->mm_users, thus ensuring that the memory descriptor cannot disappear from the list while the swapping algorithm is working on it. Then, swap_out( ) invokes the swap_out_mm( ) function, passing to it the memory descriptor address mm, the memory zone classzone, and the number of page frames still to be released. Once swap_out_mm( ) returns, swap_out( ) decrements the usage counter mm->mm_users, and then decides whether it should analyze the next memory descriptor in the list or just terminate.

swap_out_mm( ) returns the number of pages of the process that owns the memory descriptor that the function has released. The swap_out( ) function uses this value to update a counter of how many pages have been released since the beginning of its execution; if the counter reaches the value SWAP_CLUSTER_MAX, swap_out( ) terminates.

The swap_out_mm( ) function scans the memory regions of the process that owns the memory descriptor mm passed as a parameter. Usually, the function starts analyzing the first memory region object in the mm->mmap list (remember that they are ordered by starting linear addresses). However, if mm is the memory descriptor that was analyzed last in the previous invocation of swap_out( ), swap_out_mm( ) does not restart from the first memory region, but from the memory region that includes the linear address last analyzed in the previous invocation. This linear address is stored in the swap_address field of the memory descriptor; if all memory regions of the process have been analyzed, then the field stores the conventional value TASK_SIZE.

For each memory region of the process that owns the memory descriptor mm, swap_out_mm( ) invokes the swap_out_vma( ) function, passing to it the number of pages yet to be released, the first linear address to analyze, the memory region object, and the memory descriptor. Again, swap_out_vma( ) returns the number of released pages belonging to the memory region. The loop of swap_out_mm( ) continues until either the requested number of pages is released or all memory regions are considered.

The swap_out_vma( ) function checks that the memory region is swappable (e.g., the flag VM_RESERVED is cleared). It then starts a sequence in which it considers all entries in the process's Page Global Directory that refer to linear addresses in the memory region. For each such entry, the function invokes the swap_out_pgd( ) function, which in turn considers all entries in a Page Middle Directory corresponding to address intervals in the memory region. For each such entry, swap_out_pgd( ) invokes the swap_out_pmd( ) function, which considers all entries in a Page Table referencing pages in the memory region. Also, swap_out_pmd( ) invokes the try_to_swap_out( ) function, which finally attempts to swap out the page. As usual, this chain of function invocations breaks as soon as the requested number of released page frames is reached.

16.5.1 The try_to_swap_out( ) Function

The try_to_swap_out( ) function attempts to free a given page frame, either discarding or swapping out its contents. The function returns the value 1 if it succeeds in releasing the page, and 0 otherwise. Remember that by "releasing the page," we mean that the references to the page frame are removed from the Page Tables of all processes that share the page. In this case, however, the page frame is not necessarily released to the buddy system; for instance, it could be referenced by the swap cache.

The parameters of the function are:

mm

Memory descriptor address

vma

Memory region object address

address

Initial linear address of the page

page_table

Address of the Page Table entry that maps address

page

Page descriptor address

classzone

The memory zone from which pages should be swapped out

The try_to_swap_out( ) function uses the Accessed and Dirty flags included in the Page Table entry. We stated in Section 2.4.1 that the Accessed flag is automatically set by the CPU's paging unit at every read or write access, while the Dirty flag is automatically set at every write access. These two flags offer a limited degree of hardware support that allows the kernel to use a primitive LRU replacement algorithm.

try_to_swap_out( ) must recognize many different situations demanding different responses, but the responses all share many of the same basic operations. In particular, the function performs the following steps:

1.       Checks the Accessed flag of the page_table entry. If it is set, the page must be considered "young"; in this case, the function clears the flag, invokes mark_page_accessed( ) (see Section 16.7.2 later in this chapter), and returns 0. This check ensures that a page can be swapped out only if it was not accessed since the previous invocation of try_to_swap_out( ) on it.

2.       If the memory region is locked (VM_LOCKED flag set), invokes mark_page_accessed( ) on it, and returns 0.

3.       If the PG_active flag in the page->flags field is set, the page is considered actively used and shouldn't be swapped out; the function returns 0.

4.       If the page does not belong to the memory zone specified by the classzone parameter, returns 0.

5.       Tries to lock the page; if it is already locked (PG_locked flag set), it is not possible to swap out the page because it is involved in an I/O data transfer; the function returns 0.

6.       At this point, the function knows that the page can be swapped out. Forces the value zero into the Page Table entry addressed by page_table and invokes flush_tlb_page( ) to invalidate the corresponding TLB entries.

7.       If the Dirty flag in the Page Table entry was set, invokes the set_page_dirty( ) function to set the PG_dirty flag in the page descriptor. Moreover, this function moves the page in the dirty_pages list of the address_space object referenced by page->mapping, if any, and marks the inode page->mapping->host as dirty (see Section 14.1.2.2).

8.       If the page belongs to the swap cache, it performs the following substeps:

a.       Gets the swapped-out page identifier from page->index.

b.       Invokes swap_duplicate( ) to verify whether the page slot index is valid and to increment the corresponding usage counter in swap_map.

c.       Stores the swapped-out page identifier in the Page Table entry addressed by page_table.

d.       Decrements the rss field of the memory descriptor mm.

e.       Unlocks the page.

f.        Decrements the page usage counter page->count.

g.       If the page is no longer referenced by any process, it returns 1; otherwise, it returns 0.[6]

[6] The check is easily done by looking at the value of the page->count usage counter. Of course, the function must consider that the counter is incremented when the page is inserted into the swap cache (or the page cache), and when there are buffers allocated on the page (i.e., when the page->buffers field is not null).

9.       Notice that the function does not have to allocate a new page slot, because the page frame has already been swapped out when scanning the Page Tables of some other process.

10.   The page is not inserted into the swap cache. Checks whether the page belongs to an address_space object (the page->mapping field is not null); in this case, the page belongs to a shared file memory mapping, so the function jumps to Step 8d to release the page frame, leaving the corresponding Page Table entry null.

Notice that the page frame reference of the process is released even if the page is not saved into a swap area. This is because the page has an image on disk, and the function has already triggered, if necessary, the update of this image in Step 7. Moreover, notice also that the page frame is not released to the buddy system because the page is still owned by the page cache (see Section 14.1.2.3).

11.   If the function reaches this point, the page is not inserted into the swap cache, and it does not belong to an address_space object. The function checks the status of the PG_dirty flag; if it is cleared, the function jumps to Step 8d to release the page frame, leaving the corresponding Page Table entry null.

There is no need to save the page contents on a swap area because the process never wrote into the page frame. The kernel recognizes this case because the PG_dirty flag is cleared, and this flag is never reset if the page has no image on disk or if it belongs to a private memory mapping. When the process accesses the same page again, the kernel handles the Page Fault through the demand paging technique (see Section 8.4.3); then the new page frame is filled with exactly the same data as that stored in this released page frame.

12.   If the function reaches this point, the page is not inserted into the swap cache, it does not have an image on disk, and it is dirty; here the function checks whether the page contains buffers (it is a buffer page, its page->buffers field is not null). In this case, the function restores the original contents of the Page Table entry, unlocks the page, and returns 0.

How could the page host some buffers if the page doesn't belong to an address_space object—that is, it has no image on disk? Actually, this might occur in rare circumstances—for instance, if the page maps a portion of a file that has just been truncated. In these cases, try_to_swapout( ) does nothing.

13.   At this point, the page is not inserted into the swap cache, it does not have an image on disk, and it is dirty; the function must definitively swap it out in a new page slot. It invokes the get_swap_page( ) function to allocate a free page slot in an active swap area. If there are none, it restores the original content of the Page Table entry, unlocks the page, and returns 0.

14.   Invokes add_to_swap_cache( ) to insert the page in the swap cache. The function might fail if another kernel control path is trying to swap in the page. As we shall see in the next section, this can happen even if the page slot is not referenced by any process. In this case, it invokes swap_free( ) to release the page slot and restarts from Step 12.

15.   Sets the PG_uptodate flag of the page.

16.   Invokes the set_page_dirty( ) function again (see Step 7 above) because add_to_swap_cache( ) resets the PG_dirty flag.

17.   Jumps to Step 8c to store the swapped-out page identifier in the Page Table entry and to release the page frame.

The try_to_swap_out( ) function does not directly invoke rw_swap_page( ) to trigger the activation of the I/O data transfer. Rather, the function limits itself to inserting the page in the swap cache, if necessary, and to marking the page as dirty. However, we'll see in the later section Section 16.7.4 that the kernel periodically flushes the disk caches to disk by invoking the writepage methods of the address_space objects that own the dirty pages.

As mentioned in the earlier section Section 16.3, the address_space object of the pages that belong to the swap cache is a special object stored in swapper_space. Its writepage method is implemented by the swap_writepage( ) function, which executes the following steps:

1.       Checks whether the page is not included in the Page Tables of any process; in this case, it removes the page from the swap cache and releases the swap page slot.

2.       Otherwise, it invokes rw_swap_page( ) on the page, specifying the WRITE command (see the earlier section Section 16.4.1).