The pages swapped out from memory are stored in a swap area, which may be implemented either as a disk partition of its own or as a file included in a larger partition. Several different swap areas may be defined, up to a maximum number specified by the MAX_SWAPFILES macro (usually set to 32).
Having multiple swap areas allows a system administrator to spread a lot of swap space among several disks so that the hardware can act on them concurrently; it also lets swap space be increased at runtime without rebooting the system.
Each swap area consists of a sequence of page slots: 4,096-byte blocks used to contain a swapped-out page. The first page slot of a swap area is used to persistently store some information about the swap area; its format is described by the swap_header union composed of two structures, info and magic. The magic structure provides a string that marks part of the disk unambiguously as a swap area; it consists of just one field, magic.magic, which contains a 10-character "magic" string. The magic structure essentially allows the kernel to unambiguously identify a file or a partition as a swap area; the text of the string depends on the swapping algorithm version: SWAP-SPACE for Version 1 or SWAPSPACE2 for Version 2. The field is always located at the end of the first page slot.
The info structure includes the following fields:
info.bootbits
Not used by the swapping algorithm; this field corresponds to the first 1,024 bytes of the swap area, which may store partition data, disk labels, and so on.
info.version
Swapping algorithm version.
info.last_page
Last page slot that is effectively usable.
info.nr_badpages
Number of defective page slots.
info.padding[125]
Padding bytes.
info.badpages[1]
Up to 637 numbers specifying the location of defective page slots.
The data stored in a swap area is meaningful as long as the system is on. When the system is switched off, all processes are killed, so all data stored by processes in swap areas is discarded. For this reason, swap areas contain very little control information; essentially, the swap area type and the list of defective page slots. This control information easily fits in a single 4 KB page.
Usually, the system administrator creates a swap partition when creating the other partitions on the Linux system, and then uses the mkswap command to set up the disk area as a new swap area. That command initializes the fields just described within the first page slot. Since the disk may include some bad blocks, the program also examines all other page slots to locate the defective ones. But executing the mkswap command leaves the swap area in an inactive state. Each swap area can be activated in a script file at system boot or dynamically after the system is running. An initialized swap area is considered active when it effectively represents an extension of the system RAM (see Section 16.2.3 later in this chapter).
Each active swap area has its own swap_info_struct descriptor in memory. The fields of the descriptor are illustrated in Table 16-1.
The flags field includes two overlapping subfields:
SWP_USED
1 if the swap area is active; 0 if it is nonactive.
SWP_WRITEOK
This 2-bit field is set to 3 if it is possible to write into the swap area and to 0 otherwise; since the least-significant bit of this field coincides with the bit used to implement SWP_USED, a swap area can be written only if it is active. The kernel is not allowed to write in a swap area when it is being activated or deactivated.
The swap_map field points to an array of counters, one for each swap area page slot. If the counter is equal to 0, the page slot is free; if it is positive, the page slot is filled with a swapped-out page (the exact meaning of positive values is discussed later in Section 16.3). If the counter has the value SWAP_MAP_MAX (equal to 32, 767), the page stored in the page slot is "permanent" and cannot be removed from the corresponding slot. If the counter has the value SWAP_MAP_BAD (equal to 32,768), the page slot is considered defective, and thus unusable.[1]
[1] "Permanent" page slots protect against overflows of swap_map counters. Without them, valid page slots could become "defective" if they are referenced too many times, thus leading to data losses. However, no one really expects that a page slot counter could reach the value 32,768. It's just a "belt and suspenders" approach.
The prio field is a signed integer that denotes the order in which the swap subsystem should consider each swap area. Swap areas implemented on faster disks should have a higher priority so they will be used first. Only when they are filled does the swapping algorithm consider lower-priority swap areas. Swap areas that have the same priority are cyclically selected to distribute swapped-out pages among them. As we shall see in Section 16.2.3, the priority is assigned when the swap area is activated.
The sdev_lock field is a spin lock that protects the descriptor against concurrent accesses in SMP systems.
The swap_info array includes MAX_SWAPFILES swap area descriptors. Of course, not all of them are necessarily used, only those having the SWP_USED flag set. Figure 16-1 illustrates the swap_info array, one swap area, and the corresponding array of counters.
The nr_swapfiles variable stores the index of the last array element that contains, or that has contained, a used swap area descriptor. Despite its name, the variable does not contain the number of active swap areas.
Descriptors of active swap areas are also inserted into a list sorted by the swap area priority. The list is implemented through the next field of the swap area descriptor, which stores the index of the next descriptor in the swap_info array. This use of the field as an index is different from most fields with the name next, which are usually pointers.
The swap_list variable, of type swap_list_t, includes the following fields:
head
Index in the swap_info array of the first list element.
next
Index in the swap_info array of the descriptor of the next swap area to be selected for swapping out pages. This field is used to implement a round-robin algorithm among maximum-priority swap areas with free slots.
The swaplock spin lock protects the list against concurrent accesses in multiprocessor systems.
The max field of the swap area descriptor stores the size of the swap area in pages, while the pages field stores the number of usable page slots. These numbers differ because pages does not take the first page slot and the defective page slots into consideration.
Finally, the nr_swap_pages variable contains the number of available (free and nondefective) page slots in all active swap areas, while the total_swap_pages variable contains the total number of nondefective page slots.
A swapped-out page is uniquely identified quite simply by specifying the index of the swap area in the swap_info array and the page slot index inside the swap area. Since the first page (with index 0) of the swap area is reserved for the swap_header union discussed earlier, the first useful page slot has index 1. The format of a swapped-out page identifier is illustrated in Figure 16-2.
The SWP_ENTRY(type,offset) macro constructs a swapped-out page identifier from the swap area index type and the page slot index offset. Conversely, the SWP_TYPE and SWP_OFFSET macros extract from a swapped-out page identifier the swap area index and the page slot index, respectively.
When a page is swapped out, its identifier is inserted as the page's entry into the Page Table so the page can be found again when needed. Notice that the least-significant bit of such an identifier, which corresponds to the Present flag, is always cleared to denote the fact that the page is not currently in RAM. However, at least one of the remaining 31 bits has to be set because no page is ever stored in slot 0 of swap area 0. It is therefore possible to identify three different cases from the value of a Page Table entry:
Null entry
The page does not belong to the process address space.
First 31 most-significant bits not all equal to 0, last bit equal to 0
The page is currently swapped out.
Least-significant bit equal to 1
The page is contained in RAM.
Notice that the maximum size of a swap area is determined by the number of bits available to identify a slot. On the 80 x 86 architecture, the 24 bits available limit the size of a swap area to 224 slots (that is, to 64 GB).
Since a page may belong to the address spaces of several processes (see the later section Section 16.3), it may be swapped out from the address space of one process and still remain in main memory; therefore, it is possible to swap out the same page several times. A page is physically swapped out and stored just once, of course, but each subsequent attempt to swap it out increments the swap_map counter.
The swap_duplicate( ) function is usually invoked while trying to swap out an already swapped-out page. It just verifies that the swapped-out page identifier passed as its parameter is valid and increments the corresponding swap_map counter. More precisely, it performs the following actions:
1. Uses the SWP_TYPE and SWP_OFFSET macros to extract the swap area number type and the page slot index offset from the parameter.
2. Checks whether the swap area is activated; if not, it returns 0 (invalid identifier).
3. Checks whether the page slot is valid and not free (its swap_map counter is greater than 0 and less than SWAP_MAX_BAD); if not, returns 0 (invalid identifier).
4. Otherwise, the swapped-out page identifier locates a valid page. Increments the swap_map counter of the page slot if it has not already reached the value SWAP_MAP_MAX.
5. Returns 1 (valid identifier).
Once a swap area is initialized, the superuser (or, more precisely, any user having the CAP_SYS_ADMIN capability, as described in Section 20.1.1) may use the swapon and swapoff programs to activate and deactivate the swap area, respectively. These programs use the swapon( ) and swapoff( ) system calls; we'll briefly sketch out the corresponding service routines.
The sys_swapon( ) service routine receives the following as parameters:
specialfile
This parameter points to the pathname (in the User Mode address space) of the device file (partition) or plain file used to implement the swap area.
swap_flags
This parameter consists of a single SWAP_FLAG_PREFER bit plus 15 bits of priority of the swap area (these bits are significant only if the SWAP_FLAG_PREFER bit is on).
The function checks the fields of the swap_header union that was put in the first slot when the swap area was created. The function performs these main steps:
1. Checks that the current process has the CAP_SYS_ADMIN capability.
2. Searches for the first element in the swap_info array of swap area descriptors that have the SWP_USED flag cleared, meaning that the corresponding swap area is inactive. If there is none, there are already MAX_SWAPFILES active swap areas, so the function returns an error code.
3. A descriptor for the swap area has been found. The function initializes the descriptor's fields (setting flags to SWP_USED, setting lowest_bit and highest_bit to 0, and so on). Moreover, if the descriptor's index is greater than nr_swapfiles, the function updates that variable.
4. If the swap_flags parameter specifies a priority for the new swap area, the function sets the prio field of the descriptor. Otherwise, it initializes the field with the lowest priority among all active swap areas minus 1 (thus assuming that the last activated swap area is on the slowest block device). If no other swap areas are already active, the function assigns the value -1.
5. Copies the string pointed to by the specialfile parameter from the User Mode address space.
6. Invokes path_init( ) and path_walk( ) to perform a pathname lookup on the string copied from the User Mode address space (see Section 12.5).
7. Stores the addresses of the dentry object and of the mounted filesystem descriptor returned by path_walk( ) in the swap_file and swap_vfsmnt fields of the swap area descriptor, respectively.
8. If the specialfile parameter identifies a block device file, the function performs the following substeps:
a. Stores the device number in the swap_device field of the descriptor.
b. Sets the block size of the device to 4 KB—that is, sets its blksize_size entry to PAGE_SIZE.
c. Initializes the block device driver by invoking the bd_acquire( ) and do_open( ) functions, described in Section 13.4.2.
9. Checks to make sure that the swap area was not already activated by looking at the address_space objects of the other active swap areas in swap_info (given an address q of a swap area descriptor, the corresponding address_space object is obtained by q->swap_file->d_inode->i_mapping). If the swap area is already active, it returns an error code.
10. Allocates a page frame and invokes rw_swap_page_nolock( ) (see Section 16.4 later in this chapter) to fill it with the swap_header union stored in the first page of the swap area.
11. Checks that the magic string in the last ten characters of the first page in the swap area is equal to SWAP-SPACE or to SWAPSPACE2 (there are two slightly different versions of the swapping algorithm). If not, the specialfile parameter does not specify an already initialized swap area, so the function returns an error code. For the sake of brevity, we'll suppose that the swap area has the SWAPSPACE2 magic string.
12. Initializes the lowest_bit and highest_bit fields of the swap area descriptor according to the size of the swap area stored in the info.last_page field of the swap_header union.
13. Invokes vmalloc( ) to create the array of counters associated with the new swap area and store its address in the swap_map field of the swap descriptor. Initializes the elements of the array to 0 or to SWAP_MAP_BAD, according to the list of defective page slots stored in the info.bad_pages field of the swap_header union.
14. Computes the number of useful page slots by accessing the info.last_page and info.nr_badpages fields in the first page slot.
15. Sets the flags field of the swap descriptor to SWP_WRITEOK, sets the pages field to the number of useful page slots, and updates the nr_swap_pages and total_swap_pages variables.
16. Inserts the new swap area descriptor in the list to which the swap_list variable points.
17. Releases the page frame that contains the data of the first page of the swap area and returns 0 (success).
The sys_swapoff( ) service routine deactivates a swap area identified by the parameter specialfile. It is much more complex and time-consuming than sys_swapon( ), since the partition to be deactivated might still contain pages that belong to several processes. The function is thus forced to scan the swap area and to swap in all existing pages. Since each swap in requires a new page frame, it might fail if there are no free page frames left. In this case, the function returns an error code. All this is achieved by performing the following major steps:
1. Checks that the current process has the CAP_SYS_ADMIN capability.
2. Copies the string pointed to by specialfile, and invokes path_init( ) and path_walk( ) to perform a pathname lookup.
3. Scans the list to which swap_list points and locates the descriptor whose swap_file field points to the dentry object found by the pathname lookup. If no such descriptor exists, an invalid parameter was passed to the function, so it returns an error code.
4. Otherwise, if the descriptor exists, checks that its SWP_WRITE flag is set; if not, returns an error code because the swap area is already being deactivated by another process.
5. Removes the descriptor from the list and sets its flags field to SWP_USED so the kernel doesn't store more pages in the swap area before this function deactivates it.
6. Subtracts the swap area size stored in the pages field of the swap area descriptor from the values of nr_swap_pages and total_swap_pages.
7. Invokes the try_to_unuse( ) function (see below) to successively force all pages left in the swap area into RAM and to correspondingly update the Page Tables of the processes that use these pages.
8. If try_to_unuse( ) fails in allocating all requested page frames, the swap area cannot be deactivated. Therefore, the function executes the following substeps:
a. Reinserts the swap area descriptor in the swap_list list and sets its flags field to SWP_WRITEOK (see Step 5)
b. Adds the content of the pages field to the nr_swap_pages and total_swap_pages variables (see Step 6)
c. Invokes path_release( ) to release the VFS objects allocated by path_walk( ) in Step 2.
d. Finally, returns an error code.
9. Otherwise, all used page slots have been successfully transferred to RAM. Therefore, the function executes the following substeps:
a. If specialfile identifies a block device file, releases the corresponding block device driver.
b. Invokes path_release( ) to release the VFS objects allocated by path_walk( ) in Step 2.
c. Releases the memory area used to store the swap_map array.
d. Invokes path_release( ) again because the VFS objects that refer to specialfile have been allocated by the path_walk( ) function invoked by sys_swapon( ) (see Step 6 in the previous section).
e. Returns 0 (success).
As stated previously, the try_to_unuse( ) function swaps in pages and updates all the Page Tables of processes that have swapped out pages. To that end, the function visits the address spaces of all kernel threads and processes, starting with the init_mm memory descriptor that is used as a marker. It is a time-consuming function that runs mostly with the interrupts enabled. Synchronization with other processes is therefore critical.
The try_to_unuse( ) function scans the swap_map array of the swap area. When the function finds a in-use page slot, it first swaps in the page, and then starts looking for the processes that reference the page. The ordering of these two operations is crucial to avoid race conditions. While the I/O data transfer is ongoing, the page is locked, so no process can access it. Once the I/O data transfer completes, the page is locked again by try_to_unuse( ), so it cannot be swapped out again by another kernel control path. Race conditions are also avoided because each process looks up the page cache before starting a swap in or swap out operation (see the later section Section 16.3). Finally, the swap area considered by try_to_unuse( ) is marked as nonwritable (SWP_WRITE flag is not set), so no process can perform a swap out on a page slot of this area.
However, try_to_unuse( ) might be forced to scan the swap_map array of usage counters of the swap area several times. This is because memory regions that contain references to swapped-out pages might disappear during one scan and later reappear in the process lists.
For instance, recall the description of the do_munmap( ) function (in Section 8.3.5): whenever a process releases an interval of linear addresses, do_munmap( ) removes from the process list all memory regions that include the affected linear addresses; later, the function reinserts the memory regions that have been only partially unmapped in the process list. do_munmap( ) takes care of freeing the swapped-out pages that belong to the interval of released linear addresses; however, it commendably doesn't free the swapped-out pages that belong to the memory regions that have to be reinserted in the process list.
Hence, try_to_unuse( ) might fail in finding a process that references a given page slot because the corresponding memory region is temporarily not included in the process list. To cope with this fact, try_to_unuse( ) keeps scanning the swap_map array until all reference counters are null. Eventually, the ghost memory regions referencing the swapped-out pages will reappear in the process lists, so try_to_unuse( ) will succeed in freeing all page slots.
Let's describe now the major operations executed by try_to_unuse( ). It executes a continuous loop on the reference counters in the swap_map array of the swap area passed as its parameter. For each reference counter, the function performs the following steps:
1. If the counter is equal to 0 (no page is stored there) or to SWAP_MAP_BAD, it continues with the next page slot.
2. Otherwise, it invokes the read_swap_cache_async( ) function (see Section 16.4 later in this chapter) to swap in the page. This consists of allocating, if necessary, a new page frame, filling it with the data stored in the page slot, and putting the page in the swap cache.
3. Waits until the new page has been properly updated from disk and locks it.
4. While the function was executing the previous step, the process could have been suspended. Therefore, it checks again whether the reference counter of the page slot is null; if so, it continues with the next page slot (this swap page has been freed by another kernel control path).
5. Invokes unuse_process( ) on every memory descriptor in the doubly linked list whose head is init_mm (see Section 8.2). This time-consuming function scans all Page Table entries of the process that owns the memory descriptor, and replaces each occurrence of the swapped-out page identifier with the physical address of the page frame. To reflect this move, the function also decrements the page slot counter in the swap_map array (unless it is equal to SWAP_MAP_MAX) and increments the usage counter of the page frame.
6. Invokes shmem_unuse( ) to check whether the swapped-out page is used for an IPC shared memory resource and to properly handle that case (see Section 19.3.5).
7. Checks the value of the reference counter of the page. If it is equal to SWAP_MAP_MAX, the page slot is "permanent." To free it, it forces the value 1 into the reference counter.
8. The swap cache might own the page as well (it contributes to the value of the reference counter). If the page belongs to the swap cache, it invokes the rw_swap_page( ) function to flush its contents on disk (if the page is dirty), invokes delete_from_swap_cache( ) to remove the page from the swap cache, and decrements its reference counter.
9. Sets the PG_dirty flag of the page descriptor and unlocks the page.
10. Checks the need_resched field of the current process; if it is set, it invokes schedule( ) to relinquish the CPU. Deactivating a swap area is a long job, and the kernel must ensure that the other processes in the system still continue to execute. The try_to_unuse( ) function continues from this step whenever the process is selected again by the scheduler.
11. Proceeds with the next page slot. starting at Step 1.
The function continues until every reference counter in the swap_map array is null. Recall that even if the function starts examining the next page slot, the reference counter of the previous page slot could still be positive. In fact, a "ghost" process could still reference the page, typically because some memory regions have been temporarily removed from the process list scanned in Step 5. Eventually, try_to_unuse( ) catches every reference. In the meantime, however, the page is no longer in the swap cache, it is unlocked, and a copy is still included in the page slot of the swap area being deactivated.
One might expect that this situation could lead to data loss. For instance, suppose that some "ghost" process accesses the page slot and starts swapping the page in. Since the page is no longer in the swap cache, the process fills a new page frame with the data read from disk. However, this page frame would be different from the page frames owned by the processes that are supposed to share the page with the "ghost" process.
This problem does not arise when deactivating a swap area because interference from a ghost process could happen only if a swapped-out page belongs to a private anonymous memory mapping.[2] In this case, the page frame is handled by means of the Copy on Write mechanism described in Chapter 8, so it is perfectly legal to assign different page frames to the processes that reference the page. However, the try_to_unuse( ) function marks the page as "dirty" (Step 9); otherwise, the try_to_swap_out( ) function might later drop the page from the Page Table of some process without saving it in an another swap area (see the later section Section 16.5).
[2] Actually, the page might also belong to an IPC shared memory region; Chapter 19 has a discussion of this case.
As we shall see later, when freeing memory, the kernel swaps out many pages in a short period of time. It is therefore important to try to store these pages in contiguous slots to minimize disk seek time when accessing the swap area.
A first approach to an algorithm that searches for a free slot could choose one of two simplistic, rather extreme strategies:
· Always start from the beginning of the swap area. This approach may increase the average seek time during swap-out operations because free page slots may be scattered far away from one another.
· Always start from the last allocated page slot. This approach increases the average seek time during swap-in operations if the swap area is mostly free (as is usually the case) because the handful of occupied page slots may be scattered far away from one another.
Linux adopts a hybrid approach. It always starts from the last allocated page slot unless one of these conditions occurs:
· The end of the swap area is reached
· SWAPFILE_CLUSTER (usually 256) free page slots were allocated after the last restart from the beginning of the swap area
The cluster_nr field in the swap_info_struct descriptor stores the number of free page slots allocated. This field is reset to 0 when the function restarts allocation from the beginning of the swap area. The cluster_next field stores the index of the first page slot to be examined in the next allocation.[3]
[3] As you may have noticed, the names of Linux data structures are not always appropriate. In this case, the kernel does not really "cluster" page slots of a swap area.
To speed up the search for free page slots, the kernel keeps the lowest_bit and highest_bit fields of each swap area descriptor up to date. These fields specify the first and the last page slots that could be free; in other words, any page slot below lowest_bit and above highest_bit is known to be occupied.
The scan_swap_map( ) function is used to find a free page slot in a given swap area. It acts on a single parameter, which points to a swap area descriptor and returns the index of a free page slot. It returns 0 if the swap area does not contain any free slots. The function performs the following steps:
1. It tries first to use the current cluster. If the cluster_nr field of the swap area descriptor is positive, it scans the swap_map array of counters starting from the element at index cluster_next and looks for a null entry. If a null entry is found, it decrements the cluster_nr field and goes to Step 4.
2. If this point is reached, either the cluster_nr field is null or the search starting from cluster_next didn't find a null entry in the swap_map array. It is time to try the second stage of the hybrid search. It reinitializes cluster_nr to SWAPFILE_CLUSTER and restarts scanning the array from the lowest_bit index that is trying to find a group of SWAPFILE_CLUSTER free page slots. If such a group is found, it goes to Step 4.
3. No group of SWAPFILE_CLUSTER free page slots exists. It restarts scanning the array from the lowest_bit index that is trying to find a single free page slot. If no null entry is found, it sets the lowest_bit field to the maximum index in the array, the highest_bit field to 0, and returns 0 (the swap area is full).
4. A null entry is found. Puts the value 1 in the entry, decrements nr_swap_pages, updates the lowest_bit and highest_bit fields if necessary, and sets the cluster_next field to the index of the page slot just allocated plus 1.
5. Returns the index of the allocated page slot.
The get_swap_page( ) function is used to find a free page slot by searching all the active swap areas. The function, which returns the index of a newly allocated page slot or 0 if all swap areas are filled, takes into consideration the different priorities of the active swap areas.
Two passes are necessary. The first pass is partial and applies only to areas that have a single priority; the function searches such areas in a round-robin fashion for a free slot. If no free page slot is found, a second pass is made starting from the beginning of the swap area list; during this second pass, all swap areas are examined. More precisely, the function performs the following steps:
1. If nr_swap_pages is null or if there are no active swap areas, returns 0.
2. Starts by considering the swap area pointed to by swap_list.next (recall that the swap area list is sorted by decreasing priorities).
3. If the swap area is active and not being deactivated, invokes scan_swap_map( ) to allocate a free page slot. If scan_swap_map( ) returns a page slot index, the function's job is essentially done, but it must prepare for its next invocation. Thus, it updates swap_list.next to point to the next swap area in the swap area list, if the latter has the same priority (thus continuing the round-robin use of these swap areas). If the next swap area does not have the same priority as the current one, the function sets swap_list.next to the first swap area in the list (so that the next search will start with the swap areas that have the highest priority). The function finishes by returning the identifier corresponding to the page slot just allocated.
4. Either the swap area is not writable, or it does not have free page slots. If the next swap area in the swap area list has the same priority as the current one, the function makes it the current one and goes to Step 3.
5. At this point, the next swap area in the swap area list has a lower priority than the previous one. The next step depends on which of the two passes the function is performing.
a. If this is the first (partial) pass, it considers the first swap area in the list and goes to Step 3, thus starting the second pass.
b. Otherwise, it checks if there is a next element in the list; if so, it considers it and goes to Step 3.
6. At this point the list is completely scanned by the second pass and no free page slot has been found; it returns 0.
The swap_free( ) function is invoked when swapping in a page to decrement the corresponding swap_map counter (see Table 16-1). When the counter reaches 0, the page slot becomes free since its identifier is no longer included in any Page Table entry. We'll see in the later section Section 16.3, however, that the swap cache counts as an owner of the page slot.
The function acts on a single entry parameter that specifies a swapped-out page identifier and performs the following steps:
1. Derives the swap area index and the offset page slot index from the entry parameter and gets the address of the swap area descriptor.
2. Checks whether the swap area is active and returns right away if it is not.
3. If the swap_map counter corresponding to the page slot being freed is smaller than SWAP_MAP_MAX, decrements it. Recall that entries that have the SWAP_MAP_MAX value are considered persistent (undeletable).
4. If the swap_map counter becomes 0, increments the value of nr_swap_pages and updates, if necessary, the lowest_bit and highest_bit fields of the swap area descriptor.