2.5 Paging in Linux

As we explained earlier in Section 2.4.5, Linux adopted a three-level paging model so paging is feasible on 64-bit architectures. Figure 2-11 shows the model, which defines three types of paging tables.

·         Page Global Directory

·         Page Middle Directory

·         Page Table

The Page Global Directory includes the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. The linear address is thus split into four parts. Figure 2-11 does not show the bit numbers because the size of each part depends on the computer architecture.

Figure 2-11. The Linux paging model

figs/ULK2_0211.gif

Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear addresses into physical ones makes the following design objectives feasible:

·         Assign a different physical address space to each process, ensuring an efficient protection against addressing errors.

·         Distinguish pages (groups of data) from page frames (physical addresses in main memory). This allows the same page to be stored in a page frame, then saved to disk and later reloaded in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 16).

As we shall see in Chapter 8, each process has its own Page Global Directory and its own set of Page Tables. When a process switch occurs (see Section 3.3), Linux saves the cr3 control register in the descriptor of the process previously in execution and then loads cr3 with the value stored in the descriptor of the process to be executed next. Thus, when the new process resumes its execution on the CPU, the paging unit refers to the correct set of Page Tables.

What happens when this three-level paging model is applied to the Pentium, which uses only two types of Page Tables? Linux essentially eliminates the Page Middle Directory field by saying that it contains zero bits. However, the position of the Page Middle Directory in the sequence of pointers is kept so that the same code can work on 32-bit and 64-bit architectures. The kernel keeps a position for the Page Middle Directory by setting the number of entries in it to 1 and mapping this single entry into the proper entry of the Page Global Directory.

However, when Linux uses the Physical Address Extension (PAE) mechanism of the Pentium Pro and later processors, the Linux's Page Global Directory corresponds to the 80 x 86's Page Directory Pointer Table, the Page Middle Directory to the 80 x 86's Page Directory, and the Linux's Page Table to the 80 x 86's Page Table.

Mapping logical to linear addresses now becomes a mechanical task, although it is still somewhat complex. The next few sections of this chapter are a rather tedious list of functions and macros that retrieve information the kernel needs to find addresses and manage the tables; most of the functions are one or two lines long. You may want to just skim these sections now, but it is useful to know the role of these functions and macros because you'll see them often in discussions throughout this book.

2.5.1 The Linear Address Fields

The following macros simplify Page Table handling:

PAGE_SHIFT

Specifies the length in bits of the Offset field; when applied to 80 x 86 processors, it yields the value 12. Since all the addresses in a page must fit in the Offset field, the size of a page on 80 x 86 systems is 212 or the familiar 4,096 bytes; the PAGE_SHIFT of 12 can thus be considered the logarithm base 2 of the total page size. This macro is used by PAGE_SIZE to return the size of the page. Finally, the PAGE_MASK macro yields the value 0xfffff000 and is used to mask all the bits of the Offset field.

PMD_SHIFT

The total length in bits of the Middle Directory and Table fields of a linear address; in other words, the logarithm of the size of the area a Page Middle Directory entry can map. The PMD_SIZE macro computes the size of the area mapped by a single entry of the Page Middle Directory — that is, of a Page Table. The PMD_MASK macro is used to mask all the bits of the Offset and Table fields.

When PAE is disabled, PMD_SHIFT yields the value 22 (12 from Offset plus 10 from Table), PMD_SIZE yields 222 or 4 MB, and PMD_MASK yields 0xffc00000. Conversely, when PAE is enabled, PMD_SHIFT yields the value 21 (12 from Offset plus 9 from Table), PMD_SIZE yields 221 or 2 MB, and PMD_MASK yields 0xffe00000.

PGDIR_SHIFT

Determines the logarithm of the size of the area a Page Global Directory entry can map. The PGDIR_SIZE macro computes the size of the area mapped by a single entry of the Page Global Directory. The PGDIR_MASK macro is used to mask all the bits of the Offset, Table, and Middle Dir fields.

When PAE is disabled, PGDIR_SHIFT yields the value 22 (the same value yielded by PMD_SHIFT), PGDIR_SIZE yields 222 or 4 MB, and PGDIR_MASK yields 0xffc00000. Conversely, when PAE is enabled, PGDIR_SHIFT yields the value 30 (12 from Offset plus 9 from Table plus 9 from Middle Dir), PGDIR_SIZE yields 230 or 1 GB, and PGDIR_MASK yields 0xc0000000.

PTRS_PER_PTE, PTRS_PER_PMD, and PTRS_PER_PGD

Compute the number of entries in the Page Table, Page Middle Directory, and Page Global Directory. They yield the values 1,024, 1, and 1,024, respectively, when PAE is disabled, and the values 4, 512, and 512, respectively, when PAE is enabled

2.5.2 Page Table Handling

pte_t, pmd_t, and pgd_t describe the format of, respectively, a Page Table, a Page Middle Directory, and a Page Global Directory entry. They are 32-bit data types, except for pte_t, which is a 64-bit data type when PAE is enabled and a 32-bit data type otherwise. pgprot_t is another 32-bit data type that represents the protection flags associated with a single entry.

Four type-conversion macros — _ _ pte( ), _ _ pmd( ), _ _ pgd( ), and _ _ pgprot( ) — cast a unsigned integer into the required type. Four other type-conversion macros — pte_val( ), pmd_val( ), pgd_val( ), and pgprot_val( ) — perform the reverse casting from one of the four previously mentioned specialized types into a unsigned integer.

The kernel also provides several macros and functions to read or modify Page Table entries:

·         The pte_none( ), pmd_none( ), and pgd_none( ) macros yield the value 1 if the corresponding entry has the value 0; otherwise, they yield the value 0.

·         The pte_present( ), pmd_present( ), and pgd_present( ) macros yield the value 1 if the Present flag of the corresponding entry is equal to 1 — that is, if the corresponding page or Page Table is loaded in main memory.

·         The pte_clear( ), pmd_clear( ), and pgd_clear( ) macros clear an entry of the corresponding Page Table, thus forbidding a process to use the linear addresses mapped by the Page Table entry.

The macros pmd_bad( ) and pgd_bad( ) are used by functions to check Page Global Directory and Page Middle Directory entries passed as input parameters. Each macro yields the value 1 if the entry points to a bad Page Table — that is, if at least one of the following conditions applies:

·         The page is not in main memory (Present flag cleared).

·         The page allows only Read access (Read/Write flag cleared).

·         Either Accessed or Dirty is cleared (Linux always forces these flags to be set for every existing Page Table).

No pte_bad( ) macro is defined because it is legal for a Page Table entry to refer to a page that is not present in main memory, not writable, or not accessible at all. Instead, several functions are offered to query the current value of any of the flags included in a Page Table entry:

pte_read( )

Returns the value of the User/Supervisor flag (indicating whether the page is accessible in User Mode).

pte_write( )

Returns 1 if the Read/Write flag is set (indicating whether the page is writable).

pte_exec( )

Returns the value of the User/Supervisor flag (indicating whether the page is accessible in User Mode). Notice that pages on the 80 x 86 processor cannot be protected against code execution.

pte_dirty( )

Returns the value of the Dirty flag (indicating whether the page has been modified).

pte_young( )

Returns the value of the Accessed flag (indicating whether the page has been accessed).

Another group of functions sets the value of the flags in a Page Table entry:

pte_wrprotect( )

Clears the Read/Write flag

pte_rdprotect and pte_exprotect( )

Clear the User/Supervisor flag

pte_mkwrite( )

Sets the Read/Write flag

pte_mkread( ) and pte_mkexec( )

Set the User/Supervisor flag

pte_mkdirty( ) and pte_mkclean( )

Set the Dirty flag to 1 and to 0, respectively, marking the page as modified or unmodified

pte_mkyoung( ) and pte_mkold( )

Set the Accessed flag to 1 and to 0, respectively, marking the page as accessed (young) or nonaccessed (old)

pte_modify(p,v)

Sets all access rights in a Page Table entry p to a specified value v

set_pte, set_pmd, and set_pgd

Writes a specified value into a Page Table, Page Middle Directory, and Page Global Directory entry, respectively

The ptep_set_wrprotect( ) and ptep_mkdirty( ) functions are similar to pte_wrprotect( ) and pte_mkdirty( ), respectively, except that they act on pointers to a Page Table entry. The ptep_test_and_clear_dirty( ) and ptep_test_and_clear_young( ) functions also act on pointers and are similar to pte_mkclean( ) and pte_mkold( ), respectively, except that they return the old value of the flag.

Now come the macros that combine a page address and a group of protection flags into a page entry or perform the reverse operation of extracting the page address from a Page Table entry:

mk_ pte

Accepts a linear address and a group of access rights as arguments and creates a Page Table entry.

mk_ pte_ phys

Creates a Page Table entry by combining the physical address and the access rights of the page.

pte_ page( )

Returns the address of the descriptor of the page frame referenced by a Page Table entry (see Section 7.1.1).

pmd_ page( )

Returns the linear address of a Page Table from its Page Middle Directory entry.

pgd_offset(p,a)

Receives as parameters a memory descriptor p (see Chapter 8) and a linear address a. The macro yields the address of the entry in a Page Global Directory that corresponds to the address a; the Page Global Directory is found through a pointer within the memory descriptor p. The pgd_offset_k( ) macro is similar, except that it refers to the master kernel Page Tables (see the later section Section 2.5.5).

pmd_offset(p,a)

Receives as a parameter a Page Global Directory entry p and a linear address a; it yields the address of the entry corresponding to the address a in the Page Middle Directory referenced by p.

pte_offset(p,a)

Similar to pmd_offset, but p is a Page Middle Directory entry and the macro yields the address of the entry corresponding to a in the Page Table referenced by p.

The last group of functions of this long list were introduced to simplify the creation and deletion of Page Table entries.

When two-level paging is used (and PAE is disabled), creating or deleting a Page Middle Directory entry is trivial. As we explained earlier in this section, the Page Middle Directory contains a single entry that points to the subordinate Page Table. Thus, the Page Middle Directory entry is the entry within the Page Global Directory too. When dealing with Page Tables, however, creating an entry may be more complex because the Page Table that is supposed to contain it might not exist. In such cases, it is necessary to allocate a new page frame, fill it with zeros, and add the entry.

If PAE is enabled, the kernel uses three-level paging. When the kernel creates a new Page Global Directory, it also allocates the four corresponding Page Middle Directories; these are freed only when the parent Page Global Directory is released.

As we shall see in Section 7.1, the allocations and deallocations of page frames are expensive operations. Therefore, when the kernel destroys a Page Table, it makes sense to add the corresponding page frame to a suitable memory cache. Linux 2.4.18 already includes some functions and data structures, such as pte_quicklist or pgd_quicklist, to implement such cache; however, the code is not mature and the cache is not used yet.

Now comes the last round of functions and macros. As usual, we'll stick to the 80 x 86 architecture.

pgd_alloc( m )

Allocates a new Page Global Directory by invoking the get_ pgd_slow( ) function. If PAE is enabled, the latter function also allocates the four children Page Middle Directories. The argument m (the address of a memory descriptor) is ignored on the 80 x 86 architecture.

pmd_alloc(m,p,a)

Defined so three-level paging systems can allocate a new Page Middle Directory for the linear address a. If PAE is not enabled, the function simply returns the input parameter p — that is, the address of the entry in the Page Global Directory. If PAE is enabled, the function returns the address of the Page Middle Directory that was allocated when the Page Global Directory was created. The argument m is ignored.

pte_alloc(m,p,a)

Receives as parameters the address of a Page Middle Directory entry p and a linear address a, and returns the address of the Page Table entry corresponding to a. If the Page Middle Directory entry is null, the function must allocate a new Page Table. The page frame is allocated by invoking pte_alloc_one( ). If a new Page Table is allocated, the entry corresponding to a is initialized and the User/Supervisor flag is set. The argument m is ignored.

pte_free( ) and pgd_free( )

Release a Page Table. The pmd_free( ) function does nothing, since Page Middle Directories are allocated and deallocated together with their parent Page Global Directory.

free_one_pmd( )

Invokes pte_free( ) to release a Page Table and sets the corresponding entry in the Page Middle Directory to NULL.

free_one_ pgd( )

Releases all Page Tables of a Page Middle Directory by using free_one_ pmd( ) repeatedly. Then it releases the Page Middle Directory by invoking pmd_free( ).

clear_ page_tables( )

Clears the contents of the Page Tables of a process by iteratively invoking free_one_ pgd( ).

2.5.3 Reserved Page Frames

The kernel's code and data structures are stored in a group of reserved page frames. A page contained in one of these page frames can never be dynamically assigned or swapped to disk.

As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 — i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 2 MBs of RAM.

Why isn't the kernel loaded starting with the first available megabyte of RAM? Well, the PC architecture has several peculiarities that must be taken into account. For example:

·         Page frame 0 is used by BIOS to store the system hardware configuration detected during the Power-On Self-Test (POST ); the BIOS of many laptops, moreover, write data on this page frame even after the system is initialized.

·         Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS routines and to map the internal memory of ISA graphics cards. This area is the well-known hole from 640 KB to 1 MB in all IBM-compatible PCs: the physical addresses exist but they are reserved, and the corresponding page frames cannot be used by the operating system.

·         Additional page frames within the first megabyte may be reserved by specific computer models. For example, the IBM ThinkPad maps the 0xa0 page frame into the 0x9f one.

In the early stage of the boot sequence (see Appendix A), the kernel queries the BIOS and learns the size of the physical memory. In recent computers, the kernel also invokes a BIOS procedure to build a list of physical address ranges and their corresponding memory types.

Later, the kernel executes the setup_memory_region( ) function, which fills a table of physical memory regions, shown in Table 2-1. Of course, the kernel builds this table on the basis of the BIOS list, if this is available; otherwise the kernel builds the table following the conservative default setup. All page frames with numbers from 0x9f (LOWMEMSIZE( )) to 0x100 (HIGH_MEMORY) are marked as reserved.

Table 2-1. Example of BIOS-provided physical addresses map

Start

End

Type

0x00000000

0x0009ffff

Usable

0x000f0000

0x000fffff

Reserved

0x00100000

0x07feffff

Usable

0x07ff0000

0x07ff2fff

ACPI data

0x07ff3000

0x07ffffff

ACPI NVS

0xffff0000

0xffffffff

Reserved

A typical configuration for a computer having 128 MB of RAM is shown in Table 2-1. The physical address range from 0x07ff0000 to 0x07ff2fff stores information about the hardware devices of the system written by the BIOS in the POST phase; during the initialization phase, the kernel copies such information in a suitable kernel data structure, and then considers these page frames usable. Conversely, the physical address range of 0x07ff3000 to 0x07ffffff is mapped on ROM chips of the hardware devices. The physical address range starting from 0xffff0000 is marked as reserved since it is mapped by the hardware to the BIOS's ROM chip (see Appendix A). Notice that the BIOS may not provide information for some physical address ranges (in the table, the range is 0x000a0000 to 0x000effff). To be on the safe side, Linux assumes that such ranges are not usable.

To avoid loading the kernel into groups of noncontiguous page frames, Linux prefers to skip the first megabyte of RAM. Clearly, page frames not reserved by the PC architecture will be used by Linux to store dynamically assigned pages.

Figure 2-12 shows how the first 2 MB of RAM are filled by Linux. We have assumed that the kernel requires less than one megabyte of RAM (this is a bit optimistic).

Figure 2-12. The first 512 page frames (2 MB) in Linux 2.4

figs/ULK2_0212.gif

The symbol _text, which corresponds to physical address 0x00100000, denotes the address of the first byte of kernel code. The end of the kernel code is similarly identified by the symbol _etext. Kernel data is divided into two groups: initialized and uninitialized. The initialized data starts right after _etext and ends at _edata. The uninitialized data follows and ends up at _end.

The symbols appearing in the figure are not defined in Linux source code; they are produced while compiling the kernel.[3]

[3] You can find the linear address of these symbols in the file System.map, which is created right after the kernel is compiled.

2.5.4 Process Page Tables

The linear address space of a process is divided into two parts:

·         Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process is in either User or Kernel Mode.

·         Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process is in Kernel Mode.

When a process runs in User Mode, it issues linear addresses smaller than 0xc0000000; when it runs in Kernel Mode, it is executing kernel code and the linear addresses issued are greater than or equal to 0xc0000000. In some cases, however, the kernel must access the User Mode linear address space to retrieve or store data.

The PAGE_OFFSET macro yields the value 0xc0000000; this is the offset in the linear address space of a process where the kernel lives. In this book, we often refer directly to the number 0xc0000000 instead.

The content of the first entries of the Page Global Directory that map linear addresses lower than 0xc0000000 (the first 768 entries with PAE disabled) depends on the specific process. Conversely, the remaining entries should be the same for all processes and equal to the corresponding entries of the kernel master Page Global Directory (see the following section).

2.5.5 Kernel Page Tables

The kernel maintains a set of Page Tables for its own use, rooted at a so-called master kernel Page Global Directory. After system initialization, this set of Page Tables are never directly used by any process or kernel thread; rather, the highest entries of the master kernel Page Global Directory are the reference model for the corresponding entries of the Page Global Directories of every regular process in the system.

We explain how the kernel ensures that changes to the master kernel Page Global Directory are propagated to the Page Global Directories that are actually used by the processes in the system in Section 8.4.5.

We now describe how the kernel initializes its own Page Tables. This is a two-phase activity. In fact, right after the kernel image is loaded into memory, the CPU is still running in real mode; thus, paging is not enabled.

In the first phase, the kernel creates a limited 8 MB address space, which is enough for it to install itself in RAM.

In the second phase, the kernel takes advantage of all of the existing RAM and sets up the paging tables properly. The next sections examine how this plan is executed.

2.5.5.1 Provisional kernel Page Tables

A provisional Page Global Directory is initialized statically during kernel compilation, while the provisional Page Tables are initialized by the startup_32( ) assembly language function defined in arch/i386/kernel/head.S. We won't bother mentioning the Page Middle Directories anymore since they are equated to Page Global Directory entries. PAE support is not enabled at this stage.

The Page Global Directory is contained in the swapper_pg_dir variable, while the two Page Tables that span the first 8 MB of RAM are contained in the pg0 and pg1 variables.

The objective of this first phase of paging is to allow these 8 MB to be easily addressed both in real mode and protected mode. Therefore, the kernel must create a mapping from both the linear addresses 0x00000000 through 0x007fffff and the linear addresses 0xc0000000 through 0xc07fffff into the physical addresses 0x00000000 through 0x007fffff. In other words, the kernel during its first phase of initialization can address the first 8 MB of RAM by either linear addresses identical to the physical ones or 8 MB worth of linear addresses, starting from 0xc0000000.

The kernel creates the desired mapping by filling all the swapper_pg_dir entries with zeroes, except for entries 0, 1, 0 x 300 (decimal 768), and 0 x 301 (decimal 769); the latter two entries span all linear addresses between 0xc0000000 and 0xc07fffff. The 0, 1, 0 x 300, and 0 x 301 entries are initialized as follows:

·         The address field of entries 0 and 0 x 300 is set to the physical address of pg0, while the address field of entries 1 and 0 x 301 is set to the physical address of pg1.

·         The Present, Read/Write, and User/Supervisor flags are set in all four entries.

·         The Accessed, Dirty, PCD, PWD, and Page Size flags are cleared in all four entries.

The startup_32( ) assembly language function also enables the paging unit. This is achieved by loading the physical address of swapper_pg_dir into the cr3 control register and by setting the PG flag of the cr0 control register, as shown in the following equivalent code fragment:

movl $swapper_pg_dir-0xc0000000,%eax 
movl %eax,%cr3        /* set the page table pointer.. */ 
movl %cr0,%eax 
orl $0x80000000,%eax 
movl %eax,%cr0        /* ..and set paging (PG) bit */ 
2.5.5.2 Final kernel Page Table when RAM size is less than 896 MB

The final mapping provided by the kernel Page Tables must transform linear addresses starting from 0xc0000000 into physical addresses starting from 0.

The _ pa macro is used to convert a linear address starting from PAGE_OFFSET to the corresponding physical address, while the _va macro does the reverse.

The kernel master Page Global Directory is still stored in swapper_pg_dir. It is initialized by the paging_init( ) function, which does the following:

1.       Invokes pagetable_init( ) to set up the Page Table entries properly

2.       Writes the physical address of swapper_pg_dir in the cr3 control register

3.       Invokes flush_tlb_all( ) to invalidate all TLB entries

The actions performed by pagetable_init( ) depend on both the amount of RAM present and on the CPU model. Let's start with the simplest case. Our computer has less than 896 MB[4] of RAM, 32-bit physical addresses are sufficient to address all the available RAM, and there is no need to activate the PAE mechanism. (See the earlier section Section 2.4.6.)

[4] The highest 128 MB of linear addresses are left available for several kinds of mappings (see sections Section 2.5.6 later in this chapter and Section 7.3). The kernel address space left for mapping the RAM is thus 1 GB - 128 MB = 896 MB.

The swapper_pg_dir Page Global Directory is reinitialized by a cycle equivalent to the following:

pgd = swapper_pg_dir + 768;
address = 0xc0000000;
while (address < end) {
    pe = _PAGE_PRESENT + _PAGE_RW + _PAGE_ACCESSED +
           _PAGE_DIRTY + _PAGE_PSE + _PAGE_GLOBAL + _ _pa(address);
    set_pgd(pgd, _ _pgd(pe));
    ++pgd;
    address += 0x400000;
}

The end variable stores the linear address in the fourth gigabyte corresponding to the end of usable physical memory. We assume that the CPU is a recent 80 x 86 microprocessor supporting 4 MB pages and "global" TLB entries. Notice that the User/Supervisor flags in all Page Global Directory entries referencing linear addresses above 0xc0000000 are cleared, thus denying processes in User Mode access to the kernel address space.

The identity mapping of the first 8 MB of physical memory built by the startup_32( ) function is required to complete the initialization phase of the kernel. When this mapping is no longer necessary, the kernel clears the corresponding Page Table entries by invoking the zap_low_mappings( ) function.

Actually, this description does not state the whole truth. As we shall see in the later section Section 2.5.6, the kernel also adjusts the entries of Page Tables corresponding to the "fix-mapped linear addresses."

2.5.5.3 Final kernel Page Table when RAM size is between 896 MB and 4096 MB

In this case, the RAM cannot be mapped entirely into the kernel linear address space. The best Linux can do during the initialization phase is to map a RAM window having size of 896 MB into the kernel linear address space. If a program needs to address other parts of the existing RAM, some other linear address interval must be mapped to the required RAM. This implies changing the value of some Page Table entries. We'll defer discussing how this kind of dynamic remapping is done in Chapter 7.

To initialize the Page Global Directory, the kernel uses the same code as in previous case.

2.5.5.4 Final kernel Page Table when RAM size is more than 4096 MB

Let's now consider kernel Page Table initialization for computers with more than 4 GB; more precisely, we deal with cases in which the following happens:

·         The CPU model supports Physical Address Extension (PAE).

·         The amount of RAM is larger than 4 GB.

·         The kernel is compiled with PAE support.

Although PAE handles 36-bit physical addresses, linear addresses are still 32-bit addresses. As in the previous case, Linux maps a 896-MB RAM window into the kernel linear address space; the remaining RAM is left unmapped and handled by dynamic remapping, as described in Chapter 7. The main difference with the previous case is that a three-level paging model is used, so the Page Global Directory is initialized as follows:

for (i = 0; i < 3; i++)
    set_pgd(swapper_pg_dir + i, _ _pgd(1 + _ _pa(empty_zero_page)));
pgd = swapper_pg_dir + 3;
address = 0xc0000000;
set_pgd(pgd, _ _pgd(_ _pa(pmd) + 0x1));
while (address < 0xe8000000) {
    pe = _PAGE_PRESENT + _PAGE_RW + _PAGE_ACCESSED +
         _PAGE_DIRTY + _PAGE_PSE + _PAGE_GLOBAL + _ _pa(address);
    set_pmd(pmd, _ _pmd(pe));
    pmd++;
    address += 0x200000;
}
pgd_base[0] = pgd_base[3];

The kernel initializes the first three entries in the Page Global Directory corresponding to the user linear address space with the address of an empty page (empty_zero_page). The fourth entry is initialized with the address of a Page Middle Directory (pmd). The first 448 entries in the Page Middle Directory (there are 512 entries, but the last 64 are reserved for noncontiguous memory allocation) are filled with the physical address of the first 896 MB of RAM.

Notice that all CPU models that support PAE also support large 2 MB pages and global pages. As in the previous case, whenever possible, Linux uses large pages to reduce the number of Page Tables.

2.5.6 Fix-Mapped Linear Addresses

We saw that the initial part of the fourth gigabyte of kernel linear addresses maps the physical memory of the system. However, at least 128 MB of linear addresses are always left available because the kernel uses them to implement noncontiguous memory allocation and fix-mapped linear addresses.

Noncontiguous memory allocation is just a special way to dynamically allocate and release pages of memory, and is described in Section 7.3. In this section, we focus on fix-mapped linear addresses.

Basically, a fix-mapped linear address is a constant linear address like 0xfffffdf0 whose corresponding physical address can be set up in an arbitrary way. Thus, each fix-mapped linear address maps one page frame of the physical memory.

Fix-mapped linear addresses are conceptually similar to the linear addresses that map the first 896 MB of RAM. However, a fix-mapped linear address can map any physical address, while the mapping established by the linear addresses in the initial portion of the fourth gigabyte is linear (linear address X maps physical address X-PAGE_OFFSET).

With respect to variable pointers, fix-mapped linear addresses are more efficient. In fact, dereferencing a variable pointers requires that one memory access more than dereferencing an immediate constant address. Moreover, checking the value of a variable pointer before dereferencing it is a good programming practice; conversely, the check is never required for a constant linear address.

Each fix-mapped linear address is represented by an integer index defined in the enum fixed_addresses data structure:

enum fixed_addresses {
    FIX_APIC_BASE,
    FIX_IO_APIC_BASE_0,
    [...]
    _ _end_of_fixed_addresses
};

Fix-mapped linear addresses are placed at the end of the fourth gigabyte of linear addresses. The fix_to_virt( ) function computes the constant linear address starting from the index:

inline unsigned long fix_to_virt(const unsigned int idx)
{ 
    if (idx >= _ _end_of_fixed_addresses)
        _ _this_fixmap_does_not_exist( );
    return (0xffffe000UL - (idx << PAGE_SHIFT));
} 

Let's assume that some kernel function invokes fix_to_virt(FIX_IOAPIC_BASE_0). Since the function is declared as "inline," the C compiler does not invoke fix_to_virt( ), but just inserts its code in the calling function. Moreover, the check on the index value is never performed at runtime. In fact, FIX_IOAPIC_BASE_0 is a constant, so the compiler can cut away the if statement because its condition is false at compile time. Conversely, if the condition is true or the argument of fix_to_virt( ) is not a constant, the compiler issues an error during the linking phase because the symbol _ _this_fixmap_does_not_exist is not defined elsewhere. Eventually, the compiler computes 0xffffe000-(1<<PAGE_SHIFT) and replaces the fix_to_virt( ) function call with the constant linear address 0xffffd000.

To associate a physical address with a fix-mapped linear address, the kernel uses the set_fixmap( idx,phys) and set_fixmap_nocache( idx,phys) functions. Both of them initialize the Page Table entry corresponding to the fix_to_virt(idx) linear address with the physical address phys; however, the second function also sets the PCD flag of the Page Table entry, thus disabling the hardware cache when accessing the data in the page frame (see Section 2.4.7 earlier in this chapter).

2.5.7 Handling the Hardware Cache and the TLB

Hardware caches and Translation Lookaside Buffers play a crucial role in boosting the performances of modern computer architectures. Several techniques are used by kernel developers to reduce the number of cache and TLB misses.

2.5.7.1 Handling the hardware cache

As mentioned earlier in this chapter, hardware cache are addressed by cache lines. The L1_CACHE_BYTES macro yields the size of a cache line in bytes. On Intel models earlier than the Pentium 4, the macro yields the value 32; on a Pentium 4, it yields the value 128.

To optimize the cache hit rate, the kernel considers the architecture in making the following decisions.

·         The most frequently used fields of a data structure are placed at the low offset within the data structure so they can be cached in the same line.

·         When allocating a large set of data structures, the kernel tries to store each of them in memory so that all cache lines are used uniformly.

·         When performing a process switch, the kernel has a small preference for processes that use the same set of Page Tables as the previously running process (see Section 11.2.2).

2.5.7.2 Handling the TLB

As a general rule, any process switch implies changing the set of active Page Tables. Local TLB entries relative to the old Page Tables must be flushed; this is done automatically when the kernel writes the address of the new Page Global Directory into the cr3 control register. In some cases, however, the kernel succeeds in avoiding TLB flushes, which are listed here:

·         When performing a process switch between two regular processes that use the same set of Page Tables (see Section 11.2.2).

·         When performing a process switch between a regular process and a kernel thread. In fact, we'll see in Section 8.2.1, that kernel threads do not have their own set of Page Tables; rather, they use the set of Page Tables owned by the regular process that was scheduled last for execution on the CPU.

Beside process switches, there are other cases in which the kernel needs to flush some entries in a TLB. For instance, when the kernel assigns a page frame to a User Mode process and stores its physical address into a Page Table entry, it must flush any local TLB entry that refers to the corresponding linear address. On multiprocessor systems, the kernel must also flush the same TLB entry on the CPUs that are using the same set of Page Tables, if any.

To invalidate TLB entries, the kernel uses the following functions and macros:

_ _flush_tlb_one

Invalidates the local TLB entry of the page, including the specified address.

flush_tlb_page

Invalidates all TLB entries of the page, including the specified address on all CPUs. To do this, the kernel sends an Interprocessor Interrupt to the other CPUs (see Section 4.6.2).

local_flush_tlb and _ _flush_tlb

Flushes the local TLB entries relative to all pages of the current process. To do this, the current value of the cr3 register is rewritten back into it. On Pentium Pro and later processors, only TLB entries of nonglobal pages (pages whose Global flag is clear) are invalidated.

flush_tlb

Flushes the TLB entries relative to the nonglobal pages of all current processes. Essentially, all CPUs receive an Interprocessor Interrupt that forces them to execute _ _flush_tlb.

flush_tlb_mm

Flushes the TLB entries relative to all nonglobal pages in a specified set of Page Tables (see Section 8.2). As we shall see in the next chapter, on multiprocessor systems, two or more CPUs might be executing processes that share the same set of Page Tables. On the 80 x 86 architecture, this function forces every CPU to invalidate all local TLB entries relative to nonglobal pages of the specified set of Page Tables.

flush_tlb_range

Invalidates all TLB entries of the nonglobal pages in a specified address range of a given set of Page Tables. On the 80 x 86 architecture, this macro is equivalent to flush_tlb_mm.

_ _flush_tlb_all

Flushes all local TLB entries (regardless of the Global flag settings of the Page Table entries in Pentium Pro and later processors). To do this, the kernel temporarily clears the PGE flag in cr4 and then writes into the cr3 register.

flush_tlb_all

Flushes all TLB entries in all CPUs (regardless of the Global flag settings of the Page Table entries in Pentium Pro and later processors). Essentially, all CPUs receive an Interprocessor Interrupt that forces them to execute _ _flush_tlb_all.

To avoid useless TLB flushing in multiprocessor systems, the kernel uses a technique called lazy TLB mode. The basic idea is if several CPUs are using the same Page Tables and a TLB entry must be flushed on all of them, then TLB flushing may, in some cases, be delayed on CPUs running kernel threads.

In fact, each kernel thread does not have its own set of Page Tables; rather, it makes use of the set of Page Tables belonging to a regular process. However, there is no need to invalidate a TLB entry that refers to a User Mode linear address because no kernel thread accesses the User Mode address space.[5]

[5] By the way, the flush_tlb_all macro does not use the lazy TLB mode mechanism; it is usually invoked whenever the kernel modifies a Page Table entry relative to the Kernel Mode address space.

When some CPU starts running a kernel thread, the kernel sets it into lazy TLB mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries; however, the CPU remembers that its current process is running on a set of Page Tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of Page Tables, the hardware automatically flushes the TLB entries, and the kernel sets the CPU back in nonlazy TLB mode. However, if a CPU in lazy TLB mode switches to a regular process that owns the same set of Page Tables used by the previously running kernel thread, then any deferred TLB invalidation must be effectively applied by the kernel; this "lazy" invalidation is effectively achieved by flushing all nonglobal TLB entries of the CPU.

Some extra data structures are needed to implement the lazy TLB mode. The cpu_tlbstate variable is a static array of NR_CPUS structures (one for every CPU in the system) consisting of an active_mm field pointing to the memory descriptor of the current process (see Chapter 8) and a state flag that can assume only two values: TLBSTATE_OK (non-lazy TLB mode) or TLBSTATE_LAZY (lazy TLB mode). Furthermore, each memory descriptor includes a cpu_vm_mask field that stores the indices of the CPUs that should receive Interprocessor Interrupts related to TLB flushing; this field is meaningful only when the memory descriptor belongs to a process currently in execution.

When a CPU starts executing a kernel thread, the kernel sets the state field of its cpu_tlbstate element to TLBSTATE_LAZY; moreover, the cpu_vm_mask field of the active memory descriptor stores the indices of all CPUs in the system, including the one that is entering in lazy TLB mode. When another CPU wants to invalidate the TLB entries of all CPUs relative to a given set of Page Tables, it delivers an Interprocessor Interrupt to all CPUs whose indices are included in the cpu_vm_mask field of the corresponding memory descriptor.

When a CPU receives an Interprocessor Interrupt related to TLB flushing and verifies that it affects the set of Page Tables of its current process, it checks whether the state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY; in this case, the kernel refuses to invalidate the TLB entries and removes the CPU index from the cpu_vm_mask field of the memory descriptor. This has two consequences:

·         Until the CPU remains in lazy TLB mode, it will not receive other Interprocessor Interrupts related to TLB flushing.

·         If the CPU switches to another process that is using the same set of Page Tables as the kernel thread that is being replaced, the kernel invokes local_flush_tlb to invalidate all nonglobal TLB entries of the CPU.