7.3 Noncontiguous Memory Area Management

We already know that it is preferable to map memory areas into sets of contiguous page frames, thus making better use of the cache and achieving lower average memory access times. Nevertheless, if the requests for memory areas are infrequent, it makes sense to consider an allocation schema based on noncontiguous page frames accessed through contiguous linear addresses. The main advantage of this schema is to avoid external fragmentation, while the disadvantage is that it is necessary to fiddle with the kernel Page Tables. Clearly, the size of a noncontiguous memory area must be a multiple of 4,096. Linux uses noncontiguous memory areas in several ways — for instance, to allocate data structures for active swap areas (see Section 16.2.3), to allocate space for a module (see Appendix B), or to allocate buffers to some I/O drivers.

7.3.1 Linear Addresses of Noncontiguous Memory Areas

To find a free range of linear addresses, we can look in the area starting from PAGE_OFFSET (usually 0xc0000000, the beginning of the fourth gigabyte). Figure 7-7 shows how the fourth gigabyte linear addresses are used:

·         The beginning of the area includes the linear addresses that map the first 896 MB of RAM (see Section 2.5.4); the linear address that corresponds to the end of the directly mapped physical memory is stored in the high_memory variable.

·         The end of the area contains the fix-mapped linear addresses (see Section 2.5.6).

·         Starting from PKMAP_BASE (0xfe000000), we find the linear addresses used for the persistent kernel mapping of high-memory page frames (see Section 7.1.6 earlier in this chapter).

·         The remaining linear addresses can be used for noncontiguous memory areas. A safety interval of size 8 MB (macro VMALLOC_OFFSET) is inserted between the end of the physical memory mapping and the first memory area; its purpose is to "capture" out-of-bounds memory accesses. For the same reason, additional safety intervals of size 4 KB are inserted to separate noncontiguous memory areas.

Figure 7-7. The linear address interval starting from PAGE_OFFSET

figs/ULK2_0707.gif

The VMALLOC_START macro defines the starting address of the linear space reserved for noncontiguous memory areas, while VMALLOC_END defines its ending address.

7.3.2 Descriptors of Noncontiguous Memory Areas

Each noncontiguous memory area is associated with a descriptor of type struct vm_struct:

struct vm_struct { 
    unsigned long flags; 
    void * addr; 
    unsigned long size; 
    struct vm_struct * next; 
}; 

These descriptors are inserted in a simple list by means of the next field; the address of the first element of the list is stored in the vmlist variable. Accesses to this list are protected by means of the vmlist_lock read/write spin lock. The addr field contains the linear address of the first memory cell of the area; the size field contains the size of the area plus 4,096 (which is the size of the previously mentioned inter-area safety interval).

The get_vm_area( ) function creates new descriptors of type struct vm_struct; its parameter size specifies the size of the new memory area. The function is essentially equivalent to the following:

struct vm_struct * get_vm_area(unsigned long size, unsigned long flags) 
{ 
    unsigned long addr; 
    struct vm_struct **p, *tmp, *area; 
    area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); 
    if (!area) 
        return NULL; 
    size += PAGE_SIZE;
    addr = VMALLOC_START;
    write_lock(&vmlist_lock); 
    for (p = &vmlist; (tmp = *p) ; p = &tmp->next) { 
         if (size + addr <= (unsigned long) tmp->addr) {
           area->flags = flags;
           area->addr = (void *) addr; 
           area->size = size; 
           area->next = *p; 
           *p = area; 
           write_unlock(&vmlist_lock);
           return area;      
        }
        addr = tmp->size + (unsigned long) tmp->addr; 
        if (addr + size > VMALLOC_END) {
            write_unlock(&vmlist_lock); 
            kfree(area); 
            return NULL; 
        } 
    } 
} 

The function first calls kmalloc( ) to obtain a memory area for the new descriptor. It then scans the list of descriptors of type struct vm_struct looking for an available range of linear addresses that includes at least size+4096 addresses. If such an interval exists, the function initializes the fields of the descriptor and terminates by returning the initial address of the noncontiguous memory area. Otherwise, when addr + size exceeds VMALLOC_END, get_vm_area( ) releases the descriptor and returns NULL.

7.3.3 Allocating a Noncontiguous Memory Area

The vmalloc( ) function allocates a noncontiguous memory area to the kernel. The parameter size denotes the size of the requested area. If the function is able to satisfy the request, it then returns the initial linear address of the new area; otherwise, it returns a NULL pointer:

void * vmalloc(unsigned long size) 
{ 
    void * addr; 
    struct vm_struct *area; 
    size = (size + PAGE_SIZE - 1) & PAGE_MASK; 
    area = get_vm_area(size, VM_ALLOC); 
    if (!area) 
        return NULL; 
    addr = area->addr; 
    if (vmalloc_area_pages((unsigned long) addr, size, 
                           GFP_KERNEL|_ _GFP_HIGHMEM, 0x63)) { 
        vfree(addr); 
        return NULL; 
    } 
    return addr; 
} 

The function starts by rounding up the value of the size parameter to a multiple of 4,096 (the page frame size). Then vmalloc( ) invokes get_vm_area( ), which creates a new descriptor and returns the linear addresses assigned to the memory area. The flags field of the descriptor is initialized with the VM_ALLOC flag, which means that the linear address range is going to be used for a noncontiguous memory allocation (we'll see in Chapter 13 that vm_struct descriptors are also used to remap memory on hardware devices). Then the vmalloc( ) function invokes vmalloc_area_pages( ) to request noncontiguous page frames and terminates by returning the initial linear address of the noncontiguous memory area.

The vmalloc_area_pages( ) function uses four parameters:

address

The initial linear address of the area.

size

The size of the area.

gfp_mask

The allocation flags passed to the buddy system allocator function. It is always set to GFP_KERNEL|_ _GFP_HIGHMEM.

prot

The protection bits of the allocated page frames. It is always set to 0x63, which corresponds to Present, Accessed, Read/Write, and Dirty.

The function starts by assigning the linear address of the end of the area to the end local variable:

end = address + size; 

The function then uses the pgd_offset_k macro to derive the entry in the master kernel Page Global Directory related to the initial linear address of the area; it then acquires the kernel Page Table spin lock:

dir = pgd_offset_k(address);
spin_lock(&init_mm.page_table_lock); 

The function then executes the following cycle:

while (address < end) { 
    pmd_t *pmd = pmd_alloc(&init_mm, dir, address); 
    ret = -ENOMEM;
    if (!pmd) 
        break; 
    if (alloc_area_pmd(pmd, address, end - address, gfp_mask, prot)) 
        break; 
    address = (address + PGDIR_SIZE) & PGDIR_MASK; 
    dir++; 
    ret = 0;
} 
spin_unlock(&init_mm.page_table_lock);
return ret;

In each cycle, it first invokes pmd_alloc( ) to create a Page Middle Directory for the new area and writes its physical address in the right entry of the kernel Page Global Directory. It then calls alloc_area_pmd( ) to allocate all the Page Tables associated with the new Page Middle Directory. It adds the constant 222—the size of the range of linear addresses spanned by a single Page Middle Directory—to the current value of address, and it increases the pointer dir to the Page Global Directory.

The cycle is repeated until all Page Table entries referring to the noncontiguous memory area are set up.

The alloc_area_pmd( ) function executes a similar cycle for all the Page Tables that a Page Middle Directory points to:

while (address < end) { 
    pte_t * pte = pte_alloc(&init_mm, pmd, address); 
    if (!pte) 
        return -ENOMEM; 
    if (alloc_area_pte(pte, address, end - address)) 
        return -ENOMEM; 
    address = (address + PMD_SIZE) & PMD_MASK; 
    pmd++; 
} 

The pte_alloc( ) function (see Section 2.5.2) allocates a new Page Table and updates the corresponding entry in the Page Middle Directory. Next, alloc_area_pte( ) allocates all the page frames corresponding to the entries in the Page Table. The value of address is increased by 222—the size of the linear address interval spanned by a single Page Table—and the cycle is repeated.

The main cycle of alloc_area_pte( ) is:

while (address < end) { 
    unsigned long page; 
    spin_unlock(&init_mm.page_table_lock);
    page_alloc(gfp_mask);
    spin_lock(&init_mm.page_table_lock);
    if (!page) 
        return -ENOMEM; 
    set_pte(pte, mk_pte(page, prot)); 
    address += PAGE_SIZE; 
    pte++; 
} 

Each page frame is allocated through page_alloc( ). The physical address of the new page frame is written into the Page Table by the set_pte and mk_pte macros. The cycle is repeated after adding the constant 4,096 (the length of a page frame) to address.

Notice that the Page Tables of the current process are not touched by vmalloc_area_pages( ). Therefore, when a process in Kernel Mode accesses the noncontiguous memory area, a Page Fault occurs, since the entries in the process's Page Tables corresponding to the area are null. However, the Page Fault handler checks the faulty linear address against the master kernel Page Tables (which are init_mm.pgd Page Global Directory and its child Page Tables; see Section 2.5.5). Once the handler discovers that a master kernel Page Table includes a non-null entry for the address, it copies its value into the corresponding process's Page Table entry and resumes normal execution of the process. This mechanism is described in Section 8.4.

7.3.4 Releasing a Noncontiguous Memory Area

The vfree( ) function releases noncontiguous memory areas. Its parameter addr contains the initial linear address of the area to be released. vfree( ) first scans the list pointed to by vmlist to find the address of the area descriptor associated with the area to be released:

write_lock(&vmlist_lock);
for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) { 
    if (tmp->addr == addr) { 
        *p = tmp->next; 
        vmfree_area_pages((unsigned long)(tmp->addr), tmp->size); 
        write_unlock(&vmlist_lock);
        kfree(tmp); 
        return; 
    } 
} 
write_unlock(&vmlist_lock);
printk("Trying to vfree(  ) nonexistent vm area (%p)\n", addr); 

The size field of the descriptor specifies the size of the area to be released. The area itself is released by invoking vmfree_area_pages( ), while the descriptor is released by invoking kfree( ).

The vmfree_area_pages( ) function takes two parameters: the initial linear address and the size of the area. It executes the following cycle to reverse the actions performed by vmalloc_area_pages( ):

dir = pgd_offset_k(address);
while (address < end) { 
    free_area_pmd(dir, address, end - address); 
    address = (address + PGDIR_SIZE) & PGDIR_MASK; 
    dir++; 
} 

In turn, free_area_pmd( ) reverses the actions of alloc_area_pmd( ) in the cycle:

while (address < end) { 
    free_area_pte(pmd, address, end - address); 
    address = (address + PMD_SIZE) & PMD_MASK; 
    pmd++; 
} 

Again, free_area_pte( ) reverses the activity of alloc_area_pte( ) in the cycle:

while (address < end) { 
    pte_t page = *pte; 
    pte_clear(pte); 
    address += PAGE_SIZE; 
    pte++; 
    if (pte_none(page)) 
        continue; 
    if (pte_present(page)) { 
        _ _free_page(pte_page(page)); 
        continue; 
    } 
    printk("Whee... Swapped out page in kernel page table\n"); 
} 

Each page frame assigned to the noncontiguous memory area is released by means of the buddy system _ _free_ page( ) function. The corresponding entry in the Page Table is set to 0 by the pte_clear macro.

As for vmalloc( ), the kernel modifies the entries of the master kernel Page Global Directory and its child Page Tables (see Section 2.5.5), but it leaves unchanged the entries of the process Page Tables mapping the fourth gigabyte. This is fine because the kernel never reclaims Page Middle Directories and Page Tables rooted at the master kernel Page Global Directory.

For instance, suppose that a process in Kernel Mode accessed a noncontiguous memory area that later got released. The process's Page Global Directory entries are equal to the corresponding entries of the master kernel Page Global Directory, thanks to the mechanism explained in Section 8.4; they point to the same Page Middle Directories and Page Tables. The vmfree_area_pages( ) function clears only the entries of the Page Tables (without reclaiming the Page Tables themselves). Further accesses of the process to the released noncontiguous memory area will trigger Page Faults because of the null Page Table entries. However, the handler will consider such accesses a bug because the master kernel Page Tables do not include valid entries.