5.5 Examples of Race Condition Prevention

Kernel developers are expected to identify and solve the synchronization problems raised interleaving by kernel control paths. However, avoiding race conditions is a hard task because it requires a clear understanding of how the various components of the kernel interact. To give a feeling of what's really inside the kernel code, let's mention a few typical usages of the synchronization primitives defined in this chapter.

5.5.1 Reference Counters

Reference counters are widely used inside the kernel to avoid race conditions due to the concurrent allocation and releasing of a resource. A reference counter is just an atomic_t counter associated with a specific resource like a memory page, a module, or a file. The counter is atomically incremented when a kernel control path starts using the resource, and it is decremented when a kernel control path finishes using the resource. When the reference counter becomes zero, the resource is not being used, and it can be released if necessary.

5.5.2 The Global Kernel Lock

In earlier Linux kernel versions, a global kernel lock (also known as big kernel lock, or BKL) was widely used. In Version 2.0, this lock was a relatively crude spin lock, ensuring that only one processor at a time could run in Kernel Mode. The 2.2 kernel was considerably more flexible and no longer relied on a single spin lock; rather, a large number of kernel data structures were protected by specialized spin locks. The global kernel lock, on other hand, was still present because splitting a big lock into several smaller locks is not trivial — both deadlocks and race conditions must be carefully avoided. Several unrelated parts of the kernel code were still serialized by the global kernel lock.

Linux kernel Version 2.4 reduces still further the role of the global kernel lock. In the current stable version, the global kernel lock is mostly used to serialize accesses to the Virtual File System and avoid race conditions when loading and unloading kernel modules. The main progress with respect to the earlier stable version is that the networking transfers and file accessing (like reading or writing into a regular file) are no longer serialized by the global kernel lock.

The global kernel lock is a spin lock named kernel_flag. Every process descriptor includes a lock_depth field, which allows the same process to acquire the global kernel lock several times. Therefore, two consecutive requests for it will not hang the processor (as for normal spin locks). If the process does not want the lock, the field has the value -1. If the process wants it, the field value plus 1 specifies how many times the lock has been requested. The lock_depth field is crucial for interrupt handlers, exception handlers, and bottom halves. Without it, any asynchronous function that tries to get the global kernel lock could generate a deadlock if the current process already owns the lock.

The lock_kernel( ) and unlock_kernel( ) functions are used to get and release the global kernel lock. The former function is equivalent to:

if (++current->lock_depth == 0) 
    spin_lock(&kernel_flag); 

while the latter is equivalent to:

if (--current->lock_depth < 0) 
    spin_unlock(&kernel_flag); 

Notice that the if statements of the lock_kernel( ) and unlock_kernel( ) functions need not be executed atomically because lock_depth is not a global variable — each CPU addresses a field of its own current process descriptor. Local interrupts inside the if statements do not induce race conditions either. Even if the new kernel control path invokes lock_kernel( ), it must release the global kernel lock before terminating.

5.5.3 Memory Descriptor Read/Write Semaphore

Each memory descriptor of type mm_struct includes its own semaphore in the mmap_sem field (see Section 8.2). The semaphore protects the descriptor against race conditions that could arise because a memory descriptor can be shared among several lightweight processes.

For instance, let's suppose that the kernel must create or extend a memory region for some process; to do this, it invokes the do_mmap( ) function, which allocates a new vm_area_struct data structure. In doing so, the current process could be suspended if no free memory is available, and another process sharing the same memory descriptor could run. Without the semaphore, any operation of the second process that requires access to the memory descriptor (for instance, a Page Fault due to a Copy on Write) could lead to severe data corruption.

The semaphore is implemented as a read/write semaphore because some kernel functions, such as the Page Fault exception handler (see Section 8.4), need only to scan the memory descriptors.

5.5.4 Slab Cache List Semaphore

The list of slab cache descriptors (see Section 7.2.2) is protected by the cache_chain_sem semaphore, which grants an exclusive right to access and modify the list.

A race condition is possible when kmem_cache_create( ) adds a new element in the list, while kmem_cache_shrink( ) and kmem_cache_reap( ) sequentially scan the list. However, these functions are never invoked while handling an interrupt, and they can never block while accessing the list. Since the kernel is nonpreemptive, these functions cannot overlap on a uniprocessor system. However, this semaphore plays an active role in multiprocessor systems.

5.5.5 Inode Semaphore

As we shall see in Chapter 12, Linux stores the information on a disk file in a memory object called an inode. The corresponding data structure includes its own semaphore in the i_sem field.

A huge number of race conditions can occur during filesystem handling. Indeed, each file on disk is a resource held in common for all users, since all processes may (potentially) access the file content, change its name or location, destroy or duplicate it, and so on. For example, let's suppose that a process lists the files contained in some directory. Each disk operation is potentially blocking, and therefore even in uniprocessor systems, other processes could access the same directory and modify its content while the first process is in the middle of the listing operation. Or, again, two different processes could modify the same directory at the same time. All these race conditions are avoided by protecting the directory file with the inode semaphore.

Whenever a program uses two or more semaphores, the potential for deadlock is present because two different paths could end up waiting for each other to release a semaphore. Generally speaking, Linux has few problems with deadlocks on semaphore requests, since each kernel control path usually needs to acquire just one semaphore at a time. However, in a couple of cases, the kernel must get two semaphore locks. Inode semaphores are prone to this scenario; for instance, this occurs in the service routines of the rmdir( ) and the rename( ) system calls (notice that in both cases two different inodes are involved in the operation, so both semaphores must be taken). To avoid such deadlocks, semaphore requests are performed in address order: the semaphore request whose semaphore data structure is located at the lowest address is issued first.