We mentioned earlier in Section 4.6 that several tasks among those executed by the kernel are not critical: they can be deferred for a long period of time, if necessary. Remember that the interrupt service routines of an interrupt handler are serialized, and often there should be no occurrence of an interrupt until the corresponding interrupt handler has terminated. Conversely, the deferrable tasks can execute with all interrupts enabled. Taking them out of the interrupt handler helps keep kernel response time small. This is a very important property for many time-critical applications that expect their interrupt requests to be serviced in a few milliseconds.
Linux 2.4 answers such a challenge by using three kinds of deferrable and interruptible kernel functions (in short, deferrable functions[13]): softirqs, tasklets, and bottom halves. Although these three kinds of deferrable functions work in different ways, they are strictly correlated. Tasklets are implemented on top of softirqs, and bottom halves are implemented by means of tasklets. As a matter of fact, the term "softirq," which appears in the kernel source code, often denotes all kinds of deferrable functions.
[13] These are also called software interrupts, but we denote them as "deferrable functions" to avoid confusion with programmed exceptions, which are referred to as "software interrupts" in Intel manuals.
As a general rule, no softirq can be interrupted to run another softirq on the same CPU; the same rule holds for tasklets and bottom halves built on top of softirqs. On a multiprocessor system, however, several deferrable functions can run concurrently on different CPUs. The degree of concurrency depends on the type of deferrable function, as shown in Table 4-6.
Softirqs and bottom halves are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime (for instance, when loading a kernel module).
Many softirqs can always be executed concurrently on several CPUs, even if they are of the same type. Generally speaking, softirqs are re-entrant functions and must explicitly protect their data structures with spin locks.
Tasklets differ from softirqs because a tasklet is always serialized with respect to itself; in other words, a tasklet cannot be executed by two CPUs at the same time. However, different tasklets can be executed concurrently on several CPUs. Serializing the tasklet simplifies the life of device driver developers, since the tasklet function needs not to be re-entrant.
Finally, bottom halves are globally serialized. When one bottom half is in execution on some CPU, the other CPUs cannot execute any bottom half, even if it is of different type. This is a quite strong limitation, since it degrades the performances of the Linux kernel on multiprocessor systems. As a matter of fact, bottom halves continue to be supported by the kernel for compatibility reasons only, and device driver developers are expected to update their old drivers and replace bottom halves with tasklets. Therefore, it is likely that bottom halves will disappear in a future version of Linux.
In any case, deferrable functions must be executed serially. Any deferrable function cannot be interleaved with other deferrable functions on the same CPU.
Generally speaking, four kinds of operations can be performed on deferrable functions:
Defines a new deferrable function; this operation is usually done when the kernel initializes itself.
Activation
Marks a deferrable function as "pending" — to be run in the next round of executions of the deferrable functions. Activation can be done at any time (even while handling interrupts).
Masking
Selectively disables a deferrable function in such a way that it will not be executed by the kernel even if activated. We'll see in Section 5.3.11 that disabling deferrable functions is sometimes essential.
Execution
Executes a pending deferrable function together with all other pending deferrable functions of the same type; execution is performed at well-specified times, explained later in Section 4.7.1.
Activation and execution are somehow bound together: a deferrable function that has been activated by a given CPU must be executed on the same CPU. There is no self-evident reason suggesting that this rule is beneficial for system performances. Binding the deferrable function to the activating CPU could in theory make better use of the CPU hardware cache. After all, it is conceivable that the activating kernel thread accesses some data structures that will also be used by the deferrable function. However, the relevant lines could easily be no longer in the cache when the deferrable function is run because its execution can be delayed a long time. Moreover, binding a function to a CPU is always a potentially "dangerous" operation, since a CPU might end up very busy while the others are mostly idle.
Linux 2.4 uses a limited number of softirqs. For most purposes, tasklets are good enough and are much easier to write because they do not need to be re-entrant.
As a matter of fact, only the four kinds of softirqs listed in Table 4-7 are currently defined.
The index of a sofirq determines its priority: a lower index means higher priority because softirq functions will be executed starting from index 0.
The main data structure used to represent softirqs is the softirq_vec array, which includes 32 elements of type softirq_action. The priority of a softirq is the index of the corresponding softirq_action element inside the array. As shown in Table 4-7, only the first four entries of the array are effectively used. The softirq_action data structure consists of two fields: a pointer to the softirq function and a pointer to a generic data structure that may be needed by the softirq function.
The irq_stat array, already introduced in Section 4.6.1.2, includes several fields used by the kernel to implements softirqs (and also tasklets and bottom halves, which depend on softirqs). Each element of the array, corresponding to a given CPU, includes:
· A _ _softirq_pending field that points to a softirq_action structure (the pending softirq). This field may easily be accessed through the softirq_pending macro.
· A _ _local_bh_count field that disables the execution of the softirqs (as well as tasklets and bottom halves). This field may easily be accessed through the local_bh_count macro. If it is set to zero, the softirqs are enabled; alternatively, if the field stores a positive integer, the softirqs are disabled. The local_bh_disable macro increments the field, while the local_bh_enable macro decrements it. If the kernel invokes local_bh_disable twice, it must also call local_bh_enable twice to re-enable softirqs.[14]
[14] Better names for these two macros could be local_softirq_disable and local_softirq_enable. The actual names are vestiges of old kernel versions.
· A _ _ksoftirqd_task field that stores the process descriptor address of a ksoftirqd_CPUn kernel thread, which is devoted to the execution of deferrable functions. (There is one such thread per CPU, and the n in ksoftiqd_CPUn represents the CPU index, as described later in Section 4.7.1.1.) This field can be accessed through the ksoftirqd_task macro.
The open_softirq( ) function takes care of softirq initialization. It uses three parameters: the softirq index, a pointer to the softirq function to be executed, and a second pointer to a data structure that may be required by the softirq function. open_softirq( ) limits itself to initialize the proper entry of the softirq_vec array.
Softirqs are activated by invoking by the _ _cpu_raise_softirq macro, which receives as parameters the CPU number cpu and the softirq index nr, and sets the nrth bit of softirq_pending(cpu). The cpu_raise_softirq( ) function is similar to the _ _cpu_raise_softirq macro, except that it might also wake up the ksoftirqd_CPUn kernel thread.
Checks for pending softirqs are performed in a few points of the kernel code. Currently, this is done in the following cases (be warned that number and position of the softirq check points change both with the kernel version and with the supported hardware architecture):
· When the local_bh_enable macro re-enables the softirqs
· When the do_IRQ( ) function finishes handling an I/O interrupt
· When the smp_apic_timer_interrupt( ) function finishes handling a local timer interrupt (see Section 6.2.2)
· When one of the special ksoftirqd_CPUn kernel threads is awoken
· When a packet is received on a network interface card (see Chapter 18)
In each check point, the kernel reads softirq_pending(cpu); if this field is not null, the kernel invokes do_softirq( ) to execute the softirq functions. It performs the following actions:
1. Gets the logical number cpu of the CPU that executes the function.
2. Returns if local_irq_count(cpu) is not set to zero. In this case, do_softirq( ) is invoked while terminating a nested interrupt handler, and we know that deferrable functions must run outside of interrupt service routines.
3. Returns if local_bh_count(cpu) is not set to zero. In this case, all deferrable functions are disabled.
4. Saves the state of the IF flag and clears it to disable local interrupts.
5. Checks the softirq_pending(cpu) field of irq_stat. If no softirqs are pending, restores the value of the IF flag saved in the previous step, and then returns.
6. Invokes local_bh_disable(cpu ) to increment the local_bh_count(cpu) field of irq_stat. In this way, deferrable functions are effectively serialized on the CPU because any further invocation of do_softirq( ) returns without executing the softirq functions (see check at Step 3).
7. Executes the following loop:
8. pending = softirq_pending(cpu);
9. softirq_pending(cpu) = 0;
10. mask = 0;
11. do {
12. mask &= ~pending;
13. asm("sti");
14. for (i=0; pending; pending >>= 1, i++)
15. if (pending & 1)
16. softirq_vec[i].action(softirq_vec+i);
17. asm("cli");
18. pending = softirq_pending(cpu);
} while (pending & mask);
As you may see, the function stores the pending softirqs in the pending local variable, and then resets the softirq_pending(cpu) field to zero. In each iteration of the loop, the function:
a. Updates the mask local variable; it stores the indices of the softirqs that are already executed in this invocation of the do_softirq( ) function.
b. Enables local interrupts.
c. Executes the softirq functions of all pending softirqs (inner loop).
d. Disables local interrupts.
e. Reloads the pending local variable with the contents of the softirq_pending(cpu) field. An interrupt handler, or even a softirq function, could have invoked cpu_raise_softirq( ) while softirq functions were executing.
f. Performs another iteration of the loop if a softirq that has not been handled in this invocation of do_softirq( ) is activated.
19. Decrements the local_bh_count(cpu) field, thus re-enabling the softirqs.
20. Checks the value of the pending local variable. If it is not zero, a softirq that was handled in this invocation of do_softirq( ) is activated again. To trigger another execution of the do_softirq( ) function, the function wakes up the ksoftirqd_CPUn kernel thread.
21. Restores the status of IF flag (local interrupts enabled or disabled) saved in Step 4 and returns.
In recent kernel versions, each CPU has its own ksoftirqd_CPUn kernel thread (where n is the logical number of the CPU). Each ksoftirqd_CPUn kernel thread runs the ksoftirqd( ) function, which essentially executes the following loop:
for(;;) {
set_current_state(TASK_INTERRUPTIBLE);
schedule( );
/* now in TASK_RUNNING state */
while (softirq_pending(cpu)) {
do_softirq( );
if (current->need_resched)
schedule( );
}
}
When awoken, the kernel thread checks the softirq_pending(n) field and invokes, if necessary, do_softirq( ).
The ksoftirqd_CPUn kernel threads represent a solution for a critical trade-off problem.
Softirq functions may re-activate themselves; actually, both the networking softirqs and the tasklet softirqs do this. Moreover, external events, like packet flooding on a network card, may activate softirqs at very high frequency.
The potential for a continuous high-volume flow of softirqs creates a problem that is solved by introducing kernel threads. Without them, developers are essentially faced with two alternative strategies.
The first strategy consists of ignoring new softirqs that occur while do_softirq( ) is running. In other words, the do_softirq( ) function determines what softirqs are pending when the function is started, and then executes their functions. Next, it terminates without rechecking the pending softirqs. This solution is not good enough. Suppose that a softirq function is re-activated during the execution of do_softirq( ). In the worst case, the softirq is not executed again until the next timer interrupt, even if the machine is idle. As a result, softirq latency time is unacceptable for networking developers.
The second strategy consists of continuously rechecking for pending softirqs. The do_softirq( ) function keeps checking the pending softirqs and terminates only when none of them is pending. While this solution might satisfy networking developers, it can certainly upset normal users of the system: if a high-frequency flow of packets is received by a network card or a softirq function keeps activating itself, the do_softirq( ) function never returns and the User Mode programs are virtually stopped.
The ksoftirqd_CPUn kernel threads try to solve this difficult trade-off problem. The do_softirq( ) function determines what softirqs are pending and executes their functions. If an already executed softirq is activated again, the function wakes up the kernel thread and terminates (Step 9 in of do_softirq( )). The kernel thread has low priority, so user programs have a chance to run; but if the machine is idle, the pending softirqs are executed quickly.
Tasklets are the preferred way to implement deferrable functions in I/O drivers. As already explained, tasklets are built on top of two softirqs named HI_SOFTIRQ and TASKLET_SOFTIRQ. Several tasklets may be associated with the same softirq, each tasklet carrying its own function. There is no real difference between the two softirqs, except that do_softirq( ) executes HI_SOFTIRQ's tasklets before TASKLET_SOFTIRQ's tasklets.
Tasklets and high-priority tasklets are stored in the tasklet_vec and tasklet_hi_vec arrays, respectively. Both of them include NR_CPUS elements of type tasklet_head, and each element consists of a pointer to a list of tasklet descriptors. The tasklet descriptor is a data structure of type tasklet_struct, whose fields are shown in Table 4-8.
The state field of the tasklet descriptor includes two flags:
TASKLET_STATE_SCHED
When set, this indicates that the tasklet is pending (has been scheduled for execution); it also means that the tasklet descriptor is inserted in one of the lists of the tasklet_vec and tasklet_hi_vec arrays.
TASKLET_STATE_RUN
When set, this indicates that the tasklet is being executed; on a uniprocessor system this flag is not used because there is no need to check whether a specific tasklet is running.
Let's suppose you're writing a device driver and you want to use a tasklet: what has to be done? First of all, you should allocate a new tasklet_struct data structure and initialize it by invoking tasklet_init( ); this function receives as parameters the address of the tasklet descriptor, the address of your tasklet function, and its optional integer argument.
Your tasklet may be selectively disabled by invoking either tasklet_disable_nosync( ) or tasklet_disable( ). Both functions increment the count field of the tasklet descriptor, but the latter function does not return until an already running instance of the tasklet function has terminated. To re-enable your tasklet, use tasklet_enable( ).
To activate the tasklet, you should invoke either the tasklet_schedule( ) function or the tasklet_hi_schedule( ) function, according to the priority that you require for your tasklet. The two functions are very similar; each of them performs the following actions:
1. Checks the TASKLET_STATE_SCHED flag; if it is set, returns (the tasklet has already been scheduled)
2. Gets the logical number of the CPU that is executing the function
3. Saves the state of the IF flag and clears it to disable local interrupts
4. Adds the tasklet descriptor at the beginning of the list pointed to by tasklet_vec[cpu] or tasklet_hi_vec[cpu]
5. Invokes cpu_raise_softirq( ) to activate either the TASKLET_SOFTIRQ softirq or the HI_SOFTIRQ softirq
6. Restores the value of the IF flag saved in Step 3 (local interrupts enabled or disabled)
Finally, let's see how your tasklet is executed. We know from the previous section that, once activated, softirq functions are executed by the do_softirq( ) function. The softirq function associated with the HI_SOFTIRQ softirq is named tasklet_hi_action( ), while the function associated with TASKLET_SOFTIRQ is named tasklet_action( ). Once again, the two functions are very similar; each of them:
1. Gets the logical number of the CPU that is executing the function.
2. Disables local interrupts, saving the previous state of the IF flag.
3. Stores the address of the list pointed to by tasklet_vec[cpu] or tasklet_hi_vec[cpu] in the list local variable.
4. Puts a NULL address in tasklet_vec[cpu] or tasklet_hi_vec[cpu]; thus, the list of scheduled tasklet descriptors is emptied.
5. Enables local interrupts.
6. For each tasklet descriptor in the list pointed to by list:
a. In multiprocessor systems, checks the TASKLET_STATE_RUN flag of the tasklet. If it is set, a tasklet of the same type is already running on another CPU, so the function reinserts the task descriptor in the list pointed to by tasklet_vec[cpu] or tasklet_hi_vec[cpu] and activates the TASKLET_SOFTIRQ or HI_SOFTIRQ softirq again. In this way, execution of the tasklet is deferred until other tasklets of the same type are running on other CPUs.
b. If the TASKLET_STATE_RUN flag is not set, the tasklet is not running on other CPUs. In multiprocessor systems, the function sets the flag so that the tasklet function cannot be executed on other CPUs.
c. Checks whether the tasklet is disabled by looking at the count field of the tasklet descriptor. If it is disabled, it reinserts the task descriptor in the list pointed to by tasklet_vec[cpu] or tasklet_hi_vec[cpu]; then the function activates the TASKLET_SOFTIRQ or HI_SOFTIRQ softirq again.
d. If the tasklet is enabled, clears the TASKLET_STATE_SCHED flag and executes the tasklet function.
Notice that, unless the tasklet function re-activates itself, every tasklet activation triggers at most one execution of the tasklet function.
A bottom half is essentially a high-priority tasklet that cannot be executed concurrently with any other bottom half, even if it is of a different type and on another CPU. The global_bh_lock spin lock is used to ensure that at most one bottom half is running.
Linux uses an array called the bh_base table to group all bottom halves together. It is an array of pointers to bottom halves and can include up to 32 entries, one for each type of bottom half. In practice, Linux uses about half of them; the types are listed in Table 4-9. As you can see from the table, some of the bottom halves are associated with hardware devices that are not necessarily installed in the system or that are specific to platforms besides the IBM PC compatible. But TIMER_BH, TQUEUE_BH, SERIAL_BH, and IMMEDIATE_BH still see widespread use. We describe the TQUEUE_BH and IMMEDIATE_BH bottom half later in this chapter and the TIMER_BH bottom half in Chapter 6.
The bh_task_vec array stores 32 tasklet descriptors, one for each bottom half. During kernel initialization, these tasklet descriptors are initialized in the following way:
for (i=0; i<32; ++i)
tasklet_init(bh_task_vec+i, bh_action, i);
As usual, before a bottom half is invoked for the first time, it must be initialized. This is done by invoking init_bh(n, routine), which inserts the routine address as the n th entry of bh_base. Conversely, remove_bh(n) removes the n th bottom half from the table.
Bottom-half activation is done by invoking mark_bh( ). Since bottom halves are high-priority tasklets, mark_bh(n) just reduces to tasklet_hi_schedule(bh_task_vec + n).
The bh_action( ) function is the tasklet function common to all bottom halves. It receives as a parameter the index of the bottom half and performs the following steps:
1. Gets the logical number of the CPU executing the tasklet function.
2. Checks whether the global_bh_lock spin lock has already been acquired. In this case, another CPU is running a bottom half. The function invokes mark_bh( ) to re-activate the bottom half and returns.
3. Otherwise, the function acquires the global_bh_lock spin lock so that no other bottom half can be executed in the system.
4. Checks that the local_irq_count field is set to zero (bottom halves are supposed to be run outside interrupt service routines), and that global interrupts are enabled (see Chapter 5). If one of these cases doesn't hold, the function releases the global_bh_lock spin lock and terminates.
5. Invokes the bottom half function stored in the proper entry of the bh_base array.
6. Releases the global_bh_lock spin lock and returns.
The motivation for introducing deferrable functions is to allow a limited number of functions related to interrupt handling to be executed in a deferred manner. This approach has been stretched in two directions:
· To allow not only a function that services an interrupt, but also a generic kernel function to be executed as a bottom half
· To allow several kernel functions, instead of a single one, to be associated with a bottom half
Groups of functions are represented by task queues, which are lists of tq_struct structures whose fields are shown in Table 4-10.
As we shall see in Chapter 13, I/O device drivers use task queues to require the execution of several related functions when a specific interrupt occurs.
The DECLARE_TASK_QUEUE macro allocates a new task queue, while queue_task( ) inserts a new function in a task queue. The run_task_queue( ) function executes all the functions included in a given task queue.
It's worth mentioning three particular task queues:
· The tq _immediate task queue, run by the IMMEDIATE_BH bottom half, includes kernel functions to be executed together with the standard bottom halves. The kernel invokes mark_bh( ) to activate the IMMEDIATE_BH bottom half whenever a function is added to the tq _immediate task queue. It is executed as soon as do_softirq( ) is invoked.
· The tq _timer task queue is run by the TQUEUE_BH bottom half, which is activated at every timer interrupt. As we'll see in Chapter 6, that means it runs about every 10 ms.
· The tq_context task queue is not associated with a bottom half, but it is run by the keventd kernel thread. The schedule_task( ) function adds a function to the task queue; its execution is deferred until the scheduler selects the keventd kernel thread as the next process to run.
The main advantage of tq_context, with respect to the other task queues based on deferrable functions, is that its functions can freely perform blocking operations. On the other hand, softirqs (and therefore tasklets and bottom halves) are similar to interrupt handlers in that kernel developers cannot make any assumption on the process that will execute the deferrable functions. From a practical point of view, this means that softirqs cannot perform blocking operations like accessing a file, acquiring a semaphore, or sleeping in a wait queue.
The price to pay is that, once scheduled for execution in tq_context, a function might be delayed for quite a long time interval.