The kernel, among the other time-related duties, must periodically collect several data used to:
· Checking the CPU resource limit of the running processes
· Computing the average system load
· Profiling the kernel code
The update_process_times( ) function (invoked by either the PIT's timer interrupt handler on uniprocessor systems or the local timer interrupt handler in multiprocessor systems) updates some kernel statistics stored in the kstat variable of type kernel_stat; it then invokes update_one_process( ) to update some fields storing statistics that can be exported to user programs through the times( ) system call. In particular, a distinction is made between CPU time spent in User Mode and in Kernel Mode. The function performs the following actions:
1. Updates the per_cpu_utime field of current's process descriptor, which stores the number of ticks during which the process has been running in User Mode.
2. Updates the per_cpu_stime field of current's process descriptor, which stores the number of ticks during which the process has been running in Kernel Mode.
3. Invokes do_process_times( ), which checks whether the total CPU time limit has been reached; if so, it sends SIGXCPU and SIGKILL signals to current. Section 3.2.5 describes how the limit is controlled by the rlim[RLIMIT_CPU].rlim_cur field of each process descriptor.
Two additional fields called times.tms_cutime and times.tms_cstime are provided in the process descriptor to count the number of CPU ticks spent by the process children in User Mode and Kernel Mode, respectively. For reasons of efficiency, these fields are not updated by update_one_process( ), but rather when the parent process queries the state of one of its children (see Section 3.5).
Any Unix kernel keeps track of how much CPU activity is being carried on by the system. These statistics are used by various administration utilities such as top. A user who enters the uptime command sees the statistics as the "load average" relative to the last minute, the last 5 minutes, and the last 15 minutes. On a uniprocessor system, a value of 0 means that there are no active processes (besides the swapper process 0) to run, while a value of 1 means that the CPU is 100 percent busy with a single process, and values greater than 1 mean that the CPU is shared among several active processes.[6]
[6] Linux includes in the load average all processes that are in TASK_RUNNING and TASK_UNINTERRUPTIBLE states. However, in normal conditions, there are few TASK_UNINTERRUPTIBLE processes, so a high load usually means that the CPU is busy.
The system load data is collected by the calc_load( ) function, which is invoked by update_times( ). This activity is therefore performed in the TIMER_BH bottom half. calc_load( ) counts the number of processes in the TASK_RUNNING or TASK_UNINTERRUPTIBLE state and uses this number to update the CPU usage statistics.
Linux includes a minimalist code profiler used by Linux developers to discover where the kernel spends its time in Kernel Mode. The profiler identifies the hot spots of the kernel — the most frequently executed fragments of kernel code. Identifying the kernel hot spots is very important because they may point out kernel functions that should be further optimized.
The profiler is based on a very simple Monte Carlo algorithm: at every timer interrupt occurrence, the kernel determines whether the interrupt occurred in Kernel Mode; if so, the kernel fetches the value of the eip register before the interruption from the stack and uses it to discover what the kernel was doing before the interrupt. In the long run, the samples accumulate on the hot spots.
The x86_do_profile( ) function collects the data for the code profiler. It is invoked either by the do_timer_interrupt( ) function in uniprocessor systems (by the PIT's timer interrupt handler) or by the smp_local_timer_interrupt( ) function in multiprocessor systems (by the local timer interrupt handler).
To enable the code profiler, the Linux kernel must be booted by passing as a parameter the string profile=N, where 2N denotes the size of the code fragments to be profiled. The collected data can be read from the /proc/profile file. The counters are reset by writing in the same file; in multiprocessor systems, writing into the file can also change the sample frequency (see the earlier section Section 6.2.2). However, kernel developers do not usually access /proc/profile directly; instead, they use the readprofile system command.
In multiprocessor systems, Linux offers yet another feature to kernel developers: a watchdog system, which might be quite useful to detect kernel bugs that cause a system freeze. To activate such watchdog, the kernel must be booted with the nmi_watchdog parameter.
The watchdog is based on a clever hardware feature of multiprocessor motherboards: they can broadcast the PIT's interrupt timer as NMI interrupts to all CPUs. Since NMI interrupts are not masked by the cli assembly language instruction, the watchdog can detect deadlocks even when interrupts are disabled.
As a consequence, once every tick, all CPUs, regardless of what they are doing, start executing the NMI interrupt handler; in turn, the handler invokes do_nmi( ). This function gets the logical number n of the CPU, and then checks the nth entry of the apic_timer_irqs array. If the CPU is working properly, the value must be different from the value read at the previous NMI interrupt. When the CPU is running properly, the nth entry of the apic_timer_irqs array is incremented by the local timer interrupt handler (see the earlier section Section 6.2.2.2); if the counter is not incremented, the local timer interrupt handler has not been executed in a whole tick. Not a good thing, you know.
When the NMI interrupt handler detects a CPU freeze, it rings all the bells: it logs scary messages in the system log files, dumps the contents of the CPU registers and of the kernel stack (kernel oops), and finally kills the current process. This gives kernel developers a chance to discover what's gone wrong.