Several system calls have been introduced to allow processes to change their priorities and scheduling policies. As a general rule, users are always allowed to lower the priorities of their processes. However, if they want to modify the priorities of processes belonging to some other user or if they want to increase the priorities of their own processes, they must have superuser privileges.
The nice( )[8] system call allows processes to change their base priority. The integer value contained in the increment parameter is used to modify the nice field of the process descriptor. The nice Unix command, which allows users to run programs with modified scheduling priority, is based on this system call.
[8] Since this system call is usually invoked to lower the priority of a process, users who invoke it for their processes are "nice" to other users.
The sys_nice( ) service routine handles the nice( ) system call. Although the increment parameter may have any value, absolute values larger than 40 are trimmed down to 40. Traditionally, negative values correspond to requests for priority increments and require superuser privileges, while positive ones correspond to requests for priority decrements. In the case of a negative increment, the function invokes the capable( ) function to verify whether the process has a CAP_SYS_NICE capability. We discuss that function, together with the notion of capability, in Chapter 20. If the user turns out to have the capability required to change priorities, sys_nice( ) adds the value of increment to the nice field of current. If necessary, the value of this field is trimmed down so it won't be less than - 20 or greater than + 19.
The nice( ) system call is maintained for backward compatibility only; it has been replaced by the setpriority( ) system call described next.
The nice( ) system call affects only the process that invokes it. Two other system calls, denoted as getpriority( ) and setpriority( ), act on the base priorities of all processes in a given group. getpriority( ) returns 20 minus the lowest nice field value among all processes in a given group—that is, the highest priority among that processes; setpriority( ) sets the base priority of all processes in a given group to a given value.
The kernel implements these system calls by means of the sys_getpriority( ) and sys_setpriority( ) service routines. Both of them act essentially on the same group of parameters:
which
The value that identifies the group of processes; it can assume one of the following:
PRIO_PROCESS
Selects the processes according to their process ID (pid field of the process descriptor).
PRIO_PGRP
Selects the processes according to their group ID (pgrp field of the process descriptor).
PRIO_USER
Selects the processes according to their user ID (uid field of the process descriptor).
who
The value of the pid, pgrp, or uid field (depending on the value of which) to be used for selecting the processes. If who is 0, its value is set to that of the corresponding field of the current process.
niceval
The new base priority value (needed only by sys_setpriority( )). It should range between - 20 (highest priority) and + 19 (lowest priority).
As stated before, only processes with a CAP_SYS_NICE capability are allowed to increase their own base priority or to modify that of other processes.
As we saw in Chapter 9, system calls return a negative value only if some error occurred. For this reason, getpriority( ) does not return a normal nice value ranging between - 20 and + 19, but rather a nonnegative value ranging between 1 and 40.
We now introduce a group of system calls that allow processes to change their scheduling discipline and, in particular, to become real-time processes. As usual, a process must have a CAP_SYS_NICE capability to modify the values of the rt_priority and policy process descriptor fields of any process, including itself.
The sched_ getscheduler( ) system call queries the scheduling policy currently applied to the process identified by the pid parameter. If pid equals 0, the policy of the calling process is retrieved. On success, the system call returns the policy for the process: SCHED_FIFO, SCHED_RR, or SCHED_OTHER. The corresponding sys_sched_getscheduler( ) service routine invokes find_process_by_pid( ), which locates the process descriptor corresponding to the given pid and returns the value of its policy field.
The sched_setscheduler( ) system call sets both the scheduling policy and the associated parameters for the process identified by the parameter pid. If pid is equal to 0, the scheduler parameters of the calling process will be set.
The corresponding sys_sched_setscheduler( ) function checks whether the scheduling policy specified by the policy parameter and the new static priority specified by the param->sched_priority parameter are valid. It also checks whether the process has CAP_SYS_NICE capability or whether its owner has superuser rights. If everything is OK, it executes the following statements:
p->policy = policy;
p->rt_priority = param->sched_priority;
if (task_on_runqueue(p))
move_first_runqueue(p);
current->need_resched = 1;
The sched_getparam( ) system call retrieves the scheduling parameters for the process identified by pid. If pid is 0, the parameters of the current process are retrieved. The corresponding sys_sched_getparam( ) service routine, as one would expect, finds the process descriptor pointer associated with pid, stores its rt_priority field in a local variable of type sched_param, and invokes copy_to_user( ) to copy it into the process address space at the address specified by the param parameter.
The sched_setparam( ) system call is similar to sched_setscheduler( ). The difference is that sched_setscheduler( ) does not let the caller set the policy field's value.[9] The corresponding sys_sched_setparam( ) service routine is almost identical to sys_sched_setscheduler( ), but the policy of the affected process is never changed.
[9] This anomaly is caused by a specific requirement of the POSIX standard.
The sched_ yield( ) system call allows a process to relinquish the CPU voluntarily without being suspended; the process remains in a TASK_RUNNING state, but the scheduler puts it at the end of the runqueue list. In this way, other processes that have the same dynamic priority have a chance to run. The call is used mainly by SCHED_FIFO processes.
The corresponding sys_sched_ yield( ) service routine checks first if there is some process in the system that is runnable, other than the process executing the system call and the swapper kernel threads. If there is no such process, sched_yield( ) returns without performing any action because no process would be able to use the freed processor. Otherwise, the function executes the following statements:
if (current->policy == SCHED_OTHER)
current->policy |= SCHED_YIELD;
current->need_resched = 1;
spin_lock_irq(&runqueue_lock);
move_last_runqueue(current);
spin_unlock_irq(&runqueue_lock);
As a result, schedule( ) is invoked when returning from the sys_sched_ yield( ) service routine (see Section 4.8), and the current process will most likely be replaced.
The sched_get_priority_min( ) and sched_get_priority_max( ) system calls return, respectively, the minimum and the maximum real-time static priority value that can be used with the scheduling policy identified by the policy parameter.
The sys_sched_get_priority_min( ) service routine returns 1 if current is a real-time process, 0 otherwise.
The sys_sched_get_priority_max( ) service routine returns 99 (the highest priority) if current is a real-time process, 0 otherwise.
The sched_rr_get_interval( ) system writes in a structure stored in the User Mode address space the Round Robin time quantum for the real-time process identified by the pid parameter. If pid is zero, the system call writes the time quantum of the current process.
The corresponding sys_sched_rr_get_interval( ) service routine invokes, as usual, find_process_by_pid( ) to retrieve the process descriptor associated with pid. It then converts the number of ticks stored in the nice field of the selected process descriptor into seconds and nanoseconds and copies the numbers into the User Mode structure.