When a file can be accessed by more than one process, a synchronization problem occurs. What happens if two processes try to write in the same file location? Or again, what happens if a process reads from a file location while another process is writing into it?
In traditional Unix systems, concurrent accesses to the same file location produce unpredictable results. However, Unix systems provide a mechanism that allows the processes to lock a file region so that concurrent accesses may be easily avoided.
The POSIX standard requires a file-locking mechanism based on the fcntl( ) system call. It is possible to lock an arbitrary region of a file (even a single byte) or to lock the whole file (including data appended in the future). Since a process can choose to lock just a part of a file, it can also hold multiple locks on different parts of the file.
This kind of lock does not keep out another process that is ignorant of locking. Like a critical region in code, the lock is considered "advisory" because it doesn't work unless other processes cooperate in checking the existence of a lock before accessing the file. Therefore, POSIX's locks are known as advisory locks.
Traditional BSD variants implement advisory locking through the flock( ) system call. This call does not allow a process to lock a file region, just the whole file.
Traditional System V variants provide the lockf( ) function, which is just an interface to fcntl( ). More importantly, System V Release 3 introduced mandatory locking: the kernel checks that every invocation of the open( ), read( ), and write( ) system calls does not violate a mandatory lock on the file being accessed. Therefore, mandatory locks are enforced even between noncooperative processes.[8] A file is marked as a candidate for mandatory locking by setting its set-group bit (SGID) and clearing the group-execute permission bit. Since the set-group bit makes no sense when the group-execute bit is off, the kernel interprets that combination as a hint to use mandatory locks instead of advisory ones.
[8] Oddly enough, a process may still unlink (delete) a file even if some other process owns a mandatory lock on it! This perplexing situation is possible because when a process deletes a file hard link, it does not modify its contents, but only the contents of its parent directory.
Whether processes use advisory or mandatory locks, they can use both shared read locks and exclusive write locks. Any number of processes may have read locks on some file region, but only one process can have a write lock on it at the same time. Moreover, it is not possible to get a write lock when another process owns a read lock for the same file region, and vice versa (see Table 12-18).
Table 12-18. Whether a lock is granted |
||
|
Grant request for |
|
Current Locks |
Read lock? |
Write lock? |
No lock |
Yes |
Yes |
Read lock |
Yes |
No |
Write lock |
No |
No |
Linux supports all fashions of file locking: advisory and mandatory locks, as well as the fcntl( ), flock( ), and the lockf( ) system calls. However, the lockf( ) system call is just a library wrapper routine, and therefore is not discussed here.
fcntl( )'s mandatory locks can be enabled and disabled on a per-filesystem basis using the MS_MANDLOCK flag (the mand option) of the mount( ) system call. The default is to switch off mandatory locking. In this case, fcntl( ) creates advisory locks. When the flag is set, fcntl( ) produces mandatory locks if the file has the set-group bit on and the group-execute bit off; it produces advisory locks otherwise.
In earlier Linux versions, the flock( ) system call produced only advisory locks, without regard of the MS_MANDLOCK mount flag. This is the expected behavior of the system call in any Unix-like operating system. In Linux 2.4, however, a special kind of flock( )'s mandatory lock has been added to allow proper support for some proprietary network filesystem implementations. It is the so-called share-mode mandatory look; when set, no other process may open a file that would conflict with the access mode of the lock. Use of this feature for native Unix applications is discouraged, because the resulting source code will be nonportable.
Another kind of flock( )-based mandatory lock called leases has been introduced in Linux 2.4. When a process tries to open a file protected by a lease, it is blocked as usual. However, the process that owns the lock receives a signal. Once informed, it should first update the file so that its content is consistent, and then release the lock. If the owner does not do this in a well-defined time interval (tunable by writing a number of seconds into /proc/sys/fs/lease-break-time, usually 45 seconds), the lease is automatically removed by the kernel and the blocked process is allowed to continue.
Beside the checks in the read( ) and write( ) system calls, the kernel takes into consideration the existence of mandatory locks when servicing all system calls that could modify the contents of a file. For instance, an open( ) system call with the O_TRUNC flag set fails if any mandatory lock exists for the file.
A lock produced by fcntl( ) is of type FL_POSIX, while a lock produced by flock( ) is of type FL_FLOCK, FL_MAND (for share-mode locks), or FL_LEASE (for leases). The types of locks produced by fcntl( ) may safely coexist with those produced by flock( ), but neither one has any effect on the other. Therefore, a file locked through fcntl( ) does not appear locked to flock( ), and vice versa.
The following section describes the main data structure used by the kernel to handle file locks. The next two sections examine the differences between the two most common lock types: FL_POSIX and FL_FLOCK.
The file_lock data structure represents file locks; its fields are shown in Table 12-19. All file_lock data structures are included in a doubly linked list. The address of the first element is stored in file_lock_list, while the fields fl_nextlink and fl_prevlink store the addresses of the adjacent elements in the list.
All lock_file structures that refer to the same file on disk are collected in a simply linked list, whose first element is pointed to by the i_flock field of the inode object. The fl_next field of the lock_file structure specifies the next element in the list.
When a process tries to get an advisory or mandatory lock, it may be suspended until the previously allocated lock on the same file region is released. All processes sleeping on some lock are inserted into a wait queue, whose head is stored in the fl_wait field of the file_lock structure. Moreover, all processes sleeping on any file locks are inserted into a circular doubly linked list, whose head (first dummy element) is stored in the blocked_list variable; the fl_block field of the file_lock data structure stores the pointer to adjacent elements in the list.
An FL_FLOCK lock is always associated with a file object and is thus maintained by a particular process (or clone processes sharing the same opened file). When a lock is requested and granted, the kernel replaces any other lock that the process is holding on the same file object.
This happens only when a process wants to change an already owned read lock into a write one, or vice versa. Moreover, when a file object is being freed by the fput( ) function, all FL_FLOCK locks that refer to the file object are destroyed. However, there could be other FL_FLOCK read locks set by other processes for the same file (inode), and they still remain active.
The flock( ) system call acts on two parameters: the fd file descriptor of the file to be acted upon and a cmd parameter that specifies the lock operation. A cmd parameter of LOCK_SH requires a shared lock for reading, LOCK_EX requires an exclusive lock for writing, and LOCK_UN releases the lock. If the LOCK_NB value is ORed to the LOCK_SH or LOCK_EX operation, the system call does not block; in other words, if the lock cannot be immediately obtained, the system call returns an error code. Note that it is not possible to specify a region inside the file—the lock always applies to the whole file.
When the sys_flock( ) service routine is invoked, it performs the following steps:
1. Checks whether fd is a valid file descriptor; if not, returns an error code. Gets the address of the corresponding file object.
2. If the process has to acquire an advisory lock, checks that the process has both read and write permission on the open file; if not, returns an error code.
3. Invokes flock_lock_file( ), passing as parameters the file object pointer filp, the type type of lock operation required, and a flag wait. This last parameter is set if the system call should block (LOCK_NB clear) and cleared otherwise (LOOK_NB set). This function performs, in turn, the following actions:
a. If the lock must be acquired, gets a new file_lock object and fills it with the appropriate lock operation.
b. Searches the list that filp->f_dentry->d_inode->i_flock points to. If an FL_FLOCK lock for the same file object is found and an unlock operation is required, removes the file_lock element from the inode list and the global list, wakes up all processes sleeping in the lock's wait queue, frees the file_lock structure, and returns.
c. Otherwise, searches the inode list again to verify that no existing FL_FLOCK lock conflicts with the requested one. There must be no FL_FLOCK write lock in the inode list, and moreover, there must be no FL_FLOCK lock at all if the processing is requesting a write lock. However, a process may want to change the type of lock it already owns; this is done by issuing a second flock( ) system call. Therefore, the kernel always allows the process to change locks that refer to the same file object. If a conflicting lock is found and the LOCK_NB flag was specified, the function returns an error code; otherwise, it inserts the current process in the circular list of blocked processes and suspends it.
d. If no incompatibility exists, inserts the file_lock structure into the global lock list and the inode list, and then returns 0 (success).
4. Returns the return code of flock_lock_file( ).
An FL_POSIX lock is always associated with a process and with an inode; the lock is automatically released either when the process dies or when a file descriptor is closed (even if the process opened the same file twice or duplicated a file descriptor). Moreover, FL_POSIX locks are never inherited by the child across a fork( ).
When used to lock files, the fcntl( ) system call acts on three parameters: the fd file descriptor of the file to be acted upon, a cmd parameter that specifies the lock operation, and an fl pointer to a flock data structure. Version 2.4 of Linux also defines a flock64 structure, which uses 64-bit fields for the file offset and length fields. In the following, we focus on the flock data structure, but the description is valid for flock64 too.
Locks of type FL_POSIX are able to protect an arbitrary file region, even a single byte. The region is specified by three fields of the flock structure. l_start is the initial offset of the region and is relative to the beginning of the file (if field l_whence is set to SEEK_SET), to the current file pointer (if l_whence is set to SEEK_CUR), or to the end of the file (if l_whence is set to SEEK_END). The l_len field specifies the length of the file region (or 0, which means that the region includes all potential writes past the current end of the file).
The sys_fcntl( ) service routine behaves differently, depending on the value of the flag set in the cmd parameter:
F_GETLK
Determines whether the lock described by the flock structure conflicts with some FL_POSIX lock already obtained by another process. In this case, the flock structure is overwritten with the information about the existing lock.
F_SETLK
Sets the lock described by the flock structure. If the lock cannot be acquired, the system call returns an error code.
F_SETLKW
Sets the lock described by the flock structure. If the lock cannot be acquired, the system call blocks; that is, the calling process is put to sleep.
F_GETLK64, F_SETLK64, F_SETLKW64
Identical to the previous ones, but the flock64 data structure is used rather than flock.
When sys_fcntl( ) acquires a lock, it performs the following:
1. Reads the flock structure from user space.
2. Gets the file object corresponding to fd.
3. Checks whether the lock should be a mandatory one and the file has a shared memory mapping (see Chapter 15). In this case, refuses to create the lock and returns the -EAGAIN error code; the file is already being accessed by another process.
4. Initializes a new file_lock structure according to the contents of the user's flock structure.
5. Terminates returning an error code if the file does not allow the access mode specified by the type of the requested lock.
6. Invokes the lock method of the file operations, if defined.
7. Invokes the posix_lock_file( ) function, which executes the following actions:
a. Invokes posix_locks_conflict( ) for each FL_POSIX lock in the inode's lock list. The function checks whether the lock conflicts with the requested one. Essentially, there must be no FL_POSIX write lock for the same region in the inode list, and there may be no FL_POSIX lock at all for the same region if the process is requesting a write lock. However, locks owned by the same process never conflict; this allows a process to change the characteristics of a lock it already owns.
b. If a conflicting lock is found and fcntl( ) was invoked with the F_SETLK or F_SETLK64 flag, returns an error code. Otherwise, the current process should be suspended. In this case, invokes posix_locks_deadlock( ) to check that no deadlock condition is being created among processes waiting for FL_POSIX locks, and then inserts the current process in the circular list of blocked processes and suspends it.
c. As soon as the inode's lock list includes no conflicting lock, checks all the FL_POSIX locks of the current process that overlap the file region that the current process wants to lock, and combines and splits adjacent areas as required. For example, if the process requested a write lock for a file region that falls inside a read-locked wider region, the previous read lock is split into two parts covering the nonoverlapping areas, while the central region is protected by the new write lock. In case of overlaps, newer locks always replace older ones.
d. Inserts the new file_lock structure in the global lock list and in the inode list.
8. Returns the value 0 (success).