[ Team LiB ] Previous Section Next Section

2.9 Synchronizing Resource Access Across Processes on Unix

2.9.1 Problem

You want to ensure that two processes cannot simultaneously access the same resource, such as a segment of shared memory.

2.9.2 Solution

Use a lock file to signal that you are accessing the resource.

2.9.3 Discussion

Using a lock file to synchronize access to shared resources is not as simple as it sounds. Suppose that your program creates a lock file and then crashes. If this happens, the lock file will remain, and your program (as well as any other program that attempted to obtain the lock) will fail until someone manually removes the lock file. Obviously, this is undesirable. The solution is to store the process ID of the process holding the lock in the lock file. Other processes attempting to obtain the lock can then test to see whether the process holding the lock still exists. If it does not, the lock file is stale, it is safe to remove, and you can make another attempt to obtain the lock.

Unfortunately, this solution is still not a perfect one. What happens if another process is assigned the same ID as the one stored in the stale lock file? The answer to this question is simply that no process can obtain the lock until the process with the stale ID terminates or someone manually removes the lock file. Fortunately, this case should not be encountered frequently.

As a result of solving the stale lock problem, a new problem arises: there is now a race condition between the time the check for the existence of the process holding the lock is performed and the time the lock file is removed. The solution to this problem is to attempt to reopen the lock file after writing the new one to make sure that the process ID in the lock file is the same as the locking process's ID. If it is, the lock is successfully obtained.

The function presented below, spc_lock_file( ), requires a single argument: the name of the file to be used as the lock file. You must store the lock file in a "safe" directory (see Recipe 2.4) on a local filesystem. Network filesystems—versions of NFS older than Version 3 in particular—may not necessarily support the O_EXCL flag to open( ). Further, because the ID of the process holding the lock is stored in the lock file and process IDs are not shared across machines, testing for the presence of the process holding the lock would be unreliable at best if the lock file were stored on a network filesystem.

Three attempts are made to obtain the lock, with a pause of one second between attempts. If the lock cannot be obtained, the return value from the function is 0. If some kind of error occurs in attempting to obtain the lock, the return value is -1. If the lock is successfully obtained, the return value is 1.

#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <errno.h>
#include <limits.h>
#include <signal.h>
   
static int read_data(int fd, void *buf, size_t nbytes) {
  size_t  toread, nread = 0;
  ssize_t result;
   
  do {
    if (nbytes - nread > SSIZE_MAX) toread = SSIZE_MAX;
    else toread = nbytes - nread;
    if ((result = read(fd, (char *)buf + nread, toread)) >= 0)
      nread += result;
    else if (errno != EINTR) return 0;
  } while (nread < nbytes);
  return 1;
}
   
static int write_data(int fd, const void *buf, size_t nbytes) {
  size_t  towrite, written = 0;
  ssize_t result;
   
  do {
    if (nbytes - written > SSIZE_MAX) towrite = SSIZE_MAX;
    else towrite = nbytes - written;
    if ((result = write(fd, (const char *)buf + written, towrite)) >= 0)
      written += result;
    else if (errno != EINTR) return 0;
  } while (written < nbytes);
  return 1;
}

The two functions read_data( ) and write_data( ) are helper functions that ensure that all the requested data is read or written. If the system calls for reading or writing are interrupted by a signal, they are retried. Because such a small amount of data is being read and written, the data should all be written atomically, but all the data may not be read or written in a single call. These helper functions also handle this case.

int spc_lock_file(const char *lfpath) {
  int   attempt, fd, result;
  pid_t pid;
   
  /* Try three times, if we fail that many times, we lose */
  for (attempt = 0;  attempt < 3;  attempt++) {
    if ((fd = open(lfpath, O_RDWR | O_CREAT | O_EXCL, S_IRWXU)) =  = -1) {
      if (errno != EEXIST) return -1;
      if ((fd = open(lfpath, O_RDONLY)) =  = -1) return -1;
      result = read_data(fd, &pid, sizeof(pid));
      close(fd);
      if (result) {
        if (pid =  = getpid(  )) return 1;
        if (kill(pid, 0) =  = -1) {
          if (errno != ESRCH) return -1;
          attempt--;
          unlink(lfpath);
          continue;
        }
      }
      sleep(1);
      continue;
    }
   
    pid = getpid(  );
    if (!write_data(fd, &pid, sizeof(pid))) {
      close(fd);
      return -1;
    }
    close(fd);
    attempt--;
  }
   
  /* If we've made it to here, three attempts have been made and the lock could
   * not be obtained.  Return an error code indicating failure to obtain the
   * requested lock.
   */
  return 0;
}

The first step in attempting to obtain the lock is to try to create the lock file. If this succeeds, the caller's process ID is written to the file, the file is closed, and the loop is executed again. The loop counter is decremented first to ensure that at least one more iteration will always occur. The next time through the loop, creating the file should fail but won't necessarily do so, because another process was attempting to get the lock at the same time from a stale process and deleted the lock file out from under this process. If this happens, the whole process begins again.

If the lock file cannot be created, the lock file is opened for reading, and the ID of the process holding the lock is read from the file. The read is blocking, so if another process has begun to write out its ID, the read will block until the other process is done. Another race condition here could be avoided by performing a non-blocking read in a loop until all the data is read. A timeout could be applied to the read operation to cause the incomplete lock to be treated as stale. This race condition will only occur if a process creates the lock file without writing any data to it. This could be caused by an attacker, or it could occur because the process is terminated at precisely the right time so that it doesn't get the chance to write its ID to the lock file.

Once the process ID is read from the lock file, an attempt to send the process a signal of 0 is made. If the signal cannot be sent because the process does not exist, the call to kill( ) will return failure, and errno will be set to ESRCH. If this happens, the lock file is stale, and it can be removed. This is where the race condition discussed earlier occurs. The lock file is removed, the attempt counter is decremented, and the loop is restarted.

Between the time that kill( ) returns failure with an ESRCH error code and the time that unlink( ) is called to remove the lock file, another process could successfully delete the lock file and begin creating a new one. If this happens, the process will successfully write its process ID to the now deleted lock file and assume that it has the lock. It will not have the lock, though, because this process will have deleted the lock file the other process was creating. For this reason, after the lock file is created, the process must attempt to read the lock file and compare process IDs. If the process ID in the lock file is the same as the process making the comparison, the lock was successfully obtained.

2.9.4 See Also

Recipe 2.4

    [ Team LiB ] Previous Section Next Section