For the sake of brevity, we cannot discuss the implementation of all the VFS system calls listed in Table 12-1. However, it could be useful to sketch out the implementation of a few system calls, just to show how VFS's data structures interact.
Let's reconsider the example proposed at the beginning of this chapter: a user issues a shell command that copies the MS-DOS file /floppy/TEST to the Ext2 file /tmp/test. The command shell invokes an external program like cp, which we assume executes the following code fragment:
inf = open("/floppy/TEST", O_RDONLY, 0);
outf = open("/tmp/test", O_WRONLY | O_CREAT | O_TRUNC, 0600);
do {
len = read(inf, buf, 4096);
write(outf, buf, len);
} while (len);
close(outf);
close(inf);
Actually, the code of the real cp program is more complicated, since it must also check for possible error codes returned by each system call. In our example, we just focus our attention on the "normal" behavior of a copy operation.
The open( ) system call is serviced by the sys_open( ) function, which receives as parameters the pathname filename of the file to be opened, some access mode flags flags, and a permission bit mask mode if the file must be created. If the system call succeeds, it returns a file descriptor—that is, the index assigned to the new file in the current->files->fd array of pointers to file objects; otherwise, it returns -1.
In our example, open( ) is invoked twice; the first time to open /floppy/TEST for reading (O_RDONLY flag) and the second time to open /tmp/test for writing (O_WRONLY flag). If /tmp/test does not already exist, it is created (O_CREAT flag) with exclusive read and write access for the owner (octal 0600 number in the third parameter).
Conversely, if the file already exists, it is rewritten from scratch (O_TRUNC flag). Table 12-17 lists all flags of the open( ) system call.
Let's describe the operation of the sys_open( ) function. It performs the following steps:
1. Invokes getname( ) to read the file pathname from the process address space.
2. Invokes get_unused_fd( ) to find an empty slot in current->files->fd. The corresponding index (the new file descriptor) is stored in the fd local variable.
3. Invokes the filp_open( ) function, passing as parameters the pathname, the access mode flags, and the permission bit mask. This function, in turn, executes the following steps:
a. Copies the access mode flags into namei_flags, but encodes the access mode flags O_RDONLY, O_WRONLY, and O_RDWR with the format expected by the pathname lookup functions (see the earlier section Section 12.5).
b. Invokes open_namei( ), passing to it the pathname, the modified access mode flags, and the address of a local nameidata data structure. The function performs the lookup operation in the following manner:
§ If O_CREAT is not set in the access mode flags, starts the lookup operation with the LOOKUP_PARENT flag not set. Moreover, the LOOKUP_FOLLOW flag is set only if O_NOFOLLOW is cleared, while the LOOKUP_DIRECTORY flag is set only if the O_DIRECTORY flag is set.
§ If O_CREAT is set in the access mode flags, starts the lookup operation with the LOOKUP_PARENT flag set. Once the path_walk( ) function successfully returns, checks whether the requested file already exists. If not, allocates a new disk inode by invoking the create method of the parent inode.
The open_namei( ) function also executes several security checks on the file located by the lookup operation. For instance, the function checks whether the inode associated with the dentry object found really exists, whether it is a regular file, and whether the current process is allowed to access it according to the access mode flags. Also, if the file is opened for writing, the function checks that the file is not locked by other processes.
c. Invokes the dentry_open( ) function, passing to it the access mode flags and the addresses of the dentry object and the mounted filesystem object located by the lookup operation. In turn, this function:
1. Allocates a new file object.
2. Initializes the f_flags and f_mode fields of the file object according to the access mode flags passed to the open( ) system call.
3. Initializes the f_fentry and f_vfsmnt fields of the file object according to the addresses of the dentry object and the mounted filesystem object passed as parameters.
4. Sets the f_op field to the contents of the i_fop field of the corresponding inode object. This sets up all the methods for future file operations.
5. Inserts the file object into the list of opened files pointed to by the s_files field of the filesystem's superblock.
6. If the O_DIRECT flag is set, preallocates a direct access buffer (see Section 15.3).
7. If the open method of the file operations is defined, invokes it.
d. Returns the address of the file object.
4. Sets current->files->fd[fd] to the address of the file object returned by dentry_open( ).
5. Returns fd .
Let's return to the code in our cp example. The open( ) system calls return two file descriptors, which are stored in the inf and outf variables. Then the program starts a loop: at each iteration, a portion of the /floppy/TEST file is copied into a local buffer (read( ) system call), and then the data in the local buffer is written into the /tmp/test file (write( ) system call).
The read( ) and write( ) system calls are quite similar. Both require three parameters: a file descriptor fd, the address buf of a memory area (the buffer containing the data to be transferred), and a number count that specifies how many bytes should be transferred. Of course, read( ) transfers the data from the file into the buffer, while write( ) does the opposite. Both system calls return either the number of bytes that were successfully transferred or -1 to signal an error condition.
A return value less than count does not mean that an error occurred. The kernel is always allowed to terminate the system call even if not all requested bytes were transferred, and the user application must accordingly check the return value and reissue, if necessary, the system call. Typically, a small value is returned when reading from a pipe or a terminal device, when reading past the end of the file, or when the system call is interrupted by a signal. The End-Of-File condition (EOF) can easily be recognized by a null return value from read( ). This condition will not be confused with an abnormal termination due to a signal, because if read( ) is interrupted by a signal before any data is read, an error occurs.
The read or write operation always takes place at the file offset specified by the current file pointer (field f_pos of the file object). Both system calls update the file pointer by adding the number of transferred bytes to it.
In short, both sys_read( ) (the read( )'s service routine) and sys_write( ) (the write( )'s service routine) perform almost the same steps:
1. Invoke fget( ) to derive from fd the address file of the corresponding file object and increment the usage counter file->f_count.
2. Check whether the flags in file->f_mode allow the requested access (read or write operation).
3. Invoke locks_verify_area( ) to check whether there are mandatory locks for the file portion to be accessed (see Section 12.7 later in this chapter).
4. Invoke either file->f_op->read or file->f_op->write to transfer the data. Both functions return the number of bytes that were actually transferred. As a side effect, the file pointer is properly updated.
5. Invoke fput( ) to decrement the usage counter file->f_count.
6. Return the number of bytes actually transferred.
The loop in our example code terminates when the read( ) system call returns the value 0—that is, when all bytes of /floppy/TEST have been copied into /tmp/test. The program can then close the open files, since the copy operation has completed.
The close( ) system call receives as its parameter fd, which is the file descriptor of the file to be closed. The sys_close( ) service routine performs the following operations:
1. Gets the file object address stored in current->files->fd[fd]; if it is NULL, returns an error code.
2. Sets current->files->fd[fd] to NULL. Releases the file descriptor fd by clearing the corresponding bits in the open_fds and close_on_exec fields of current->files (see Chapter 20 for the Close on Execution flag).
3. Invokes filp_close( ), which performs the following operations:
a. Invokes the flush method of the file operations, if defined
b. Releases any mandatory lock on the file
c. Invokes fput( ) to release the file object
4. Returns the error code of the flush method (usually 0).