12.5 Pathname Lookup

In this section, we illustrate how the VFS derives an inode from the corresponding file pathname. When a process must identify a file, it passes its file pathname to some VFS system call, such as open( ), mkdir( ), rename( ), or stat( ).

The standard procedure for performing this task consists of analyzing the pathname and breaking it into a sequence of filenames. All filenames except the last must identify directories.

If the first character of the pathname is / , the pathname is absolute, and the search starts from the directory identified by current->fs->root (the process root directory). Otherwise, the pathname is relative and the search starts from the directory identified by current->fs->pwd (the process-current directory).

Having in hand the inode of the initial directory, the code examines the entry matching the first name to derive the corresponding inode. Then the directory file that has that inode is read from disk and the entry matching the second name is examined to derive the corresponding inode. This procedure is repeated for each name included in the path.

The dentry cache considerably speeds up the procedure, since it keeps the most recently used dentry objects in memory. As we saw before, each such object associates a filename in a specific directory to its corresponding inode. In many cases, therefore, the analysis of the pathname can avoid reading the intermediate directories from the disk.

However, things are not as simple as they look, since the following Unix and VFS filesystem features must be taken into consideration:

·         The access rights of each directory must be checked to verify whether the process is allowed to read the directory's content.

·         A filename can be a symbolic link that corresponds to an arbitrary pathname; in this case, the analysis must be extended to all components of that pathname.

·         Symbolic links may induce circular references; the kernel must take this possibility into account and break endless loops when they occur.

·         A filename can be the mount point of a mounted filesystem. This situation must be detected, and the lookup operation must continue into the new filesystem.

Pathname lookup is performed by three functions: path_init( ), path_walk( ), and path_release( ). They are always invoked in this exact order.

The path_init( ) function receives three parameters:

name

A pointer to the file pathname to be resolved.

flags

The value of flags that represent how the looked-up file is going to be accessed. The flags are listed in Table 12-17 in the later section Section 12.6.1.[7]

[7] There is, however, a small difference in how the O_RDONLY, O_WRONLY, and O_RDWR flags are encoded. The bit at index 0 (lowest-order) of the flags parameter is set only if the file access requires read privileges; similarly, the bit at index 1 is set only if the file access requires write privileges. Conversely, for the open( ) system call, the value of the O_WRONLY flag is stored in the bit at index 0, while the O_RDWR flag is stored in the bit at index 1; thus, the O_RDONLY flag is true when both bits are cleared. Notice that it is not possible to specify in the open( ) system call that a file access does not require either read or write privileges; this makes sense, however, in a pathname lookup operation involving symbolic links.

nd

The address of a struct nameidata data structure.

The struct nameidata data structure is filled by path_walk( ) with data pertaining to the pathname lookup operation. The fields of this structure are shown in Table 12-14.

Table 12-14. The fields of the nameidata data structure

Type

Field

Description

struct dentry *

dentry

Address of the dentry object

struct vfs_mount *

mnt

Address of the mounted filesystem object

struct qstr

last

Last component of the pathname (used when the LOOKUP_PARENT flag is set)

unsigned int

flags

Lookup flags

int

last_type

Type of last component of the pathname (used when the LOOKUP_PARENT flag is set)

The dentry and mnt fields point respectively to the dentry object and the mounted filesystem object of the last resolved component in the pathname. Once path_walk( ) successfully returns, these two fields "describe" the file that is identified by the given pathname.

The flags field stores the value of some flags used in the lookup operation; they are listed in Table 12-15.

Table 12-15. The flags of the lookup operation

Macro

Description

LOOKUP_FOLLOW

If the last component is a symbolic link, interpret (follow) it.

LOOKUP_DIRECTORY

The last component must be a directory.

LOOKUP_CONTINUE

There are still filenames to be examined in the pathname (used only by NFS).

LOOKUP_POSITIVE

The pathname must identify an existing file.

LOOKUP_PARENT

Look up the directory including the last component of the pathname.

LOOKUP_NOALT

Do not consider the emulated root directory (always set for the 80 x 86 architecture).

The goal of the path_init( ) function consists of initializing the nameidata structure, which it does in the following manner:

1.       Sets the dentry field with the address of the dentry object of the directory where the pathname lookup operation starts. If the pathname is relative (it doesn't start with a slash), the field points to the dentry of the working directory (current->fs->pwd); otherwise it points to the dentry of the root directory of the process (current->fs->root).

2.       Sets the mnt field with the address of the mounted filesystem object relative to the directory where the pathname lookup operation starts: either current->fs->pwdmnt or current->fs->rootmnt, according to whether the pathname is relative or absolute.

3.       Initializes the flags field with the value of the flags parameter.

4.       Initializes the last_type field to LAST_ROOT.

Once path_init( ) initializes the nameidata data structure, the path_walk( ) function takes care of the lookup operation, and stores in the nameidata structure the pointers to the dentry object and mounted filesystem object relative to the last component of the pathname. The function also increments the usage counters of the objects referenced by nd->dentry and nd->mnt so that the caller function may safely access them once path_walk( ) returns. When the caller finishes accessing them, it invokes the third function of the set, path_release( ), which receives as a parameter the address of the nameidata data structure and decrements the two usage counters of nd->dentry and nd->mnt.

We are now ready to describe the core of the pathname lookup operation, namely the path_walk( ) function. It receives as parameters a pointer name to the pathname to be resolved and the address nd of the nameidata data structure. The function initializes to zero the total_link_count of the current process (see the later section Section 12.5.3), and then invokes link_path_walk( ). This latter function acts on the same two parameters of path_walk( ).

To make things a bit easier, we first describe what link_path_walk( ) does when LOOKUP_PARENT is not set and the pathname does not contain symbolic links (standard pathname lookup). Next, we discuss the case in which LOOKUP_PARENT is set: this type of lookup is required when creating, deleting, or renaming a directory entry, that is, during a parent pathname lookup. Finally, we explain how the function resolves the symbolic links.

12.5.1 Standard Pathname Lookup

When the LOOKUP_PARENT flag is cleared, link_path_walk( ) performs the following steps.

1.       Initializes the lookup_flags local variable with nd->flags.

2.       Skips any leading slash ( / ) before the first component of the pathname.

3.       If the remaining pathname is empty, returns the value 0. In the nameidata data structure, the dentry and mnt fields point to the object relative to the last resolved component of the original pathname.

4.       If the link_count field in the descriptor of the current process is positive, sets the LOOKUP_FOLLOW flag in the lookup_flags local variable (see Section 12.5.3).

5.       Executes a cycle that breaks name into components (the intermediate slashes are treated as filename separators); for each component found, the function:

a.       Retrieves the address of the inode object of the last resolved component from nd->dentry->d_inode.

b.       Checks that the permissions of the last resolved component stored into the inode allow execution (in Unix, a directory can be traversed only if it is executable). If the inode has a custom permission method, the function executes it; otherwise, it executes the vfs_permission( ) function, which examines the access mode stored in the i_mode inode field and the privileges of the running process.

c.       Considers the next component to be resolved. From its name, it computes a hash value for the dentry cache hash table.

d.       Skips any trailing slash ( / ) after the slash that terminates the name of the component to be resolved.

e.       If the component to be resolved is the last one in the original pathname, jump to Step 6.

f.        If the name of the component is "." (a single dot), continues with the next component ("." refers to the current directory, so it has no effect inside a pathname).

g.       If the name of the component is ". ." (two dots), tries to climb to the parent directory:

1.       If the last resolved directory is the process's root directory (nd->dentry is equal to current->fs->root and nd->mnt is equal to current->fs->rootmnt), continues with the next component.

2.       If the last resolved directory is the root directory of a mounted filesystem (nd->dentry is equal to nd->mnt->mnt_root), sets nd->mnt to nd->mnt->mnt_parent and nd->dentry to nd->mnt->mnt_mountpoint, and then restarts Step 5.g. (Recall that several filesystems can be mounted on the same mount point).

3.       If the last resolved directory is not the root directory of a mounted filesystem, sets nd->dentry to nd->dentry->d_parent and continues with the next component.

h.       The component name is neither "." nor ". .", so the function must look it up in the dentry cache. If the low-level filesystem has a custom d_hash dentry method, the function invokes it to modify the hash value already computed in Step 5.c.

i.         Invokes cached_lookup( ), passing as parameters nd->dentry, the name of the component to be resolved, the hash value, and the LOOKUP_CONTINUE flag, which specifies that this is not the last component of the pathname. The function invokes d_lookup( ) to search the dentry object of the component in the dentry cache. If cached_lookup( ) fails in finding the dentry in the cache, link_walk_path( ) invokes real_lookup( ) to read the directory from disk and create a new dentry object. In either case, we can assume at the end of this step that the dentry local variable points to the dentry object of the component name to be resolved in this cycle.

j.         Checks whether the component just resolved (dentry local variable) refers to a directory that is a mount point for some filesystem (dentry->d_mounted is set to 1). In this case, invokes lookup_mnt( ), passing to it dentry and nd->mnt, in order to get the address mounted of the child mounted filesystem object. Next, it sets dentry to mounted->mnt_root and nd->mnt to mounted. Then it repeats the whole step (several filesystems can be mounted on the same mount point).

k.       Checks whether the inode object dentry->d_inode has a custom follow_link method. If this is the case, the component is a symbolic link, which is described in the later section Section 12.5.3.

l.         Checks that dentry points to the dentry object of a directory (dentry->d_inode->i_op->lookup method is defined). If not, returns the error -ENOTDIR, because the component is in the middle of the original pathname.

m.     Sets nd->dentry to dentry and continues with the next component of the pathname.

6.       Now all components of the original pathname are resolved except the last one. If the pathname has a trailing slash, it sets the LOOKUP_FOLLOW and LOOKUP_DIRECTORY in the lookup_flags local variable to force interpretation of the last component as a directory name.

7.       Checks the value of the LOOKUP_PARENT flag in the lookup_flags variable. In the following, we assume that the flag is set to 0, and we postpone the opposite case to the next section.

8.       If the name of the last component is "." (a single dot), terminates the execution returning the value 0 (no error). In the nameidata structure that nd points to, the dentry and mnt fields refer to the objects relative to the next-to-last component of the pathname (any component "." has no effect inside a pathname).

9.       If the name of the last component is ". ." (two dots), tries to climb to the parent directory:

a.       If the last resolved directory is the process's root directory (nd->dentry is equal to current->fs->root and nd->mnt is equal to current->fs->rootmnt), terminates the execution returning the value 0 (no error). nd->dentry and nd->mnt refer to the objects relative to the next to the last component of the pathname—that is, to the root directory of the process.

b.       If the last resolved directory is the root directory of a mounted filesystem (nd->dentry is equal to nd->mnt->mnt_root), sets nd->mnt to nd->mnt->mnt_parent and nd->dentry to nd->mnt->mnt_mountpoint, and then restarts Step 5.j.

c.       If the last resolved directory is not the root directory of a mounted filesystem, sets nd->dentry to nd->dentry->d_parent, and terminates the execution returning the value 0 (no error). nd->dentry and nd->mnt refer to the objects relative to the next-to-last component of the pathname.

10.   The name of the last component is neither "." nor ". .", so the function must look it up in the dentry cache. If the low-level filesystem has a custom d_hash dentry method, the function invokes it to modify the hash value already computed in Step 5.c.

11.   Invokes cached_lookup( ), passing as parameters nd->dentry, the name of the component to be resolved, the hash value, and no flag (LOOKUP_CONTINUE is not set because this is the last component of the pathname). If cached_lookup( ) fails in finding the dentry in the cache, it also invokes real_lookup( ) to read the directory from disk and create a new dentry object. In either case, we can assume at the end of this step that the dentry local variable points to the dentry object of the component name to be resolved in this cycle.

12.   Checks whether the component just resolved (dentry local variable) refers to a directory that is a mount point for some filesystem (dentry->d_mounted is set to 1). In this case, invokes lookup_mnt( ), passing to it dentry and nd->mnt, in order to get the address mounted of the child mounted filesystem object. Next, it sets dentry to mounted->mnt_root and nd->mnt to mounted. Then it repeats the whole step (because several filesystems can be mounted on the same mount point).

13.   Checks whether LOOKUP_FOLLOW flag is set in lookup_flags and the inode object dentry->d_inode has a custom follow_link method. If this is the case, the component is a symbolic link that must be interpreted, as described in the later section Section 12.5.3.

14.   Sets nd->dentry with the value stored in the dentry local variable. This dentry object is "the result" of the lookup operation.

15.   Checks whether nd->dentry->d_inode is NULL. This happens when there is no inode associated with the dentry object, usually because the pathname refers to a nonexisting file. In this case:

a.       If either LOOKUP_POSITIVE or LOOKUP_DIRECTORY is set in lookup_flags, it terminates, returning the error code -ENOENT.

b.       Otherwise, it terminates returning the value 0 (no error). nd->dentry points to the negative dentry object created by the lookup operation.

16.   There is an inode associated with the last component of the pathname. If the LOOKUP_DIRECTORY flag is set in lookup_flags, checks that the inode has a custom lookup method—that is, it is a directory. If not, terminates returning the error code -ENOTDIR.

17.   Terminates returning the value 0 (no error). nd->dentry and nd->mnt refer to the last component of the pathname.

12.5.2 Parent Pathname Lookup

In many cases, the real target of a lookup operation is not the last component of the pathname, but the next-to-last one. For example, when a file is created, the last component denotes the filename of the not yet existing file, and the rest of the pathname specifies the directory in which the new link must be inserted. Therefore, the lookup operation should fetch the dentry object of the next-to-last component. For another example, unlinking a file identified by the pathname /foo/bar consists of removing bar from the directory foo. Thus, the kernel is really interested in accessing the file directory foo rather than bar.

The LOOKUP_PARENT flag is used whenever the lookup operation must resolve the directory containing the last component of the pathname, rather than the last component itself.

When the LOOKUP_PARENT flag is set, the path_walk( ) function also sets up the last and last_type fields of the nameidata data structure. The last field stores the name of the last component in the pathname. The last_type field identifies the type of the last component; it may be set to one of the values shown in Table 12-16.

Table 12-16. The values of the last_type field in the nameidata data structure

Value

Description

LAST_NORM

Last component is a regular filename

LAST_ROOT

Last component is "/ " (that is, the entire pathname is "/ ")

LAST_DOT

Last component is "."

LAST_DOTDOT

Last component is ". ."

LAST_BIND

Last component is a symbolic link into a special filesystem

The LAST_ROOT flag is the default value set by path_init( ) when the whole pathname lookup operation starts (see the description at the beginning of Section 12.5). If the pathname turns out to be just "/ ", the kernel does not change the initial value of the last_type field. The LAST_BIND flag is set by the follow_link inode object's method of symbolic links in special filesystems (see the next section).

The remaining values of the last_type field are set by link_path_walk( ) when the LOOKUP_PARENT flag is on; in this case, the function performs the same steps described in the previous section up to Step 7. From Step 7 onward, however, the lookup operation for the last component of the pathname is different:

1.       Sets nd->last to the name of the last component

2.       Initializes nd->last_type to LAST_NORM

3.       If the name of the last component is "." (a single dot), sets nd->last_type to LAST_DOT

4.       If the name of the last component is ". ." (two dots), sets nd->last_type to LAST_DOTDOT

5.       Terminates by returning the value 0 (no error)

As you can see, the last component is not interpreted at all. Thus, when the function terminates, the dentry and mnt fields of the nameidata data structure point to the objects relative to the directory that includes the last component.

12.5.3 Lookup of Symbolic Links

Recall that a symbolic link is a regular file that stores a pathname of another file. A pathname may include symbolic links, and they must be resolved by the kernel.

For example, if /foo/bar is a symbolic link pointing to (containing the pathname) ../dir, the pathname /foo/bar/file must be resolved by the kernel as a reference to the file /dir/file. In this example, the kernel must perform two different lookup operations. The first one resolves /foo/bar; when the kernel discovers that bar is the name of a symbolic link, it must retrieve its content and interpret it as another pathname. The second pathname operation starts from the directory reached by the first operation and continues until the last component of the symbolic link pathname has been resolved. Next, the original lookup operation resumes from the dentry reached in the second one and with the component following the symbolic link in the original pathname.

To further complicate the scenario, the pathname included in a symbolic link may include other symbolic links. You might think that the kernel code that resolves the symbolic links is hard to understand, but this is not true; the code is actually quite simple because it is recursive.

However, untamed recursion is intrinsically dangerous. For instance, suppose that a symbolic link points to itself. Of course, resolving a pathname including such a symbolic link may induce an endless stream of recursive invocations, which in turn quickly leads to a kernel stack overflow. The link_count field in the descriptor of the current process is used to avoid the problem: the field is incremented before each recursive execution and decremented right after. If the field reaches the value 5, the whole lookup operation terminates with an error code. Therefore, the level of nesting of symbolic links can be at most 5.

Furthermore, the total_link_count field in the descriptor of the current process keeps track of how many symbolic links (even nonnested) were followed in the original lookup operation. If this counter reaches the value 40, the lookup operation aborts. Without this counter, a malicious user could create a pathological pathname including many consecutive symbolic links that freezes the kernel in a very long lookup operation.

This is how the code basically works: once the link_path_walk( ) function retrieves the dentry object associated with a component of the pathname, it checks whether the corresponding inode object has a custom follow_link method (see Step 5.k and Step 13 in Section 12.5.1). If so, the inode is a symbolic link that must be interpreted before proceeding with the lookup operation of the original pathname.

In this case, the link_path_walk( ) function invokes do_follow_link( ), passing to it the address of the dentry object of the symbolic link and the address of the nameidata data structure. In turn, do_follow_link( ) performs the following steps:

1.       Checks that current->link_count is less than 5; otherwise, returns the error code -ELOOP

2.       Checks that current->total_link_count is less than 40; otherwise, returns the error code -ELOOP

3.       If the current->need_resched flag is set, invokes schedule( ) to give a chance to preempt the running process

4.       Increments current->link_count and current->total_link_count

5.       Updates the access time of the inode object associated with the symbolic link to be resolved

6.       Invokes the follow_link method of the inode, passing to it the addresses of the dentry object and of the nameidata data structure

7.       Decrements the current->link_count field

8.       Returns the error code returned by the follow_link method (0 for no error)

The follow_link method is a filesystem-dependent function that reads the pathname stored in the symbolic link from the disk. Having filled a buffer with the symbolic link's pathname, most follow_link methods end up invoking the vfs_follow_link( ) function and returning the value taken from it. In turn, the vfs_follow_link( ) does the following:

1.       Checks whether the first character of the symbolic link pathname is a slash; if so, the dentry and mnt fields of the nameidata data structure are set so they refer to the current process root directory.

2.       Invokes link_path_walk( ) to resolve the symbolic link pathname, passing to it the nameidata data structure.

3.       Returns the value taken from link_path_walk( ).

When do_follow_link( ) finally terminates, it returns the address of the dentry object referred to by the symbolic link to the original execution of link_path_walk( ). The link_path_walk( ) assigns this address to the dentry local variable, and then proceeds with the next step.