19.1 Pipes

Pipes are an interprocess communication mechanism that is provided in all flavors of Unix. A pipe is a one-way flow of data between processes: all data written by a process to the pipe is routed by the kernel to another process, which can thus read it.

In Unix command shells, pipes can be created by means of the | operator. For instance, the following statement instructs the shell to create two processes connected by a pipe:

$ ls | more 

The standard output of the first process, which executes the ls program, is redirected to the pipe; the second process, which executes the more program, reads its input from the pipe.

Note that the same results can also be obtained by issuing two commands such as the following:

$ ls > temp 
$ more < temp 

The first command redirects the output of ls into a regular file; then the second command forces more to read its input from the same file. Of course, using pipes instead of temporary files is usually more convenient due to the following reasons:

·         The shell statement is much shorter and simpler.

·         There is no need to create temporary regular files, which must be deleted later.

19.1.1 Using a Pipe

Pipes may be considered open files that have no corresponding image in the mounted filesystems. A process creates a new pipe by means of the pipe( ) system call, which returns a pair of file descriptors; the process may then pass these descriptors to its descendants through fork( ), thus sharing the pipe with them. The processes can read from the pipe by using the read( ) system call with the first file descriptor; likewise, they can write into the pipe by using the write( ) system call with the second file descriptor.

POSIX defines only half-duplex pipes, so even though the pipe( ) system call returns two file descriptors, each process must close one before using the other. If a two-way flow of data is required, the processes must use two different pipes by invoking pipe( ) twice.

Several Unix systems, such as System V Release 4, implement full-duplex pipes. In a full-duplex pipe, both descriptors can be written into and read from, thus there are two bidirectional channels of information. Linux adopts yet another approach: each pipe's file descriptors are still one-way, but it is not necessary to close one of them before using the other.

Let's resume the previous example. When the command shell interprets the ls|more statement, it essentially performs the following actions:

1.       Invokes the pipe( ) system call; let's assume that pipe( ) returns the file descriptors 3 (the pipe's read channel ) and 4 (the write channel ).

2.       Invokes the fork( ) system call twice.

3.       Invokes the close( ) system call twice to release file descriptors 3 and 4.

The first child process, which must execute the ls program, performs the following operations:

1.       Invokes dup2(4,1) to copy file descriptor 4 to file descriptor 1. From now on, file descriptor 1 refers to the pipe's write channel.

2.       Invokes the close( ) system call twice to release file descriptors 3 and 4.

3.       Invokes the execve( ) system call to execute the ls program (see Section 20.4). The program writes its output to the file that has file descriptor 1 (the standard output); i.e., it writes into the pipe.

The second child process must execute the more program; therefore, it performs the following operations:

1.       Invokes dup2(3,0) to copy file descriptor 3 to file descriptor 0. From now on, file descriptor 0 refers to the pipe's read channel.

2.       Invokes the close( ) system call twice to release file descriptors 3 and 4.

3.       Invokes the execve( ) system call to execute more. By default, that program reads its input from the file that has file descriptor 0 (the standard input); i.e., it reads from the pipe.

In this simple example, the pipe is used by exactly two processes. Because of its implementation, though, a pipe can be used by an arbitrary number of processes.[1] Clearly, if two or more processes read or write the same pipe, they must explicitly synchronize their accesses by using file locking (see Section 12.7.1) or IPC semaphores (see Section 19.3.3 later in this chapter).

[1] Since most shells offer pipes that connect only two processes, applications requiring pipes used by more than two processes must be coded in a programming language such as C.

Many Unix systems provide, besides the pipe( ) system call, two wrapper functions named popen( ) and pclose( ) that handle all the dirty work usually done when using pipes. Once a pipe has been created by means of the popen( ) function, it can be used with the high-level I/O functions included in the C library (fprintf( ), fscanf( ), and so on).

In Linux, popen( ) and pclose( ) are included in the C library. The popen( ) function receives two parameters: the filename pathname of an executable file and a type string specifying the direction of the data transfer. It returns the pointer to a FILE data structure. The popen( ) function essentially performs the following operations:

1.       Creates a new pipe by using the pipe( ) system call

2.       Forks a new process, which in turn executes the following operations:

a.       If type is r, duplicates the file descriptor associated with the pipe's write channel as file descriptor 1 (standard output); otherwise, if type is w, duplicates the file descriptor associated with the pipe's read channel as file descriptor 0 (standard input)

b.       Closes the file descriptors returned by pipe( )

c.       Invokes the execve( ) system call to execute the program specified by filename

3.       If type is r, closes the file descriptor associated with the pipe's write channel; otherwise, if type is w, closes the file descriptor associated with the pipe's read channel

4.       Returns the address of the FILE file pointer that refers to whichever file descriptor for the pipe is still open

After the popen( ) invocation, parent and child can exchange information through the pipe: the parent can read (if type is r) or write (if type is w) data by using the FILE pointer returned by the function. The data is written to the standard output or read from the standard input, respectively, by the program executed by the child process.

The pclose( ) function (which receives the file pointer returned by popen( ) as its parameter) simply invokes the wait4( ) system call and waits for the termination of the process created by popen( ).

19.1.2 Pipe Data Structures

We now have to start thinking again on the system call level. Once a pipe is created, a process uses the read( ) and write( ) VFS system calls to access it. Therefore, for each pipe, the kernel creates an inode object plus two file objects—one for reading and the other for writing. When a process wants to read from or write to the pipe, it must use the proper file descriptor.

When the inode object refers to a pipe, its i_pipe field points to a pipe_inode_info structure shown in Table 19-1.

Table 19-1. The pipe_inode_info structure

Type

Field

Description

struct wait_queue *

wait

Pipe/FIFO wait queue

char *

base

Address of kernel buffer

unsigned int

len

Number of bytes written into the buffer and yet to be read

unsigned int

start

Read position in kernel buffer

unsigned int

readers

Flag for (or number of) reading processes

unsigned int

writers

Flag for (or number of) writing processes

unsigned int

waiting_readers

Number of reading processes sleeping in the wait queue

unsigned int

waiting_writers

Number of writing processes sleeping in the wait queue

unsigned int

r_counter

Like readers, but used when waiting for a process that reads from the FIFO

unsigned int

w_counter

Like writers, but used when waiting for a process that writes into the FIFO

Besides one inode and two file objects, each pipe has its own pipe buffer—a single page frame containing the data written into the pipe and yet to be read. The address of this page frame is stored in the base field of the pipe_inode_info structure. The len field of the structure stores the number of bytes written into the pipe buffer that are yet to be read; in the following, we call that number the current pipe size.

The pipe buffer is circular and it is accessed both by reading and writing processes, so the kernel must keep track of two current positions in the buffer:

·         The offset of the next byte to be read, which is stored in the start field of the pipe_inode_info structure

·         The offset of the next byte to be written, which is derived from start and the pipe size (the len field of the structure)

To avoid race conditions on the pipe's data structures, the kernel prevents concurrent accesses to the pipe buffer through the use of the i_sem semaphore included in the inode object.

19.1.2.1 The pipefs special filesystem

A pipe is implemented as a set of VFS objects, which have no corresponding disk image. In Linux 2.4, these VFS objects are organized into the pipefs special filesystem to expedite their handling (see Section 12.3.1). Since this filesystem has no mount point in the system directory tree, users never see it. However, thanks to pipefs, the pipes are fully integrated in the VFS layer, and the kernel can handle them in the same way as named pipes or FIFOs, which truly exist as files recognizable to end users (see the later section Section 19.2).

The init_pipe_fs( ) function, typically executed during kernel initialization, registers the pipefs filesystem and mounts it (refer to the discussion in Section 12.4.1):

struct file_system_type pipe_fs_type;
root_fs_type.name = "pipefs";
root_fs_type.read_super = pipefs_read_super;
root_fs_type.fs_flags = FS_NOMOUNT;
register_filesystem(&pipe_fs_type);
pipe_mnt = do_kern_mount("pipefs", 0, "pipefs", NULL);

The mounted filesystem object that represents the root directory of pipefs is stored in the pipe_mnt variable.

19.1.3 Creating and Destroying a Pipe

The pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invokes the do_pipe( ) function. To create a new pipe, do_pipe( ) performs the following operations:

1.       Invokes the get_pipe_inode( ) function, which allocates and initializes an inode object for the pipe in the pipefs filesystem. In particular, this function executes the following actions:

a.       Allocates a pipe_inode_info data structure and stores its address in the i_pipe field of the inode.

b.       Allocates a page frame for the pipe buffer and stores its starting address in the base field of the pipe_inode_info structure.

c.       Initializes the start, len, waiting_readers, and waiting_writers fields of the pipe_inode_info structure to 0.

d.       Initializes the r_counter and w_counter fields of the pipe_inode_info structure to 1.

2.       Sets the readers and writers fields of the pipe_inode_info structure to 1.

3.       Allocates a file object and a file descriptor for the read channel of the pipe, sets the flag field of the file object to O_RDONLY, and initializes the f_op field with the address of the read_ pipe_fops table.

4.       Allocates a file object and a file descriptor for the write channel of the pipe, sets the flag field of the file object to O_WRONLY, and initializes the f_op field with the address of the write_ pipe_fops table.

5.       Allocates a dentry object and uses it to link the two file objects and the inode object (see Section 12.1.1); then inserts the new inode in the pipefs special filesystem.

6.       Returns the two file descriptors to the User Mode process.

The process that issues a pipe( ) system call is initially the only process that can access the new pipe, both for reading and writing. To represent that the pipe has both a reader and a writer, the readers and writers fields of the pipe_inode_info data structure are initialized to 1. In general, each of these two fields is set to 1 only if the corresponding pipe's file object is still opened by a process; the field is set to 0 if the corresponding file object has been released, since it is no longer accessed by any process.

Forking a new process does not increase the value of the readers and writers fields, so they never rise above 1;[2] however, it does increase the value of the usage counters of all file objects still used by the parent process (see Section 3.4.1). Thus, the objects are not released even when the parent dies, and the pipe stays open for use by the children.

[2] As we'll see, the readers and writers fields act as counters instead of flags when associated with FIFOs.

Whenever a process invokes the close( ) system call on a file descriptor associated with a pipe, the kernel executes the fput( ) function on the corresponding file object, which decrements the usage counter. If the counter becomes 0, the function invokes the release method of the file operations (see Section 12.6.3 and Section 12.2.6).

According to whether the file is associated with the read or write channel, the release method is implemented by either pipe_read_release( ) or pipe_write_release( ); both functions invoke pipe_release( ), which sets either the readers field or the writers field of the pipe_inode_info structure to 0. The function checks whether both the readers and writers fields are equal to 0; in this case, it releases the page frame containing the pipe buffer. Otherwise, the function wakes up any processes sleeping in the pipe's wait queue so they can recognize the change in the pipe state.

19.1.4 Reading from a Pipe

A process wishing to get data from a pipe issues a read( ) system call, specifying the file descriptor associated with the pipe's reading end. As described in Section 12.6.2, the kernel ends up invoking the read method found in the file operation table associated with the proper file object. In the case of a pipe, the entry for the read method in the read_pipe_fops table points to the pipe_read( ) function.

The pipe_read( ) function is quite involved, since the POSIX standard specifies several requirements for the pipe's read operations. Table 19-2 summarizes the expected behavior of a read( ) system call that requests n bytes from a pipe that has a pipe size (number of bytes in the pipe buffer yet to be read) equal to p.

The system call might block the current process in two cases:

·         The pipe buffer is empty when the system call starts.

·         The pipe buffer does not include all requested bytes, and a writing process was previously put to sleep while waiting for space in the buffer.

Notice that the read operation can be nonblocking: in this case, it completes as soon as all available bytes (even none) are copied into the user address space.[3]

[3] Nonblocking operations are usually requested by specifying the O_NONBLOCK flag in the open( ) system call. This method does not work for pipes, since they cannot be opened. A process can, however, require a nonblocking operation on a pipe by issuing a fcntl( ) system call on the corresponding file descriptor.

Notice also that the value 0 is returned by the read( ) system call only if the pipe is empty and no process is currently using the file object associated with the pipe's write channel.

Table 19-2. Reading n bytes from a pipe

 

At least one writing process

No writing process

 

Blocking read

Nonblocking read

 

Pipe Size p

Sleeping writer

No sleeping writer

 

 

p = 0

Copy n bytes and return n, waiting for data when the pipe buffer is empty.

Wait for some data, copy it, and return its size.

Return -EAGAIN.

Return 0.

0 < p < n

 

Copy p bytes and return p: bytes are left in the pipe buffer.

p n

Copy n bytes and return n: p-n bytes are left in the pipe buffer.

The function performs the following operations:

1.       Acquires the i_sem semaphore of the inode.

2.       Determines whether the pipe size, which is stored into the len field of the pipe_inode_info structure, is 0. In this case, determines whether the function must return or whether the process must be blocked while waiting until another process writes some data in the pipe (see Table 19-2). The type of I/O operation (blocking or nonblocking) is specified by the O_NONBLOCK flag in the f_flags field of the file object. If the current process must be blocked, the function performs the following actions:

a.       Adds 1 to the waiting_readers field of the pipe_inode_info structure.

b.       Adds current to the wait queue of the pipe (the wait field of the pipe_inode_info structure).

c.       Releases the inode semaphore.

d.       Sets the process status to TASK_INTERRUPTIBLE and invokes schedule( ).

e.       Once awake, removes current from the wait queue, acquires again the i_sem inode semaphore, decrements the waiting_readers field, and then jumps back to Step 2.

3.       Copies the requested number of bytes (or the number of available bytes, if the buffer size is too small) from the pipe's buffer to the user address space.

4.       Updates the start and len fields of the pipe_inode_info structure.

5.       Invokes wake_up_interruptible( ) to wake up all processes sleeping on the pipe's wait queue.

6.       If not all requested bytes have been copied, there is at least one writing process currently sleeping (waiting_writers field greater than 0) and the read operation is nonblocking, so the function jumps back to Step 2.

7.       Releases the i_sem semaphore of the inode.

8.       Returns the number of bytes copied into the user address space.

19.1.5 Writing into a Pipe

A process wishing to put data into a pipe issues a write( ) system call, specifying the file descriptor for the writing end of the pipe. The kernel satisfies this request by invoking the write method of the proper file object; the corresponding entry in the write_pipe_fops table points to the pipe_write( ) function.

Table 19-3 summarizes the behavior, specified by the POSIX standard, of a write( ) system call that requested to write n bytes into a pipe having u unused bytes in its buffer. In particular, the standard requires that write operations involving a small number of bytes must be atomically executed. More precisely, if two or more processes are concurrently writing into a pipe, each write operation involving fewer than 4,096 bytes (the pipe buffer size) must finish without being interleaved with write operations of other processes to the same pipe. However, write operations involving more than 4,096 bytes may be nonatomic and may also force the calling process to sleep.

Table 19-3. Writing n bytes to a pipe

 

At least one reading process

 

Available buffer space u

Blocking write

Nonblocking write

No reading process

u<n4096

Wait until n-u bytes are freed, copy n bytes, and return n.

Return -EAGAIN.

Send SIGPIPE signal and return -EPIPE.

n>4096

Copy n bytes (waiting when necessary) and return n.

If u>0, copy u bytes and return u; else return -EAGAIN.

 

un

Copy n bytes and return n.

Moreover, each write operation to a pipe must fail if the pipe does not have a reading process (that is, if the readers field of the pipe's inode object has the value 0). In this case, the kernel sends a SIGPIPE signal to the writing process and terminates the write( ) system call with the -EPIPE error code, which usually leads to the familiar "Broken pipe" message.

The pipe_write( ) function performs the following operations:

1.       Acquires the i_sem semaphore of the inode.

2.       Checks whether the pipe has at least one reading process. If not, it sends a SIGPIPE signal to the current process, releases the inode semaphore, and returns an -EPIPE value.

3.       Checks whether the number of bytes to be written is within the pipe's buffer size:

a.       If so, the write operation must be atomic. Therefore, checks whether the buffer has enough free space to store all bytes to be written.

b.       Otherwise, if the number of bytes is greater than the buffer size, the operation can start as long as there is any free space at all. Therefore, the function checks for at least one free byte.

4.       If the buffer does not have enough free space and the write operation is nonblocking, releases the inode semaphore and returns the -EAGAIN error code.

5.       If the buffer does not have enough free space and the write operation is blocking, performs the following actions:

a.       Adds 1 to the waiting_writers field of the pipe_inode_info structure.

b.       Adds current to the wait queue of the pipe (the wait field of the pipe_inode_info structure).

c.       Releases the inode semaphore.

d.       Sets the process status to TASK_INTERRUPTIBLE and invokes schedule( ).

e.       Once awake, removes current from the wait queue, again acquires the inode semaphore, decrements the waiting_writers field, and then jumps back to Step 5.

6.       Now the pipe buffer has enough free space to either copy the requested number of bytes (if the write operation must be atomic) or copy at least one byte; notice that other writers cannot steal free space because this writer owns the inode semaphore. Copies the requested number of bytes (or the number of free bytes if the pipe size is too small) from the user address space to the pipe's buffer.

7.       Wakes up all processes sleeping on the pipe's wait queue.

8.       If the write operation was blocking and not all requested bytes were written in the pipe buffer, jumps back to Step 5. Notice that this case may occur only when the write operation is nonatomic; hence the current process remains blocked until one or more bytes of the pipe buffer are freed.

9.       Releases the inode semaphore.

10.   Returns the number of bytes written into the pipe's buffer.