IPC is an abbreviation for Interprocess Communication, and commonly refers to a set of mechanisms that allow a User Mode process to do the following:
· Synchronize itself with other processes by means of semaphores
· Send messages to other processes or receive messages from them
· Share a memory area with other processes
System V IPC first appeared in a development Unix variant called "Columbus Unix" and later was adopted by AT&T's System III. It is now found in most Unix systems, including Linux.
IPC data structures are created dynamically when a process requests an IPC resource (a semaphore, a message queue, or a shared memory region). An IPC resource is persistent: unless explicitly removed by a process, it is kept in memory and remains available until the system is shut down. An IPC resource may be used by any process, including those that do not share the ancestor that created the resource.
Since a process may require several IPC resources of the same type, each new resource is identified by a 32-bit IPC key, which is similar to the file pathname in the system's directory tree. Each IPC resource also has a 32-bit IPC identifier, which is somewhat similar to the file descriptor associated with an open file. IPC identifiers are assigned to IPC resources by the kernel and are unique within the system, while IPC keys can be freely chosen by programmers.
When two or more processes wish to communicate through an IPC resource, they all refer to the IPC identifier of the resource.
IPC resources are created by invoking the semget( ), msgget( ), or shmget( ) functions, depending on whether the new resource is a semaphore, a message queue, or a shared memory region.
The main objective of each of these three functions is to derive from the IPC key (passed as the first parameter) the corresponding IPC identifier, which is then used by the process for accessing the resource. If there is no IPC resource already associated with the IPC key, a new resource is created. If everything goes right, the function returns a positive IPC identifier; otherwise, it returns one of the error codes listed in Table 19-6.
Assume that two independent processes want to share a common IPC resource. This can be achieved in two possible ways:
· The processes agree on some fixed, predefined IPC key. This is the simplest case, and it works quite well for any complex application implemented by many processes. However, there's a chance that the same IPC key is chosen by another unrelated program. In this case, the IPC functions might be successfully invoked and still return the IPC identifier of the wrong resource.[5]
[5] The ftok( ) function attempts to create a new key from a file pathname and an 8-bit project identifier passed as parameters. It does not guarantee, however, a unique key number, since there is a small chance that it will return the same IPC key to two different applications using different pathnames and project identifiers.
· One process issues a semget( ), msgget( ), or shmget( ) function by specifying IPC_PRIVATE as its IPC key. A new IPC resource is thus allocated, and the process can either communicate its IPC identifier to the other process in the application[6] or fork the other process itself. This method ensures that the IPC resource cannot be used accidentally by other applications.
[6] This implies, of course, the existence of another communication channel between the processes not based on IPC.
The last parameter of the semget( ), msgget( ), and shmget( ) functions can include two flags. IPC_CREAT specifies that the IPC resource must be created, if it does not already exist; IPC_EXCL specifies that the function must fail if the resource already exists and the IPC_CREAT flag is set.
Even if the process uses the IPC_CREAT and IPC_EXCL flags, there is no way to ensure exclusive access to an IPC resource, since other processes may always refer to the resource by using its IPC identifier.
To minimize the risk of incorrectly referencing the wrong resource, the kernel does not recycle IPC identifiers as soon as they become free. Instead, the IPC identifier assigned to a resource is almost always larger than the identifier assigned to the previously allocated resource of the same type. (The only exception occurs when the 32-bit IPC identifier overflows.) Each IPC identifier is computed by combining a slot usage sequence number relative to the resource type, an arbitrary slot index for the allocated resource, and an arbitrary value chosen in the kernel that is greater than the maximum number of allocatable resources. If we choose s to represent the slot usage sequence number, M to represent the upper bound on the number of allocatable resources, and i to represent the slot index, where 0i<M, each IPC resource's ID is computed as follows:
IPC identifier = s x M + i
In Linux 2.4, the value of M is set to 32,768 (IPCMNI macro). The slot usage sequence number s is initialized to 0 and is incremented by 1 at every resource allocation. When s reaches a predefined threshold, which depends on the type of IPC resource, it restarts from 0.
Every type of IPC resource (semaphores, message queues, and shared memory areas) owns an ipc_ids data structure, which includes the fields shown in Table 19-7.
The size field stores the maximum number of allocatable IPC resources of the given type. The system administrator may increase this value for any resource type by writing into the /proc/sys/kernel/sem, /proc/sys/kernel/msgmni, and /proc/sys/kernel/shmmni special files, respectively.
The entries field points to an array of pointers to kern_ipc_perm data structures, one for every allocatable resource (the size field is also the size of the array). Each kern_ipc_perm data structure is associated with an IPC resource and contains the fields shown in Table 19-8. The uid, gid, cuid, and cgid fields store the user and group identifiers of the resource's creator and the user and group identifiers of the current resource's owner, respectively. The mode bit mask includes six flags, which store the read and write access permissions for the resource's owner, the resource's group, and all other users. IPC access permissions are similar to file access permissions described in Section 1.5.5, except that the Execute permission flag is not used.
The kern_ipc_perm data structure also includes a key field (which contains the IPC key of the corresponding resource) and a seq field (which stores the slot usage sequence number s used to compute the IPC identifier of the resource).
The semctl( ), msgctl( ), and shmctl( ) functions may be used to handle IPC resources. The IPC_SET subcommand allows a process to change the owner's user and group identifiers and the permission bit mask in the ipc_perm data structure. The IPC_STAT and IPC_INFO subcommands retrieve some information concerning a resource. Finally, the IPC_RMID subcommand releases an IPC resource. Depending on the type of IPC resource, other specialized subcommands are also available.[7]
[7] Another IPC design flaw is that a User Mode process cannot atomically create and initialize an IPC semaphore, since these two operations are performed by two different IPC functions.
Once an IPC resource is created, a process may act on the resource by means of a few specialized functions. A process may acquire or release an IPC semaphore by issuing the semop( ) function. When a process wants to send or receive an IPC message, it uses the msgsnd( ) and msgrcv( ) functions, respectively. Finally, a process attaches and detaches an IPC shared memory region in its address space by means of the shmat( ) and shmdt( ) functions, respectively.
All IPC functions must be implemented through suitable Linux system calls. Actually, in the 80 x 86 architecture, there is just one IPC system call named ipc( ). When a process invokes an IPC function, let's say msgget( ), it really invokes a wrapper function in the C library. This in turn invokes the ipc( ) system call by passing to it all the parameters of msgget( ) plus a proper subcommand code—in this case, MSGGET. The sys_ipc( ) service routine examines the subcommand code and invokes the kernel function that implements the requested service.
The ipc( ) "multiplexer" system call is a legacy from older Linux versions, which included the IPC code in a dynamic module (see Appendix B). It did not make much sense to reserve several system call entries in the system_call table for a kernel component that could be missing, so the kernel designers adopted the multiplexer approach.
Nowadays, System V IPC can no longer be compiled as a dynamic module, and there is no justification for using a single IPC system call. As a matter of fact, Linux provides one system call for each IPC function on Hewlett-Packard's Alpha architecture and on Intel's IA-64.
IPC semaphores are quite similar to the kernel semaphores introduced in Chapter 5; they are counters used to provide controlled access to shared data structures for multiple processes.
The semaphore value is positive if the protected resource is available, and 0 if the protected resource is currently not available. A process that wants to access the resource tries to decrement the semaphore value; the kernel, however, blocks the process until the operation on the semaphore yields a positive value. When a process relinquishes a protected resource, it increments its semaphore value; in doing so, any other process waiting for the semaphore is woken up.
Actually, IPC semaphores are more complicated to handle than kernel semaphores for two main reasons:
· Each IPC semaphore is a set of one or more semaphore values, not just a single value like a kernel semaphore. This means that the same IPC resource can protect several independent shared data structures. The number of semaphore values in each IPC semaphore must be specified as a parameter of the semget( ) function when the resource is being allocated. From now on, we'll refer to the counters inside an IPC semaphore as primitive semaphores. There are bounds both on the number of IPC semaphore resources (by default, 128) and on the number of primitive semaphores inside a single IPC semaphore resource (by default, 250); however, the system administrator can easily modify these bounds by writing into the /proc/sys/kernel/sem file.
· System V IPC semaphores provide a fail-safe mechanism for situations in which a process dies without being able to undo the operations that it previously issued on a semaphore. When a process chooses to use this mechanism, the resulting operations are called undoable semaphore operations. When the process dies, all of its IPC semaphores can revert to the values they would have had if the process had never started its operations. This can help prevent other processes that use the same semaphores from remaining blocked indefinitely as a consequence of the terminating process failing to manually undo its semaphore operations.
First, we'll briefly sketch the typical steps performed by a process wishing to access one or more resources protected by an IPC semaphore:
1. Invokes the semget( ) wrapper function to get the IPC semaphore identifier, specifying as the parameter the IPC key of the IPC semaphore that protects the shared resources. If the process wants to create a new IPC semaphore, it also specifies the IPC_CREATE or IPC_PRIVATE flag and the number of primitive semaphores required (see Section 19.3.1 earlier in this chapter).
2. Invokes the semop( ) wrapper function to test and decrement all primitive semaphore values involved. If all the tests succeed, the decrements are performed, the function terminates, and the process is allowed to access the protected resources. If some semaphores are in use, the process is usually suspended until some other process releases the resources. The function receives as parameters the IPC semaphore identifier, an array of integers specifying the operations to be atomically performed on the primitive semaphores, and the number of such operations. Optionally, the process may specify the SEM_UNDO flag, which instructs the kernel to reverse the operations, should the process exit without releasing the primitive semaphores.
3. When relinquishing the protected resources, invokes the semop( ) function again to atomically increment all primitive semaphores involved.
4. Optionally, invokes the semctl( ) wrapper function, specifying the IPC_RMID command to remove the IPC semaphore from the system.
Now we can discuss how the kernel implements IPC semaphores. The data structures involved are shown in Figure 19-1. The sem_ids variable stores the ipc_ids data structure of the IPC semaphore resource type; its entries field is an array of pointers to sem_array data structures, one item for every IPC semaphore resource.
Formally, the array stores pointers to kern_ipc_perm data structures, but each structure is simply the first field of the sem_array data structure. All fields of the sem_array data structure are shown in Table 19-9.
The sem_base field points to an array of struct sem data structures, one for every IPC primitive semaphore. The latter data structure includes only two fields:
semval
The value of the semaphore's counter.
sempid
The PID of the last process that accessed the semaphore. This value can be queried by a process through the semctl( ) wrapper function.
If a process aborts suddenly, it cannot undo the operations that it started (for instance, release the semaphores it reserved); so by declaring them undoable, the process lets the kernel return the semaphores to a consistent state and allow other processes to proceed. Processes can request undoable operations by specifying the SEM_UNDO flag in the semop( ) function.
Information to help the kernel reverse the undoable operations performed by a given process on a given IPC semaphore resource is stored in a sem_undo data structure. It essentially contains the IPC identifier of the semaphore and an array of integers representing the changes to the primitive semaphore's values caused by all undoable operations performed by the process.
A simple example can illustrate how such sem_undo elements are used. Consider a process that uses an IPC semaphore resource containing four primitive semaphores. Suppose that it invokes the semop( ) function to increment the first counter by 1 and decrement the second by 2. If it specifies the SEM_UNDO flag, the integer in the first array element in the sem_undo data structure is decremented by 1, the integer in the second element is incremented by 2, and the other two integers are left unchanged. Further undoable operations on the IPC semaphore performed by the same process change the integers stored in the sem_undo structure accordingly. When the process exits, any nonzero value in that array corresponds to one or more unbalanced operations on the corresponding primitive semaphore; the kernel reverses these operations, simply adding the nonzero value to the corresponding semaphore's counter. In other words, the changes made by the aborted process are backed out while the changes made by other processes are still reflected in the state of the semaphores.
For each process, the kernel keeps track of all semaphore resources handled with undoable operations so that it can roll them back if the process unexpectedly exits. Furthermore, for each semaphore, the kernel has to keep track of all its sem_undo structures so it can quickly access them whenever a process uses semctl( ) to force an explicit value into a primitive semaphore's counter or to destroy an IPC semaphore resource.
The kernel is able to handle these tasks efficiently, thanks to two lists, which we denote as the per-process and the per-semaphore lists. The first list keeps track of all semaphores operated upon by a given process with undoable operations. The second list keeps track of all processes that are acting on a given semaphore with undoable operations. More precisely:
· The per-process list includes all sem_undo data structures corresponding to IPC semaphores on which the process has performed undoable operations. The semundo field of the process descriptor points to the first element of the list, while the proc_next field of each sem_undo data structure points to the next element in the list.
· The per-semaphore list includes all sem_undo data structures corresponding to the processes that performed undoable operations on the semaphore. The undo field of the semid_ds data structure points to the first element of the list, while the id_next field of each sem_undo data structure points to the next element in the list.
The per-process list is used when a process terminates. The sem_exit( ) function, which is invoked by do_exit( ), walks through the list and reverses the effect of any unbalanced operation for every IPC semaphore touched by the process. By contrast, the per-semaphore list is mainly used when a process invokes the semctl( ) function to force an explicit value into a primitive semaphore. The kernel sets the corresponding element to 0 in the arrays of all sem_undo data structures referring to that IPC semaphore resource, since it would no longer make any sense to reverse the effect of previous undoable operations performed on that primitive semaphore. Moreover, the per-semaphore list is also used when an IPC semaphore is destroyed; all related sem_undo data structures are invalidated by setting the semid field to -1.[8]
[8] Notice that they are just invalidated and not freed, since it would be too costly to remove the data structures from the per-process lists of all processes.
The kernel associates a queue of pending requests with each IPC semaphore to identify processes that are waiting on one (or more) of the semaphores in the array. The queue is a doubly linked list of sem_queue data structures whose fields are shown in Table 19-10. The first and last pending requests in the queue are referenced, respectively, by the sem_pending and sem_pending_last fields of the sem_array structure. This last field allows the list to be handled easily as a FIFO; new pending requests are added to the end of the list so they will be serviced later. The most important fields of a pending request are nsops (which stores the number of primitive semaphores involved in the pending operation) and sops (which points to an array of integer values describing each semaphore operation). The sleeper field stores the descriptor address of the sleeping process that requested the operation.
Figure 19-1 illustrates an IPC semaphore that has three pending requests. Two of them refer to undoable operations, so the undo field of the sem_queue data structure points to the corresponding sem_undo structure; the third pending request has a NULL undo field since the corresponding operation is not undoable.
Processes can communicate with one another by means of IPC messages. Each message generated by a process is sent to an IPC message queue, where it stays until another process reads it.
A message is composed of a fixed-size header and a variable-length text; it can be labeled with an integer value (the message type), which allows a process to selectively retrieve messages from its message queue.[9] Once a process has read a message from an IPC message queue, the kernel destroys the message; therefore, only one process can receive a given message.
[9] As we'll see, the message queue is implemented by means of a linked list. Since messages can be retrieved in an order different from "first in, first out," the name "message queue" is not appropriate. However, new messages are always put at the end of the linked list.
To send a message, a process invokes the msgsnd( ) function, passing the following as parameters:
· The IPC identifier of the destination message queue
· The size of the message text
· The address of a User Mode buffer that contains the message type immediately followed by the message text
To retrieve a message, a process invokes the msgrcv( ) function, passing to it:
· The IPC identifier of the IPC message queue resource
· The pointer to a User Mode buffer to which the message type and message text should be copied
· The size of this buffer
· A value t that specifies what message should be retrieved
If the value t is 0, the first message in the queue is returned. If t is positive, the first message in the queue with its type equal to t is returned. Finally, if t is negative, the function returns the first message whose message type is the lowest value less than or equal to the absolute value of t.
To avoid resource exhaustion, there are some limits on the number of IPC message queue resources allowed (by default, 16), on the size of each message (by default, 8,192 bytes), and on the maximum total size of the messages in a queue (by default, 16,384 bytes). As usual, however, the system administrator can tune these values by writing into the /proc/sys/kernel/msgmni, /proc/sys/kernel/msgmnb, and /proc/sys/kernel/msgmax files, respectively.
The data structures associated with IPC message queues are shown in Figure 19-2. The msg_ids variable stores the ipc_ids data structure of the IPC message queue resource type; its entries field is an array of pointers to msg_queue data structures—one item for every IPC message queue resource. Formally, the array stores pointers to kern_ipc_perm data structures, but each such structure is simply the first field of the msg_queue data structure. All fields of the msg_queue data structure are shown in Table 19-11.
The most important field is q_messages, which represents the head (i.e., the first dummy element) of a doubly linked circular list containing all messages currently in the queue.
Each message is broken in one or more pages, which are dynamically allocated. The beginning of the first page stores the message header, which is a data structure of type msg_msg; its fields are listed in Table 19-12. The m_list field stores the pointers to the previous and next messages in the queue. The message text starts right after the msg_msg descriptor; if the message is longer than 4,072 bytes (the page size minus the size of the msg_msg descriptor), it continues on another page, whose address is stored in the next field of the msg_msg descriptor. The second page frame starts with a descriptor of type msg_msgseg, which just includes a next pointer storing the address of an optional third page, and so on.
When the message queue is full (either the maximum number of messages or the maximum total size has been reached), processes that try to enqueue new messages may be blocked. The q_senders field of the msg_queue data structure is the head of a list that includes the pointers to the descriptors of all blocked sending processes.
Even receiving processes may be blocked when the message queue is empty (or the process specified a type of message not present in the queue). The q_receivers field of the msg_queue data structure is the head of a list of msg_receiver data structures, one for every blocked receiving process. Each of these structures essentially includes a pointer to the process descriptor, a pointer to the msg_msg structure of the message, and the type of the requested message.
The most useful IPC mechanism is shared memory, which allows two or more processes to access some common data structures by placing them in an IPC shared memory region. Each process that wants to access the data structures included in an IPC shared memory region must add to its address space a new memory region (see Section 8.3), which maps the page frames associated with the IPC shared memory region. Such page frames can then be easily handled by the kernel through demand paging (see Section 8.4.3).
As with semaphores and message queues, the shmget( ) function is invoked to get the IPC identifier of a shared memory region, optionally creating it if it does not already exist.
The shmat( ) function is invoked to "attach" an IPC shared memory region to a process. It receives as its parameter the identifier of the IPC shared memory resource and tries to add a shared memory region to the address space of the calling process. The calling process can require a specific starting linear address for the memory region, but the address is usually unimportant, and each process accessing the shared memory region can use a different address in its own address space. The process's Page Tables are left unchanged by shmat( ). We describe later what the kernel does when the process tries to access a page that belongs to the new memory region.
The shmdt( ) function is invoked to "detach" an IPC shared memory region specified by its IPC identifier—that is, to remove the corresponding memory region from the process's address space. Recall that an IPC shared memory resource is persistent: even if no process is using it, the corresponding pages cannot be discarded, although they can be swapped out.
As for the other types of IPC resources, in order to avoid overuse of memory by User Mode processes, there are some limits on the allowed number of IPC shared memory regions (by default, 4,096), on the size of each segment (by default, 32 megabytes), and on the maximum total size of all segments (by default, 8 gigabytes). As usual, however, the system administrator can tune these values by writing into the /proc/sys/kernel/shmmni, /proc/sys/kernel/shmmax, and /proc/sys/kernel/shmall files, respectively.
The data structures associated with IPC shared memory regions are shown in Figure 19-3. The shm_ids variable stores the ipc_ids data structure of the IPC shared memory resource type; its entries field is an array of pointers to shmid_kernel data structures, one item for every IPC shared memory resource. Formally, the array stores pointers to kern_ipc_perm data structures, but each such structure is simply the first field of the msg_queue data structure. All fields of the shmid_kernel data structure are shown in Table 19-13.
The most important field is shm_file, which stores the address of a file object. This reflects the tight integration of IPC shared memory with the VFS layer in Linux 2.4. In particular, each IPC shared memory region is associated with a regular file belonging to the shm special filesystem (see Section 12.3.1).
Since the shm filesystem has no mount point in the system directory tree, no user can open and access its files by means of regular VFS system calls. However, whenever a process "attaches" a segment, the kernel invokes do_mmap( ) and creates a new shared memory mapping of the file in the address space of the process. Therefore, files that belong to the shm special filesystem have just one file object method, mmap, which is implemented by the shm_mmap( ) function.
As shown in Figure 19-3, a memory region that corresponds to an IPC shared memory region is described by a vm_area_struct object (see Section 15.2); its vm_file field points back to the file object of the special file, which in turn references a dentry object and an inode object. The inode number, stored in the i_ino field of the inode, is actually the slot index of the IPC shared memory region, so the inode object indirectly references the shmid_kernel descriptor.
As usual, for any shared memory mapping, page frames that belong to the IPC shared memory region are included in the page cache through an address_space object referenced by the i_mapping field of the inode (you might also refer to Figure 15-4).
The kernel has to be careful when swapping out pages included in shared memory regions, and the role of the swap cache is crucial (this topic was already discussed in Section 16.3).
As explained in Section 16.5.1, to swap out a page owned by an address_space object, the kernel essentially marks the page as dirty, thus triggering a data transfer to disk, and then removes the page from the process's Page Table. If the page belongs to a shared file memory mapping, eventually the page is no longer referenced by any process, and the shrink_cache( ) function will release it to the Buddy system (see Section 16.7.5). This is fine because the data in the page is just a duplicate of some data on disk.
However, pages of an IPC shared memory region map a special inode that has no image on disk. Moreover, an IPC shared memory region is persistent—that is, its pages must be preserved even when the segment is not attached to any process. Therefore, the kernel cannot simply discard the pages when reclaiming the corresponding page frames; rather, the pages have to be swapped out.
The try_to_swap_out( ) function includes no check for this special case, so a page belonging to the region is marked as dirty and removed from the process address space. Even the shrink_cache( ) function, which periodically prunes the page cache from the least recently used pages, has no check for this special case, so it ends up invoking the writepage method of the owner address_space object (see Section 16.7.5).
How, then, are IPC shared memory regions preserved when their pages are swapped out? Pages belonging to IPC shared memory regions implement the writepage method by means of a custom shmem_writepage( ) function, which essentially allocates a new page slot in a swap area and moves the page from the page cache to the swap cache (it's just a matter of changing the owner address_space object of the page). The function also stores the swapped-out page identifier in a shmem_inode_info structure embedded in the filesystem-specific portion of the inode object. Notice that the page is not immediately written onto the swap area: this is done when the shrink_cache( ) is invoked again.
The pages added to a process by shmat( ) are dummy pages; the function adds a new memory region into a process's address space, but it doesn't modify the process's Page Tables. Moreover, as we have seen, pages of an IPC shared memory region can be swapped out. Therefore, these pages are handled through the demand paging mechanism.
As we know, a Page Fault occurs when a process tries to access a location of an IPC shared memory region whose underlying page frame has not been assigned. The corresponding exception handler determines that the faulty address is inside the process address space and that the corresponding Page Table entry is null; therefore, it invokes the do_no_page( ) function (see Section 8.4.3). In turn, this function checks whether the nopage method for the memory region is defined. That method is invoked, and the Page Table entry is set to the address returned from it (see also Section 15.2.4).
Memory regions used for IPC shared memory always define the nopage method. It is implemented by the shmem_nopage( ) function, which performs the following operations:
1. Walks the chain of pointers in the VFS objects and derives the address of the inode object of the IPC shared memory resource (see Figure 19-3).
2. Computes the logical page number inside the segment from the vm_start field of the memory region descriptor and the requested address.
3. Checks whether the page is already included in the swap cache (see Section 16.3); if so, it terminates by returning its address.
4. Checks whether the shmem_inode_info embedded in the inode object stores a swapped-out page identifier for the logical page number. If so, it performs a swap-in operation by invoking swapin_readahead( ) (see Section 16.6.1), waits until the data transfer completes, and terminates by returning the address of the page.
5. Otherwise, the page is not stored in a swap area; therefore, the function allocates a new page from the Buddy system, inserts into the page cache, and returns its address.
The do_no_page( ) function sets the entry that corresponds to the faulty address in the process's Page Table so that it points to the page frame returned by the method.