We won't be able to discuss all system calls related to networking. However, we shall examine the basic ones, namely those needed to send a UDP datagram.
In most Unix-like systems, the User Mode code fragment that sends a datagram looks like the following:
int sockfd; /* socket descriptor */
struct sockaddr_in addr_local, addr_remote; /* IPv4 address descriptors */
const char *mesg[] = "Hello, how are you?";
sockfd = socket(PF_INET, SOCK_DGRAM, 0);
addr_local.sin_family = AF_INET;
addr.sin_port = htons(50000);
addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */
bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in));
addr_remote.sin_family = AF_INET;
addr_remote.sin_port = htons(49152);
inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr);
connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in));
write(sockfd, mesg, strlen(mesg)+1);
Obviously, this listing does not represent the complete source code of the program. For instance, we have not defined a main( ) function, we have omitted the proper #include directives for loading the header files, and we have not checked the return values of the system calls. However, the listing includes all network-related system calls issued by the program to send a UDP datagram.
Let's describe the system calls in the order the program uses them.
The socket( ) system call creates a new endpoint for a communication between two or more processes. In our example program, it is invoked in this way:
sockfd = socket(PF_INET, SOCK_DGRAM, 0);
The socket( ) system call returns a file descriptor. In fact, a socket is similar to an opened file because it is possible to read and write data on it by means of the usual read( ) and write( ) system calls.
The first parameter of the socket( ) system call represents the network architecture that will be used for the communication, as well as a particular network layer protocol adopted by the network architecture. The PF_INET macro denotes both the IPS architecture and Version 4 of the IP protocol (IPv4). Linux supports several different network architectures; a few of them are shown in Table 18-1 earlier in this chapter.
The second parameter of the socket( ) system call specifies the basic model of communication inside the network architecture. As we already know, the IPS architecture offers essentially two alternative models of communication:
SOCK_STREAM
Reliable, connection-oriented, stream-based communication implemented by the TCP transport protocol
SOCK_DGRAM
Unreliable, connection-less, datagram-based communication implemented by the UDP transport protocol
Moreover, the special SOCK_RAW value creates a socket that can be used to directly access the network layer protocol (in our case, the IPv4 protocol).
In general, a network architecture might offer other models of communication. For instance, SOCK_SEQPACKET specifies a reliable, connection-oriented, datagram-based communication, while SOCK_RDM specifies a reliable, connection-less, datagram-based communication; however, neither of them is available in the IPS.
The third parameter of the socket( ) system call specifies the transport protocol to be used in the communication; in general, for any model of communication, the network architecture might offer several different protocols. Passing the value 0 selects the default protocol for the specified communication model. Of course, when using the IPS, the value 0 selects the TCP transport protocol (IPPROTO_TCP) for the SOCK_STREAM model and the UDP protocol (IPPROTO_IP) for the SOCK_DGRAM model. On the other hand, the SOCK_RAW model allows the programmer to specify any one of the network-layer service protocols of the IPS — for instance, the Internet Control Message Protocol (IPPROTO_ICMP), the Exterior Gateway Protocol (IPPROTO_EGP), or the Internet Group Management Protocol (IPPROTO_IGMP).
The socket( ) system call is implemented by means of the sys_socket( ) service routine, which essentially performs three actions:
1. Allocates a descriptor for the new BSD socket (see the later section Section 18.1.3).
2. Initializes the new descriptor according to the specified network architecture, communication model, and protocol.
3. Allocates the first available file descriptor of the process and associates a new file object with that file descriptor and with the socket object.
Let's return to the service routine of the socket( ) system call. After having allocated a new BSD socket, the function must initialize it according to the given network architecture, communication model, and protocol.
For every known network architecture, the kernel stores a pointer to an object of type net_proto_family in the net_families array. Essentially, the object just defines the create method, which is invoked whenever the kernel initializes a new socket of that network architecture.
The create method corresponding to the PF_INET architecture is implemented by inet_create( ). This function checks whether the communication model and the protocol specified as parameters of the socket( ) system call are compatible with the IPS network architecture; then it allocates and initializes a new INET socket and links it to the parent BSD socket.
Before terminating, the socket( )'s service routine allocates a new file object and a new dentry object for the sockfs's file of the socket; then it associates these objects with the process that raised the system call through a new file descriptor (see Section 12.2.6).
As far as the VFS is concerned, any file associated with a socket is in no way special. The corresponding dentry object and inode object are included in the dentry cache and in the inode cache, respectively. The process that created the socket can access the file by means of the system calls that act on already opened files — that is, the system calls that receive a file descriptor as a parameter. Of course, the file object methods are implemented by functions that operate on the socket rather than on the file.
As far as the User Mode process is concerned, however, the socket's file is somewhat peculiar. In fact, a process can never issue an open( ) system call on such a file because it never appears on the system directory tree (remember that the sockfs special filesystem has no visible mount point). For the same reason, it is not possible to remove a socket file through the unlink( ) system call: the inodes belonging to the sockfs filesystem are automatically destroyed by the kernel whenever the socket is closed (released).
Once the socket( ) system call completes, a new socket is created and initialized. It represents a new communication channel that can be identified by the following five elements: protocol, local IP address, local port number, remote IP address, and remote port number.
Only the "protocol" element has been set so far. Hence, the next action of the User Mode process consists of setting the "local IP address" and the "local port number." These two elements identify the process that is sending packets onto the socket so the receiving process on the remote machine can determine who is talking and where the answers should be sent.[3]
[3] Actually, when a process uses the UDP protocol, it can omit the invocation of the bind( ) system call. In this case, the kernel automatically assigns a local address and a local port number to the socket as soon as the program issues a connect( ) or listen( ) system call.
The corresponding instructions in our simple program are the following:
struct sockaddr_in addr_local;
addr_local.sin_family = AF_INET;
addr.sin_port = htons(50000);
addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */
bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in));
The addr_local local variable is of type struct sockaddr_in and represents an IPS identifier for a socket. It includes three significant fields:
sin_family
The protocol family (AF_INET, AF_INET6, or AF_PACKET; this is the same as the macros in Table 18-1).
sin_port
The port number.
sin_addr
The network address. In the IPS architecture, it is composed of a single 32-bit field s_addr storing the IP address.
Therefore, our program sets the fields of the addr_local variable to the protocol family AF_INET, the port number 50,000, and the IP address 192.160.80.240. Notice how the dotted notation of the IP address is translated into a hexadecimal number.
In the 80 x 86 architecture, the numbers are represented in the "little endian" format (the byte at lower address is the less significant one) while the IPS architecture requires that the numbers be represented in the "big endian" format (the byte at lower address is the most significant one). Several functions, such as htons( ) and htonl( ), are used to ensure that data is sent in the network byte order; other functions, such as ntohs( ) and ntohl( ), ensure that received data is converted from the network to the host byte order.
The bind( ) system call receives as parameters the socket file descriptor and the address of addr_local. It also receives the length of the struct sockaddr_in data structure; in fact, bind( ) can be used for sockets of any network architecture, as well as for Unix sockets and any different type of socket that has addresses of different length.
The sys_bind( ) service routine copies the data of the sock_addr variable into the kernel address space, retrieves the address of the BSD socket object (struct socket) that corresponds to the file descriptor, and invokes its bind method. In the IPS architecture, this method is implemented by the inet_bind( ) function.
The inet_bind( ) function performs essentially the following operations:
1. Invokes the inet_addr_type( ) function to check whether the IP address passed to the bind( ) system call corresponds to the address of some network card device of the host; if not, it returns an error code. However, the User Mode program may pass the special IP address INADDR_ANY (0.0.0.0), which essentially delegates to the kernel the task of assigning the IP sender address.
2. If the port number passed to the bind( ) system call is smaller than 1,024, checks whether the User Mode process has superuser privileges (this is the CAP_NET_BIND_SERVICE capability; see Section 20.1.1). However, the User Mode process may pass the value 0 as the port number; the kernel assigns a random, unused port number (see below).
3. Sets the rcv_saddr and saddr fields of the INET socket object with the IP address passed to the system call (the former field is used when looking in the routing table, while the latter is included in the header of outgoing packets). Usually, the fields hold the same value, except for special transmission modes like broadcast and multicasting.
4. Invokes the get_port protocol method of the INET socket object to check whether there already exists an INET socket for the transport protocol using the same local port number and IP address as the one being initialized. For IPv4 sockets using the UDP transport protocol, the method is implemented by the udp_v4_get_port( ) function. To speed up the lookup operation, the function uses a per-protocol hash table. Moreover, if the User Mode program specified a value of 0 for the port, the function assigns an unused number to the socket.
5. Stores the local port number in the sport field of the INET socket object.
The next operation of the User Mode process consists of setting the "remote IP address" and the "remote port number," so the kernel knows where datagrams written to the socket have to be sent. This is achieved by invoking the connect( ) system call.
It is important to observe that a User Mode program is in no way obliged to connect a UDP socket to a destination host. In fact, the program may use the sendto( ) and sendmsg( ) system calls to transmit datagrams over the socket, each time specifying the destination host's IP address and port number. Similarly, the program may receive datagrams from a UDP socket by invoking the recvfrom( ) and recvmsg( ) system calls. However, the connect( ) system call is required if the User Mode program transfers data on the socket by means of the read( ) and write( ) system call.
Since our program is going to use the write( ) system call to send its datagram, it invokes connect( ) to set up the destination of the message. The relevant instructions are:
struct sockaddr_in addr_remote;
addr_remote.sin_family = AF_INET;
addr_remote.sin_port = htons(49152);
inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr);
connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in));
The program initializes the addr_remote local variable by writing into it the IP address 192.160.80.110 and the port number 49,152. This is very similar to the initialization of the addr_local variable discussed in the previous section; however, this time the program invoked the inet_pton( ) library helper function to convert a string representing the IP address in dotted notation into a number in the network order format.
The connect( ) system call receives the same parameters as the bind( ) system call. It copies the data of the addr_remote variable into the kernel address space, retrieves the address of the BSD socket object (struct socket) corresponding to the file descriptor, and invokes its connect method. In IPS architecture, this method is implemented by either the inet_dgram_connect( ) function for UDP or the inet_stream_connect( ) function for TCP.
Our simple program uses a UDP socket, so let's describe what the inet_dgram_connection( ) function does:
1. If the socket does not have a local port number, invokes inet_autobind( ) to automatically assign a unused value. In our case, the program issued a bind( ) system call before invoking collect( ), but an application using UDP is not really obliged to do so.
2. Invokes the connect method of the INET socket object.
The UDP protocol implements the INET socket's connect method by means of the udp_connect( ) function, which executes the following actions:
1. If the INET socket already has a destination host, removes it from the destination cache (which is the dst_cache field of the sock object; see the earlier section Section 18.1.5).
2. Invokes the ip_route_connect( ) function to establish a route to the host identified by the IP address passed as a parameter of connect( ). In turn, this function invokes ip_route_output_key( ) to search an entry corresponding to the route in the route cache (see the earlier section Section 18.1.6.2). If the route cache does not include the desired entry, ip_route_output_key( ) invokes ip_route_output_slow( ) to look up a suitable entry in the FIB (see the earlier section Section 18.1.6.1). Let's assume that, once this step terminates, a route is found, so the address of a suitable rtable object is determined.
3. Initializes the daddr field of the INET socket object with the remote IP address found in the rtable object. Usually, it coincides with the IP address specified by the user as a parameter of the connect( ) system call.
4. Initializes the dport field of the INET socket object with the remote port number specified as a parameter of the connect( ) system call.
5. Puts the value TCP_ESTABLISHED in the state field of the INET socket object (when used by UDP, the flag indicates that the INET socket is "connected" to a destination host).
6. Sets the dst_cache entry of the sock object to the address of the dst_entry object embedded in the rtable object (see the earlier section Section 18.1.5).
Finally, our example program is ready to send messages to the remote host; it simply writes the data onto the socket:
write(sockfd, mesg, strlen(mesg)+1);
The write( ) system call triggers the write method of the file object associated with the sockfd file descriptor. For socket files, this method is implemented by the sock_write( ) function, which performs the following actions:
1. Determines the address of the socket object embedded in the file's inode.
2. Allocates and initializes a "message header"; namely, a msghdr data structure, which stores various control information.
3. Invokes the sock_sendmsg( ) function, passing to it the addresses of the socket object and the msghdr data structure. In turn, this function performs the following actions:
a. Invokes scm_send( ) to check the contents of the message header and allocate a scm_cookie (socket control message) data structure, storing into it a few fields of control information distilled from the message header.
b. Invokes the sendmsg method of the socket object, passing to it the addresses of the socket object, message header, and scm_cookie data structure.
c. Invokes scm_destroy( ) to release the scm_cookie data structure.
Since the BSD socket has been set up specifying the UDP protocol, the addresses of the socket object's methods are stored in the inet_dgram_ops table. In particular, the sendmsg method is implemented by the inet_sendmsg( ) function, which extracts the address of the INET socket stored in the BSD socket and invokes the sendmsg method of the INET socket.
Again, since the INET socket has been set up specifying the UDP protocol, the addresses of the sock object's methods are stored in the udp_prot table. In particular, the sendmsg method is implemented by the udp_sendmsg( ) function.
The udp_sendmsg( ) function receives as parameters the addresses of the sock object and the message header (msghdr data structure), and performs the following actions:
1. Allocates a udpfakehdr data structure, which contains the UDP header of the packet to be sent.
2. Determines the address of the rtable describing the route to the destination host from the dst_cache field of the sock object.
3. Invokes ip_build_xmit( ), passing to it the addresses of all relevant data structures, like the sock object, the UDP header, the rtable object, and the address of a UDP-specific function that constructs the packet to be transmitted.
The ip_build_xmit( ) function is used to transmit an IP datagram. It performs the following actions:
1. Invokes sock_alloc_send_skb( ) to allocate a new socket buffer together with the corresponding socket buffer descriptor (see the earlier section Section 18.1.7).
2. Determines the position inside the socket buffer where the payload shall go (the payload is placed near the end of the socket buffer, so its position depends on the payload size).
3. Writes the IP header on the socket buffer, leaving space for the UDP header.
4. Invokes either udp_getfrag_nosum( ) or udp_getfrag( ) to copy the data of the UDP datagram from the User Mode buffer; the latter function also computes, if required, the checksum of the data and of the UDP header (the UDP standard specifies that this checksum computation be optional).[4]
[4] You might wonder why the IP header is written in the socket buffer before the UDP header. Well, the UDP standard dictates that the checksum, if used, has to be computed on the payload, the UDP header, and the last 12 bytes of the IP header (including the source and destination IP addresses). The simplest way to compute the UDP checksum is thus to write the IP header before the UDP header.
5. Invokes the output method of the dst_entry object, passing to it the address of the socket buffer descriptor.
The output method of the dst_entry object invokes the function of the data link layer that writes the hardware header (and trailer, if required) of the packet in the buffer.
The output method of the IP's dst_entry object is usually implemented by the ip_output( ) function, which receives as a parameter the address skb of the socket buffer descriptor. In turn, this function essentially performs the following actions:
· Checks whether there is already a suitable hardware header descriptor in the cache by looking at the hh field of the skb->dst destination cache object (see the earlier section Section 18.1.5). If the field is not NULL, the cache includes the header, so it copies the hardware header into the socket buffer, and then invokes the hh_output method of the hh_cache object.
· Otherwise, if the skb->dst->hh field is NULL, the header must be prepared from scratch. Thus, the function invokes the output method of the neighbour object pointed to by the neighbour field of skb->dst, which is implemented by the neigh_resolve_output( ) function. To compose the header, the latter function invokes a suitable method of the net_device object relative to the network card device that shall transmit the packet, and then inserts the new hardware header in the cache.
Both the hh_output method of the hh_cache object and the output method of the neighbour object end up invoking the dev_queue_xmit( ) function.
The dev_queue_xmit( ) function takes care of queueing the socket buffer for later transmission. In general, network cards are slow devices, and at any given instant there can be many packets waiting to be transmitted. They are usually processed with a First-In, First-Out policy (hence the queue of packets), even if the Linux kernel offers several sophisticated packet scheduling algorithms to be used in high-performance routers. As a general rule, all network card devices define their own queue of packets waiting to be transmitted. Exceptions are virtual devices like the loopback device (lo) and the devices offered by various tunneling protocols, but we don't discuss these further.
A queue of socket buffers is implemented through a complex Qdisc object. Thanks to this data structure, the packet scheduling functions can efficiently manipulate the queue and quickly select the "best" packet to be sent. However, for the purpose of our simple description, the queue is just a list of socket buffer descriptors.
Essentially, dev_queue_xmit( ) performs the following actions:
1. Checks whether the driver of the network device (whose descriptor is stored in the dev field of the socket buffer descriptor) defines its own queue of packets waiting to be transmitted (the address of the Qdisc object is stored in the qdisc field of the net_device object).
2. Invokes the enqueue method of the corresponding Qdisc object to append the socket buffer to the queue.
3. Invokes the qdisc_run( ) function to ensure that the network device is actively sending the packets in the queue.
The chain of functions executed by the sys_write( ) system call service routine ends here. As you see, the final result consists of a new packet that is appended to the transmit queue of a network card device.
In the next section, we look at how our packet is processed by the network card.