This chapter is mostly focused on how the kernel handles the transmission of network packets. We have already glimpsed at many crucial data structures of the networking code, so we will just give a brief description of the other side of the story; namely, how a network packet is received.
The main difference between transmitting and receiving is that the kernel cannot predict when a packet will arrive at a network card device. Therefore, the networking code that takes care of receiving the packets runs in interrupt handlers and deferrable functions.
Let's sketch a typical chain of events occurring when a packet carrying the right hardware address (card identifier) arrives to the network device.
1. The network device saves the packet in a buffer in the device's memory (the card usually keeps several packets at once in a circular buffer).
2. The network device raises an interrupt.
3. The interrupt handler allocates and initializes a new socket buffer for the packet.
4. The interrupt handler copies the packet from the device's memory to the socket buffer.
5. The interrupt handler invokes a function (such as eth_type_trans( ) function for Ethernet and IEEE 802.3) to determine the protocol of the packet encapsulated in the data link frame.
6. The interrupt handler invokes the netif_rx( ) function to notify the Linux networking code that a new packet is arrived and should be processed.
Of course, the interrupt handler is specific to the network card device. Many device drivers try to be nice to the other devices in the system and move lengthy tasks, such as allocating a socket buffer or copying a packet to deferrable functions.
The netif_rx( ) function is the main entry point of the receiving code of the networking layer (above the network card device driver). The kernel uses a per-CPU queue for the packets that have been received from the network devices and are waiting to be processed by the various protocol stack layers. The function essentially appends the new packet in this queue and invokes cpu_raise_softirq( ) to schedule the activation of the NET_RX_SOFTIRQ softirq. (Remember that the same softirq can be executed concurrently on several CPUs, hence the reason for the per-CPU queue of received packets.)
The NET_RX_SOFTIRQ softirq is implemented by the net_rx_action( ) function, which essentially executes the following operations:[5]
[5] We omit discussing several special cases, such as when the packet has to be quickly forwarded to another network card device or when the host is acting as a bridge that links two local area network as if they were a single one.
1. Extracts the first packet from the queue. If the queue is empty, it terminates.
2. Determines the network layer protocol number encoded in the data link layer.
3. Invokes a suitable function of the network layer protocol.
The corresponding function for the IP protocol is named ip_rcv( ), which essentially executes the following actions:
1. Checks the length and the checksum of the packet and discards it if it is corrupted or truncated.
2. Invokes ip_route_input( ), which initializes the destination cache (dst_entry field) of the socket buffer descriptor. To determine the route followed by the packet, the function looks the route up first in the route cache, and then in the FIB (if the route cache doesn't include a relevant entry). In this way, the kernel determines whether the packet must be forwarded to another host or simply passed to a protocol of the transport layer.
3. Checks to see whether any packet sniffing or other input policy is enforced. In the affirmative case, it handles the packet accordingly; we don't discuss these topics further.
4. Invokes the input method of the dst_entry object of the packet.
If the packet has to be forwarded to another host, the input method is implemented by the ip_forward( ) function; otherwise, it is implemented by the ip_local_delivery( ) function. Let's follow the latter path.
The ip_local_delivery( ) function takes care of reassembling the original IP datagram, if the datagram has been fragmented along its way. Then the function reads the IP header and determines the type of transport protocol to which the packet belongs. If the transport protocol is TCP, the function ends up invoking tcp_v4_rcv( ); if the transport protocol is UDP, the function ends up invoking udp_rcv( ).
Let's continue following the UDP path. The udp_rcv( ) function essentially executes the following actions:
1. Invokes the udp_v4_lookup( ) function to find the INET socket to which the UDP datagram has been sent (by looking at the port number inside the UDP header). The kernel keeps the INET socket in a hash table so that the lookup operation is reasonably fast. If the UDP datagram is not associated with a socket, the function discards the packet and terminates.
2. Invokes udp_queue_rcv_skb( ), which in turn invokes sock_queue_rcv_skb( ), to append the packet into a queue of the INET socket (receive_queue field of the sock object) and to invoke the data_ready method of the sock object.
3. Releases the socket buffer and the socket buffer descriptor.
INET sockets implement the data_ready method by means of the sock_def_readable( ) function, which essentially wakes up any process sleeping in the socket's wait queue (listed in the sleep field of the sock object).
There is one final step to describe what happens when a process reads from the BSD socket owning our INET socket. The read( ) system call triggers the read method of the file object associated with the socket's special file. This method is implemented by the sock_read( ) function, which in turn invokes the sock_recvmsg( ) function. The latter function is similar to sock_sendmsg( ) described earlier. Essentially, it invokes the recvmsg method of the BSD socket. In turn, this method (inet_recvmsg( )) invokes the recvmsg method of the INET socket; that is, either the tcp_recvmsg( ) or the udp_recvmsg( ) function.
Finally, the udp_recvmsg( ) function executes the following actions:
1. Invokes the skb_recv_datagram( ) function to extract the first packet from the receive_queue queue of the INET socket and return the address of the corresponding socket buffer descriptor. If the queue is empty, the function blocks the current process (unless the read operation was not blocking).
2. If the UDP datagram carries a valid checksum and checks that the message has not been corrupted during the transmission (actually, this step is performed at the same time as Step 3).
3. Copies the payload of the UDP datagram into the User Mode buffer.