18.1 Main Networking Data Structures

In this section, we shall give a general idea of how Linux implements the lower layers of networking.

18.1.1 Network Architectures

A network architecture describe how a specific computer network is made. The architecture defines a set of layers, each of which should have a well-defined purpose; programs in each layer communicate by using a shared set of rules and conventions (a so-called protocol ).

Generally speaking, Linux supports a large number of different network architectures; some of them are listed in Table 18-1.

Table 18-1. Some network architectures supported by Linux
Name	Network architecture and/or protocol family
`PF_APPLETALK`	Appletalk
`PF_BLUETOOTH`	Bluetooth
`PF_BRIDGE`	Multiprotocol bridge
`PF_DECnet`	DECnet
`PF_INET`	IPS's IPv4 protocol
`PF_INET6`	IPS's IPv6 protocol
`PF_IPX`	Novell IPX
`PF_LOCAL, PF_UNIX`	Unix domain sockets (local communication)
`PF_PACKET`	IPS's IPv4/IPv6 protocol low-level access
`PF_X25`	X25

IPS (Internet Protocol Suite) is the network architecture of Internet, the well-known internetwork that collects hundreds of thousands of local computer networks all around the world. Sometimes it is also called TCP/IP network architecture from the names of the two main protocols that it defines.

18.1.2 Network Interface Cards

A network interface card (NIC) is a special kind of I/O device that does not have a corresponding device file. Essentially, a network card places outgoing data on a line going to remote computer systems and receives packets from those systems into kernel memory.

Starting with BSD, all Unix systems assign a different symbolic name to each network card included in the computer; for instance, the first Ethernet card gets the eth0 name. However, the name does not correspond to any device file and has no corresponding inode in the system directory tree.

Instead of using the filesystem, the system administrator has to set up a relationship between the device name and a network address. Therefore, as we shall see in the later section Section 18.2, BSD Unix introduced a new group of system calls, which has become the standard programming model for network devices.

18.1.3 BSD Sockets

Generally speaking, any operating system must define a common Application Programming Interface (API) between the User Mode program and the networking code. The Linux networking API is based on BSD sockets. They were introduced in Berkeley's Unix 4.1cBSD and are available in almost all Unix-like operating systems, either natively or by means of a User Mode helper library.^[1]

^[1] An alternative API between User Mode programs and networking code is provided by the Transport Layer Interface (TLI), introduced by System V Release 3.0. In general, TLI is implemented as a User Mode library that uses the STREAMS I/O subsystem. As mentioned in Section 1.1, the Linux kernel does not implement the STREAMS I/O subsystem.

A socket is a communication endpoint — the terminal gate of a channel that links two processes. Data is pushed on a terminal gate, and after some delay, shows up at the other gate. The communicating processes may be on different computers; it's up to the kernel's networking code to forward the data between the two endpoints.

Linux implements BSD sockets as files that belong to the sockfs special filesystem (see Section 12.3.1). More precisely, for every new BSD socket, the kernel creates a new inode in the sockfs special filesystem. The attributes of the BSD socket are stored in a socket data structure, which is an object included in the filesystem-specific u.socket_i field of the sockfs's inode.

The most important fields of the BSD socket object are:

inode

Points to the sockfs's inode object

file

Points to the file object of the sockfs's file

state

Stores the connection status of the socket: SS_FREE (not allocated), SS_UNCONNECTED (not yet connected), SS_CONNECTING (in process of connecting), SS_CONNECTED (connected), SS_DISCONNECTING (in process of disconnecting).

ops

Points to a proto_ops data structure, which stores the methods of the socket object; they are listed in Table 18-2. Most of the methods refer to system calls that operate on sockets. Each network architecture implements the methods by means of its own functions; hence, the same system call acts differently according to the networking architecture to which the target socket belongs.

Table 18-2. The methods of the BSD socket object
Method	Description
`release`	Close the socket
`bind`	Assign a local address (a name)
`connect`	Either establish a connection (TCP) or assign a remote address (UDP)
`socketpair`	Create a pair of sockets for two-way data flow
`accept`	Wait for connection requests
`getname`	Return the local address
`ioctl`	Implement `ioctl( )`'s commands
`listen`	Initialize the socket to accept connection requests
`shutdown`	Close a half or both halves of a full-duplex connection
`setsockopt`	Set the value of the socket flags
`getsockopt`	Get the value of the socket flags
`sendmsg`	Send a packet on the socket
`recvmsg`	Receive a packet from the socket
`mmap`	File memory-mapping (not used by network sockets)
`sendpage`	Copy data directly from/to a file (`sendfile( )` system call)

Points to the low-level struct sock socket descriptor (see the next section).

18.1.4 INET Sockets

INET sockets are data structures of type struct sock. Any BSD socket that belongs to the IPS network architecture stores the address of an INET socket in the sk field of the socket object.

INET sockets are required because the socket objects (describing BSD sockets) include only fields that are meaningful to all network architectures. However, the kernel must also keep track of several other bits of information for any socket of every specific network architecture. For instance, in each INET socket, the kernel records the local and remote IP addresses, the local and remote port numbers, the relative transport protocol, the queue of packets that were received from the socket, the queue of packets waiting to be sent to the socket, and several tables of methods that handle the packets traveling on the socket. These attributes are stored, together with many others, in the INET socket.

The INET socket object also defines some methods specific to the type of transport protocol adopted (TCP or UDP). The methods are stored in a data structure of type proto and are listed in Table 18-3.

Table 18-3. The methods of the INET socket object
Method	Description
`close`	Close the socket
`connect`	Either establish a connection or assign a remote address
`disconnect`	Relinquish an established connection
`accept`	Wait for connection request
`ioctl`	Implement `ioctl( )`'s commands
`init`	INET socket object constructor
`destroy`	INET socket object destructor
`shutdown`	Close a half or both halves of a full-duplex connection
`setsockopt`	Set the value of the socket flags
`getsockopt`	Get the value of the socket flags
`sendmsg`	Send a packet on the socket
`recvmsg`	Receive a packet from the socket
`bind`	Assign a local address (a name)
`backlog_rcv`	Callback function invoked when receiving a packet
`hash`	Add the INET socket to the per-protocol hash table
`unhash`	Remove the INET socket from the per-protocol hash table
`get_port`	Assign a port number to the INET socket

As you may notice, many methods replicate the methods of the BSD socket object (Table 18-2). Actually, a BSD socket method usually invokes the corresponding INET socket method, if it is defined.

The sock object includes no less than 80 fields; many of them are pointers to other objects, tables of methods, or other data structures that deserve a detailed description by themselves. Rather than including a boring list of field names, we introduce a few fields of the sock object whenever we encounter them in the rest of the chapter.

18.1.5 The Destination Cache

As we shall see in the later section Section 18.2.2, processes usually "assign names" to sockets — that is, they specify the remote IP address and port number of the host that should receive the data written onto the socket. The kernel shall also make available to the processes reading the sockets every packet received from the remote host carrying the proper port number.

Actually, the kernel has to keep in memory a bunch of data about the remote host identified by an in-use socket. To speed up the networking code, this data is stored in a so-called destination cache, whose entries are objects of type dst_entry. Each INET socket stores in the dst_cache field a pointer to a single dst_entry object, which corresponds to the destination host bound to the socket.

A dst_entry object stores a lot of data used by the kernel whenever it sends a packet to the corresponding remote host. For instance, it includes:

· A pointer to a net_device object describing the network device (for instance, a network card) that transmits or receives the packets

· A pointer to a neighbour structure relative to the router that forwards the packets to their final destination, if any (see the later section Section 18.1.6.3)

· A pointer to a hh_cache structure, which describes the common header to be attached to every packet to be transmitted (see the later section Section 18.1.6.3)

· The pointer to a function invoked whenever a packet is received from the remote host

· The pointer to a function invoked whenever a packet is to be transmitted

18.1.6 Routing Data Structures

The most important function of the IP layer consists of ensuring that packets originated by the host or received by the network interface cards are forwarded toward their final destinations. As you might easily guess, this task is really crucial because the routing algorithm should be fast enough to keep up with the highest network loads.

The IP routing mechanism is fairly simple. Each 32-bit integer representing an IP address encodes both a network address, which specifies the network the host is in, and a host identifier, which specifies the host inside the network. To properly interpret the IP address, the kernel must know the network mask of a given IP address — that is, what bits of the IP address encode the network address. For instance, suppose the network mask of the IP address 192.160.80.110 is 255.255.255.0; then 192.160.80.0 represents the network address, while 110 identifies the host inside its network. Nowadays, the network address is almost always stored in the most significant bits of the IP address, so each network mask can also be represented by the number of bits set to 1 (24 in our example).

The key property of IP routing is that any host in the internetwork needs only to know the address of a computer inside its local area network (a so-called router), which is able to forward the packets to the destination network.

For instance, consider the following routing table shown by the netstat -rn system command:

Destination     Gateway         Genmask        Flags    MSS  Window   irtt Iface

192.160.80.0    0.0.0.0         255.255.255.0  U         40  0           0 eth1

192.160.0.0     0.0.0.0         255.255.0.0    U         40  0           0 eth0

192.160.50.0    192.160.11.1    255.255.0.0    UG        40  0           0 eth0

0.0.0.0         192.160.1.1     0.0.0.0        UG        40  0           0 eth0

This computer is linked to two networks. One of them has the IP address 192.160.80.0 and a netmask of 24 bits, and it is served by the Network Interface Card (NIC) associated with the network device eth1. The other network has the IP address 192.160.0.0 and a netmask of 16 bits, and it is served by the NIC associated with eth0.

Suppose that a packet must be sent to a host that belongs to the local area network 192.160.80.0 and that has the IP address 192.160.80.110. The kernel examines the static routing table starting with the higher entry (the one including the greater number of bits set to 1 in the netmask). For each entry, it performs a logical AND between the destination host's IP address and the netmask; if the results are equal to the network destination address, the kernel uses the entry to route the packet. In our case, the first entry wins and the packet is sent to the eth1 network device.

In this case, the "gateway" field of the static routing table entry is null ("0.0.0.0"). This means the address is on the local network of the sender, so the computer sends packets directly to hosts in the network; it encapsulates the packet in a frame carrying the Ethernet address of the destination host. The frame is physically broadcast to all hosts in the network, but any NIC automatically ignores frames carrying Ethernet addresses different from its own.

Suppose now that a packet must be sent to a host that has the IP address 209.204.146.22. This address belongs to a remote network (not directly linked to our computer). The last entry in the table is a catch-all entry, since the AND logical operation with the netmask 0.0.0.0 always yields the network address 0.0.0.0. Thus, in our case, any IP address still not resolved by higher entries is sent through the eth0 network device to the default router that has the IP address 192.160.1.1, which hopefully knows how to forward the packet toward its final destination. The packet is encapsulated in a frame carrying the Ethernet address of the default router.

18.1.6.1 The Forwarding Information Base (FIB)

The Forwarding Information Base (FIB), or static routing table , is the ultimate reference used by the kernel to determine how to forward packets to their destinations. As a matter of fact, if the destination network of a packet is not included in the FIB, then the kernel cannot transmit that packet. As mentioned previously, however, the FIB usually includes a default entry that catches any IP address not resolved by the other entries.

The kernel data structures that implement the FIB are quite sophisticated. In fact, routers might include several hundred lines, most of which refer to the same network devices or to the same gateway. Figure 18-1 illustrates a simplified view of the FIB's data structures when the table includes the four entries of the routing table just shown. You can get a low-level view of the data included in the FIB data structures by reading the /proc/net/route file.

Figure 18-1. FIB's main data structures

figs/ULK2_1801.gif

The main_table global variable points to an fib_table object that represents the static routing table of the IPS architecture. Actually, it is possible to define secondary routing tables, but the table referenced by main_table is the most important one. The fib_table object includes the addresses of some methods that operate on the FIB, and stores the pointer to a fn_hash data structure.

The fn_hash data structure is essentially an array of 33 pointers, one for every FIB zone. A zone includes routing information for destination networks that have a given number of bits in the network mask. For instance, zone 24 includes entries for networks that have the mask 255.255.255.0.

Each zone is represented by a fn_zone descriptor. It references, through a hash table, the set of entries of the routing table that have the given netmask. For instance, in Figure 18-1, zone 16 references the entries 192.160.0.0 and 192.50.0.0.

The data relative to each routing table entry is stored in a fib_node descriptor. A router might have several entries, but it usually has very few network devices. Thus, to avoid wasting space, the fib_node descriptor does not include information about the network interface, but rather a pointer to a fib_info descriptor shared by several entries.

18.1.6.2 The routing cache

Looking up a route in the static routing table is quite a slow task: the kernel has to walk the various zones in the FIB and, for each entry in a zone, check whether the logical AND between the host destination address and the entry's netmask yields the entry's exact network address. To speed up routing, the kernel keeps the most recently discovered routes in a routing cache. Typically, the cache includes several hundreds of entries; they are sorted so that more frequently used routes are retrieved more quickly. You can easily get the contents of the cache by reading the /proc/net/rt_cache file.

The main data structure of the routing cache is the rt_hash_table hash table; its hash function combines the destination host's IP address with other information, like the source address of the packet and the type of service required. In fact, the Linux networking code allows you to fine tune the routing process so that a packet can, for instance, be routed along several paths according to where the packet came from and what kind of data it is carrying.

Each entry of the cache is represented by a rtable data structure, which stores several pieces of information; among them:

· The source and destination IP addresses

· The gateway IP address, if any

· Data relative to the route identified by the entry, stored in a dst_entry embedded in the rtable data structure (see the earlier section Section 18.1.5)

18.1.6.3 The neighbor cache

Another core component of the networking code is the so-called "neighbor cache," which includes information relative to hosts that belong to the networks directly linked to the computer.

We know that IP addresses are the main host identifiers of the network layer; unfortunately, they are meaningless for the lower data-link layer, whose protocols are essentially hardware-dependent. In practice, when the kernel has to transmit a packet by means of a given network card device, it must encapsulate the data in a frame carrying, among other things, the hardware-dependent identifiers of the source and destination network card devices.

Most local area networks are based on the IEEE 802 standards, and in particular, on the 802.3 standard, which is commercially known as "Ethernet."^[2] The network card identifiers of the 802 standards are 48-bit numbers, which are usually written as 6 bytes separated by colons (such as "00:50:DA:61:A7:83"). There are no two network card devices sharing the same identifier (although it would be sufficient to ensure that all network card devices in the same local area network have different identifiers).

^[2] Actually, Ethernet local area networks sprang up before IEEE published its standards; unfortunately, Ethernet and IEEE standards disagree in small but nevertheless crucial details — for instance, in the format of the data link packets. Every host in the Internet is able to operate with both standards, though.

How can the kernel know the identifier of a remote device? It uses an IPS protocol named Address Resolution Protocol (ARP). Basically, the kernel sends a broadcast packet into the local area network carrying the question: "What is the identifier of the network card device associated with IP address X?" As a result, the host identified by the specified IP address sends an answer packet carrying the network card device identifier.

It is a waste of time and bandwidth to repeat the whole process for every packet to be sent. Thus, the kernel keeps the network card device identifier, together with other precious data concerning the physical connection to the remote device, in the neighbor cache (often also called arp cache). You might get the contents of this cache by reading the /proc/net/arp file. System administrators may also explicitly set the entries of this cache by means of the arp command.

Each entry of the neighbor cache is an object of type neighbour; the most important field is certainly ha, which stores the network card device identifier. The entry also stores a pointer to a hh_cache object belonging to the hardware header cache; since all packets sent to the same remote network card device are encapsulated in frames having the same header (essentially carrying the source and destination device identifiers), the kernel keeps a copy of the header in memory to avoid having to reconstruct it from scratch for every packet.

18.1.7 The Socket Buffer

Each single packet transmitted through a network device is composed of several pieces of information. Besides the payload — that is, the data whose transmission caused the creation of the packet itself — all network layers, starting from the data link layer and ending at the transport layer, add some control information. The format of a packet handled by a network card device is shown in Figure 18-2.

Figure 18-2. The packet format

figs/ULK2_1802.gif

The whole packet is built by different functions in several stages. For instance, the UDP/TCP header and the IP header are composed of functions belonging, respectively, to the transport layer and the network layer of the IPS architecture, while the hardware header and trailer, which build the frame encapsulating the IP datagram, are written by a suitable method specific to the network card device.

The Linux networking code keeps each packet in a large memory area called a socket buffer. Each socket buffer is associated with a descriptor, which is a data structure of type sk_buff that stores, among other things, pointers to the following data structures:

· The socket buffer

· The payload — that is, the user data (inside the socket buffer)

· The data link trailer (inside the socket buffer)

· The INET socket (sock object)

· The network device's net_device object

· A descriptor of the transport layer header

· A descriptor of the network layer header

· A descriptor of the data link layer header

· The destination cache entry (dst_entry object)

The sk_buff data structure includes many other fields, like an identifier of the network protocol used for transmitting the packet, a checksum field, and the arrival time for received packets.

As a general rule, the kernel avoids copying data, but simply passes the sk_buff descriptor pointer, and thus the socket buffer, to each networking layer in turn. For instance, when preparing a packet to send, the transport layer starts copying the payload from the User Mode buffer into the higher portion of the socket buffer; then the transport layer adds its TCP or UDP header before the payload. Next, the control is transferred to the network layer, which receives the socket buffer descriptor and adds the IP header before the transport header. Eventually, the data link layer adds its header and trailer, and enqueues the packet for transmission.