Chapter 22. Protocol Control Blocks

Introduction

Protocol control blocks (PCBs) are used at the protocol layer to hold the various pieces of information required for each UDP or TCP socket. The Internet protocols maintain Internet protocol control blocks and TCP control blocks. Since UDP is connectionless, everything it needs for an end point is found in the Internet PCB; there are no UDP control blocks.

The Internet PCB contains the information common to all UDP and TCP end points: foreign and local IP addresses, foreign and local port numbers, IP header prototype, IP options to use for this end point, and a pointer to the routing table entry for the destination of this end point. The TCP control block contains all of the state information that TCP maintains for each connection: sequence numbers in both directions, window sizes, retransmission timers, and the like.

In this chapter we describe the Internet PCBs used in Net/3, saving TCP’s control blocks until we describe TCP in detail. We examine the numerous functions that operate on Internet PCBs, since we’ll encounter them when we describe UDP and TCP. Most of the functions begin with the six characters in_pcb.

Figure 22.1 summarizes the protocol control blocks that we describe and their relationship to the file and socket structures.

Internet protocol control blocks and their relationship to other structures.

Figure 22.1. Internet protocol control blocks and their relationship to other structures.

There are numerous points to consider in this figure.

  • When a socket is created by either socket or accept, the socket layer creates a file structure and a socket structure. The file type is DTYPE_SOCKET and the socket type is SOCK_DGRAM for UDP end points or SOCK_STREAM for TCP end points.

  • The protocol layer is then called. UDP creates an Internet PCB (an inpcb structure) and links it to the socket structure: the so_pcb member points to the inpcb structure and the inp_socket member points to the socket structure.

  • TCP does the same and also creates its own control block (a tcpcb structure) and links it to the inpcb using the inp_ppcb and t_inpcb pointers. In the two UDP inpcbs the inp_ppcb member is a null pointer, since UDP does not maintain its own control block.

  • The four other members of the inpcb structure that we show, inp_faddr through inp_lport, form the socket pair for this end point: the foreign IP address and port number along with the local IP address and port number.

  • Both UDP and TCP maintain a doubly linked list of all their Internet PCBs, using the inp_next and inp_prev pointers. They allocate a global inpcb structure as the head of their list (named udb and tcb) and only use three members in the structure: the next and previous pointers, and the local port number. This latter member contains the next ephemeral port number to use for this protocol.

The Internet PCB is a transport layer data structure. It is used by TCP, UDP, and raw IP, but not by IP, ICMP, or IGMP.

We haven’t described raw IP yet, but it too uses Internet PCBs. Unlike TCP and UDP, raw IP does not use the port number members in the PCB, and raw IP uses only two of the functions that we describe in this chapter: in_pcballoc to allocate a PCB, and in_pcbdetach to release a PCB. We return to raw IP in Chapter 32.

Code Introduction

All the PCB functions are in a single C file and a single header contains the definitions, as shown in Figure 22.2.

Table 22.2. Files discussed in this chapter.

File

Description

netinet/in_pcb.h

inpcb structure definition

netinet/in_pcb.c

PCB functions

Global Variables

One global variable is introduced in this chapter, which is shown in Figure 22.3.

Table 22.3. Global variable introduced in this chapter.

Variable

Datatype

Description

zeroin_addr

struct in_addr

32-bit IP address of all zero bits

Statistics

Internet PCBs and TCP PCBs are both allocated by the kernel’s malloc function with a type of M_PCB. This is just one of the approximately 60 different types of memory allocated by the kernel. Mbufs, for example, are allocated with a type of M_BUF, and socket structures are allocated with a type of M_SOCKET.

Since the kernel can keep counters of the different types of memory buffers that are allocated, various statistics on the number of PCBs can be maintained. The command vmstat -m shows the kernel’s memory allocation statistics and the netstat -m command shows the mbuf allocation statistics.

inpcb Structure

Figure 22.4 shows the definition of the inpcb structure. It is not a big structure, and occupies only 84 bytes.

Table 22.4. inpcb structure.

------------------------------------------------------------------------- in_pcb.h
 42 struct inpcb {
 43     struct inpcb *inp_next, *inp_prev;  /* doubly linked list */
 44     struct inpcb *inp_head;     /* pointer back to chain of inpcb's for
 45                                    this protocol */
 46     struct in_addr inp_faddr;   /* foreign IP address */
 47     u_short inp_fport;          /* foreign port# */
 48     struct in_addr inp_laddr;   /* local IP address */
 49     u_short inp_lport;          /* local port# */
 50     struct socket *inp_socket;  /* back pointer to socket */
 51     caddr_t inp_ppcb;           /* pointer to per-protocol PCB */
 52     struct route inp_route;     /* placeholder for routing entry */
 53     int     inp_flags;          /* generic IP/datagram flags */
 54     struct ip inp_ip;           /* header prototype; should have more */
 55     struct mbuf *inp_options;   /* IP options */
 56     struct ip_moptions *inp_moptions;   /* IP multicast options */
 57 };
------------------------------------------------------------------------- in_pcb.h

43-45

inp_next and inp_prev form the doubly linked list of all PCBs for UDP and TCP. Additionally, each PCB has a pointer to the head of the protocol’s linked list (inp_head). For PCBs on the UDP list, inp_head always points to udb (Figure 22.1); for PCBs on the TCP list, this pointer always points to tcb.

46-49

The next four members, inp_faddr, inp_fport, inp_laddr, and inp_lport, contain the socket pair for this IP end point: the foreign IP address and port number and the local IP address and port number. These four values are maintained in the PCB in network byte order, not host byte order.

The Internet PCB is used by both transport layers, TCP and UDP. While it makes sense to store the local and foreign IP addresses in this structure, the port numbers really don’t belong here. The definition of a port number and its size are specified by each transport layer and could differ between different transport layers. This problem was identified in [Partridge 1987], where 8-bit port numbers were used in version 1 of RDP, which required reimplementing several standard kernel routines to use 8-bit port numbers. Version 2 of RDP [Partridge and Hinden 1990] uses 16-bit port numbers. The port numbers really belong in a transport-specific control block, such as TCP’s tcpcb. A new UDP-specific PCB would then be required. While doable, this would complicate some of the routines we’ll examine shortly.

50-51

inp_socket is a pointer to the socket structure for this PCB and inp_ppcb is a pointer to an optional transport-specific control block for this PCB. We saw in Figure 22.1 that the inp_ppcb pointer is used with TCP to point to the corresponding tcpcb, but is not used by UDP. The link between the socket and inpcb is two way because sometimes the kernel starts at the socket layer and needs to find the corresponding Internet PCB (e.g., user output), and sometimes the kernel starts at the PCB and needs to locate the corresponding socket structure (e.g., processing a received IP datagram).

52

If IP has a route to the foreign address, it is stored in the inp_route entry. We’ll see that when an ICMP redirect message is received, all Internet PCBs are scanned and all those with a foreign IP address that matches the redirected IP address have their inp_route entry marked as invalid. This forces IP to find a new route to the foreign address the next time the PCB is used for output.

53

Various flags are stored in the inp_flags member. Figure 22.5 lists the individual flags.

Table 22.5. inp_flags values.

inp_flags

Description

INP_HDRINCL

process supplies entire IP header (raw socket only)

INP_RECVOPTS

receive incoming IP options as control information (UDP only, not implemented)

INP_RECVRETOPTS

receive IP options for reply as control information (UDP only, not implemented)

INP_RECVDSTADDR

receive IP destination address as control information (UDP only)

INP_CONTROLOPTS

INP_RECVOPTS | INP_RECVRETOPTS | INP_RECVDSTADDR

54

A copy of an IP header is maintained in the PCB but only two members are used, the TOS and TTL. The TOS is initialized to 0 (normal service) and the TTL is initialized by the transport layer. We’ll see that TCP and UDP both default the TTL to 64. A process can change these defaults using the IP_TOS or IP_TTL socket options, and the new value is recorded in the inpcb.inp_ip structure. This structure is then used by TCP and UDP as the prototype IP header when sending IP datagrams.

55-56

A process can set the IP options for outgoing datagrams with the IP_OPTIONS socket option. A copy of the caller’s options are stored in an mbuf by the function ip_pcbopts and a pointer to that mbuf is stored in the inp_options member. Each time TCP or UDP calls the ip_output function, a pointer to these IP options is passed for IP to insert into the outgoing IP datagram. Similarly, a pointer to a copy of the user’s IP multicast options is maintained in the inp_moptions member.

in_pcballoc and in_pcbdetach Functions

An Internet PCB is allocated by TCP, UDP, and raw IP when a socket is created. A PRU_ATTACH request is issued by the socket system call. In the case of UDP, we’ll see in Figure 23.33 that the resulting call is

struct socket *so;
int  error;
error = in_pcballoc(so, &udb);

Figure 22.6 shows the in_pcballoc function.

Table 22.6. in_pcballoc function: allocate an Internet PCB.

------------------------------------------------------------------------- in_pcb.h
 36 int
 37 in_pcballoc(so, head)
 38 struct socket *so;
 39 struct inpcb *head;
 40 {
 41     struct inpcb *inp;

 42     MALLOC(inp, struct inpcb *, sizeof(*inp), M_PCB, M_WAITOK);
 43     if (inp == NULL)
 44         return (ENOBUFS);
 45     bzero((caddr_t) inp, sizeof(*inp));

 46     inp->inp_head = head;
 47     inp->inp_socket = so;
 48     insque(inp, head);
 49     so->so_pcb = (caddr_t) inp;
 50     return (0);
 51 }
------------------------------------------------------------------------- in_pcb.h

Allocate PCB and initialize to zero

36-45

in_pcballoc calls the kernel’s memory allocator using the macro MALLOC. Since these PCBs are always allocated as the result of a system call, it is OK to wait for one.

Net/2 and earlier Berkeley releases stored both Internet PCBs and TCP PCBs in mbufs. Their sizes were 80 and 108 bytes, respectively. With the Net/3 release, the sizes went to 84 and 140 bytes, so TCP control blocks no longer fit into an mbuf. Net/3 uses the kernel’s memory allocator instead of mbufs for both types of control blocks.

Careful readers may note that the example in Figure 2.6 shows 17 mbufs allocated for PCBs, yet we just said that Net/3 no longer uses mbufs for Internet PCBs or TCP PCBs. Net/3 does, however, use mbufs for Unix domain PCBs, and that is what this counter refers to. The mbuf statistics output by netstat are for all mbufs in the kernel across all protocol suites, not just the Internet protocols.

bzero sets the PCB to 0. This is important because the IP addresses and port numbers in the PCB must be initialized to 0.

Link structures together

46-49

The inp_head member points to the head of the protocol’s PCB list (either udb or tcb), the inp_socket member points to the socket structure, the new PCB is added to the protocol’s doubly linked list (insque), and the socket structure points to the PCB. The insque function puts the new PCB at the head of the protocol’s list.

An Internet PCB is deallocated when a PRU_DETACH request is issued. This happens when the socket is closed. The function in_pcbdetach, shown in Figure 22.7, is eventually called.

Table 22.7. in_pcbdetach function: deallocate an Internet PCB.

------------------------------------------------------------------------- in_pcb.h
252 int
253 in_pcbdetach(inp)
254 struct inpcb *inp;
255 {
256     struct socket *so = inp->inp_socket;

257     so->so_pcb = 0;
258     sofree(so);
259     if (inp->inp_options)
260         (void) m_free(inp->inp_options);
261     if (inp->inp_route.ro_rt)
262         rtfree(inp->inp_route.ro_rt);
263     ip_freemoptions(inp->inp_moptions);
264     remque(inp);
265     FREE(inp, M_PCB);
266 }
------------------------------------------------------------------------- in_pcb.h

252-263

The PCB pointer in the socket structure is set to 0 and that structure is released by sofree. If an mbuf with IP options was allocated for this PCB, it is released by m_free. If a route is held by this PCB, it is released by rtfree. Any multicast options are also released by ip_freemoptions.

264-265

The PCB is removed from the protocol’s doubly linked list by remque and the memory used by the PCB is returned to the kernel.

Binding, Connecting, and Demultiplexing

Before examining the kernel functions that bind sockets, connect sockets, and demultiplex incoming datagrams, we describe the rules imposed by the kernel on these actions.

Binding of Local IP Address and Port Number

Figure 22.8 shows the six different combinations of a local IP address and local port number that a process can specify in a call to bind.

Table 22.8. Combination of local IP address and local port number for bind.

Local IP address

Local port

Description

unicast or broadcast

nonzero

one local interface, specific port

multicast

nonzero

one local multicast group, specific port

*

nonzero

any local interface or multicast group, specific port

unicast or broadcast

0

one local interface, kernel chooses port

multicast

0

one multicast group, kernel chooses port

*

0

any local interface, kernel chooses port

The first three lines are typical for servers they bind a specific port, termed the server’s well-known port, whose value is known by the client. The last three lines are typical for clients they don’t care what the local port, termed an ephemeral port, is, as long as it is unique on the client host.

Most servers and most clients specify the wildcard IP address in the call to bind. This is indicated in Figure 22.8 by the notation * on lines 3 and 6.

If a server binds a specific IP address to a socket (i.e., not the wildcard address), then only IP datagrams arriving with that specific IP address as the destination IP address be it unicast, broadcast, or multicast are delivered to the process. Naturally, when the process binds a specific unicast or broadcast IP address to a socket, the kernel verifies that the IP address corresponds to a local interface.

It is rare, though possible, for a client to bind a specific IP address (lines 4 and 5 in Figure 22.8). Normally a client binds the wildcard IP address (the final line in Figure 22.8), which lets the kernel choose the outgoing interface based on the route chosen to reach the server.

What we don’t show in Figure 22.8 is what happens if the client tries to bind a local port that is already in use with another socket. By default a process cannot bind a port number if that port is already in use. The error EADDRINUSE (address already in use) is returned if this occurs. The definition of in use is simply whether a PCB exists with that port as its local port. This notion of “in use” is relative to a given protocol: TCP or UDP, since TCP port numbers are independent of UDP port numbers.

Net/3 allows a process to change this default behavior by specifying one of following two socket options:

SO_REUSEADDR

Allows the process to bind a port number that is already in use, but the IP address being bound (including the wildcard) must not already be bound to that same port.

 

For example, if an attached interface has the IP address 140.252.1.29 then one socket can be bound to 140.252.1.29, port 5555; another socket can be bound to 127.0.0.1, port 5555; and another socket can be bound to the wildcard IP address, port 5555. The call to bind for the second and third cases must be preceded by a call to setsockopt, setting the so_reuseaddr option.

SO_REUSEPORT

Allows a process to reuse both the IP address and port number, but each binding

SO_REUSEADDR

Allows the process to bind a port number that is already in use, but the IP address being bound (including the wildcard) must not already be bound to that same port.

 

For example, if an attached interface has the IP address 140.252.1.29 then one socket can be bound to 140.252.1.29, port 5555; another socket can be bound to 127.0.0.1, port 5555; and another socket can be bound to the wildcard IP address, port 5555. The call to bind for the second and third cases must be preceded by a call to setsockopt, setting the so_reuseaddr option.

 

of the IP address and port number, including the first, must specify this socket option. With SO_REUSEADDR, the first binding of the port number need not specify the socket option.

 

For example, if an attached interface has the IP address 140.252.1.29 and a socket is bound to 140.252.1.29, port 6666 specifying the SO_REUSEPORT socket option, then another socket can also specify this same socket option and bind 140.252.1.29, port 6666.

Later in this section we describe what happens in this final example when an IP datagram arrives with a destination address of 140.252.1.29 and a destination port of 6666, since two sockets are bound to that end point.

The SO_REUSEPORT option is new with Net/3 and was introduced with the support for multicasting in 4.4BSD. Before this release it was never possible for two sockets to be bound to the same IP address and same port number.

Unfortunately the so_REUSEPORT option was not part of the original Stanford multicast sources and is therefore not widely supported. Other systems that support multicasting, such as Solaris 2.x, let a process specify SO_REUSEADDR to specify that it is OK to bind multiple sockets to the same IP address and same port number.

Connecting a UDP Socket

We normally associate the connect system call with TCP clients, but it is also possible for a UDP client or a UDP server to call connect and specify the foreign IP address and foreign port number for the socket. This restricts the socket to exchanging UDP datagrams with that one particular peer.

There is a side effect when a UDP socket is connected: the local IP address, if not already specified by a call to bind, is automatically set by connect. It is set to the local interface address chosen by IP routing to reach the specified peer.

Figure 22.9 shows the three different states of a UDP socket along with the pseudo-code of the function calls to end up in that state.

Table 22.9. Specification of local and foreign IP addresses and port numbers for UDP sockets.

Local socket

Foreign socket

Description

localIP.lport

foreignIP.fport

restricted to one peer:

  • socket(), bind(*, lport), connect(foreignIP, fport)

  • socket(), bind(localIP, lport), connect(foreignIP, fport)

localIP.lport

*.*

restricted to datagrams arriving on one local interface: localIP

  • socket(), bind(localIP, lport)

*.lport

*.*

receives all datagrams sent to lport:

  • socket(), bind(*, lport)

The first of the three states is called a connected UDP socket and the next two states are called unconnected UDP sockets. The difference between the two unconnected sockets is that the first has a fully specified local address and the second has a wildcarded local IP address.

Demultiplexing of Received IP Datagrams by TCP

Figure 22.10 shows the state of three Telnet server sockets on the host sun. The first two sockets are in the LISTEN state, waiting for incoming connection requests, and the third is connected to a client at port 1500 on the host with an IP address of 140.252.1.11. The first listening socket will handle connection requests that arrive on the 140.252.1.29 interface and the second listening socket will handle all other interfaces (since its local IP address is the wildcard).

Table 22.10. Three TCP sockets with a local port of 23.

Local address

Local port

Foreign address

Foreign port

TCP state

140.252.1.29

23

*

*

LISTEN

*

23

*

*

LISTEN

140.252.1.29

23

140.252.1.11

1500

ESTABLISHED

We show both of the listening sockets with unspecified foreign IP addresses and port numbers because the sockets API doesn’t allow a TCP server to restrict either of these values. A TCP server must accept the client’s connection and is then told of the client’s IP address and port number after the connection establishment is complete (i.e., when TCP’s three-way handshake is complete). Only then can the server close the connection if it doesn’t like the client’s IP address and port number. This isn’t a required TCP feature, it is just the way the sockets API has always worked.

When TCP receives a segment with a destination port of 23 it searches through its list of Internet PCBs looking for a match by calling in_pcblookup. When we examine this function shortly we’ll see that it has a preference for the smallest number of wildcard matches. To determine the number of wildcard matches we consider only the local and foreign IP addresses. We do not consider the foreign port number. The local port number must match, or we don’t even consider the PCB. The number of wildcard matches can be 0, 1 (local IP address or foreign IP address), or 2 (both local and foreign IP addresses).

For example, assume the incoming segment is from 140.252.1.11, port 1500, destined for 140.252.1.29, port 23. Figure 22.11 shows the number of wildcard matches for the three sockets from Figure 22.10.

Table 22.11. Incoming segment from {140.252.1.11, 1500} to {140.252.1.29, 23}.

Local address

Local port

Foreign address

Foreign port

TCP state

#wildcard matches

140.252.1.29

23

*

*

LISTEN

1

*

23

*

*

LISTEN

2

140.252.1.29

23

140.252.1.11

1500

ESTABLISHED

0

The first socket matches these four values, but with one wildcard match (the foreign IP address). The second socket also matches the incoming segment, but with two wildcard matches (the local and foreign IP addresses). The third socket is a complete match with no wildcards. Net/3 uses the third socket, the one with the smallest number of wildcard matches.

Continuing this example, assume the incoming segment is from 140.252.1.11, port 1501, destined for 140.252.1.29, port 23. Figure 22.12 shows the number of wildcard matches.

Table 22.12. Incoming segment from {140.252.1.11, 1501} to {140.252.1.29, 23}.

Local address

Local port

Foreign address

Foreign port

TCP state

#wildcard matches

140.252.1.29

23

*

*

LISTEN

1

*

23

*

*

LISTEN

2

140.252.1.29

23

140.252.1.11

1500

ESTABLISHED

 

The first socket matches with one wildcard match; the second socket matches with two wildcard matches; and the third socket doesn’t match at all, since the foreign port numbers are unequal. (The foreign port numbers are compared only if the foreign IP address in the PCB is not a wildcard.) The first socket is chosen.

In these two examples we never said what type of TCP segment arrived: we assume that the segment in Figure 22.11 contains data or an acknowledgment for an established connection since it is delivered to an established socket. We also assume that the segment in Figure 22.12 is an incoming connection request (a SYN) since it is delivered to a listening socket. But the demultiplexing code in in_pcblookup doesn’t care. If the TCP segment is the wrong type for the socket that it is delivered to, we’ll see later how TCP handles this. For now the important fact is that the demultiplexing code only compares the source and destination socket pair from the IP datagram against the values in the PCB.

Demultiplexing of Received IP Datagrams by UDP

The delivery of UDP datagrams is more complicated than the TCP example we just examined, since UDP datagrams can be sent to a broadcast or multicast address. Since Net/3 (and most systems with multicast support) allow multiple sockets to have identical local IP addresses and ports, how are multiple recipients handled? The Net/3 rules are:

  1. An incoming UDP datagram destined for either a broadcast IP address or a multicast IP address is delivered to all matching sockets. There is no concept of a “best” match here (i.e., the one with the smallest number of wildcard matches).

  2. An incoming UDP datagram destined for a unicast IP address is delivered only to one matching socket, the one with the smallest number of wildcard matches. If there are multiple sockets with the same “smallest” number of wildcard matches, which socket receives the incoming datagram is implementation-dependent.

Figure 22.13 shows four UDP sockets that we’ll use for some examples. Having four UDP sockets with the same local port number requires using either SO_REUSEADDR or SO_REUSEPORT. The first two sockets have been connected to a foreign IP address and port number, and the last two are unconnected.

Table 22.13. Four UDP sockets with a local port of 577.

Local address

Local port

Foreign address

Foreign port

Comment

140.252.1.29

577

140.252.1.11

1500

connected, local IP = unicast

140.252.13.63

577

140.252.13.35

1500

connected, local IP = broadcast

140.252.13.63

577

*

*

unconnected, local IP = broadcast

*

577

*

*

unconnected, local IP = wildcard

Consider an incoming UDP datagram destined for 140.252.13.63 (the broadcast address on the 140.252.13 subnet), port 577, from 140.252.13.34, port 1500. Figure 22.14 shows that it is delivered to the third and fourth sockets.

Table 22.14. Received datagram from {140.252.13.34, 1500} to {140.252.13.63, 577}.

Local address

Local port

Foreign address

Foreign port

Delivered?

140.252.1.29

577

140.252.1.11

1500

no, local and foreign IP mismatch

140.252.13.63

577

140.252.13.35

1500

no, foreign IP mismatch

140.252.13.63

577

*

*

yes

*

577

*

*

yes

The broadcast datagram is not delivered to the first socket because the local IP address doesn’t match the destination IP address and the foreign IP address doesn’t match the source IP address. It isn’t delivered to the second socket because the foreign IP address doesn’t match the source IP address.

As the next example, consider an incoming UDP datagram destined for 140.252.1.29 (a unicast address), port 577, from 140.252.1.11, port 1500. Figure 22.15 shows to which sockets the datagram is delivered.

Table 22.15. Received datagram from {140.252.1.11, 1500} to {140.252.1.29, 577}.

Local address

Local port

Foreign address

Foreign port

Delivered?

140.252.1.29

577

140.252.1.11

1500

yes, 0 wildcard matches

140.252.13.63

577

140.252.13.35

1500

no, local and foreign IP mismatch

140.252.13.63

577

*

*

no, local IP mismatch

*

577

*

*

no, 2 wildcard matches

The datagram matches the first socket with no wildcard matches and also matches the fourth socket with two wildcard matches. It is delivered to the first socket, the best match.

in_pcblookup Function

The function in_pcblookup serves four different purposes.

  1. When either TCP or UDP receives an IP datagram, in_pcblookup scans the protocol’s list of Internet PCBs looking for a matching PCB to receive the datagram. This is transport layer demultiplexing of a received datagram.

  2. When a process executes the bind system call, to assign a local IP address and local port number to a socket, in_pcbbind is called by the protocol to verify that the requested local address pair is not already in use.

  3. When a process executes the bind system call, requesting an ephemeral port be assigned to its socket, the kernel picks an ephemeral port and calls in_pcbbind to check if the port is in use. If it is in use, the next ephemeral port number is tried, and so on, until an unused port is located.

  4. When a process executes the connect system call, either explicitly or implicitly, in_pcbbind verifies that the requested socket pair is unique. (An implicit call to connect happens when a UDP datagram is sent on an unconnected socket. We’ll see this scenario in Chapter 23.)

In cases 2, 3, and 4 in_pcbbind calls in_pcblookup. Two options confuse the logic of the function. First, a process can specify either the SO_REUSEADDR or SO_REUSEPORT socket option to say that a duplicate local address is OK.

Second, sometimes a wildcard match is OK (e.g., an incoming UDP datagram can match a PCB that has a wildcard for its local IP address, meaning that the socket will accept UDP datagrams that arrive on any local interface), while other times a wildcard match is forbidden (e.g., when connecting to a foreign IP address and port number).

In the original Stanford IP multicast code appears the comment that “The logic of in_pcblookup is rather opaque and there is not a single comment,” The adjective opaque is an understatement.

The publicly available IP multicast code available for BSD/386, which is derived from the port to 4.4BSD done by Craig Leres, fixed the overloaded semantics of this function by using in_pcblookup only for case 1 above. Cases 2 and 4 are handled by a new function named in_pcbconflict, and case 3 is handled by a new function named in_uniqueport. Dividing the original functionality into separate functions is much clearer, but in the Net/3 release, which we’re describing in this text, the logic is still combined into the single function in_pcblookup.

Figure 22.16 shows the in_pcblookup function.

Table 22.16. in_pcblookup function: search all the PCBs for a match.

------------------------------------------------------------------------- in_pcb.h
405 struct inpcb *
406 in_pcblookup(head, faddr, fport_arg, laddr, lport_arg, flags)
407 struct inpcb *head;
408 struct in_addr faddr, laddr;
409 u_int   fport_arg, lport_arg;
410 int     flags;
411 {
412     struct inpcb *inp, *match = 0;
413     int     matchwild = 3, wildcard;
414     u_short fport = fport_arg, lport = lport_arg;

415     for (inp = head->inp_next; inp != head; inp = inp->inp_next) {
416         if (inp->inp_lport != lport)
417             continue;           /* ignore if local ports are unequal */

418         wildcard = 0;

419         if (inp->inp_laddr.s_addr != INADDR_ANY) {
420             if (laddr.s_addr == INADDR_ANY)
421                 wildcard++;
422             else if (inp->inp_laddr.s_addr != laddr.s_addr)
423                 continue;
424         } else {
425             if (laddr.s_addr != INADDR_ANY)
426                 wildcard++;
427         }

428         if (inp->inp_faddr.s_addr != INADDR_ANY) {
429             if (faddr.s_addr == INADDR_ANY)
430                 wildcard++;
431             else if (inp->inp_faddr.s_addr != faddr.s_addr ||
432                      inp->inp_fport != fport)
433                 continue;
434         } else {
435             if (faddr.s_addr != INADDR_ANY)
436                 wildcard++;
437         }

438         if (wildcard && (flags & INPLOOKUP_WILDCARD) == 0)
439             continue;           /* wildcard match not allowed */

440         if (wildcard < matchwild) {
441             match = inp;
442             matchwild = wildcard;
443             if (matchwild == 0)
444                 break;          /* exact match, all done */
445         }
446     }
447     return (match);
448 }
------------------------------------------------------------------------- in_pcb.h

The function starts at the head of the protocol’s PCB list and potentially goes through every PCB on the list. The variable match remembers the pointer to the entry with the best match so far, and matchwild remembers the number of wildcards in that match. The latter is initialized to 3, which is a value greater than the maximum number of wildcard matches that can be encountered. (Any value greater than 2 would work.) Each time around the loop, the variable wildcard starts at 0 and counts the number of wildcard matches for each PCB.

Compare local port number

416-417

The first comparison is the local port number. If the PCB’s local port doesn’t match the lport argument, the PCB is ignored.

Compare local address

419-427

in_pcblookup compares the local address in the PCB with the laddr argument. If one is a wildcard and the other is not a wildcard, the wildcard counter is incremented. If both are not wildcards, then they must be the same, or this PCB is ignored. If both are wildcards, nothing changes: they can’t be compared and the wildcard counter isn’t incremented. Figure 22.17 summarizes the four different conditions.

Table 22.17. Four scenarios for the local IP address comparison done by in_pcblookup.

PCB local IP

laddr argument

Description

not *

*

wildcard++

not *

*

not *

*

compare IP addresses, skip PCB if not equal can’t compare

*

not *

wildcard++

Compare foreign address and foreign port number

428-437

These lines perform the same test that we just described, but using the foreign addresses instead of the local addresses. Also, if both foreign addresses are not wildcards then not only must the two IP addresses be equal, but the two foreign ports must also be equal. Figure 22.18 summarizes the foreign IP comparisons.

Table 22.18. Four scenarios for the foreign IP address comparison done by in_pcblookup.

PCB foreign IP

faddr argument

Description

not *

*

wildcard++

not *

*

not *

*

compare IP addresses and ports, skip PCB if not equal can’t compare

*

not *

wildcard++

The additional comparison of the foreign port numbers can be performed for the second line of Figure 22.18 because it is not possible to have a PCB with a nonwildcard foreign address and a foreign port number of 0. This restriction is enforced by connect, which we’ll see shortly requires a nonwildcard foreign IP address and a nonzero foreign port. It is possible, however, and common, to have a wildcard local address with a nonzero local port. We saw this in Figures 22.10 and 22.13.

Check if wildcard match allowed

438-439

The flags argument can be set to INPLOOKUP_WILDCARD, which means a match containing wildcards is OK. If a match is found containing wildcards (wildcard is nonzero) and this flag was not specified by the caller, this PCB is ignored. When TCP and UDP call this function to demultiplex an incoming datagram, INPLOOKUP_WILDCARD is always set, since a wildcard match is OK. (Recall our examples using Figures 22.10 and 22.13.) But when this function is called as part of the connect system call, in order to verify that a socket pair is not already in use, the flags argument is set to 0.

Remember best match, return if exact match found

440-447

These statements remember the best match found so far. Again, the best match is considered the one with the fewest number of wildcard matches. If a match is found with one or two wildcards, that match is remembered and the loop continues. But if an exact match is found (wildcard is 0), the loop terminates, and a pointer to the PCB with that exact match is returned.

Example—Demultiplexing of Received TCP Segment

Figure 22.19 is from the TCP example we discussed with Figure 22.11. Assume in_pcblookup is demultiplexing a received datagram from 140.252.1.11, port 1500, destined for 140.252.1.29, port 23. Also assume that the order of the PCBs is the order of the rows in the figure. laddr is the destination IP address, lport is the destination TCP port, faddr is the source IP address, and fport is the source TCP port.

Table 22.19. laddr = 140.252.1.29, lport = 23, faddr = 140.252.1.11, fport = 1500.

PCB values

wildcard

Local address

Local port

Foreign address

Foreign port

140.252.1.29

23

*

*

1

*

23

*

*

2

140.252.1.29

23

140.252.1.11

1500

0

When the first row is compared to the incoming segment, wildcard is 1 (the foreign IP address), flags is set to INPLOOKUP_WILDCARD, so match is set to point to this PCB and matchwild is set to 1. The loop continues since an exact match has not been found yet. The next time around the loop, wildcard is 2 (the local and foreign IP addresses) and since this is greater than matchwild, the entry is not remembered, and the loop continues. The next time around the loop, wildcard is 0, which is less than matchwild (1), so this entry is remembered in match. The loop also terminates since an exact match has been found and the pointer to this PCB is returned to the caller.

If in_pcblookup were used by TCP and UDP only to demultiplex incoming datagrams, it could be simplified. First, there’s no need to check whether the faddr or laddr arguments are wildcards, since these are the source and destination IP addresses from the received datagram. Also the flags argument could be removed, along with its corresponding test, since wildcard matches are always OK.

This section has covered the mechanics of the in_pcblookup function. We’ll return to this function and discuss its meaning after seeing how it is called from the in_pcbbind and in_pcbconnect functions.

in_pcbbind Function

The next function, in_pcbbind, binds a local address and port number to a socket. It is called from five functions:

  1. from bind for a TCP socket (normally to bind a server’s well-known port);

  2. from bind for a UDP socket (either to bind a server’s well-known port or to bind an ephemeral port to a client’s socket);

  3. from connect for a TCP socket, if the socket has not yet been bound to a nonzero port (this is typical for TCP clients);

  4. from 1isten for a TCP socket, if the socket has not yet been bound to a nonzero port (this is rare, since listen is called by a TCP server, which normally binds a well-known port, not an ephemeral port); and

  5. from in_pcbconnect (Section 22.8), if the local IP address and local port number have not been set (typical for a call to connect for a UDP socket or for each call to sendto for an unconnected UDP socket).

In cases 3, 4, and 5, an ephemeral port number is bound to the socket and the local IP address is not changed (in case it is already set).

We call cases 1 and 2 explicit binds and cases 3, 4, and 5 implicit binds. We also note that although it is normal in case 2 for a server to bind a well-known port, servers invoked using remote procedure calls (RPC) often bind ephemeral ports and then register their ephemeral port with another program that maintains a mapping between the server’s RPC program number and its ephemeral port (e.g., the Sun port mapper described in Section 29.4 of Volume 1).

We’ll show the in_pcbbind function in three sections. Figure 22.20 is the first section.

Table 22.20. in_pcbbind function: bind a local address and port number.

------------------------------------------------------------------------- in_pcb.h
 52 int
 53 in_pcbbind(inp, nam)
 54 struct inpcb *inp;
 55 struct mbuf *nam;
 56 {
 57     struct socket *so = inp->inp_socket;
 58     struct inpcb *head = inp->inp_head;
 59     struct sockaddr_in *sin;
 60     struct proc *p = curproc;   /* XXX */
 61     u_short lport = 0;
 62     int     wild = 0, reuseport = (so->so_options & SO_REUSEPORT);
 63     int     error;

 64     if (in_ifaddr == 0)
 65         return (EADDRNOTAVAIL);
 66     if (inp->inp_lport || inp->inp_laddr.s_addr != INADDR_ANY)
 67         return (EINVAL);

 68     if ((so->so_options & (SO_REUSEADDR | SO_REUSEPORT)) == 0 &&
 69         ((so->so_proto->pr_flags & PR_CONNREQUIRED) == 0 ||
 70          (so->so_options & SO_ACCEPTCONN) == 0))
 71         wild = INPLOOKUP_WILDCARD;
------------------------------------------------------------------------- in_pcb.h

64-67

The first two tests verify that at least one interface has been assigned an IP address and that the socket is not already bound. You can’t bind a socket twice.

68-71

This if statement is confusing. The net result sets the variable wild to INPLOOKUP_WILDCARD if neither SO_REUSEADDR or SO_REUSEPORT are set.

The second test is true for UDP sockets since PR_CONNREQUIRED is false for connectionless sockets and true for connection-oriented sockets.

The third test is where the confusion lies [Torek 1992]. The socket flag SO_ACCEPTCONN is set only by the listen system call (Section 15.9), which is valid only for a connection-oriented server. In the normal scenario, a TCP server calls socket, bind, and then listen. Therefore, when in_pcbbind is called by bind, this socket flag is cleared. Even if the process calls socket and then listen, without calling bind, TCP’s PRU_LISTEN request calls in_pcbbind to assign an ephemeral port to the socket before the socket layer sets the SO_ACCEPTCONN flag. This means the third test in the if statement, testing whether SO_ACCEPTCONN is not set, is always true. The if statement is therefore equivalent to

if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0 &&
    ((so->so_proto->pr_flags & PR_CONNREQUIRED) == 0 || 1)
        wild = INPLOOKUP_WILDCARD;

Since anything logically ORed with 1 is always true, this is equivalent to

if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0)
        wild = INPLOOKUP_WILDCARD;

which is simpler to understand: if either of the REUSE socket options is set, wild is left as 0. If neither of the REUSE socket options are set, wild is set to INPLOOKUP_WILDCARD. In other words, when in_pcblookup is called later in the function, a wildcard match is allowed only if neither of the REUSE socket options are on.

The next section of the in_pcbbind, shown in Figure 22.22, function processes the optional nam argument.

72-75

The nam argument is a nonnull pointer only when the process calls bind explicitly. For an implicit bind (a side effect of connect, listen, or in_pcbconnect, cases 3, 4, and 5 from the beginning of this section), nam is a null pointer. When the argument is specified, it is an mbuf containing a sockaddr_in structure. Figure 22.21 shows the four cases for the nonnull nam argument.

Table 22.21. Four cases for nam argument to in_pcbbind.

nam argument:

PCB member gets set to:

Comment

localIP

lport

inp_laddr

inp_lport

not *

0

localIP

ephemeral port

localIP must be local interface

not *

nonzero

localIP

lport

subject to in_pcblookup

*

0

*

ephemeral port

 

*

nonzero

*

lport

subject to in_pcblookup

76-83

The test for the correct address family is commented out, yet the identical test in the in_pcbconnect function (Figure 22.25) is performed. We expect either both to be in or both to be out.???

Table 22.22. in_pcbbind function: process optional nam argument.

------------------------------------------------------------------------- in_pcb.c
 72     if (nam) {
 73         sin = mtod(nam, struct sockaddr_in *);
 74         if (nam->m_len != sizeof(*sin))
 75             return (EINVAL);
 76 #ifdef notdef
 77         /*
 78          * We should check the family, but old programs
 79          * incorrectly fail to initialize it.
 80          */
 81         if (sin->sin_family != AF_INET)
 82             return (EAFNOSUPPORT);
 83 #endif
 84         lport = sin->sin_port;  /* might be 0 */
 85         if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) {
 86             /*
 87              * Treat SO_REUSEADDR as SO_REUSEPORT for multicast;
 88              * allow complete duplication of binding if
 89              * SO_REUSEPORT is set, or if SO_REUSEADDR is set
 90              * and a multicast address is bound on both
 91              * new and duplicated sockets.
 92              */
 93             if (so->so_options & SO_REUSEADDR)
 94                 reuseport = SO_REUSEADDR | SO_REUSEPORT;
 95         } else if (sin->sin_addr.s_addr != INADDR_ANY) {
 96             sin->sin_port = 0;  /* yech... */
 97             if (ifa_ifwithaddr((struct sockaddr *) sin) == 0)
 98                 return (EADDRNOTAVAIL);
 99         }
100         if (lport) {
101             struct inpcb *t;

102             /* GROSS */
103             if (ntohs(lport) < IPPORT_RESERVED &&
104                 (error = suser(p->p_ucred, &p->p_acflag)))
105                 return (error);
106             t = in_pcblookup(head, zeroin_addr, 0,
107                              sin->sin_addr, lport, wild);
108             if (t && (reuseport & t->inp_socket->so_options) == 0)
109                 return (EADDRINUSE);
110         }
111         inp->inp_laddr = sin->sin_addr;     /* might be wildcard */
112     }
------------------------------------------------------------------------- in_pcb.c

85-94

Net/3 tests whether the IP address being bound is a multicast group. If so, the SO_REUSEADDR option is considered identical to SO_REUSEPORT.

95-99

Otherwise, if the local address being bound by the caller is not the wildcard, ifa_ifwithaddr verifies that the address corresponds to a local interface.

The comment “yech” is probably because the port number in the socket address structure must be 0 because ifa_ifwithaddr does a binary comparison of the entire structure, not just a comparison of the IP addresses.

This is one of the few instances where the process must zero the socket address structure before issuing the system call. If bind is called and the final 8 bytes of the socket address structure (sin_zero [8]) are nonzero, ifa_ifwithaddr will not find the requested interface, and in_pcbbind will return an error.

100-105

The next if statement is executed when the caller is binding a nonzero port, that is, the process wants to bind one particular port number (the second and fourth scenarios from Figure 22.21). If the requested port is less than 1024 (IPPORT_RESERVED) the process must have superuser privilege. This is not part of the Internet protocols, but a Berkeley convention. A port number less than 1024 is called a reserved port and is used, for example, by the rcmd function [Stevens 1990], which in turn is used by the rlogin and rsh client programs as part of their authentication with their servers.

106-109

The function in_pcblookup (Figure 22.16) is then called to check whether a PCB already exists with the same local IP address and local port number. The second argument is the wildcard IP address (the foreign IP address) and the third argument is a port number of 0 (the foreign port). The wildcard value for the second argument causes in_pcblookup to ignore the foreign IP address and foreign port in the PCB on ly the local IP address and local port are compared to sin->sin_addr and lport, respectively. We mentioned earlier that wild is set to INPLOOKUP_WILDCARD only if neither of the REUSE socket options are set.

111

The caller’s value for the local IP address is stored in the PCB. This can be the wildcard address, if that’s the value specified by the caller. In this case the local IP address is chosen by the kernel, but not until the socket is connected at some later time. This is because the local IP address is determined by IP routing, based on foreign IP address.

The final section of in_pcbbind handles the assignment of an ephemeral port when the caller explicitly binds a port of 0, or when the nam argument is a null pointer (an implicit bind).???

Table 22.23. in_pcbbind function: choose an ephemeral port.

------------------------------------------------------------------------- in_pcb.c
113     if (lport == 0)
114         do {
115             if (head->inp_lport++ < IPPORT_RESERVED ||
116                 head->inp_lport > IPPORT_USERRESERVED)
117                 head->inp_lport = IPPORT_RESERVED;
118             lport = htons(head->inp_lport);
119         } while (in_pcblookup(head,
120                             zeroin_addr, 0, inp->inp_laddr, lport, wild));
121     inp->inp_lport = lport;
122     return (0);
123 }
------------------------------------------------------------------------- in_pcb.c

113-122

The next ephemeral port number to use for this protocol (TCP or UDP) is maintained in the head of the protocol’s PCB list: tcb or udb. Other than the inp_next and inp_back pointers in the protocol’s head PCB, the only other element of the inpcb structure that is used is the local port number. Confusingly, this local port number is maintained in host byte order in the head PCB, but in network byte order in all the other PCBs on the list! The ephemeral port numbers start at 1024 (IPPORT_RESERVED) and get incremented by 1 until port 5000 is used (IPPORT_USERRESERVED), then cycle back to 1024. The loop is executed until in_pcbbind does not find a match.

so_reuseaddr Examples

Let’s look at some common examples to see the interaction of in_pcbbind with in_pcblookup and the two REUSE socket options.

  1. A TCP or UDP server normally starts by calling socket and bind. Assume a TCP server that calls bind, specifying the wildcard IP address and its nonzero well-known port, say 23 (the Telnet server). Also assume that the server is not already running and that the process does not set the SO_REUSEADDR socket option.

    in_pcbbind calls in_pcblookup with INPLOOKUP_WILDCARD as the final argument. The loop in in_pcblookup won’t find a matching PCB, assuming no other process is using the server’s well-known TCP port, causing a null pointer to be returned. This is OK and in_pcbbind returns 0.

  2. Assume the same scenario as above, but with the server already running when someone tries to start the server a second time.

    When in_pcblookup is called it finds the PCB with a local socket of {*, 23}. Since the wildcard counter is 0, in_pcblookup returns the pointer to this entry. Since reuseport is 0, in_pcbbind returns EADDRINUSE.

  3. Assume the same scenario as the previous example, but when the attempt is made to start the server a second time, the SO_REUSEADDR socket option is specified.

    Since this socket option is specified, in_pcbbind calls in_pcblookup with a final argument of 0. But the PCB with a local socket of {*, 23} is still matched and returned because wildcard is 0, since in_pcblookup cannot compare the two wildcard addresses (Figure 22.17). in_pcbbind again returns EADDRINUSE, preventing us from starting two instances of the server with identical local sockets, regardless of whether we specify SO_REUSEADDR or not.

  4. Assume that a Telnet server is already running with a local socket of {*, 23} and we try to start another with a local socket of {140.252.13.35, 23}.

    Assuming SO_REUSEADDR is not specified, in_pcblookup is called with a final argument of INPLOOKUP_WILDCARD. When it compares the PCB containing *.23, the counter wildcard is set to 1. Since a wildcard match is allowed, this match is remembered as the best match and a pointer to it is returned after all the TCP PCBs are scanned. in_pcbbind returns EADDRINUSE.

  5. This example is the same as the previous one, but we specify the SO_REUSEADDR socket option for the second server that tries to bind the local socket {140.252.13.35, 23}.

    The final argument to in_pcblookup is now 0, since the socket option is specified. When the PCB with the local socket {*, 23} is compared, the wildcard counter is 1, but since the final flags argument is 0, this entry is skipped and is not remembered as a match.

    After comparing all the TCP PCBs, the function returns a null pointer and in_pcbbind returns 0.

  6. Assume the first Telnet server is started with a local socket of {140.252.13.35, 23} when we try to start a second server with a local socket of {*, 23}. This is the same as the previous example, except we’re starting the servers in reverse order this time.

    The first server is started without a problem, assuming no other socket has already bound port 23. When we start the second server, the final argument to in_pcblookup is INPLOOKUP_WILDCARD, assuming the SO_REUSEADDR socket option is not specified. When the PCB with the local socket of {140.252.13.35, 23} is compared, the wildcard counter is set to 1 and this entry is remembered. After all the TCP PCBs are compared, the pointer to this entry is returned, causing in_pcbbind to return EADDRINUSE.

  7. What if we start two instances of a server, both with a nonwildcard local IP address? Assume we start the first Telnet server with a local socket of {140.252.13.35, 23} and then try to start a second with a local socket of {127.0.0.1, 23}, without specifying SO_REUSEADDR.

    When the second server calls in_pcbbind, it calls in_pcblookup with a final argument of INPLOOKUP_WILDCARD. When the PCB with the local socket of {140.252.13.35, 23} is compared, it is skipped because the local IP addresses are not equal. in_pcblookup returns a null pointer, and in_pcbbind returns 0.

    From this example we see that the SO_REUSEADDR socket option has no effect on nonwildcard IP addresses. Indeed the test on the flags value INPLOOKUP_WILDCARD in in_pcblookup is made only when wildcard is greater than 0, that is, when either the PCB entry has a wildcard IP address or the IP address being bound is the wildcard.

  8. As a final example, assume we try to start two instances of the same server, both with the same nonwildcard local IP address, say 127.0.0.1.

    When the second server is started, in_pcblookup always returns a pointer to the matching PCB with the same local socket. This happens regardless of the SO_REUSEADDR socket option, because the wildcard counter is always 0 for this comparison. Since in_pcblookup returns a nonnull pointer, in_pcbbind returns EADDRINUSE.

From these examples we can state the rules about the binding of local IP addresses and the SO_REUSEADDR socket option. These rules are shown in Figure 22.24. We assume that localIP1 and localIP2 are two different unicast or broadcast IP addresses valid on the local host, and that localmcastIP is a multicast group. We also assume that the process is trying to bind the same nonzero port number that is already bound to the existing PCB.

Table 22.24. Effect of SO_REUSEADDR socket option on binding of local IP address.

Existing PCB

Try to bind

SO_REUSEADDR

Description

off

on

localIP1

localIP1

error

error

one server per IP address and port

localIP1

localIP2

OK

OK

one server for each local interface

localIP1

*

error

OK

one server for one interface, other server for remaining interfaces

*

localIP1

error

OK

one server for one interface, other server for remaining interfaces

*

*

error

error

can’t duplicate local sockets (same as first example)

localmcastIP

localmcastIP

error

OK

multiple multicast recipients

We need to differentiate between a unicast or broadcast address and a multicast address, because we saw that in_pcbbind considers SO_REUSEADDR to be the same as SO_REUSEPORT for a multicast address.

SO_REUSEPORT Socket Option

The handling of SO_REUSEPORT in Net/3 changes the logic of in_pcbbind to allow duplicate local sockets as long as both sockets specify SO_REUSEPORT. In other words, all the servers must agree to share the same local port.

in_pcbconnect Function

The function in_pcbconnect specifies the foreign IP address and foreign port number for a socket. It is called from four functions:

  1. from connect for a TCP socket (required for a TCP client);

  2. from connect for a UDP socket (optional for a UDP client, rare for a UDP server);

  3. from sendto when a datagram is output on an unconnected UDP socket (common); and

  4. from tcp_input when a connection request (a SYN segment) arrives on a TCP socket that is in the LISTEN state (standard for a TCP server).

In all four cases it is common, though not required, for the local IP address and local port be unspecified when in_pcbconnect is called. Therefore one function of in_pcbconnect is to assign the local values when they are unspecified.

We’ll discuss the in_pcbconnect function in four sections. Figure 22.25 shows the first section.

Table 22.25. in_pcbconnect function: verify arguments, check foreign IP address.

------------------------------------------------------------------------- in_pcb.h
130 int
131 in_pcbconnect(inp, nam)
132 struct inpcb *inp;
133 struct mbuf *nam;
134 {
135     struct in_ifaddr *ia;
136     struct sockaddr_in *ifaddr;
137     struct sockaddr_in *sin = mtod(nam, struct sockaddr_in *);
138     if (nam->m_len != sizeof(*sin))
139         return (EINVAL);
140     if (sin->sin_family != AF_INET)
141         return (EAFNOSUPPORT);
142     if (sin->sin_port == 0)
143         return (EADDRNOTAVAIL);
144     if (in_ifaddr) {
145         /*
146          * If the destination address is INADDR_ANY,
147          * use the primary local address.
148          * If the supplied address is INADDR_BROADCAST,
149          * and the primary interface supports broadcast,
150          * choose the broadcast address for that interface.
151          */
152 #define satosin(sa)     ((struct sockaddr_in *)(sa))
153 #define sintosa(sin)    ((struct sockaddr *)(sin))
154 #define ifatoia(ifa)    ((struct in_ifaddr *)(ifa))
155         if (sin->sin_addr.s_addr == INADDR_ANY)
156             sin->sin_addr = IA_SIN(in_ifaddr)->sin_addr;
157         else if (sin->sin_addr.s_addr == (u_long) INADDR_BROADCAST &&
158                  (in_ifaddr->ia_ifp->if_flags & IFF_BROADCAST))
159             sin->sin_addr = satosin(&in_ifaddr->ia_broadaddr)->sin_addr;
160     }
------------------------------------------------------------------------- in_pcb.h

Validate argument

130-143

The nam argument points to an mbuf containing a sockaddr_in structure with the foreign IP address and port number. These lines validate the argument and verify that the caller is not trying to connect to a port number of 0.

Handle connection to 0.0.0.0 and 255.255.255.255 specially

144-160

The test of the global in_ifaddr verifies that an IP interface has been configured. If the foreign IP address is 0.0.0.0 (INADDR_ANY), then 0.0.0.0 is replaced with the IP address of the primary IP interface. This means the calling process is connecting to a peer on this host. If the foreign IP address is 255.255.255.255 (INADDR_BROADCAST) and the primary interface supports broadcasting, then 255.255.255.255 is replaced with the broadcast address of the primary interface. This allows a UDP application to broadcast on the primary interface without having to figure out its IP address i t can simply send datagrams to 255.255.255.255, and the kernel converts this to the appropriate IP address for the interface.

The next section of code, Figure 22.26, handles the case of an unspecified local address. This is the common scenario for TCP and UDP clients, cases 1, 2, and 3 from the list at the beginning of this section.

Table 22.26. in_pcbconnect function: local IP address not yet specified.

------------------------------------------------------------------------- in_pcb.c
161     if (inp->inp_laddr.s_addr == INADDR_ANY) {
162         struct route *ro;

163         ia = (struct in_ifaddr *) 0;
164         /*
165          * If route is known or can be allocated now,
166          * our src addr is taken from the i/f, else punt.
167          */
168         ro = &inp->inp_route;
169         if (ro->ro_rt &&
170             (satosin(&ro->ro_dst)->sin_addr.s_addr !=
171              sin->sin_addr.s_addr ||
172              inp->inp_socket->so_options & SO_DONTROUTE)) {
173             RTFREE(ro->ro_rt);
174             ro->ro_rt = (struct rtentry *) 0;
175         }
176         if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0 &&    /* XXX */
177             (ro->ro_rt == (struct rtentry *) 0 ||
178              ro->ro_rt->rt_ifp == (struct ifnet *) 0)) {
179             /* No route yet, so try to acquire one */
180             ro->ro_dst.sa_family = AF_INET;
181             ro->ro_dst.sa_len = sizeof(struct sockaddr_in);
182             ((struct sockaddr_in *) &ro->ro_dst)->sin_addr =
183                 sin->sin_addr;
184             rtalloc(ro);
185         }
186         /*
187          * If we found a route, use the address
188          * corresponding to the outgoing interface
189          * unless it is the loopback (in case a route
190          * to our address on another net goes to loopback).
191          */
192         if (ro->ro_rt && !(ro->ro_rt->rt_ifp->if_flags & IFF_LOOPBACK))
193             ia = ifatoia(ro->ro_rt->rt_ifa);
194         if (ia == 0) {
195             u_short fport = sin->sin_port;

196             sin->sin_port = 0;
197             ia = ifatoia(ifa_ifwithdstaddr(sintosa(sin)));
198             if (ia == 0)
199                 ia = ifatoia(ifa_ifwithnet(sintosa(sin)));
200             sin->sin_port = fport;
201             if (ia == 0)
202                 ia = in_ifaddr;
203             if (ia == 0)
204                 return (EADDRNOTAVAIL);
205         }
------------------------------------------------------------------------- in_pcb.c

Release route if no longer valid

164-175

If a route is held by the PCB but the destination of that route differs from the foreign address being connected to, or the SO_DONTROUTE socket option is set, that route is released.

To understand why a PCB may have an associated route, consider case 3 from the list at the beginning of this section: in_pcbconnect is called every time a UDP datagram is sent on an unconnected socket. Each time a process calls sendto, the UDP output function calls in_pcbconnect, ip_output, and in_pcbdisconnect. If all the datagrams sent on the socket go to the same destination IP address, then the first time through in_pcbconnect the route is allocated and it can be used from that point on. But since a UDP application can send datagrams to a different IP address with each call to sendto, the destination address must be compared to the saved route and the route released when the destination changes. This same test is done in ip_output, which seems to be redundant.

The SO_DONTROUTE socket option tells the kernel to bypass the normal routing decisions and send the IP datagram to the locally attached interface whose IP network address matches the network portion of the destination address.

Acquire route

176-185

If the SO_DONTROUTE socket option is not set, and a route to the destination is not held by the PCB, try to acquire one by calling rtalloc.

Determine outgoing interface

186-205

The goal in this section of code is to have ia point to an interface address structure (in_ifaddr, Section 6.5), which contains the IP address of the interface. If the PCB holds a route that is still valid, or if rtalloc found a route, and the route is not to the loopback interface, the corresponding interface is used. Otherwise ifa_withdstaddr and ifa_withnet are called to check if the foreign IP address is on the other end of a point-to-point link or on an attached network. Both of these functions require that the port number in the socket address structure be 0, so it is saved in fport across the calls. If this fails, the primary IP address is used (in_ifaddr), and if no interfaces are configured (in_ifaddr is zero), an error is returned.

Figure 22.27 shows the next section of in_pcbconnect, which handles a destination address that is a multicast address.

Table 22.27. in_pcbconnect function: destination address is a multicast address.

------------------------------------------------------------------------- in_pcb.c
206         /*
207          * If the destination address is multicast and an outgoing
208          * interface has been set as a multicast option, use the
209          * address of that interface as our source address.
210          */
211         if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) &&
212             inp->inp_moptions != NULL) {
213             struct ip_moptions *imo;
214             struct ifnet *ifp;

215             imo = inp->inp_moptions;
216             if (imo->imo_multicast_ifp != NULL) {
217                 ifp = imo->imo_multicast_ifp;
218                 for (ia = in_ifaddr; ia; ia = ia->ia_next)
219                     if (ia->ia_ifp == ifp)
220                         break;
221                 if (ia == 0)
222                     return (EADDRNOTAVAIL);
223             }
224         }
225         ifaddr = (struct sockaddr_in *) &ia->ia_addr;
226     }
------------------------------------------------------------------------- in_pcb.c

206-223

If the destination address is a multicast address and the process has specified the outgoing interface to use for multicast packets (using the IP_MULTICAST_IF socket option), then the IP address of that interface is used as the local address. A search is made of all IP interfaces for the one matching the interface that was specified with the socket option. An error is returned if that interface is no longer up.

224-225

The code that started at the beginning of Figure 22.26 to handle the case of a wildcard local address is complete. The pointer to the sockaddr_in structure for the local interface ia is saved in ifaddr.

The final section of in_pcblookup is shown in Figure 22.28.

Table 22.28. in_pcbconnect function: verify that socket pair is unique.

------------------------------------------------------------------------- in_pcb.c
227     if (in_pcblookup(inp->inp_head,
228                      sin->sin_addr,
229                      sin->sin_port,
230                 inp->inp_laddr.s_addr ? inp->inp_laddr : ifaddr->sin_addr,
231                      inp->inp_lport,
232                      0))
233         return (EADDRINUSE);

234     if (inp->inp_laddr.s_addr == INADDR_ANY) {
235         if (inp->inp_lport == 0)
236             (void) in_pcbbind(inp, (struct mbuf *) 0);
237         inp->inp_laddr = ifaddr->sin_addr;
238     }
239     inp->inp_faddr = sin->sin_addr;
240     inp->inp_fport = sin->sin_port;
241     return (0);
242 }
------------------------------------------------------------------------- in_pcb.c

Verify that socket pair is unique

227-233

in_pcblookup verifies that the socket pair is unique. The foreign address and foreign port are the values specified as arguments to in_pcbconnect. The local address is either the value that was already bound to the socket or the value in ifaddr that was calculated in the code we just described. The local port can be 0, which is typical for a TCP client, and we’ll see that later in this section of code an ephemeral port is chosen for the local port.

This test prevents two TCP connections to the same foreign address and foreign port from the same local address and local port. For example, if we establish a TCP connection with the echo server on the host sun and then try to establish another connection to the same server from the same local port (8888, specified with the -b option), the call to in_pcblookup returns a match, causing connect to return the error EADDRINUSE. (We use the sock program from Appendix C of Volume 1.)

bsdi $ sock -b 8888 sun echo &        start first one in the background
bsdi $ sock -A -b 8888 sun echo       then try again
connect () error: Address already in use

We specify the -A option to set the SO_REUSEADDR socket option, which lets the bind succeed, but the connect cannot succeed. This is a contrived example, as we explicitly bound the same local port (8888) to both sockets. In the normal scenario of two different clients from the host bsdi to the echo server on the host sun, the local port will be 0 when the second client calls in_pcblookup from Figure 22.28.

This test also prevents two UDP sockets from being connected to the same foreign address from the same local port. This test does not prevent two UDP sockets from alternately sending datagrams to the same foreign address from the same local port, as long as neither calls connect, since a UDP socket is only temporarily connected to a peer for the duration of a sendto system call.

Implicit bind and assignment of ephemeral port

234-238

If the local address is still wildcarded for the socket, it is set to the value saved in ifaddr. This is an implicit bind: cases 3, 4, and 5 from the beginning of Section 22.7. First a check is made as to whether the local port has been bound yet, and if not, in_pcbbind binds an ephemeral port to the socket. The order of the call to in_pcbbind and the assignment to inp_laddr is important, since in_pcbbind fails if the local address is not the wildcard address.

Store foreign address and foreign port in PCB

239-240

The final step of this function sets the foreign IP address and foreign port number in the PCB. We are guaranteed, on successful return from this function, that both socket pairs in the PCB th e local and foreign are f illed in with specific values.

IP Source Address Versus Outgoing Interface Address

There is a subtle difference between the source address in the IP datagram versus the IP address of the interface used to send the datagram.

The PCB member inp_laddr is used by TCP and UDP as the source address of the IP datagram. It can be set by the process to the IP address of any configured interface by bind. (The call to ifa_ifwithaddr in in_pcbbind verifies the local address desired by the application.) in_pcbconnect assigns the local address only if it is a wildcard, and when this happens the local address is based on the outgoing interface (since the destination address is known).

The outgoing interface, however, is also determined by ip_output based on the destination IP address. On a multihomed host it is possible for the source address to be a local interface that is not the outgoing interface, when the process explicitly binds a local address that differs from the outgoing interface. This is allowed because Net/3 chooses the weak end system model (Section 8.4).

in_pcbdisconnect Function

A UDP socket is disconnected by in_pcbdisconnect. This removes the foreign association by setting the foreign IP address to all 0s (INADDR_ANY) and foreign port number to 0.

This is done after a datagram has been sent on an unconnected UDP socket and when connect is called on a connected UDP socket. In the first case the sequence of steps when the process calls sendto is: UDP calls in_pcbconnect to connect the socket temporarily to the destination, udp_output sends the datagram, and then in_pcbdisconnect removes the temporary connection.

in_pcbdisconnect is not called when a socket is closed since in_pcbdetach handles the release of the PCB. A disconnect is required only when the PCB needs to be reused for a different foreign address or port number.

Figure 22.29 shows the function in_pcbdisconnect.

Table 22.29. in_pcbdisconnect function: disconnect from foreign address and port number.

------------------------------------------------------------------------- in_pcb.c
243 int
244 in_pcbdisconnect(inp)
245 struct inpcb *inp;
246 {

247     inp->inp_faddr.s_addr = INADDR_ANY;
248     inp->inp_fport = 0;
249     if (inp->inp_socket->so_state & SS_NOFDREF)
250         in_pcbdetach(inp);
251 }
------------------------------------------------------------------------- in_pcb.c

If there is no longer a file table reference for this PCB (SS_NOFDREF is set) then in_pcbdetach (Figure 22.7) releases the PCB.

in_setsockaddr and in_setpeeraddr Functions

The getsockname system call returns the local protocol address of a socket (e.g., the IP address and port number for an Internet socket) and the getpeername system call returns the foreign protocol address. Both system calls end up issuing a PRU_SOCKADDR request or a PRU_PEERADDR request. The protocol then calls either in_setsockaddr or in_setpeeraddr. We show the first of these in Figure 22.30.

Table 22.30. in_setsockaddr function: return local address and port number.

------------------------------------------------------------------------- in_pcb.c
267 int
268 in_setsockaddr(inp, nam)
269 struct inpcb *inp;
270 struct mbuf *nam;
271 {
272     struct sockaddr_in *sin;

273     nam->m_len = sizeof(*sin);
274     sin = mtod(nam, struct sockaddr_in *);
275     bzero((caddr_t) sin, sizeof(*sin));
276     sin->sin_family = AF_INET;
277     sin->sin_len = sizeof(*sin);
278     sin->sin_port = inp->inp_lport;
279     sin->sin_addr = inp->inp_laddr;
280 }
------------------------------------------------------------------------- in_pcb.c

The argument nam is a pointer to an mbuf that will hold the result: a sockaddr_in structure that the system call copies back to the process. The code fills in the socket address structure and copies the IP address and port number from the Internet PCB into the sin_addr and sin_port members.

Figure 22.31 shows the in_setpeeraddr function. It is nearly identical to Figure 22.30, but copies the foreign IP address and port number from the PCB.

Table 22.31. in_setpeeraddr function: return foreign address and port number.

------------------------------------------------------------------------- in_pcb.c
281 int
282 in_setpeeraddr(inp, nam)
283 struct inpcb *inp;
284 struct mbuf *nam;
285 {
286     struct sockaddr_in *sin;

287     nam->m_len = sizeof(*sin);
288     sin = mtod(nam, struct sockaddr_in *);
289     bzero((caddr_t) sin, sizeof(*sin));
290     sin->sin_family = AF_INET;
291     sin->sin_len = sizeof(*sin);
292     sin->sin_port = inp->inp_fport;
293     sin->sin_addr = inp->inp_faddr;
294 }
------------------------------------------------------------------------- in_pcb.c

in_pcbnotify, in_rtchange, and in_losing Functions

The function in_pcbnotify is called when an ICMP error is received, in order to notify the appropriate process of the error. The “appropriate process” is found by searching all the PCBs for one of the protocols (TCP or UDP) and comparing the local and foreign IP addresses and port numbers with the values returned in the ICMP error. For example, when an ICMP source quench error is received in response to a TCP segment that some router discarded, TCP must locate the PCB for the connection that caused the error and slow down the transmission on that connection.

Before showing the function we must review how it is called. Figure 22.32 summarizes the functions called to process an ICMP error. The two shaded ellipses are the functions described in this section.

Summary of processing of ICMP errors.

Figure 22.32. Summary of processing of ICMP errors.

When an ICMP message is received, icmp_input is called. Five of the ICMP messages are classified as errors (Figures 11.1 and 11.2):

  • destination unreachable,

  • parameter problem,

  • redirect,

  • source quench, and

  • time exceeded.

Redirects are handled differently from the other four errors. All other ICMP messages (the queries) are handled as described in Chapter 11.

Each protocol defines its control input function, the pr_ctlinput entry in the protosw structure (Section 7.4). The ones for TCP and UDP are named tcp_ctlinput and udp_ctlinput, and we’ll show their code in later chapters. Since the ICMP error that is received contains the IP header of the datagram that caused the error, the protocol that caused the error (TCP or UDP) is known. Four of the five ICMP errors cause that protocol’s control input function to be called. Redirects are handled differently: the function pfctlinput is called, and it in turn calls the control input functions for all the protocols in the family (Internet). TCP and UDP are the only protocols in the Internet family with control input functions.

Redirects are handled specially because they affect all IP datagrams going to that destination, not just the one that caused the redirect. On the other hand, the other four errors need only be processed by the protocol that caused the error.

The final points we need to make about Figure 22.32 are that TCP handles source quenches differently from the other errors, and redirects are handled specially by in_pcbnotify: the function in_rtchange is called, regardless of the protocol that caused the error.

Figure 22.33 shows the in_pcbnotify function. When it is called by TCP, the first argument is the address of tcb and the final argument is the address of the function tcp_notify. For UDP, these two arguments are the address of udb and the address of the function udp_notify.

Table 22.33. in_pcbnotify function: pass error notification to processes.

------------------------------------------------------------------------- in_pcb.c
306 int
307 in_pcbnotify(head, dst, fport_arg, laddr, lport_arg, cmd, notify)
308 struct inpcb *head;
309 struct sockaddr *dst;
310 u_int   fport_arg, lport_arg;
311 struct in_addr laddr;
312 int     cmd;
313 void    (*notify) (struct inpcb *, int);
314 {
315     extern u_char inetctlerrmap[];
316     struct inpcb *inp, *oinp;
317     struct in_addr faddr;
318     u_short fport = fport_arg, lport = lport_arg;
319     int     errno;

320     if ((unsigned) cmd > PRC_NCMDS || dst->sa_family != AF_INET)
321         return;
322     faddr = ((struct sockaddr_in *) dst)->sin_addr;
323     if (faddr.s_addr == INADDR_ANY)
324         return;

325     /*
326      * Redirects go to all references to the destination,
327      * and use in_rtchange to invalidate the route cache.
328      * Dead host indications: notify all references to the destination.
329      * Otherwise, if we have knowledge of the local port and address,
330      * deliver only to that socket.
331      */
332     if (PRC_IS_REDIRECT(cmd) || cmd == PRC_HOSTDEAD) {
333         fport = 0;
334         lport = 0;
335         laddr.s_addr = 0;
336         if (cmd != PRC_HOSTDEAD)
337             notify = in_rtchange;
338     }
339     errno = inetctlerrmap[cmd];
340     for (inp = head->inp_next; inp != head;) {
341         if (inp->inp_faddr.s_addr != faddr.s_addr ||
342             inp->inp_socket == 0 ||
343             (lport && inp->inp_lport != lport) ||
344             (laddr.s_addr && inp->inp_laddr.s_addr != laddr.s_addr) ||
345             (fport && inp->inp_fport != fport)) {
346             inp = inp->inp_next;
347             continue;           /* skip this PCB */
348         }
349         oinp = inp;
350         inp = inp->inp_next;
351         if (notify)
352             (*notify) (oinp, errno);
353     }
354 }
------------------------------------------------------------------------- in_pcb.c

Verify arguments

306-324

The cmd argument and the address family of the destination are verified. The foreign address is checked to ensure it is not 0.0.0.0.

Handle redirects specially

325-338

If the error is a redirect it is handled specially. (The error PRC_HOSTDEAD is an old error that was generated by the IMPs. Current systems should never see this error it is a historical artifact.) The foreign port, local port, and local address are all set to 0 so that the for loop that follows won’t compare them. For a redirect we want that loop to select the PCBs to receive notification based only on the foreign IP address, because that is the IP address for which our host received a redirect. Also, the function that is called for a redirect is in_rtchange (Figure 22.34) instead of the notify argument specified by the caller.

Table 22.34. in_rtchange function: invalidate route.

------------------------------------------------------------------------- in_pcb.c
391 void
392 in_rtchange(inp, errno)
393 struct inpcb *inp;
394 int     errno;
395 {
396     if (inp->inp_route.ro_rt) {
397         rtfree(inp->inp_route.ro_rt);
398         inp->inp_route.ro_rt = 0;
399         /*
400          * A new route can be allocated the next time
401          * output is attempted.
402          */
403     }
404 }
------------------------------------------------------------------------- in_pcb.c

339

The global array inetctlerrmap maps one of the protocol-independent error codes (the PRC_xxx values from Figure 11.19) into its corresponding Unix errno value (the final column in Figure 11.1).

Call notify function for selected PCBs

340-353

This loop selects the PCBs to be notified. Multiple PCBs can be notified t he loop keeps going even after a match is located. The first if statement combines five tests, and if any one of the five is true, the PCB is skipped: (1) if the foreign addresses are unequal, (2) if the PCB does not have a corresponding socket structure, (3) if the local ports are unequal, (4) if the local addresses are unequal, or (5) if the foreign ports are unequal. The foreign addresses must match, while the other three foreign and local elements are compared only if the corresponding argument is nonzero. When a match is found, the notify function is called.

in_rtchange Function

We saw that in_pcbnotify calls the function in_rtchange when the ICMP error is a redirect. This function is called for all PCBs with a foreign address that matches the IP address that has been redirected. Figure 22.34 shows the in_rtchange function.

If the PCB holds a route, that route is released by rtfree, and the PCB member is marked as empty. We don’t try to update the route at this time, using the new router address returned in the redirect. The new route will be allocated by ip_output when this PCB is used next, based on the kernel’s routing table, which is updated by the redirect, before pfctlinput is called.

Redirects and Raw Sockets

Let’s examine the interaction of redirects, raw sockets, and the cached route in the PCB. If we run the Ping program, which uses a raw socket, and an ICMP redirect error is received for the IP address being pinged, Ping continues using the original route, not the redirected route. We can see this as follows.

We ping the host svr4 on the 140.252.13 network from the host gemini on the 140.252.1 network. The default router for gemini is gateway, but the packets should be sent to the router netb instead. Figure 22.35 shows the arrangement.

Example of ICMP redirect.

Figure 22.35. Example of ICMP redirect.

We expect gateway to send a redirect when it receives the first ICMP echo request.

gemini $ ping -sv svr4
PING 140.252.13.34: 56 data bytes
ICMP Host redirect from gateway 140.252.1.4
  to netb (140.252.1.183) for svr4 (140.252.13.34)
64 bytes from svr4 (140.252.13.34): icmp_seq=0. time=572. ms

ICMP Host redirect from gateway 140.252.1.4
  to netb (140.252.1.183) for svr4 (140.252.13.34)
64 bytes from svr4 (140.252.13.34): icmp_seq=1. time=392. ms

The -s option causes an ICMP echo request to be sent once a second, and the -v option prints every received ICMP message (instead of only the ICMP echo replies).

Every ICMP echo request elicits a redirect, but the raw socket used by ping never notices the redirect to change the route that it is using. The route that is first calculated and stored in the PCB, causing the IP datagrams to be sent to the router gateway (140.252.1.4), should be updated so that the datagrams are sent to the router netb (140.252.1.183) instead. We see that the ICMP redirects are received by the kernel on gemini, but they appear to be ignored.

If we terminate the program and start it again, we never see a redirect:

gemini $ ping -sv svr4
PING 140.252.13.34: 56 data bytes
64 bytes from svr4 (140.252.13.34): icmp_seq=0. time=388. ms
64 bytes from svr4 (140.252.13.34): icmp_seq=1. time=363. ms

The reason for this anomaly is that the raw IP socket code (Chapter 32) does not have a control input function. Only TCP and UDP have a control input function. When the redirect error is received, ICMP updates the kernel’s routing table accordingly, and pfctlinput is called (Figure 22.32). But since there is no control input function for the raw IP protocol, the cached route in the PCB associated with Ping’s raw socket is never released. When we start the Ping program a second time, however, the route that is allocated is based on the kernel’s updated routing table, and we never see the redirects.

ICMP Errors and UDP Sockets

One confusing part of the sockets API is that ICMP errors received on a UDP socket are not passed to the application unless the application has issued a connect on the socket, restricting the foreign IP address and port number for the socket. We now see where this limitation is enforced by in_pcbnotify.

Consider an ICMP port unreachable, probably the most common ICMP error on a UDP socket. The foreign IP address and the foreign port number in the dst argument to in_pcbnotify are the IP address and port number that caused the ICMP error. But if the process has not issued a connect on the socket, the inp_faddr and inp_fport members of the PCB are both 0, preventing in_pcbnotify from ever calling the notify function for this socket. The for loop in Figure 22.33 will skip every UDP PCB.

This limitation arises for two reasons. First, if the sending process has an unconnected UDP socket, the only nonzero element in the socket pair is the local port. (This assumes the process did not call bind.) This is the only value available to in_pcbnotify to demultiplex the incoming ICMP error and pass it to the correct process. Although unlikely, there could be multiple processes bound to the same local port, making it ambiguous which process should receive the error. There’s also the possibility that the process that sent the datagram that caused the ICMP error has terminated, with another process then starting and using the same local port. This is also unlikely since ephemeral ports are assigned in sequential order from 1024 to 5000 and reused only after cycling around (Figure 22.23).

The second reason for this limitation is because the error notification from the kernel to the process an errno value is inadequate. Consider a process that calls sendto on an unconnected UDP socket three times in a row, sending a UDP datagram to three different destinations, and then waits for the replies with recvfrom. If one of the datagrams generates an ICMP port unreachable error, and if the kernel were to return the corresponding error (ECONNREFUSED) to the recvfrom that the process issued, the errno value doesn’t tell the process which of the three datagrams caused the error. The kernel has all the information required in the ICMP error, but the sockets API doesn’t provide a way to return this to the process.

Therefore the design decision was made that if a process wants to be notified of these ICMP errors on a UDP socket, that socket must be connected to a single peer. If the error ECONNREFUSED is returned on that connected socket, there’s no question which peer generated the error.

There is still a remote possibility of an ICMP error being delivered to the wrong process. One process sends the UDP datagram that elicits the ICMP error, but it terminates before the error is received. Another process then starts up before the error is received, binds the same local port, and connects to the same foreign address and foreign port, causing this new process to receive the error. There’s no way to prevent this from occurring, given UDP’s lack of memory. We’ll see that TCP handles this with its TIME_WAIT state.

In our preceding example, one way for the application to get around this limitation is to use three connected UDP sockets instead of one unconnected socket, and call select to determine when any one of the three has a received datagram or an error to be read.

Here we have a scenario where the kernel has the information but the API (sockets) is inadequate. With most implementations of Unix System V and the other popular API (TLI), the reverse is true: the TLI function t_rcvuderr can return the peer’s IP address, port number, and an error value, but most SVR4 streams implementations of TCP/IP don’t provide a way for ICMP to pass the error to an unconnected UDP end point.

In an ideal world, in_pcbnotify delivers the ICMP error to all UDP sockets that match, even if the only nonwildcard match is the local port. The error returned to the process would include the destination IP address and destination UDP port that caused the error, allowing the process to determine if the error corresponds to a datagram sent by the process.

in_losing Function

The final function dealing with PCBs is in_losing, shown in Figure 22.36. It is called by TCP when its retransmission timer has expired four or more times in a row for a given connection (Figure 25.26).

Table 22.36. in_losing function: invalidate cached route information.

------------------------------------------------------------------------- in_pcb.c
361 int
362 in_losing(inp)
363 struct inpcb *inp;
364 {
365     struct rtentry *rt;
366     struct rt_addrinfo info;

367     if ((rt = inp->inp_route.ro_rt)) {
368         inp->inp_route.ro_rt = 0;
369         bzero((caddr_t) & info, sizeof(info));
370         info.rti_info[RTAX_DST] =
371             (struct sockaddr *) &inp->inp_route.ro_dst;
372         info.rti_info[RTAX_GATEWAY] = rt->rt_gateway;
373         info.rti_info[RTAX_NETMASK] = rt_mask(rt);
374         rt_missmsg(RTM_LOSING, &info, rt->rt_flags, 0);

375         if (rt->rt_flags & RTF_DYNAMIC)
376             (void) rtrequest(RTM_DELETE, rt_key(rt),
377                              rt->rt_gateway, rt_mask(rt), rt->rt_flags,
378                              (struct rtentry **) 0);
379         else
380             /*
381              * A new route can be allocated
382              * the next time output is attempted.
383              */
384             rtfree(rt);
385     }
386 }
------------------------------------------------------------------------- in_pcb.c

Generate routing message

361-374

If the PCB holds a route, that route is discarded. An rt_addrinfo structure is filled in with information about the cached route that appears to be failing. The function rt_missmsg is then called to generate a message from the routing socket of type RTM_LOSING, indicating a problem with the route.

Delete or release route

375-384

If the cached route was generated by a redirect (RTF_DYNAMIC is set), the route is deleted by calling rtrequest with a request of RTM_DELETE. Otherwise the cached route is released, causing the next output on the socket to allocate another route to the destination h opefully a better route.

Implementation Refinements

Undoubtedly the most time-consuming algorithm we’ve encountered in this chapter is the linear searching of the PCBs done by in_pcblookup. At the beginning of Section 22.6 we noted four instances when this function is called. We can ignore the calls to bind and connect, as they occur much less frequently than the calls to in_pcblookup from TCP and UDP, to demultiplex every received IP datagram.

In later chapters we’ll see that TCP and UDP both try to help this linear search by maintaining a pointer to the last PCB that the protocol referenced: a one-entry cache. If the local address, local port, foreign address, and foreign port in the cached PCB match the values in the received datagram, the protocol doesn’t even call in_pcblookup. If the protocol’s data fits the packet train model [Jain and Routhier 1986], this simple cache works well. But if the data does not fit this model and, for example, looks like data entry into an on-line transaction processing system, the one-entry cache performs poorly [McKenney and Dove 1992].

One proposal for a better PCB arrangement is to move a PCB to the front of the PCB list when the PCB is referenced. ([McKenney and Dove 1992] attribute this idea to Jon Crowcroft; [Partridge and Pink 1993] attribute it to Gary Delp.) This movement of the PCB is easy to do since it is a doubly linked list and a pointer to the head of the list is the first argument to in_pcblookup.

[McKenney and Dove 1992] compare the original Net/1 implementation (no cache), an enhanced one-entry send—receive cache, the move-to-the-front heuristic, and their own algorithm that uses hash chains. They show that maintaining a linear list of PCBs on hash chains provides an order of magnitude improvement over the other algorithms. The only cost for the hash chains is the memory required for the hash chain headers and the computation of the hash function. They also consider adding the move-to-the-front heuristic to their hash-chain algorithm and conclude that it is easier simply to add more hash chains.

Another comparison of the BSD linear search to a hash table search is in [Hutchinson and Peterson 1991]. They show that the time required to demultiplex an incoming UDP datagram is constant as the number of sockets increases for a hash table, but with a linear search the time increases as the number of sockets increases.

Summary

An Internet PCB is associated with every Internet socket: TCP, UDP, and raw IP. It contains information common to all Internet sockets: local and foreign IP addresses, pointer to a route structure, and so on. All the PCBs for a given protocol are placed on a doubly linked list maintained by that protocol.

In this chapter we’ve looked at numerous functions that manipulate the PCBs, and three in detail.

  1. in_pcblookup is called by TCP and UDP to demultiplex every received datagram. It chooses which socket receives the datagram, taking into account wildcard matches.

    This function is also called by in_pcbbind to verify that the local address and local process are unique, and by in_pcbconnect to verify that the combination of a local address, local process, foreign address, and foreign process are unique.

  2. in_pcbbind explicitly or implicitly binds a local address and local port to a socket. An explicit bind occurs when the process calls bind, and an implicit bind occurs when a TCP client calls connect without calling bind, or when a UDP process calls sendto or connect without calling bind.

  3. in_pcbconnect sets the foreign address and foreign process. If the local address has not been set by the process, a route to the foreign address is calculated and the resulting local interface becomes the local address. If the local port has not been set by the process, in_pcbbind chooses an ephemeral port for the socket.

Figure 22.37 summarizes the common scenarios for various TCP and UDP applications and the values stored in the PCB for the local address and port and the foreign address and port. We have not yet covered all the actions shown in Figure 22.37 for TCP and UDP processes, but will examine the code in later chapters.

Table 22.37. Summary of in_pcbbind and in_pcbconnect.

Application

local address: inp_laddr

local port: inp_lport

foreign address: inp_faddr

foreign port: inp_fport

TCP client:

  • connect(foreignIP, fport)

in_pcbconnect calls rtalloc to allocate route to foreignIP. Local address is local interface.

in_pcbconnect calls in_pcbbind to choose ephemeral port.

foreignIP

fport

TCP client:

  • bind(localIP, lport)

  • connect(foreignIP, fport)

localIP

lport

foreignIP

fport

TCP client:

  • bind(*, lport)

  • connect(foreignIP, fport)

in_pcbconnect calls rtalloc to allocate route to foreignIP. Local address is local interface.

lport

foreignIP

fport

TCP client:

  • bind(localIP, 0)

  • connect(foreignIP, fport)

localIP

in_pcbbind chooses ephemeral port.

foreignIP

fport

TCP server:

  • bind(localIP, lport)

  • listen()

  • accept()

localIP

lport

Source address from IP header.

Source port from TCP header.

TCP server:

  • bind(*, lport)

  • listen()

  • accept()

Destination address from IP header.

lport

Source address from IP header.

Source port from TCP header.

UDP client:

  • sendto(foreignIP, fport)

in_pcbconnect calls rtalloc to allocate route to foreignIP. Local address is local interface. Reset to 0.0.0.0 after datagram sent.

in_pcbconnect calls in_pcbbind to choose ephemeral port. Not changed on subsequent calls to sendto.

foreignIP. Reset to 0.0.0.0 after datagram sent.

fport. Reset to 0 after datagram sent.

UDP client:

  • connect(foreignIP, fport)

  • write()

in_pcbconnect calls rtalloc to allocate route to foreignIP. Local address is local interface. Not changed on subsequent calls to write.

in_pcbconnect calls in_pcbbind to choose ephemeral port. Not changed on subsequent calls to write.

foreignIP

fport

Exercises

22.1

What happens in Figure 22.23 when the process asks for an ephemeral port and every ephemeral port is in use?

22.1

An infinite loop occurs, waiting for a port to become available. This assumes the process is allowed to open enough descriptors to tie up all ephemeral ports.

22.2

In Figure 22.10 we showed two Telnet servers with listening sockets: one with a specific local IP address and one with the wildcard for its local IP address. Does your system’s Telnet daemon allow you to specify the local IP address, and if so, how?

22.2

Few, if any, servers support this option. [Cheswick and Bellovin 1994] mention how this would be nice for implementing firewall systems.

22.3

Assume a socket is bound to the local socket {140.252.1.29, 8888}, and this is the only socket using local port 8888. (1) Go through the steps performed by in_pcbbind when another socket is bound to {140.252.13.33, 8888}, without any socket options. (2) Go through the steps performed when another socket is bound to the wildcard IP address, port 8888, without any socket options. (3) Go through the steps performed when another socket is bound to the wildcard IP address, port 8888, with the SO_REUSEADDR socket option.

22.4

What is the first ephemeral port number allocated by UDP?

22.4

The udb structure is initialized to 0 so udb.inp_lport starts at 0. The first time through in_pcbbind it is incremented to 1, which is less than 1024, so it is set to 1024.

22.5

When a process calls bind, which elements in the sockaddr_in structure must be filled in?

22.5

Normally the caller sets the address family (sa_family) to AF_INET, but we saw in Figure 22.20 that the test for this is commented out. The caller can set the length member (sa_len), but we saw in Figure 15.20 that the function sockargs always sets this to the third argument to bind, which for a sockaddr_in structure is specified as 16, normally using C’s sizeof operator.

The local IP address (sin_addr) can be specified as a wildcard address or as a local IP address. The local port number (sin_port), can be either 0 (telling the kernel to choose an ephemeral port) or nonzero if the process wants a particular port. Normally a TCP or UDP server specifies a wildcard IP address and a nonzero port, and a UDP client often specifies a wildcard IP address and a port number of 0.

22.6

What happens if a process tries to bind a local broadcast address? What happens if a process tries to bind the limited broadcast address (255.255.255.255)?

22.6

A process is allowed to bind a local broadcast address, because the call to ifa_ifwithaddr in Figure 22.22 succeeds. That address is used as the source address for IP datagrams sent on the socket. As noted in Section C.2, this behavior is not allowed by RFC 1122.

An attempt to bind 255.255.255.255, however, fails, since that address is not acceptable to ifa_ifwithaddr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.33.157