Protocol control blocks (PCBs) are used at the protocol layer to hold the various pieces of information required for each UDP or TCP socket. The Internet protocols maintain Internet protocol control blocks and TCP control blocks. Since UDP is connectionless, everything it needs for an end point is found in the Internet PCB; there are no UDP control blocks.
The Internet PCB contains the information common to all UDP and TCP end points: foreign and local IP addresses, foreign and local port numbers, IP header prototype, IP options to use for this end point, and a pointer to the routing table entry for the destination of this end point. The TCP control block contains all of the state information that TCP maintains for each connection: sequence numbers in both directions, window sizes, retransmission timers, and the like.
In this chapter we describe the Internet PCBs used in Net/3, saving TCP’s control blocks until we describe TCP in detail. We examine the numerous functions that operate on Internet PCBs, since we’ll encounter them when we describe UDP and TCP. Most of the functions begin with the six characters in_pcb
.
Figure 22.1 summarizes the protocol control blocks that we describe and their relationship to the file
and socket
structures.
There are numerous points to consider in this figure.
When a socket is created by either socket
or accept
, the socket layer creates a file
structure and a socket
structure. The file type is DTYPE_SOCKET
and the socket type is SOCK_DGRAM
for UDP end points or SOCK_STREAM
for TCP end points.
The protocol layer is then called. UDP creates an Internet PCB (an inpcb
structure) and links it to the socket
structure: the so_pcb
member points to the inpcb
structure and the inp_socket
member points to the socket
structure.
TCP does the same and also creates its own control block (a tcpcb
structure) and links it to the inpcb
using the inp_ppcb
and t_inpcb
pointers. In the two UDP inpcb
s the inp_ppcb
member is a null pointer, since UDP does not maintain its own control block.
The four other members of the inpcb
structure that we show, inp_faddr
through inp_lport
, form the socket pair for this end point: the foreign IP address and port number along with the local IP address and port number.
Both UDP and TCP maintain a doubly linked list of all their Internet PCBs, using the inp_next
and inp_prev
pointers. They allocate a global inpcb
structure as the head of their list (named udb
and tcb
) and only use three members in the structure: the next and previous pointers, and the local port number. This latter member contains the next ephemeral port number to use for this protocol.
The Internet PCB is a transport layer data structure. It is used by TCP, UDP, and raw IP, but not by IP, ICMP, or IGMP.
We haven’t described raw IP yet, but it too uses Internet PCBs. Unlike TCP and UDP, raw IP does not use the port number members in the PCB, and raw IP uses only two of the functions that we describe in this chapter: in_pcballoc
to allocate a PCB, and in_pcbdetach
to release a PCB. We return to raw IP in Chapter 32.
All the PCB functions are in a single C file and a single header contains the definitions, as shown in Figure 22.2.
One global variable is introduced in this chapter, which is shown in Figure 22.3.
Internet PCBs and TCP PCBs are both allocated by the kernel’s malloc
function with a type of M_PCB
. This is just one of the approximately 60 different types of memory allocated by the kernel. Mbufs, for example, are allocated with a type of M_BUF
, and socket
structures are allocated with a type of M_SOCKET
.
Since the kernel can keep counters of the different types of memory buffers that are allocated, various statistics on the number of PCBs can be maintained. The command vmstat -m
shows the kernel’s memory allocation statistics and the netstat -m
command shows the mbuf allocation statistics.
Figure 22.4 shows the definition of the inpcb
structure. It is not a big structure, and occupies only 84 bytes.
Table 22.4. inpcb
structure.
------------------------------------------------------------------------- in_pcb.h 42 struct inpcb { 43 struct inpcb *inp_next, *inp_prev; /* doubly linked list */ 44 struct inpcb *inp_head; /* pointer back to chain of inpcb's for 45 this protocol */ 46 struct in_addr inp_faddr; /* foreign IP address */ 47 u_short inp_fport; /* foreign port# */ 48 struct in_addr inp_laddr; /* local IP address */ 49 u_short inp_lport; /* local port# */ 50 struct socket *inp_socket; /* back pointer to socket */ 51 caddr_t inp_ppcb; /* pointer to per-protocol PCB */ 52 struct route inp_route; /* placeholder for routing entry */ 53 int inp_flags; /* generic IP/datagram flags */ 54 struct ip inp_ip; /* header prototype; should have more */ 55 struct mbuf *inp_options; /* IP options */ 56 struct ip_moptions *inp_moptions; /* IP multicast options */ 57 }; ------------------------------------------------------------------------- in_pcb.h |
43-45
inp_next
and inp_prev
form the doubly linked list of all PCBs for UDP and TCP. Additionally, each PCB has a pointer to the head of the protocol’s linked list (inp_head
). For PCBs on the UDP list, inp_head
always points to udb
(Figure 22.1); for PCBs on the TCP list, this pointer always points to tcb
.
46-49
The next four members, inp_faddr, inp_fport, inp_laddr
, and inp_lport
, contain the socket pair for this IP end point: the foreign IP address and port number and the local IP address and port number. These four values are maintained in the PCB in network byte order, not host byte order.
The Internet PCB is used by both transport layers, TCP and UDP. While it makes sense to store the local and foreign IP addresses in this structure, the port numbers really don’t belong here. The definition of a port number and its size are specified by each transport layer and could differ between different transport layers. This problem was identified in [Partridge 1987], where 8-bit port numbers were used in version 1 of RDP, which required reimplementing several standard kernel routines to use 8-bit port numbers. Version 2 of RDP [Partridge and Hinden 1990] uses 16-bit port numbers. The port numbers really belong in a transport-specific control block, such as TCP’s
tcpcb
. A new UDP-specific PCB would then be required. While doable, this would complicate some of the routines we’ll examine shortly.
50-51
inp_socket
is a pointer to the socket
structure for this PCB and inp_ppcb
is a pointer to an optional transport-specific control block for this PCB. We saw in Figure 22.1 that the inp_ppcb
pointer is used with TCP to point to the corresponding tcpcb
, but is not used by UDP. The link between the socket
and inpcb
is two way because sometimes the kernel starts at the socket layer and needs to find the corresponding Internet PCB (e.g., user output), and sometimes the kernel starts at the PCB and needs to locate the corresponding socket
structure (e.g., processing a received IP datagram).
52
If IP has a route to the foreign address, it is stored in the inp_route
entry. We’ll see that when an ICMP redirect message is received, all Internet PCBs are scanned and all those with a foreign IP address that matches the redirected IP address have their inp_route
entry marked as invalid. This forces IP to find a new route to the foreign address the next time the PCB is used for output.
53
Various flags are stored in the inp_flags
member. Figure 22.5 lists the individual flags.
Table 22.5. inp_flags
values.
| Description |
---|---|
| process supplies entire IP header (raw socket only) |
| receive incoming IP options as control information (UDP only, not implemented) |
| receive IP options for reply as control information (UDP only, not implemented) |
| receive IP destination address as control information (UDP only) |
|
|
54
A copy of an IP header is maintained in the PCB but only two members are used, the TOS and TTL. The TOS is initialized to 0 (normal service) and the TTL is initialized by the transport layer. We’ll see that TCP and UDP both default the TTL to 64. A process can change these defaults using the IP_TOS
or IP_TTL
socket options, and the new value is recorded in the inpcb
.inp_ip
structure. This structure is then used by TCP and UDP as the prototype IP header when sending IP datagrams.
55-56
A process can set the IP options for outgoing datagrams with the IP_OPTIONS
socket option. A copy of the caller’s options are stored in an mbuf by the function ip_pcbopts
and a pointer to that mbuf is stored in the inp_options
member. Each time TCP or UDP calls the ip_output
function, a pointer to these IP options is passed for IP to insert into the outgoing IP datagram. Similarly, a pointer to a copy of the user’s IP multicast options is maintained in the inp_moptions
member.
An Internet PCB is allocated by TCP, UDP, and raw IP when a socket is created. A PRU_ATTACH
request is issued by the socket
system call. In the case of UDP, we’ll see in Figure 23.33 that the resulting call is
struct socket *so; int error; error = in_pcballoc(so, &udb);
Figure 22.6 shows the in_pcballoc
function.
Table 22.6. in_pcballoc
function: allocate an Internet PCB.
------------------------------------------------------------------------- in_pcb.h 36 int 37 in_pcballoc(so, head) 38 struct socket *so; 39 struct inpcb *head; 40 { 41 struct inpcb *inp; 42 MALLOC(inp, struct inpcb *, sizeof(*inp), M_PCB, M_WAITOK); 43 if (inp == NULL) 44 return (ENOBUFS); 45 bzero((caddr_t) inp, sizeof(*inp)); 46 inp->inp_head = head; 47 inp->inp_socket = so; 48 insque(inp, head); 49 so->so_pcb = (caddr_t) inp; 50 return (0); 51 } ------------------------------------------------------------------------- in_pcb.h |
36-45
in_pcballoc
calls the kernel’s memory allocator using the macro MALLOC
. Since these PCBs are always allocated as the result of a system call, it is OK to wait for one.
Net/2 and earlier Berkeley releases stored both Internet PCBs and TCP PCBs in mbufs. Their sizes were 80 and 108 bytes, respectively. With the Net/3 release, the sizes went to 84 and 140 bytes, so TCP control blocks no longer fit into an mbuf. Net/3 uses the kernel’s memory allocator instead of mbufs for both types of control blocks.
Careful readers may note that the example in Figure 2.6 shows 17 mbufs allocated for PCBs, yet we just said that Net/3 no longer uses mbufs for Internet PCBs or TCP PCBs. Net/3 does, however, use mbufs for Unix domain PCBs, and that is what this counter refers to. The mbuf statistics output by
netstat
are for all mbufs in the kernel across all protocol suites, not just the Internet protocols.
bzero
sets the PCB to 0. This is important because the IP addresses and port numbers in the PCB must be initialized to 0.
46-49
The inp_head
member points to the head of the protocol’s PCB list (either udb
or tcb
), the inp_socket
member points to the socket
structure, the new PCB is added to the protocol’s doubly linked list (insque
), and the socket
structure points to the PCB. The insque
function puts the new PCB at the head of the protocol’s list.
An Internet PCB is deallocated when a PRU_DETACH
request is issued. This happens when the socket is closed. The function in_pcbdetach
, shown in Figure 22.7, is eventually called.
Table 22.7. in_pcbdetach
function: deallocate an Internet PCB.
------------------------------------------------------------------------- in_pcb.h 252 int 253 in_pcbdetach(inp) 254 struct inpcb *inp; 255 { 256 struct socket *so = inp->inp_socket; 257 so->so_pcb = 0; 258 sofree(so); 259 if (inp->inp_options) 260 (void) m_free(inp->inp_options); 261 if (inp->inp_route.ro_rt) 262 rtfree(inp->inp_route.ro_rt); 263 ip_freemoptions(inp->inp_moptions); 264 remque(inp); 265 FREE(inp, M_PCB); 266 } ------------------------------------------------------------------------- in_pcb.h |
252-263
The PCB pointer in the socket
structure is set to 0 and that structure is released by sofree
. If an mbuf with IP options was allocated for this PCB, it is released by m_free
. If a route is held by this PCB, it is released by rtfree
. Any multicast options are also released by ip_freemoptions
.
264-265
The PCB is removed from the protocol’s doubly linked list by remque
and the memory used by the PCB is returned to the kernel.
Before examining the kernel functions that bind sockets, connect sockets, and demultiplex incoming datagrams, we describe the rules imposed by the kernel on these actions.
Figure 22.8 shows the six different combinations of a local IP address and local port number that a process can specify in a call to bind
.
Table 22.8. Combination of local IP address and local port number for bind
.
Local IP address | Local port | Description |
---|---|---|
unicast or broadcast | nonzero | one local interface, specific port |
multicast | nonzero | one local multicast group, specific port |
| nonzero | any local interface or multicast group, specific port |
unicast or broadcast | 0 | one local interface, kernel chooses port |
multicast | 0 | one multicast group, kernel chooses port |
| 0 | any local interface, kernel chooses port |
The first three lines are typical for servers they bind a specific port, termed the server’s well-known port, whose value is known by the client. The last three lines are typical for clients they don’t care what the local port, termed an ephemeral port, is, as long as it is unique on the client host.
Most servers and most clients specify the wildcard IP address in the call to bind
. This is indicated in Figure 22.8 by the notation * on lines 3 and 6.
If a server binds a specific IP address to a socket (i.e., not the wildcard address), then only IP datagrams arriving with that specific IP address as the destination IP address be it unicast, broadcast, or multicast are delivered to the process. Naturally, when the process binds a specific unicast or broadcast IP address to a socket, the kernel verifies that the IP address corresponds to a local interface.
It is rare, though possible, for a client to bind a specific IP address (lines 4 and 5 in Figure 22.8). Normally a client binds the wildcard IP address (the final line in Figure 22.8), which lets the kernel choose the outgoing interface based on the route chosen to reach the server.
What we don’t show in Figure 22.8 is what happens if the client tries to bind a local port that is already in use with another socket. By default a process cannot bind a port number if that port is already in use. The error EADDRINUSE
(address already in use) is returned if this occurs. The definition of in use is simply whether a PCB exists with that port as its local port. This notion of “in use” is relative to a given protocol: TCP or UDP, since TCP port numbers are independent of UDP port numbers.
Net/3 allows a process to change this default behavior by specifying one of following two socket options:
| Allows the process to bind a port number that is already in use, but the IP address being bound (including the wildcard) must not already be bound to that same port. |
For example, if an attached interface has the IP address 140.252.1.29 then one socket can be bound to 140.252.1.29, port 5555; another socket can be bound to 127.0.0.1, port 5555; and another socket can be bound to the wildcard IP address, port 5555. The call to | |
| Allows a process to reuse both the IP address and port number, but each binding |
| Allows the process to bind a port number that is already in use, but the IP address being bound (including the wildcard) must not already be bound to that same port. |
For example, if an attached interface has the IP address 140.252.1.29 then one socket can be bound to 140.252.1.29, port 5555; another socket can be bound to 127.0.0.1, port 5555; and another socket can be bound to the wildcard IP address, port 5555. The call to | |
of the IP address and port number, including the first, must specify this socket option. With | |
For example, if an attached interface has the IP address 140.252.1.29 and a socket is bound to 140.252.1.29, port 6666 specifying the |
Later in this section we describe what happens in this final example when an IP datagram arrives with a destination address of 140.252.1.29 and a destination port of 6666, since two sockets are bound to that end point.
The
SO_REUSEPORT
option is new with Net/3 and was introduced with the support for multicasting in 4.4BSD. Before this release it was never possible for two sockets to be bound to the same IP address and same port number.Unfortunately the
so_REUSEPORT
option was not part of the original Stanford multicast sources and is therefore not widely supported. Other systems that support multicasting, such as Solaris 2.x, let a process specifySO_REUSEADDR
to specify that it is OK to bind multiple sockets to the same IP address and same port number.
We normally associate the connect
system call with TCP clients, but it is also possible for a UDP client or a UDP server to call connect
and specify the foreign IP address and foreign port number for the socket. This restricts the socket to exchanging UDP datagrams with that one particular peer.
There is a side effect when a UDP socket is connected: the local IP address, if not already specified by a call to bind
, is automatically set by connect
. It is set to the local interface address chosen by IP routing to reach the specified peer.
Figure 22.9 shows the three different states of a UDP socket along with the pseudo-code of the function calls to end up in that state.
Table 22.9. Specification of local and foreign IP addresses and port numbers for UDP sockets.
Local socket | Foreign socket | Description |
---|---|---|
localIP.lport | foreignIP.fport | restricted to one peer:
|
localIP.lport |
| restricted to datagrams arriving on one local interface: localIP
|
|
| receives all datagrams sent to lport:
|
The first of the three states is called a connected UDP socket and the next two states are called unconnected UDP sockets. The difference between the two unconnected sockets is that the first has a fully specified local address and the second has a wildcarded local IP address.
Figure 22.10 shows the state of three Telnet server sockets on the host sun
. The first two sockets are in the LISTEN state, waiting for incoming connection requests, and the third is connected to a client at port 1500 on the host with an IP address of 140.252.1.11. The first listening socket will handle connection requests that arrive on the 140.252.1.29 interface and the second listening socket will handle all other interfaces (since its local IP address is the wildcard).
We show both of the listening sockets with unspecified foreign IP addresses and port numbers because the sockets API doesn’t allow a TCP server to restrict either of these values. A TCP server must accept
the client’s connection and is then told of the client’s IP address and port number after the connection establishment is complete (i.e., when TCP’s three-way handshake is complete). Only then can the server close the connection if it doesn’t like the client’s IP address and port number. This isn’t a required TCP feature, it is just the way the sockets API has always worked.
When TCP receives a segment with a destination port of 23 it searches through its list of Internet PCBs looking for a match by calling in_pcblookup
. When we examine this function shortly we’ll see that it has a preference for the smallest number of wildcard matches. To determine the number of wildcard matches we consider only the local and foreign IP addresses. We do not consider the foreign port number. The local port number must match, or we don’t even consider the PCB. The number of wildcard matches can be 0, 1 (local IP address or foreign IP address), or 2 (both local and foreign IP addresses).
For example, assume the incoming segment is from 140.252.1.11, port 1500, destined for 140.252.1.29, port 23. Figure 22.11 shows the number of wildcard matches for the three sockets from Figure 22.10.
The first socket matches these four values, but with one wildcard match (the foreign IP address). The second socket also matches the incoming segment, but with two wildcard matches (the local and foreign IP addresses). The third socket is a complete match with no wildcards. Net/3 uses the third socket, the one with the smallest number of wildcard matches.
Continuing this example, assume the incoming segment is from 140.252.1.11, port 1501, destined for 140.252.1.29, port 23. Figure 22.12 shows the number of wildcard matches.
The first socket matches with one wildcard match; the second socket matches with two wildcard matches; and the third socket doesn’t match at all, since the foreign port numbers are unequal. (The foreign port numbers are compared only if the foreign IP address in the PCB is not a wildcard.) The first socket is chosen.
In these two examples we never said what type of TCP segment arrived: we assume that the segment in Figure 22.11 contains data or an acknowledgment for an established connection since it is delivered to an established socket. We also assume that the segment in Figure 22.12 is an incoming connection request (a SYN) since it is delivered to a listening socket. But the demultiplexing code in in_pcblookup
doesn’t care. If the TCP segment is the wrong type for the socket that it is delivered to, we’ll see later how TCP handles this. For now the important fact is that the demultiplexing code only compares the source and destination socket pair from the IP datagram against the values in the PCB.
The delivery of UDP datagrams is more complicated than the TCP example we just examined, since UDP datagrams can be sent to a broadcast or multicast address. Since Net/3 (and most systems with multicast support) allow multiple sockets to have identical local IP addresses and ports, how are multiple recipients handled? The Net/3 rules are:
An incoming UDP datagram destined for either a broadcast IP address or a multicast IP address is delivered to all matching sockets. There is no concept of a “best” match here (i.e., the one with the smallest number of wildcard matches).
An incoming UDP datagram destined for a unicast IP address is delivered only to one matching socket, the one with the smallest number of wildcard matches. If there are multiple sockets with the same “smallest” number of wildcard matches, which socket receives the incoming datagram is implementation-dependent.
Figure 22.13 shows four UDP sockets that we’ll use for some examples. Having four UDP sockets with the same local port number requires using either SO_REUSEADDR
or SO_REUSEPORT
. The first two sockets have been connected to a foreign IP address and port number, and the last two are unconnected.
Table 22.13. Four UDP sockets with a local port of 577.
Local address | Local port | Foreign address | Foreign port | Comment |
---|---|---|---|---|
140.252.1.29 | 577 | 140.252.1.11 | 1500 | connected, local IP = unicast |
140.252.13.63 | 577 | 140.252.13.35 | 1500 | connected, local IP = broadcast |
140.252.13.63 | 577 |
|
| unconnected, local IP = broadcast |
| 577 |
|
| unconnected, local IP = wildcard |
Consider an incoming UDP datagram destined for 140.252.13.63 (the broadcast address on the 140.252.13 subnet), port 577, from 140.252.13.34, port 1500. Figure 22.14 shows that it is delivered to the third and fourth sockets.
The broadcast datagram is not delivered to the first socket because the local IP address doesn’t match the destination IP address and the foreign IP address doesn’t match the source IP address. It isn’t delivered to the second socket because the foreign IP address doesn’t match the source IP address.
As the next example, consider an incoming UDP datagram destined for 140.252.1.29 (a unicast address), port 577, from 140.252.1.11, port 1500. Figure 22.15 shows to which sockets the datagram is delivered.
Table 22.15. Received datagram from {140.252.1.11, 1500} to {140.252.1.29, 577}.
Local address | Local port | Foreign address | Foreign port | Delivered? |
---|---|---|---|---|
140.252.1.29 | 577 | 140.252.1.11 | 1500 | yes, 0 wildcard matches |
140.252.13.63 | 577 | 140.252.13.35 | 1500 | no, local and foreign IP mismatch |
140.252.13.63 | 577 |
|
| no, local IP mismatch |
| 577 |
|
| no, 2 wildcard matches |
The datagram matches the first socket with no wildcard matches and also matches the fourth socket with two wildcard matches. It is delivered to the first socket, the best match.
The function in_pcblookup
serves four different purposes.
When either TCP or UDP receives an IP datagram, in_pcblookup
scans the protocol’s list of Internet PCBs looking for a matching PCB to receive the datagram. This is transport layer demultiplexing of a received datagram.
When a process executes the bind
system call, to assign a local IP address and local port number to a socket, in_pcbbind
is called by the protocol to verify that the requested local address pair is not already in use.
When a process executes the bind
system call, requesting an ephemeral port be assigned to its socket, the kernel picks an ephemeral port and calls in_pcbbind
to check if the port is in use. If it is in use, the next ephemeral port number is tried, and so on, until an unused port is located.
When a process executes the connect
system call, either explicitly or implicitly, in_pcbbind
verifies that the requested socket pair is unique. (An implicit call to connect
happens when a UDP datagram is sent on an unconnected socket. We’ll see this scenario in Chapter 23.)
In cases 2, 3, and 4 in_pcbbind
calls in_pcblookup
. Two options confuse the logic of the function. First, a process can specify either the SO_REUSEADDR
or SO_REUSEPORT
socket option to say that a duplicate local address is OK.
Second, sometimes a wildcard match is OK (e.g., an incoming UDP datagram can match a PCB that has a wildcard for its local IP address, meaning that the socket will accept UDP datagrams that arrive on any local interface), while other times a wildcard match is forbidden (e.g., when connecting to a foreign IP address and port number).
In the original Stanford IP multicast code appears the comment that “The logic of
in_pcblookup
is rather opaque and there is not a single comment,” The adjective opaque is an understatement.The publicly available IP multicast code available for BSD/386, which is derived from the port to 4.4BSD done by Craig Leres, fixed the overloaded semantics of this function by using
in_pcblookup
only for case 1 above. Cases 2 and 4 are handled by a new function namedin_pcbconflict
, and case 3 is handled by a new function namedin_uniqueport
. Dividing the original functionality into separate functions is much clearer, but in the Net/3 release, which we’re describing in this text, the logic is still combined into the single functionin_pcblookup
.
Figure 22.16 shows the in_pcblookup
function.
Table 22.16. in_pcblookup
function: search all the PCBs for a match.
------------------------------------------------------------------------- in_pcb.h 405 struct inpcb * 406 in_pcblookup(head, faddr, fport_arg, laddr, lport_arg, flags) 407 struct inpcb *head; 408 struct in_addr faddr, laddr; 409 u_int fport_arg, lport_arg; 410 int flags; 411 { 412 struct inpcb *inp, *match = 0; 413 int matchwild = 3, wildcard; 414 u_short fport = fport_arg, lport = lport_arg; 415 for (inp = head->inp_next; inp != head; inp = inp->inp_next) { 416 if (inp->inp_lport != lport) 417 continue; /* ignore if local ports are unequal */ 418 wildcard = 0; 419 if (inp->inp_laddr.s_addr != INADDR_ANY) { 420 if (laddr.s_addr == INADDR_ANY) 421 wildcard++; 422 else if (inp->inp_laddr.s_addr != laddr.s_addr) 423 continue; 424 } else { 425 if (laddr.s_addr != INADDR_ANY) 426 wildcard++; 427 } 428 if (inp->inp_faddr.s_addr != INADDR_ANY) { 429 if (faddr.s_addr == INADDR_ANY) 430 wildcard++; 431 else if (inp->inp_faddr.s_addr != faddr.s_addr || 432 inp->inp_fport != fport) 433 continue; 434 } else { 435 if (faddr.s_addr != INADDR_ANY) 436 wildcard++; 437 } 438 if (wildcard && (flags & INPLOOKUP_WILDCARD) == 0) 439 continue; /* wildcard match not allowed */ 440 if (wildcard < matchwild) { 441 match = inp; 442 matchwild = wildcard; 443 if (matchwild == 0) 444 break; /* exact match, all done */ 445 } 446 } 447 return (match); 448 } ------------------------------------------------------------------------- in_pcb.h |
The function starts at the head of the protocol’s PCB list and potentially goes through every PCB on the list. The variable match
remembers the pointer to the entry with the best match so far, and matchwild
remembers the number of wildcards in that match. The latter is initialized to 3, which is a value greater than the maximum number of wildcard matches that can be encountered. (Any value greater than 2 would work.) Each time around the loop, the variable wildcard
starts at 0 and counts the number of wildcard matches for each PCB.
416-417
The first comparison is the local port number. If the PCB’s local port doesn’t match the lport
argument, the PCB is ignored.
419-427
in_pcblookup
compares the local address in the PCB with the laddr
argument. If one is a wildcard and the other is not a wildcard, the wildcard
counter is incremented. If both are not wildcards, then they must be the same, or this PCB is ignored. If both are wildcards, nothing changes: they can’t be compared and the wildcard
counter isn’t incremented. Figure 22.17 summarizes the four different conditions.
428-437
These lines perform the same test that we just described, but using the foreign addresses instead of the local addresses. Also, if both foreign addresses are not wildcards then not only must the two IP addresses be equal, but the two foreign ports must also be equal. Figure 22.18 summarizes the foreign IP comparisons.
The additional comparison of the foreign port numbers can be performed for the second line of Figure 22.18 because it is not possible to have a PCB with a nonwildcard foreign address and a foreign port number of 0. This restriction is enforced by connect
, which we’ll see shortly requires a nonwildcard foreign IP address and a nonzero foreign port. It is possible, however, and common, to have a wildcard local address with a nonzero local port. We saw this in Figures 22.10 and 22.13.
438-439
The flags
argument can be set to INPLOOKUP_WILDCARD
, which means a match containing wildcards is OK. If a match is found containing wildcards (wildcard
is nonzero) and this flag was not specified by the caller, this PCB is ignored. When TCP and UDP call this function to demultiplex an incoming datagram, INPLOOKUP_WILDCARD
is always set, since a wildcard match is OK. (Recall our examples using Figures 22.10 and 22.13.) But when this function is called as part of the connect
system call, in order to verify that a socket pair is not already in use, the flags
argument is set to 0.
440-447
These statements remember the best match found so far. Again, the best match is considered the one with the fewest number of wildcard matches. If a match is found with one or two wildcards, that match is remembered and the loop continues. But if an exact match is found (wildcard
is 0), the loop terminates, and a pointer to the PCB with that exact match is returned.
Figure 22.19 is from the TCP example we discussed with Figure 22.11. Assume in_pcblookup
is demultiplexing a received datagram from 140.252.1.11, port 1500, destined for 140.252.1.29, port 23. Also assume that the order of the PCBs is the order of the rows in the figure. laddr
is the destination IP address, lport
is the destination TCP port, faddr
is the source IP address, and fport
is the source TCP port.
When the first row is compared to the incoming segment, wildcard
is 1 (the foreign IP address), flags
is set to INPLOOKUP_WILDCARD
, so match
is set to point to this PCB and matchwild
is set to 1. The loop continues since an exact match has not been found yet. The next time around the loop, wildcard
is 2 (the local and foreign IP addresses) and since this is greater than matchwild
, the entry is not remembered, and the loop continues. The next time around the loop, wildcard
is 0, which is less than matchwild
(1), so this entry is remembered in match
. The loop also terminates since an exact match has been found and the pointer to this PCB is returned to the caller.
If in_pcblookup
were used by TCP and UDP only to demultiplex incoming datagrams, it could be simplified. First, there’s no need to check whether the faddr
or laddr
arguments are wildcards, since these are the source and destination IP addresses from the received datagram. Also the flags
argument could be removed, along with its corresponding test, since wildcard matches are always OK.
This section has covered the mechanics of the in_pcblookup
function. We’ll return to this function and discuss its meaning after seeing how it is called from the in_pcbbind
and in_pcbconnect
functions.
The next function, in_pcbbind
, binds a local address and port number to a socket. It is called from five functions:
from bind
for a TCP socket (normally to bind a server’s well-known port);
from bind
for a UDP socket (either to bind a server’s well-known port or to bind an ephemeral port to a client’s socket);
from connect
for a TCP socket, if the socket has not yet been bound to a nonzero port (this is typical for TCP clients);
from 1isten
for a TCP socket, if the socket has not yet been bound to a nonzero port (this is rare, since listen
is called by a TCP server, which normally binds a well-known port, not an ephemeral port); and
from in_pcbconnect
(Section 22.8), if the local IP address and local port number have not been set (typical for a call to connect
for a UDP socket or for each call to sendto
for an unconnected UDP socket).
In cases 3, 4, and 5, an ephemeral port number is bound to the socket and the local IP address is not changed (in case it is already set).
We call cases 1 and 2 explicit binds and cases 3, 4, and 5 implicit binds. We also note that although it is normal in case 2 for a server to bind a well-known port, servers invoked using remote procedure calls (RPC) often bind ephemeral ports and then register their ephemeral port with another program that maintains a mapping between the server’s RPC program number and its ephemeral port (e.g., the Sun port mapper described in Section 29.4 of Volume 1).
We’ll show the in_pcbbind
function in three sections. Figure 22.20 is the first section.
Table 22.20. in_pcbbind
function: bind a local address and port number.
------------------------------------------------------------------------- in_pcb.h 52 int 53 in_pcbbind(inp, nam) 54 struct inpcb *inp; 55 struct mbuf *nam; 56 { 57 struct socket *so = inp->inp_socket; 58 struct inpcb *head = inp->inp_head; 59 struct sockaddr_in *sin; 60 struct proc *p = curproc; /* XXX */ 61 u_short lport = 0; 62 int wild = 0, reuseport = (so->so_options & SO_REUSEPORT); 63 int error; 64 if (in_ifaddr == 0) 65 return (EADDRNOTAVAIL); 66 if (inp->inp_lport || inp->inp_laddr.s_addr != INADDR_ANY) 67 return (EINVAL); 68 if ((so->so_options & (SO_REUSEADDR | SO_REUSEPORT)) == 0 && 69 ((so->so_proto->pr_flags & PR_CONNREQUIRED) == 0 || 70 (so->so_options & SO_ACCEPTCONN) == 0)) 71 wild = INPLOOKUP_WILDCARD; ------------------------------------------------------------------------- in_pcb.h |
64-67
The first two tests verify that at least one interface has been assigned an IP address and that the socket is not already bound. You can’t bind a socket twice.
68-71
This if
statement is confusing. The net result sets the variable wild
to INPLOOKUP_WILDCARD
if neither SO_REUSEADDR
or SO_REUSEPORT
are set.
The second test is true for UDP sockets since PR_CONNREQUIRED
is false for connectionless sockets and true for connection-oriented sockets.
The third test is where the confusion lies [Torek 1992]. The socket flag SO_ACCEPTCONN
is set only by the listen
system call (Section 15.9), which is valid only for a connection-oriented server. In the normal scenario, a TCP server calls socket, bind
, and then listen
. Therefore, when in_pcbbind
is called by bind
, this socket flag is cleared. Even if the process calls socket
and then listen
, without calling bind
, TCP’s PRU_LISTEN
request calls in_pcbbind
to assign an ephemeral port to the socket before the socket layer sets the SO_ACCEPTCONN
flag. This means the third test in the if
statement, testing whether SO_ACCEPTCONN
is not set, is always true. The if
statement is therefore equivalent to
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0 && ((so->so_proto->pr_flags & PR_CONNREQUIRED) == 0 || 1) wild = INPLOOKUP_WILDCARD;
Since anything logically ORed with 1 is always true, this is equivalent to
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0) wild = INPLOOKUP_WILDCARD;
which is simpler to understand: if either of the REUSE
socket options is set, wild
is left as 0. If neither of the REUSE
socket options are set, wild
is set to INPLOOKUP_WILDCARD
. In other words, when in_pcblookup
is called later in the function, a wildcard match is allowed only if neither of the REUSE
socket options are on.
The next section of the in_pcbbind
, shown in Figure 22.22, function processes the optional nam
argument.
72-75
The nam
argument is a nonnull pointer only when the process calls bind
explicitly. For an implicit bind (a side effect of connect, listen
, or in_pcbconnect
, cases 3, 4, and 5 from the beginning of this section), nam
is a null pointer. When the argument is specified, it is an mbuf containing a sockaddr_in
structure. Figure 22.21 shows the four cases for the nonnull nam
argument.
76-83
The test for the correct address family is commented out, yet the identical test in the in_pcbconnect
function (Figure 22.25) is performed. We expect either both to be in or both to be out.???
Table 22.22. in_pcbbind
function: process optional nam
argument.
------------------------------------------------------------------------- in_pcb.c 72 if (nam) { 73 sin = mtod(nam, struct sockaddr_in *); 74 if (nam->m_len != sizeof(*sin)) 75 return (EINVAL); 76 #ifdef notdef 77 /* 78 * We should check the family, but old programs 79 * incorrectly fail to initialize it. 80 */ 81 if (sin->sin_family != AF_INET) 82 return (EAFNOSUPPORT); 83 #endif 84 lport = sin->sin_port; /* might be 0 */ 85 if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) { 86 /* 87 * Treat SO_REUSEADDR as SO_REUSEPORT for multicast; 88 * allow complete duplication of binding if 89 * SO_REUSEPORT is set, or if SO_REUSEADDR is set 90 * and a multicast address is bound on both 91 * new and duplicated sockets. 92 */ 93 if (so->so_options & SO_REUSEADDR) 94 reuseport = SO_REUSEADDR | SO_REUSEPORT; 95 } else if (sin->sin_addr.s_addr != INADDR_ANY) { 96 sin->sin_port = 0; /* yech... */ 97 if (ifa_ifwithaddr((struct sockaddr *) sin) == 0) 98 return (EADDRNOTAVAIL); 99 } 100 if (lport) { 101 struct inpcb *t; 102 /* GROSS */ 103 if (ntohs(lport) < IPPORT_RESERVED && 104 (error = suser(p->p_ucred, &p->p_acflag))) 105 return (error); 106 t = in_pcblookup(head, zeroin_addr, 0, 107 sin->sin_addr, lport, wild); 108 if (t && (reuseport & t->inp_socket->so_options) == 0) 109 return (EADDRINUSE); 110 } 111 inp->inp_laddr = sin->sin_addr; /* might be wildcard */ 112 } ------------------------------------------------------------------------- in_pcb.c |
85-94
Net/3 tests whether the IP address being bound is a multicast group. If so, the SO_REUSEADDR
option is considered identical to SO_REUSEPORT
.
95-99
Otherwise, if the local address being bound by the caller is not the wildcard, ifa_ifwithaddr
verifies that the address corresponds to a local interface.
The comment “yech” is probably because the port number in the socket address structure must be 0 because
ifa_ifwithaddr
does a binary comparison of the entire structure, not just a comparison of the IP addresses.This is one of the few instances where the process must zero the socket address structure before issuing the system call. If
bind
is called and the final 8 bytes of the socket address structure (sin_zero
[8]
) are nonzero,ifa_ifwithaddr
will not find the requested interface, andin_pcbbind
will return an error.
100-105
The next if
statement is executed when the caller is binding a nonzero port, that is, the process wants to bind one particular port number (the second and fourth scenarios from Figure 22.21). If the requested port is less than 1024 (IPPORT_RESERVED
) the process must have superuser privilege. This is not part of the Internet protocols, but a Berkeley convention. A port number less than 1024 is called a reserved port and is used, for example, by the rcmd
function [Stevens 1990], which in turn is used by the rlogin
and rsh
client programs as part of their authentication with their servers.
106-109
The function in_pcblookup
(Figure 22.16) is then called to check whether a PCB already exists with the same local IP address and local port number. The second argument is the wildcard IP address (the foreign IP address) and the third argument is a port number of 0 (the foreign port). The wildcard value for the second argument causes in_pcblookup
to ignore the foreign IP address and foreign port in the PCB on ly the local IP address and local port are compared to sin->sin_addr
and lport
, respectively. We mentioned earlier that wild
is set to INPLOOKUP_WILDCARD
only if neither of the REUSE
socket options are set.
111
The caller’s value for the local IP address is stored in the PCB. This can be the wildcard address, if that’s the value specified by the caller. In this case the local IP address is chosen by the kernel, but not until the socket is connected at some later time. This is because the local IP address is determined by IP routing, based on foreign IP address.
The final section of in_pcbbind
handles the assignment of an ephemeral port when the caller explicitly binds a port of 0, or when the nam
argument is a null pointer (an implicit bind).???
Table 22.23. in_pcbbind
function: choose an ephemeral port.
------------------------------------------------------------------------- in_pcb.c 113 if (lport == 0) 114 do { 115 if (head->inp_lport++ < IPPORT_RESERVED || 116 head->inp_lport > IPPORT_USERRESERVED) 117 head->inp_lport = IPPORT_RESERVED; 118 lport = htons(head->inp_lport); 119 } while (in_pcblookup(head, 120 zeroin_addr, 0, inp->inp_laddr, lport, wild)); 121 inp->inp_lport = lport; 122 return (0); 123 } ------------------------------------------------------------------------- in_pcb.c |
113-122
The next ephemeral port number to use for this protocol (TCP or UDP) is maintained in the head
of the protocol’s PCB list: tcb
or udb
. Other than the inp_next
and inp_back
pointers in the protocol’s head
PCB, the only other element of the inpcb
structure that is used is the local port number. Confusingly, this local port number is maintained in host byte order in the head
PCB, but in network byte order in all the other PCBs on the list! The ephemeral port numbers start at 1024 (IPPORT_RESERVED
) and get incremented by 1 until port 5000 is used (IPPORT_USERRESERVED
), then cycle back to 1024. The loop is executed until in_pcbbind
does not find a match.
Let’s look at some common examples to see the interaction of in_pcbbind
with in_pcblookup
and the two REUSE
socket options.
A TCP or UDP server normally starts by calling socket
and bind
. Assume a TCP server that calls bind
, specifying the wildcard IP address and its nonzero well-known port, say 23 (the Telnet server). Also assume that the server is not already running and that the process does not set the SO_REUSEADDR
socket option.
in_pcbbind
calls in_pcblookup
with INPLOOKUP_WILDCARD
as the final argument. The loop in in_pcblookup
won’t find a matching PCB, assuming no other process is using the server’s well-known TCP port, causing a null pointer to be returned. This is OK and in_pcbbind
returns 0.
Assume the same scenario as above, but with the server already running when someone tries to start the server a second time.
When in_pcblookup
is called it finds the PCB with a local socket of {*, 23}. Since the wildcard
counter is 0, in_pcblookup
returns the pointer to this entry. Since reuseport
is 0, in_pcbbind
returns EADDRINUSE
.
Assume the same scenario as the previous example, but when the attempt is made to start the server a second time, the SO_REUSEADDR
socket option is specified.
Since this socket option is specified, in_pcbbind
calls in_pcblookup
with a final argument of 0. But the PCB with a local socket of {*, 23} is still matched and returned because wildcard
is 0, since in_pcblookup
cannot compare the two wildcard addresses (Figure 22.17). in_pcbbind
again returns EADDRINUSE
, preventing us from starting two instances of the server with identical local sockets, regardless of whether we specify SO_REUSEADDR
or not.
Assume that a Telnet server is already running with a local socket of {*, 23} and we try to start another with a local socket of {140.252.13.35, 23}.
Assuming SO_REUSEADDR
is not specified, in_pcblookup
is called with a final argument of INPLOOKUP_WILDCARD
. When it compares the PCB containing *.23
, the counter wildcard
is set to 1. Since a wildcard match is allowed, this match is remembered as the best match and a pointer to it is returned after all the TCP PCBs are scanned. in_pcbbind
returns EADDRINUSE
.
This example is the same as the previous one, but we specify the SO_REUSEADDR
socket option for the second server that tries to bind the local socket {140.252.13.35, 23}.
The final argument to in_pcblookup
is now 0, since the socket option is specified. When the PCB with the local socket {*, 23} is compared, the wildcard
counter is 1, but since the final flags
argument is 0, this entry is skipped and is not remembered as a match.
After comparing all the TCP PCBs, the function returns a null pointer and in_pcbbind
returns 0.
Assume the first Telnet server is started with a local socket of {140.252.13.35, 23} when we try to start a second server with a local socket of {*, 23}. This is the same as the previous example, except we’re starting the servers in reverse order this time.
The first server is started without a problem, assuming no other socket has already bound port 23. When we start the second server, the final argument to in_pcblookup
is INPLOOKUP_WILDCARD
, assuming the SO_REUSEADDR
socket option is not specified. When the PCB with the local socket of {140.252.13.35, 23} is compared, the wildcard
counter is set to 1 and this entry is remembered. After all the TCP PCBs are compared, the pointer to this entry is returned, causing in_pcbbind
to return EADDRINUSE
.
What if we start two instances of a server, both with a nonwildcard local IP address? Assume we start the first Telnet server with a local socket of {140.252.13.35, 23} and then try to start a second with a local socket of {127.0.0.1, 23}, without specifying SO_REUSEADDR
.
When the second server calls in_pcbbind
, it calls in_pcblookup
with a final argument of INPLOOKUP_WILDCARD
. When the PCB with the local socket of {140.252.13.35, 23} is compared, it is skipped because the local IP addresses are not equal. in_pcblookup
returns a null pointer, and in_pcbbind
returns 0.
From this example we see that the SO_REUSEADDR
socket option has no effect on nonwildcard IP addresses. Indeed the test on the flags value INPLOOKUP_WILDCARD
in in_pcblookup
is made only when wildcard
is greater than 0, that is, when either the PCB entry has a wildcard IP address or the IP address being bound is the wildcard.
As a final example, assume we try to start two instances of the same server, both with the same nonwildcard local IP address, say 127.0.0.1.
When the second server is started, in_pcblookup
always returns a pointer to the matching PCB with the same local socket. This happens regardless of the SO_REUSEADDR
socket option, because the wildcard
counter is always 0 for this comparison. Since in_pcblookup
returns a nonnull pointer, in_pcbbind
returns EADDRINUSE
.
From these examples we can state the rules about the binding of local IP addresses and the SO_REUSEADDR
socket option. These rules are shown in Figure 22.24. We assume that localIP1 and localIP2 are two different unicast or broadcast IP addresses valid on the local host, and that localmcastIP is a multicast group. We also assume that the process is trying to bind the same nonzero port number that is already bound to the existing PCB.
Table 22.24. Effect of SO_REUSEADDR
socket option on binding of local IP address.
Existing PCB | Try to bind |
| Description | |
---|---|---|---|---|
off | on | |||
localIP1 | localIP1 | error | error | one server per IP address and port |
localIP1 | localIP2 | OK | OK | one server for each local interface |
localIP1 |
| error | OK | one server for one interface, other server for remaining interfaces |
| localIP1 | error | OK | one server for one interface, other server for remaining interfaces |
|
| error | error | can’t duplicate local sockets (same as first example) |
localmcastIP | localmcastIP | error | OK | multiple multicast recipients |
We need to differentiate between a unicast or broadcast address and a multicast address, because we saw that in_pcbbind
considers SO_REUSEADDR
to be the same as SO_REUSEPORT
for a multicast address.
The function in_pcbconnect
specifies the foreign IP address and foreign port number for a socket. It is called from four functions:
from connect
for a TCP socket (required for a TCP client);
from connect
for a UDP socket (optional for a UDP client, rare for a UDP server);
from sendto
when a datagram is output on an unconnected UDP socket (common); and
from tcp_input
when a connection request (a SYN segment) arrives on a TCP socket that is in the LISTEN state (standard for a TCP server).
In all four cases it is common, though not required, for the local IP address and local port be unspecified when in_pcbconnect
is called. Therefore one function of in_pcbconnect
is to assign the local values when they are unspecified.
We’ll discuss the in_pcbconnect
function in four sections. Figure 22.25 shows the first section.
Table 22.25. in_pcbconnect
function: verify arguments, check foreign IP address.
------------------------------------------------------------------------- in_pcb.h 130 int 131 in_pcbconnect(inp, nam) 132 struct inpcb *inp; 133 struct mbuf *nam; 134 { 135 struct in_ifaddr *ia; 136 struct sockaddr_in *ifaddr; 137 struct sockaddr_in *sin = mtod(nam, struct sockaddr_in *); 138 if (nam->m_len != sizeof(*sin)) 139 return (EINVAL); 140 if (sin->sin_family != AF_INET) 141 return (EAFNOSUPPORT); 142 if (sin->sin_port == 0) 143 return (EADDRNOTAVAIL); 144 if (in_ifaddr) { 145 /* 146 * If the destination address is INADDR_ANY, 147 * use the primary local address. 148 * If the supplied address is INADDR_BROADCAST, 149 * and the primary interface supports broadcast, 150 * choose the broadcast address for that interface. 151 */ 152 #define satosin(sa) ((struct sockaddr_in *)(sa)) 153 #define sintosa(sin) ((struct sockaddr *)(sin)) 154 #define ifatoia(ifa) ((struct in_ifaddr *)(ifa)) 155 if (sin->sin_addr.s_addr == INADDR_ANY) 156 sin->sin_addr = IA_SIN(in_ifaddr)->sin_addr; 157 else if (sin->sin_addr.s_addr == (u_long) INADDR_BROADCAST && 158 (in_ifaddr->ia_ifp->if_flags & IFF_BROADCAST)) 159 sin->sin_addr = satosin(&in_ifaddr->ia_broadaddr)->sin_addr; 160 } ------------------------------------------------------------------------- in_pcb.h |
130-143
The nam
argument points to an mbuf containing a sockaddr_in
structure with the foreign IP address and port number. These lines validate the argument and verify that the caller is not trying to connect to a port number of 0.
144-160
The test of the global in_ifaddr
verifies that an IP interface has been configured. If the foreign IP address is 0.0.0.0 (INADDR_ANY
), then 0.0.0.0 is replaced with the IP address of the primary IP interface. This means the calling process is connecting to a peer on this host. If the foreign IP address is 255.255.255.255 (INADDR_BROADCAST
) and the primary interface supports broadcasting, then 255.255.255.255 is replaced with the broadcast address of the primary interface. This allows a UDP application to broadcast on the primary interface without having to figure out its IP address i t can simply send datagrams to 255.255.255.255, and the kernel converts this to the appropriate IP address for the interface.
The next section of code, Figure 22.26, handles the case of an unspecified local address. This is the common scenario for TCP and UDP clients, cases 1, 2, and 3 from the list at the beginning of this section.
Table 22.26. in_pcbconnect
function: local IP address not yet specified.
------------------------------------------------------------------------- in_pcb.c 161 if (inp->inp_laddr.s_addr == INADDR_ANY) { 162 struct route *ro; 163 ia = (struct in_ifaddr *) 0; 164 /* 165 * If route is known or can be allocated now, 166 * our src addr is taken from the i/f, else punt. 167 */ 168 ro = &inp->inp_route; 169 if (ro->ro_rt && 170 (satosin(&ro->ro_dst)->sin_addr.s_addr != 171 sin->sin_addr.s_addr || 172 inp->inp_socket->so_options & SO_DONTROUTE)) { 173 RTFREE(ro->ro_rt); 174 ro->ro_rt = (struct rtentry *) 0; 175 } 176 if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0 && /* XXX */ 177 (ro->ro_rt == (struct rtentry *) 0 || 178 ro->ro_rt->rt_ifp == (struct ifnet *) 0)) { 179 /* No route yet, so try to acquire one */ 180 ro->ro_dst.sa_family = AF_INET; 181 ro->ro_dst.sa_len = sizeof(struct sockaddr_in); 182 ((struct sockaddr_in *) &ro->ro_dst)->sin_addr = 183 sin->sin_addr; 184 rtalloc(ro); 185 } 186 /* 187 * If we found a route, use the address 188 * corresponding to the outgoing interface 189 * unless it is the loopback (in case a route 190 * to our address on another net goes to loopback). 191 */ 192 if (ro->ro_rt && !(ro->ro_rt->rt_ifp->if_flags & IFF_LOOPBACK)) 193 ia = ifatoia(ro->ro_rt->rt_ifa); 194 if (ia == 0) { 195 u_short fport = sin->sin_port; 196 sin->sin_port = 0; 197 ia = ifatoia(ifa_ifwithdstaddr(sintosa(sin))); 198 if (ia == 0) 199 ia = ifatoia(ifa_ifwithnet(sintosa(sin))); 200 sin->sin_port = fport; 201 if (ia == 0) 202 ia = in_ifaddr; 203 if (ia == 0) 204 return (EADDRNOTAVAIL); 205 } ------------------------------------------------------------------------- in_pcb.c |
164-175
If a route is held by the PCB but the destination of that route differs from the foreign address being connected to, or the SO_DONTROUTE
socket option is set, that route is released.
To understand why a PCB may have an associated route, consider case 3 from the list at the beginning of this section: in_pcbconnect
is called every time a UDP datagram is sent on an unconnected socket. Each time a process calls sendto
, the UDP output function calls in_pcbconnect, ip_output
, and in_pcbdisconnect
. If all the datagrams sent on the socket go to the same destination IP address, then the first time through in_pcbconnect
the route is allocated and it can be used from that point on. But since a UDP application can send datagrams to a different IP address with each call to sendto
, the destination address must be compared to the saved route and the route released when the destination changes. This same test is done in ip_output
, which seems to be redundant.
The SO_DONTROUTE
socket option tells the kernel to bypass the normal routing decisions and send the IP datagram to the locally attached interface whose IP network address matches the network portion of the destination address.
176-185
If the SO_DONTROUTE
socket option is not set, and a route to the destination is not held by the PCB, try to acquire one by calling rtalloc
.
186-205
The goal in this section of code is to have ia
point to an interface address structure (in_ifaddr
, Section 6.5), which contains the IP address of the interface. If the PCB holds a route that is still valid, or if rtalloc
found a route, and the route is not to the loopback interface, the corresponding interface is used. Otherwise ifa_withdstaddr
and ifa_withnet
are called to check if the foreign IP address is on the other end of a point-to-point link or on an attached network. Both of these functions require that the port number in the socket address structure be 0, so it is saved in fport
across the calls. If this fails, the primary IP address is used (in_ifaddr
), and if no interfaces are configured (in_ifaddr
is zero), an error is returned.
Figure 22.27 shows the next section of in_pcbconnect
, which handles a destination address that is a multicast address.
Table 22.27. in_pcbconnect
function: destination address is a multicast address.
------------------------------------------------------------------------- in_pcb.c 206 /* 207 * If the destination address is multicast and an outgoing 208 * interface has been set as a multicast option, use the 209 * address of that interface as our source address. 210 */ 211 if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) && 212 inp->inp_moptions != NULL) { 213 struct ip_moptions *imo; 214 struct ifnet *ifp; 215 imo = inp->inp_moptions; 216 if (imo->imo_multicast_ifp != NULL) { 217 ifp = imo->imo_multicast_ifp; 218 for (ia = in_ifaddr; ia; ia = ia->ia_next) 219 if (ia->ia_ifp == ifp) 220 break; 221 if (ia == 0) 222 return (EADDRNOTAVAIL); 223 } 224 } 225 ifaddr = (struct sockaddr_in *) &ia->ia_addr; 226 } ------------------------------------------------------------------------- in_pcb.c |
206-223
If the destination address is a multicast address and the process has specified the outgoing interface to use for multicast packets (using the IP_MULTICAST_IF
socket option), then the IP address of that interface is used as the local address. A search is made of all IP interfaces for the one matching the interface that was specified with the socket option. An error is returned if that interface is no longer up.
224-225
The code that started at the beginning of Figure 22.26 to handle the case of a wildcard local address is complete. The pointer to the sockaddr_in
structure for the local interface ia
is saved in ifaddr
.
The final section of in_pcblookup
is shown in Figure 22.28.
Table 22.28. in_pcbconnect
function: verify that socket pair is unique.
------------------------------------------------------------------------- in_pcb.c 227 if (in_pcblookup(inp->inp_head, 228 sin->sin_addr, 229 sin->sin_port, 230 inp->inp_laddr.s_addr ? inp->inp_laddr : ifaddr->sin_addr, 231 inp->inp_lport, 232 0)) 233 return (EADDRINUSE); 234 if (inp->inp_laddr.s_addr == INADDR_ANY) { 235 if (inp->inp_lport == 0) 236 (void) in_pcbbind(inp, (struct mbuf *) 0); 237 inp->inp_laddr = ifaddr->sin_addr; 238 } 239 inp->inp_faddr = sin->sin_addr; 240 inp->inp_fport = sin->sin_port; 241 return (0); 242 } ------------------------------------------------------------------------- in_pcb.c |
227-233
in_pcblookup
verifies that the socket pair is unique. The foreign address and foreign port are the values specified as arguments to in_pcbconnect
. The local address is either the value that was already bound to the socket or the value in ifaddr
that was calculated in the code we just described. The local port can be 0, which is typical for a TCP client, and we’ll see that later in this section of code an ephemeral port is chosen for the local port.
This test prevents two TCP connections to the same foreign address and foreign port from the same local address and local port. For example, if we establish a TCP connection with the echo server on the host sun
and then try to establish another connection to the same server from the same local port (8888, specified with the -b
option), the call to in_pcblookup
returns a match, causing connect
to return the error EADDRINUSE
. (We use the sock
program from Appendix C of Volume 1.)
bsdi $ sock -b 8888 sun echo & start first one in the background bsdi $ sock -A -b 8888 sun echo then try again connect () error: Address already in use
We specify the -A
option to set the SO_REUSEADDR
socket option, which lets the bind
succeed, but the connect
cannot succeed. This is a contrived example, as we explicitly bound the same local port (8888) to both sockets. In the normal scenario of two different clients from the host bsdi
to the echo server on the host sun
, the local port will be 0 when the second client calls in_pcblookup
from Figure 22.28.
This test also prevents two UDP sockets from being connected to the same foreign address from the same local port. This test does not prevent two UDP sockets from alternately sending datagrams to the same foreign address from the same local port, as long as neither calls connect
, since a UDP socket is only temporarily connected to a peer for the duration of a sendto
system call.
234-238
If the local address is still wildcarded for the socket, it is set to the value saved in ifaddr
. This is an implicit bind: cases 3, 4, and 5 from the beginning of Section 22.7. First a check is made as to whether the local port has been bound yet, and if not, in_pcbbind
binds an ephemeral port to the socket. The order of the call to in_pcbbind
and the assignment to inp_laddr
is important, since in_pcbbind
fails if the local address is not the wildcard address.
239-240
The final step of this function sets the foreign IP address and foreign port number in the PCB. We are guaranteed, on successful return from this function, that both socket pairs in the PCB th e local and foreign are f illed in with specific values.
There is a subtle difference between the source address in the IP datagram versus the IP address of the interface used to send the datagram.
The PCB member inp_laddr
is used by TCP and UDP as the source address of the IP datagram. It can be set by the process to the IP address of any configured interface by bind
. (The call to ifa_ifwithaddr
in in_pcbbind
verifies the local address desired by the application.) in_pcbconnect
assigns the local address only if it is a wildcard, and when this happens the local address is based on the outgoing interface (since the destination address is known).
The outgoing interface, however, is also determined by ip_output
based on the destination IP address. On a multihomed host it is possible for the source address to be a local interface that is not the outgoing interface, when the process explicitly binds a local address that differs from the outgoing interface. This is allowed because Net/3 chooses the weak end system model (Section 8.4).
A UDP socket is disconnected by in_pcbdisconnect
. This removes the foreign association by setting the foreign IP address to all 0s (INADDR_ANY
) and foreign port number to 0.
This is done after a datagram has been sent on an unconnected UDP socket and when connect
is called on a connected UDP socket. In the first case the sequence of steps when the process calls sendto
is: UDP calls in_pcbconnect
to connect the socket temporarily to the destination, udp_output
sends the datagram, and then in_pcbdisconnect
removes the temporary connection.
in_pcbdisconnect
is not called when a socket is closed since in_pcbdetach
handles the release of the PCB. A disconnect is required only when the PCB needs to be reused for a different foreign address or port number.
Figure 22.29 shows the function in_pcbdisconnect
.
Table 22.29. in_pcbdisconnect
function: disconnect from foreign address and port number.
------------------------------------------------------------------------- in_pcb.c 243 int 244 in_pcbdisconnect(inp) 245 struct inpcb *inp; 246 { 247 inp->inp_faddr.s_addr = INADDR_ANY; 248 inp->inp_fport = 0; 249 if (inp->inp_socket->so_state & SS_NOFDREF) 250 in_pcbdetach(inp); 251 } ------------------------------------------------------------------------- in_pcb.c |
If there is no longer a file table reference for this PCB (SS_NOFDREF
is set) then in_pcbdetach
(Figure 22.7) releases the PCB.
The getsockname
system call returns the local protocol address of a socket (e.g., the IP address and port number for an Internet socket) and the getpeername
system call returns the foreign protocol address. Both system calls end up issuing a PRU_SOCKADDR
request or a PRU_PEERADDR
request. The protocol then calls either in_setsockaddr
or in_setpeeraddr
. We show the first of these in Figure 22.30.
Table 22.30. in_setsockaddr
function: return local address and port number.
------------------------------------------------------------------------- in_pcb.c 267 int 268 in_setsockaddr(inp, nam) 269 struct inpcb *inp; 270 struct mbuf *nam; 271 { 272 struct sockaddr_in *sin; 273 nam->m_len = sizeof(*sin); 274 sin = mtod(nam, struct sockaddr_in *); 275 bzero((caddr_t) sin, sizeof(*sin)); 276 sin->sin_family = AF_INET; 277 sin->sin_len = sizeof(*sin); 278 sin->sin_port = inp->inp_lport; 279 sin->sin_addr = inp->inp_laddr; 280 } ------------------------------------------------------------------------- in_pcb.c |
The argument nam
is a pointer to an mbuf that will hold the result: a sockaddr_in
structure that the system call copies back to the process. The code fills in the socket address structure and copies the IP address and port number from the Internet PCB into the sin_addr
and sin_port
members.
Figure 22.31 shows the in_setpeeraddr
function. It is nearly identical to Figure 22.30, but copies the foreign IP address and port number from the PCB.
Table 22.31. in_setpeeraddr
function: return foreign address and port number.
------------------------------------------------------------------------- in_pcb.c 281 int 282 in_setpeeraddr(inp, nam) 283 struct inpcb *inp; 284 struct mbuf *nam; 285 { 286 struct sockaddr_in *sin; 287 nam->m_len = sizeof(*sin); 288 sin = mtod(nam, struct sockaddr_in *); 289 bzero((caddr_t) sin, sizeof(*sin)); 290 sin->sin_family = AF_INET; 291 sin->sin_len = sizeof(*sin); 292 sin->sin_port = inp->inp_fport; 293 sin->sin_addr = inp->inp_faddr; 294 } ------------------------------------------------------------------------- in_pcb.c |
The function in_pcbnotify
is called when an ICMP error is received, in order to notify the appropriate process of the error. The “appropriate process” is found by searching all the PCBs for one of the protocols (TCP or UDP) and comparing the local and foreign IP addresses and port numbers with the values returned in the ICMP error. For example, when an ICMP source quench error is received in response to a TCP segment that some router discarded, TCP must locate the PCB for the connection that caused the error and slow down the transmission on that connection.
Before showing the function we must review how it is called. Figure 22.32 summarizes the functions called to process an ICMP error. The two shaded ellipses are the functions described in this section.
When an ICMP message is received, icmp_input
is called. Five of the ICMP messages are classified as errors (Figures 11.1 and 11.2):
destination unreachable,
parameter problem,
redirect,
source quench, and
time exceeded.
Redirects are handled differently from the other four errors. All other ICMP messages (the queries) are handled as described in Chapter 11.
Each protocol defines its control input function, the pr_ctlinput
entry in the protosw
structure (Section 7.4). The ones for TCP and UDP are named tcp_ctlinput
and udp_ctlinput
, and we’ll show their code in later chapters. Since the ICMP error that is received contains the IP header of the datagram that caused the error, the protocol that caused the error (TCP or UDP) is known. Four of the five ICMP errors cause that protocol’s control input function to be called. Redirects are handled differently: the function pfctlinput
is called, and it in turn calls the control input functions for all the protocols in the family (Internet). TCP and UDP are the only protocols in the Internet family with control input functions.
Redirects are handled specially because they affect all IP datagrams going to that destination, not just the one that caused the redirect. On the other hand, the other four errors need only be processed by the protocol that caused the error.
The final points we need to make about Figure 22.32 are that TCP handles source quenches differently from the other errors, and redirects are handled specially by in_pcbnotify:
the function in_rtchange
is called, regardless of the protocol that caused the error.
Figure 22.33 shows the in_pcbnotify
function. When it is called by TCP, the first argument is the address of tcb
and the final argument is the address of the function tcp_notify
. For UDP, these two arguments are the address of udb
and the address of the function udp_notify
.
Table 22.33. in_pcbnotify
function: pass error notification to processes.
------------------------------------------------------------------------- in_pcb.c 306 int 307 in_pcbnotify(head, dst, fport_arg, laddr, lport_arg, cmd, notify) 308 struct inpcb *head; 309 struct sockaddr *dst; 310 u_int fport_arg, lport_arg; 311 struct in_addr laddr; 312 int cmd; 313 void (*notify) (struct inpcb *, int); 314 { 315 extern u_char inetctlerrmap[]; 316 struct inpcb *inp, *oinp; 317 struct in_addr faddr; 318 u_short fport = fport_arg, lport = lport_arg; 319 int errno; 320 if ((unsigned) cmd > PRC_NCMDS || dst->sa_family != AF_INET) 321 return; 322 faddr = ((struct sockaddr_in *) dst)->sin_addr; 323 if (faddr.s_addr == INADDR_ANY) 324 return; 325 /* 326 * Redirects go to all references to the destination, 327 * and use in_rtchange to invalidate the route cache. 328 * Dead host indications: notify all references to the destination. 329 * Otherwise, if we have knowledge of the local port and address, 330 * deliver only to that socket. 331 */ 332 if (PRC_IS_REDIRECT(cmd) || cmd == PRC_HOSTDEAD) { 333 fport = 0; 334 lport = 0; 335 laddr.s_addr = 0; 336 if (cmd != PRC_HOSTDEAD) 337 notify = in_rtchange; 338 } 339 errno = inetctlerrmap[cmd]; 340 for (inp = head->inp_next; inp != head;) { 341 if (inp->inp_faddr.s_addr != faddr.s_addr || 342 inp->inp_socket == 0 || 343 (lport && inp->inp_lport != lport) || 344 (laddr.s_addr && inp->inp_laddr.s_addr != laddr.s_addr) || 345 (fport && inp->inp_fport != fport)) { 346 inp = inp->inp_next; 347 continue; /* skip this PCB */ 348 } 349 oinp = inp; 350 inp = inp->inp_next; 351 if (notify) 352 (*notify) (oinp, errno); 353 } 354 } ------------------------------------------------------------------------- in_pcb.c |
306-324
The cmd
argument and the address family of the destination are verified. The foreign address is checked to ensure it is not 0.0.0.0.
325-338
If the error is a redirect it is handled specially. (The error PRC_HOSTDEAD
is an old error that was generated by the IMPs. Current systems should never see this error it is a historical artifact.) The foreign port, local port, and local address are all set to 0 so that the for
loop that follows won’t compare them. For a redirect we want that loop to select the PCBs to receive notification based only on the foreign IP address, because that is the IP address for which our host received a redirect. Also, the function that is called for a redirect is in_rtchange
(Figure 22.34) instead of the notify
argument specified by the caller.
Table 22.34. in_rtchange
function: invalidate route.
------------------------------------------------------------------------- in_pcb.c 391 void 392 in_rtchange(inp, errno) 393 struct inpcb *inp; 394 int errno; 395 { 396 if (inp->inp_route.ro_rt) { 397 rtfree(inp->inp_route.ro_rt); 398 inp->inp_route.ro_rt = 0; 399 /* 400 * A new route can be allocated the next time 401 * output is attempted. 402 */ 403 } 404 } ------------------------------------------------------------------------- in_pcb.c |
339
The global array inetctlerrmap
maps one of the protocol-independent error codes (the PRC_
xxx values from Figure 11.19) into its corresponding Unix errno
value (the final column in Figure 11.1).
340-353
This loop selects the PCBs to be notified. Multiple PCBs can be notified t he loop keeps going even after a match is located. The first if
statement combines five tests, and if any one of the five is true, the PCB is skipped: (1) if the foreign addresses are unequal, (2) if the PCB does not have a corresponding socket
structure, (3) if the local ports are unequal, (4) if the local addresses are unequal, or (5) if the foreign ports are unequal. The foreign addresses must match, while the other three foreign and local elements are compared only if the corresponding argument is nonzero. When a match is found, the notify
function is called.
We saw that in_pcbnotify
calls the function in_rtchange
when the ICMP error is a redirect. This function is called for all PCBs with a foreign address that matches the IP address that has been redirected. Figure 22.34 shows the in_rtchange
function.
If the PCB holds a route, that route is released by rtfree
, and the PCB member is marked as empty. We don’t try to update the route at this time, using the new router address returned in the redirect. The new route will be allocated by ip_output
when this PCB is used next, based on the kernel’s routing table, which is updated by the redirect, before pfctlinput
is called.
Let’s examine the interaction of redirects, raw sockets, and the cached route in the PCB. If we run the Ping program, which uses a raw socket, and an ICMP redirect error is received for the IP address being pinged, Ping continues using the original route, not the redirected route. We can see this as follows.
We ping the host svr4
on the 140.252.13 network from the host gemini
on the 140.252.1 network. The default router for gemini
is gateway
, but the packets should be sent to the router netb
instead. Figure 22.35 shows the arrangement.
We expect gateway
to send a redirect when it receives the first ICMP echo request.
gemini $ ping -sv svr4
PING 140.252.13.34: 56 data bytes
ICMP Host redirect from gateway 140.252.1.4
to netb (140.252.1.183) for svr4 (140.252.13.34)
64 bytes from svr4 (140.252.13.34): icmp_seq=0. time=572. ms
ICMP Host redirect from gateway 140.252.1.4
to netb (140.252.1.183) for svr4 (140.252.13.34)
64 bytes from svr4 (140.252.13.34): icmp_seq=1. time=392. ms
The -s
option causes an ICMP echo request to be sent once a second, and the -v
option prints every received ICMP message (instead of only the ICMP echo replies).
Every ICMP echo request elicits a redirect, but the raw socket used by ping never notices the redirect to change the route that it is using. The route that is first calculated and stored in the PCB, causing the IP datagrams to be sent to the router gateway
(140.252.1.4), should be updated so that the datagrams are sent to the router netb
(140.252.1.183) instead. We see that the ICMP redirects are received by the kernel on gemini
, but they appear to be ignored.
If we terminate the program and start it again, we never see a redirect:
gemini $ ping -sv svr4
PING 140.252.13.34: 56 data bytes
64 bytes from svr4 (140.252.13.34): icmp_seq=0. time=388. ms
64 bytes from svr4 (140.252.13.34): icmp_seq=1. time=363. ms
The reason for this anomaly is that the raw IP socket code (Chapter 32) does not have a control input function. Only TCP and UDP have a control input function. When the redirect error is received, ICMP updates the kernel’s routing table accordingly, and pfctlinput
is called (Figure 22.32). But since there is no control input function for the raw IP protocol, the cached route in the PCB associated with Ping’s raw socket is never released. When we start the Ping program a second time, however, the route that is allocated is based on the kernel’s updated routing table, and we never see the redirects.
One confusing part of the sockets API is that ICMP errors received on a UDP socket are not passed to the application unless the application has issued a connect
on the socket, restricting the foreign IP address and port number for the socket. We now see where this limitation is enforced by in_pcbnotify
.
Consider an ICMP port unreachable, probably the most common ICMP error on a UDP socket. The foreign IP address and the foreign port number in the dst
argument to in_pcbnotify
are the IP address and port number that caused the ICMP error. But if the process has not issued a connect
on the socket, the inp_faddr
and inp_fport
members of the PCB are both 0, preventing in_pcbnotify
from ever calling the notify
function for this socket. The for
loop in Figure 22.33 will skip every UDP PCB.
This limitation arises for two reasons. First, if the sending process has an unconnected UDP socket, the only nonzero element in the socket pair is the local port. (This assumes the process did not call bind
.) This is the only value available to in_pcbnotify
to demultiplex the incoming ICMP error and pass it to the correct process. Although unlikely, there could be multiple processes bound to the same local port, making it ambiguous which process should receive the error. There’s also the possibility that the process that sent the datagram that caused the ICMP error has terminated, with another process then starting and using the same local port. This is also unlikely since ephemeral ports are assigned in sequential order from 1024 to 5000 and reused only after cycling around (Figure 22.23).
The second reason for this limitation is because the error notification from the kernel to the process an errno
value is inadequate. Consider a process that calls sendto
on an unconnected UDP socket three times in a row, sending a UDP datagram to three different destinations, and then waits for the replies with recvfrom
. If one of the datagrams generates an ICMP port unreachable error, and if the kernel were to return the corresponding error (ECONNREFUSED
) to the recvfrom
that the process issued, the errno
value doesn’t tell the process which of the three datagrams caused the error. The kernel has all the information required in the ICMP error, but the sockets API doesn’t provide a way to return this to the process.
Therefore the design decision was made that if a process wants to be notified of these ICMP errors on a UDP socket, that socket must be connected to a single peer. If the error ECONNREFUSED
is returned on that connected socket, there’s no question which peer generated the error.
There is still a remote possibility of an ICMP error being delivered to the wrong process. One process sends the UDP datagram that elicits the ICMP error, but it terminates before the error is received. Another process then starts up before the error is received, binds the same local port, and connects to the same foreign address and foreign port, causing this new process to receive the error. There’s no way to prevent this from occurring, given UDP’s lack of memory. We’ll see that TCP handles this with its TIME_WAIT state.
In our preceding example, one way for the application to get around this limitation is to use three connected UDP sockets instead of one unconnected socket, and call select to determine when any one of the three has a received datagram or an error to be read.
Here we have a scenario where the kernel has the information but the API (sockets) is inadequate. With most implementations of Unix System V and the other popular API (TLI), the reverse is true: the TLI function
t_rcvuderr
can return the peer’s IP address, port number, and an error value, but most SVR4 streams implementations of TCP/IP don’t provide a way for ICMP to pass the error to an unconnected UDP end point.In an ideal world,
in_pcbnotify
delivers the ICMP error to all UDP sockets that match, even if the only nonwildcard match is the local port. The error returned to the process would include the destination IP address and destination UDP port that caused the error, allowing the process to determine if the error corresponds to a datagram sent by the process.
The final function dealing with PCBs is in_losing
, shown in Figure 22.36. It is called by TCP when its retransmission timer has expired four or more times in a row for a given connection (Figure 25.26).
Table 22.36. in_losing
function: invalidate cached route information.
------------------------------------------------------------------------- in_pcb.c 361 int 362 in_losing(inp) 363 struct inpcb *inp; 364 { 365 struct rtentry *rt; 366 struct rt_addrinfo info; 367 if ((rt = inp->inp_route.ro_rt)) { 368 inp->inp_route.ro_rt = 0; 369 bzero((caddr_t) & info, sizeof(info)); 370 info.rti_info[RTAX_DST] = 371 (struct sockaddr *) &inp->inp_route.ro_dst; 372 info.rti_info[RTAX_GATEWAY] = rt->rt_gateway; 373 info.rti_info[RTAX_NETMASK] = rt_mask(rt); 374 rt_missmsg(RTM_LOSING, &info, rt->rt_flags, 0); 375 if (rt->rt_flags & RTF_DYNAMIC) 376 (void) rtrequest(RTM_DELETE, rt_key(rt), 377 rt->rt_gateway, rt_mask(rt), rt->rt_flags, 378 (struct rtentry **) 0); 379 else 380 /* 381 * A new route can be allocated 382 * the next time output is attempted. 383 */ 384 rtfree(rt); 385 } 386 } ------------------------------------------------------------------------- in_pcb.c |
361-374
If the PCB holds a route, that route is discarded. An rt_addrinfo
structure is filled in with information about the cached route that appears to be failing. The function rt_missmsg
is then called to generate a message from the routing socket of type RTM_LOSING
, indicating a problem with the route.
375-384
If the cached route was generated by a redirect (RTF_DYNAMIC
is set), the route is deleted by calling rtrequest
with a request of RTM_DELETE
. Otherwise the cached route is released, causing the next output on the socket to allocate another route to the destination h opefully a better route.
Undoubtedly the most time-consuming algorithm we’ve encountered in this chapter is the linear searching of the PCBs done by in_pcblookup
. At the beginning of Section 22.6 we noted four instances when this function is called. We can ignore the calls to bind
and connect
, as they occur much less frequently than the calls to in_pcblookup
from TCP and UDP, to demultiplex every received IP datagram.
In later chapters we’ll see that TCP and UDP both try to help this linear search by maintaining a pointer to the last PCB that the protocol referenced: a one-entry cache. If the local address, local port, foreign address, and foreign port in the cached PCB match the values in the received datagram, the protocol doesn’t even call in_pcblookup
. If the protocol’s data fits the packet train model [Jain and Routhier 1986], this simple cache works well. But if the data does not fit this model and, for example, looks like data entry into an on-line transaction processing system, the one-entry cache performs poorly [McKenney and Dove 1992].
One proposal for a better PCB arrangement is to move a PCB to the front of the PCB list when the PCB is referenced. ([McKenney and Dove 1992] attribute this idea to Jon Crowcroft; [Partridge and Pink 1993] attribute it to Gary Delp.) This movement of the PCB is easy to do since it is a doubly linked list and a pointer to the head of the list is the first argument to in_pcblookup
.
[McKenney and Dove 1992] compare the original Net/1 implementation (no cache), an enhanced one-entry send—receive cache, the move-to-the-front heuristic, and their own algorithm that uses hash chains. They show that maintaining a linear list of PCBs on hash chains provides an order of magnitude improvement over the other algorithms. The only cost for the hash chains is the memory required for the hash chain headers and the computation of the hash function. They also consider adding the move-to-the-front heuristic to their hash-chain algorithm and conclude that it is easier simply to add more hash chains.
Another comparison of the BSD linear search to a hash table search is in [Hutchinson and Peterson 1991]. They show that the time required to demultiplex an incoming UDP datagram is constant as the number of sockets increases for a hash table, but with a linear search the time increases as the number of sockets increases.
An Internet PCB is associated with every Internet socket: TCP, UDP, and raw IP. It contains information common to all Internet sockets: local and foreign IP addresses, pointer to a route structure, and so on. All the PCBs for a given protocol are placed on a doubly linked list maintained by that protocol.
In this chapter we’ve looked at numerous functions that manipulate the PCBs, and three in detail.
in_pcblookup
is called by TCP and UDP to demultiplex every received datagram. It chooses which socket receives the datagram, taking into account wildcard matches.
This function is also called by in_pcbbind
to verify that the local address and local process are unique, and by in_pcbconnect
to verify that the combination of a local address, local process, foreign address, and foreign process are unique.
in_pcbbind
explicitly or implicitly binds a local address and local port to a socket. An explicit bind occurs when the process calls bind
, and an implicit bind occurs when a TCP client calls connect
without calling bind
, or when a UDP process calls sendto
or connect
without calling bind
.
in_pcbconnect
sets the foreign address and foreign process. If the local address has not been set by the process, a route to the foreign address is calculated and the resulting local interface becomes the local address. If the local port has not been set by the process, in_pcbbind
chooses an ephemeral port for the socket.
Figure 22.37 summarizes the common scenarios for various TCP and UDP applications and the values stored in the PCB for the local address and port and the foreign address and port. We have not yet covered all the actions shown in Figure 22.37 for TCP and UDP processes, but will examine the code in later chapters.
Table 22.37. Summary of in_pcbbind
and in_pcbconnect
.
Application | local address: | local port: | foreign address: | foreign port: |
---|---|---|---|---|
TCP client:
|
|
| foreignIP | fport |
TCP client:
| localIP | lport | foreignIP | fport |
TCP client:
|
| lport | foreignIP | fport |
TCP client:
| localIP |
| foreignIP | fport |
TCP server:
| localIP | lport | Source address from IP header. | Source port from TCP header. |
TCP server:
| Destination address from IP header. | lport | Source address from IP header. | Source port from TCP header. |
UDP client:
|
|
| foreignIP. Reset to 0.0.0.0 after datagram sent. | fport. Reset to 0 after datagram sent. |
UDP client:
|
|
| foreignIP | fport |
22.1 | What happens in Figure 22.23 when the process asks for an ephemeral port and every ephemeral port is in use? |
22.1 | An infinite loop occurs, waiting for a port to become available. This assumes the process is allowed to open enough descriptors to tie up all ephemeral ports. |
22.2 | In Figure 22.10 we showed two Telnet servers with listening sockets: one with a specific local IP address and one with the wildcard for its local IP address. Does your system’s Telnet daemon allow you to specify the local IP address, and if so, how? |
22.2 | Few, if any, servers support this option. [Cheswick and Bellovin 1994] mention how this would be nice for implementing firewall systems. |
22.3 | Assume a socket is bound to the local socket {140.252.1.29, 8888}, and this is the only socket using local port 8888. (1) Go through the steps performed by |
22.4 | What is the first ephemeral port number allocated by UDP? |
22.4 | The |
22.5 | When a process calls |
22.5 | Normally the caller sets the address family ( The local IP address ( |
22.6 | What happens if a process tries to |
22.6 | A process is allowed to An attempt to |
18.223.33.157