The User Datagram Protocol, or UDP, is a simple, datagram-oriented, transport-layer protocol: each output operation by a process produces exactly one UDP datagram, which causes one IP datagram to be sent.
A process accesses UDP by creating a socket of type SOCK_DGRAM
in the Internet domain. By default the socket is termed unconnected. Each time the process sends a datagram it must specify the destination IP address and port number. Each time a datagram is received for the socket, the process can receive the source IP address and port number from the datagram.
We mentioned in Section 22.5 that a UDP socket can also be connected to one particular IP address and port number. This causes all datagrams written to the socket to go to that destination, and only datagrams arriving from that IP address and port number are passed to the process.
This chapter examines the implementation of UDP.
There are nine UDP functions in a single C file and various UDP definitions in two headers, as shown in Figure 23.1.
Figure 23.2 shows the relationship of the six main UDP functions to other kernel functions. The shaded ellipses are the six functions that we cover in this chapter. We also cover three additional UDP functions that are called by some of these six functions.
Seven global variables are introduced in this chapter, which are shown in Figure 23.3.
Table 23.3. Global variables introduced in this chapter.
Variable | Datatype | Description |
---|---|---|
|
| head of the UDP PCB list |
|
| pointer to PCB for last received datagram: one-behind cache |
|
| flag for calculating and verifying UDP checksum |
|
| holds sender’s IP address and port on input |
|
| UDP statistics (Figure 23.4) |
|
| default size of socket receive buffer, 41,600 bytes |
|
| default size of socket send buffer, 9216 bytes |
Various UDP statistics are maintained in the global structure udpstat
, described in Figure 23.4. We’ll see where these counters are incremented as we proceed through the code.
Table 23.4. UDP statistics maintained in the udpstat
structure.
| Description | Used by SNMP |
---|---|---|
| #received datagrams with data length larger than packet | • |
| #received datagrams with checksum error | • |
| #received datagrams not delivered because input socket full | |
| #received datagrams with packet shorter than header | • |
| total #received datagrams | • |
| #received datagrams with no process on destination port | • |
| #received broadcast/multicast datagrams with no process on dest. port | • |
| total #output datagrams | • |
| #received input datagrams missing pcb cache |
Figure 23.5 shows some sample output of these statistics, from the netstat -s
command.
Table 23.5. Sample UDP statistics.
|
|
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| (see text) |
|
|
The number of UDP datagrams delivered (the second from last line of output) is the number of datagrams received (udps_ipackets
) minus the six variables that precede it in Figure 23.5.
Figure 23.6 shows the four simple SNMP variables in the UDP group and which counters from the udpstat
structure implement that variable.
Table 23.6. Simple SNMP variables in udp
group.
SNMP variable |
| Description |
---|---|---|
|
| #received datagrams delivered to processes |
|
| #undeliverable UDP datagrams for reasons other than no application at destination port (e.g., UDP checksum error) |
|
| #received datagrams for which no application process was at the destination port |
|
| #datagrams sent |
Figure 23.7 shows the UDP listener table, named udpTable
. The values returned by SNMP for this table are taken from a UDP PCB, not the udpstat
structure.
Figure 23.8 lists the protocol switch entry for UDP.
Table 23.8. The UDP protosw
structure.
Member |
| Description |
---|---|---|
|
| UDP provides datagram packet services |
|
| UDP is part of the Internet domain |
|
| appears in the |
|
| socket layer flags, not used by protocol processing |
|
| receives messages from IP layer |
|
| not used by UDP |
|
| control input function for ICMP errors |
|
| respond to administrative requests from a process |
|
| respond to communication requests from a process |
|
| initialization for UDP |
|
| not used by UDP |
|
| not used by UDP |
|
| not used by UDP |
|
| for |
We describe the five functions that begin with udp_
in this chapter. We also cover a sixth function, udp_output
, which is not in the protocol switch entry but is called by udp_usrreq
when a UDP datagram is output.
The UDP header is defined as a udphdr
structure. Figure 23.9 shows the C structure and Figure 23.10 shows a picture of the UDP header.
Table 23.9. udphdr
structure.
----------------------------------------------------------------------- udp.h 39 struct udphdr { 40 u_short uh_sport; /* source port */ 41 u_short uh_dport; /* destination port */ 42 short uh_ulen; /* udp length */ 43 u_short uh_sum; /* udp checksum */ 44 }; ----------------------------------------------------------------------- udp.h |
In the source code the UDP header is normally referenced as an IP header immediately followed by a UDP header. This is how udp_input
processes received IP datagrams, and how udp_output
builds outgoing IP datagrams. This combined IP/UDP header is a udpiphdr
structure, shown in Figure 23.11.
Table 23.11. udpiphdr
structure: combined IP/UDP header.
------------------------------------------------------------------------ udp_var.h 38 struct udpiphdr { 39 struct ipovly ui_i; /* overlaid ip structure */ 40 struct udphdr ui_u; /* udp header */ 41 }; 42 #define ui_next ui_i.ih_next 43 #define ui_prev ui_i.ih_prev 44 #define ui_x1 ui_i.ih_x1 45 #define ui_pr ui_i.ih_pr 46 #define ui_len ui_i.ih_len 47 #define ui_src ui_i.ih_src 48 #define ui_dst ui_i.ih_dst 49 #define ui_sport ui_u.uh_sport 50 #define ui_dport ui_u.uh_dport 51 #define ui_ulen ui_u.uh_ulen 52 #define ui_sum ui_u.uh_sum ------------------------------------------------------------------------ udp_var.h |
The 20-byte IP header is defined as an ipovly
structure, shown in Figure 23.12.
Table 23.12. ipovly
structure.
------------------------------------------------------------------------- ip_var.h 38 struct ipovly { 39 caddr_t ih_next, ih_prev; /* for protocol sequence q's */ 40 u_char ih_x1; /* (unused) */ 41 u_char ih_pr; /* protocol */ 42 short ih_len; /* protocol length */ 43 struct in_addr ih_src; /* source internet address */ 44 struct in_addr ih_dst; /* destination internet address */ 45 }; ------------------------------------------------------------------------- ip_var.h |
Unfortunately this structure is not a real IP header, as shown in Figure 8.8. The size is the same (20 bytes) but the fields are different. We’ll return to this discrepancy when we discuss the calculation of the UDP checksum in Section 23.6.
The domaininit
function calls UDP’s initialization function (udp_init
, Figure 23.13) at system initialization time.
The only action performed by this function is to set the next and previous pointers in the head PCB (udb
) to point to itself. This is an empty doubly linked list.
The remainder of the udb
PCB is initialized to 0, although the only other field used in this head PCB is inp_lport
, the next UDP ephemeral port number to allocate. In the solution for Exercise 22.4 we mention that because this local port number is initialized to 0, the first ephemeral port number will be 1024.
UDP output occurs when the application calls one of the five write functions: send, sendto, sendmsg, write
, or writev
. If the socket is connected, any of the five functions can be called, although a destination address cannot be specified with sendto
or sendmsg
. If the socket is unconnected, only sendto
and sendmsg
can be called, and a destination address must be specified. Figure 23.14 summarizes how these five write functions end up with udp_output
being called, which in turn calls ip_output
.
All five functions end up calling sosend
, passing a pointer to a msghdr
structure as an argument. The data to output is packaged into an mbuf chain and an optional destination address and optional control information are also put into mbufs by sosend
. A PRU_SEND
request is issued.
UDP calls the function udp_output
, which we show the first half of in Figure 23.15. The four arguments are inp
, a pointer to the socket Internet PCB; m
, a pointer to the mbuf chain for output; addr
, an optional pointer to an mbuf with the destination address packaged as a sockaddr_in
structure; and control
, an optional pointer to an mbuf with control information from sendmsg
.
Table 23.15. udp_output
function: temporarily connect an unconnected socket.
---------------------------------------------------------------------- udp_usrreq.c 333 int 334 udp_output(inp, m, addr, control) 335 struct inpcb *inp; 336 struct mbuf *m; 337 struct mbuf *addr, *control; 338 { 339 struct udpiphdr *ui; 340 int len = m->m_pkthdr.len; 341 struct in_addr laddr; 342 int s, error = 0; 343 if (control) 344 m_freem(control); /* XXX */ 345 if (addr) { 346 laddr = inp->inp_laddr; 347 if (inp->inp_faddr.s_addr != INADDR_ANY) { 348 error = EISCONN; 349 goto release; 350 } 351 /* 352 * Must block input while temporarily connected. 353 */ 354 s = splnet(); 355 error = in_pcbconnect(inp, addr); 356 if (error) { 357 splx(s); 358 goto release; 359 } 360 } else { 361 if (inp->inp_faddr.s_addr == INADDR_ANY) { 362 error = ENOTCONN; 363 goto release; 364 } 365 } 366 /* 367 * Calculate data length and get an mbuf for UDP and IP headers. 368 */ 369 M_PREPEND(m, sizeof(struct udpiphdr), M_DONTWAIT); 370 if (m == 0) { 371 error = ENOBUFS; 372 goto release; 373 } /* remainder of function shown in Figure 23.20 */ 409 release: 410 m_freem(m); 411 return (error); 412 } ---------------------------------------------------------------------- udp_usrreq.c |
333-344
Any optional control information is discarded by m_freem
, without generating an error. UDP output does not use control information for any purpose.
The comment
XXX
is because the control information is ignored without generating an error. Other protocols, such as the routing domain and TCP, generate an error if the process passes control information.
345-359
If the caller specifies a destination address for the UDP datagram (addr
is nonnull), the socket is temporarily connected to that destination address by in_pcbconnect
. The socket will be disconnected at the end of this function. Before doing this connect, a check is made as to whether the socket is already connected, and, if so, the error EISCONN
is returned. This is why a sendto
that specifies a destination address on a connected socket returns an error.
Before the socket is temporarily connected, IP input processing is stopped by splnet
. This is done because the temporary connect changes the foreign address, foreign port, and possibly the local address in the socket’s PCB. If a received UDP datagram were processed while this PCB was temporarily connected, that datagram could be delivered to the wrong process. Setting the processor priority to splnet
only stops a software interrupt from causing the IP input routine to be executed (Figure 1.12), it does not prevent the interface layer from accepting incoming packets and placing them onto IP’s input queue.
[Partridge and Pink 1993] note that this operation of temporarily connecting the socket is expensive and consumes nearly one-third of the cost of each UDP transmission.
The local address from the PCB is saved in laddr
before temporarily connecting, because if it is the wildcard address it will be changed by in_pcbconnect
when it calls in_pcbbind
.
The same rules apply to the destination address that would apply if the process called connect
, since in_pcbconnect
is called for both cases.
360-364
If the process doesn’t specify a destination address, and the socket is not connected, ENOTCONN
is returned.
366-373
M_PREPEND
allocates room for the IP and UDP headers in front of the data. Figure 1.8 showed one scenario, assuming there is not room in the first mbuf on the chain for the 28 bytes of header. Exercise 23.1 details the other possible scenarios. The flag M_DONTWAIT
is specified because if the socket is temporarily connected, IP processing is blocked, and M_PREPEND
should not block.
Earlier Berkeley releases incorrectly specified
M_WAIT
here.
There is a subtle interaction between the M_PREPEND
macro and mbuf clusters. If the user data is placed into a cluster by sosend
, then 56 bytes (max_hdr
from Figure 7.17) are left unused at the beginning of the cluster, allowing room for the Ethernet, IP, and UDP headers. This is to prevent M_PREPEND
from allocating another mbuf just to hold these headers. M_PREPEND
calls M_LEADINGSPACE
to calculate how much space is available at the beginning of the mbuf:
#define M_LEADINGSPACE(m) ((m)->m_flags & M_EXT ? /* (m)->m_data - (m)- >m_ext.ext_buf */ 0 : (m)->m_flags & M_PKTHDR ? (m)->m_data - (m)- >m_pktdat : (m)->m_data - (m)->m_dat)
The code that correctly calculates the amount of room at the front of a cluster is commented out, and the macro always returns 0 if the data is in a cluster. This means that when the user data is in a cluster, M_PREPEND
always allocates a new mbuf for the protocol headers instead of using the room allocated for this purpose by sosend
.
The reason for commenting out the correct code in
M_LEADINGSPACE
is that the cluster might be shared (Section 2.9), and, if it is shared, using the space before the user’s data in the cluster could wipe out someone else’s data.With UDP data, clusters are not shared, since
udp_output
does not save a copy of the data. TCP, however, saves a copy of the data in its send buffer (waiting for the data to be acknowledged), and if the data is in a cluster, it is shared. Buttcp_output
doesn’t callM_LEADINGSPACE
, becausesosend
leaves room for only 56 bytes at the beginning of the cluster for datagram protocols.tcp_output
always callsMGETHDR
instead, to allocate an mbuf for the protocol headers.
Before showing the last half of udp_output
we describe how UDP fills in some of the fields in the IP/UDP headers, calculates the UDP checksum, and passes the IP/UDP headers and the data to IP for output. The way this is done with the ipovly
structure is tricky.
Figure 23.16 shows the 28-byte IP/UDP headers that are built by udp_output
in the first mbuf in the chain pointed to by m
. The unshaded fields are filled in by udp_output
and the shaded fields are filled in by ip_output
. This figure shows the format of the headers as they appear on the wire.
The UDP checksum is calculated over three areas: (1) a 12-byte pseudo-header containing fields from the IP header, (2) the 8-byte UDP header, and (3) the UDP data. Figure 23.17 shows the 12 bytes of pseudo-header used for the checksum computation, along with the UDP header. The UDP header used for the checksum calculation is identical to the UDP header that appears on the wire (Figure 23.16).
The following three facts are used in computing the UDP checksum. (1) The third 32-bit word in the pseudo-header (Figure 23.17) looks similar to the third 32-bit word in the IP header (Figure 23.16): two 8-bit values and a 16-bit value. (2) The order of the three 32-bit values in the pseudo-header is irrelevant. Actually, the computation of the Internet checksum does not depend on the order of the 16-bit values that are used (Section 8.7). (3) Including additional 32-bit words of 0 in the checksum computation has no effect.
udp_output
takes advantage of these three facts and fills in the fields in the udpiphdr
structure (Figure 23.11), which we depict in Figure 23.18. This structure is contained in the first mbuf in the chain pointed to by the argument m
.
The last three 32-bit words in the 20-byte IP header (the five members ui_x1, ui_pr, ui_len, ui_src
, and ui_dst
) are used as the pseudo-header for the checksum computation. The first two 32-bit words in the IP header (ui_next
and ui_prev
) are also used in the checksum computation, but they’re initialized to 0, and don’t affect the checksum.
Figure 23.19 summarizes the operations we’ve described.
The top picture shown in Figure 23.19 is the protocol definition of the pseudo-header, which corresponds to Figure 23.17.
The middle picture is the udpiphdr
structure that is used in the source code, which corresponds to Figure 23.11. (To make the figure readable, the prefix ui_
has been left off all the members.) This is the structure built by udp_output
in the first mbuf and then used to calculate the UDP checksum.
The bottom picture shows the IP/UDP headers that appear on the wire, which corresponds to Figure 23.16. The seven fields with an arrow above are filled in by udp_output
before the checksum computation. The three fields with an asterisk above are filled in by udp_output
after the checksum computation. The remaining six shaded fields are filled in by ip_output
.
Figure 23.20 shows the last half of the udp_output
function.
Table 23.20. udp_output
function: fill in headers, calculate checksum, pass to IP.
---------------------------------------------------------------------- udp_usrreq.c 374 /* 375 * Fill in mbuf with extended UDP header 376 * and addresses and length put into network format. 377 */ 378 ui = mtod(m, struct udpiphdr *); 379 ui->ui_next = ui->ui_prev = 0; 380 ui->ui_x1 = 0; 381 ui->ui_pr = IPPROTO_UDP; 382 ui->ui_len = htons((u_short) len + sizeof(struct udphdr)); 383 ui->ui_src = inp->inp_laddr; 384 ui->ui_dst = inp->inp_faddr; 385 ui->ui_sport = inp->inp_lport; 386 ui->ui_dport = inp->inp_fport; 387 ui->ui_ulen = ui->ui_len; 388 /* 389 * Stuff checksum and output datagram. 390 */ 391 ui->ui_sum = 0; 392 if (udpcksum) { 393 if ((ui->ui_sum = in_cksum(m, sizeof(struct udpiphdr) + len)) == 0) 394 ui->ui_sum = 0xffff; 395 } 396 ((struct ip *) ui)->ip_len = sizeof(struct udpiphdr) + len; 397 ((struct ip *) ui)->ip_ttl = inp->inp_ip.ip_ttl; /* XXX */ 398 ((struct ip *) ui)->ip_tos = inp->inp_ip.ip_tos; /* XXX */ 399 udpstat.udps_opackets++; 400 error = ip_output(m, inp->inp_options, &inp->inp_route, 401 inp->inp_socket->so_options & (SO_DONTROUTE | SO_BROADCAST), 402 inp->inp_moptions); 403 if (addr) { 404 in_pcbdisconnect(inp); 405 inp->inp_laddr = laddr; 406 splx(s); 407 } 408 return (error); ---------------------------------------------------------------------- udp_usrreq.c |
374-387
All the members in the udpiphdr
structure (Figure 23.18) are set to their respective values. The local and foreign sockets from the PCB are already in network byte order, but the UDP length must be converted to network byte order. The UDP length is the number of bytes of data (len
, which can be 0) plus the size of the UDP header (8). The UDP length field appears twice in the UDP checksum calculation: ui_len
and ui_ulen
. One of them is redundant.
388-395
The checksum is calculated by first setting it to 0 and then calling in_cksum
. If UDP checksums are disabled (a bad idea see Section 11.3 of Volume 1), 0 is sent as the checksum. If the calculated checksum is 0, 16 one bits are stored in the header instead of 0. (In one’s complement arithmetic, all one bits and all zero bits are both considered 0.) This allows the receiver to distinguish between a UDP packet without a checksum (the checksum field is 0) versus a UDP packet with a checksum whose value is 0 (the checksum is 16 one bits).
The variable
udpcksum
(Figure 23.3) normally defaults to 1, enabling UDP checksums. The kernel can be compiled for 4.2BSD compatibility, which initializesudpcksum
to 0.
396-398
The pointer ui
is cast to a pointer to a standard IP header (ip
), and three fields in the IP header are set by UDP. The IP length field is set to the amount of data in the UDP datagram, plus 28, the size of the IP/UDP headers. Notice that this field in the IP header is stored in host byte order, not network byte order like the rest of the multibyte fields in the header. ip_output
converts it to network byte order before transmission.
The TTL and TOS fields in the IP header are then set from the values in the socket’s PCB. These values are defaulted by UDP when the socket is created, but can be changed by the process using setsockopt
. Since these three fields IP length, TTL, and TOS are not par t of the pseudo-header and not used in the UDP checksum computation, they must be set after the checksum is calculated but before ip_output
is called.
400-402
ip_output
sends the datagram. The second argument, inp_options
, are IP options the process can set using setsockopt
. These IP options are placed into the IP header by ip_output
. The third argument is a pointer to the cached route in the PCB, and the fourth argument is the socket options. The only socket options that are passed to ip_output
are SO_DONTROUTE
(bypass the routing tables) and SO_BROADCAST
(allow broadcasting). The final argument is a pointer to the multicast options for this socket.
UDP output is driven by a process calling one of the five write functions. The functions shown in Figure 23.14 are all called directly as part of the system call. UDP input, on the other hand, occurs when IP input receives an IP datagram on its input queue whose protocol field specifies UDP. IP calls the function udp_input
through the pr_input
function in the protocol switch table (Figure 8.15). Since IP input is at the software interrupt level, udp_input
also executes at this level. The goal of udp_input
is to place the UDP datagram onto the appropriate socket’s buffer and wake up any process blocked for input on that socket.
We’ll divide our discussion of the udp_input
function into three sections:
the general validation that UDP performs on the received datagram,
processing UDP datagrams destined for a unicast address: locating the appropriate PCB and placing the datagram onto the socket’s buffer, and
processing UDP datagrams destined for a broadcast or multicast address: the datagram may be delivered to multiple sockets.
This last step is new with the support of multicasting in Net/3, but consumes almost one-third of the code.
Figure 23.21 shows the first section of UDP input.
Table 23.21. udp_input
function: general validation of received UDP datagram.
---------------------------------------------------------------------- udp_usrreq.c 55 void 56 udp_input(m, iphlen) 57 struct mbuf *m; 58 int iphlen; 59 { 60 struct ip *ip; 61 struct udphdr *uh; 62 struct inpcb *inp; 63 struct mbuf *opts = 0; 64 int len; 65 struct ip save_ip; 66 udpstat.udps_ipackets++; 67 /* 68 * Strip IP options, if any; should skip this, 69 * make available to user, and use on returned packets, 70 * but we don't yet have a way to check the checksum 71 * with options still present. 72 */ 73 if (iphlen > sizeof(struct ip)) { 74 ip_stripoptions(m, (struct mbuf *) 0); 75 iphlen = sizeof(struct ip); 76 } 77 /* 78 * Get IP and UDP header together in first mbuf. 79 */ 80 ip = mtod(m, struct ip *); 81 if (m->m_len < iphlen + sizeof(struct udphdr)) { 82 if ((m = m_pullup(m, iphlen + sizeof(struct udphdr))) == 0) { 83 udpstat.udps_hdrops++; 84 return; 85 } 86 ip = mtod(m, struct ip *); 87 } 88 uh = (struct udphdr *) ((caddr_t) ip + iphlen); 89 /* 90 * Make mbuf data length reflect UDP length. 91 * If not enough data to reflect UDP length, drop. 92 */ 93 len = ntohs((u_short) uh->uh_ulen); 94 if (ip->ip_len != len) { 95 if (len > ip->ip_len) { 96 udpstat.udps_badlen++; 97 goto bad; 98 } 99 m_adj(m, len - ip->ip_len); 100 /* ip->ip_len = len; */ 101 } 102 /* 103 * Save a copy of the IP header in case we want to restore 104 * it for sending an ICMP error message in response. 105 */ 106 save_ip = *ip; 107 /* 108 * Checksum extended UDP header and data. 109 */ 110 if (udpcksum && uh->uh_sum) { 111 ((struct ipovly *) ip)->ih_next = 0; 112 ((struct ipovly *) ip)->ih_prev = 0; 113 ((struct ipovly *) ip)->ih_x1 = 0; 114 ((struct ipovly *) ip)->ih_len = uh->uh_ulen; 115 if (uh->uh_sum = in_cksum(m, len + sizeof(struct ip))) { 116 udpstat.udps_badsum++; 117 m_freem(m); 118 return; 119 } 120 } ---------------------------------------------------------------------- udp_usrreq.c |
55-65
The two arguments to udp_input
are m
, a pointer to an mbuf chain containing the IP datagram, and iphlen
, the length of the IP header (including possible IP options).
67-76
If IP options are present they are discarded by ip_stripoptions
. As the comments indicate, UDP should save a copy of the IP options and make them available to the receiving process through the IP_RECVOPTS
socket option, but this isn’t implemented yet.
77-88
If the length of the first mbuf on the mbuf chain is less than 28 bytes (the size of the IP header plus the UDP header), m_pullup
rearranges the mbuf chain so that at least 28 bytes are stored contiguously in the first mbuf.
89-101
There are two lengths associated with a UDP datagram: the length field in the IP header (ip_len
) and the length field in the UDP header (uh_ulen
). Recall that ipintr
subtracted the length of the IP header from ip_len
before calling udp_input
(Figure 10.11). The two lengths are compared and there are three possibilities:
ip_len
equals uh_ulen
. This is the common case.
ip_len
is greater than uh_ulen
. The IP datagram is too big, as shown in Figure 23.22.
The code believes the smaller of the two lengths (the UDP header length) and m_adj
removes the excess bytes of data from the end of the datagram. In the code the second argument to m_adj
is negative, which we said in Figure 2.20 trims data from the end of the mbuf chain. It is possible in this scenario that the UDP length field has been corrupted. If so, the datagram will probably be discarded shortly, assuming the sender calculated the UDP checksum, that this checksum detects the error, and that the receiver verifies the checksum. The IP length field should be correct since it was verified by IP against the amount of data received from the interface, and the IP length field is covered by the mandatory IP header checksum.
ip_len
is less than uh_ulen
. The IP datagram is smaller than possible, given the length in the UDP header. Figure 23.23 shows this case.
Something is wrong and the datagram is discarded. There is no other choice here: if the UDP length field has been corrupted, it can’t be detected with the UDP checksum. The correct UDP length is needed to calculate the checksum.
As we’ve said, the UDP length is redundant. In Chapter 28 we’ll see that TCP does not have a length field in its header it uses the IP length field, minus the lengths of the IP and TCP headers, to determine the amount of data in the datagram. Why does the UDP length field exist? Possibly to add a small amount of error checking, since UDP checksums are optional.
102-106
udp_input
saves a copy of the IP header before verifying the checksum, because the checksum computation wipes out some of the fields in the original IP header.
110
The checksum is verified only if UDP checksums are enabled for the kernel (udpcksum
), and if the sender calculated a UDP checksum (the received checksum is nonzero).
This test is incorrect. If the sender calculated a checksum, it should be verified, regardless of whether outgoing checksums are calculated or not. The variable
udpcksum
should only specify whether outgoing checksums are calculated. Unfortunately many vendors have copied this incorrect test, although many vendors today finally ship their kernels with UDP checksums enabled by default.
111-120
Before calculating the checksum, the IP header is referenced as an ipovly
structure (Figure 23.18) and the fields are initialized as described in the previous section when the UDP checksum is calculated by udp_output
.
At this point special code is executed if the datagram is destined for a broadcast or multicast IP address. We defer this code until later in the section.
Assuming the datagram is destined for a unicast address, Figure 23.24 shows the code that is executed.
Table 23.24. udp_input
function: demultiplex unicast datagram.
--------------------------------------------------------------------- udp_usrreq.c /* demultiplex broadcast & multicast datagrams (Figure 23.26) */ 206 /* 207 * Locate pcb for unicast datagram. 208 */ 209 inp = udp_last_inpcb; 210 if (inp->inp_lport != uh->uh_dport || 211 inp->inp_fport != uh->uh_sport || 212 inp->inp_faddr.s_addr != ip->ip_src.s_addr || 213 inp->inp_laddr.s_addr != ip->ip_dst.s_addr) { 214 inp = in_pcblookup(&udb, ip->ip_src, uh->uh_sport, 215 ip->ip_dst, uh->uh_dport, INPLOOKUP_WILDCARD); 216 if (inp) 217 udp_last_inpcb = inp; 218 udpstat.udpps_pcbcachemiss++; 219 } 220 if (inp == 0) { 221 udpstat.udps_noport++; 222 if (m->m_flags & (M_BCAST | M_MCAST)) { 223 udpstat.udps_noportbcast++; 224 goto bad; 225 } 226 *ip = save_ip; 227 ip->ip_len += iphlen; 228 icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PORT, 0, 0); 229 return; 230 } --------------------------------------------------------------------- udp_usrreq.c |
206-209
UDP maintains a pointer to the last Internet PCB for which it received a datagram, udp_last_inpcb
. Before calling in_pcblookup
, which might have to search many PCBs on the UDP list, the foreign and local addresses and ports of that last PCB are compared against the received datagram. This is called a one-behind cache [Partridge and Pink 1993], and it is based on the assumption that the next datagram received has a high probability of being destined for the same socket as the last received datagram [Mogul 1991]. This cache was introduced with the 4.3BSD Tahoe release.
210-213
The order of the four comparisons between the cached PCB and the received datagram is intentional. If the PCBs don’t match, the comparisons should stop as soon as possible. The highest probability is that the destination port numbers are different this is therefore the first test. The lowest probability of a mismatch is between the local addresses, especially on a host with just one interface, so this is the last test.
Unfortunately this one-behind cache, as coded, is practically useless [Partridge and Pink 1993]. The most common type of UDP server binds only its well-known port, leaving its local address, foreign address, and foreign port wildcarded. The most common type of UDP client does not connect its UDP socket; it specifies the destination address for each datagram using sendto
. Therefore most of the time the three values in the PCB inp_laddr, inp_faddr
, and inp_fport
are wildcards. In the cache comparison the four values in the received datagram are never wildcards, meaning the cache entry will compare equal with the received datagram only when the PCB has all four local and foreign values specified to nonwildcard values. This happens only for a connected UDP socket.
On the system
bsdi
, the counterudpps_pcbcachemiss
was 41,253 and the counterudps_ipackets
was 42,485. This is less than a 3% cache hit rate.The
netstat -s
command prints most of the fields in theudpstat
structure (Figure 23.5). Unfortunately the Net/3 version, and most vendor’s versions, never printudpps_pcbcachemiss
. If you want to see the value, use a debugger to examine the variable in the running kernel.
214-218
Assuming the comparison with the cached PCB fails, in_pcblookup
searches for a match. The INPLOOKUP_WILDCARD
flag is specified, allowing a wildcard match. If a match is found, the pointer to the PCB is saved in udp_last_inpcb
, which we said is a cache of the last received UDP datagram’s PCB.
220-230
If a matching PCB is not found, UDP normally generates an ICMP port unreachable error. First the m_flags
for the received mbuf chain is checked to see if the datagram was sent to a link-level broadcast or multicast destination address. It is possible to receive an IP datagram with a unicast IP address that was sent to a broadcast or multicast link-level address, but an ICMP port unreachable error must not be generated. If it is OK to generate the ICMP error, the IP header is restored to its received value (save_ip
) and the IP length is also set back to its original value.
This check for a link-level broadcast or multicast address is redundant.
icmp_error
also performs this check. The only advantage in this redundant check is to maintain the counterudps_noportbcast
in addition to the counterudps_noport
.The addition of
iphlen
back intoip_len
is a bug.icmp_error
will also do this, causing the IP length field in the IP header returned in the ICMP error to be 20 bytes too large. You can tell if a system has this bug by adding a few lines of code to the Traceroute program (Chapter 8 of Volume 1) to print this field in the ICMP port unreachable that is returned when the destination host is finally reached.
Figure 23.25 is the next section of processing for a unicast datagram, delivering the datagram to the socket corresponding to the destination PCB.
Table 23.25. udp_input
function: deliver unicast datagram to socket.
------------------------------------------------------------------- udp_usrreq.c 231 /* 232 * Construct sockaddr format source address. 233 * Stuff source address and datagram in user buffer. 234 */ 235 udp_in.sin_port = uh->uh_sport; 236 udp_in.sin_addr = ip->ip_src; 237 if (inp->inp_flags & INP_CONTROLOPTS) { 238 struct mbuf **mp = &opts; 239 if (inp->inp_flags & INP_RECVDSTADDR) { 240 *mp = udp_saveopt((caddr_t) & ip->ip_dst, 241 sizeof(struct in_addr), IP_RECVDSTADDR); 242 if (*mp) 243 mp = &(*mp)->m_next; 244 } 245 #ifdef notyet 246 /* IP options were tossed above */ 247 if (inp->inp_flags & INP_RECVOPTS) { 248 *mp = udp_saveopt((caddr_t) opts_deleted_above, 249 sizeof(struct in_addr), IP_RECVOPTS); 250 if (*mp) 251 mp = &(*mp)->m_next; 252 } 253 /* ip_srcroute doesn't do what we want here, need to fix */ 254 if (inp->inp_flags & INP_RECVRETOPTS) { 255 *mp = udp_saveopt((caddr_t) ip_srcroute(), 256 sizeof(struct in_addr), IP_RECVRETOPTS); 257 if (*mp) 258 mp = &(*mp)->m_next; 259 } 260 #endif 261 } 262 iphlen += sizeof(struct udphdr); 263 m->m_len -= iphlen; 264 m->m_pkthdr.len -= iphlen; 265 m->m_data += iphlen; 266 if (sbappendaddr(&inp->inp_socket->so_rcv, (struct sockaddr *) &udp_in, 267 m, opts) == 0) { 268 udpstat.udps_fullsock++; 269 goto bad; 270 } 271 sorwakeup(inp->inp_socket); 272 return; 273 bad: 274 m_freem(m); 275 if (opts) 276 m_freem(opts); 277 } ------------------------------------------------------------------- udp_usrreq.c |
231-236
The source IP address and source port number from the received IP datagram are stored in the global sockaddr_in
structure udp_in
. This structure is passed as an argument to sbappendaddr
later in the function.
Using a global to hold the IP address and port number is OK because udp_input
is single threaded. When this function is called by ipintr
it processes the received datagram completely before returning. Also, sbappendaddr
copies the socket address structure from the global into an mbuf.
237-244
The constant INP_CONTROLOPTS
is the combination of the three socket options that the process can set to cause control information to be returned through the recvmsg
system call for a UDP socket (Figure 22.5). The IP_RECVDSTADDR
socket option returns the destination IP address from the received UDP datagram as control information. The function udp_saveopt
allocates an mbuf of type MT_CONTROL
and stores the 4-byte destination IP address in the mbuf. We show this function in Section 23.8.
This socket option appeared with 4.3BSD Reno and was intended for applications such as TFTP, the Trivial File Transfer Protocol, that should not respond to client requests that are sent to a broadcast address. Unfortunately, even if the receiving application uses this option, it is nontrivial to determine if the destination IP address is a broadcast address or not (Exercise 23.6).
When the multicasting changes were added in 4.4BSD, this code was left in only for datagrams destined for a unicast address. We’ll see in Figure 23.26 that this option is not implemented for datagrams sent to a broadcast of multicast address. This defeats the purpose of the option!
Table 23.26.
udp_input
function: demultiplexing of broadcast and multicast datagrams.
--------------------------------------------------------------------- udp_usrreq.c 121 if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) || 122 in_broadcast(ip->ip_dst, m->m_pkthdr.rcvif)) { 123 struct socket *last; 124 /* 125 * Deliver a multicast or broadcast datagram to *all* sockets 126 * for which the local and remote addresses and ports match 127 * those of the incoming datagram. This allows more than 128 * one process to receive multi/broadcasts on the same port. 129 * (This really ought to be done for unicast datagrams as 130 * well, but that would cause problems with existing 131 * applications that open both address-specific sockets and 132 * a wildcard socket listening to the same port -- they would 133 * end up receiving duplicates of every unicast datagram. 134 * Those applications open the multiple sockets to overcome an 135 * inadequacy of the UDP socket interface, but for backwards 136 * compatibility we avoid the problem here rather than 137 * fixing the interface. Maybe 4.5BSD will remedy this?) 138 */ 139 /* 140 * Construct sockaddr format source address. 141 */ 142 udp_in.sin_port = uh->uh_sport; 143 udp_in.sin_addr = ip->ip_src; 144 m->m_len -= sizeof(struct udpiphdr); 145 m->m_data += sizeof(struct udpiphdr); 146 /* 147 * Locate pcb(s) for datagram. 148 * (Algorithm copied from raw_intr().) 149 */ 150 last = NULL; 151 for (inp = udb.inp_next; inp != &udb; inp = inp->inp_next) { 152 if (inp->inp_lport != uh->uh_dport) 153 continue; 154 if (inp->inp_laddr.s_addr != INADDR_ANY) { 155 if (inp->inp_laddr.s_addr != 156 ip->ip_dst.s_addr) 157 continue; 158 } 159 if (inp->inp_faddr.s_addr != INADDR_ANY) { 160 if (inp->inp_faddr.s_addr != 161 ip->ip_src.s_addr || 162 inp->inp_fport != uh->uh_sport) 163 continue; 164 } 165 if (last != NULL) { 166 struct mbuf *n; 167 if ((n = m_copy(m, 0, M_COPYALL)) != NULL) { 168 if (sbappendaddr(&last->so_rcv, 169 (struct sockaddr *) &udp_in, 170 n, (struct mbuf *) 0) == 0) { 171 m_freem(n); 172 udpstat.udps_fullsock++; 173 } else 174 sorwakeup(last); 175 } 176 } 177 last = inp->inp_socket; 178 /* 179 * Don't look for additional matches if this one does 180 * not have either the SO_REUSEPORT or SO_REUSEADDR 181 * socket options set. This heuristic avoids searching 182 * through all pcbs in the common case of a non-shared 183 * port. It assumes that an application will never 184 * clear these options after setting them. 185 */ 186 if ((last->so_options & (SO_REUSEPORT | SO_REUSEADDR) == 0)) 187 break; 188 } 189 if (last == NULL) { 190 /* 191 * No matching pcb found; discard datagram. 192 * (No need to send an ICMP Port Unreachable 193 * for a broadcast or multicast datgram.) 194 */ 195 udpstat.udps_noportbcast++; 196 goto bad; 197 } 198 if (sbappendaddr(&last->so_rcv, (struct sockaddr *) &udp_in, 199 m, (struct mbuf *) 0) == 0) { 200 udpstat.udps_fullsock++; 201 goto bad; 202 } 203 sorwakeup(last); 204 return; 205 } --------------------------------------------------------------------- udp_usrreq.c
245-260
This code is commented out because it doesn’t work. The intent of the IP_RECVOPTS
socket option is to return the IP options from the received datagram as control information, and the intent of IP_RECVRETOPTS
socket option is to return source route information. The manipulation of the mp
variable by all three IP_RECV
socket options is to build a linked list of up to three mbufs that are then placed onto the socket’s buffer by sbappendaddr
. The code shown in Figure 23.25 only returns one option as control information, so the m_next
pointer of that mbuf is always a null pointer.
262-272
At this point the received datagram (the mbuf chain pointed to by m
), is ready to be placed onto the socket’s receive queue along with a socket address structure representing the sender’s IP address and port (udp_in
), and optional control information (the destination IP address, the mbuf pointed to by opts
). This is done by sbappendaddr
. Before calling this function, however, the pointer and lengths of the first mbuf on the chain are adjusted to ignore the IP and UDP headers. Before returning, sorwakeup
is called for the receiving socket to wake up any processes asleep on the socket’s receive queue.
273-276
If an error is encountered during UDP input processing, udp_input
jumps to the label bad
. The mbuf chain containing the datagram is released, along with the mbuf chain containing any control information (if present).
We now return to the portion of udp_input
that handles datagrams sent to a broadcast or multicast IP address. The code is shown in Figure 23.26.
121-138
As the comments indicate, these datagrams are delivered to all sockets that match, not just a single socket. The inadequacy of the UDP interface that is mentioned refers to the inability of a process to receive asynchronous errors on a UDP socket (notably ICMP port unreachables) unless the socket is connected. We described this in Section 22.11.
139-145
The source IP address and port number are saved in the global sockaddr_in
structure udp_in
, which is passed to sbappendaddr
. The mbuf chain’s length and data pointer are updated to ignore the IP and UDP headers.
146-164
The large for
loop scans each UDP PCB to find all matching PCBs. in_pcblookup
is not called for this demultiplexing because it returns only one PCB, whereas the broadcast or multicast datagram may be delivered to more than one PCB.
If the local port in the PCB doesn’t match the destination port from the received datagram, the entry is ignored. If the local address in the PCB is not the wildcard, it is compared to the destination IP address and the entry is skipped if they’re not equal. If the foreign address in the PCB is not a wildcard, it is compared to the source IP address and if they match, the foreign port must also match the source port. This last test assumes that if the socket is connected to a foreign IP address it must also be connected to a foreign port, and vice versa. This is the same logic we saw in in_pcblookup
.
165-177
If this is not the first match found (last
is nonnull), a copy of the datagram is placed onto the receive queue for the previous match. Since sbappendaddr
releases the mbuf chain when it is done, a copy is first made by m_copy
. Any processes waiting for this data are awakened by sorwakeup
. A pointer to this matching socket
structure is saved in last
.
This use of the variable last
avoids calling m_copy
(an expensive operation since an entire mbuf chain is copied) unless there are multiple recipients for a given datagram. In the common case of a single recipient, the for
loop just sets last
to the single matching PCB, and when the loop terminates, sbappendaddr
places the mbuf chain onto the socket’s receive queue a copy is not made.
178-188
If this matching socket doesn’t have either the SO_REUSEPORT
or the SO_REUSEADDR
socket option set, then there’s no need to check for additional matches and the loop is terminated. The datagram is placed onto the single socket’s receive queue in the call to sbappendaddr
outside the loop.
189-197
If last
is null at the end of the loop, no matches were found. An ICMP error is not generated because the datagram was sent to a broadcast or multicast IP address.
198-204
The final matching entry (which could be the only matching entry) has the original datagram (m
) placed onto its receive queue. After sorwakeup
is called, udp_input
returns, since the processing the broadcast or multicast datagram is complete.
The remainder of the function (shown previously in Figure 23.24) handles unicast datagrams.
There is a subtle problem when using a connected UDP socket to exchange datagrams with a process on a multihomed host. Datagrams from the peer may arrive with a different source IP address and will not be delivered to the connected socket.
Consider the example shown in Figure 23.27.
Three steps take place.
The client on bsdi
creates a UDP socket and connects it to 140.252.1.29, the PPP interface on sun
, not the Ethernet interface. A datagram is sent on the socket to the server.
The server on sun
receives the datagram and accepts it, even though it arrives on an interface that differs from the destination IP address. (sun
is acting as a router, so whether it implements the weak end system model or the strong end system model doesn’t matter.) The datagram is delivered to the server, which is waiting for client requests on an unconnected UDP socket.
The server sends a reply, but since the reply is being sent on an unconnected UDP socket, the source IP address for the reply is chosen by the kernel based on the outgoing interface (140.252.13.33). The destination IP address in the request is not used as the source address for the reply.
When the reply is received by bsdi
it is not delivered to the client’s connected UDP socket since the IP addresses don’t match.
bsdi
generates an ICMP port unreachable error since the reply can’t be demultiplexed. (This assumes that there is not another process on bsdi
eligible to receive the datagram.)
The problem in this example is that the server does not use the destination IP address from the request as the source IP address of the reply. If it did, the problem wouldn’t exist, but this solution is nontrivial see Exercise 23.10. We’ll see in Figure 28.16 that a TCP server uses the destination IP address from the client as the source IP address from the server, if the server has not explicitly bound a local IP address to its socket.
If a process specifies the IP_RECVDSTADDR
socket option, to receive the destination IP address from the received datagram udp_saveopt
is called by udp_input:
*mp = udp_saveopt((caddr_t) &ip->ip_dst, sizeof(struct in_addr), IP_RECVDSTADDR);
Figure 23.28 shows this function.
Table 23.28. udp_saveopt
function: create mbuf with control information.
---------------------------------------------------------------------- udp_usrreq.c 278 /* 279 * Create a "control" mbuf containing the specified data 280 * with the specified type for presentation with a datagram. 281 */ 282 struct mbuf * 283 udp_saveopt(p, size, type) 284 caddr_t p; 285 int size; 286 int type; 287 { 288 struct cmsghdr *cp; 289 struct mbuf *m; 290 if ((m = m_get(M_DONTWAIT, MT_CONTROL)) == NULL) 291 return ((struct mbuf *) NULL); 292 cp = (struct cmsghdr *) mtod(m, struct cmsghdr *); 293 bcopy(p, CMSG_DATA(cp), size); 294 size += sizeof(*cp); 295 m->m_len = size; 296 cp->cmsg_len = size; 297 cp->cmsg_level = IPPROTO_IP; 298 cp->cmsg_type = type; 299 return (m); 300 } ---------------------------------------------------------------------- udp_usrreq.c |
278-289
The arguments are p
, a pointer to the information to be stored in the mbuf (the destination IP address from the received datagram); size
, its size in bytes (4 in this example, the size of an IP address); and type
, the type of control information (IP_RECVDSTADDR
).
290-299
An mbuf is allocated, and since the code is executing at the software interrupt layer, M_DONTWAIT
is specified. The pointer cp
points to the data portion of the mbuf, and it is cast into a pointer to a cmsghdr
structure (Figure 16.14). The IP address is copied from the IP header into the data portion of the cmsghdr
structure by bcopy
. The length of the mbuf is then set (to 16 in this example), followed by the remainder of the cmsghdr
structure. Figure 23.29 shows the final state of the mbuf.
The cmsg_len
field contains the length of the cmsghdr
structure (12) plus the size of the cmsg_data
field (4 for this example). If the application calls recvmsg
to receive the control information, it must go through the cmsghdr
structure to determine the type and length of the cmsg_data
field.
When icmp_input
receives an ICMP error (destination unreachable, parameter problem, redirect, source quench, and time exceeded) the corresponding protocol’s pr_ctlinput
function is called:
if (ctlfunc = inetsw[ ip_protox[icp->icmp_ip.ip_p] ].pr_ctlinput) (*ctlfunc)(code, (struct sockaddr *)&icmpsrc, &icp >icmp_ip);
For UDP, Figure 22.32 showed that the function udp_ctlinput
is called. We show this function in Figure 23.30.
Table 23.30. udp_ctlinput
function: process received ICMP errors.
--------------------------------------------------------------------- udp_usrreq.c 314 void 315 udp_ctlinput(cmd, sa, ip) 316 int cmd; 317 struct sockaddr *sa; 318 struct ip *ip; 319 { 320 struct udphdr *uh; 321 extern struct in_addr zeroin_addr; 322 extern u_char inetctlerrmap[]; 323 if (!PRC_IS_REDIRECT(cmd) && 324 ((unsigned) cmd >= PRC_NCMDS || inetctlerrmap[cmd] == 0)) 325 return; 326 if (ip) { 327 uh = (struct udphdr *) ((caddr_t) ip + (ip->ip_hl << 2)); 328 in_pcbnotify(&udb, sa, uh->uh_dport, ip->ip_src, uh->uh_sport, 329 cmd, udp_notify); 330 } else 331 in_pcbnotify(&udb, sa, 0, zeroin_addr, 0, cmd, udp_notify); 332 } --------------------------------------------------------------------- udp_usrreq.c |
314-322
The arguments are cmd
, one of the PRC_
xxx constants from Figure 11.19; sa
, a pointer to a sockaddr_in
structure containing the source IP address from the ICMP message; and ip
, a pointer to the IP header that caused the error. For the destination unreachable, parameter problem, source quench, and time exceeded errors, the pointer ip
points to the IP header that caused the error. But when udp_ctlinput
is called by pfctlinput
for redirects (Figure 22.32), sa
points to a sockaddr_in
structure containing the destination address that should be redirected, and ip
is a null pointer. There is no loss of information in this final case, since we saw in Section 22.11 that a redirect is applied to all TCP and UDP sockets connected to the destination address. The nonnull third argument is needed, however, for other errors, such as a port unreachable, since the protocol header following the IP header contains the unreachable port.
323-325
If the error is not a redirect, and either the PRC_
xxx value is too large or there is no error code in the global array inetctlerrmap
, the ICMP error is ignored. To understand this test we need to review what happens to a received ICMP message.
icmp_input
converts the ICMP type and code into a PRC_
xxx error code.
The PRC_
xxx error code is passed to the protocol’s control-input function.
The Internet protocols (TCP and UDP) map the PRC_
xxx error code into one of the Unix errno
values using inetctlerrmap
, and this value is returned to the process.
Figures 11.1 and 11.2 summarize this processing of ICMP messages.
Returning to Figure 23.30, we can see what happens to an ICMP source quench that arrives in response to a UDP datagram. icmp_input
converts the ICMP message into the error PRC_QUENCH
and udp_ctlinput
is called. But since the errno
column for this ICMP error is blank in Figure 11.2, the error is ignored.
326-331
The function in_pcbnotify
notifies the appropriate PCBs of the ICMP error. If the third argument to udp_ctlinput
is nonnull, the source and destination UDP ports from the datagram that caused the error are passed to in_pcbnotify
along with the source IP address.
The final argument to in_pcbnotify
is a pointer to a function that in_pcbnotify
calls for each PCB that is to receive the error. The function for UDP is udp_notify
and we show it in Figure 23.31.
Table 23.31. udp_notify
function: notify process of an asynchronous error.
--------------------------------------------------------------------- udp_usrreq.c 305 static void 306 udp_notify(inp, errno) 307 struct inpcb *inp; 308 int errno; 309 { 310 inp->inp_socket->so_error = errno; 311 sorwakeup(inp->inp_socket); 312 sowwakeup(inp->inp_socket); 313 } --------------------------------------------------------------------- udp_usrreq.c |
301-313
The errno
value, the second argument to this function, is stored in the socket’s so_error
variable. By setting this socket variable, the socket becomes readable and writable if the process calls select
. Any processes waiting to receive or send on the socket are then awakened to receive the error.
The protocol’s user-request function is called for a variety of operations. We saw in Figure 23.14 that a call to any one of the five write functions on a UDP socket ends up calling UDP’s user-request function with a request of PRU_SEND
.
Figure 23.32 shows the beginning and end of udp_usrreq
. The body of the switch
is discussed in separate figures following this figure. The function arguments are described in Figure 15.17.
Table 23.32. Body of udp_usrreq
function.
--------------------------------------------------------------------- udp_usrreq.c 417 int 418 udp_usrreq(so, req, m, addr, control) 419 struct socket *so; 420 int req; 421 struct mbuf *m, *addr, *control; 422 { 423 struct inpcb *inp = sotoinpcb(so); 424 int error = 0; 425 int s; 426 if (req == PRU_CONTROL) 427 return (in_control(so, (int) m, (caddr_t) addr, 428 (struct ifnet *) control)); 429 if (inp == NULL && req != PRU_ATTACH) { 430 error = EINVAL; 431 goto release; 432 } 433 /* 434 * Note: need to block udp_input while changing 435 * the udp pcb queue and/or pcb addresses. 436 */ 437 switch (req) { /* switch cases */ 522 default: 523 panic("udp_usrreq"); 524 } 525 release: 526 if (control) { 527 printf("udp control data unexpectedly retaineden"); 528 m_freem(control); 529 } 530 if (m) 531 m_freem(m); 532 return (error); 533 } --------------------------------------------------------------------- udp_usrreq.c |
417-428
The PRU_CONTROL
request is from the ioctl
system call. The function in_control
processes the request completely.
429-432
The socket pointer was converted to the PCB pointer when inp
was declared at the beginning of the function. The only time a null PCB pointer is allowed is when a new socket is being created (PRU_ATTACH
).
433-436
The comment indicates that whenever entries are being added to or deleted from UDP’s PCB list, the code must be protected by splnet
. This is done because udp_usrreq
is called as part of a system call, and it doesn’t want to be interrupted by UDP input (called by IP input, which is called as a software interrupt) while it is modifying the doubly linked list of PCBs. UDP input is also blocked while modifying the local or foreign addresses or ports in a PCB, to prevent a received UDP datagram from being delivered incorrectly by in_pcblookup
.
We now discuss the individual case
statements. The PRU_ATTACH
request, shown in Figure 23.33, is from the socket
system call.
Table 23.33. udp_usrreq
function: PRU_ATTACH
and PRU_DETACH
requests.
---------------------------------------------------------------------- udp_usrreq.c 438 case PRU_ATTACH: 439 if (inp != NULL) { 440 error = EINVAL; 441 break; 442 } 443 s = splnet(); 444 error = in_pcballoc(so, &udb); 445 splx(s); 446 if (error) 447 break; 448 error = soreserve(so, udp_sendspace, udp_recvspace); 449 if (error) 450 break; 451 ((struct inpcb *) so->so_pcb)->inp_ip.ip_ttl = ip_defttl; 452 break; 453 case PRU_DETACH: 454 udp_detach(inp); 455 break; ---------------------------------------------------------------------- udp_usrreq.c |
438-447
If the socket structure already points to a PCB, EINVAL
is returned. in_pcballoc
allocates a new PCB, adds it to the front of UDP’s PCB list, and links the socket structure and the PCB to each other.
448-450
soreserve
reserves buffer space for a receive buffer and a send buffer for the socket. As noted in Figure 16.7, soreserve
just enforces system limits; the buffer space is not actually allocated. The default values for the send and receive buffer sizes are 9216 bytes (udp_sendspace
) and 41,600 bytes (udp_recvspace
). The former allows for a maximum UDP datagram size of 9200 bytes (to hold 8 Kbytes of data in an NFS packet), plus the 16-byte sockaddr_in
structure for the destination address. The latter allows for 40 1024-byte datagrams to be queued at one time for the socket. The process can change these defaults by calling setsockopt
.
451-452
There are two fields in the prototype IP header in the PCB that the process can change by calling setsockopt:
the TTL and the TOS. The TTL defaults to 64 (ip_defttl
) and the TOS defaults to 0 (normal service), since the PCB is initialized to 0 by in_pcballoc
.
453-455
The close
system call issues the PRU_DETACH
request. The function udp_detach
, shown in Figure 23.34, is called. This function is also called later in this section for the PRU_ABORT
request.
Table 23.34. udp_detach
function: delete a UDP PCB.
------------------------------------------------------------ udp_usrreq.c 534 static void 535 udp_detach(inp) 536 struct inpcb *inp; 537 { 538 int s = splnet(); 539 if (inp == udp_last_inpcb) 540 udp_last_inpcb = &udb; 541 in_pcbdetach(inp); 542 splx(s); 543 } ------------------------------------------------------------ udp_usrreq.c |
If the last-received PCB pointer (the one-behind cache) points to the PCB being detached, the cache pointer is set to the head of the UDP list (udb
). The function in_pcbdetach
removes the PCB from UDP’s list and releases the PCB.
Returning to udp_usrreq
, a PRU_BIND
request is the result of the bind
system call and a PRU_LISTEN
request is the result of the listen
system call. Both are shown in Figure 23.35.
Table 23.35. udp_usrreq
function: PRU_BIND
and PRU_LISTEN
requests.
-------------------------------------------------------------- udp_usrreq.c 456 case PRU_BIND: 457 s = splnet(); 458 error = in_pcbbind(inp, addr); 459 splx(s); 460 break; 461 case PRU_LISTEN: 462 error = EOPNOTSUPP; 463 break; -------------------------------------------------------------- udp_usrreq.c |
456-460
All the work for a PRU_BIND
request is done by in_pcbbind
.
461-463
The PRU_LISTEN
request is invalid for a connectionless protocol it is used only by connection-oriented protocols.
We mentioned earlier that a UDP application, either a client or server (normally a client), can call connect
. This fixes the foreign IP address and port number that this socket can send to or receive from. Figure 23.36 shows the PRU_CONNECT, PRU_CONNECT2
, and PRU_ACCEPT
requests.
Table 23.36. udp_usrreq
function: PRU_CONNECT, PRU_CONNECT2
, and PRU_ACCEPT
requests.
----------------------------------------------------------------- udp_usrreq.c 464 case PRU_CONNECT: 465 if (inp->inp_faddr.s_addr != INADDR_ANY) { 466 error = EISCONN; 467 break; 468 469 s = splnet(); 470 error = in_pcbconnect(inp, addr); 471 splx(s); 472 if (error == 0) 473 soisconnected(so); 474 break; 475 case PRU_CONNECT2: 476 error = EOPNOTSUPP; 477 break; 478 case PRU_ACCEPT: 479 error = EOPNOTSUPP; 480 break; ----------------------------------------------------------------- udp_usrreq.c |
464-474
If the socket is already connected, EISCONN
is returned. The socket should never be connected at this point, because a call to connect
on an already-connected UDP socket generates a PRU_DISCONNECT
request before this PRU_CONNECT
request. Otherwise in_pcbconnect
does all the work. If no errors are encountered, soisconnected
marks the socket structure as being connected.
475-477
The socketpair
system call issues the PRU_CONNECT2
request, which is defined only for the Unix domain protocols.
478-480
The PRU_ACCEPT
request is from the accept
system call, which is defined only for connection-oriented protocols.
The PRU_DISCONNECT
request can occur in two cases for a UDP socket:
When a connected UDP socket is closed, PRU_DISCONNECT
is called before PRU_DETACH
.
When a connect
is issued on an already-connected UDP socket, soconnect
issues the PRU_DISCONNECT
request before the PRU_CONNECT
request.
Figure 23.37 shows the PRU_DISCONNECT
request.
Table 23.37. udp_usrreq
function: PRU_DISCONNECT
request.
------------------------------------------------------------------ udp_usrreq.c 481 case PRU_DISCONNECT: 482 if (inp->inp_faddr.s_addr == INADDR_ANY) { 483 error = ENOTCONN; 484 break; 485 } 486 s = splnet(); 487 in_pcbdisconnect(inp); 488 inp->inp_laddr.s_addr = INADDR_ANY; 489 splx(s); 490 so->so_state &= ~SS_ISCONNECTED; /* XXX */ 491 break; ------------------------------------------------------------------ udp_usrreq.c |
If the socket is not already connected, ENOTCONN
is returned. Otherwise in_pcbdisconnect
sets the foreign IP address to 0.0.0.0 and the foreign port to 0. The local address is also set to 0.0.0.0, since this PCB variable could have been set by connect
.
A call to shutdown
specifying that the process has finished sending data generates the PRU_SHUTDOWN
request, although it is rare for a process to issue this system call for a UDP socket. Figure 23.38 shows the PRU_SHUTDOWN, PRU_SEND
, and PRU_ABORT
requests.
Table 23.38. udp_usrreq
function: PRU_SHUTDOWN, PRU_SEND
, and PRU_ABORT
requests.
------------------------------------------------------------------- udp_usrreq.c 492 case PRU_SHUTDOWN: 493 socantsendmore(so); 494 break; 495 case PRU_SEND: 496 return (udp_output(inp, m, addr, control)); 497 case PRU_ABORT: 498 soisdisconnected(so); 499 udp_detach(inp); 500 break; ------------------------------------------------------------------- udp_usrreq.c |
492-494
socantsendmore
sets the socket’s flags to prevent any future output.
495-496
In Figure 23.14 we showed how the five write functions ended up calling udp_usrreq
with a PRU_SEND
request. udp_output
sends the datagram. udp_usrreq
returns, to avoid falling through to the label release
(Figure 23.32), since the mbuf chain containing the data (m
) must not be released yet. IP output appends this mbuf chain to the appropriate interface output queue, and the device driver will release the mbuf when the data has been transmitted.
The only buffering of UDP output within the kernel is on the interface’s output queue. If there is room in the socket’s send buffer for the datagram and destination address, sosend
calls udp_usrreq
, which we see calls udp_output
. We saw in Figure 23.20 that ip_output
is then called, which calls ether_output
for an Ethernet, placing the datagram onto the interface’s output queue (if there is room). If the process calls sendto
faster than the interface can transmit the datagrams, ether_output
can return ENOBUFS
, which is returned to the process.
497-500
A PRU_ABORT
request should never be generated for a UDP socket, but if it is, the socket is disconnected and the PCB detached.
The PRU_SOCKADDR
and PRU_PEERADDR
requests are from the getsockname
and getpeername
system calls, respectively. These two requests, and the PRU_SENSE
request, are shown in Figure 23.39.
Table 23.39. udp_usrreq
function: PRU_SOCKADDR, PRU_PEERADDR
, and PRU_SENSE
requests.
-------------------------------------------------------------------- udp_usrreq.c 501 case PRU_SOCKADDR: 502 in_setsockaddr(inp, addr); 503 break; 504 case PRU_PEERADDR: 505 in_setpeeraddr(inp, addr); 506 break; 507 case PRU_SENSE: 508 /* 509 * fstat: don't bother with a blocksize. 510 */ 511 return (0); -------------------------------------------------------------------- udp_usrreq.c |
501-506
The functions in_setsockaddr
and in_setpeeraddr
fetch the information from the PCB, storing the result in the addr
argument.
507-511
The fstat
system call generates the PRU_SENSE
request. The function returns OK, but doesn’t return any other information. We’ll see later that TCP returns the size of the send buffer as the st_blksize
element of the stat
structure.
The remaining seven PRU_
xxx requests, shown in Figure 23.40, are not supported for a UDP socket.
Table 23.40. udp_usrreq
function: unsupported requests.
--------------------------------------------------------------------- udp_usrreq.c 512 case PRU_SENDOOB: 513 case PRU_FASTTIMO: 514 case PRU_SLOWTIMO: 515 case PRU_PROTORCV: 516 case PRU_PROTOSEND: 517 error = EOPNOTSUPP; 518 break; 519 case PRU_RCVD: 520 case PRU_RCVOOB: 521 return (EOPNOTSUPP); /* do not free mbuf's */ --------------------------------------------------------------------- udp_usrreq.c |
There is a slight difference in how the last two are handled because PRU_RCVD
doesn’t pass a pointer to an mbuf as an argument (m
is a null pointer) and PRU_RCVOOB
passes a pointer to an mbuf for the protocol to fill in. In both cases the error is immediately returned, without breaking out of the switch
and releasing the mbuf chain. With PRU_RCVOOB
the caller releases the mbuf that it allocated.
The sysctl
function for UDP supports only a single option, the UDP checksum flag. The system administrator can enable or disable UDP checksums using the sysctl
(8) program. Figure 23.41 shows the udp_sysctl
function. This function calls sysctl_int
to fetch or set the value of the integer udpcksum
.
Table 23.41. udp_sysctl
function.
--------------------------------------------------------------------- udp_usrreq.c 547 udp_sysctl(name, namelen, oldp, oldlenp, newp, newlen) 548 int *name; 549 u_int namelen; 550 void *oldp; 551 size_t *oldlenp; 552 void *newp; 553 size_t newlen; 554 { 555 /* All sysctl names at this level are terminal. */ 556 if (namelen != 1) 557 return (ENOTDIR); 558 switch (name[0]) { 559 case UDPCTL_CHECKSUM: 560 return (sysctl_int(oldp, oldlenp, newp, newlen, &udpcksum)); 561 default: 562 return (ENOPROTOOPT); 563 } 564 /* NOTREACHED */ 565 } --------------------------------------------------------------------- udp_usrreq.c |
In Section 22.12 we talked about some general features of PCB searching and how the code we’ve seen uses a linear search of the protocol’s PCB list. We now tie this together with the one-behind cache used by UDP in Figure 23.24.
The problem with the one-behind cache occurs when the cached PCB contains wildcard values (for either the local address, foreign address, or foreign port): the cached value never matches any received datagram. One solution tested in [Partridge and Pink 1993] is to modify the cache to not compare wildcarded values. That is, instead of comparing the foreign address in the PCB with the source address in the datagram, compare these two values only if the foreign address in the PCB is not a wildcard.
There’s a subtle problem with this approach [Partridge and Pink 1993]. Assume there are two sockets bound to local port 555. One has the remaining three elements wildcarded, while the other has connected to the foreign address 128.1.2.3 and the foreign port 1600. If we cache the first PCB and a datagram arrives from 128.1.2.3, port 1600, we can’t ignore comparing the foreign addresses just because the cached value has a wildcarded foreign address. This is called cache hiding. The cached PCB has hidden another PCB that is a better match in this example.
To get around cache hiding requires more work when a new entry is added to or deleted from the cache. Those PCBs that hide other PCBs cannot be cached. This is not a problem, however, because the normal scenario is to have one socket per local port. The example we just gave with two sockets bound to local port 555, while possible (especially on a multihomed host), is rare.
The next enhancement tested in [Partridge and Pink 1993] is to also remember the PCB of the last datagram sent. This is motivated by [Mogul 1991], who shows that half of all datagrams received are replies to the last datagram that was sent. Cache hiding is a problem here also, so PCBs that would hide other PCBs are not cached.
The results of these two caches shown in [Partridge and Pink 1993] on a general-purpose system measured for around 100,000 received UDP datagrams show a 57% hit rate for the last-received PCB cache and a 30% hit rate for the last-sent PCB cache. The amount of CPU time spent in udp_input
is more than halved, compared to the version with no caching.
These two caches still depend on a certain amount of locality: that with a high probability the UDP datagram that just arrived is either from the same peer as the last UDP datagram received or from the peer to whom the last datagram was sent. The latter is typical for request-response applications that send a datagram and wait for a reply. [McKenney and Dove 1992] show that some applications, such as data entry into an online transaction processing (OLTP) system, don’t yield the high cache hit rates that [Partridge and Pink 1993] observed. As we mentioned in Section 22.12, placing the PCBs onto hash chains provided an order of magnitude improvement over the last-received and last-sent caches for a system with thousands of OLTP connections.
The next area for improving the implementation is to combine the copying of data between the process and the kernel with the calculation of the checksum. In Net/3, each byte of data is processed twice during an output operation: once when copied from the process into an mbuf (the function uiomove
, which is called by sosend
), and again when the UDP checksum is calculated (by the function in_cksum
, which is called by udp_output
). This happens on input as well as output.
[Partridge and Pink 1993] modified the UDP output processing from what we showed in Figure 23.14 so that a UDP-specific function named udp_sosend
is called instead of sosend
. This new function calculates the checksum of the UDP header and the pseudo-header in-line (instead of calling the general-purpose function in_cksum
) and then copies the data from the process into an mbuf chain using a special function named in_uiomove
(instead of the general-purpose uiomove
). This new function copies the data and updates the checksum. The amount of time spent copying the data and calculating the checksum is reduced with this technique by about 40 to 45%.
On the receive side the scenario is different. UDP calculates the checksum of the UDP header and the pseudo-header, removes the UDP header, and queues the data for the appropriate socket. When the application reads the data, a special version of soreceive
(called udp_soreceive
) completes the calculation of the checksum while copying the data into the user’s buffer. If the checksum is in error, however, the error is not detected until the entire datagram has been copied into the user’s buffer. In the normal case of a blocking socket, udp_soreceive
just waits for the next datagram to arrive. But if the socket is nonblocking, the error EWOULDBLOCK
must be returned if another datagram is not ready to be passed to the process. This implies two changes in the socket interface for a nonblocking read from a UDP socket:
The select
function can indicate that a nonblocking UDP socket is readable, yet the error EWOULDBLOCK
is unexpectedly returned by one of the read functions if the checksum fails.
Since a checksum error is detected after the datagram has been copied into the user’s buffer, the application’s buffer is changed even though no data is returned by the read.
Even with a blocking socket, if the datagram with the checksum error contains 100 bytes of data and the next datagram without an error contains 40 bytes of data, recvfrom
returns a length of 40, but the 60 bytes that follow in the user’s buffer have also been modified.
[Partridge and Pink 1993] compare the timings for a copy versus a copy-with-checksum for six different computers. They show that the checksum is calculated for free during the copy operation on many architectures. This occurs when memory access speeds and CPU processing speeds are mismatched, as is true for many current RISC processors.
UDP is a simple, connectionless protocol, which is why we cover it before looking at TCP. UDP output is simple: IP and UDP headers are prepended to the user’s data, as much of the header is filled in as possible, and the result is passed to ip_output
. The only complication is calculating the UDP checksum, which involves prepending a pseudo-header just for the checksum computation. We’ll encounter a similar pseudo-header for the calculation of the TCP checksum in Chapter 26.
When udp_input
receives a datagram, it first performs a general validation (the length and checksum); the processing then differs depending on whether the destination IP address is a unicast address or a broadcast or multicast address. A unicast datagram is delivered to at most one process, but a broadcast or multicast datagram can be delivered to multiple processes. A one-behind cache is maintained for unicast datagrams, which maintains a pointer to the last Internet PCB for which a UDP datagram was received. We saw, however, that because of the prevalence of wildcard addressing with UDP applications, this cache is practically useless.
The udp_ctlinput
function is called to handle received ICMP messages, and the udp_usrreq
function handles the PRU_
xxx requests from the socket layer.
23.1 | List the five types of mbuf chains that |
23.1 |
|
23.2 | What happens to the answer for the previous exercise when the process specifies IP options for the outgoing datagram? |
23.2 | IP options are passed to In scenarios 2, 3, 4, and 5, the first mbuf in the chain is always allocated by |
23.3 | Does a UDP client need to call |
23.3 | No. The function |
23.4 | What happens to the processor priority level in |
23.4 | The processor priority level is left at |
23.5 |
|
23.5 | No. |
23.6 | Assuming the |
23.6 | The application must call |
23.7 | Who releases the mbuf that |
23.7 |
|
23.8 | How can a process disconnect a connected UDP socket? That is, the process calls |
23.8 | To disconnect a connected UDP socket, call The manual page for |
23.9 | In our discussion of Figure 22.25 we noted that a UDP application that calls |
23.9 | Since an unconnected UDP socket is temporarily connected to the foreign IP address by |
23.10 | After discussing the problem with Figure 23.27, we mentioned that this problem would not exist if the server used the destination IP address from the request as the source IP address of the reply. Explain how the server could do this. |
23.10 | The server must set the |
23.11 | Implement changes to allow a process to perform path MTU discovery using UDP: the process must be able to set the “don’t fragment” bit in the resulting IP datagram and be told if the corresponding ICMP destination unreachable error is received. |
23.11 | Notice in |
23.12 | Does the variable |
23.12 | No. It is used only in the |
23.13 | Modify |
23.14 | Fix the one-behind cache in Figure 23.24. |
23.15 | Fix |
23.16 | Fix |
18.191.240.222