We start our detailed description of the TCP source code by looking at the various TCP timers. We encounter these timers throughout most of the TCP functions.
TCP maintains seven timers for each connection. They are briefly described here, in the approximate order of their occurrence during the lifetime of a connection.
A connection-establishment timer starts when a SYN is sent to establish a new connection. If a response is not received within 75 seconds, the connection establishment is aborted.
A retransmission timer is set when TCP sends data. If the data is not acknowledged by the other end when this timer expires, TCP retransmits the data. The value of this timer (i.e., the amount of time TCP waits for an acknowledgment) is calculated dynamically, based on the round-trip time measured by TCP for this connection, and based on the number of times this data segment has been retransmitted. The retransmission timer is bounded by TCP to be between 1 and 64 seconds.
A delayed ACK timer is set when TCP receives data that must be acknowledged, but need not be acknowledged immediately. Instead, TCP waits up to 200 ms before sending the ACK. If, during this 200-ms time period, TCP has data to send on this connection, the pending acknowledgment is sent along with the data (called piggybacking).
A persist timer is set when the other end of a connection advertises a window of 0, stopping TCP from sending data. Since window advertisements from the other end are not sent reliably (that is, ACKs are not acknowledged, only data is acknowledged), there’s a chance that a future window update, allowing TCP to send some data, can be lost. Therefore, if TCP has data to send and the other end advertises a window of 0, the persist timer is set and when it expires, 1 byte of data is sent to see if the window has opened. Like the retransmission timer, the persist timer value is calculated dynamically, based on the round-trip time. The value of this is bounded by TCP to be between 5 and 60 seconds.
A keepalive timer can be set by the process using the SO_KEEPALIVE
socket option. If the connection is idle for 2 hours, the keepalive timer expires and a special segment is sent to the other end, forcing it to respond. If the expected response is received, TCP knows that the other host is still up, and TCP won’t probe it again until the connection is idle for another 2 hours. Other responses to the keepalive probe tell TCP that the other host has crashed and rebooted. If no response is received to a fixed number of keepalive probes, TCP assumes that the other end has crashed, although it can’t distinguish between the other end being down (i.e., it crashed and has not yet rebooted) and a temporary lack of connectivity to the other end (i.e., an intermediate router or phone line is down).
A FIN_WAIT_2 timer. When a connection moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state (Figure 24.15) and the connection cannot receive any more data (implying the process called close
, instead of taking advantage of TCP’s half-close with shutdown
), this timer is set to 10 minutes. When this timer expires it is reset to 75 seconds, and when it expires the second time the connection is dropped. The purpose of this timer is to avoid leaving a connection in the FIN_WAIT_2 state forever, if the other end never sends a FIN. (We don’t show this timeout in Figure 24.15.)
A TIME_WAIT timer, often called the 2MSL timer. The term 2MSL means twice the MSL, the maximum segment lifetime defined in Section 24.8. It is set when a connection enters the TIME_WAIT state (Figure 24.15), that is, when the connection is actively closed. Section 18.6 of Volume 1 describes the reasoning for the 2MSL wait state in detail. The timer is set to 1 minute (Net/3 uses an MSL of 30 seconds) when the connection enters the TIME_WAIT state and when it expires, the TCP control block and Internet PCB are deleted, allowing that socket pair to be reused.
TCP has two timer functions: one is called every 200 ms (the fast timer) and the other every 500 ms (the slow timer). The delayed ACK timer is different from the other six: when the delayed ACK timer is set for a connection it means that a delayed ACK must be sent the next time the 200-ms timer expires (i.e., the elapsed time is between 0 and 200 ms). The other six timers are decremented every 500 ms, and only when the counter reaches 0 does the corresponding action take place.
The delayed ACK timer is enabled for a connection when the TF_DELACK
flag (Figure 24.14) is set in the TCP control block. The array t_timer
in the TCP control block contains four (TCPT_NTIMERS
) counters used to implement the other six timers. The indexes into this array are shown in Figure 25.1. We describe briefly how the six timers (other than the delayed ACK timer) are implemented by these four counters.
Each entry in the t_timer
array contains the number of 500-ms clock ticks until the timer expires, with 0 meaning that the timer is not set. Since each timer is a short
, if 16 bits hold a short
, the maximum timer value is 16,383.5 seconds, or about 4.5 hours.
Notice in Figure 25.1 that four “timer counters” implement six TCP “timers,” because some of the timers are mutually exclusive. We’ll distinguish between the counters and the timers. The TCPT_KEEP
counter implements both the keepalive timer and the connection-establishment timer, since the two timers are never used at the same time for a connection. Similarly, the 2MSL timer and the FIN_WAIT_2 timer are implemented using the TCPT_2MSL
counter, since a connection is only in one state at a time. The first section of Figure 25.2 summarizes the implementation of the seven TCP timers. The second and third sections of the table show how four of the seven timers are initialized using three global variables from Figure 24.3 and two constants from Figure 25.3. Notice that two of the three globals are used with multiple timers. We’ve already said that the delayed ACK timer is tied to TCP’s 200-ms timer, and we describe how the other two timers are set later in this chapter.
Table 25.2. Implementation of the seven TCP timers.
conn. estab. | rexmit | delayed ACK | persist | keep-alive | FIN_-WAIT_2 | 2MSL | |
---|---|---|---|---|---|---|---|
| • | ||||||
| • | ||||||
| • | • | |||||
| • | • | |||||
| • | ||||||
| • | ||||||
| • | • | |||||
| • | • | |||||
| • | ||||||
| • |
Table 25.3. Fundamental timer values for the implementation.
Constant | #500-ms clock ticks | #sec | Description |
---|---|---|---|
| 60 | 30 | MSL, maximum segment lifetime |
| 2 | 1 | minimum value of retransmission timer |
| 128 | 64 | maximum value of retransmission timer |
| 10 | 5 | minimum value of persist timer |
| 120 | 60 | maximum value of persist timer |
| 150 | 75 | connection-establishment timer value |
| 14400 | 7200 | idle time for connection before first probe (2 hours) |
| 150 | 75 | time between probes when no response |
| 0 | special value to denote no measurements yet for connection | |
| 6 | 3 | default RTT when no measurements yet for connection |
Figure 25.3 shows the fundamental timer values for the Net/3 implementation.
Figure 25.4 shows other timer constants that we’ll encounter.
The TCPT_RANGESET
macro, shown in Figure 25.5, sets a timer to a given value, making certain the value is between the specified minimum and maximum.
Table 25.5. TCPT_RANGESET
macro.
-------------------------------------------------------------------- tcp_timer.h 102 #define TCPT_RANGESET(tv, value, tvmin, tvmax) { 103 (tv) = (value); 104 if ((tv) < (tvmin)) 105 (tv) = (tvmin); 106 else if ((tv) > (tvmax)) 107 (tv) = (tvmax); 108 } -------------------------------------------------------------------- tcp_timer.h |
We see in Figure 25.3 that the retransmission timer and the persist timer have upper and lower bounds, since their values are calculated dynamically, based on the measured round-trip time. The other timers are set to constant values.
There is one additional timer that we allude to in Figure 25.4 but don’t discuss in this chapter: the linger timer for a socket, set by the SO_LINGER
socket option. This is a socket-level timer used by the close
system call (Section 15.15). We will see in Figure 30.12 that when a socket is closed, TCP checks whether this socket option is set and whether the linger time is 0. If so, the connection is aborted with an RST instead of TCP’s normal close.
The function tcp_canceltimers
, shown in Figure 25.6, is called by tcp_input
when the TIME_WAIT state is entered. All four timer counters are set to 0, which turns off the retransmission, persist, keepalive, and FIN_WAIT_2 timers, before tcp_input
sets the 2MSL timer.
Table 25.6. tcp_canceltimers
function.
------------------------------------------------------------ tcp_timer.c 107 void 108 tcp_canceltimers(tp) 109 struct tcpcb *tp; 110 { 111 int i; 112 for (i = 0; i < TCPT_NTIMERS; i++) 113 tp->t_timer[i] = 0; 114 } ------------------------------------------------------------ tcp_timer.c |
The function tcp_fasttimo
, shown in Figure 25.7, is called by pr_fasttimo
every 200 ms. It handles only the delayed ACK timer.
Table 25.7. tcp_fasttimo
function, which is called every 200 ms.
------------------------------------------------------------------------ tcp_timer.c 41 void 42 tcp_fasttimo() 43 { 44 struct inpcb *inp; 45 struct tcpcb *tp; 46 int s = splnet(); 47 inp = tcb.inp_next; 48 if (inp) 49 for (; inp != &tcb; inp = inp->inp_next) 50 if ((tp = (struct tcpcb *) inp->inp_ppcb) && 51 (tp->t_flags & TF_DELACK)) { 52 tp->t_flags &= ~TF_DELACK; 53 tp->t_flags |= TF_ACKNOW; 54 tcpstat.tcps_delack++; 55 (void) tcp_output(tp); 56 } 57 splx(s); 58 } ------------------------------------------------------------------------ tcp_timer.c |
Each Internet PCB on the TCP list that has a corresponding TCP control block is checked. If the TF_DELACK
flag is set, it is cleared and the TF_ACKNOW
flag is set instead. tcp_output
is called, and since the TF_ACKNOW
flag is set, an ACK is sent.
How can TCP have an Internet PCB on its PCB list that doesn’t have a TCP control block (the test at line 50)? When a socket is created (the PRU_ATTACH
request, in response to the socket
system call) we’ll see in Figure 30.11 that the creation of the Internet PCB is done first, followed by the creation of the TCP control block. Between these two operations a high-priority clock interrupt can occur (Figure 1.13), which calls tcp_fasttimo
.
The function tcp_slowtimo
, shown in Figure 25.8, is called by pr_slowtimo
every 500 ms. It handles the other six TCP timers: connection establishment, retransmission, persist, keepalive, FIN_WAIT_2, and 2MSL.
Table 25.8. tcp_slowtimo
function, which is called every 500 ms.
------------------------------------------------------------------------ tcp_timer.c 64 void 65 tcp_slowtimo() 66 { 67 struct inpcb *ip, *ipnxt; 68 struct tcpcb *tp; 69 int s = splnet(); 70 int i; 71 tcp_maxidle = TCPTV_KEEPCNT * tcp_keepintvl; 72 /* 73 * Search through tcb's and update active timers. 74 */ 75 ip = tcb.inp_next; 76 if (ip == 0) { 77 splx(s); 78 return; 79 } 80 for (; ip != &tcb; ip = ipnxt) { 81 ipnxt = ip->inp_next; 82 tp = intotcpcb(ip); 83 if (tp == 0) 84 continue; 85 for (i = 0; i < TCPT_NTIMERS; i++) { 86 if (tp->t_timer[i] && --tp->t_timer[i] == 0) { 87 (void) tcp_usrreq(tp->t_inpcb->inp_socket, 88 PRU_SLOWTIMO, (struct mbuf *) 0, 89 (struct mbuf *) i, (struct mbuf *) 0); 90 if (ipnxt->inp_prev != ip) 91 goto tpgone; 92 } 93 } 94 tp->t_idle++; 95 if (tp->t_rtt) 96 tp->t_rtt++; 97 tpgone: 98 ; 99 } 100 tcp_iss += TCP_ISSINCR / PR_SLOWHZ; /* increment iss */ 101 tcp_now++; /* for timestamps */ 102 splx(s); 103 } ------------------------------------------------------------------------ tcp_timer.c |
71
tcp_maxidle
is initialized to 10 minutes. This is the maximum amount of time TCP will send keepalive probes to another host, waiting for a response from that host. This variable is also used with the FIN_WAIT_2 timer, as we describe in Section 25.6. This initialization statement could be moved to tcp_init
, since it only needs to be evaluated when the system is initialized (see Exercise 25.2).
72-89
Each Internet PCB on the TCP list that has a corresponding TCP control block is checked. Each of the four timer counters for each connection is tested, and if nonzero, the counter is decremented. When the timer reaches 0, a PRU_SLOWTIMO
request is issued. We’ll see that this request calls the function tcp_timers
, which we describe later in this chapter.
The fourth argument to tcp_usrreq
is a pointer to an mbuf. But this argument is actually used for different purposes when the mbuf pointer is not required. Here we see the index i
is passed, telling the request which timer has expired. The funny-looking cast of i
to an mbuf pointer is to avoid a compile-time error.
90-93
Before examining the timers for a control block, a pointer to the next Internet PCB is saved in ipnxt
. Each time the PRU_SLOWTIMO
request returns, tcp_slowtimo
checks whether the next PCB in the TCP list still points to the PCB that’s being processed. If not, it means the control block has been deleted perhaps the 2MSL timer expired or the retransmission timer expired and TCP is giving up on this connection causing a jump to tpgone
, skipping the remaining timers for this control block, and moving on to the next PCB.
94
t_idle
is incremented for the control block. This counts the number of 500-ms clock ticks since the last segment was received on this connection. It is set to 0 by tcp_input
when a segment is received on the connection and used for three purposes: (1) by the keepalive algorithm to send a probe after the connection is idle for 2 hours, (2) to drop a connection in the FIN_WAIT_2 state that is idle for 10 minutes and 75 seconds, and (3) by tcp_output
to return to the slow start algorithm after the connection has been idle for a while.
95-96
If this connection is timing an outstanding segment, t_rtt
is nonzero and counts the number of 500-ms clock ticks until that segment is acknowledged. It is initialized to 1 by tcp_output
when a segment is transmitted whose RTT should be timed. tcp_slowtimo
increments this counter.
100
tcp_iss
was initialized to 1 by tcp_init
. Every 500 ms it is incremented by 64,000: 128,000 (TCP_ISSINCR
) divided by 2 (PR_SLOWHZ
). This is a rate of about once every 8 microseconds, although tcp_iss
is incremented only twice a second. We’ll see that tcp_iss
is also incremented by 64,000 each time a connection is established, either actively or passively.
RFC 793 specifies that the initial sequence number should increment roughly every 4 microseconds, or 250,000 times a second. The Net/3 value increments at about one-half this rate.
101
tcp_now
is initialized to 0 on bootstrap and incremented every 500 ms. It is used by the timestamp option defined in RFC 1323 [Jacobson, Braden, and Borman 1992], which we describe in Section 26.6.
75-79
Notice that if there are no TCP connections active on the host (tcb.inp_next
is null), neither tcp_iss
nor tcp_now
is incremented. This would occur only when the system is being initialized, since it would be rare to find a Unix system attached to a network without a few TCP servers active.
The function tcp_timers
is called by TCP’s PRU_SLOWTIMO
request (Figure 30.10):
case PRU_SLOWTIMO: tp = tcp_timers(tp, (int)nam);
when any one of the four TCP timer counters reaches 0 (Figure 25.8).
The structure of the function is a switch
statement with one case
per timer, as outlined in Figure 25.9.
Table 25.9. tcp_timers
function: general organization.
----------------------------------------------------------------- tcp_timer.c 120 struct tcpcb * 121 tcp_timers(tp, timer) 122 struct tcpcb *tp; 123 int timer; 124 { 125 int rexmt; 126 switch (timer) { /* switch cases */ 256 } 257 return (tp); 258 } ----------------------------------------------------------------- tcp_timer.c |
We now discuss three of the four timer counters (five of TCP’s timers), saving the retransmission timer for Section 25.11.
TCP’s TCPT_2MSL
counter implements two of TCP’s timers.
FIN_WAIT_2 timer. When tcp_input
moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state and the socket cannot receive any more data (implying the process called close
, instead of taking advantage of TCP’s half-close with shutdown
), the FIN_WAIT_2 timer is set to 10 minutes (tcp_maxidle
). We’ll see that this prevents the connection from staying in the FIN_WAIT_2 state forever.
2MSL timer. When TCP enters the TIME_WAIT state, the 2MSL timer is set to 60 seconds (TCPTV_MSL
times 2).
Figure 25.10 shows the case
for the 2MSL timer executed when the timer reaches 0.
Table 25.10. tcp_timers
function: expiration of 2MSL timer counter.
------------------------------------------------------------------- tcp_timer.c 127 /* 128 * 2 MSL timeout in shutdown went off. If we're closed but 129 * still waiting for peer to close and connection has been idle 130 * too long, or if 2MSL time is up from TIME_WAIT, delete connection 131 * control block. Otherwise, check again in a bit. 132 */ 133 case TCPT_2MSL: 134 if (tp->t_state != TCPS_TIME_WAIT && 135 tp->t_idle <= tcp_maxidle) 136 tp->t_timer[TCPT_2MSL] = tcp_keepintvl; 137 else 138 tp = tcp_close(tp); 139 break; ------------------------------------------------------------------- tcp_timer.c |
127-139
The puzzling logic in the conditional is because the two different uses of the TCPT_2MSL
counter are intermixed (Exercise 25.4). Let’s first look at the TIME_WAIT state. When the timer expires after 60 seconds, tcp_close
is called and the control blocks are released. We have the scenario shown in Figure 25.11.
This figure shows the series of function calls that occurs when the 2MSL timer expires. We also see that setting one of the timers for N seconds in the future (2 x N ticks), causes the timer to expire somewhere between 2 x N - 1 and 2 x N ticks in the future, since the time until the first decrement of the counter is between 0 and 500 ms in the future.
127-139
If the connection state is not TIME_WAIT, the TCPT_2MSL
counter is the FIN_WAIT_2 timer. As soon as the connection has been idle for more than 10 minutes (tcp_maxidle
) the connection is closed. But if the connection has been idle for less than or equal to 10 minutes, the FIN_WAIT_2 timer is reset for 75 seconds in the future. Figure 25.12 shows the typical scenario.
The connection moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state on the receipt of an ACK (Figure 24.15). Receiving this ACK sets t_idle
to 0 and the FIN_WAIT_2 timer is set to 1200 (tcp_maxidle
). In Figure 25.12 we show the up arrow just to the right of the tick mark starting the 10-minute period, to reiterate that the first decrement of the counter occurs between 0 and 500 ms after the counter is set. After 1199 ticks the timer expires, but since t_idle
is incremented after the test and decrement of the four counters in Figure 25.8, t_idle
is 1198. (We assume the connection is idle for this 10-minute period.) The comparison of 1198 as less than or equal to 1200 is true, so the FIN_WAIT_2 timer is set to 150 (tcp_keepintvl
). When the timer expires again in 75 seconds, assuming the connection is still idle, t_idle
is now 1348, the test is false, and tcp_close
is called.
The reason for the 75-second timeout after the first 10-minute timeout is as follows: a connection in the FIN_WAIT_2 state is not dropped until the connection has been idle for more than 10 minutes. There’s no reason to test t_idle
until at least 10 minutes have expired, but once this time has passed, the value of t_idle
is checked every 75 seconds. Since a duplicate segment could be received, say a duplicate of the ACK that moved the connection from the FIN_WAIT_1 state to the FIN_WAIT_2 state, the 10-minute wait is restarted when the segment is received (since t_idle
will be set to 0).
Terminating an idle connection after more than 10 minutes in the FIN_WAIT_2 state violates the protocol specification, but this is practical. In the FIN_WAIT_2 state the process has called
close
, all outstanding data on the connection has been sent and acknowledged, the other end has acknowledged the FIN, and TCP is waiting for the process at the other end of the connection to issue itsclose
. If the other process never closes its end of the connection, our end can remain in the FIN_WAIT_2 forever. A counter should be maintained for the number of connections terminated for this reason, to see how often this occurs.
Figure 25.13 shows the case
for when the persist timer expires.
Table 25.13. tcp_timers
function: expiration of persist timer.
----------------------------------------------------------------------- tcp_timer.c 210 /* 211 * Persistence timer into zero window. 212 * Force a byte to be output, if possible. 213 */ 214 case TCPT_PERSIST: 215 tcpstat.tcps_persisttimeo++; 216 tcp_setpersist(tp); 217 tp->t_force = 1; 218 (void) tcp_output(tp); 219 tp->t_force = 0; 220 break; ----------------------------------------------------------------------- tcp_timer.c |
210-220
When the persist timer expires, there is data to send on the connection but TCP has been stopped by the other end’s advertisement of a zero-sized window. tcp_setpersist
calculates the next value for the persist timer and stores it in the TCPT_PERSIST
counter. The flag t_force
is set to 1, forcing tcp_output
to send 1 byte, even though the window advertised by the other end is 0.
Figure 25.14 shows typical values of the persist timer for a LAN, assuming the retransmission timeout for the connection is 1.5 seconds (see Figure 22.1 of Volume 1).
Once the value of the persist timer reaches 60 seconds, TCP continues sending window probes every 60 seconds. The reason the first two values are both 5, and not 1.5 and 3, is that the persist timer is lower bounded at 5 seconds. It is also upper bounded at 60 seconds. The multiplication of each value by 2 to give the next value is called an exponential backoff, and we describe how it is calculated in Section 25.9.
TCP’s TCPT_KEEP
counter implements two timers:
When a SYN is sent, the connection-establishment timer is set to 75 seconds (TCPTV_KEEP_INIT
). This happens when connect
is called, putting a connection into the SYN_SENT state (active open), or when a connection moves from the LISTEN to the SYN_RCVD state (passive open). If the connection doesn’t enter the ESTABLISHED state within 75 seconds, the connection is dropped.
When a segment is received on a connection, tcp_input
resets the keepalive timer for that connection to 2 hours (tcp_keepidle
), and the t_idle
counter for the connection is reset to 0. This happens for every TCP connection on the system, whether the keepalive option is enabled for the socket or not. If the keepalive timer expires (2 hours after the last segment was received on the connection), and if the socket option is set, a keepalive probe is sent to the other end. If the timer expires and the socket option is not set, the keepalive timer is just reset for 2 hours in the future.
Figure 25.16 shows the case
for TCP’s TCPT_KEEP
counter.
221-228
If the state is less than ESTABLISHED (Figure 24.16), the TCPT_KEEP
counter is the connection-establishment timer. At the label dropit, tcp_drop
is called to terminate the connection attempt with an error of ETIMEDOUT
. We’ll see that this error is the default error if, for example, a soft error such as an ICMP host unreachable was received on the connection, the error returned to the process will be changed to EHOSTUNREACH
instead of the default.
In Figure 30.4 we’ll see that when TCP sends a SYN, two timers are initialized: the connection-establishment timer as we just described, with a value of 75 seconds, and the retransmission timer, to cause the SYN to be retransmitted if no response is received. Figure 25.15 shows these two timers.
The retransmission timer is initialized to 6 seconds for a new connection (Figure 25.19), and successive values are calculated to be 24 and 48 seconds. We describe how these values are calculated in Section 25.7. The retransmission timer causes the SYN to be transmitted a total of three times, at times 0, 6, and 30. At time 75, 3 seconds before the retransmission timer would expire again, the connection-establishment timer expires, and tcp_drop
terminates the connection attempt.???
Table 25.16. tcp_timers
function: expiration of keepalive timer.
----------------------------------------------------------------------- tcp_timer.c 221 /* 222 * Keep-alive timer went off; send something 223 * or drop connection if idle for too long. 224 */ 225 case TCPT_KEEP: 226 tcpstat.tcps_keeptimeo++; 227 if (tp->t_state < TCPS_ESTABLISHED) 228 goto dropit; /* connection establishment timer */ 229 if (tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE && 230 tp->t_state <= TCPS_CLOSE_WAIT) { 231 if (tp->t_idle >= tcp_keepidle + tcp_maxidle) 232 goto dropit; 233 /* 234 * Send a packet designed to force a response 235 * if the peer is up and reachable: 236 * either an ACK if the connection is still alive, 237 * or an RST if the peer has closed the connection 238 * due to timeout or reboot. 239 * Using sequence number tp->snd_una-1 240 * causes the transmitted zero-length segment 241 * to lie outside the receive window; 242 * by the protocol spec, this requires the 243 * correspondent TCP to respond. 244 */ 245 tcpstat.tcps_keepprobe++; 246 tcp_respond(tp, tp->t_template, (struct mbuf *) NULL, 247 tp->rcv_nxt, tp->snd_una - 1, 0); 248 tp->t_timer[TCPT_KEEP] = tcp_keepintvl; 249 } else 250 tp->t_timer[TCPT_KEEP] = tcp_keepidle; 251 break; 252 dropit: 253 tcpstat.tcps_keepdrops++; 254 tp = tcp_drop(tp, ETIMEDOUT); 255 break; ----------------------------------------------------------------------- tcp_timer.c |
229-230
This timer expires after 2 hours of idle time on every connection, not just ones with the SO_KEEPALIVE
socket option enabled. If the socket option is set, probes are sent only if the connection is in the ESTABLISHED or CLOSE_WAIT states (Figure 24.15). Once the process calls close
(the states greater than CLOSE_WAIT), keepalive probes are not sent, even if the connection is idle for 2 hours.
231-232
If the total idle time for the connection is greater than or equal to 2 hours (tcp_keepidle
) plus 10 minutes (tcp_maxidle
), the connection is dropped. This means that TCP has sent its limit of nine keepalive probes, 75 seconds apart (tcp_keepintvl
), with no response. One reason TCP must send multiple keepalive probes before considering the connection dead is that the ACKs sent in response do not contain data and therefore are not reliably transmitted by TCP. An ACK that is a response to a keepalive probe can get lost.
233-248
If TCP hasn’t reached the keepalive limit, tcp_respond
sends a keepalive packet. The acknowledgment field of the keepalive packet (the fourth argument to tcp_respond
) contains rcv_nxt
, the next sequence number expected on the connection. The sequence number field of the keepalive packet (the fifth argument) deliberately contains
snd_una
minus 1, which is the sequence number of a byte of data that the other end has already acknowledged (Figure 24.17). Since this sequence number is outside the window, the other end must respond with an ACK, specifying the next sequence number it expects.
Figure 25.17 summarizes this use of the keepalive timer.
The nine keepalive probes are sent every 75 seconds, starting at time 0, through time 600. At time 675 (11.25 minutes after the 2-hour timer expired) the connection is dropped. Notice that nine keepalive probes are sent, even though the constant TCPTV_KEEPCNT
(Figure 25.4) is 8. This is because the variable t_idle
is incremented in Figure 25.8 after the timer is decremented, compared to 0, and possibly handled. When tcp_input
receives a segment on a connection, it sets the keepalive timer to 14400 (tcp_keepidle
) and t_idle
to 0. The next time tcp_slowtimo
is called, the keepalive timer is decremented to 14399 and t_idle
is incremented to 1. About 2 hours later, when the keepalive timer is decremented from 1 to 0 and tcp_timers
is called, the value of t_idle
will be 14399. We can build the table in Figure 25.18 to see the value of t_idle
each time tcp_timers
is called.
Table 25.18. The value of t_idle
when tcp_timers
is called for keepalive processing.
probe # | time in Figure 25.17 |
|
---|---|---|
1 | 0 | 14399 |
2 | 75 | 14549 |
3 | 150 | 14699 |
4 | 225 | 14849 |
5 | 300 | 14999 |
6 | 375 | 15149 |
7 | 450 | 15299 |
8 | 525 | 15449 |
9 | 600 | 15599 |
675 | 15749 |
The code in Figure 25.16 is waiting for t_idle
to be greater than or equal to 15600 (tcp_keepidle + tcp_maxidle
) and that only happens at time 675 in Figure 25.17, after nine keepalive probes have been sent.
249-250
If the socket option is not set or the connection state is greater than CLOSE_WAIT, the keepalive timer for this connection is reset to 2 hours (tcp_keepidle
).
Unfortunately the counter
tcps_keepdrops
(line 253) counts both uses of theTCPT_KEEP
counter: the connection-establishment timer and the keepalive timer.
The timers that we’ve described so far in this chapter have fixed times associated with them: 200 ms for the delayed ACK timer, 75 seconds for the connection-establishment timer, 2 hours for the keepalive timer, and so on. The final two timers that we describe, the retransmission timer and the persist timer, have values that depend on the measured RTT for the connection. Before going through the source code that calculates and sets these timers we need to understand how TCP measures the RTT for a connection.
Fundamental to the operation of TCP is setting a retransmission timer when a segment is transmitted and an ACK is required from the other end. If the ACK is not received when the retransmission timer expires, the segment is retransmitted. TCP requires an ACK for data segments but does not require an ACK for a segment without data (i.e., a pure ACK segment). If the calculated retransmission timeout is too small, it can expire prematurely, causing needless retransmissions. If the calculated value is too large, after a segment is lost, additional time is lost before the segment is retransmitted, degrading performance. Complicating this is that the round-trip times between two hosts can vary widely and dynamically over the course of a connection.
TCP in Net/3 calculates the retransmission timeout (RTO) by measuring the round-trip time (nticks) of data segments and keeping track of the smoothed RTT estimator (srtt) and a smoothed mean deviation estimator (rttvar). The mean deviation is a good approximation of the standard deviation, but easier to compute since, unlike the standard deviation, the mean deviation does not require square root calculations. [Jacobson 1988b] provides additional details on these RTT measurements, which lead to the following equations:
delta = nticks − srtt srtt ← srtt + g × delta rttvar ← rttvar + h(|delta| − rttvar) RTO = srtt + 4 × rttvar |
delta is the difference between the measured round trip just obtained (nticks) and the current smoothed RTT estimator (srtt). g is the gain applied to the RTT estimator and equals 1/8. h is the gain applied to the mean deviation estimator and equals 1/4. The two gains and the multiplier 4 in the RTO calculation are purposely powers of 2, so they can be calculated using shift operations instead of multiplying or dividing.
[Jacobson 1988b] specified 2 x rttvar in the calculation of RTO, but after further research, [Jacobson 1990d] changed the value to 4 x rttvar, which is what appeared in the Net/1 implementation.
We now describe the variables and calculations used to calculate TCP’s retransmission timer, as we’ll encounter them throughout the TCP code. Figure 25.19 lists the variables in the control block related to the retransmission timer.
Table 25.19. Control block variables for calculation of retransmission timer.
| Units |
| #sec | Description |
---|---|---|---|---|
| ticks × 8 | 0 | smoothed RTT estimator: srtt × 8 | |
| ticks × 4 | 24 | 3 | smoothed mean deviation estimator: rttvar × 4 |
| ticks | 12 | 6 | current retransmission timeout: RTO |
| ticks | 2 | 1 | minimum value for retransmission timeout |
| n.a. | 0 | index into |
We show the tcp_backoff
array at the end of Section 25.9. The tcp_newtcpcb
function sets the initial values for these variables, and we cover it in the next section. The term shift in the variable t_rxtshift
and its limit TCP_MAXRXTSHIFT
is not entirely accurate. The former is not used for bit shifting, but as Figure 25.19 indicates, it is an index into an array.
The confusing part of TCP’s timeout calculations is that the two smoothed estimators maintained in the C code (t_srtt
and t_rttvar
) are fixed-point integers, instead of floating-point values. This is done to avoid floating-point calculations within the kernel, but it complicates the code.
To keep the scaled and unscaled variables distinct, we’ll use the italic variables srtt and rttvar to refer to the unscaled variables in the earlier equations, and t_srtt
and t_rttvar
to refer to the scaled variables in the TCP control block.
Figure 25.20 shows four constants we encounter, which define the scale factors of 8 for t_srtt
and 4 for t_rttvar
.
A new TCP control block is allocated and initialized by tcp_newtcpcb
, shown in Figure 25.21. This function is called by TCP’s PRU_ATTACH
request when a new socket is created (Figure 30.2). The caller has previously allocated an Internet PCB for this connection, pointed to by the argument inp
. We present this function now because it initializes the TCP timer variables.
Table 25.21. tcp_newtcpcb
function: create and initialize a new TCP control block.
------------------------------------------------------------------------ tcp_subr.c 167 struct tcpcb * 168 tcp_newtcpcb(inp) 169 struct inpcb *inp; 170 { 171 struct tcpcb *tp; 172 tp = malloc(sizeof(*tp), M_PCB, M_NOWAIT); 173 if (tp == NULL) 174 return ((struct tcpcb *) 0); 175 bzero((char *) tp, sizeof(struct tcpcb)); 176 tp->seg_next = tp->seg_prev = (struct tcpiphdr *) tp; 177 tp->t_maxseg = tcp_mssdflt; 178 tp->t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE | TF_REQ_TSTMP) : 0; 179 tp->t_inpcb = inp; 180 /* 181 * Init srtt to TCPTV_SRTTBASE (0), so we can tell that we have no 182 * rtt estimate. Set rttvar so that srtt + 2 * rttvar gives 183 * reasonable initial retransmit time. 184 */ 185 tp->t_srtt = TCPTV_SRTTBASE; 186 tp->t_rttvar = tcp_rttdflt * PR_SLOWHZ << 2; 187 tp->t_rttmin = TCPTV_MIN; 188 TCPT_RANGESET(tp->t_rxtcur, 189 ((TCPTV_SRTTBASE >> 2) + (TCPTV_SRTTDFLT << 2)) >> 1, 190 TCPTV_MIN, TCPTV_REXMTMAX); 191 tp->snd_cwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT; 192 tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT; 193 inp->inp_ip.ip_ttl = ip_defttl; 194 inp->inp_ppcb = (caddr_t) tp; 195 return (tp); 196 } ------------------------------------------------------------------------ tcp_subr.c |
167-175
The kernel’s malloc
function allocates memory for the control block, and bzero
sets it to 0.
176
The two variables seg_next
and seg_prev
point to the reassembly queue for out-of-order segments received for this connection. We discuss this queue in detail in Section 27.9.
177-179
The maximum segment size to send, t_maxseg
, defaults to 512 (
tcp_mssdflt
). This value can be changed by the tcp_mss
function after an MSS option is received from the other end. (TCP also sends an MSS option to the other end when a new connection is established.) The two flags TF_REQ_SCALE
and TF_REQ_TSTMP
are set if the system is configured to request window scaling and timestamps as defined in RFC 1323 (the global tcp_do_rfc1323
from Figure 24.3, which defaults to 1). The t_inpcb
pointer in the TCP control block is set to point to the Internet PCB passed in by the caller.
180-185
The four variables t_srtt
,
t_rttvar
,
t_rttmin
, and
t_rxtcur
, described in Figure 25.19, are initialized. First, the smoothed RTT estimator
t_srtt
is set to 0 (TCPTV_SRTTBASE
), which is a special value that means no RTT measurements have been made yet for this connection. tcp_xmit_timer
recognizes this special value when the first RTT measurement is made.
186-187
The smoothed mean deviation estimator t_rttvar
is set to 24: 3 (tcp_rttdflt
, from Figure 24.3) times 2 (PR_SLOWHZ
) multiplied by 4 (the left shift of 2 bits). Since this scaled estimator is 4 times the variable rttvar, this value equals 6 clock ticks, or 3 seconds. The minimum RTO, stored in t_rttmin
, is 2 ticks (
TCPTV_MIN
).
188-190
The current RTO in clock ticks is calculated and stored in t_rxtcur
. It is bounded by a minimum value of 2 ticks (
TCPTV_MIN
) and a maximum value of 128 ticks (TCPTV_REXMTMAX
). The value calculated as the second argument to TCPT_RANGESET
is 12 ticks, or 6 seconds. This is the first RTO for the connection.
Understanding these C expressions involving the scaled RTT estimators can be a challenge. It helps to start with the unscaled equation and substitute the scaled variables. The unscaled equation we’re solving is
RTO = srtt + 2 × rttvar |
where we use the multipler of 2 instead of 4 to calculate the first RTO.
The use of the multiplier 2 instead of 4 appears to be a leftover from the original 4.3BSD Tahoe code [Paxson 1994].
Substituting the two scaling relationships
|
we get
which is the C code for the second argument to TCPT_RANGESET
. In this code the variable t_rttvar
is not used the constant TCPTV_SRTTDFLT
, whose value is 6 ticks, is used instead, and it must be multiplied by 4 to have the same scale as t_rttvar
.
191-192
The congestion window (snd_cwnd
) and slow start threshold (snd_ssthresh
) are set to 1,073,725,440 (approximately one gigabyte), which is the largest possible TCP window if the window scale option is in effect. (Slow start and congestion avoidance are described in Section 21.6 of Volume 1.) It is calculated as the maximum value for the window size field in the TCP header (65535, TCP_MAXWIN
) times 214, where 14 is the maximum value for the window scale factor (TCP_MAX_WINSHIFT
). We’ll see that when a SYN is sent or received on the connection, tcp_mss
resets snd_cwnd
to a single segment.
193-194
The default IP TTL in the Internet PCB is set to 64 (ip_defttl
) and the PCB is set to point to the new TCP control block.
Not shown in this code is that numerous variables, such as the shift variable t_rxtshift
, are implicitly initialized to 0 since the control block is initialized by
bzero
.
The next function we look at that uses TCP’s retransmission timeout calculations is tcp_setpersist
. In Figure 25.13 we saw this function called when the persist timer expired. This timer is set when TCP has data to send on a connection, but the other end is advertising a window of 0. This function, shown in Figure 25.22, calculates and stores the next value for the timer.
Table 25.22. tcp_setpersist
function: calculate and store a new value for the persist timer.
----------------------------------------------------------------------- tcp_output.c 493 void 494 tcp_setpersist(tp) 495 struct tcpcb *tp; 496 { 497 t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1; 498 if (tp->t_timer[TCPT_REXMT]) 499 panic("tcp_output REXMT"); 500 /* 501 * Start/restart persistance timer. 502 */ 503 TCPT_RANGESET(tp->t_timer[TCPT_PERSIST], 504 t * tcp_backoff[tp->t_rxtshift], 505 TCPTV_PERSMIN, TCPTV_PERSMAX); 506 if (tp->t_rxtshift < TCP_MAXRXTSHIFT) 507 tp->t_rxtshift++; 508 } ----------------------------------------------------------------------- tcp_output.c |
493-499
A check is made that the retransmission timer is not enabled when the persist timer is about to be set, since the two timers are mutually exclusive: if data is being sent, the other side must be advertising a nonzero window, but the persist timer is being set only if the advertised window is 0.
500-505
The variable t
is set to the RTO value that was calculated at the beginning of the function. The equation being solved is
RTO = srtt + 2 × rttvar |
which is identical to the formula used at the end of the previous section. With substitution we get
which is the value computed for the variable t
.
506-507
An exponential backoff is also applied to the RTO. This is done by multiplying the RTO by a value from the tcp_backoff
array:
int tcp_backoff[TCP_MAXRXTSHIFT + 1] = { 1, 2, 4, 8, 16, 32, 64, 64, 64, 64, 64, 64, 64 };
When tcp_output
initially sets the persist timer for a connection, the code is
tp->t_rxtshift = 0; tcp_setpersist(tp);
so the first time tcp_setpersist
is called, t_rxtshift
is 0. Since the value of tcp_backoff
[0]
is 1, t
is used as the persist timeout. The TCPT_RANGESET
macro bounds this value between 5 and 60 seconds. t_rxtshift
is incremented by 1 until it reaches a maximum of 12 (TCP_MAXRXTSHIFT
), since tcp_backoff
[12]
is the final entry in the array.
The next function we look at, tcp_xmit_timer
, is called each time an RTT measurement is collected, to update the smoothed RTT estimator (srtt) and the smoothed mean deviation estimator (rttvar).
The argument rtt
is the RTT measurement to be applied. It is the value nticks + 1, using the notation from Section 25.7. It can be from one of two sources:
If the timestamp option is present in a received segment, the measured RTT is the current time (tcp_now
) minus the timestamp value. We’ll examine the timestamp option in Section 26.6, but for now all we need to know is that tcp_now
is incremented every 500 ms (Figure 25.8). When a data segment is sent, tcp_now
is sent as the timestamp, and the other end echoes this time-stamp in the acknowledgment it sends back.
If timestamps are not in use and a data segment is being timed, we saw in Figure 25.8 that the counter t_rtt
is incremented every 500 ms for the connection. We also mentioned in Section 25.5 that this counter is initialized to 1, so when the acknowledgment is received the counter is the measured RTT (in ticks) plus 1.
Typical code in tcp_input
that calls tcp_xmit_timer
is
if (ts_present) tcp_xmit_timer(tp, tcp_now - ts_ecr + 1); else if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq)) tcp_xmit_timer(tp, tp->t_rtt);
If a timestamp was present in the segment (ts_present
), the RTT estimators are updated using the current time (tcp_now
) minus the echoed timestamp (ts_ecr
) plus 1. (We describe the reason for adding 1 below.)
If a timestamp is not present, the RTT estimators are updated only if the received segment acknowledges a data segment that was being timed. There is only one RTT counter per TCP control block (t_rtt
), so only one outstanding data segment can be timed per connection. The starting sequence number of that segment is stored in t_rtseq
when the segment is transmitted, to tell when an acknowledgment is received that covers that sequence number. If the received acknowledgment number (ti_ack
) is greater than the starting sequence number of the segment being timed (t_rtseq
), the RTT estimators are updated using t_rtt
as the measured RTT.
Before RFC 1323 timestamps were supported, TCP measured the RTT only by counting clock ticks in
t_rtt
. But this variable is also used as a flag that specifies whether a segment is being timed (Figure 25.8): if
t_rtt
is greater than 0, thentcp_slowtimo
adds 1 to it every 500 ms. Hence whent_rtt
is nonzero, it is the number of ticks plus 1. We’ll see shortly thattcp_xmit_timer
always decrements its second argument by 1 to account for this offset. Therefore when timestamps are being used, 1 is added to the second argument to account for the decrement by 1 intcp_xmit_timer
.
The greater-than test of the sequence numbers is because ACKs are cumulative: if TCP sends and times a segment with sequence numbers 1-1024 (t_rtseq
equals 1), then immediately sends (but can’t time) a segment with sequence numbers 1025-2048, and then receives an ACK with ti_ack
equal to 2049, this is an ACK for sequence numbers 1-2048 and the ACK acknowledges the first segment being timed as well as the second (untimed) segment. Notice that when RFC 1323 timestamps are in use there is no comparison of sequence numbers. If the other end sends a timestamp option, it chooses the echo reply value (ts_ecr
) to allow TCP to calculate the RTT.
Figure 25.23 shows the first part of the function that updates the estimators.
Table 25.23. tcp_xmit_timer
function: apply new RTT measurement to smoothed estimators.
----------------------------------------------------------------------- tcp_input.c 1310 void 1311 tcp_xmit_timer(tp, rtt) 1312 struct tcpcb *tp; 1313 short rtt; 1314 { 1315 short delta; 1316 tcpstat.tcps_rttupdated++; 1317 if (tp->t_srtt != 0) { 1318 /* 1319 * srtt is stored as fixed point with 3 bits after the 1320 * binary point (i.e., scaled by 8). The following magic 1321 * is equivalent to the smoothing algorithm in rfc793 with 1322 * an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed 1323 * point). Adjust rtt to origin 0. 1324 */ 1325 delta = rtt - 1 - (tp->t_srtt >> TCP_RTT_SHIFT); 1326 if ((tp->t_srtt += delta) <= 0) 1327 tp->t_srtt = 1; 1328 /* 1329 * We accumulate a smoothed rtt variance (actually, a 1330 * smoothed mean difference), then set the retransmit 1331 * timer to smoothed rtt + 4 times the smoothed variance. 1332 * rttvar is stored as fixed point with 2 bits after the 1333 * binary point (scaled by 4). The following is 1334 * equivalent to rfc793 smoothing with an alpha of .75 1335 * (rttvar = rttvar*3/4 + |delta| / 4). This replaces 1336 * rfc793's wired-in beta. 1337 */ 1338 if (delta < 0) 1339 delta = -delta; 1340 delta -= (tp->t_rttvar >> TCP_RTTVAR_SHIFT); 1341 if ((tp->t_rttvar += delta) <= 0) 1342 tp->t_rttvar = 1; 1343 } else { 1344 /* 1345 * No rtt measurement yet - use the unsmoothed rtt. 1346 * Set the variance to half the rtt (so our first 1347 * retransmit happens at 3*rtt). 1348 */ 1349 tp->t_srtt = rtt << TCP_RTT_SHIFT; 1350 tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT - 1); 1351 } ----------------------------------------------------------------------- tcp_input.c |
1310-1325
Recall that tcp_newtcpcb
initialized the smoothed RTT estimator (t_srtt
) to 0, indicating that no measurements have been made for this connection. delta
is the difference between the measured RTT and the current value of the smoothed RTT estimator, in unscaled ticks. t_srtt
is divided by 8 to convert from scaled to unscaled ticks.
1326-1327
The smoothed RTT estimator is updated using the equation
srtt ← srtt + g × delta |
Since the gain g is 1/8, this equation is
8 × srtt ← 8 × srtt + delta |
which is
|
1328-1342
The mean deviation estimator is updated using the equation
rttvar ← rttvar + h(| delta | - rttvar) |
Substituting 1/4 for h and the scaled variable t_rttvar
for 4 x rttvar, we get
which is
This final equation corresponds to the C code.
1343-1350
If this is the first RTT measured for this connection, the smoothed RTT estimator is initialized to the measured RTT. These calculations use the value of the argument rtt
, which we said is the measured RTT plus 1 (nticks + 1), whereas the earlier calculation of delta
subtracted 1 from rtt
.
srtt = nticks + 1 |
or
which is
|
The smoothed mean deviation is set to one-half of the measured RTT:
which is
or
|
The comment in the code states that this initial setting for the smoothed mean deviation yields an initial RTO of 3 x srtt. Since the RTO is calculated as
RTO = srtt + 4 × rttvar |
substituting for rttvar gives us
which is indeed
RTO = 3 × srtt |
Figure 25.24 shows the final part of the tcp_xmit_timer
function.
Table 25.24. tcp_xmit_timer
function: final part.
------------------------------------------------------------------------ tcp_input.c 1352 tp->t_rtt = 0; 1353 tp->t_rxtshift = 0; 1354 /* 1355 * the retransmit should happen at rtt + 4 * rttvar. 1356 * Because of the way we do the smoothing, srtt and rttvar 1357 * will each average +1/2 tick of bias. When we compute 1358 * the retransmit timer, we want 1/2 tick of rounding and 1359 * 1 extra tick because of +-1/2 tick uncertainty in the 1360 * firing of the timer. The bias will give us exactly the 1361 * 1.5 tick we need. But, because the bias is 1362 * statistical, we have to test that we don't drop below 1363 * the minimum feasible timer (which is 2 ticks). 1364 */ 1365 TCPT_RANGESET(tp->t_rxtcur, TCP_REXMTVAL(tp), 1366 tp->t_rttmin, TCPTV_REXMTMAX); 1367 /* 1368 * We received an ack for a packet that wasn't retransmitted; 1369 * it is probably safe to discard any error indications we've 1370 * received recently. This isn't quite right, but close enough 1371 * for now (a route might have failed after we sent a segment, 1372 * and the return path might not be symmetrical). 1373 */ 1374 tp->t_softerror = 0; 1375 } ------------------------------------------------------------------------ tcp_input.c |
1352-1353
The RTT counter (t_rtt
) and the retransmission shift count (t_rxtshift
) are both reset to 0 in preparation for timing and transmission of the next segment.
1354-1366
The next RTO to use for the connection (t_rxtcur
) is calculated using the macro
#define TCP_REXMTVAL(tp) (((tp)->t_srtt >> TCP_RTT_SHIFT) + (tp)->t_rttvar)
This is the now-familiar equation
RTO = srtt + 4 × rttvar |
using the scaled variables updated by tcp_xmit_timer
. Substituting these scaled variables for srtt and rttvar, we have
which corresponds to the macro. The calculated value for the RTO is bounded by the minimum RTO for this connection (t_rttmin
, which
t_newtcpcb
set to 2 ticks), and 128 ticks (TCPTV_REXMTMAX
).
We now return to the tcp_timers
function and cover the final case
that we didn’t present in Section 25.6: the one that handles the expiration of the retransmission timer. This code is executed when a data segment that was transmitted has not been acknowledged by the other end within the RTO.
Figure 25.25 summarizes the actions caused by the retransmission timer. We assume that the first timeout calculated by tcp_output
is 1.5 seconds, which is typical for a LAN (see Figure 21.1 of Volume 1).
The x-axis is labeled with the time in seconds: 0, 1.5, 4.5, and so on. Below each of these numbers we show the value of t_rxtshift
that is used in the code we’re about to examine. Only after 12 retransmissions and a total of 542.5 seconds (just over 9 minutes) does TCP give up and drop the connection.
RFC 793 recommended that an open of a new connection, active or passive, allow a parameter specifying the total timeout period for data sent by TCP. This is the total amount of time TCP will try to send a given segment before giving up and terminating the connection. The recommended default was 5 minutes.
RFC 1122 requires that an application must be able to specify a parameter for a connection giving either the total number of retransmissions or the total timeout value for data sent by TCP. This parameter can be specified as “infinity,” meaning TCP never gives up, allowing, perhaps, an interactive user the choice of when to give up.
We’ll see in the code described shortly that Net/3 does not give the application any of this control: a fixed number of retransmissions (12) always occurs before TCP gives up, and the total timeout before giving up depends on the RTT.
The first half of the retransmission timeout case is shown in Figure 25.26.
Table 25.26. tcp_timers
function: expiration of retransmission timer, first half.
----------------------------------------------------------------------- tcp_timer.c 140 /* 141 * Retransmission timer went off. Message has not 142 * been acked within retransmit interval. Back off 143 * to a longer retransmit interval and retransmit one segment. 144 */ 145 case TCPT_REXMT: 146 if (++tp->t_rxtshift > TCP_MAXRXTSHIFT) { 147 tp->t_rxtshift = TCP_MAXRXTSHIFT; 148 tcpstat.tcps_timeoutdrop++; 149 tp = tcp_drop(tp, tp->t_softerror ? 150 tp->t_softerror : ETIMEDOUT); 151 break; 152 } 153 tcpstat.tcps_rexmttimeo++; 154 rexmt = TCP_REXMTVAL(tp) * tcp_backoff[tp->t_rxtshift]; 155 TCPT_RANGESET(tp->t_rxtcur, rexmt, 156 tp->t_rttmin, TCPTV_REXMTMAX); 157 tp->t_timer[TCPT_REXMT] = tp->t_rxtcur; 158 /* 159 * If losing, let the lower level know and try for 160 * a better route. Also, if we backed off this far, 161 * our srtt estimate is probably bogus. Clobber it 162 * so we'll take the next rtt measurement as our srtt; 163 * move the current srtt into rttvar to keep the current 164 * retransmit times until then. 165 */ 166 if (tp->t_rxtshift > TCP_MAXRXTSHIFT / 4) { 167 in_losing(tp->t_inpcb); 168 tp->t_rttvar += (tp->t_srtt >> TCP_RTT_SHIFT); 169 tp->t_srtt = 0; 170 } 171 tp->snd_nxt = tp->snd_una; 172 /* 173 * If timing a segment in this window, stop the timer. 174 */ 175 tp->t_rtt = 0; ----------------------------------------------------------------------- tcp_timer.c |
146
The retransmission shift count (t_rxtshift
) is incremented, and if the value exceeds 12 (TCP_MAXRXTSHIFT
) it is time to drop the connection. This new value of t_rxtshift
is what we show in Figure 25.25. Notice the difference between this dropping of a connection because an acknowledgment is not received from the other end in response to data sent by TCP, and the keepalive timer, which drops a connection after a long period of inactivity and no response from the other end. Both report the error ETIMEDOUT
to the process, unless a soft error is received for the connection.
147-152
A soft error is one that doesn’t cause TCP to terminate an established connection or an attempt to establish a connection, but the soft error is recorded in case TCP gives up later. For example, if TCP retransmits a SYN segment to establish a connection, receiving nothing in response, the error returned to the process will be ETIMEDOUT
. But if during the retransmissions an ICMP host unreachable is received for the connection, that is considered a soft error and stored in t_softerror
by tcp_notify
. If TCP finally gives up the retransmissions, the error returned to the process will be EHOSTUNREACH
instead of ETIMEDOUT
, providing more information to the process. If TCP receives an RST on the connection in response to the SYN, that’s considered a hard error and the connection is terminated immediately with an error of ECONNREFUSED
(Figure 28.18).
153-157
The next RTO is calculated using the TCP_REXMTVAL
macro, applying an exponential backoff. In this code, t_rxtshift
will be 1 the first time a given segment is retransmitted, so the RTO will be twice the value calculated by TCP_REXMTVAL
. This value is stored in t_rxtcur
and as the retransmission timer for the connection, t_timer
[TCPT_REXMT]
. The value stored in t_rxtcur
is used in tcp_input
when the retransmission timer is restarted (Figures 28.12 and 29.6).
158-167
If this segment has been retransmitted four or more times, in_losing
releases the cached route (if there is one), so when the segment is retransmitted by tcp_output
(at the end of this case
statement in Figure 25.27) a new, and hopefully better, route will be chosen. In Figure 25.25 in_losing
is called each time the retransmission timer expires, starting with the retransmission at time 22.5.
Table 25.27. tcp_timers
function: expiration of retransmission timer, second half.
--------------------------------------------------------------------- tcp_timer.c 176 /* 177 * Close the congestion window down to one segment 178 * (we'll open it by one segment for each ack we get). 179 * Since we probably have a window's worth of unacked 180 * data accumulated, this "slow start" keeps us from 181 * dumping all that data as back-to-back packets (which 182 * might overwhelm an intermediate gateway). 183 * 184 * There are two phases to the opening: Initially we 185 * open by one mss on each ack. This makes the window 186 * size increase exponentially with time. If the 187 * window is larger than the path can handle, this 188 * exponential growth results in dropped packet(s) 189 * almost immediately. To get more time between 190 * drops but still "push" the network to take advantage 191 * of improving conditions, we switch from exponential 192 * to linear window opening at some threshhold size. 193 * For a threshhold, we use half the current window 194 * size, truncated to a multiple of the mss. 195 * 196 * (the minimum cwnd that will give us exponential 197 * growth is 2 mss. We don't allow the threshhold 198 * to go below this.) 199 */ 200 { 201 u_int win = min(tp->snd_wnd, tp->snd_cwnd) / 2 / tp->t_maxseg; 202 if (win < 2) 203 win = 2; 204 tp->snd_cwnd = tp->t_maxseg; 205 tp->snd_ssthresh = win * tp->t_maxseg; 206 tp->t_dupacks = 0; 207 } 208 (void) tcp_output(tp); 209 break; --------------------------------------------------------------------- tcp_timer.c |
168-170
The smoothed RTT estimator (t_srtt
) is set to 0, which is what t_newtcpcb
did. This forces tcp_xmit_timer
to use the next measured RTT as the smoothed RTT estimator. This is done because the retransmitted segment has been sent four or more times, implying that TCP’s smoothed RTT estimator is probably way off. But if the retransmission timer expires again, at the beginning of this case
statement the RTO is calculated by TCP_REXMTVAL
. That calculation should generate the same value as it did for this retransmission (which will then be exponentially backed off), even though t_srtt
is set to 0. (The retransmission at time 42.464 in Figure 25.28 is an example of what’s happening here.)
Table 25.28. Values of RTT variables and estimators during example.
xmit time | send | recv | RTT timer | actual delta (ms) |
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|
0.0 | SYN | on | 0 | 24 | 12 | ||||
0.365 | SYN,ACK | off | 365 | 2 | 16 | 4 | 6 | ||
0.365 | ACK | ||||||||
0.415 | 513 | on | |||||||
1.259 | ack 513 | off | 844 | 2 | 15 | 4 | 5 | ||
1.260 | 513:1025 | on | |||||||
1.261 | 1025:1537 | ||||||||
2.206 | ack 1537 | off | 946 | 3 | 16 | 4 | 6 | ||
2.206 | 1537:2049 | on | |||||||
2.207 | 2049:2561 | ||||||||
2.209 | 2561:3073 | ||||||||
3.132 | ack 2049 | off | 926 | 3 | 16 | 3 | 5 | ||
3.132 | 3073:3585 | on | |||||||
3.133 | 3585:4097 | ||||||||
3.736 | ack 2561 | ||||||||
3.736 | 4097:4609 | ||||||||
3.737 | 4609:5121 | ||||||||
3.739 | ack 3073 | ||||||||
3.739 | 5121:5633 | ||||||||
3.740 | 5633:6145 | ||||||||
6.064 | 3073:3585 | off | 16 | 3 | 10 | 1 | |||
11.264 | 3073:3585 | off | 16 | 3 | 20 | 2 | |||
21.664 | 3073:3585 | off | 16 | 3 | 40 | 3 | |||
42.464 | 3073:3585 | off | 0 | 5 | 80 | 4 | |||
84.064 | 3073:3585 | off | 0 | 5 | 128 | 5 | |||
150.624 | 3073:3585 | off | 0 | 5 | 128 | 6 | |||
217.184 | 3073:3585 | off | 0 | 5 | 128 | 7 | |||
217.944 | ack 6145 | ||||||||
217.944 | 6145:6657 | on | |||||||
217.945 | 6657:7169 | ||||||||
218.834 | ack 6657 | off | 890 | 3 | 24 | 6 | 9 | ||
218.834 | 7169:7681 | on | |||||||
218.836 | 7681:8193 | ||||||||
219.209 | ack 7169 | ||||||||
219.209 | 8193:8705 | ||||||||
219.760 | ack 7681 | off | 926 | 2 | 22 | 7 | 9 | ||
219.760 | 8705:9217 | on | |||||||
220.103 | ack 8705 | ||||||||
220.103 | 9217:9729 | ||||||||
220.105 | 9729:10241 | ||||||||
220.106 | 10241:10753 | ||||||||
220.821 | ack 9217 | off | 1061 | 3 | 22 | 6 | 8 | ||
220.821 | 10753:11265 | on | |||||||
221.310 | ack 9729 | ||||||||
221.310 | 11265:11777 | ||||||||
221.312 | ack 10241 | ||||||||
221.312 | 11777:12289 | ||||||||
221.674 | ack 10753 | ||||||||
221.955 | ack 11265 | off | 1134 | 3 | 22 | 5 | 7 |
To accomplish this the value of t_rttvar
is changed as follows. The next time the RTO is calculated, the equation
is evaluated. Since t_srtt
will be 0, if t_rttvar
is increased by t_srtt
divided by 8, RTO will have the same value. If the retransmission timer expires again for this segment (e.g., times 84.064 through 217.184 in Figure 25.28), when this code is executed again t_srtt
will be 0, so t_rttvar
won’t change.
171
The next send sequence number (snd_nxt
) is set to the oldest unacknowledged sequence number (snd_una
). Recall from Figure 24.17 that snd_nxt
can be greater than snd_una
. By moving
snd_nxt
back, the retransmission will be the oldest segment that hasn’t been acknowledged.
172-175
The RTT counter, t_rtt
, is set to 0, in case the last segment transmitted was being timed. Karn’s algorithm says that even if an ACK of that segment is received, since the segment is about to be retransmitted, any timing of the segment is worthless since the ACK could be for the first transmission or for the retransmission. The algorithm is described in [Karn and Partridge 1987] and in Section 21.3 of Volume 1. Therefore the only segments that are timed using the
t_rtt
counter and used to update the RTT estimators are those that are not retransmitted. We’ll see in Figure 29.6 that the use of RFC 1323 timestamps overrides Karn’s algorithm.
The second half of this case
is shown in Figure 25.27. It performs slow start and congestion avoidance and retransmits the oldest unacknowledged segment.
Since a retransmission timeout has occurred, this is a strong indication of congestion in the network. TCP’s congestion avoidance algorithm comes into play, and when a segment is eventually acknowledged by the other end, TCP’s slow start algorithm will continue the data transmission on the connection at a slower rate. Sections 20.6 and 21.6 of Volume 1 describe the two algorithms in detail.
176-205
win
is set to one-half of the current window size (the minimum of the receiver’s advertised window, snd_wnd
, and the sender’s congestion window,
snd_cwnd
) in segments, not bytes (hence the division by t_maxseg
). Its minimum value is two segments. This records one-half of the window size when the congestion occurred, assuming one cause of the congestion is our sending segments too rapidly into the network. This becomes the slow start threshold, t_ssthresh
(which is stored in bytes, hence the multiplication by t_maxseg
). The congestion window, snd_cwnd
, is set to one segment, which forces slow start.
This code is enclosed in braces because it was added between the 4.3BSD and Net/1 releases and required its own local variable (
win
).
206
The counter of consecutive duplicate ACKs, t_dupacks
(which is used by the fast retransmit algorithm in Section 29.4), is set to 0. We’ll see how this counter is used with TCP’s fast retransmit and fast recovery algorithms in Chapter 29.
208
tcp_output
resends a segment containing the oldest unacknowledged sequence number. This is the retransmission caused by the retransmission timer expiring.
How accurate are these estimators that TCP maintains? At first they appear too coarse, since the RTTs are measured in multiples of 500 ms. The mean and mean deviation are maintained with additional accuracy (factors of 8 and 4 respectively), but LANs have RTTs on the order of milliseconds, and a transcontinental RTT is around 60 ms. What these estimators provide is a solid upper bound on the RTT so that the retransmission timeout can be set without worrying that the timeout is too small, causing unnecessary and wasteful retransmissions.
[Brakmo, O’Malley, and Peterson 1994] describe a TCP implementation that provides higher-resolution RTT measurements. This is done by recording the system clock (which has a much higher resolution than 500 ms) when a segment is transmitted and reading the system clock when the ACK is received, calculating a higher-resolution RTT.
The timestamp option provided by Net/3 (Section 26.6) can provide higher-resolution RTTs, but Net/3 sets the resolution of these timestamps to 500 ms.
We now go through an actual example to see how the calculations are performed. We transfer 12288 bytes from the host bsdi
to vangogh.cs.berkeley.edu
. During the transfer we purposely bring down the PPP link being used and then bring it back up, to see how timeouts and retransmissions are handled. To transfer the data we use our sock
program (described in Appendix C of Volume 1) with the -D
option, to enable the SO_DEBUG
socket option (Section 27.10). After the transfer is complete we examine the debug records left in the kernel’s circular buffer using the trpt
(8) program and print the desired timer variables from the TCP control block.
Figure 25.28 shows the calculations that occur at the various times. We use the notation M:N to mean that sequence numbers M through and including N—1 are sent. Each segment in this example contains 512 bytes. The notation “ack M” means that the acknowledgment field of the ACK is M. The column labeled “actual delta (ms)” shows the time difference between the RTT timer going on and going off. The column labeled “rtt
(arg.)” shows the second argument to the tcp_xmit_timer
function: the number of clock ticks plus 1 between the RTT timer going on and going off.
The function tcp_newtcpcb
initializes t_srtt
,
t_rttvar
, and
t_rxtcur
to the values shown at time 0.0.
The first segment timed is the initial SYN. When its ACK is received 365 ms later, tcp_xmit_timer
is called with an rtt
argument of 2. Since this is the first RTT measurement (t_srtt
is 0), the else
clause in Figure 25.23 calculates the first values of the smoothed estimators.
The data segment containing bytes 1 through 512 is the next segment timed, and the RTT variables are updated at time 1.259 when its ACK is received.
The next three segments show how ACKs are cumulative. The timer is started at time 1.260 when bytes 513 through 1024 are sent. Another segment is sent with bytes 1025 through 1536, and the ACK received at time 2.206 acknowledges both data segments. The RTT estimators are then updated, since the ACK covers the starting sequence number being timed (513).
The segment with bytes 1537 through 2048 is transmitted at time 2.206 and the timer is started. Just that segment is acknowledged at time 3.132, and the estimators updated.
The data segment at time 3.132 is timed and the retransmission timer is set to 5 ticks (the current value of t_rxtcur
). Somewhere around this time the PPP link between the routers sun
and netb
is taken down and then brought back up, a procedure that takes a few minutes. When the retransmission timer expires at time 6.064, the code in Figure 25.26 is executed to update the RTT variables. t_rxtshift
is incremented from 0 to 1 and t_rxtcur
is set to 10 ticks (the exponential backoff). A segment starting with the oldest unacknowledged sequence number (snd_una
, which is 3073) is retransmitted. After 5 seconds the timer expires again,
t_rxtshift
is incremented to 2, and the retransmission timer is set to 20 ticks.
When the retransmission timer expires at time 42.464, t_srtt
is set to 0 and t_rttvar
is set to 5. As we mentioned in our discussion of Figure 25.26, this leaves the calculation of t_rxtcur
the same (so the next calculation yields 160), but by setting t_srtt
to 0, the next time the RTT estimators are updated (at time 218.834), the measured RTT becomes the smoothed RTT, as if the connection were starting fresh.
The rest of the data transfer continues, and the estimators are updated a few more times.
The two functions tcp_fasttimo
and tcp_slowtimo
are called by the kernel every 200 ms and every 500 ms, respectively. These two functions drive TCP’s per-connection timer maintenance.
TCP maintains the following seven timers for each connection:
a connection-establishment timer,
a retransmission timer,
a delayed ACK timer,
a persist timer,
a keepalive timer,
a FIN_WAIT_2 timer, and
a 2MSL timer.
The delayed ACK timer is different from the other six, since when it is set it means a delayed ACK must be sent the next time TCP’s 200-ms timer expires. The other six timers are counters that are decremented by 1 every time TCP’s 500-ms timer expires. When any one of the counters reaches 0, the appropriate action is taken: drop the connection, retransmit a segment, send a keepalive probe, and so on, as described in this chapter. Since some of the timers are mutually exclusive, the six timers are really implemented using four counters, which complicates the code.
This chapter also introduced the recommended way to calculate values for the retransmission timer. TCP maintains two smoothed estimators for a connection: the round-trip time and the mean deviation of the RTT. Although the algorithms are simple and elegant, these estimators are maintained as scaled fixed-point numbers (to provide adequate precision without using floating-point code within the kernel), which complicates the code.
25.1 | How efficient is TCP’s fast timeout function? (Hint: Look at the number of delayed ACKs in Figure 24.5.) Suggest alternative implementations. |
25.1 | In Figure 24.5 there were 531,285 delayed ACKs over 2,592,000 seconds (30 days). This is an average of about one delayed ACK every 5 seconds, or one delayed ACK every 25 times One alternative implementation would be to set a global flag when a delayed ACK is needed and only go through the list of control blocks when the flag is set. Alternatively, another list could be maintained that contains only the control blocks that require a delayed ACK. See, for example, the variable |
25.2 | Why do you think the initialization of |
25.2 | This allows the variable |
25.3 |
|
25.3 |
|
25.4 | Rewrite the code in Figure 25.10 to separate the logic for the two different uses of the |
25.4 | Here is one way to rewrite the code: case TCPT_2MSL: if (tp->t_state == TCPS_TIME_WAIT) tp = tcp_close(tp); else { if (tp->t_idle <= tcp_maxidle) tp->t_timer[TCPT_2MSL] = tcp_keepintvl; else tp = tcp_close(tp); } break; |
25.5 | 75 seconds after the connection in Figure 25.12 enters the FIN_WAIT_2 state a duplicate ACK is received on the connection. What happens? |
25.5 | When the duplicate ACK is received, |
25.6 | A connection has been idle for 1 hour when the application sets the |
25.6 | The first keepalive probe will be sent 1 hour in the future. When the process sets the option, nothing happens other than setting the |
25.7 | Why is |
25.7 | The value of |
25.8 | Rewrite the code related to Exercise 25.6 to implement the alternate behavior. |
18.220.64.128