Chapter 25. TCP Timers

Introduction

We start our detailed description of the TCP source code by looking at the various TCP timers. We encounter these timers throughout most of the TCP functions.

TCP maintains seven timers for each connection. They are briefly described here, in the approximate order of their occurrence during the lifetime of a connection.

  1. A connection-establishment timer starts when a SYN is sent to establish a new connection. If a response is not received within 75 seconds, the connection establishment is aborted.

  2. A retransmission timer is set when TCP sends data. If the data is not acknowledged by the other end when this timer expires, TCP retransmits the data. The value of this timer (i.e., the amount of time TCP waits for an acknowledgment) is calculated dynamically, based on the round-trip time measured by TCP for this connection, and based on the number of times this data segment has been retransmitted. The retransmission timer is bounded by TCP to be between 1 and 64 seconds.

  3. A delayed ACK timer is set when TCP receives data that must be acknowledged, but need not be acknowledged immediately. Instead, TCP waits up to 200 ms before sending the ACK. If, during this 200-ms time period, TCP has data to send on this connection, the pending acknowledgment is sent along with the data (called piggybacking).

  4. A persist timer is set when the other end of a connection advertises a window of 0, stopping TCP from sending data. Since window advertisements from the other end are not sent reliably (that is, ACKs are not acknowledged, only data is acknowledged), there’s a chance that a future window update, allowing TCP to send some data, can be lost. Therefore, if TCP has data to send and the other end advertises a window of 0, the persist timer is set and when it expires, 1 byte of data is sent to see if the window has opened. Like the retransmission timer, the persist timer value is calculated dynamically, based on the round-trip time. The value of this is bounded by TCP to be between 5 and 60 seconds.

  5. A keepalive timer can be set by the process using the SO_KEEPALIVE socket option. If the connection is idle for 2 hours, the keepalive timer expires and a special segment is sent to the other end, forcing it to respond. If the expected response is received, TCP knows that the other host is still up, and TCP won’t probe it again until the connection is idle for another 2 hours. Other responses to the keepalive probe tell TCP that the other host has crashed and rebooted. If no response is received to a fixed number of keepalive probes, TCP assumes that the other end has crashed, although it can’t distinguish between the other end being down (i.e., it crashed and has not yet rebooted) and a temporary lack of connectivity to the other end (i.e., an intermediate router or phone line is down).

  6. A FIN_WAIT_2 timer. When a connection moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state (Figure 24.15) and the connection cannot receive any more data (implying the process called close, instead of taking advantage of TCP’s half-close with shutdown), this timer is set to 10 minutes. When this timer expires it is reset to 75 seconds, and when it expires the second time the connection is dropped. The purpose of this timer is to avoid leaving a connection in the FIN_WAIT_2 state forever, if the other end never sends a FIN. (We don’t show this timeout in Figure 24.15.)

  7. A TIME_WAIT timer, often called the 2MSL timer. The term 2MSL means twice the MSL, the maximum segment lifetime defined in Section 24.8. It is set when a connection enters the TIME_WAIT state (Figure 24.15), that is, when the connection is actively closed. Section 18.6 of Volume 1 describes the reasoning for the 2MSL wait state in detail. The timer is set to 1 minute (Net/3 uses an MSL of 30 seconds) when the connection enters the TIME_WAIT state and when it expires, the TCP control block and Internet PCB are deleted, allowing that socket pair to be reused.

TCP has two timer functions: one is called every 200 ms (the fast timer) and the other every 500 ms (the slow timer). The delayed ACK timer is different from the other six: when the delayed ACK timer is set for a connection it means that a delayed ACK must be sent the next time the 200-ms timer expires (i.e., the elapsed time is between 0 and 200 ms). The other six timers are decremented every 500 ms, and only when the counter reaches 0 does the corresponding action take place.

Code Introduction

The delayed ACK timer is enabled for a connection when the TF_DELACK flag (Figure 24.14) is set in the TCP control block. The array t_timer in the TCP control block contains four (TCPT_NTIMERS) counters used to implement the other six timers. The indexes into this array are shown in Figure 25.1. We describe briefly how the six timers (other than the delayed ACK timer) are implemented by these four counters.

Table 25.1. Indexes into the t_timer array.

Constant

Value

Description

TCPT_REXMT

0

retransmission timer

TCPT_PERSIST

1

persist timer

TCPT_KEEP

2

keepalive timer or connection-establishment timer

TCPT_2MSL

3

2MSL timer or FIN_WAIT_2 timer

Each entry in the t_timer array contains the number of 500-ms clock ticks until the timer expires, with 0 meaning that the timer is not set. Since each timer is a short, if 16 bits hold a short, the maximum timer value is 16,383.5 seconds, or about 4.5 hours.

Notice in Figure 25.1 that four “timer counters” implement six TCP “timers,” because some of the timers are mutually exclusive. We’ll distinguish between the counters and the timers. The TCPT_KEEP counter implements both the keepalive timer and the connection-establishment timer, since the two timers are never used at the same time for a connection. Similarly, the 2MSL timer and the FIN_WAIT_2 timer are implemented using the TCPT_2MSL counter, since a connection is only in one state at a time. The first section of Figure 25.2 summarizes the implementation of the seven TCP timers. The second and third sections of the table show how four of the seven timers are initialized using three global variables from Figure 24.3 and two constants from Figure 25.3. Notice that two of the three globals are used with multiple timers. We’ve already said that the delayed ACK timer is tied to TCP’s 200-ms timer, and we describe how the other two timers are set later in this chapter.

Table 25.2. Implementation of the seven TCP timers.

 

conn. estab.

rexmit

delayed ACK

persist

keep-alive

FIN_-WAIT_2

2MSL

t_timer[TCPT_REXMT]

 

     

t_timer[TCPT_PERSIST]

   

   

t_timer[TCPT_KEEP]

   

  

t_timer[TCPT_2MSL]

     

t_flags & TF_DELACK

  

    

tcp_keepidle (2 hr)

    

  

tcp_keepintvl (75 sec)

    

 

tcp_maxidle (10 min)

    

 

2 * TCPTV_MSL (60 sec)

      

TCPTV_KEEP_INIT (75 sec)

      

Table 25.3. Fundamental timer values for the implementation.

Constant

#500-ms clock ticks

#sec

Description

TCPTV_MSL

60

30

MSL, maximum segment lifetime

TCPTV_MIN

2

1

minimum value of retransmission timer

TCPTV_REXMTMAX

128

64

maximum value of retransmission timer

TCPTV_PERSMIN

10

5

minimum value of persist timer

TCPTV_PERSMAX

120

60

maximum value of persist timer

TCPTV_KEEP_INIT

150

75

connection-establishment timer value

TCPTV_KEEP_IDLE

14400

7200

idle time for connection before first probe (2 hours)

TCPTV_KEEPINTVL

150

75

time between probes when no response

TCPTV_SRTTBASE

0

 

special value to denote no measurements yet for connection

TCPTV_SRTTDFLT

6

3

default RTT when no measurements yet for connection

Figure 25.3 shows the fundamental timer values for the Net/3 implementation.

Figure 25.4 shows other timer constants that we’ll encounter.

Table 25.4. Timer constants.

Constant

Value

Description

TCP_LINGERTIME

120

maximum #seconds for SO_LINGER socket option

TCP_MAXRXTSHIFT

12

maximum #retransmissions waiting for an ACK

TCPTV_KEEPCNT

8

maximum #keepalive probes when no response received

The TCPT_RANGESET macro, shown in Figure 25.5, sets a timer to a given value, making certain the value is between the specified minimum and maximum.

Table 25.5. TCPT_RANGESET macro.

-------------------------------------------------------------------- tcp_timer.h
102 #define TCPT_RANGESET(tv, value, tvmin, tvmax) { 
103     (tv) = (value); 
104     if ((tv) < (tvmin)) 
105         (tv) = (tvmin); 
106     else if ((tv) > (tvmax)) 
107         (tv) = (tvmax); 
108 }
-------------------------------------------------------------------- tcp_timer.h

We see in Figure 25.3 that the retransmission timer and the persist timer have upper and lower bounds, since their values are calculated dynamically, based on the measured round-trip time. The other timers are set to constant values.

There is one additional timer that we allude to in Figure 25.4 but don’t discuss in this chapter: the linger timer for a socket, set by the SO_LINGER socket option. This is a socket-level timer used by the close system call (Section 15.15). We will see in Figure 30.12 that when a socket is closed, TCP checks whether this socket option is set and whether the linger time is 0. If so, the connection is aborted with an RST instead of TCP’s normal close.

tcp_canceltimers Function

The function tcp_canceltimers, shown in Figure 25.6, is called by tcp_input when the TIME_WAIT state is entered. All four timer counters are set to 0, which turns off the retransmission, persist, keepalive, and FIN_WAIT_2 timers, before tcp_input sets the 2MSL timer.

Table 25.6. tcp_canceltimers function.

------------------------------------------------------------ tcp_timer.c
107 void
108 tcp_canceltimers(tp)
109 struct tcpcb *tp;
110 {
111     int     i;

112     for (i = 0; i < TCPT_NTIMERS; i++)
113         tp->t_timer[i] = 0;
114 }
------------------------------------------------------------ tcp_timer.c

tcp_fasttimo Function

The function tcp_fasttimo, shown in Figure 25.7, is called by pr_fasttimo every 200 ms. It handles only the delayed ACK timer.

Table 25.7. tcp_fasttimo function, which is called every 200 ms.

------------------------------------------------------------------------ tcp_timer.c
 41 void
 42 tcp_fasttimo()
 43 {
 44     struct inpcb *inp;
 45     struct tcpcb *tp;
 46     int     s = splnet();

 47     inp = tcb.inp_next;
 48     if (inp)
 49         for (; inp != &tcb; inp = inp->inp_next)
 50             if ((tp = (struct tcpcb *) inp->inp_ppcb) &&
 51                 (tp->t_flags & TF_DELACK)) {
 52                 tp->t_flags &= ~TF_DELACK;
 53                 tp->t_flags |= TF_ACKNOW;
 54                 tcpstat.tcps_delack++;
 55                 (void) tcp_output(tp);
 56             }
 57     splx(s);
 58 }
------------------------------------------------------------------------ tcp_timer.c

Each Internet PCB on the TCP list that has a corresponding TCP control block is checked. If the TF_DELACK flag is set, it is cleared and the TF_ACKNOW flag is set instead. tcp_output is called, and since the TF_ACKNOW flag is set, an ACK is sent.

How can TCP have an Internet PCB on its PCB list that doesn’t have a TCP control block (the test at line 50)? When a socket is created (the PRU_ATTACH request, in response to the socket system call) we’ll see in Figure 30.11 that the creation of the Internet PCB is done first, followed by the creation of the TCP control block. Between these two operations a high-priority clock interrupt can occur (Figure 1.13), which calls tcp_fasttimo.

tcp_slowtimo Function

The function tcp_slowtimo, shown in Figure 25.8, is called by pr_slowtimo every 500 ms. It handles the other six TCP timers: connection establishment, retransmission, persist, keepalive, FIN_WAIT_2, and 2MSL.

Table 25.8. tcp_slowtimo function, which is called every 500 ms.

------------------------------------------------------------------------ tcp_timer.c
 64 void
 65 tcp_slowtimo()
 66 {
 67     struct inpcb *ip, *ipnxt;
 68     struct tcpcb *tp;
 69     int     s = splnet();
 70     int     i;

 71     tcp_maxidle = TCPTV_KEEPCNT * tcp_keepintvl;
 72     /*
 73      * Search through tcb's and update active timers.
 74      */
 75     ip = tcb.inp_next;
 76     if (ip == 0) {
 77         splx(s);
 78         return;
 79     }
 80     for (; ip != &tcb; ip = ipnxt) {
 81         ipnxt = ip->inp_next;
 82         tp = intotcpcb(ip);
 83         if (tp == 0)
 84             continue;
 85         for (i = 0; i < TCPT_NTIMERS; i++) {
 86             if (tp->t_timer[i] && --tp->t_timer[i] == 0) {
 87                 (void) tcp_usrreq(tp->t_inpcb->inp_socket,
 88                                   PRU_SLOWTIMO, (struct mbuf *) 0,
 89                                   (struct mbuf *) i, (struct mbuf *) 0);
 90                 if (ipnxt->inp_prev != ip)
 91                     goto tpgone;
 92             }
 93         }
 94         tp->t_idle++;
 95         if (tp->t_rtt)
 96             tp->t_rtt++;
 97       tpgone:
 98         ;
 99     }
100     tcp_iss += TCP_ISSINCR / PR_SLOWHZ;     /* increment iss */
101     tcp_now++;                  /* for timestamps */
102     splx(s);
103 }
------------------------------------------------------------------------ tcp_timer.c

71

tcp_maxidle is initialized to 10 minutes. This is the maximum amount of time TCP will send keepalive probes to another host, waiting for a response from that host. This variable is also used with the FIN_WAIT_2 timer, as we describe in Section 25.6. This initialization statement could be moved to tcp_init, since it only needs to be evaluated when the system is initialized (see Exercise 25.2).

Check each timer counter in all TCP control blocks

72-89

Each Internet PCB on the TCP list that has a corresponding TCP control block is checked. Each of the four timer counters for each connection is tested, and if nonzero, the counter is decremented. When the timer reaches 0, a PRU_SLOWTIMO request is issued. We’ll see that this request calls the function tcp_timers, which we describe later in this chapter.

The fourth argument to tcp_usrreq is a pointer to an mbuf. But this argument is actually used for different purposes when the mbuf pointer is not required. Here we see the index i is passed, telling the request which timer has expired. The funny-looking cast of i to an mbuf pointer is to avoid a compile-time error.

Check if TCP control block has been deleted

90-93

Before examining the timers for a control block, a pointer to the next Internet PCB is saved in ipnxt. Each time the PRU_SLOWTIMO request returns, tcp_slowtimo checks whether the next PCB in the TCP list still points to the PCB that’s being processed. If not, it means the control block has been deleted perhaps the 2MSL timer expired or the retransmission timer expired and TCP is giving up on this connection causing a jump to tpgone, skipping the remaining timers for this control block, and moving on to the next PCB.

Count idle time

94

t_idle is incremented for the control block. This counts the number of 500-ms clock ticks since the last segment was received on this connection. It is set to 0 by tcp_input when a segment is received on the connection and used for three purposes: (1) by the keepalive algorithm to send a probe after the connection is idle for 2 hours, (2) to drop a connection in the FIN_WAIT_2 state that is idle for 10 minutes and 75 seconds, and (3) by tcp_output to return to the slow start algorithm after the connection has been idle for a while.

Increment RTT counter

95-96

If this connection is timing an outstanding segment, t_rtt is nonzero and counts the number of 500-ms clock ticks until that segment is acknowledged. It is initialized to 1 by tcp_output when a segment is transmitted whose RTT should be timed. tcp_slowtimo increments this counter.

Increment initial send sequence number

100

tcp_iss was initialized to 1 by tcp_init. Every 500 ms it is incremented by 64,000: 128,000 (TCP_ISSINCR) divided by 2 (PR_SLOWHZ). This is a rate of about once every 8 microseconds, although tcp_iss is incremented only twice a second. We’ll see that tcp_iss is also incremented by 64,000 each time a connection is established, either actively or passively.

RFC 793 specifies that the initial sequence number should increment roughly every 4 microseconds, or 250,000 times a second. The Net/3 value increments at about one-half this rate.

Increment RFC 1323 timestamp value

101

tcp_now is initialized to 0 on bootstrap and incremented every 500 ms. It is used by the timestamp option defined in RFC 1323 [Jacobson, Braden, and Borman 1992], which we describe in Section 26.6.

75-79

Notice that if there are no TCP connections active on the host (tcb.inp_next is null), neither tcp_iss nor tcp_now is incremented. This would occur only when the system is being initialized, since it would be rare to find a Unix system attached to a network without a few TCP servers active.

tcp_timers Function

The function tcp_timers is called by TCP’s PRU_SLOWTIMO request (Figure 30.10):

case PRU_SLOWTIMO:
    tp = tcp_timers(tp, (int)nam);

when any one of the four TCP timer counters reaches 0 (Figure 25.8).

The structure of the function is a switch statement with one case per timer, as outlined in Figure 25.9.

Table 25.9. tcp_timers function: general organization.

----------------------------------------------------------------- tcp_timer.c
120 struct tcpcb *
121 tcp_timers(tp, timer)
122 struct tcpcb *tp;
123 int     timer;
124 {
125     int     rexmt;

126     switch (timer) {
                                                                            
                                   /* switch cases */                       
                                                                            
256     }
257     return (tp);
258 }
----------------------------------------------------------------- tcp_timer.c

We now discuss three of the four timer counters (five of TCP’s timers), saving the retransmission timer for Section 25.11.

FIN_WAIT_2 and 2MSL Timers

TCP’s TCPT_2MSL counter implements two of TCP’s timers.

  1. FIN_WAIT_2 timer. When tcp_input moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state and the socket cannot receive any more data (implying the process called close, instead of taking advantage of TCP’s half-close with shutdown), the FIN_WAIT_2 timer is set to 10 minutes (tcp_maxidle). We’ll see that this prevents the connection from staying in the FIN_WAIT_2 state forever.

  2. 2MSL timer. When TCP enters the TIME_WAIT state, the 2MSL timer is set to 60 seconds (TCPTV_MSL times 2).

Figure 25.10 shows the case for the 2MSL timer executed when the timer reaches 0.

Table 25.10. tcp_timers function: expiration of 2MSL timer counter.

------------------------------------------------------------------- tcp_timer.c
127         /*
128          * 2 MSL timeout in shutdown went off.  If we're closed but
129          * still waiting for peer to close and connection has been idle
130          * too long, or if 2MSL time is up from TIME_WAIT, delete connection
131          * control block.  Otherwise, check again in a bit.
132          */
133     case TCPT_2MSL:
134         if (tp->t_state != TCPS_TIME_WAIT &&
135             tp->t_idle <= tcp_maxidle)
136             tp->t_timer[TCPT_2MSL] = tcp_keepintvl;
137         else
138             tp = tcp_close(tp);
139         break;
------------------------------------------------------------------- tcp_timer.c

2MSL timer

127-139

The puzzling logic in the conditional is because the two different uses of the TCPT_2MSL counter are intermixed (Exercise 25.4). Let’s first look at the TIME_WAIT state. When the timer expires after 60 seconds, tcp_close is called and the control blocks are released. We have the scenario shown in Figure 25.11.

Setting and expiration of 2MSL timer in TIME_WAIT state.

Figure 25.11. Setting and expiration of 2MSL timer in TIME_WAIT state.

This figure shows the series of function calls that occurs when the 2MSL timer expires. We also see that setting one of the timers for N seconds in the future (2 x N ticks), causes the timer to expire somewhere between 2 x N - 1 and 2 x N ticks in the future, since the time until the first decrement of the counter is between 0 and 500 ms in the future.

FIN_WAIT_2 timer

127-139

If the connection state is not TIME_WAIT, the TCPT_2MSL counter is the FIN_WAIT_2 timer. As soon as the connection has been idle for more than 10 minutes (tcp_maxidle) the connection is closed. But if the connection has been idle for less than or equal to 10 minutes, the FIN_WAIT_2 timer is reset for 75 seconds in the future. Figure 25.12 shows the typical scenario.

FIN_WAIT_2 timer to avoid infinite wait in FIN_WAIT_2 state.

Figure 25.12. FIN_WAIT_2 timer to avoid infinite wait in FIN_WAIT_2 state.

The connection moves from the FIN_WAIT_1 state to the FIN_WAIT_2 state on the receipt of an ACK (Figure 24.15). Receiving this ACK sets t_idle to 0 and the FIN_WAIT_2 timer is set to 1200 (tcp_maxidle). In Figure 25.12 we show the up arrow just to the right of the tick mark starting the 10-minute period, to reiterate that the first decrement of the counter occurs between 0 and 500 ms after the counter is set. After 1199 ticks the timer expires, but since t_idle is incremented after the test and decrement of the four counters in Figure 25.8, t_idle is 1198. (We assume the connection is idle for this 10-minute period.) The comparison of 1198 as less than or equal to 1200 is true, so the FIN_WAIT_2 timer is set to 150 (tcp_keepintvl). When the timer expires again in 75 seconds, assuming the connection is still idle, t_idle is now 1348, the test is false, and tcp_close is called.

The reason for the 75-second timeout after the first 10-minute timeout is as follows: a connection in the FIN_WAIT_2 state is not dropped until the connection has been idle for more than 10 minutes. There’s no reason to test t_idle until at least 10 minutes have expired, but once this time has passed, the value of t_idle is checked every 75 seconds. Since a duplicate segment could be received, say a duplicate of the ACK that moved the connection from the FIN_WAIT_1 state to the FIN_WAIT_2 state, the 10-minute wait is restarted when the segment is received (since t_idle will be set to 0).

Terminating an idle connection after more than 10 minutes in the FIN_WAIT_2 state violates the protocol specification, but this is practical. In the FIN_WAIT_2 state the process has called close, all outstanding data on the connection has been sent and acknowledged, the other end has acknowledged the FIN, and TCP is waiting for the process at the other end of the connection to issue its close. If the other process never closes its end of the connection, our end can remain in the FIN_WAIT_2 forever. A counter should be maintained for the number of connections terminated for this reason, to see how often this occurs.

Persist Timer

Figure 25.13 shows the case for when the persist timer expires.

Table 25.13. tcp_timers function: expiration of persist timer.

----------------------------------------------------------------------- tcp_timer.c
210         /*
211          * Persistence timer into zero window.
212          * Force a byte to be output, if possible.
213          */
214     case TCPT_PERSIST:
215         tcpstat.tcps_persisttimeo++;
216         tcp_setpersist(tp);
217         tp->t_force = 1;
218         (void) tcp_output(tp);
219         tp->t_force = 0;
220         break;
----------------------------------------------------------------------- tcp_timer.c

Force window probe segment

210-220

When the persist timer expires, there is data to send on the connection but TCP has been stopped by the other end’s advertisement of a zero-sized window. tcp_setpersist calculates the next value for the persist timer and stores it in the TCPT_PERSIST counter. The flag t_force is set to 1, forcing tcp_output to send 1 byte, even though the window advertised by the other end is 0.

Figure 25.14 shows typical values of the persist timer for a LAN, assuming the retransmission timeout for the connection is 1.5 seconds (see Figure 22.1 of Volume 1).

Time line of persist timer when probing a zero window.

Figure 25.14. Time line of persist timer when probing a zero window.

Once the value of the persist timer reaches 60 seconds, TCP continues sending window probes every 60 seconds. The reason the first two values are both 5, and not 1.5 and 3, is that the persist timer is lower bounded at 5 seconds. It is also upper bounded at 60 seconds. The multiplication of each value by 2 to give the next value is called an exponential backoff, and we describe how it is calculated in Section 25.9.

Connection Establishment and Keepalive Timers

TCP’s TCPT_KEEP counter implements two timers:

  1. When a SYN is sent, the connection-establishment timer is set to 75 seconds (TCPTV_KEEP_INIT). This happens when connect is called, putting a connection into the SYN_SENT state (active open), or when a connection moves from the LISTEN to the SYN_RCVD state (passive open). If the connection doesn’t enter the ESTABLISHED state within 75 seconds, the connection is dropped.

  2. When a segment is received on a connection, tcp_input resets the keepalive timer for that connection to 2 hours (tcp_keepidle), and the t_idle counter for the connection is reset to 0. This happens for every TCP connection on the system, whether the keepalive option is enabled for the socket or not. If the keepalive timer expires (2 hours after the last segment was received on the connection), and if the socket option is set, a keepalive probe is sent to the other end. If the timer expires and the socket option is not set, the keepalive timer is just reset for 2 hours in the future.

Figure 25.16 shows the case for TCP’s TCPT_KEEP counter.

Connection-establishment timer expires after 75 seconds

221-228

If the state is less than ESTABLISHED (Figure 24.16), the TCPT_KEEP counter is the connection-establishment timer. At the label dropit, tcp_drop is called to terminate the connection attempt with an error of ETIMEDOUT. We’ll see that this error is the default error if, for example, a soft error such as an ICMP host unreachable was received on the connection, the error returned to the process will be changed to EHOSTUNREACH instead of the default.

In Figure 30.4 we’ll see that when TCP sends a SYN, two timers are initialized: the connection-establishment timer as we just described, with a value of 75 seconds, and the retransmission timer, to cause the SYN to be retransmitted if no response is received. Figure 25.15 shows these two timers.

Connection-establishment timer and retransmission timer after SYN is sent.

Figure 25.15. Connection-establishment timer and retransmission timer after SYN is sent.

The retransmission timer is initialized to 6 seconds for a new connection (Figure 25.19), and successive values are calculated to be 24 and 48 seconds. We describe how these values are calculated in Section 25.7. The retransmission timer causes the SYN to be transmitted a total of three times, at times 0, 6, and 30. At time 75, 3 seconds before the retransmission timer would expire again, the connection-establishment timer expires, and tcp_drop terminates the connection attempt.???

Table 25.16. tcp_timers function: expiration of keepalive timer.

----------------------------------------------------------------------- tcp_timer.c
221         /*
222          * Keep-alive timer went off; send something
223          * or drop connection if idle for too long.
224          */
225     case TCPT_KEEP:
226         tcpstat.tcps_keeptimeo++;
227         if (tp->t_state < TCPS_ESTABLISHED)
228             goto dropit;        /* connection establishment timer */

229         if (tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE &&
230             tp->t_state <= TCPS_CLOSE_WAIT) {
231             if (tp->t_idle >= tcp_keepidle + tcp_maxidle)
232                 goto dropit;
233             /*
234              * Send a packet designed to force a response
235              * if the peer is up and reachable:
236              * either an ACK if the connection is still alive,
237              * or an RST if the peer has closed the connection
238              * due to timeout or reboot.
239              * Using sequence number tp->snd_una-1
240              * causes the transmitted zero-length segment
241              * to lie outside the receive window;
242              * by the protocol spec, this requires the
243              * correspondent TCP to respond.
244              */
245             tcpstat.tcps_keepprobe++;
246             tcp_respond(tp, tp->t_template, (struct mbuf *) NULL,
247                         tp->rcv_nxt, tp->snd_una - 1, 0);
248             tp->t_timer[TCPT_KEEP] = tcp_keepintvl;
249         } else
250             tp->t_timer[TCPT_KEEP] = tcp_keepidle;
251         break;
252       dropit:
253         tcpstat.tcps_keepdrops++;
254         tp = tcp_drop(tp, ETIMEDOUT);
255         break;
----------------------------------------------------------------------- tcp_timer.c

Keepalive timer expires after 2 hours of idle time

229-230

This timer expires after 2 hours of idle time on every connection, not just ones with the SO_KEEPALIVE socket option enabled. If the socket option is set, probes are sent only if the connection is in the ESTABLISHED or CLOSE_WAIT states (Figure 24.15). Once the process calls close (the states greater than CLOSE_WAIT), keepalive probes are not sent, even if the connection is idle for 2 hours.

Drop connection when no response

231-232

If the total idle time for the connection is greater than or equal to 2 hours (tcp_keepidle) plus 10 minutes (tcp_maxidle), the connection is dropped. This means that TCP has sent its limit of nine keepalive probes, 75 seconds apart (tcp_keepintvl), with no response. One reason TCP must send multiple keepalive probes before considering the connection dead is that the ACKs sent in response do not contain data and therefore are not reliably transmitted by TCP. An ACK that is a response to a keepalive probe can get lost.

Send a keepalive probe

233-248

If TCP hasn’t reached the keepalive limit, tcp_respond sends a keepalive packet. The acknowledgment field of the keepalive packet (the fourth argument to tcp_respond) contains rcv_nxt, the next sequence number expected on the connection. The sequence number field of the keepalive packet (the fifth argument) deliberately contains snd_una minus 1, which is the sequence number of a byte of data that the other end has already acknowledged (Figure 24.17). Since this sequence number is outside the window, the other end must respond with an ACK, specifying the next sequence number it expects.

Figure 25.17 summarizes this use of the keepalive timer.

Summary of keepalive timer to detect unreachability of other end.

Figure 25.17. Summary of keepalive timer to detect unreachability of other end.

The nine keepalive probes are sent every 75 seconds, starting at time 0, through time 600. At time 675 (11.25 minutes after the 2-hour timer expired) the connection is dropped. Notice that nine keepalive probes are sent, even though the constant TCPTV_KEEPCNT (Figure 25.4) is 8. This is because the variable t_idle is incremented in Figure 25.8 after the timer is decremented, compared to 0, and possibly handled. When tcp_input receives a segment on a connection, it sets the keepalive timer to 14400 (tcp_keepidle) and t_idle to 0. The next time tcp_slowtimo is called, the keepalive timer is decremented to 14399 and t_idle is incremented to 1. About 2 hours later, when the keepalive timer is decremented from 1 to 0 and tcp_timers is called, the value of t_idle will be 14399. We can build the table in Figure 25.18 to see the value of t_idle each time tcp_timers is called.

Table 25.18. The value of t_idle when tcp_timers is called for keepalive processing.

probe #

time in Figure 25.17

t_idle

1

0

14399

2

75

14549

3

150

14699

4

225

14849

5

300

14999

6

375

15149

7

450

15299

8

525

15449

9

600

15599

 

675

15749

The code in Figure 25.16 is waiting for t_idle to be greater than or equal to 15600 (tcp_keepidle + tcp_maxidle) and that only happens at time 675 in Figure 25.17, after nine keepalive probes have been sent.

Reset keepalive timer

249-250

If the socket option is not set or the connection state is greater than CLOSE_WAIT, the keepalive timer for this connection is reset to 2 hours (tcp_keepidle).

Unfortunately the counter tcps_keepdrops (line 253) counts both uses of the TCPT_KEEP counter: the connection-establishment timer and the keepalive timer.

Retransmission Timer Calculations

The timers that we’ve described so far in this chapter have fixed times associated with them: 200 ms for the delayed ACK timer, 75 seconds for the connection-establishment timer, 2 hours for the keepalive timer, and so on. The final two timers that we describe, the retransmission timer and the persist timer, have values that depend on the measured RTT for the connection. Before going through the source code that calculates and sets these timers we need to understand how TCP measures the RTT for a connection.

Fundamental to the operation of TCP is setting a retransmission timer when a segment is transmitted and an ACK is required from the other end. If the ACK is not received when the retransmission timer expires, the segment is retransmitted. TCP requires an ACK for data segments but does not require an ACK for a segment without data (i.e., a pure ACK segment). If the calculated retransmission timeout is too small, it can expire prematurely, causing needless retransmissions. If the calculated value is too large, after a segment is lost, additional time is lost before the segment is retransmitted, degrading performance. Complicating this is that the round-trip times between two hosts can vary widely and dynamically over the course of a connection.

TCP in Net/3 calculates the retransmission timeout (RTO) by measuring the round-trip time (nticks) of data segments and keeping track of the smoothed RTT estimator (srtt) and a smoothed mean deviation estimator (rttvar). The mean deviation is a good approximation of the standard deviation, but easier to compute since, unlike the standard deviation, the mean deviation does not require square root calculations. [Jacobson 1988b] provides additional details on these RTT measurements, which lead to the following equations:

delta = nticks − srtt

srtt ← srtt + g × delta

rttvar ← rttvar + h(|delta| − rttvar)

RTO = srtt + 4 × rttvar

delta is the difference between the measured round trip just obtained (nticks) and the current smoothed RTT estimator (srtt). g is the gain applied to the RTT estimator and equals 1/8. h is the gain applied to the mean deviation estimator and equals 1/4. The two gains and the multiplier 4 in the RTO calculation are purposely powers of 2, so they can be calculated using shift operations instead of multiplying or dividing.

[Jacobson 1988b] specified 2 x rttvar in the calculation of RTO, but after further research, [Jacobson 1990d] changed the value to 4 x rttvar, which is what appeared in the Net/1 implementation.

We now describe the variables and calculations used to calculate TCP’s retransmission timer, as we’ll encounter them throughout the TCP code. Figure 25.19 lists the variables in the control block related to the retransmission timer.

Table 25.19. Control block variables for calculation of retransmission timer.

tcpcb member

Units

tcp_newtcpcb initial value

#sec

Description

t_srtt

ticks × 8

0

 

smoothed RTT estimator: srtt × 8

t_rttvar

ticks × 4

24

3

smoothed mean deviation estimator: rttvar × 4

t_rxtcur

ticks

12

6

current retransmission timeout: RTO

t_rttmin

ticks

2

1

minimum value for retransmission timeout

t_rxtshift

n.a.

0

 

index into tcp_backoff[] array (exponential backoff)

We show the tcp_backoff array at the end of Section 25.9. The tcp_newtcpcb function sets the initial values for these variables, and we cover it in the next section. The term shift in the variable t_rxtshift and its limit TCP_MAXRXTSHIFT is not entirely accurate. The former is not used for bit shifting, but as Figure 25.19 indicates, it is an index into an array.

The confusing part of TCP’s timeout calculations is that the two smoothed estimators maintained in the C code (t_srtt and t_rttvar) are fixed-point integers, instead of floating-point values. This is done to avoid floating-point calculations within the kernel, but it complicates the code.

To keep the scaled and unscaled variables distinct, we’ll use the italic variables srtt and rttvar to refer to the unscaled variables in the earlier equations, and t_srtt and t_rttvar to refer to the scaled variables in the TCP control block.

Figure 25.20 shows four constants we encounter, which define the scale factors of 8 for t_srtt and 4 for t_rttvar.

Table 25.20. Multipliers and shifts for RTT estimators.

Constant

Value

Description

TCP_RTT_SCALE

8

multiplier:

t_srtt = srtt × 8

TCP_RTT_SHIFT

3

shift:

t_srtt = srtt << 3

TCP_RTTVAR_SCALE

4

multiplier:

t_rttvar = rttvar × 4

TCP_RTTVAR_SHIFT

2

shift:

t_rttvar = rttvar << 2

tcp_newtcpcb Function

A new TCP control block is allocated and initialized by tcp_newtcpcb, shown in Figure 25.21. This function is called by TCP’s PRU_ATTACH request when a new socket is created (Figure 30.2). The caller has previously allocated an Internet PCB for this connection, pointed to by the argument inp. We present this function now because it initializes the TCP timer variables.

Table 25.21. tcp_newtcpcb function: create and initialize a new TCP control block.

------------------------------------------------------------------------ tcp_subr.c
167 struct tcpcb *
168 tcp_newtcpcb(inp)
169 struct inpcb *inp;
170 {
171     struct tcpcb *tp;

172     tp = malloc(sizeof(*tp), M_PCB, M_NOWAIT);
173     if (tp == NULL)
174         return ((struct tcpcb *) 0);
175     bzero((char *) tp, sizeof(struct tcpcb));
176     tp->seg_next = tp->seg_prev = (struct tcpiphdr *) tp;
177     tp->t_maxseg = tcp_mssdflt;
178     tp->t_flags = tcp_do_rfc1323 ? (TF_REQ_SCALE | TF_REQ_TSTMP) : 0;
179     tp->t_inpcb = inp;
180     /*
181      * Init srtt to TCPTV_SRTTBASE (0), so we can tell that we have no
182      * rtt estimate.  Set rttvar so that srtt + 2 * rttvar gives
183      * reasonable initial retransmit time.
184      */
185     tp->t_srtt = TCPTV_SRTTBASE;
186     tp->t_rttvar = tcp_rttdflt * PR_SLOWHZ << 2;
187     tp->t_rttmin = TCPTV_MIN;
188     TCPT_RANGESET(tp->t_rxtcur,
189                   ((TCPTV_SRTTBASE >> 2) + (TCPTV_SRTTDFLT << 2)) >> 1,
190                   TCPTV_MIN, TCPTV_REXMTMAX);

191     tp->snd_cwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
192     tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;

193     inp->inp_ip.ip_ttl = ip_defttl;
194     inp->inp_ppcb = (caddr_t) tp;
195     return (tp);
196 }
------------------------------------------------------------------------ tcp_subr.c

167-175

The kernel’s malloc function allocates memory for the control block, and bzero sets it to 0.

176

The two variables seg_next and seg_prev point to the reassembly queue for out-of-order segments received for this connection. We discuss this queue in detail in Section 27.9.

177-179

The maximum segment size to send, t_maxseg, defaults to 512 (tcp_mssdflt). This value can be changed by the tcp_mss function after an MSS option is received from the other end. (TCP also sends an MSS option to the other end when a new connection is established.) The two flags TF_REQ_SCALE and TF_REQ_TSTMP are set if the system is configured to request window scaling and timestamps as defined in RFC 1323 (the global tcp_do_rfc1323 from Figure 24.3, which defaults to 1). The t_inpcb pointer in the TCP control block is set to point to the Internet PCB passed in by the caller.

180-185

The four variables t_srtt, t_rttvar, t_rttmin, and t_rxtcur, described in Figure 25.19, are initialized. First, the smoothed RTT estimator t_srtt is set to 0 (TCPTV_SRTTBASE), which is a special value that means no RTT measurements have been made yet for this connection. tcp_xmit_timer recognizes this special value when the first RTT measurement is made.

186-187

The smoothed mean deviation estimator t_rttvar is set to 24: 3 (tcp_rttdflt, from Figure 24.3) times 2 (PR_SLOWHZ) multiplied by 4 (the left shift of 2 bits). Since this scaled estimator is 4 times the variable rttvar, this value equals 6 clock ticks, or 3 seconds. The minimum RTO, stored in t_rttmin, is 2 ticks (TCPTV_MIN).

188-190

The current RTO in clock ticks is calculated and stored in t_rxtcur. It is bounded by a minimum value of 2 ticks (TCPTV_MIN) and a maximum value of 128 ticks (TCPTV_REXMTMAX). The value calculated as the second argument to TCPT_RANGESET is 12 ticks, or 6 seconds. This is the first RTO for the connection.

Understanding these C expressions involving the scaled RTT estimators can be a challenge. It helps to start with the unscaled equation and substitute the scaled variables. The unscaled equation we’re solving is

RTO = srtt + 2 × rttvar

where we use the multipler of 2 instead of 4 to calculate the first RTO.

The use of the multiplier 2 instead of 4 appears to be a leftover from the original 4.3BSD Tahoe code [Paxson 1994].

Substituting the two scaling relationships

t_srtt = 8 × srtt

t_rttvar = 4 × rttvar

we get

tcp_newtcpcb function: create and initialize a new TCP control block.

which is the C code for the second argument to TCPT_RANGESET. In this code the variable t_rttvar is not used the constant TCPTV_SRTTDFLT, whose value is 6 ticks, is used instead, and it must be multiplied by 4 to have the same scale as t_rttvar.

191-192

The congestion window (snd_cwnd) and slow start threshold (snd_ssthresh) are set to 1,073,725,440 (approximately one gigabyte), which is the largest possible TCP window if the window scale option is in effect. (Slow start and congestion avoidance are described in Section 21.6 of Volume 1.) It is calculated as the maximum value for the window size field in the TCP header (65535, TCP_MAXWIN) times 214, where 14 is the maximum value for the window scale factor (TCP_MAX_WINSHIFT). We’ll see that when a SYN is sent or received on the connection, tcp_mss resets snd_cwnd to a single segment.

193-194

The default IP TTL in the Internet PCB is set to 64 (ip_defttl) and the PCB is set to point to the new TCP control block.

Not shown in this code is that numerous variables, such as the shift variable t_rxtshift, are implicitly initialized to 0 since the control block is initialized by bzero.

tcp_setpersist Function

The next function we look at that uses TCP’s retransmission timeout calculations is tcp_setpersist. In Figure 25.13 we saw this function called when the persist timer expired. This timer is set when TCP has data to send on a connection, but the other end is advertising a window of 0. This function, shown in Figure 25.22, calculates and stores the next value for the timer.

Table 25.22. tcp_setpersist function: calculate and store a new value for the persist timer.

----------------------------------------------------------------------- tcp_output.c
493 void
494 tcp_setpersist(tp)
495 struct tcpcb *tp;
496 {
497     t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;

498     if (tp->t_timer[TCPT_REXMT])
499         panic("tcp_output REXMT");
500     /*
501      * Start/restart persistance timer.
502      */
503     TCPT_RANGESET(tp->t_timer[TCPT_PERSIST],
504                   t * tcp_backoff[tp->t_rxtshift],
505                   TCPTV_PERSMIN, TCPTV_PERSMAX);
506     if (tp->t_rxtshift < TCP_MAXRXTSHIFT)
507         tp->t_rxtshift++;
508 }
----------------------------------------------------------------------- tcp_output.c

Check retransmission timer not enabled

493-499

A check is made that the retransmission timer is not enabled when the persist timer is about to be set, since the two timers are mutually exclusive: if data is being sent, the other side must be advertising a nonzero window, but the persist timer is being set only if the advertised window is 0.

Calculate RTO

500-505

The variable t is set to the RTO value that was calculated at the beginning of the function. The equation being solved is

RTO = srtt + 2 × rttvar

which is identical to the formula used at the end of the previous section. With substitution we get

Calculate RTO

which is the value computed for the variable t.

Apply exponential backoff

506-507

An exponential backoff is also applied to the RTO. This is done by multiplying the RTO by a value from the tcp_backoff array:

int tcp_backoff[TCP_MAXRXTSHIFT + 1] =
    { 1, 2, 4, 8, 16, 32, 64, 64, 64, 64, 64, 64, 64 };

When tcp_output initially sets the persist timer for a connection, the code is

tp->t_rxtshift = 0;
tcp_setpersist(tp);

so the first time tcp_setpersist is called, t_rxtshift is 0. Since the value of tcp_backoff[0] is 1, t is used as the persist timeout. The TCPT_RANGESET macro bounds this value between 5 and 60 seconds. t_rxtshift is incremented by 1 until it reaches a maximum of 12 (TCP_MAXRXTSHIFT), since tcp_backoff[12] is the final entry in the array.

tcp_xmit_timer Function

The next function we look at, tcp_xmit_timer, is called each time an RTT measurement is collected, to update the smoothed RTT estimator (srtt) and the smoothed mean deviation estimator (rttvar).

The argument rtt is the RTT measurement to be applied. It is the value nticks + 1, using the notation from Section 25.7. It can be from one of two sources:

  1. If the timestamp option is present in a received segment, the measured RTT is the current time (tcp_now) minus the timestamp value. We’ll examine the timestamp option in Section 26.6, but for now all we need to know is that tcp_now is incremented every 500 ms (Figure 25.8). When a data segment is sent, tcp_now is sent as the timestamp, and the other end echoes this time-stamp in the acknowledgment it sends back.

  2. If timestamps are not in use and a data segment is being timed, we saw in Figure 25.8 that the counter t_rtt is incremented every 500 ms for the connection. We also mentioned in Section 25.5 that this counter is initialized to 1, so when the acknowledgment is received the counter is the measured RTT (in ticks) plus 1.

Typical code in tcp_input that calls tcp_xmit_timer is

if (ts_present)
    tcp_xmit_timer(tp, tcp_now - ts_ecr + 1);

else if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq))
    tcp_xmit_timer(tp, tp->t_rtt);

If a timestamp was present in the segment (ts_present), the RTT estimators are updated using the current time (tcp_now) minus the echoed timestamp (ts_ecr) plus 1. (We describe the reason for adding 1 below.)

If a timestamp is not present, the RTT estimators are updated only if the received segment acknowledges a data segment that was being timed. There is only one RTT counter per TCP control block (t_rtt), so only one outstanding data segment can be timed per connection. The starting sequence number of that segment is stored in t_rtseq when the segment is transmitted, to tell when an acknowledgment is received that covers that sequence number. If the received acknowledgment number (ti_ack) is greater than the starting sequence number of the segment being timed (t_rtseq), the RTT estimators are updated using t_rtt as the measured RTT.

Before RFC 1323 timestamps were supported, TCP measured the RTT only by counting clock ticks in t_rtt. But this variable is also used as a flag that specifies whether a segment is being timed (Figure 25.8): if t_rtt is greater than 0, then tcp_slowtimo adds 1 to it every 500 ms. Hence when t_rtt is nonzero, it is the number of ticks plus 1. We’ll see shortly that tcp_xmit_timer always decrements its second argument by 1 to account for this offset. Therefore when timestamps are being used, 1 is added to the second argument to account for the decrement by 1 in tcp_xmit_timer.

The greater-than test of the sequence numbers is because ACKs are cumulative: if TCP sends and times a segment with sequence numbers 1-1024 (t_rtseq equals 1), then immediately sends (but can’t time) a segment with sequence numbers 1025-2048, and then receives an ACK with ti_ack equal to 2049, this is an ACK for sequence numbers 1-2048 and the ACK acknowledges the first segment being timed as well as the second (untimed) segment. Notice that when RFC 1323 timestamps are in use there is no comparison of sequence numbers. If the other end sends a timestamp option, it chooses the echo reply value (ts_ecr) to allow TCP to calculate the RTT.

Figure 25.23 shows the first part of the function that updates the estimators.

Table 25.23. tcp_xmit_timer function: apply new RTT measurement to smoothed estimators.

----------------------------------------------------------------------- tcp_input.c
1310 void
1311 tcp_xmit_timer(tp, rtt)
1312 struct tcpcb *tp;
1313 short   rtt;
1314 {
1315     short   delta;

1316     tcpstat.tcps_rttupdated++;
1317     if (tp->t_srtt != 0) {
1318         /*
1319          * srtt is stored as fixed point with 3 bits after the
1320          * binary point (i.e., scaled by 8).  The following magic
1321          * is equivalent to the smoothing algorithm in rfc793 with
1322          * an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
1323          * point).  Adjust rtt to origin 0.
1324          */
1325         delta = rtt - 1 - (tp->t_srtt >> TCP_RTT_SHIFT);
1326         if ((tp->t_srtt += delta) <= 0)
1327             tp->t_srtt = 1;
1328         /*
1329          * We accumulate a smoothed rtt variance (actually, a
1330          * smoothed mean difference), then set the retransmit
1331          * timer to smoothed rtt + 4 times the smoothed variance.
1332          * rttvar is stored as fixed point with 2 bits after the
1333          * binary point (scaled by 4).  The following is
1334          * equivalent to rfc793 smoothing with an alpha of .75
1335          * (rttvar = rttvar*3/4 + |delta| / 4).  This replaces
1336          * rfc793's wired-in beta.
1337          */
1338         if (delta < 0)
1339             delta = -delta;
1340         delta -= (tp->t_rttvar >> TCP_RTTVAR_SHIFT);
1341         if ((tp->t_rttvar += delta) <= 0)
1342             tp->t_rttvar = 1;
1343     } else {
1344         /*
1345          * No rtt measurement yet - use the unsmoothed rtt.
1346          * Set the variance to half the rtt (so our first
1347          * retransmit happens at 3*rtt).
1348          */
1349         tp->t_srtt = rtt << TCP_RTT_SHIFT;
1350         tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT - 1);
1351     }
----------------------------------------------------------------------- tcp_input.c

Update smoothed estimators

1310-1325

Recall that tcp_newtcpcb initialized the smoothed RTT estimator (t_srtt) to 0, indicating that no measurements have been made for this connection. delta is the difference between the measured RTT and the current value of the smoothed RTT estimator, in unscaled ticks. t_srtt is divided by 8 to convert from scaled to unscaled ticks.

1326-1327

The smoothed RTT estimator is updated using the equation

srttsrtt + g × delta

Since the gain g is 1/8, this equation is

8 × srtt ← 8 × srtt + delta

which is

t_srttt_srtt + delta

1328-1342

The mean deviation estimator is updated using the equation

rttvarrttvar + h(| delta | - rttvar)

Substituting 1/4 for h and the scaled variable t_rttvar for 4 x rttvar, we get

Update smoothed estimators

which is

Update smoothed estimators

This final equation corresponds to the C code.

Initialize smoothed estimators on first RTT measurement

1343-1350

If this is the first RTT measured for this connection, the smoothed RTT estimator is initialized to the measured RTT. These calculations use the value of the argument rtt, which we said is the measured RTT plus 1 (nticks + 1), whereas the earlier calculation of delta subtracted 1 from rtt.

srtt = nticks + 1

or

Initialize smoothed estimators on first RTT measurement

which is

t_srtt = (nticks + 1) × 8

The smoothed mean deviation is set to one-half of the measured RTT:

Initialize smoothed estimators on first RTT measurement

which is

Initialize smoothed estimators on first RTT measurement

or

t_rttvar = (nticks + 1) × 2

The comment in the code states that this initial setting for the smoothed mean deviation yields an initial RTO of 3 x srtt. Since the RTO is calculated as

RTO = srtt + 4 × rttvar

substituting for rttvar gives us

Initialize smoothed estimators on first RTT measurement

which is indeed

RTO = 3 × srtt

Figure 25.24 shows the final part of the tcp_xmit_timer function.

Table 25.24. tcp_xmit_timer function: final part.

------------------------------------------------------------------------ tcp_input.c
1352     tp->t_rtt = 0;
1353     tp->t_rxtshift = 0;

1354     /*
1355      * the retransmit should happen at rtt + 4 * rttvar.
1356      * Because of the way we do the smoothing, srtt and rttvar
1357      * will each average +1/2 tick of bias.  When we compute
1358      * the retransmit timer, we want 1/2 tick of rounding and
1359      * 1 extra tick because of +-1/2 tick uncertainty in the
1360      * firing of the timer.  The bias will give us exactly the
1361      * 1.5 tick we need.  But, because the bias is
1362      * statistical, we have to test that we don't drop below
1363      * the minimum feasible timer (which is 2 ticks).
1364      */
1365     TCPT_RANGESET(tp->t_rxtcur, TCP_REXMTVAL(tp),
1366                   tp->t_rttmin, TCPTV_REXMTMAX);

1367     /*
1368      * We received an ack for a packet that wasn't retransmitted;
1369      * it is probably safe to discard any error indications we've
1370      * received recently.  This isn't quite right, but close enough
1371      * for now (a route might have failed after we sent a segment,
1372      * and the return path might not be symmetrical).
1373      */
1374     tp->t_softerror = 0;
1375 }
------------------------------------------------------------------------ tcp_input.c

1352-1353

The RTT counter (t_rtt) and the retransmission shift count (t_rxtshift) are both reset to 0 in preparation for timing and transmission of the next segment.

1354-1366

The next RTO to use for the connection (t_rxtcur) is calculated using the macro

#define TCP_REXMTVAL(tp) 
        (((tp)->t_srtt >> TCP_RTT_SHIFT) + (tp)->t_rttvar)

This is the now-familiar equation

RTO = srtt + 4 × rttvar

using the scaled variables updated by tcp_xmit_timer. Substituting these scaled variables for srtt and rttvar, we have

tcp_xmit_timer function: final part.

which corresponds to the macro. The calculated value for the RTO is bounded by the minimum RTO for this connection (t_rttmin, which t_newtcpcb set to 2 ticks), and 128 ticks (TCPTV_REXMTMAX).

Clear soft error variable

1367-1374

Since tcp_xmit_timer is called only when an acknowledgment is received for a data segment that was sent, if a soft error was recorded for this connection (t_softerror), that error is discarded. We describe soft errors in more detail in the next section.

Retransmission Timeout: tcp_timers Function

We now return to the tcp_timers function and cover the final case that we didn’t present in Section 25.6: the one that handles the expiration of the retransmission timer. This code is executed when a data segment that was transmitted has not been acknowledged by the other end within the RTO.

Figure 25.25 summarizes the actions caused by the retransmission timer. We assume that the first timeout calculated by tcp_output is 1.5 seconds, which is typical for a LAN (see Figure 21.1 of Volume 1).

Summary of retransmission timer when sending data.

Figure 25.25. Summary of retransmission timer when sending data.

The x-axis is labeled with the time in seconds: 0, 1.5, 4.5, and so on. Below each of these numbers we show the value of t_rxtshift that is used in the code we’re about to examine. Only after 12 retransmissions and a total of 542.5 seconds (just over 9 minutes) does TCP give up and drop the connection.

RFC 793 recommended that an open of a new connection, active or passive, allow a parameter specifying the total timeout period for data sent by TCP. This is the total amount of time TCP will try to send a given segment before giving up and terminating the connection. The recommended default was 5 minutes.

RFC 1122 requires that an application must be able to specify a parameter for a connection giving either the total number of retransmissions or the total timeout value for data sent by TCP. This parameter can be specified as “infinity,” meaning TCP never gives up, allowing, perhaps, an interactive user the choice of when to give up.

We’ll see in the code described shortly that Net/3 does not give the application any of this control: a fixed number of retransmissions (12) always occurs before TCP gives up, and the total timeout before giving up depends on the RTT.

The first half of the retransmission timeout case is shown in Figure 25.26.

Table 25.26. tcp_timers function: expiration of retransmission timer, first half.

----------------------------------------------------------------------- tcp_timer.c
140         /*
141          * Retransmission timer went off.  Message has not
142          * been acked within retransmit interval.  Back off
143          * to a longer retransmit interval and retransmit one segment.
144          */
145     case TCPT_REXMT:
146         if (++tp->t_rxtshift > TCP_MAXRXTSHIFT) {
147             tp->t_rxtshift = TCP_MAXRXTSHIFT;
148             tcpstat.tcps_timeoutdrop++;
149             tp = tcp_drop(tp, tp->t_softerror ?
150                           tp->t_softerror : ETIMEDOUT);
151             break;
152         }
153         tcpstat.tcps_rexmttimeo++;
154         rexmt = TCP_REXMTVAL(tp) * tcp_backoff[tp->t_rxtshift];
155         TCPT_RANGESET(tp->t_rxtcur, rexmt,
156                       tp->t_rttmin, TCPTV_REXMTMAX);
157         tp->t_timer[TCPT_REXMT] = tp->t_rxtcur;
158         /*
159          * If losing, let the lower level know and try for
160          * a better route.  Also, if we backed off this far,
161          * our srtt estimate is probably bogus.  Clobber it
162          * so we'll take the next rtt measurement as our srtt;
163          * move the current srtt into rttvar to keep the current
164          * retransmit times until then.
165          */
166         if (tp->t_rxtshift > TCP_MAXRXTSHIFT / 4) {
167             in_losing(tp->t_inpcb);
168             tp->t_rttvar += (tp->t_srtt >> TCP_RTT_SHIFT);
169             tp->t_srtt = 0;
170         }
171         tp->snd_nxt = tp->snd_una;
172         /*
173          * If timing a segment in this window, stop the timer.
174          */
175         tp->t_rtt = 0;
----------------------------------------------------------------------- tcp_timer.c

Increment shift count

146

The retransmission shift count (t_rxtshift) is incremented, and if the value exceeds 12 (TCP_MAXRXTSHIFT) it is time to drop the connection. This new value of t_rxtshift is what we show in Figure 25.25. Notice the difference between this dropping of a connection because an acknowledgment is not received from the other end in response to data sent by TCP, and the keepalive timer, which drops a connection after a long period of inactivity and no response from the other end. Both report the error ETIMEDOUT to the process, unless a soft error is received for the connection.

Drop connection

147-152

A soft error is one that doesn’t cause TCP to terminate an established connection or an attempt to establish a connection, but the soft error is recorded in case TCP gives up later. For example, if TCP retransmits a SYN segment to establish a connection, receiving nothing in response, the error returned to the process will be ETIMEDOUT. But if during the retransmissions an ICMP host unreachable is received for the connection, that is considered a soft error and stored in t_softerror by tcp_notify. If TCP finally gives up the retransmissions, the error returned to the process will be EHOSTUNREACH instead of ETIMEDOUT, providing more information to the process. If TCP receives an RST on the connection in response to the SYN, that’s considered a hard error and the connection is terminated immediately with an error of ECONNREFUSED (Figure 28.18).

Calculate new RTO

153-157

The next RTO is calculated using the TCP_REXMTVAL macro, applying an exponential backoff. In this code, t_rxtshift will be 1 the first time a given segment is retransmitted, so the RTO will be twice the value calculated by TCP_REXMTVAL. This value is stored in t_rxtcur and as the retransmission timer for the connection, t_timer[TCPT_REXMT]. The value stored in t_rxtcur is used in tcp_input when the retransmission timer is restarted (Figures 28.12 and 29.6).

Ask IP to find a new route

158-167

If this segment has been retransmitted four or more times, in_losing releases the cached route (if there is one), so when the segment is retransmitted by tcp_output (at the end of this case statement in Figure 25.27) a new, and hopefully better, route will be chosen. In Figure 25.25 in_losing is called each time the retransmission timer expires, starting with the retransmission at time 22.5.

Table 25.27. tcp_timers function: expiration of retransmission timer, second half.

--------------------------------------------------------------------- tcp_timer.c
176         /*
177          * Close the congestion window down to one segment
178          * (we'll open it by one segment for each ack we get).
179          * Since we probably have a window's worth of unacked
180          * data accumulated, this "slow start" keeps us from
181          * dumping all that data as back-to-back packets (which
182          * might overwhelm an intermediate gateway).
183          *
184          * There are two phases to the opening: Initially we
185          * open by one mss on each ack.  This makes the window
186          * size increase exponentially with time.  If the
187          * window is larger than the path can handle, this
188          * exponential growth results in dropped packet(s)
189          * almost immediately.  To get more time between
190          * drops but still "push" the network to take advantage
191          * of improving conditions, we switch from exponential
192          * to linear window opening at some threshhold size.
193          * For a threshhold, we use half the current window
194          * size, truncated to a multiple of the mss.
195          *
196          * (the minimum cwnd that will give us exponential
197          * growth is 2 mss.  We don't allow the threshhold
198          * to go below this.)
199          */
200         {
201             u_int   win = min(tp->snd_wnd, tp->snd_cwnd) / 2 / tp->t_maxseg;
202             if (win < 2)
203                 win = 2;
204             tp->snd_cwnd = tp->t_maxseg;
205             tp->snd_ssthresh = win * tp->t_maxseg;
206             tp->t_dupacks = 0;
207         }
208         (void) tcp_output(tp);
209         break;
--------------------------------------------------------------------- tcp_timer.c

Clear estimators

168-170

The smoothed RTT estimator (t_srtt) is set to 0, which is what t_newtcpcb did. This forces tcp_xmit_timer to use the next measured RTT as the smoothed RTT estimator. This is done because the retransmitted segment has been sent four or more times, implying that TCP’s smoothed RTT estimator is probably way off. But if the retransmission timer expires again, at the beginning of this case statement the RTO is calculated by TCP_REXMTVAL. That calculation should generate the same value as it did for this retransmission (which will then be exponentially backed off), even though t_srtt is set to 0. (The retransmission at time 42.464 in Figure 25.28 is an example of what’s happening here.)

Table 25.28. Values of RTT variables and estimators during example.

xmit time

send

recv

RTT timer

actual delta (ms)

rtt arg.

t_srtt (ticks × 8)

t_rttvar (ticks × 4)

t_rxtcur (ticks)

t_rxtshift

0.0

SYN

 

on

  

0

24

12

 

0.365

 

SYN,ACK

off

365

2

16

4

6

 

0.365

ACK

        

0.415

513

 

on

      

1.259

 

ack 513

off

844

2

15

4

5

 

1.260

513:1025

 

on

      

1.261

1025:1537

        

2.206

 

ack 1537

off

946

3

16

4

6

 

2.206

1537:2049

 

on

      

2.207

2049:2561

        

2.209

2561:3073

        

3.132

 

ack 2049

off

926

3

16

3

5

 

3.132

3073:3585

 

on

      

3.133

3585:4097

        

3.736

 

ack 2561

       

3.736

4097:4609

        

3.737

4609:5121

        

3.739

 

ack 3073

       

3.739

5121:5633

        

3.740

5633:6145

        

6.064

3073:3585

 

off

  

16

3

10

1

11.264

3073:3585

 

off

  

16

3

20

2

21.664

3073:3585

 

off

  

16

3

40

3

42.464

3073:3585

 

off

  

0

5

80

4

84.064

3073:3585

 

off

  

0

5

128

5

150.624

3073:3585

 

off

  

0

5

128

6

217.184

3073:3585

 

off

  

0

5

128

7

217.944

 

ack 6145

       

217.944

6145:6657

 

on

      

217.945

6657:7169

        

218.834

 

ack 6657

off

890

3

24

6

9

 

218.834

7169:7681

 

on

      

218.836

7681:8193

        

219.209

 

ack 7169

       

219.209

8193:8705

        

219.760

 

ack 7681

off

926

2

22

7

9

 

219.760

8705:9217

 

on

      

220.103

 

ack 8705

       

220.103

9217:9729

        

220.105

9729:10241

        

220.106

10241:10753

        

220.821

 

ack 9217

off

1061

3

22

6

8

 

220.821

10753:11265

 

on

      

221.310

 

ack 9729

       

221.310

11265:11777

        

221.312

 

ack 10241

       

221.312

11777:12289

        

221.674

 

ack 10753

       

221.955

 

ack 11265

off

1134

3

22

5

7

 

To accomplish this the value of t_rttvar is changed as follows. The next time the RTO is calculated, the equation

Values of RTT variables and estimators during example.

is evaluated. Since t_srtt will be 0, if t_rttvar is increased by t_srtt divided by 8, RTO will have the same value. If the retransmission timer expires again for this segment (e.g., times 84.064 through 217.184 in Figure 25.28), when this code is executed again t_srtt will be 0, so t_rttvar won’t change.

Force retransmission of oldest unacknowledged data

171

The next send sequence number (snd_nxt) is set to the oldest unacknowledged sequence number (snd_una). Recall from Figure 24.17 that snd_nxt can be greater than snd_una. By moving snd_nxt back, the retransmission will be the oldest segment that hasn’t been acknowledged.

Karn’s algorithm

172-175

The RTT counter, t_rtt, is set to 0, in case the last segment transmitted was being timed. Karn’s algorithm says that even if an ACK of that segment is received, since the segment is about to be retransmitted, any timing of the segment is worthless since the ACK could be for the first transmission or for the retransmission. The algorithm is described in [Karn and Partridge 1987] and in Section 21.3 of Volume 1. Therefore the only segments that are timed using the t_rtt counter and used to update the RTT estimators are those that are not retransmitted. We’ll see in Figure 29.6 that the use of RFC 1323 timestamps overrides Karn’s algorithm.

Slow Start and Congestion Avoidance

The second half of this case is shown in Figure 25.27. It performs slow start and congestion avoidance and retransmits the oldest unacknowledged segment.

Since a retransmission timeout has occurred, this is a strong indication of congestion in the network. TCP’s congestion avoidance algorithm comes into play, and when a segment is eventually acknowledged by the other end, TCP’s slow start algorithm will continue the data transmission on the connection at a slower rate. Sections 20.6 and 21.6 of Volume 1 describe the two algorithms in detail.

176-205

win is set to one-half of the current window size (the minimum of the receiver’s advertised window, snd_wnd, and the sender’s congestion window, snd_cwnd) in segments, not bytes (hence the division by t_maxseg). Its minimum value is two segments. This records one-half of the window size when the congestion occurred, assuming one cause of the congestion is our sending segments too rapidly into the network. This becomes the slow start threshold, t_ssthresh (which is stored in bytes, hence the multiplication by t_maxseg). The congestion window, snd_cwnd, is set to one segment, which forces slow start.

This code is enclosed in braces because it was added between the 4.3BSD and Net/1 releases and required its own local variable (win).

206

The counter of consecutive duplicate ACKs, t_dupacks (which is used by the fast retransmit algorithm in Section 29.4), is set to 0. We’ll see how this counter is used with TCP’s fast retransmit and fast recovery algorithms in Chapter 29.

208

tcp_output resends a segment containing the oldest unacknowledged sequence number. This is the retransmission caused by the retransmission timer expiring.

Accuracy

How accurate are these estimators that TCP maintains? At first they appear too coarse, since the RTTs are measured in multiples of 500 ms. The mean and mean deviation are maintained with additional accuracy (factors of 8 and 4 respectively), but LANs have RTTs on the order of milliseconds, and a transcontinental RTT is around 60 ms. What these estimators provide is a solid upper bound on the RTT so that the retransmission timeout can be set without worrying that the timeout is too small, causing unnecessary and wasteful retransmissions.

[Brakmo, O’Malley, and Peterson 1994] describe a TCP implementation that provides higher-resolution RTT measurements. This is done by recording the system clock (which has a much higher resolution than 500 ms) when a segment is transmitted and reading the system clock when the ACK is received, calculating a higher-resolution RTT.

The timestamp option provided by Net/3 (Section 26.6) can provide higher-resolution RTTs, but Net/3 sets the resolution of these timestamps to 500 ms.

An RTT Example

We now go through an actual example to see how the calculations are performed. We transfer 12288 bytes from the host bsdi to vangogh.cs.berkeley.edu. During the transfer we purposely bring down the PPP link being used and then bring it back up, to see how timeouts and retransmissions are handled. To transfer the data we use our sock program (described in Appendix C of Volume 1) with the -D option, to enable the SO_DEBUG socket option (Section 27.10). After the transfer is complete we examine the debug records left in the kernel’s circular buffer using the trpt(8) program and print the desired timer variables from the TCP control block.

Figure 25.28 shows the calculations that occur at the various times. We use the notation M:N to mean that sequence numbers M through and including N—1 are sent. Each segment in this example contains 512 bytes. The notation “ack M” means that the acknowledgment field of the ACK is M. The column labeled “actual delta (ms)” shows the time difference between the RTT timer going on and going off. The column labeled “rtt (arg.)” shows the second argument to the tcp_xmit_timer function: the number of clock ticks plus 1 between the RTT timer going on and going off.

The function tcp_newtcpcb initializes t_srtt, t_rttvar, and t_rxtcur to the values shown at time 0.0.

The first segment timed is the initial SYN. When its ACK is received 365 ms later, tcp_xmit_timer is called with an rtt argument of 2. Since this is the first RTT measurement (t_srtt is 0), the else clause in Figure 25.23 calculates the first values of the smoothed estimators.

The data segment containing bytes 1 through 512 is the next segment timed, and the RTT variables are updated at time 1.259 when its ACK is received.

The next three segments show how ACKs are cumulative. The timer is started at time 1.260 when bytes 513 through 1024 are sent. Another segment is sent with bytes 1025 through 1536, and the ACK received at time 2.206 acknowledges both data segments. The RTT estimators are then updated, since the ACK covers the starting sequence number being timed (513).

The segment with bytes 1537 through 2048 is transmitted at time 2.206 and the timer is started. Just that segment is acknowledged at time 3.132, and the estimators updated.

The data segment at time 3.132 is timed and the retransmission timer is set to 5 ticks (the current value of t_rxtcur). Somewhere around this time the PPP link between the routers sun and netb is taken down and then brought back up, a procedure that takes a few minutes. When the retransmission timer expires at time 6.064, the code in Figure 25.26 is executed to update the RTT variables. t_rxtshift is incremented from 0 to 1 and t_rxtcur is set to 10 ticks (the exponential backoff). A segment starting with the oldest unacknowledged sequence number (snd_una, which is 3073) is retransmitted. After 5 seconds the timer expires again, t_rxtshift is incremented to 2, and the retransmission timer is set to 20 ticks.

When the retransmission timer expires at time 42.464, t_srtt is set to 0 and t_rttvar is set to 5. As we mentioned in our discussion of Figure 25.26, this leaves the calculation of t_rxtcur the same (so the next calculation yields 160), but by setting t_srtt to 0, the next time the RTT estimators are updated (at time 218.834), the measured RTT becomes the smoothed RTT, as if the connection were starting fresh.

The rest of the data transfer continues, and the estimators are updated a few more times.

Summary

The two functions tcp_fasttimo and tcp_slowtimo are called by the kernel every 200 ms and every 500 ms, respectively. These two functions drive TCP’s per-connection timer maintenance.

TCP maintains the following seven timers for each connection:

  • a connection-establishment timer,

  • a retransmission timer,

  • a delayed ACK timer,

  • a persist timer,

  • a keepalive timer,

  • a FIN_WAIT_2 timer, and

  • a 2MSL timer.

The delayed ACK timer is different from the other six, since when it is set it means a delayed ACK must be sent the next time TCP’s 200-ms timer expires. The other six timers are counters that are decremented by 1 every time TCP’s 500-ms timer expires. When any one of the counters reaches 0, the appropriate action is taken: drop the connection, retransmit a segment, send a keepalive probe, and so on, as described in this chapter. Since some of the timers are mutually exclusive, the six timers are really implemented using four counters, which complicates the code.

This chapter also introduced the recommended way to calculate values for the retransmission timer. TCP maintains two smoothed estimators for a connection: the round-trip time and the mean deviation of the RTT. Although the algorithms are simple and elegant, these estimators are maintained as scaled fixed-point numbers (to provide adequate precision without using floating-point code within the kernel), which complicates the code.

Exercises

25.1

How efficient is TCP’s fast timeout function? (Hint: Look at the number of delayed ACKs in Figure 24.5.) Suggest alternative implementations.

25.1

In Figure 24.5 there were 531,285 delayed ACKs over 2,592,000 seconds (30 days). This is an average of about one delayed ACK every 5 seconds, or one delayed ACK every 25 times tcp_fasttimo is called. This means 96% of the time (24 times out of every 25) every TCP control block is checked for the delayed-ACK flag, when not one is set. On the large multiuser system in the solution to Exercise 24.3, this involves looking at over 400 control blocks, 5 times a second.

One alternative implementation would be to set a global flag when a delayed ACK is needed and only go through the list of control blocks when the flag is set. Alternatively, another list could be maintained that contains only the control blocks that require a delayed ACK. See, for example, the variable igmp_timers_are_running in Figure 13.14.

25.2

Why do you think the initialization of tcp_maxidle is in the tcp_slowtimo function instead of the tcp_init function?

25.2

This allows the variable tcp_keepintvl to be patched in the running kernel, which then changes the value of tcp_maxidle the next time tcp_slowtimo is called.

25.3

tcp_slowtimo increments t_idle, which we said counts the clock ticks since a segment was last received on the connection. Should TCP also count the idle time since a segment was last sent on a connection?

25.3

t_idle actually counts the time since a segment was last received or transmitted. This is because TCP output must be acknowledged by the other end and the receipt of the ACK clears t_idle, as does the receipt of a data segment (Figure 28.8).

25.4

Rewrite the code in Figure 25.10 to separate the logic for the two different uses of the TCPT_2MSL counter.

25.4

Here is one way to rewrite the code:

case TCPT_2MSL:
    if (tp->t_state == TCPS_TIME_WAIT)
        tp = tcp_close(tp);
    else {
        if (tp->t_idle <= tcp_maxidle)
            tp->t_timer[TCPT_2MSL] = tcp_keepintvl;
        else
            tp = tcp_close(tp);
    }
    break;

25.5

75 seconds after the connection in Figure 25.12 enters the FIN_WAIT_2 state a duplicate ACK is received on the connection. What happens?

25.5

When the duplicate ACK is received, t_idle is 150, but it is reset to 0. When the FIN_WAIT_2 timer expires, t_idle will be 1048 (1198 − 150), so the timer is set to 150 ticks. When the timer expires the next time, t_idle will be 1198, so the timer is set to 150 ticks. When the timer expires the next time, t_idle will be 1198 + 150, so the connection is closed. The duplicate ACK extends the time until the connection is closed.

25.6

A connection has been idle for 1 hour when the application sets the SO_KEEPALIVE option. Will the first keepalive probe be sent 1 or 2 hours in the future?

25.6

The first keepalive probe will be sent 1 hour in the future. When the process sets the option, nothing happens other than setting the SO_KEEPALIVE option in the socket structure. When the timer expires 1 hour in the future, since the option is enabled, the code in Figure 25.16 sends the first probe.

25.7

Why is tcp_rttdflt a global variable and not a constant?

25.7

The value of tcp_rttdflt initializes the RTT estimators for every TCP connection. A site can change the default of 3, if desired, by patching the global variable. If the value were a #define constant, it could be changed only by recompiling the kernel.

25.8

Rewrite the code related to Exercise 25.6 to implement the alternate behavior.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.64.128