Chapter 29. TCP Input

Introduction

This chapter continues the discussion of TCP input processing, picking up where the previous chapter left off. Recall that the final test in Figure 28.37 was that either the ACK flag was set or, if not, the segment was dropped.

The ACK flag is handled, the window information is updated, the URG flag is processed, and any data in the segment is processed. Finally the FIN flag is processed and tcp_output is called, if required.

ACK Processing Overview

We begin this chapter with ACK processing, a summary of which is shown in Figure 29.1. The SYN_RCVD state is handled specially, followed by common processing for all remaining states. (Remember that a received ACK in either the LISTEN or SYN_SENT state was discussed in the previous chapter.) This is followed by special processing for the three states in which a received ACK causes a state transition, and for the TIME_WAIT state, in which the receipt of an ACK causes the 2MSL timer to be restarted.

Table 29.1. Summary of ACK processing.

-----------------------------------------------------------------------------------
    switch (tp->t_state) {

    case TCPS_SYN_RECEIVED:
        complete processing of passive open and process
            simultaneous open or self-connect;
        /* fall into ... */

    case TCPS_ESTABLISHED:
    case TCPS_FIN_WAIT_1:
    case TCPS_FIN_WAIT_2:
    case TCPS_CLOSE_WAIT:
    case TCPS_CLOSING:
    case TCPS_LAST_ACK:
    case TCPS_TIME_WAIT:
        process duplicate ACK;
        update RTT estimators;
        if all outstanding data ACKed, turn off retransmission timer;
        remove ACKed data from socket send buffer;

        switch (tp->t_state) {

        case TCPS_FIN_WAIT_1:
            if (FIN is ACKed) {
                move to FIN_WAIT_2 state;
                start FIN_WAIT_2 timer;
            }
            break;

        case TCPS_CLOSING:
            if (FIN is ACKed) {
                move to TIME_WAIT state;
                start TIME_WAIT timer;
            }
            break;

        case TCPS_LAST_ACK:
            if (FIN is ACKed)
                move to CLOSED state;
            break;

        case TCPS_TIME_WAIT:
            restart TIME_WAIT timer;
            goto dropafterack;
        }
    }
-----------------------------------------------------------------------------------

Completion of Passive Opens and Simultaneous Opens

The first part of the ACK processing, shown in Figure 29.2, handles the SYN_RCVD state. As mentioned in the previous chapter, this handles the completion of a passive open (the common case) and also handles simultaneous opens and self-connects (the infrequent case).

Table 29.2. tcp_input function: received ACK in SYN_RCVD state.

----------------------------------------------------------------------- tcp_input.c
791     /*
792      * Ack processing.
793      */
794     switch (tp->t_state) {

795         /*
796          * In SYN_RECEIVED state if the ack ACKs our SYN then enter
797          * ESTABLISHED state and continue processing, otherwise
798          * send an RST.
799          */
800     case TCPS_SYN_RECEIVED:
801         if (SEQ_GT(tp->snd_una, ti->ti_ack) ||
802             SEQ_GT(ti->ti_ack, tp->snd_max))
803             goto dropwithreset;
804         tcpstat.tcps_connects++;
805         soisconnected(so);
806         tp->t_state = TCPS_ESTABLISHED;
807         /* Do window scaling? */
808         if ((tp->t_flags & (TF_RCVD_SCALE | TF_REQ_SCALE)) ==
809             (TF_RCVD_SCALE | TF_REQ_SCALE)) {
810             tp->snd_scale = tp->requested_s_scale;
811             tp->rcv_scale = tp->request_r_scale;
812         }
813         (void) tcp_reass(tp, (struct tcpiphdr *) 0, (struct mbuf *) 0);
814         tp->snd_wl1 = ti->ti_seq - 1;
815         /* fall into ... */
----------------------------------------------------------------------- tcp_input.c

Verify received ACK

801-806

For the ACK to acknowledge the SYN that was sent, it must be greater than snd_una (which is set to the ISS for the connection, the sequence number of the SYN, by tcp_sendseqinit) and less than or equal to snd_max. If so, the socket is marked as connected and the state becomes ESTABLISHED.

Since soisconnected wakes up the process that performed the passive open (normally a server), we see that this doesn’t occur until the last of the three segments in the three-way handshake has been received. If the server is blocked in a call to accept, that call now returns; if the server is blocked in a call to select waiting for the listening descriptor to become readable, it is now readable.

Check for window scale option

807-812

If TCP sent a window scale option and received one, the send and receive scale factors are saved in the TCP control block. Otherwise the default values of snd_scale and rcv_scale in the TCP control block are 0 (no scaling).

Pass queued data to process

813

Any data queued for the connection can now be passed to the process. This is done by tcp_reass with a null pointer as the second argument. This data would have arrived with the SYN that moved the connection into the SYN_RCVD state.

814

snd_wl1 is set to the received sequence number minus 1. We’ll see in Figure 29.15 that this causes the three window update variables to be updated.

Fast Retransmit and Fast Recovery Algorithms

The next part of ACK processing, shown in Figure 29.3, handles duplicate ACKs and determines if TCP’s fast retransmit and fast recovery algorithms [Jacobson 1990c] should come into play. The two algorithms are separate but are normally implemented together [Floyd 1994].

Table 29.3. tcp_input function: check for completely duplicate ACK.

----------------------------------------------------------------------- tcp_input.c
816         /*
817          * In ESTABLISHED state: drop duplicate ACKs; ACK out-of-range
818          * ACKs.  If the ack is in the range
819          *  tp->snd_una < ti->ti_ack <= tp->snd_max
820          * then advance tp->snd_una to ti->ti_ack and drop
821          * data from the retransmission queue.  If this ACK reflects
822          * more up-to-date window information we update our window information.
823          */
824     case TCPS_ESTABLISHED:
825     case TCPS_FIN_WAIT_1:
826     case TCPS_FIN_WAIT_2:
827     case TCPS_CLOSE_WAIT:
828     case TCPS_CLOSING:
829     case TCPS_LAST_ACK:
830     case TCPS_TIME_WAIT:

831         if (SEQ_LEQ(ti->ti_ack, tp->snd_una)) {
832             if (ti->ti_len == 0 && tiwin == tp->snd_wnd) {
833                 tcpstat.tcps_rcvdupack++;
834                 /*
835                  * If we have outstanding data (other than
836                  * a window probe), this is a completely
837                  * duplicate ack (ie, window info didn't
838                  * change), the ack is the biggest we've
839                  * seen and we've seen exactly our rexmt
840                  * threshold of them, assume a packet
841                  * has been dropped and retransmit it.
842                  * Kludge snd_nxt & the congestion
843                  * window so we send only this one
844                  * packet.
845                  *
846                  * We know we're losing at the current
847                  * window size so do congestion avoidance
848                  * (set ssthresh to half the current window
849                  * and pull our congestion window back to
850                  * the new ssthresh).
851                  *
852                  * Dup acks mean that packets have left the
853                  * network (they're now cached at the receiver)
854                  * so bump cwnd by the amount in the receiver
855                  * to keep a constant cwnd packets in the
856                  * network.
857                  */
----------------------------------------------------------------------- tcp_input.c
  • The fast retransmit algorithm occurs when TCP deduces from a small number (normally 3) of consecutive duplicate ACKs that a segment has been lost and deduces the starting sequence number of the missing segment. The missing segment is retransmitted. The algorithm is mentioned in Section 4.2.2.21 of RFC 1122, which states that TCP may generate an immediate ACK when an out-of-order segment is received. We saw that Net/3 generates the immediate duplicate ACKs in Figure 27.15. This algorithm first appeared in the 4.3BSD Tahoe release and the subsequent Net/1 release. In these two implementations, after the missing segment was retransmitted, the slow start phase was entered.

  • The fast recovery algorithm says that after the fast retransmit algorithm (that is, after the missing segment has been retransmitted), congestion avoidance but not slow start is performed. This is an improvement that allows higher throughput under moderate congestion, especially for large windows. This algorithm appeared in the 4.3BSD Reno release and the subsequent Net/2 release.

Net/3 implements both fast retransmit and fast recovery, as we describe shortly.

In the discussion of Figure 24.17 we noted that an acceptable ACK must be in the range

snd_una < acknowledgment field <= snd_max

This first test of the acknowledgment field compares it only to snd_una. The comparison against snd_max is in Figure 29.5. The reason for separating the tests is so that the following five tests can be applied to the received segment:

  1. If the acknowledgment field is less than or equal to snd_una, and

  2. the length of the received segment is 0, and

  3. the advertised window (tiwin) has not changed, and

  4. TCP has outstanding data that has not been acknowledged (the retransmission timer is nonzero), and

  5. the received segment contains the biggest ACK TCP has seen (the acknowledgment field equals snd_una),

then this segment is a completely duplicate ACK. (Tests 1, 2, and 3 are in Figure 29.3; tests 4 and 5 are at the beginning of Figure 29.4.)

Table 29.4. tcp_input function: duplicate ACK processing.

----------------------------------------------------------------------- tcp_input.c
858                 if (tp->t_timer[TCPT_REXMT] == 0 ||
859                     ti->ti_ack != tp->snd_una)
860                     tp->t_dupacks = 0;
861                 else if (++tp->t_dupacks == tcprexmtthresh) {
862                     tcp_seq onxt = tp->snd_nxt;
863                     u_int win =
864                         min(tp->snd_wnd, tp->snd_cwnd) / 2 /
865                             tp->t_maxseg;

866                     if (win < 2)
867                         win = 2;
868                     tp->snd_ssthresh = win * tp->t_maxseg;
869                     tp->t_timer[TCPT_REXMT] = 0;
870                     tp->t_rtt = 0;
871                     tp->snd_nxt = ti->ti_ack;
872                     tp->snd_cwnd = tp->t_maxseg;
873                     (void) tcp_output(tp);
874                     tp->snd_cwnd = tp->snd_ssthresh +
875                         tp->t_maxseg * tp->t_dupacks;
876                     if (SEQ_GT(onxt, tp->snd_nxt))
877                         tp->snd_nxt = onxt;
878                     goto drop;
879                 } else if (tp->t_dupacks > tcprexmtthresh) {
880                     tp->snd_cwnd += tp->t_maxseg;
881                     (void) tcp_output(tp);
882                     goto drop;
883                 }
884             } else
885                 tp->t_dupacks = 0;
886             break;              /* beyond ACK processing (to step 6) */
887         }
----------------------------------------------------------------------- tcp_input.c

TCP counts the number of these duplicate ACKs that are received in a row (in the variable t_dupacks), and when the number reaches a threshold of 3 (tcprexmtthresh), the lost segment is retransmitted. This is the fast retransmit algorithm described in Section 21.7 of Volume 1. It works in conjunction with the code we saw in Figure 27.15: when TCP receives an out-of-order segment, it is required to generate an immediate duplicate ACK, telling the other end that a segment might have been lost and telling it the value of the next expected sequence number. The goal of the fast retransmit algorithm is for TCP to retransmit immediately what appears to be the missing segment, instead of waiting for the retransmission timer to expire. Figure 21.7 of Volume 1 gives a detailed example of how this algorithm works.

The receipt of a duplicate ACK also tells TCP that a packet has “left the network,” because the other end had to receive an out-of-order segment to send the duplicate ACK. The fast recovery algorithm says that after some number of consecutive duplicate ACKs have been received, TCP should perform congestion avoidance (i.e., slow down) but need not wait for the pipe to empty between the two connection end points (slow start). The expression “a packet has left the network” means a packet has been received by the other end and has been added to the out-of-order queue for the connection. The packet is not still in transit somewhere between the two end points.

If only the first three tests shown earlier are true, the ACK is still a duplicate and is counted by the statistic tcps_rcvdupack, but the counter of the number of consecutive duplicate ACKs for this connection (t_dupacks) is reset to 0. If only the first test is true, the counter t_dupacks is reset to 0.

The remainder of the fast recovery algorithm is shown in Figure 29.4. When all five tests are true, the fast recovery algorithm processes the segment depending on the number of these consecutive duplicate ACKs that have been received.

  1. t_dupacks equals 3 (tcprexmtthresh). Congestion avoidance is performed and the missing segment is retransmitted.

  2. t_dupacks exceeds 3. Increase the congestion window and perform normal TCP output.

  3. t_dupacks is less than 3. Do nothing.

Number of consecutive duplicate ACKs reaches threshold of 3

861-868

When t_dupacks reaches 3 (tcprexmtthresh), the value of snd_nxt is saved in onxt and the slow start threshold (ssthresh) is set to one-half the current congestion window, with a minimum value of two segments. This is what was done with the slow start threshold when the retransmission timer expired in Figure 25.27, but we’ll see later in this piece of code that the fast recovery algorithm does not set the congestion window to one segment, as was done with the timeout.

Turn off retransmission timer

869-870

The retransmission timer is turned off and, in case a segment is currently being timed, t_rtt is set to 0.

Retransmit missing segment

871-873

snd_nxt is set to the starting sequence number of the segment that appears to have been lost (the acknowledgment field of the duplicate ACK) and the congestion window is set to one segment. This causes tcp_output to send only the missing segment. (This is shown by segment 63 in Figure 21.7 of Volume 1.)

Set congestion window

874-875

The congestion window is set to the slow start threshold plus the number of segments that the other end has cached. By cached we mean the number of out-of-order segments that the other end has received and generated duplicate ACKs for. These cannot be passed to the process at the other end until the missing segment (which was just sent) is received. Figures 21.10 and 21.11 in Volume 1 show what happens with the congestion window and slow start threshold when the fast recovery algorithm is in effect.

Set snd_nxt

876-878

The value of the next sequence number to send is set to the maximum of its previous value (onxt) and its current value. Its current value was modified by tcp_output when the segment was retransmitted. Normally this causes snd_nxt to be set back to its previous value, which means that only the missing segment is retransmitted, and that future calls to tcp_output carry on with the next segment in sequence.

Number of consecutive duplicate ACKs exceeds threshold of 3

879-883

The missing segment was retransmitted when t_dupacks equaled 3, so the receipt of each additional duplicate ACK means that another packet has left the network. The congestion window is incremented by one segment. tcp_output sends the next segment in sequence, and the duplicate ACK is dropped. (This is shown by segments 67, 69, and 71 in Figure 21.7 of Volume 1.)

884-885

This statement is executed when the received segment contains a duplicate ACK, but either the length is nonzero or the advertised window changed. Only the first of the five tests described earlier is true. The counter of consecutive duplicate ACKs is set to 0.

Skip remainder of ACK processing

886

This break is executed in three cases:

  1. only the first of the five tests described earlier is true, or

  2. only the first three of the five tests is true, or

  3. the ACK is a duplicate, but the number of consecutive duplicates is less than the threshold of 3.

For any of these cases the ACK is still a duplicate and the break goes to the end of the switch that started in Figure 29.2, which continues processing at the label step6.

To understand the purpose in this aggressive window manipulation, consider the following example. Assume the window is eight segments, and segments 1 through 8 are sent. Segment 1 is lost, but the remainder arrive OK and are acknowledged. After the ACKs for segments 2, 3, and 4 arrive, the missing segment (1) is retransmitted. TCP would like the subsequent ACKs for 5 through 8 to allow some of the segments starting with 9 to be sent, to keep the pipe full. But the window is 8, which prevents segments 9 and above from being sent. Therefore, the congestion window is temporarily inflated by one segment each time another duplicate ACK is received, since the receipt of the duplicate ACK tells TCP that another segment has left the pipe at the other end. When the acknowledgment of segment 1 is finally received, the next figure reduces the congestion window back to the slow start threshold. This increase in the congestion window as the duplicate ACKs arrive, and its subsequent decrease when the fresh ACK arrives, can be seen visually in Figure 21.10 of Volume 1.

ACK Processing

The ACK processing continues with Figure 29.5.

Table 29.5. tcp_input function: ACK processing continued.

----------------------------------------------------------------------- tcp_input.c
888         /*
889          * If the congestion window was inflated to account
890          * for the other side's cached packets, retract it.
891          */
892         if (tp->t_dupacks > tcprexmtthresh &&
893             tp->snd_cwnd > tp->snd_ssthresh)
894             tp->snd_cwnd = tp->snd_ssthresh;
895         tp->t_dupacks = 0;

896         if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
897             tcpstat.tcps_rcvacktoomuch++;
898             goto dropafterack;
899         }
900         acked = ti->ti_ack - tp->snd_una;
901         tcpstat.tcps_rcvackpack++;
902         tcpstat.tcps_rcvackbyte += acked;
----------------------------------------------------------------------- tcp_input.c

Adjust congestion window

888-895

If the number of consecutive duplicate ACKs exceeds the threshold of 3, this is the first nonduplicate ACK after a string of four or more duplicate ACKs. The fast recovery algorithm is complete. Since the congestion window was incremented by one segment for every consecutive duplicate after the third, if it now exceeds the slow start threshold, it is set back to the slow start threshold. The counter of consecutive duplicate ACKs is set to 0.

Check for out-of-range ACK

896-899

Recall the definition of an acceptable ACK,

snd_una < acknowledgment field <= snd_max

If the acknowledgment field is greater than snd_max, the other end is acknowledging data that TCP hasn’t even sent yet! This probably occurs on a high-speed connection when the sequence numbers wrap and a missing ACK reappears later. As we can see in Figure 24.5, this rarely happens (since today’s networks aren’t fast enough).

Calculate number of bytes acknowledged

900-902

At this point TCP knows that it has an acceptable ACK. acked is the number of bytes acknowledged.

The next part of ACK processing, shown in Figure 29.23, deals with RTT measurements and the retransmission timer.

Table 29.6. tcp_input function: RTT measurements and retransmission timer.

----------------------------------------------------------------------- tcp_input.c
903         /*
904          * If we have a timestamp reply, update smoothed
905          * round-trip time.  If no timestamp is present but
906          * transmit timer is running and timed sequence
907          * number was acked, update smoothed round-trip time.
908          * Since we now have an rtt measurement, cancel the
909          * timer backoff (cf., Phil Karn's retransmit alg.).
910          * Recompute the initial retransmit timer.
911          */
912         if (ts_present)
913             tcp_xmit_timer(tp, tcp_now - ts_ecr + 1);
914         else if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq))
915             tcp_xmit_timer(tp, tp->t_rtt);

916         /*
917          * If all outstanding data is acked, stop retransmit
918          * timer and remember to restart (more output or persist).
919          * If there is more data to be acked, restart retransmit
920          * timer, using current (possibly backed-off) value.
921          */
922         if (ti->ti_ack == tp->snd_max) {
923             tp->t_timer[TCPT_REXMT] = 0;
924             needoutput = 1;
925         } else if (tp->t_timer[TCPT_PERSIST] == 0)
926             tp->t_timer[TCPT_REXMT] = tp->t_rxtcur;
----------------------------------------------------------------------- tcp_input.c

Update RTT estimators

903-915

If either (1) a timestamp option was present, or (2) a segment was being timed and the acknowledgment number is greater than the starting sequence number of the segment being timed, tcp_xmit_timer updates the RTT estimators. Notice that the second argument to this function when timestamps are used is the current time (tcp_now) minus the timestamp echo reply (ts_ecr) plus 1 (since the function subtracts 1).

Delayed ACKs are the reason for the greater-than test of the sequence numbers. For example, if TCP sends and times a segment with bytes 1–1024, followed by a segment with bytes 1025–2048, if an ACK of 2049 is returned, this test will consider whether 2049 is greater than 1 (the starting sequence number of the segment being timed), and since this is true, the RTT estimators are updated.

Check if all outstanding data has been acknowledged

916-924

If the acknowledgment field of the received segment (ti_ack) equals the maximum sequence number that TCP has sent (snd_max), all outstanding data has been acknowledged. The retransmission timer is turned off and the needoutput flag is set to 1. This flag forces a call to tcp_output at the end of this function. Since there is no more data waiting to be acknowledged, TCP may have more data to send that it has not been able to send earlier because the data was beyond the right edge of the window. Now that a new ACK has been received, the window will probably move to the right (snd_una is updated in Figure 29.8), which could allow more data to be sent.

Unacknowledged data outstanding

925-926

Since there is additional data that has been sent but not acknowledged, if the persist timer is not on, the retransmission timer is restarted using the current value of t_rxtcur.

Karn’s Algorithm and Timestamps

Notice that timestamps overrule the portion of Karn’s algorithm (Section 21.3 of Volume 1) that says: when a timeout and retransmission occurs, the RTT estimators cannot be updated when the acknowledgment for the retransmitted data is received (the retransmission ambiguity problem). In Figure 25.26 we saw that t_rtt was set to 0 when a retransmission took place, because of Karn’s algorithm. If timestamps are not present and it is a retransmission, the code in Figure 29.6 does not update the RTT estimators because t_rtt will be 0 from the retransmission. But if a timestamp is present, t_rtt isn’t examined, allowing the RTT estimators to be updated using the received timestamp echo reply. With RFC 1323 timestamps the ambiguity is gone since the ts_ecr value was copied by the other end from the segment being acknowledged. The other half of Karn’s algorithm, specifying that an exponential backoff must be used with retransmissions, still holds, of course.

Figure 29.7 shows the next part of ACK processing, updating the congestion window.

Table 29.7. tcp_input function: open congestion window in response to ACKs.

----------------------------------------------------------------------- tcp_input.c
927         /*
928          * When new data is acked, open the congestion window.
929          * If the window gives us less than ssthresh packets
930          * in flight, open exponentially (maxseg per packet).
931          * Otherwise open linearly: maxseg per window
932          * (maxseg^2 / cwnd per packet), plus a constant
933          * fraction of a packet (maxseg/8) to help larger windows
934          * open quickly enough.
935          */
936         {
937             u_int   cw = tp->snd_cwnd;
938             u_int   incr = tp->t_maxseg;

939             if (cw > tp->snd_ssthresh)
940                 incr = incr * incr / cw + incr / 8;
941             tp->snd_cwnd = min(cw + incr, TCP_MAXWIN << tp->snd_scale);
942         }
----------------------------------------------------------------------- tcp_input.c

Update congestion window

927-942

One of the rules of slow start and congestion avoidance is that a received ACK increases the congestion window. By default the congestion window is increased by one segment for each received ACK (slow start). But if the current congestion window is greater than the slow start threshold, it is increased by 1 divided by the congestion window, plus a constant fraction of a segment. The term

incr * incr / cw

is

t_maxseg * t_maxseg / snd_cwnd

which is 1 divided by the congestion window, taking into account that snd_cwnd is maintained in bytes, not segments. The constant fraction is the segment size divided by 8. The congestion window is then limited by the maximum value of the send window for this connection. Example calculations of this algorithm are in Section 21.8 of Volume 1.

Adding in the constant fraction (the segment size divided by 8) is wrong [Floyd 1994]. But it has been in the BSD sources since 4.3BSD Reno and is still in 4.4BSD and Net/3. It should be removed.

The next part of tcp_input, shown in Figure 29.8, removes the acknowledged data from the send buffer.

Table 29.8. tcp_input function: remove acknowledged data from send buffer.

----------------------------------------------------------------------------- tcp_input.c
943         if (acked > so->so_snd.sb_cc) {
944             tp->snd_wnd -= so->so_snd.sb_cc;
945             sbdrop(&so->so_snd, (int) so->so_snd.sb_cc);
946             ourfinisacked = 1;
947         } else {
948             sbdrop(&so->so_snd, acked);
949             tp->snd_wnd -= acked;
950             ourfinisacked = 0;
951         }
952         if (so->so_snd.sb_flags & SB_NOTIFY)
953             sowwakeup(so);
954         tp->snd_una = ti->ti_ack;
955         if (SEQ_LT(tp->snd_nxt, tp->snd_una))
956             tp->snd_nxt = tp->snd_una;
----------------------------------------------------------------------------- tcp_input.c

Remove acknowledged bytes from the send buffer

943-946

If the number of bytes acknowledged exceeds the number of bytes on the send buffer, snd_wnd is decremented by the number of bytes in the send buffer and TCP knows that its FIN has been ACKed. That number of bytes is then removed from the send buffer by sbdrop. This method for detecting the ACK of a FIN works only because the FIN occupies 1 byte in the sequence number space.

947-951

Otherwise the number of bytes acknowledged is less than or equal to the number of bytes in the send buffer, so ourfinisacked is set to 0, and acked bytes of data are dropped from the send buffer.

Wakeup processes waiting on send buffer

951-956

sowwakeup awakens any processes waiting on the send buffer. snd_una is updated to contain the oldest unacknowledged sequence number. If this new value of snd_una exceeds snd_nxt, the latter is updated, since the intervening bytes have been acknowledged.

Figure 29.9 shows how snd_nxt can end up with a sequence number that is less than snd_una. Assume two segments are transmitted, the first with bytes 1–512 and the second with bytes 513–1024.

Two segments sent on a connection.

Figure 29.9. Two segments sent on a connection.

The retransmission timer then expires before an acknowledgment is returned. The code in Figure 25.26 sets snd_nxt back to snd_una, slow start is entered, tcp_output is called, and one segment containing bytes 1–512 is retransmitted. tcp_output increases snd_nxt to 513, and we have the scenario shown in Figure 29.10.

Continuation of Figure after retransmission timer expires.

Figure 29.10. Continuation of Figure 29.9 after retransmission timer expires.

At this point an ACK of 1025 arrives (either the two original segments or the ACK was delayed somewhere in the network). The ACK is valid since it is less than or equal to snd_max, but snd_nxt will be less than the updated value of snd_una.

The general ACK processing is now complete, and the switch shown in Figure 29.11 handles four special cases.

Table 29.11. tcp_input function: receipt of ACK in FIN_WAIT_1 state.

----------------------------------------------------------------------- tcp_input.c
957         switch (tp->t_state) {

958             /*
959              * In FIN_WAIT_1 state in addition to the processing
960              * for the ESTABLISHED state if our FIN is now acknowledged
961              * then enter FIN_WAIT_2.
962              */
963         case TCPS_FIN_WAIT_1:
964             if (ourfinisacked) {
965                 /*
966                  * If we can't receive any more
967                  * data, then closing user can proceed.
968                  * Starting the timer is contrary to the
969                  * specification, but if we don't get a FIN
970                  * we'll hang forever.
971                  */
972                 if (so->so_state & SS_CANTRCVMORE) {
973                     soisdisconnected(so);
974                     tp->t_timer[TCPT_2MSL] = tcp_maxidle;
975                 }
976                 tp->t_state = TCPS_FIN_WAIT_2;
977             }
978             break;
----------------------------------------------------------------------- tcp_input.c

Receipt of ACK in FIN_WAIT_1 state

958-971

In this state the process has closed the connection and TCP has sent the FIN. But other ACKs can be received for data segments sent before the FIN. Therefore the connection moves into the FIN_WAIT_2 state only when the FIN has been acknowledged. The flag ourfinisacked is set in Figure 29.8; this depends on whether the number of bytes ACKed exceeds the amount of data in the send buffer or not.

Set FIN_WAIT_2 timer

972-975

We also described in Section 25.6 how Net/3 sets a FIN_WAIT_2 timer to prevent an infinite wait in the FIN_WAIT_2 state. This timer is set only if the process completely closed the connection (i.e., the close system call or its kernel equivalent if the process was terminated by a signal), and not if the process performed a half-close (i.e., the FIN was sent but the process can still receive data on the connection).

Figure 29.12 shows the receipt of an ACK in the CLOSING state.

Table 29.12. tcp_input function: receipt of ACK in CLOSING state.

----------------------------------------------------------------------- tcp_input.c
979             /*
980              * In CLOSING state in addition to the processing for
981              * the ESTABLISHED state if the ACK acknowledges our FIN
982              * then enter the TIME-WAIT state, otherwise ignore
983              * the segment.
984              */
985         case TCPS_CLOSING:
986             if (ourfinisacked) {
987                 tp->t_state = TCPS_TIME_WAIT;
988                 tcp_canceltimers(tp);
989                 tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL;
990                 soisdisconnected(so);
991             }
992             break;
----------------------------------------------------------------------- tcp_input.c

Receipt of ACK in CLOSING state

979-992

If the ACK is for the FIN (and not for some previous data segment), the connection moves into the TIME_WAIT state. Any pending timers are cleared (such as a pending retransmission timer), and the TIME_WAIT timer is started with a value of twice the MSL.

The processing of an ACK in the LAST_ACK state is shown in Figure 29.13.

Table 29.13. tcp_input function: receipt of ACK in LAST_ACK state.

----------------------------------------------------------------------- tcp_input.c
 993             /*
 994              * In LAST_ACK, we may still be waiting for data to drain
 995              * and/or to be acked, as well as for the ack of our FIN.
 996              * If our FIN is now acknowledged, delete the TCB,
 997              * enter the closed state, and return.
 998              */
 999         case TCPS_LAST_ACK:
1000             if (ourfinisacked) {
1001                 tp = tcp_close(tp);
1002                 goto drop;
1003             }
1004             break;
----------------------------------------------------------------------- tcp_input.c

Receipt of ACK in LAST_ACK state

993-1004

If the FIN is ACKed, the new state is CLOSED. This state transition is handled by tcp_close, which also releases the Internet PCB and TCP control block.

Figure 29.14 shows the processing of an ACK in the TIME_WAIT state.

Table 29.14. tcp_input function: receipt of ACK in TIME_WAIT state.

----------------------------------------------------------------------- tcp_input.c
1005             /*
1006              * In TIME_WAIT state the only thing that should arrive
1007              * is a retransmission of the remote FIN.  Acknowledge
1008              * it and restart the finack timer.
1009              */
1010         case TCPS_TIME_WAIT:
1011             tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL;
1012             goto dropafterack;
1013         }
1014     }
----------------------------------------------------------------------- tcp_input.c

Receipt of ACK in TIME WAIT state

1005-1014

In this state both ends have sent a FIN and both FINs have been acknowledged. If TCP’s ACK of the remote FIN was lost, however, the other end will retransmit the FIN (with an ACK). TCP drops the segment and resends the ACK. Additionally, the TIME_WAIT timer must be restarted with a value of twice the MSL.

Update Window Information

There are two variables in the TCP control block that we haven’t described yet: snd_wl1 and snd_wl2.

  • snd_wl1 records the sequence number of the last segment used to update the send window (snd_wnd).

  • snd_wl2 records the acknowledgment number of the last segment used to update the send window.

Our only encounter with these variables so far was when a connection was established (active, passive, or simultaneous open) and snd_wl1 was set to ti_seq minus 1. We said this was to guarantee a window update, which we’ll see in the following code.

The send window (snd_wnd) is updated from the advertised window in the received segment (tiwin) if any one of the following three conditions is true:

1.

The segment contains new data. Since snd_wl1 contains the starting sequence number of the last segment that was used to update the send window, if

2.

snd_wl1 < ti_seq

 

this condition is true.

3.

The segment does not contain new data (snd_wl1 equals ti_seq), but the segment acknowledges new data. The latter condition is true if

4.

snd_wl2 < ti_ack

 

since snd_wl2 records the acknowledgment number of the last segment that updated the send window.

5.

The segment does not contain new data, and the segment does not acknowledge new data, but the advertised window is larger than the current send window.

The purpose of these tests is to prevent an old segment from affecting the send window, since the send window is not an absolute sequence number, but is an offset from snd_una.

Figure 29.15 shows the code that implements the update of the send window.

Table 29.15. tcp_input function: update window information.

----------------------------------------------------------------------- tcp_input.c
1015   step6:
1016     /*
1017      * Update window information.
1018      * Don't look at window if no ACK: TAC's send garbage on first SYN.
1019      */
1020     if ((tiflags & TH_ACK) &&
1021         (SEQ_LT(tp->snd_wl1, ti->ti_seq) || tp->snd_wl1 == ti->ti_seq &&
1022          (SEQ_LT(tp->snd_wl2, ti->ti_ack) ||
1023           tp->snd_wl2 == ti->ti_ack && tiwin > tp->snd_wnd))) {

1024         /* keep track of pure window updates */
1025         if (ti->ti_len == 0 &&
1026             tp->snd_wl2 == ti->ti_ack && tiwin > tp->snd_wnd)
1027             tcpstat.tcps_rcvwinupd++;

1028         tp->snd_wnd = tiwin;
1029         tp->snd_wl1 = ti->ti_seq;
1030         tp->snd_wl2 = ti->ti_ack;
1031         if (tp->snd_wnd > tp->max_sndwnd)
1032             tp->max_sndwnd = tp->snd_wnd;
1033         needoutput = 1;
1034     }
----------------------------------------------------------------------- tcp_input.c

Check if send window should be updated

1015-1023

This if test verifies that the ACK flag is set along with any one of the three previously stated conditions. Recall that a jump was made to step6 after the receipt of a SYN in either the LISTEN or SYN_SENT state, and in the LISTEN state the SYN does not contain an ACK.

The term TAC referred to in the comment is a “terminal access controller.” These were Telnet clients on the ARPANET.

1024-1027

If the received segment is a pure window update (the length is 0 and the ACK does not acknowledge new data, but the advertised window is larger), the statistic tcps_rcvwinupd is incremented.

Update variables

1028-1033

The send window is updated and new values of snd_wl1 and snd_wl2 are recorded. Additionally, if this advertised window is the largest one TCP has received from this peer, the new value is recorded in max_sndwnd. This is an attempt to guess the size of the other end’s receive buffer, and it is used in Figure 26.8. needoutput is set to 1 since the new value of snd_wnd might enable a segment to be sent.

Urgent Mode Processing

The next part of TCP input processing handles segments with the URG flag set.???

Table 29.16. tcp_input function: urgent mode processing.

----------------------------------------------------------------------------- tcp_input.c
1035     /*
1036      * Process segments with URG.
1037      */
1038     if ((tiflags & TH_URG) && ti->ti_urp &&
1039         TCPS_HAVERCVDFIN(tp->t_state) == 0) {
1040         /*
1041          * This is a kludge, but if we receive and accept
1042          * random urgent pointers, we'll crash in
1043          * soreceive.  It's hard to imagine someone
1044          * actually wanting to send this much urgent data.
1045          */
1046         if (ti->ti_urp + so->so_rcv.sb_cc > sb_max) {
1047             ti->ti_urp = 0;     /* XXX */
1048             tiflags &= ~TH_URG; /* XXX */
1049             goto dodata;        /* XXX */
1050         }
----------------------------------------------------------------------------- tcp_input.c

Check if URG flag should be processed

1035-1039

These segments must have the URG flag set, a nonzero urgent offset (ti_urp), and the connection must not have received a FIN. The macro TCPS_HAVERCVDFIN is true only for the TIME_WAIT state, so the URG is processed in any other state. This is contrary to a comment appearing later in the code stating that the URG flag is ignored in the CLOSE_WAIT, CLOSING, LAST_ACK, or TIME_WAIT states.

Ignore bogus urgent offsets

1040-1050

If the urgent offset plus the number of bytes already in the receive buffer exceeds the maximum size of a socket buffer, the urgent notification is ignored. The urgent offset is set to 0, the URG flag is cleared, and the rest of the urgent mode processing is skipped.

The next piece of code, shown in Figure 29.17, processes the urgent pointer.

Table 29.17. tcp_input function: processing of received urgent pointer.

------------------------------------------------------------------------ tcp_input.c
1051         /*
1052          * If this segment advances the known urgent pointer,
1053          * then mark the data stream.  This should not happen
1054          * in CLOSE_WAIT, CLOSING, LAST_ACK or TIME_WAIT states since
1055          * a FIN has been received from the remote side.
1056          * In these states we ignore the URG.
1057          *
1058          * According to RFC961 (Assigned Protocols),
1059          * the urgent pointer points to the last octet
1060          * of urgent data.  We continue, however,
1061          * to consider it to indicate the first octet
1062          * of data past the urgent section as the original
1063          * spec states (in one of two places).
1064          */
1065         if (SEQ_GT(ti->ti_seq + ti->ti_urp, tp->rcv_up)) {
1066             tp->rcv_up = ti->ti_seq + ti->ti_urp;
1067             so->so_oobmark = so->so_rcv.sb_cc +
1068                 (tp->rcv_up - tp->rcv_nxt) - 1;
1069             if (so->so_oobmark == 0)
1070                 so->so_state |= SS_RCVATMARK;
1071             sohasoutofband(so);
1072             tp->t_oobflags &= ~(TCPOOB_HAVEDATA | TCPOOB_HADDATA);
1073         }
1074         /*
1075          * Remove out-of-band data so doesn't get presented to user.
1076          * This can happen independent of advancing the URG pointer,
1077          * but if two URG's are pending at once, some out-of-band
1078          * data may creep in... ick.
1079          */
1080         if (ti->ti_urp <= ti->ti_len
1081 #ifdef SO_OOBINLINE
1082             && (so->so_options & SO_OOBINLINE) == 0
1083 #endif
1084             )
1085             tcp_pulloutofband(so, ti, m);
1086     } else {
1087         /*
1088          * If no out-of-band data is expected, pull receive
1089          * urgent pointer along with the receive window.
1090          */
1091         if (SEQ_GT(tp->rcv_nxt, tp->rcv_up))
1092             tp->rcv_up = tp->rcv_nxt;
1093     }
------------------------------------------------------------------------ tcp_input.c

1051-1065

If the starting sequence number of the received segment plus the urgent offset exceeds the current receive urgent pointer, a new urgent pointer has been received. For example, when the 3-byte segment that was sent in Figure 26.30 arrives at the receiver, we have the scenario shown in Figure 29.18.

Receiver side when segment from Figure arrives.

Figure 29.18. Receiver side when segment from Figure 26.30 arrives.

Normally the receive urgent pointer (rcv_up) equals rcv_nxt. In this example, since the if test is true (4 plus 3 is greater than 4), the new value of rcv_up is calculated as 7.

Calculate receive urgent pointer

1066-1070

The out-of-band mark in the socket’s receive buffer is calculated, taking into account any data bytes already in the receive buffer (so_rcv.sb_cc). In our example, assuming there is no data already in the receive buffer, so_oobmark is set to 2: that is, the byte with the sequence number 6 is considered the out-of-band byte. If this out-of-band mark is 0, the socket is currently at the out-of-band mark. This happens if the send system call that sends the out-of-band byte specifies a length of 1, and if the receive buffer is empty when this segment arrives at the other end. This reiterates that Berkeley-derived systems consider the urgent pointer to point to the first byte of data after the out-of-band byte.

Notify process of TCP’s urgent mode

1071-1072

sohasoutofband notifies the process that out-of-band data has arrived for the socket. The two flags TCPOOB_HAVEDATA and TCPOOB_HADDATA are cleared. These two flags are used with the PRU_RCVOOB request in Figure 30.8.

Pull out-of-band byte out of normal data stream

1074-1085

If the urgent offset is less than or equal to the number of bytes in the received segment, the out-of-band byte is contained in the segment. With TCP’s urgent mode it is possible for the urgent offset to point to a data byte that has not yet been received. If the SO_OOBINLINE constant is defined (which it always is for Net/3), and if the corresponding socket option is not enabled, the receiving process wants the out-of-band byte pulled out of the normal stream of data and placed into the variable t_iobc. This is done by tcp_pulloutofband, which we cover in the next section.

Notice that the receiving process is notified that the sender has entered urgent mode, regardless of whether the byte pointed to by the urgent pointer is readable or not. This is a feature of TCP’s urgent mode.

Adjust receive urgent pointer if not urgent mode

1086-1093

When the receiver is not processing an urgent pointer, if rcv_nxt is greater than the receive urgent pointer, rcv_up is moved to the right and set equal to rcv_nxt. This keeps the receive urgent pointer at the left edge of the receive window so that the comparison using SEQ_GT at the beginning of Figure 29.17 will work correctly when an URG flag is received.

If the solution to Exercise 26.6 is implemented, corresponding changes will have to go into Figures 29.16 and 29.17 also.

tcp_pulloutofband Function

This function is called from Figure 29.17 when

  1. urgent mode notification arrives in a received segment, and

  2. the out-of-band byte is contained within the segment (i.e., the urgent pointer points into the received segment), and

  3. the SO_OOBINLINE socket option is not enabled for this socket.

This function removes the out-of-band byte from the normal stream of data (i.e., the mbuf chain containing the received segment) and places it into the t_iobc variable in the TCP control block for the connection. The process reads this variable using the MSG_OOB flag with the recv system call: the PRU_RCVOOB request in Figure 30.8. Figure 29.19 shows the function.

Table 29.19. tcp_pulloutofband function: place out-of-band byte into t_iobc.

------------------------------------------------------------------------- tcp_input.c
1282 void
1283 tcp_pulloutofband(so, ti, m)
1284 struct socket *so;
1285 struct tcpiphdr *ti;
1286 struct mbuf *m;
1287 {
1288     int     cnt = ti->ti_urp - 1;

1289     while (cnt >= 0) {
1290         if (m->m_len > cnt) {
1291             char   *cp = mtod(m, caddr_t) + cnt;
1292             struct tcpcb *tp = sototcpcb(so);

1293             tp->t_iobc = *cp;
1294             tp->t_oobflags |= TCPOOB_HAVEDATA;
1295             bcopy(cp + 1, cp, (unsigned) (m->m_len - cnt - 1));
1296             m->m_len--;
1297             return;
1298         }
1299         cnt -= m->m_len;
1300         m = m->m_next;
1301         if (m == 0)
1302             break;
1303     }
1304     panic("tcp_pulloutofband");
1305 }
------------------------------------------------------------------------- tcp_input.c

1282-1289

Consider the example in Figure 29.20. The urgent offset is 3, therefore the urgent pointer is 7, and the sequence number of the out-of-band byte is 6. There are 5 bytes in the received segment, all contained in a single mbuf.

Received segment with an out-of-band byte.

Figure 29.20. Received segment with an out-of-band byte.

The variable cnt is 2 and since m_len (which is 5) is greater than 2, the true portion of the if statement is executed.

1290-1298

cp points to the shaded byte with a sequence number of 6. This is placed into the variable t_iobc, which contains the out-of-band byte. The TCPOOB_HAVEDATA flag is set and bcopy moves the next 2 bytes (with sequence numbers 7 and 8) left 1 byte, giving the arrangement shown in Figure 29.21.

Result from Figure after removal of out-of-band byte.

Figure 29.21. Result from Figure 29.20 after removal of out-of-band byte.

Remember that the numbers 7 and 8 specify the sequence numbers of the data bytes, not the contents of the data bytes. The length of the mbuf is decremented from 5 to 4 but ti_len is left as 5, for sequencing of the segment into the socket’s receive buffer. Both the TCP_REASS macro and the tcp_reass function (which are called in the next section) increment rcv_nxt by ti_len, which in this example must be 5, because the next expected receive sequence number is 9. Also notice in this function that the length field in the packet header (m_pkthdr.len) in the first mbuf is not decremented by 1. This is because that length field is not used by sbappend, which appends the data to the socket’s receive buffer.

Skip to next mbuf in chain

1299-1302

The out-of-band byte is not contained in this mbuf, so cnt is decremented by the number of bytes in the mbuf and the next mbuf in the chain is processed. Since this function is called only when the urgent offset points into the received segment, if there is not another mbuf on the chain, the break causes the call to panic.

Processing of Received Data

tcp_input continues by taking the received data (if any) and either appending it to the socket’s receive buffer (if it is the next expected segment) or placing it onto the socket’s out-of-order queue. Figure 29.22 shows the code that performs this task.

Table 29.22. tcp_input function: merge received data into sequencing queue for socket.

----------------------------------------------------------------------- tcp_input.c
1094   dodata:                       /* XXX */
1095     /*
1096      * Process the segment text, merging it into the TCP sequencing queue,
1097      * and arranging for acknowledgment of receipt if necessary.
1098      * This process logically involves adjusting tp->rcv_wnd as data
1099      * is presented to the user (this happens in tcp_usrreq.c,
1100      * case PRU_RCVD).  If a FIN has already been received on this
1101      * connection then we just ignore the text.
1102      */
1103     if ((ti->ti_len || (tiflags & TH_FIN)) &&
1104         TCPS_HAVERCVDFIN(tp->t_state) == 0) {
1105         TCP_REASS(tp, ti, m, so, tiflags);
1106         /*
1107          * Note the amount of data that peer has sent into
1108          * our window, in order to estimate the sender's
1109          * buffer size.
1110          */
1111         len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
1112     } else {
1113         m_freem(m);
1114         tiflags &= ~TH_FIN;
1115     }
----------------------------------------------------------------------- tcp_input.c

1094-1105

Segment data is processed if

  1. the length of the received data is greater than 0 or the FIN flag is set, and

  2. a FIN has not yet been received for the connection.

The macro TCP_REASS processes the data. If the data is in sequence (i.e., the next expected data for this connection), the delayed-ACK flag is set, rcv_nxt is incremented, and the data is appended to the socket’s receive buffer. If the data is out of order, the macro calls tcp_reass to add the data to the connection’s reassembly queue (which might fill a hole and cause already-queued data to be appended to the socket’s receive buffer).

Recall that the final argument to the macro (tiflags) can be modified. Specifically, if the data is out of order, tcp_reass sets tiflags to 0, clearing the FIN flag (if it was set). That’s why the if statement is true if the FIN flag is set even if there is no data in the segment.

Consider the following example. A connection is established and the sender immediately transmits three segments: one with bytes 1–1024, another with bytes 1025–2048, and another with the FIN flag but no data. The first segment is lost, so when the second arrives (bytes 1025–2048) the receiver places it onto the out-of-order list and generates an immediate ACK. When the third segment with the FIN flag is received, the code in Figure 29.22 is executed. Even though the data length is 0, since the FIN flag is set, TCP_REASS is invoked, which calls tcp_reass. Since ti_seq (2049, the sequence number of the FIN) does not equal rcv_nxt (1), tcp_reass returns 0 (Figure 27.23), which in the TCP_REASS macro sets tiflags to 0. This clears the FIN flag, preventing the code that follows (Section 29.10) from processing the FIN flag.

Guess size of other end’s send buffer

1106-1111

The calculation of len is attempt to guess the size of the other end’s send buffer. Consider the following example. A socket has a receive buffer size of 8192 (the Net/3 default), so TCP advertises a window of 8192 in its SYN. The first segment with bytes 1–1024 is then received. Figure 29.23 shows the state of the receive space after TCP_REASS has incremented rcv_nxt to account for the received segment.

Receipt of bytes 1–1024 into a 8192-byte receive window.

Figure 29.23. Receipt of bytes 1–1024 into a 8192-byte receive window.

The calculation of len yields 1024. The value of len will increase as the other end sends more data into the receive window, but it will never exceed the size of the other end’s send buffer. Recall that the variable max_sndwnd, calculated in Figure 29.15, is an attempt to guess the size of the other end’s receive buffer.

This variable len is never used! It is left over code from Net/1 when the variable max_rcvd was stored in the TCP control block after the calculation of len:

if (len > tp->max_rcvd)
     tp->max_rcvd = len;

But even in Net/1 the variable max_rcvd was never used.

1112-1115

If the length is 0 and the FIN flag is not set, or if a FIN has already been received for the connection, the received mbuf chain is discarded and the FIN flag is cleared.

FIN Processing

The next step in tcp_input, shown in Figure 29.24, handles the FIN flag.

Table 29.24. tcp_input function: FIN processing, first half.

----------------------------------------------------------------------- tcp_input.c
1116     /*
1117      * If FIN is received ACK the FIN and let the user know
1118      * that the connection is closing.
1119      */
1120     if (tiflags & TH_FIN) {
1121         if (TCPS_HAVERCVDFIN(tp->t_state) == 0) {
1122             socantrcvmore(so);
1123             tp->t_flags |= TF_ACKNOW;
1124             tp->rcv_nxt++;
1125         }
1126         switch (tp->t_state) {

1127             /*
1128              * In SYN_RECEIVED and ESTABLISHED states
1129              * enter the CLOSE_WAIT state.
1130              */
1131         case TCPS_SYN_RECEIVED:
1132         case TCPS_ESTABLISHED:
1133             tp->t_state = TCPS_CLOSE_WAIT;
1134             break;
----------------------------------------------------------------------- tcp_input.c

Process first FIN received on connection

1116-1125

If the FIN flag is set and this is the first FIN received for this connection, socantrcvmore marks the socket as write-only, TF_ACKNOW is set to acknowledge the FIN immediately (i.e., it is not delayed), and rcv_nxt steps over the FIN in the sequence space.

1126

The remainder of FIN processing is handled by a switch that depends on the connection state. Notice that the FIN is not processed in the CLOSED, LISTEN, or SYN_SENT states, since in these three states a SYN has not been received to synchronize the received sequence number, making it impossible to validate the sequence number of the FIN. A FIN is also ignored in the CLOSING, CLOSE_WAIT, and LAST_ACK states, because in these three states the FIN is a duplicate.

SYN_RCVD or ESTABLISHED states

1127-1134

From either the ESTABLISHED or SYN_RCVD states, the CLOSE_WAIT state is entered.

The receipt of a FIN in the SYN_RCVD state is unusual, but legal. It is not shown in Figure 24.15. It means a socket is in the LISTEN state when a segment containing a SYN and a FIN is received. Alternatively, a SYN is received for a listening socket, moving the connection to the SYN_RCVD state but before the ACK is received a FIN is received. (We know the segment does not contain a valid ACK, because if it did the code in Figure 29.2 would have moved the connection to the ESTABLISHED state.)

The next part of FIN processing is shown in Figure 29.25

Table 29.25. tcp_input function: FIN processing, second half.

----------------------------------------------------------------------- tcp_input.c
1135             /*
1136              * If still in FIN_WAIT_1 state FIN has not been acked so
1137              * enter the CLOSING state.
1138              */
1139         case TCPS_FIN_WAIT_1:
1140             tp->t_state = TCPS_CLOSING;
1141             break;

1142             /*
1143              * In FIN_WAIT_2 state enter the TIME_WAIT state,
1144              * starting the time-wait timer, turning off the other
1145              * standard timers.
1146              */
1147         case TCPS_FIN_WAIT_2:
1148             tp->t_state = TCPS_TIME_WAIT;
1149             tcp_canceltimers(tp);
1150             tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL;
1151             soisdisconnected(so);
1152             break;

1153             /*
1154              * In TIME_WAIT state restart the 2 MSL time_wait timer.
1155              */
1156         case TCPS_TIME_WAIT:
1157             tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL;
1158             break;
1159         }
1160     }
----------------------------------------------------------------------- tcp_input.c

FIN_WAIT_1 state

1135-1141

Since ACK processing is already complete for this segment, if the connection is in the FIN_WAIT_1 state when the FIN is processed, it means a simultaneous close is taking place—the two FINs from each end have passed in the network. The connection enters the CLOSING state.

FIN_WAIT_2 state

1142-1148

The receipt of the FIN moves the connection into the TIME_WAIT state. When a segment containing a FIN and an ACK is received in the FIN_WAIT_1 state (the typical scenario), although Figure 24.15 shows the transition directly from the FIN_WAIT_1 state to the TIME_WAIT state, the ACK is processed in Figure 29.11, moving the connection to the FIN_WAIT_2 state. The FIN processing here moves the connection into the TIME_WAIT state. Because the ACK is processed before the FIN, the FIN_WAIT_2 state is always passed through, albeit momentarily.

Start TIME_WAIT timer

1149-1152

Any pending TCP timer is turned off and the TIME_WAIT timer is started with a value of twice the MSL. (If the received segment contained a FIN and an ACK, Figure 29.11 started the FIN_WAIT_2 timer.) The socket is disconnected.

TIME_WAIT state

1153-1159

If a FIN arrives in the TIME_WAIT state, it is a duplicate, and similar to Figure 29.14, the TIME_WAIT timer is restarted with a value of twice the MSL.

Final Processing

The final part of the slow path through tcp_input along with the label dropafterack is shown in Figure 29.26.

Table 29.26. tcp_input function: final processing.

------------------------------------------------------------------------- tcp_input.c
1161     if (so->so_options & SO_DEBUG)
1162         tcp_trace(TA_INPUT, ostate, tp, &tcp_saveti, 0);

1163     /*
1164      * Return any desired output.
1165      */
1166     if (needoutput || (tp->t_flags & TF_ACKNOW))
1167         (void) tcp_output(tp);
1168     return;

1169   dropafterack:
1170     /*
1171      * Generate an ACK dropping incoming segment if it occupies
1172      * sequence space, where the ACK reflects our state.
1173      */
1174     if (tiflags & TH_RST)
1175         goto drop;
1176     m_freem(m);
1177     tp->t_flags |= TF_ACKNOW;
1178     (void) tcp_output(tp);
1179     return;
------------------------------------------------------------------------- tcp_input.c

SO_DEBUG socket option

1161-1162

If the SO_DEBUG socket option is enabled, tcp_trace appends the trace record to the kernel’s circular buffer. Remember that the code in Figure 28.7 saved both the original connection state and the IP and TCP headers, since these values may have changed in this function.

Call tcp_output

1163-1168

If either the needoutput flag was set (Figures 29.6 and 29.15) or if an immediate ACK is required, tcp_output is called.

dropafterack

1169-1179

An ACK is generated only if the RST flag was not set. (A segment with an RST is never ACKed.) The mbuf chain containing the received segment is released, and tcp_output generates an immediate ACK.

Figure 29.27 completes the tcp_input function.

Table 29.27. tcp_input function: final processing.

----------------------------------------------------------------------- tcp_input.c
1180   dropwithreset:
1181     /*
1182      * Generate an RST, dropping incoming segment.
1183      * Make ACK acceptable to originator of segment.
1184      * Don't bother to respond if destination was broadcast/multicast.
1185      */
1186     if ((tiflags & TH_RST) || m->m_flags & (M_BCAST | M_MCAST) ||
1187         IN_MULTICAST(ti->ti_dst.s_addr))
1188         goto drop;
1189     if (tiflags & TH_ACK)
1190         tcp_respond(tp, ti, m, (tcp_seq) 0, ti->ti_ack, TH_RST);
1191     else {
1192         if (tiflags & TH_SYN)
1193             ti->ti_len++;
1194         tcp_respond(tp, ti, m, ti->ti_seq + ti->ti_len, (tcp_seq) 0,
1195                     TH_RST | TH_ACK);
1196     }
1197     /* destroy temporarily created socket */
1198     if (dropsocket)
1199         (void) soabort(so);
1200     return;

1201   drop:
1202     /*
1203      * Drop space held by incoming segment and return.
1204      */
1205     if (tp && (tp->t_inpcb->inp_socket->so_options & SO_DEBUG))
1206         tcp_trace(TA_DROP, ostate, tp, &tcp_saveti, 0);
1207     m_freem(m);
1208     /* destroy temporarily created socket */
1209     if (dropsocket)
1210         (void) soabort(so);
1211     return;
1212 }
----------------------------------------------------------------------- tcp_input.c

dropwithreset

1180-1188

An RST is generated unless the received segment also contained an RST, or the received segment was sent as a broadcast or multicast. An RST is never generated in response to an RST, since this could lead to RST storms (a continual exchange of RST segments between two end points).

This code contains the same error that we noted in Figure 28.16: it does not check whether the destination address of the received segment was a broadcast address.

Similarly, the destination address argument to IN_MULTICAST needs to be converted to host byte order.

Sequence number and acknowledgment number of RST segment

1189-1196

The values of the sequence number field, the acknowledgment field, and the ACK flag of the RST segment depend on whether the received segment contained an ACK.

Figure 29.28 summarizes these fields in the RST segment that is generated.

Table 29.28. Values of fields in RST segment generated.

received segment

RST segment generated

seq#

ack. field

flags

contains ACK

received ack. field

0

TH_RST

ACK-less

0

received seq# field

TH_RST | TH_ACK

Realize that the ACK flag is normally set in all segments except when an initial SYN is sent (Figure 24.16). The fourth argument to tcp_respond is the acknowledgment field, and the fifth argument is the sequence number.

Rejecting connections

1192-1193

If the SYN flag is set, ti_len must be incremented by 1, causing the acknowledgment field of the RST to be 1 greater than the received sequence number of the SYN. This code is executed when a SYN arrives for a nonexistent server. When the Internet PCB is not found in Figure 28.6, a jump is made to dropwithreset. But for the received RST to be acceptable to the other end, the acknowledgment field must ACK the SYN (Figure 28.18). Figure 18.14 of Volume 1 contains an example of this type of RST segment.

Finally note that tcp_respond builds the RST in the first mbuf of the received chain and releases any remaining mbufs in the chain. When that mbuf finally makes its way to the device driver, it will be discarded.

Destroy temporarily created socket

1197-1199

If a temporary socket was created in Figure 28.7 for a listening server, but the code in Figure 28.16 found the received segment to contain an error, dropsocket will be 1. If so, that socket is now destroyed.

Drop (without ACK or RST)

1201-1206

tcp_trace is called when a segment is dropped without generating an ACK or an RST. If the SO_DEBUG flag is set and an ACK is generated, tcp_output generates a trace record. If the SO_DEBUG flag is set and an RST is generated, a trace record is not generated for the RST.

1207-1211

The mbuf chain containing the received segment is released and the temporary socket is destroyed if dropsocket is nonzero.

Implementation Refinements

The refinements to speed up TCP processing are similar to the ones described for UDP (Section 23.12). Multiple passes over the data should be avoided and the checksum computation should be combined with a copy. [Dalton et al. 1993] describe these modifications.

The linear search of the TCP PCBs is also a bottleneck when the number of connections increases. [McKenney and Dove 1992] address this problem by replacing the linear search with hash tables.

[Partridge 1993] describes a research implementation being developed by Van Jacobson that greatly reduces the TCP input processing. The received packet is processed by IP (about 25 instructions on a RISC system), then by a demultiplexer to locate the PCB (about 10 instructions), and then by TCP (about 30 instructions). These 30 instructions perform header prediction and calculate the pseudo-header checksum. If the segment passes the header prediction test, contains data, and the process is waiting for the data, the data is copied into the process buffer and the remainder of the TCP checksum is calculated and verified (a one-pass copy and checksum). If the TCP header prediction fails, the slow path through the TCP input processing occurs.

Header Compression

We now describe TCP header compression. Although header compression is not part of TCP input, we needed to cover TCP thoroughly before describing header compression. Header compression is described in detail in RFC 1144 [Jacobson 1990a]. It was designed by Van Jacobson and is sometimes called VJ header compression. Our purpose in this section is not to go through the header compression source code (a well-commented version of which is presented in RFC 1144, and which is approximately the same size as tcp_output), but to provide an overview of the algorithm. Be sure to distinguish between header prediction (Section 28.4) and header compression.

Introduction

Most implementations of SLIP and PPP support header compression. Although header compression could, in theory, be used with any data link, it is intended for slow-speed serial links. Header compression works with TCP segments only—it does nothing with other IP datagrams (e.g., ICMP, IGMP, UDP, etc.). Header compression reduces the size of the combined IP/TCP header from its normal 40 bytes to as few as 3 bytes. This reduces the size of a typical TCP segment from an interactive application such as Rlogin or Telnet from 41 bytes to 4 bytes—a big saving on a slowspeed serial link.

Each end of the serial link maintains two connection state tables, one for datagrams sent and one for datagrams received. Each table allows a maximum of 256 entries, but typically there are 16 entries in this table, allowing up to 16 different TCP connections to be compressed at any time. Each entry contains an 8-bit connection ID (hence the limit of 256), some flags, and the complete uncompressed IP/TCP header from the most recent datagram. The 96-bit socket pair that uniquely identifies each connection—the source and destination IP addresses and source and destination TCP ports—are contained in this uncompressed header. Figure 29.29 shows an example of these tables.

A pair of connection state tables at each end of a link (e.g., SLIP link).

Figure 29.29. A pair of connection state tables at each end of a link (e.g., SLIP link).

Since a TCP connection is full duplex, header compression can be applied in both directions. Each end must implement both compression and decompression. A connection appears in both tables, as shown in Figure 29.29. In this example, the entry with a connection ID of 1 in the top two tables has a source IP address of 128.1.2.3, source TCP port of 1500, destination IP address of 192.3.4.5, and a destination TCP port of 25. The entry with a connection ID of 2 in the bottom two tables is for the other direction of the same connection.

We show these tables as arrays, but the source code defines each entry as a structure, and a connection table is a circular linked list of these structures. The most recently used structure is stored at the head of the list.

By saving the most recent uncompressed header at each end, only the differences in various header fields from the previous datagram to the current datagram are transmitted across the link (along with a special first byte indicating which fields follow). Since some header fields don’t change at all from one datagram to the next, and other header fields change by small amounts, this differential coding provides the savings. Header compression works with the IP and TCP headers only—the data contents of the TCP segment are not modified.

Figure 29.30 shows the steps involved at the sending side when it has an IP datagram to send across a link using header compression.

Steps involved in header compression at sender side.

Figure 29.30. Steps involved in header compression at sender side.

Three different types of datagrams are sent and must be recognized at the receiver:

  1. Type IP is specified with the high-order 4 bits of the first byte equal to 4. This is the normal IP version number in the IP header (Figure 8.8). The normal, uncompressed datagram is transmitted across the link.

  2. Type COMPRESSED_TCP is specified by setting the high-order bit of the first byte. This looks like an IP version between 8 and 15 (i.e., the remaining 7 bits of this byte are used by the compression algorithm). The compressed header and uncompressed data are transmitted across the link, as we describe later in this section.

  3. Type UNCOMPRESSED_TCP is specified with the high-order 4 bits of the first byte equal to 7. The normal, uncompressed datagram is transmitted across the link, but the IP protocol field (which equals 6 for TCP), is replaced with the connection ID. This identifies the connection state table entry for the receiver.

The receiver can identify the datagram type by examining its first byte. The code that does this was shown in Figure 5.13. In Figure 5.16 the sender calls sl_compress_tcp to check if a TCP segment is compressible, and the return value of this function is logically ORed into the first byte of the datagram.

Figure 29.31 shows an illustration of the first byte that is sent across the link.

First byte transmitted across link.

Figure 29.31. First byte transmitted across link.

The 4 bits shown as “-” comprise the normal IP header length field. The 7 bits shown as C, I, P, S, A, W, and U indicate which optional fields follow. We describe these fields shortly.

Figure 29.32 shows the complete IP datagram for the various datagrams that are sent.

Different types of IP datagrams possible with header compression.

Figure 29.32. Different types of IP datagrams possible with header compression.

We show two datagrams with a type of IP: one that is not a TCP segment (e.g., a protocol of UDP, ICMP, or IGMP), and one that is a TCP segment. This is to illustrate the differences between the TCP segment sent as type IP and the TCP segment sent as type UNCOMPRESSED_TCP: the first 4 bits are different as is the protocol field in the IP header.

Datagrams are not candidates for header compression if the protocol is not TCP, or if the protocol is TCP but any one of the following conditions is true.

  • The datagram is an IP fragment: either the fragment offset is nonzero or the more-fragments bit is set.

  • Any one of the SYN, FIN, or RST flags is set.

  • The ACK flag is not set.

If any one of these three conditions is true, the datagram is sent as type IP.

Furthermore, even if the datagram is a TCP segment that looks compressible, it is possible to abort the compression and send the datagram as type UNCOMPRESSED_TCP if certain fields have changed between the current datagram and the last datagram sent for this connection. These are fields that normally do not change for a given connection, so the compression scheme was not designed to encode their differences from one datagram to the next. The TOS field and the don’t fragment bit are examples. Also, when the differences in some fields are greater than 65535, the compression algorithm fails and the datagram is sent uncompressed.

Compression of Header Fields

We now describe how the fields in the IP and TCP headers, shown in Figure 29.33, are compressed. The shaded fields normally don’t change during a connection.

Combined IP and TCP headers: shaded fields normally don’t change.

Figure 29.33. Combined IP and TCP headers: shaded fields normally don’t change.

If any of the shaded fields have changed from the previous segment on this connection to the current segment, the segment is sent uncompressed. We don’t show IP options or TCP options in this figure, but if either are present and have changed from the previous segment, the segment is sent uncompressed (Exercise 29.7).

If the algorithm transmitted only the nonshaded fields when the shaded fields do not change from the previous segment, about a 50% savings would result. VJ header compression does even better than this, by knowing which fields in the IP and TCP headers normally don’t change. Figure 29.34 shows the format of the compressed IP/TCP header.

Format of compressed IP/TCP header.

Figure 29.34. Format of compressed IP/TCP header.

The smallest compressed header consists of 3 bytes: the first byte (the flag bits) followed by the 16-bit TCP checksum. For protection against possible link errors, the TCP checksum is always transmitted without any change. (SLIP provides no link-layer checksum, although PPP does provide one.)

The other six fields, connid, urgoff, Δwin, Δack, Δseq, and Δ ipid, are optional. We show the number of bytes used to encode all the fields to the left of the field in Figure 29.34. The largest compressed header appears to be 19 bytes, but we’ll see shortly that the 4 bits SAWU can never be set at the same time in a compressed header, so the largest size is actually 16 bytes.

Six of the 7 bits in the first byte specify which of the six optional fields are present. The high-order bit of the first byte is always set to 1. This identifies the datagram type as COMPRESSED_TCP. Figure 29.35 summarizes the 7 bits, which we now describe.

Table 29.35. The 7 bits in the compressed header.

Flag bit

Description

Structure member

Meaning if flag = 0

Meaning if flag = 1

C

connection ID

 

same connection ID as last

connid = connection ID

I

IP identification

ip_id

ip_id has increased by 1

Δipid = current − previous

P

TCP push flag

 

PSH flag off

PSH flag on

S

TCP sequence#

th_seq

same th_seq as last

Δseq = current − previous

A

TCP acknowledgment#

th_ack

same th_ack as last

Δack = current − previous

W

TCP window

th_win

same th_win as last

Δwin = current − previous

U

TCP urgent offset

th_urg

URG flag not set

urgoff = urgent offset

C

If this bit is 0, this segment has the same connection ID as the previous compressed or uncompressed segment. If this flag is 1, connid is the connection ID, a value between 0 and 255.

I

If this bit is 0, the IP identification field has increased by 1 (the typical case). If this bit is 1, Δipid is the current value of ip_id minus its previous value.

P

This bit is a copy of the PSH flag from the TCP segment. Since the PSH flag doesn’t follow any established pattern, it must be explicitly specified for each segment.

S

If this bit is 0, the TCP sequence number has not changed. If this bit is 1, Δseq is the current value of th_seq minus its previous value.

A

If this bit is 0, the TCP acknowledgment number has not changed (the typical case). If this bit is 1, Δack is the current value of th_ack minus its previous value.

W

If this bit is 0, the TCP window has not changed (the typical case). If this bit is 1, Δwin is the current value of th_win minus its previous value.

U

If this bit is 0, the URG flag in the segment is not set and the urgent offset has not changed from its previous value (the typical case). If this bit is 1, urgoff is the current value of th_urg and the URG flag is set. If the urgent offset changes without the URG flag being set, the segment is sent uncompressed. (This often occurs in the first segment following urgent data.)

The differences are encoded as the current value minus the previous value, because most of these differences will be small positive numbers (with Δwin being an exception) given the way these fields normally change.

We note that five of the optional fields in Figure 29.34 are encoded in 0, 1, or 3 bytes.

0 bytes:

If the corresponding flag is not set, nothing is encoded for the field.

1 byte:

If the value to send is between 1 and 255, a single byte encodes the value.

3 bytes:

If the value to send is either 0 or between 256 and 65535, 3 bytes encode the value: the first byte is 0, followed by the 2-byte value. This always works for the three 16-bit values, urgoff, Δwin, and Δipid; but if the difference to encode for the two 32-bit values, Δack and Δseq, is less than 0 or greater than 65535, the segment is sent uncompressed.

If we compare the nonshaded fields in Figure 29.33 with the possible fields in Figure 29.34 we notice that some fields are never transmitted.

  • The IP total length field is not transmitted since most link layers provide the length of a received message to the receiver.

  • Since the only field in the IP header that is being transmitted is the identification field, the IP checksum is also omitted. This is a hop-by-hop checksum that protects only the IP header across any given link.

Special Cases

Two common cases are detected and transmitted as special combinations of the 4 low-order bits: SAWU. Since urgent data is rare, if the URG flag in the segment is set and both the sequence number and window also change (implying that the 4 low-order bits would be 1011 or 1111), the segment is sent uncompressed. Therefore if the 4 low-order bits are sent as 1011 (called *SA) or 1111 (called *S), the following two special cases apply:

*SA

The sequence number and acknowledgment number both increase by the amount of data in the last segment, the window and urgent offset don’t change, and the URG flag is not set. This special case avoids encoding both Δseq and Δack.

This case occurs frequently for both directions of echoed terminal traffic. Figures 19.3 and 19.4 of Volume 1 give examples of this type of data flow across an Rlogin connection.

*S

The sequence number changes by the amount of data in the last segment, the acknowledgment number, window, and urgent offset don’t change, and the URG flag is not set. This special case avoids encoding Δseq.

This case occurs frequently for the sending side of a unidirectional data transfer (e.g., FTP). Figures 20.1, 20.2, and 20.3 of Volume 1 give examples of this type of data transfer. This case also occurs for the sender of nonechoed terminal traffic (e.g., commands that are not echoed by a full-screen editor).

Examples

Two simple examples were run across the SLIP link between the systems bsdi and slip in Figure 1.17. This SLIP link uses header compression in both directions. The tcpdump program described in Appendix A of Volume 1 was also run on the host bsdi to save a copy of all the frames. This program has an option that outputs the compressed header, showing all the fields in Figure 29.34.

Two traces were obtained: a short portion of an Rlogin connection and a file transfer from bsdi to slip using FTP. Figure 29.36 shows a summary of the different frame types for both connections.

Table 29.36. Counts of different frame types for Rlogin and FTP connections.

frame type

Rlogin

FTP

input

output

input

output

IP

1

1

5

5

UNCOMPRESSED_TCP

3

2

2

3

COMPRESSED_TCP

    
  • *SA special case

75

75

0

0

  • *S special case

25

1

1

325

  • nonspecial

9

93

337

13

Total

113

172

345

346

The two entries of 75 verify our claim that this special case often occurs for both directions of echoed terminal traffic. The entry of 325 verifies our claim that this special case occurs frequently for the sending side of a unidirectional data transfer.

The 10 frames of type IP for the FTP example correspond to four segments with the SYN flag set and six segments with the FIN flag set. FTP uses two connections: one for the interactive commands and one for the file transfer.

The UNCOMPRESSED_TCP frame types normally correspond to the first segment following connection establishment, the one that establishes the connection ID. An additional few are seen in these examples when the type of service is set (the Net/3 Rlogin and FTP clients and servers all set the TOS field after the connection is established).

Figure 29.37 shows the distribution of the compressed-header sizes. The average size of the compressed header for the final four columns in Figure 29.37 is 3.1, 4.1, 6.0, and 3.3 bytes, a significant savings compared to the uncompressed 40-byte headers, especially for the interactive connection.

Table 29.37. Distribution of compressed-header sizes.

#bytes

Rlogin

FTP

input

output

input

output

3

102

44

2

250

4

 

94

 

78

5

7

12

5

2

6

 

6

325

5

7

 

13

2

1

8

   

1

9

  

4

1

Total

109

169

338

338

Almost all of the 325 6-byte headers in the FTP input column contain only a Δack of 256, which being greater than 255 is encoded in 3 bytes. The SLIP MTU is 296, so TCP uses an MSS of 256. Almost all of the 250 3-byte headers in the FTP output column contain the *S special case (sequence number change only) with a change of 256 bytes. But since this change refers to the amount of data in the previous segment, nothing is transmitted other than the flag byte and the TCP checksum. The 78 4-byte headers in the FTP output column are this same special case, but with a change in the IP identification field also (Exercise 29.8).

Configuration

Header compression must be enabled on a given SLIP or PPP link. With a SLIP link there are normally two flags that can be set when the interface is configured: enable header compression and autoenable header compression. These two flags are set using the link0 and link2 flags to the ifconfig command, respectively. Normally a client (the dialin host) decides whether to use header compression or not. The server (the host or terminal server to which the client dials in) specifies the autoenable flag only. If header compression is enabled by the client, its TCP will send a datagram of type UNCOMPRESSED_TCP to specify the connection ID. When the server sees this packet it enables header compression (since it was in the autoenable mode). If the server never sees this type of packet, it never enables header compression for this line.

PPP allows the negotiation of options between the two ends of the link when the link is established. One of the options that can be negotiated is whether to use header compression or not.

Summary

This chapter completes our detailed look at TCP input processing. We started with the processing of an ACK in the SYN_RCVD state, which completes a passive open, a simultaneous open, or a self-connect.

The fast retransmit algorithm lets TCP detect a dropped segment after receiving a specified number of consecutive duplicate ACKs and retransmit the segment before the retransmission timer expires. Net/3 combines the fast retransmit algorithm with the fast recovery algorithm, which tries to keep the data flowing from the sender to the receiver, albeit at a slower rate, using congestion avoidance but not slow start.

ACK processing then discards the acknowledged data from the socket’s send buffer and handles a few TCP states specially, when the receipt of an ACK changes the connection state.

The URG flag is processed, if set, and TCP’s urgent mode is mapped into the socket abstraction of out-of-band data. This is complicated because the process can receive the out-of-band byte inline or in a special out-of-band buffer, and TCP can receive urgent notification before the data byte referenced by the urgent pointer has been received.

TCP input processing completes by calling TCP_REASS to merge the received data into either the socket’s receive buffer or the socket’s out-of-order queue, processing the FIN flag, and calling tcp_output if a segment must be generated in response to the received segment.

TCP header compression is a technique used on SLIP and PPP links to reduce the size of the IP and TCP headers from the normal 40 bytes to around 3-6 bytes (typically). This is done by recognizing that most fields in these headers don’t change from one segment to the next on a given connection, and the fields that do change often change by a small amount. This allows a flag byte to be sent indicating which fields have changed, and the changes are encoded as differences from the previous segment.

Exercises

29.1

A client connects to a server and no segments are lost. Which process, the client or server, completes its open of the connection first?

29.1

Assume a 2-second RTT. The server has a passive open pending and the client issues its active open at time 0. The server receives the SYN at time 1 and responds with its own SYN and an ACK of the client’s SYN. The client receives this segment at time 2, and the code in Figure 28.20 completes the active open with the call to soisconnected (waking up the client process) and an ACK will be sent back to the server. The server receives the ACK at time 3, and the code in Figure 29.2 completes the server’s passive open, returning control to the server process. In general, the client process receives control about one-half RTT before the server.

29.2

A Net/3 system receives a SYN for a listening socket and the SYN segment also contains 50 bytes of data. What happens?

29.2

Assume the sequence number of the SYN is 1000 and the 50 bytes of data are numbered 1001–1050. When the SYN is processed by tcp_input, first the case starting in Figure 28.15 is executed, which sets rcv_nxt to 1001, and then a jump is made to step6. Figure 29.22 calls tcp_reass and the data is placed onto the socket’s reassembly queue. But the data cannot be appended to the socket’s receive buffer yet (Figure 27.23) so rcv_nxt is left at 1001. When tcp_output is called to generate the immediate ACK, rcv_nxt (1001) is sent as the acknowledgment field. In summary, the SYN is acknowledged, but not the 50 bytes of data. Since the client will retransmit the 50 bytes of data, there is no advantage in sending data with a SYN generated by an active open.

29.3

Continue the previous exercise assuming that the client does not retransmit the 50 bytes of data; instead the client responds with a segment that acknowledges the server’s SYN/ACK and contains a FIN. What happens?

29.3

The server’s socket is in the SYN_RCVD state when the client’s ACK/FIN arrives, so tcp_input ends up processing the ACK in Figure 29.2. The connection moves to the ESTABLISHED state and tcp_reass appends the already-queued data to the socket’s receive buffer. rcv_nxt is incremented to 1051. tcp_input continues and the FIN is handled in Figure 29.24 where the TF_ACKNOW flag is set and rcv_nxt becomes 1052. socantrcvmore sets the socket’s state so that after the server reads the 50 bytes of data, the server will receive an end-of-file. The server’s socket also moves to the CLOSE_WAIT state. tcp_output will be called to ACK the client’s FIN (since rcv_nxt equals 1052). Assuming the server process closes its socket when it reads the end-of-file, the server will then send a FIN for the client to ACK.

In this example six segments requiring three round trips are required to pass the 50 bytes from the client to server. To reduce the number of segments requires the TCP extensions for transactions [Braden 1994].

29.4

A Net/3 client performs a passive open to a listening server. The server’s response to the client’s SYN is a segment with the expected SYN/ACK, but the segment also contains 50 bytes of data and the FIN flag. List the processing steps for the client’s TCP.

29.4

The client’s socket is in the SYN_SENT state when the server’s response is received. Figure 28.20 processes the segment and moves the connection to the ESTABLISHED state. A jump is made to step6 and the data is processed in Figure 29.22. TCP_REASS appends the data to the socket’s receive buffer and rcv_nxt is incremented to acknowledge the data. The FIN is then processed in Figure 29.24, incrementing rcv_nxt again and moving the connection to the CLOSE_WAIT state. When tcp_output is called, the acknowledgment field ACKs the SYN, the 50 bytes of data, and the FIN. The client process then reads the 50 bytes of data, followed by the end-of-file, and then probably closes its socket. This moves the connection to the LAST_ACK state and causes a FIN to be sent by the client, which the server should acknowledge.

29.5

Figure 18.19 in Volume 1 and Figure 14 in RFC 793 both show four segments exchanged during a simultaneous close. But if we trace a simultaneous close between two Net/3 systems, or if we watch the close sequence following a self-connect on a Net/3 system, we see six segments, not four. The extra two segments are a retransmission of the FIN by each end when the other’s FIN is received. Where is the bug and what is the fix?

29.5

The bug is in the entry tcp_outflags[TCPS_CLOSING] shown in Figure 24.16. It specifies the TH_FIN flag, whereas the state transition diagram (Figure 24.15) doesn’t specify that the FIN should be retransmitted. To fix this, remove TH_FIN from the tcp_outflags entry for this state. The bug is relatively harmless—it just causes two extra segments to be exchanged—and a simultaneous close or a close following a self-connect is rare.

29.6

Page 72 of RFC 793 says that when data in the send buffer is acknowledged by the other end “Users should receive positive acknowledgments for buffers which have been sent and fully acknowledged (i.e., send buffer should be returned with ‘ok’ response).” Does Net/3 provide this notification?

29.6

No. An OK return from a write system call only means the data has been copied into the socket buffer. Net/3 does not notify the process when that data is acknowledged by the other end. An application-level acknowledgment is required to obtain this information.

29.7

What effect do the options defined in RFC 1323 have on TCP header compression?

29.7

RFC 1323 timestamps defeat header compression because whenever the timestamps change, the TCP options change, and the segment is sent uncompressed. The window scale option has no effect because the value in the TCP header is still a 16-bit value.

29.8

What effect does the Net/3 assignment of the IP identification field have on TCP header compression?

29.8

IP assigns the ID field from a global variable that is incremented each time any IP datagram is sent. This increases the probability that two consecutive TCP segments sent on the same connection will have ID values that differ by more than 1. A difference other than 1 causes the Δipid field in Figure 29.34 to be transmitted, increasing the size of the compressed header. A better scheme would be for TCP to maintain its own counter for assigning IDs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.70.93