This chapter continues the discussion of TCP input processing, picking up where the previous chapter left off. Recall that the final test in Figure 28.37 was that either the ACK flag was set or, if not, the segment was dropped.
The ACK flag is handled, the window information is updated, the URG flag is processed, and any data in the segment is processed. Finally the FIN flag is processed and tcp_output
is called, if required.
We begin this chapter with ACK processing, a summary of which is shown in Figure 29.1. The SYN_RCVD state is handled specially, followed by common processing for all remaining states. (Remember that a received ACK in either the LISTEN or SYN_SENT state was discussed in the previous chapter.) This is followed by special processing for the three states in which a received ACK causes a state transition, and for the TIME_WAIT state, in which the receipt of an ACK causes the 2MSL timer to be restarted.
Table 29.1. Summary of ACK processing.
----------------------------------------------------------------------------------- switch (tp->t_state) { case TCPS_SYN_RECEIVED: complete processing of passive open and process simultaneous open or self-connect; /* fall into ... */ case TCPS_ESTABLISHED: case TCPS_FIN_WAIT_1: case TCPS_FIN_WAIT_2: case TCPS_CLOSE_WAIT: case TCPS_CLOSING: case TCPS_LAST_ACK: case TCPS_TIME_WAIT: process duplicate ACK; update RTT estimators; if all outstanding data ACKed, turn off retransmission timer; remove ACKed data from socket send buffer; switch (tp->t_state) { case TCPS_FIN_WAIT_1: if (FIN is ACKed) { move to FIN_WAIT_2 state; start FIN_WAIT_2 timer; } break; case TCPS_CLOSING: if (FIN is ACKed) { move to TIME_WAIT state; start TIME_WAIT timer; } break; case TCPS_LAST_ACK: if (FIN is ACKed) move to CLOSED state; break; case TCPS_TIME_WAIT: restart TIME_WAIT timer; goto dropafterack; } } ----------------------------------------------------------------------------------- |
The first part of the ACK processing, shown in Figure 29.2, handles the SYN_RCVD state. As mentioned in the previous chapter, this handles the completion of a passive open (the common case) and also handles simultaneous opens and self-connects (the infrequent case).
Table 29.2. tcp_input
function: received ACK in SYN_RCVD state.
----------------------------------------------------------------------- tcp_input.c 791 /* 792 * Ack processing. 793 */ 794 switch (tp->t_state) { 795 /* 796 * In SYN_RECEIVED state if the ack ACKs our SYN then enter 797 * ESTABLISHED state and continue processing, otherwise 798 * send an RST. 799 */ 800 case TCPS_SYN_RECEIVED: 801 if (SEQ_GT(tp->snd_una, ti->ti_ack) || 802 SEQ_GT(ti->ti_ack, tp->snd_max)) 803 goto dropwithreset; 804 tcpstat.tcps_connects++; 805 soisconnected(so); 806 tp->t_state = TCPS_ESTABLISHED; 807 /* Do window scaling? */ 808 if ((tp->t_flags & (TF_RCVD_SCALE | TF_REQ_SCALE)) == 809 (TF_RCVD_SCALE | TF_REQ_SCALE)) { 810 tp->snd_scale = tp->requested_s_scale; 811 tp->rcv_scale = tp->request_r_scale; 812 } 813 (void) tcp_reass(tp, (struct tcpiphdr *) 0, (struct mbuf *) 0); 814 tp->snd_wl1 = ti->ti_seq - 1; 815 /* fall into ... */ ----------------------------------------------------------------------- tcp_input.c |
801-806
For the ACK to acknowledge the SYN that was sent, it must be greater than snd_una
(which is set to the ISS for the connection, the sequence number of the SYN, by tcp_sendseqinit
) and less than or equal to snd_max
. If so, the socket is marked as connected and the state becomes ESTABLISHED.
Since soisconnected
wakes up the process that performed the passive open (normally a server), we see that this doesn’t occur until the last of the three segments in the three-way handshake has been received. If the server is blocked in a call to accept
, that call now returns; if the server is blocked in a call to select
waiting for the listening descriptor to become readable, it is now readable.
807-812
If TCP sent a window scale option and received one, the send and receive scale factors are saved in the TCP control block. Otherwise the default values of snd_scale
and rcv_scale
in the TCP control block are 0 (no scaling).
813
Any data queued for the connection can now be passed to the process. This is done by tcp_reass
with a null pointer as the second argument. This data would have arrived with the SYN that moved the connection into the SYN_RCVD state.
814
snd_wl1
is set to the received sequence number minus 1. We’ll see in Figure 29.15 that this causes the three window update variables to be updated.
The next part of ACK processing, shown in Figure 29.3, handles duplicate ACKs and determines if TCP’s fast retransmit and fast recovery algorithms [Jacobson 1990c] should come into play. The two algorithms are separate but are normally implemented together [Floyd 1994].
Table 29.3. tcp_input
function: check for completely duplicate ACK.
----------------------------------------------------------------------- tcp_input.c 816 /* 817 * In ESTABLISHED state: drop duplicate ACKs; ACK out-of-range 818 * ACKs. If the ack is in the range 819 * tp->snd_una < ti->ti_ack <= tp->snd_max 820 * then advance tp->snd_una to ti->ti_ack and drop 821 * data from the retransmission queue. If this ACK reflects 822 * more up-to-date window information we update our window information. 823 */ 824 case TCPS_ESTABLISHED: 825 case TCPS_FIN_WAIT_1: 826 case TCPS_FIN_WAIT_2: 827 case TCPS_CLOSE_WAIT: 828 case TCPS_CLOSING: 829 case TCPS_LAST_ACK: 830 case TCPS_TIME_WAIT: 831 if (SEQ_LEQ(ti->ti_ack, tp->snd_una)) { 832 if (ti->ti_len == 0 && tiwin == tp->snd_wnd) { 833 tcpstat.tcps_rcvdupack++; 834 /* 835 * If we have outstanding data (other than 836 * a window probe), this is a completely 837 * duplicate ack (ie, window info didn't 838 * change), the ack is the biggest we've 839 * seen and we've seen exactly our rexmt 840 * threshold of them, assume a packet 841 * has been dropped and retransmit it. 842 * Kludge snd_nxt & the congestion 843 * window so we send only this one 844 * packet. 845 * 846 * We know we're losing at the current 847 * window size so do congestion avoidance 848 * (set ssthresh to half the current window 849 * and pull our congestion window back to 850 * the new ssthresh). 851 * 852 * Dup acks mean that packets have left the 853 * network (they're now cached at the receiver) 854 * so bump cwnd by the amount in the receiver 855 * to keep a constant cwnd packets in the 856 * network. 857 */ ----------------------------------------------------------------------- tcp_input.c |
The fast retransmit algorithm occurs when TCP deduces from a small number (normally 3) of consecutive duplicate ACKs that a segment has been lost and deduces the starting sequence number of the missing segment. The missing segment is retransmitted. The algorithm is mentioned in Section 4.2.2.21 of RFC 1122, which states that TCP may generate an immediate ACK when an out-of-order segment is received. We saw that Net/3 generates the immediate duplicate ACKs in Figure 27.15. This algorithm first appeared in the 4.3BSD Tahoe release and the subsequent Net/1 release. In these two implementations, after the missing segment was retransmitted, the slow start phase was entered.
The fast recovery algorithm says that after the fast retransmit algorithm (that is, after the missing segment has been retransmitted), congestion avoidance but not slow start is performed. This is an improvement that allows higher throughput under moderate congestion, especially for large windows. This algorithm appeared in the 4.3BSD Reno release and the subsequent Net/2 release.
Net/3 implements both fast retransmit and fast recovery, as we describe shortly.
In the discussion of Figure 24.17 we noted that an acceptable ACK must be in the range
snd_una <
acknowledgment field <= snd_max
This first test of the acknowledgment field compares it only to snd_una
. The comparison against
snd_max
is in Figure 29.5. The reason for separating the tests is so that the following five tests can be applied to the received segment:
If the acknowledgment field is less than or equal to snd_una
, and
the length of the received segment is 0, and
the advertised window (tiwin
) has not changed, and
TCP has outstanding data that has not been acknowledged (the retransmission timer is nonzero), and
the received segment contains the biggest ACK TCP has seen (the acknowledgment field equals snd_una
),
then this segment is a completely duplicate ACK. (Tests 1, 2, and 3 are in Figure 29.3; tests 4 and 5 are at the beginning of Figure 29.4.)
Table 29.4. tcp_input
function: duplicate ACK processing.
----------------------------------------------------------------------- tcp_input.c 858 if (tp->t_timer[TCPT_REXMT] == 0 || 859 ti->ti_ack != tp->snd_una) 860 tp->t_dupacks = 0; 861 else if (++tp->t_dupacks == tcprexmtthresh) { 862 tcp_seq onxt = tp->snd_nxt; 863 u_int win = 864 min(tp->snd_wnd, tp->snd_cwnd) / 2 / 865 tp->t_maxseg; 866 if (win < 2) 867 win = 2; 868 tp->snd_ssthresh = win * tp->t_maxseg; 869 tp->t_timer[TCPT_REXMT] = 0; 870 tp->t_rtt = 0; 871 tp->snd_nxt = ti->ti_ack; 872 tp->snd_cwnd = tp->t_maxseg; 873 (void) tcp_output(tp); 874 tp->snd_cwnd = tp->snd_ssthresh + 875 tp->t_maxseg * tp->t_dupacks; 876 if (SEQ_GT(onxt, tp->snd_nxt)) 877 tp->snd_nxt = onxt; 878 goto drop; 879 } else if (tp->t_dupacks > tcprexmtthresh) { 880 tp->snd_cwnd += tp->t_maxseg; 881 (void) tcp_output(tp); 882 goto drop; 883 } 884 } else 885 tp->t_dupacks = 0; 886 break; /* beyond ACK processing (to step 6) */ 887 } ----------------------------------------------------------------------- tcp_input.c |
TCP counts the number of these duplicate ACKs that are received in a row (in the variable t_dupacks
), and when the number reaches a threshold of 3 (tcprexmtthresh
), the lost segment is retransmitted. This is the fast retransmit algorithm described in Section 21.7 of Volume 1. It works in conjunction with the code we saw in Figure 27.15: when TCP receives an out-of-order segment, it is required to generate an immediate duplicate ACK, telling the other end that a segment might have been lost and telling it the value of the next expected sequence number. The goal of the fast retransmit algorithm is for TCP to retransmit immediately what appears to be the missing segment, instead of waiting for the retransmission timer to expire. Figure 21.7 of Volume 1 gives a detailed example of how this algorithm works.
The receipt of a duplicate ACK also tells TCP that a packet has “left the network,” because the other end had to receive an out-of-order segment to send the duplicate ACK. The fast recovery algorithm says that after some number of consecutive duplicate ACKs have been received, TCP should perform congestion avoidance (i.e., slow down) but need not wait for the pipe to empty between the two connection end points (slow start). The expression “a packet has left the network” means a packet has been received by the other end and has been added to the out-of-order queue for the connection. The packet is not still in transit somewhere between the two end points.
If only the first three tests shown earlier are true, the ACK is still a duplicate and is counted by the statistic tcps_rcvdupack
, but the counter of the number of consecutive duplicate ACKs for this connection (
t_dupacks
) is reset to 0. If only the first test is true, the counter t_dupacks
is reset to 0.
The remainder of the fast recovery algorithm is shown in Figure 29.4. When all five tests are true, the fast recovery algorithm processes the segment depending on the number of these consecutive duplicate ACKs that have been received.
t_dupacks
equals 3 (tcprexmtthresh
). Congestion avoidance is performed and the missing segment is retransmitted.
t_dupacks
exceeds 3. Increase the congestion window and perform normal TCP output.
t_dupacks
is less than 3. Do nothing.
861-868
When t_dupacks
reaches 3 (tcprexmtthresh
), the value of snd_nxt
is saved in onxt
and the slow start threshold (ssthresh
) is set to one-half the current congestion window, with a minimum value of two segments. This is what was done with the slow start threshold when the retransmission timer expired in Figure 25.27, but we’ll see later in this piece of code that the fast recovery algorithm does not set the congestion window to one segment, as was done with the timeout.
869-870
The retransmission timer is turned off and, in case a segment is currently being timed, t_rtt
is set to 0.
871-873
snd_nxt
is set to the starting sequence number of the segment that appears to have been lost (the acknowledgment field of the duplicate ACK) and the congestion window is set to one segment. This causes tcp_output
to send only the missing segment. (This is shown by segment 63 in Figure 21.7 of Volume 1.)
874-875
The congestion window is set to the slow start threshold plus the number of segments that the other end has cached. By cached we mean the number of out-of-order segments that the other end has received and generated duplicate ACKs for. These cannot be passed to the process at the other end until the missing segment (which was just sent) is received. Figures 21.10 and 21.11 in Volume 1 show what happens with the congestion window and slow start threshold when the fast recovery algorithm is in effect.
876-878
The value of the next sequence number to send is set to the maximum of its previous value (onxt
) and its current value. Its current value was modified by tcp_output
when the segment was retransmitted. Normally this causes snd_nxt
to be set back to its previous value, which means that only the missing segment is retransmitted, and that future calls to tcp_output
carry on with the next segment in sequence.
879-883
The missing segment was retransmitted when t_dupacks
equaled 3, so the receipt of each additional duplicate ACK means that another packet has left the network. The congestion window is incremented by one segment. tcp_output
sends the next segment in sequence, and the duplicate ACK is dropped. (This is shown by segments 67, 69, and 71 in Figure 21.7 of Volume 1.)
884-885
This statement is executed when the received segment contains a duplicate ACK, but either the length is nonzero or the advertised window changed. Only the first of the five tests described earlier is true. The counter of consecutive duplicate ACKs is set to 0.
886
This break
is executed in three cases:
only the first of the five tests described earlier is true, or
only the first three of the five tests is true, or
the ACK is a duplicate, but the number of consecutive duplicates is less than the threshold of 3.
For any of these cases the ACK is still a duplicate and the break
goes to the end of the switch
that started in Figure 29.2, which continues processing at the label step6
.
To understand the purpose in this aggressive window manipulation, consider the following example. Assume the window is eight segments, and segments 1 through 8 are sent. Segment 1 is lost, but the remainder arrive OK and are acknowledged. After the ACKs for segments 2, 3, and 4 arrive, the missing segment (1) is retransmitted. TCP would like the subsequent ACKs for 5 through 8 to allow some of the segments starting with 9 to be sent, to keep the pipe full. But the window is 8, which prevents segments 9 and above from being sent. Therefore, the congestion window is temporarily inflated by one segment each time another duplicate ACK is received, since the receipt of the duplicate ACK tells TCP that another segment has left the pipe at the other end. When the acknowledgment of segment 1 is finally received, the next figure reduces the congestion window back to the slow start threshold. This increase in the congestion window as the duplicate ACKs arrive, and its subsequent decrease when the fresh ACK arrives, can be seen visually in Figure 21.10 of Volume 1.
The ACK processing continues with Figure 29.5.
Table 29.5. tcp_input
function: ACK processing continued.
----------------------------------------------------------------------- tcp_input.c 888 /* 889 * If the congestion window was inflated to account 890 * for the other side's cached packets, retract it. 891 */ 892 if (tp->t_dupacks > tcprexmtthresh && 893 tp->snd_cwnd > tp->snd_ssthresh) 894 tp->snd_cwnd = tp->snd_ssthresh; 895 tp->t_dupacks = 0; 896 if (SEQ_GT(ti->ti_ack, tp->snd_max)) { 897 tcpstat.tcps_rcvacktoomuch++; 898 goto dropafterack; 899 } 900 acked = ti->ti_ack - tp->snd_una; 901 tcpstat.tcps_rcvackpack++; 902 tcpstat.tcps_rcvackbyte += acked; ----------------------------------------------------------------------- tcp_input.c |
888-895
If the number of consecutive duplicate ACKs exceeds the threshold of 3, this is the first nonduplicate ACK after a string of four or more duplicate ACKs. The fast recovery algorithm is complete. Since the congestion window was incremented by one segment for every consecutive duplicate after the third, if it now exceeds the slow start threshold, it is set back to the slow start threshold. The counter of consecutive duplicate ACKs is set to 0.
896-899
Recall the definition of an acceptable ACK,
snd_una <
acknowledgment field <= snd_max
If the acknowledgment field is greater than snd_max
, the other end is acknowledging data that TCP hasn’t even sent yet! This probably occurs on a high-speed connection when the sequence numbers wrap and a missing ACK reappears later. As we can see in Figure 24.5, this rarely happens (since today’s networks aren’t fast enough).
900-902
At this point TCP knows that it has an acceptable ACK. acked
is the number of bytes acknowledged.
The next part of ACK processing, shown in Figure 29.23, deals with RTT measurements and the retransmission timer.
Table 29.6. tcp_input
function: RTT measurements and retransmission timer.
----------------------------------------------------------------------- tcp_input.c 903 /* 904 * If we have a timestamp reply, update smoothed 905 * round-trip time. If no timestamp is present but 906 * transmit timer is running and timed sequence 907 * number was acked, update smoothed round-trip time. 908 * Since we now have an rtt measurement, cancel the 909 * timer backoff (cf., Phil Karn's retransmit alg.). 910 * Recompute the initial retransmit timer. 911 */ 912 if (ts_present) 913 tcp_xmit_timer(tp, tcp_now - ts_ecr + 1); 914 else if (tp->t_rtt && SEQ_GT(ti->ti_ack, tp->t_rtseq)) 915 tcp_xmit_timer(tp, tp->t_rtt); 916 /* 917 * If all outstanding data is acked, stop retransmit 918 * timer and remember to restart (more output or persist). 919 * If there is more data to be acked, restart retransmit 920 * timer, using current (possibly backed-off) value. 921 */ 922 if (ti->ti_ack == tp->snd_max) { 923 tp->t_timer[TCPT_REXMT] = 0; 924 needoutput = 1; 925 } else if (tp->t_timer[TCPT_PERSIST] == 0) 926 tp->t_timer[TCPT_REXMT] = tp->t_rxtcur; ----------------------------------------------------------------------- tcp_input.c |
903-915
If either (1) a timestamp option was present, or (2) a segment was being timed and the acknowledgment number is greater than the starting sequence number of the segment being timed, tcp_xmit_timer
updates the RTT estimators. Notice that the second argument to this function when timestamps are used is the current time (tcp_now
) minus the timestamp echo reply (ts_ecr
) plus 1 (since the function subtracts 1).
Delayed ACKs are the reason for the greater-than test of the sequence numbers. For example, if TCP sends and times a segment with bytes 1–1024, followed by a segment with bytes 1025–2048, if an ACK of 2049 is returned, this test will consider whether 2049 is greater than 1 (the starting sequence number of the segment being timed), and since this is true, the RTT estimators are updated.
916-924
If the acknowledgment field of the received segment (ti_ack
) equals the maximum sequence number that TCP has sent (snd_max
), all outstanding data has been acknowledged. The retransmission timer is turned off and the needoutput
flag is set to 1. This flag forces a call to tcp_output
at the end of this function. Since there is no more data waiting to be acknowledged, TCP may have more data to send that it has not been able to send earlier because the data was beyond the right edge of the window. Now that a new ACK has been received, the window will probably move to the right (snd_una
is updated in Figure 29.8), which could allow more data to be sent.
925-926
Since there is additional data that has been sent but not acknowledged, if the persist timer is not on, the retransmission timer is restarted using the current value of t_rxtcur
.
Notice that timestamps overrule the portion of Karn’s algorithm (Section 21.3 of Volume 1) that says: when a timeout and retransmission occurs, the RTT estimators cannot be updated when the acknowledgment for the retransmitted data is received (the retransmission ambiguity problem). In Figure 25.26 we saw that t_rtt
was set to 0 when a retransmission took place, because of Karn’s algorithm. If timestamps are not present and it is a retransmission, the code in Figure 29.6 does not update the RTT estimators because t_rtt
will be 0 from the retransmission. But if a timestamp is present, t_rtt
isn’t examined, allowing the RTT estimators to be updated using the received timestamp echo reply. With RFC 1323 timestamps the ambiguity is gone since the ts_ecr
value was copied by the other end from the segment being acknowledged. The other half of Karn’s algorithm, specifying that an exponential backoff must be used with retransmissions, still holds, of course.
Figure 29.7 shows the next part of ACK processing, updating the congestion window.
Table 29.7. tcp_input
function: open congestion window in response to ACKs.
----------------------------------------------------------------------- tcp_input.c 927 /* 928 * When new data is acked, open the congestion window. 929 * If the window gives us less than ssthresh packets 930 * in flight, open exponentially (maxseg per packet). 931 * Otherwise open linearly: maxseg per window 932 * (maxseg^2 / cwnd per packet), plus a constant 933 * fraction of a packet (maxseg/8) to help larger windows 934 * open quickly enough. 935 */ 936 { 937 u_int cw = tp->snd_cwnd; 938 u_int incr = tp->t_maxseg; 939 if (cw > tp->snd_ssthresh) 940 incr = incr * incr / cw + incr / 8; 941 tp->snd_cwnd = min(cw + incr, TCP_MAXWIN << tp->snd_scale); 942 } ----------------------------------------------------------------------- tcp_input.c |
927-942
One of the rules of slow start and congestion avoidance is that a received ACK increases the congestion window. By default the congestion window is increased by one segment for each received ACK (slow start). But if the current congestion window is greater than the slow start threshold, it is increased by 1 divided by the congestion window, plus a constant fraction of a segment. The term
incr * incr / cw
is
t_maxseg * t_maxseg / snd_cwnd
which is 1 divided by the congestion window, taking into account that snd_cwnd
is maintained in bytes, not segments. The constant fraction is the segment size divided by 8. The congestion window is then limited by the maximum value of the send window for this connection. Example calculations of this algorithm are in Section 21.8 of Volume 1.
Adding in the constant fraction (the segment size divided by 8) is wrong [Floyd 1994]. But it has been in the BSD sources since 4.3BSD Reno and is still in 4.4BSD and Net/3. It should be removed.
The next part of tcp_input
, shown in Figure 29.8, removes the acknowledged data from the send buffer.
Table 29.8. tcp_input
function: remove acknowledged data from send buffer.
----------------------------------------------------------------------------- tcp_input.c 943 if (acked > so->so_snd.sb_cc) { 944 tp->snd_wnd -= so->so_snd.sb_cc; 945 sbdrop(&so->so_snd, (int) so->so_snd.sb_cc); 946 ourfinisacked = 1; 947 } else { 948 sbdrop(&so->so_snd, acked); 949 tp->snd_wnd -= acked; 950 ourfinisacked = 0; 951 } 952 if (so->so_snd.sb_flags & SB_NOTIFY) 953 sowwakeup(so); 954 tp->snd_una = ti->ti_ack; 955 if (SEQ_LT(tp->snd_nxt, tp->snd_una)) 956 tp->snd_nxt = tp->snd_una; ----------------------------------------------------------------------------- tcp_input.c |
943-946
If the number of bytes acknowledged exceeds the number of bytes on the send buffer, snd_wnd
is decremented by the number of bytes in the send buffer and TCP knows that its FIN has been ACKed. That number of bytes is then removed from the send buffer by sbdrop
. This method for detecting the ACK of a FIN works only because the FIN occupies 1 byte in the sequence number space.
947-951
Otherwise the number of bytes acknowledged is less than or equal to the number of bytes in the send buffer, so ourfinisacked
is set to 0, and acked
bytes of data are dropped from the send buffer.
951-956
sowwakeup
awakens any processes waiting on the send buffer. snd_una
is updated to contain the oldest unacknowledged sequence number. If this new value of snd_una
exceeds snd_nxt
, the latter is updated, since the intervening bytes have been acknowledged.
Figure 29.9 shows how snd_nxt
can end up with a sequence number that is less than snd_una
. Assume two segments are transmitted, the first with bytes 1–512 and the second with bytes 513–1024.
The retransmission timer then expires before an acknowledgment is returned. The code in Figure 25.26 sets snd_nxt
back to snd_una
, slow start is entered,
tcp_output
is called, and one segment containing bytes 1–512 is retransmitted. tcp_output
increases snd_nxt
to 513, and we have the scenario shown in Figure 29.10.
At this point an ACK of 1025 arrives (either the two original segments or the ACK was delayed somewhere in the network). The ACK is valid since it is less than or equal to snd_max
, but
snd_nxt
will be less than the updated value of snd_una
.
The general ACK processing is now complete, and the switch
shown in Figure 29.11 handles four special cases.
Table 29.11. tcp_input
function: receipt of ACK in FIN_WAIT_1 state.
----------------------------------------------------------------------- tcp_input.c 957 switch (tp->t_state) { 958 /* 959 * In FIN_WAIT_1 state in addition to the processing 960 * for the ESTABLISHED state if our FIN is now acknowledged 961 * then enter FIN_WAIT_2. 962 */ 963 case TCPS_FIN_WAIT_1: 964 if (ourfinisacked) { 965 /* 966 * If we can't receive any more 967 * data, then closing user can proceed. 968 * Starting the timer is contrary to the 969 * specification, but if we don't get a FIN 970 * we'll hang forever. 971 */ 972 if (so->so_state & SS_CANTRCVMORE) { 973 soisdisconnected(so); 974 tp->t_timer[TCPT_2MSL] = tcp_maxidle; 975 } 976 tp->t_state = TCPS_FIN_WAIT_2; 977 } 978 break; ----------------------------------------------------------------------- tcp_input.c |
958-971
In this state the process has closed the connection and TCP has sent the FIN. But other ACKs can be received for data segments sent before the FIN. Therefore the connection moves into the FIN_WAIT_2 state only when the FIN has been acknowledged. The flag ourfinisacked
is set in Figure 29.8; this depends on whether the number of bytes ACKed exceeds the amount of data in the send buffer or not.
972-975
We also described in Section 25.6 how Net/3 sets a FIN_WAIT_2 timer to prevent an infinite wait in the FIN_WAIT_2 state. This timer is set only if the process completely closed the connection (i.e., the close
system call or its kernel equivalent if the process was terminated by a signal), and not if the process performed a half-close (i.e., the FIN was sent but the process can still receive data on the connection).
Figure 29.12 shows the receipt of an ACK in the CLOSING state.
Table 29.12. tcp_input
function: receipt of ACK in CLOSING state.
----------------------------------------------------------------------- tcp_input.c 979 /* 980 * In CLOSING state in addition to the processing for 981 * the ESTABLISHED state if the ACK acknowledges our FIN 982 * then enter the TIME-WAIT state, otherwise ignore 983 * the segment. 984 */ 985 case TCPS_CLOSING: 986 if (ourfinisacked) { 987 tp->t_state = TCPS_TIME_WAIT; 988 tcp_canceltimers(tp); 989 tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL; 990 soisdisconnected(so); 991 } 992 break; ----------------------------------------------------------------------- tcp_input.c |
979-992
If the ACK is for the FIN (and not for some previous data segment), the connection moves into the TIME_WAIT state. Any pending timers are cleared (such as a pending retransmission timer), and the TIME_WAIT timer is started with a value of twice the MSL.
The processing of an ACK in the LAST_ACK state is shown in Figure 29.13.
Table 29.13. tcp_input
function: receipt of ACK in LAST_ACK state.
----------------------------------------------------------------------- tcp_input.c 993 /* 994 * In LAST_ACK, we may still be waiting for data to drain 995 * and/or to be acked, as well as for the ack of our FIN. 996 * If our FIN is now acknowledged, delete the TCB, 997 * enter the closed state, and return. 998 */ 999 case TCPS_LAST_ACK: 1000 if (ourfinisacked) { 1001 tp = tcp_close(tp); 1002 goto drop; 1003 } 1004 break; ----------------------------------------------------------------------- tcp_input.c |
993-1004
If the FIN is ACKed, the new state is CLOSED. This state transition is handled by tcp_close
, which also releases the Internet PCB and TCP control block.
Figure 29.14 shows the processing of an ACK in the TIME_WAIT state.
Table 29.14. tcp_input
function: receipt of ACK in TIME_WAIT state.
----------------------------------------------------------------------- tcp_input.c 1005 /* 1006 * In TIME_WAIT state the only thing that should arrive 1007 * is a retransmission of the remote FIN. Acknowledge 1008 * it and restart the finack timer. 1009 */ 1010 case TCPS_TIME_WAIT: 1011 tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL; 1012 goto dropafterack; 1013 } 1014 } ----------------------------------------------------------------------- tcp_input.c |
1005-1014
In this state both ends have sent a FIN and both FINs have been acknowledged. If TCP’s ACK of the remote FIN was lost, however, the other end will retransmit the FIN (with an ACK). TCP drops the segment and resends the ACK. Additionally, the TIME_WAIT timer must be restarted with a value of twice the MSL.
There are two variables in the TCP control block that we haven’t described yet: snd_wl1
and snd_wl2
.
snd_wl1
records the sequence number of the last segment used to update the send window (snd_wnd
).
snd_wl2
records the acknowledgment number of the last segment used to update the send window.
Our only encounter with these variables so far was when a connection was established (active, passive, or simultaneous open) and snd_wl1
was set to ti_seq
minus 1. We said this was to guarantee a window update, which we’ll see in the following code.
The send window (snd_wnd
) is updated from the advertised window in the received segment (tiwin
) if any one of the following three conditions is true:
1. | The segment contains new data. Since |
|
|
this condition is true. | |
3. | The segment does not contain new data ( |
|
|
since | |
5. | The segment does not contain new data, and the segment does not acknowledge new data, but the advertised window is larger than the current send window. |
The purpose of these tests is to prevent an old segment from affecting the send window, since the send window is not an absolute sequence number, but is an offset from snd_una
.
Figure 29.15 shows the code that implements the update of the send window.
Table 29.15. tcp_input
function: update window information.
----------------------------------------------------------------------- tcp_input.c 1015 step6: 1016 /* 1017 * Update window information. 1018 * Don't look at window if no ACK: TAC's send garbage on first SYN. 1019 */ 1020 if ((tiflags & TH_ACK) && 1021 (SEQ_LT(tp->snd_wl1, ti->ti_seq) || tp->snd_wl1 == ti->ti_seq && 1022 (SEQ_LT(tp->snd_wl2, ti->ti_ack) || 1023 tp->snd_wl2 == ti->ti_ack && tiwin > tp->snd_wnd))) { 1024 /* keep track of pure window updates */ 1025 if (ti->ti_len == 0 && 1026 tp->snd_wl2 == ti->ti_ack && tiwin > tp->snd_wnd) 1027 tcpstat.tcps_rcvwinupd++; 1028 tp->snd_wnd = tiwin; 1029 tp->snd_wl1 = ti->ti_seq; 1030 tp->snd_wl2 = ti->ti_ack; 1031 if (tp->snd_wnd > tp->max_sndwnd) 1032 tp->max_sndwnd = tp->snd_wnd; 1033 needoutput = 1; 1034 } ----------------------------------------------------------------------- tcp_input.c |
1015-1023
This if
test verifies that the ACK flag is set along with any one of the three previously stated conditions. Recall that a jump was made to step6
after the receipt of a SYN in either the LISTEN or SYN_SENT state, and in the LISTEN state the SYN does not contain an ACK.
The term TAC referred to in the comment is a “terminal access controller.” These were Telnet clients on the ARPANET.
1024-1027
If the received segment is a pure window update (the length is 0 and the ACK does not acknowledge new data, but the advertised window is larger), the statistic tcps_rcvwinupd
is incremented.
1028-1033
The send window is updated and new values of snd_wl1
and snd_wl2
are recorded. Additionally, if this advertised window is the largest one TCP has received from this peer, the new value is recorded in max_sndwnd
. This is an attempt to guess the size of the other end’s receive buffer, and it is used in Figure 26.8.
needoutput
is set to 1 since the new value of snd_wnd
might enable a segment to be sent.
The next part of TCP input processing handles segments with the URG flag set.???
Table 29.16. tcp_input
function: urgent mode processing.
----------------------------------------------------------------------------- tcp_input.c 1035 /* 1036 * Process segments with URG. 1037 */ 1038 if ((tiflags & TH_URG) && ti->ti_urp && 1039 TCPS_HAVERCVDFIN(tp->t_state) == 0) { 1040 /* 1041 * This is a kludge, but if we receive and accept 1042 * random urgent pointers, we'll crash in 1043 * soreceive. It's hard to imagine someone 1044 * actually wanting to send this much urgent data. 1045 */ 1046 if (ti->ti_urp + so->so_rcv.sb_cc > sb_max) { 1047 ti->ti_urp = 0; /* XXX */ 1048 tiflags &= ~TH_URG; /* XXX */ 1049 goto dodata; /* XXX */ 1050 } ----------------------------------------------------------------------------- tcp_input.c |
1035-1039
These segments must have the URG flag set, a nonzero urgent offset (ti_urp
), and the connection must not have received a FIN. The macro TCPS_HAVERCVDFIN
is true only for the TIME_WAIT state, so the URG is processed in any other state. This is contrary to a comment appearing later in the code stating that the URG flag is ignored in the CLOSE_WAIT, CLOSING, LAST_ACK, or TIME_WAIT states.
1040-1050
If the urgent offset plus the number of bytes already in the receive buffer exceeds the maximum size of a socket buffer, the urgent notification is ignored. The urgent offset is set to 0, the URG flag is cleared, and the rest of the urgent mode processing is skipped.
The next piece of code, shown in Figure 29.17, processes the urgent pointer.
Table 29.17. tcp_input
function: processing of received urgent pointer.
------------------------------------------------------------------------ tcp_input.c 1051 /* 1052 * If this segment advances the known urgent pointer, 1053 * then mark the data stream. This should not happen 1054 * in CLOSE_WAIT, CLOSING, LAST_ACK or TIME_WAIT states since 1055 * a FIN has been received from the remote side. 1056 * In these states we ignore the URG. 1057 * 1058 * According to RFC961 (Assigned Protocols), 1059 * the urgent pointer points to the last octet 1060 * of urgent data. We continue, however, 1061 * to consider it to indicate the first octet 1062 * of data past the urgent section as the original 1063 * spec states (in one of two places). 1064 */ 1065 if (SEQ_GT(ti->ti_seq + ti->ti_urp, tp->rcv_up)) { 1066 tp->rcv_up = ti->ti_seq + ti->ti_urp; 1067 so->so_oobmark = so->so_rcv.sb_cc + 1068 (tp->rcv_up - tp->rcv_nxt) - 1; 1069 if (so->so_oobmark == 0) 1070 so->so_state |= SS_RCVATMARK; 1071 sohasoutofband(so); 1072 tp->t_oobflags &= ~(TCPOOB_HAVEDATA | TCPOOB_HADDATA); 1073 } 1074 /* 1075 * Remove out-of-band data so doesn't get presented to user. 1076 * This can happen independent of advancing the URG pointer, 1077 * but if two URG's are pending at once, some out-of-band 1078 * data may creep in... ick. 1079 */ 1080 if (ti->ti_urp <= ti->ti_len 1081 #ifdef SO_OOBINLINE 1082 && (so->so_options & SO_OOBINLINE) == 0 1083 #endif 1084 ) 1085 tcp_pulloutofband(so, ti, m); 1086 } else { 1087 /* 1088 * If no out-of-band data is expected, pull receive 1089 * urgent pointer along with the receive window. 1090 */ 1091 if (SEQ_GT(tp->rcv_nxt, tp->rcv_up)) 1092 tp->rcv_up = tp->rcv_nxt; 1093 } ------------------------------------------------------------------------ tcp_input.c |
1051-1065
If the starting sequence number of the received segment plus the urgent offset exceeds the current receive urgent pointer, a new urgent pointer has been received. For example, when the 3-byte segment that was sent in Figure 26.30 arrives at the receiver, we have the scenario shown in Figure 29.18.
Normally the receive urgent pointer (rcv_up
) equals rcv_nxt
. In this example, since the
if
test is true (4 plus 3 is greater than 4), the new value of rcv_up
is calculated as 7.
1066-1070
The out-of-band mark in the socket’s receive buffer is calculated, taking into account any data bytes already in the receive buffer (so_rcv.sb_cc
). In our example, assuming there is no data already in the receive buffer, so_oobmark
is set to 2: that is, the byte with the sequence number 6 is considered the out-of-band byte. If this out-of-band mark is 0, the socket is currently at the out-of-band mark. This happens if the send
system call that sends the out-of-band byte specifies a length of 1, and if the receive buffer is empty when this segment arrives at the other end. This reiterates that Berkeley-derived systems consider the urgent pointer to point to the first byte of data after the out-of-band byte.
1071-1072
sohasoutofband
notifies the process that out-of-band data has arrived for the socket. The two flags TCPOOB_HAVEDATA
and TCPOOB_HADDATA
are cleared. These two flags are used with the PRU_RCVOOB
request in Figure 30.8.
1074-1085
If the urgent offset is less than or equal to the number of bytes in the received segment, the out-of-band byte is contained in the segment. With TCP’s urgent mode it is possible for the urgent offset to point to a data byte that has not yet been received. If the SO_OOBINLINE
constant is defined (which it always is for Net/3), and if the corresponding socket option is not enabled, the receiving process wants the out-of-band byte pulled out of the normal stream of data and placed into the variable t_iobc
. This is done by
tcp_pulloutof
band
, which we cover in the next section.
Notice that the receiving process is notified that the sender has entered urgent mode, regardless of whether the byte pointed to by the urgent pointer is readable or not. This is a feature of TCP’s urgent mode.
1086-1093
When the receiver is not processing an urgent pointer, if rcv_nxt
is greater than the receive urgent pointer, rcv_up
is moved to the right and set equal to rcv_nxt
. This keeps the receive urgent pointer at the left edge of the receive window so that the comparison using
SEQ_GT
at the beginning of Figure 29.17 will work correctly when an URG flag is received.
If the solution to Exercise 26.6 is implemented, corresponding changes will have to go into Figures 29.16 and 29.17 also.
This function is called from Figure 29.17 when
urgent mode notification arrives in a received segment, and
the out-of-band byte is contained within the segment (i.e., the urgent pointer points into the received segment), and
the SO_OOBINLINE
socket option is not enabled for this socket.
This function removes the out-of-band byte from the normal stream of data (i.e., the mbuf chain containing the received segment) and places it into the t_iobc
variable in the TCP control block for the connection. The process reads this variable using the MSG_OOB
flag with the recv
system call: the PRU_RCVOOB
request in Figure 30.8. Figure 29.19 shows the function.
Table 29.19. tcp_pulloutofband
function: place out-of-band byte into t_iobc
.
------------------------------------------------------------------------- tcp_input.c 1282 void 1283 tcp_pulloutofband(so, ti, m) 1284 struct socket *so; 1285 struct tcpiphdr *ti; 1286 struct mbuf *m; 1287 { 1288 int cnt = ti->ti_urp - 1; 1289 while (cnt >= 0) { 1290 if (m->m_len > cnt) { 1291 char *cp = mtod(m, caddr_t) + cnt; 1292 struct tcpcb *tp = sototcpcb(so); 1293 tp->t_iobc = *cp; 1294 tp->t_oobflags |= TCPOOB_HAVEDATA; 1295 bcopy(cp + 1, cp, (unsigned) (m->m_len - cnt - 1)); 1296 m->m_len--; 1297 return; 1298 } 1299 cnt -= m->m_len; 1300 m = m->m_next; 1301 if (m == 0) 1302 break; 1303 } 1304 panic("tcp_pulloutofband"); 1305 } ------------------------------------------------------------------------- tcp_input.c |
1282-1289
Consider the example in Figure 29.20. The urgent offset is 3, therefore the urgent pointer is 7, and the sequence number of the out-of-band byte is 6. There are 5 bytes in the received segment, all contained in a single mbuf.
The variable cnt
is 2 and since m_len
(which is 5) is greater than 2, the true portion of the if
statement is executed.
1290-1298
cp
points to the shaded byte with a sequence number of 6. This is placed into the variable t_iobc
, which contains the out-of-band byte. The
TCPOOB_HAVEDATA
flag is set and bcopy
moves the next 2 bytes (with sequence numbers 7 and 8) left 1 byte, giving the arrangement shown in Figure 29.21.
Remember that the numbers 7 and 8 specify the sequence numbers of the data bytes, not the contents of the data bytes. The length of the mbuf is decremented from 5 to 4 but ti_len
is left as 5, for sequencing of the segment into the socket’s receive buffer. Both the TCP_REASS
macro and the tcp_reass
function (which are called in the next section) increment rcv_nxt
by ti_len
, which in this example must be 5, because the next expected receive sequence number is 9. Also notice in this function that the length field in the packet header (
m_pkthdr.len
) in the first mbuf is not decremented by 1. This is because that length field is not used by sbappend
, which appends the data to the socket’s receive buffer.
1299-1302
The out-of-band byte is not contained in this mbuf, so cnt
is decremented by the number of bytes in the mbuf and the next mbuf in the chain is processed. Since this function is called only when the urgent offset points into the received segment, if there is not another mbuf on the chain, the break
causes the call to panic
.
tcp_input
continues by taking the received data (if any) and either appending it to the socket’s receive buffer (if it is the next expected segment) or placing it onto the socket’s out-of-order queue. Figure 29.22 shows the code that performs this task.
Table 29.22. tcp_input
function: merge received data into sequencing queue for socket.
----------------------------------------------------------------------- tcp_input.c 1094 dodata: /* XXX */ 1095 /* 1096 * Process the segment text, merging it into the TCP sequencing queue, 1097 * and arranging for acknowledgment of receipt if necessary. 1098 * This process logically involves adjusting tp->rcv_wnd as data 1099 * is presented to the user (this happens in tcp_usrreq.c, 1100 * case PRU_RCVD). If a FIN has already been received on this 1101 * connection then we just ignore the text. 1102 */ 1103 if ((ti->ti_len || (tiflags & TH_FIN)) && 1104 TCPS_HAVERCVDFIN(tp->t_state) == 0) { 1105 TCP_REASS(tp, ti, m, so, tiflags); 1106 /* 1107 * Note the amount of data that peer has sent into 1108 * our window, in order to estimate the sender's 1109 * buffer size. 1110 */ 1111 len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt); 1112 } else { 1113 m_freem(m); 1114 tiflags &= ~TH_FIN; 1115 } ----------------------------------------------------------------------- tcp_input.c |
1094-1105
Segment data is processed if
the length of the received data is greater than 0 or the FIN flag is set, and
a FIN has not yet been received for the connection.
The macro TCP_REASS
processes the data. If the data is in sequence (i.e., the next expected data for this connection), the delayed-ACK flag is set, rcv_nxt
is incremented, and the data is appended to the socket’s receive buffer. If the data is out of order, the macro calls tcp_reass
to add the data to the connection’s reassembly queue (which might fill a hole and cause already-queued data to be appended to the socket’s receive buffer).
Recall that the final argument to the macro (tiflags
) can be modified. Specifically, if the data is out of order, tcp_reass
sets tiflags
to 0, clearing the FIN flag (if it was set). That’s why the if
statement is true if the FIN flag is set even if there is no data in the segment.
Consider the following example. A connection is established and the sender immediately transmits three segments: one with bytes 1–1024, another with bytes 1025–2048, and another with the FIN flag but no data. The first segment is lost, so when the second arrives (bytes 1025–2048) the receiver places it onto the out-of-order list and generates an immediate ACK. When the third segment with the FIN flag is received, the code in Figure 29.22 is executed. Even though the data length is 0, since the FIN flag is set, TCP_REASS
is invoked, which calls tcp_reass
. Since ti_seq
(2049, the sequence number of the FIN) does not equal rcv_nxt
(1), tcp_reass
returns 0 (Figure 27.23), which in the TCP_REASS
macro sets tiflags
to 0. This clears the FIN flag, preventing the code that follows (Section 29.10) from processing the FIN flag.
1106-1111
The calculation of len
is attempt to guess the size of the other end’s send buffer. Consider the following example. A socket has a receive buffer size of 8192 (the Net/3 default), so TCP advertises a window of 8192 in its SYN. The first segment with bytes 1–1024 is then received. Figure 29.23 shows the state of the receive space after TCP_REASS
has incremented rcv_nxt
to account for the received segment.
The calculation of len
yields 1024. The value of len
will increase as the other end sends more data into the receive window, but it will never exceed the size of the other end’s send buffer. Recall that the variable max_sndwnd
, calculated in Figure 29.15, is an attempt to guess the size of the other end’s receive buffer.
This variable
len
is never used! It is left over code from Net/1 when the variablemax_rcvd
was stored in the TCP control block after the calculation oflen:
if (len > tp->max_rcvd) tp->max_rcvd = len;But even in Net/1 the variable
max_rcvd
was never used.
1112-1115
If the length is 0 and the FIN flag is not set, or if a FIN has already been received for the connection, the received mbuf chain is discarded and the FIN flag is cleared.
The next step in tcp_input
, shown in Figure 29.24, handles the FIN flag.
Table 29.24. tcp_input
function: FIN processing, first half.
----------------------------------------------------------------------- tcp_input.c 1116 /* 1117 * If FIN is received ACK the FIN and let the user know 1118 * that the connection is closing. 1119 */ 1120 if (tiflags & TH_FIN) { 1121 if (TCPS_HAVERCVDFIN(tp->t_state) == 0) { 1122 socantrcvmore(so); 1123 tp->t_flags |= TF_ACKNOW; 1124 tp->rcv_nxt++; 1125 } 1126 switch (tp->t_state) { 1127 /* 1128 * In SYN_RECEIVED and ESTABLISHED states 1129 * enter the CLOSE_WAIT state. 1130 */ 1131 case TCPS_SYN_RECEIVED: 1132 case TCPS_ESTABLISHED: 1133 tp->t_state = TCPS_CLOSE_WAIT; 1134 break; ----------------------------------------------------------------------- tcp_input.c |
1116-1125
If the FIN flag is set and this is the first FIN received for this connection, socantrcvmore
marks the socket as write-only, TF_ACKNOW
is set to acknowledge the FIN immediately (i.e., it is not delayed), and rcv_nxt
steps over the FIN in the sequence space.
1126
The remainder of FIN processing is handled by a switch
that depends on the connection state. Notice that the FIN is not processed in the CLOSED, LISTEN, or SYN_SENT states, since in these three states a SYN has not been received to synchronize the received sequence number, making it impossible to validate the sequence number of the FIN. A FIN is also ignored in the CLOSING, CLOSE_WAIT, and LAST_ACK states, because in these three states the FIN is a duplicate.
1127-1134
From either the ESTABLISHED or SYN_RCVD states, the CLOSE_WAIT state is entered.
The receipt of a FIN in the SYN_RCVD state is unusual, but legal. It is not shown in Figure 24.15. It means a socket is in the LISTEN state when a segment containing a SYN and a FIN is received. Alternatively, a SYN is received for a listening socket, moving the connection to the SYN_RCVD state but before the ACK is received a FIN is received. (We know the segment does not contain a valid ACK, because if it did the code in Figure 29.2 would have moved the connection to the ESTABLISHED state.)
The next part of FIN processing is shown in Figure 29.25
Table 29.25. tcp_input
function: FIN processing, second half.
----------------------------------------------------------------------- tcp_input.c 1135 /* 1136 * If still in FIN_WAIT_1 state FIN has not been acked so 1137 * enter the CLOSING state. 1138 */ 1139 case TCPS_FIN_WAIT_1: 1140 tp->t_state = TCPS_CLOSING; 1141 break; 1142 /* 1143 * In FIN_WAIT_2 state enter the TIME_WAIT state, 1144 * starting the time-wait timer, turning off the other 1145 * standard timers. 1146 */ 1147 case TCPS_FIN_WAIT_2: 1148 tp->t_state = TCPS_TIME_WAIT; 1149 tcp_canceltimers(tp); 1150 tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL; 1151 soisdisconnected(so); 1152 break; 1153 /* 1154 * In TIME_WAIT state restart the 2 MSL time_wait timer. 1155 */ 1156 case TCPS_TIME_WAIT: 1157 tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL; 1158 break; 1159 } 1160 } ----------------------------------------------------------------------- tcp_input.c |
1135-1141
Since ACK processing is already complete for this segment, if the connection is in the FIN_WAIT_1 state when the FIN is processed, it means a simultaneous close is taking place—the two FINs from each end have passed in the network. The connection enters the CLOSING state.
1142-1148
The receipt of the FIN moves the connection into the TIME_WAIT state. When a segment containing a FIN and an ACK is received in the FIN_WAIT_1 state (the typical scenario), although Figure 24.15 shows the transition directly from the FIN_WAIT_1 state to the TIME_WAIT state, the ACK is processed in Figure 29.11, moving the connection to the FIN_WAIT_2 state. The FIN processing here moves the connection into the TIME_WAIT state. Because the ACK is processed before the FIN, the FIN_WAIT_2 state is always passed through, albeit momentarily.
1149-1152
Any pending TCP timer is turned off and the TIME_WAIT timer is started with a value of twice the MSL. (If the received segment contained a FIN and an ACK, Figure 29.11 started the FIN_WAIT_2 timer.) The socket is disconnected.
1153-1159
If a FIN arrives in the TIME_WAIT state, it is a duplicate, and similar to Figure 29.14, the TIME_WAIT timer is restarted with a value of twice the MSL.
The final part of the slow path through tcp_input
along with the label dropafterack
is shown in Figure 29.26.
Table 29.26. tcp_input
function: final processing.
------------------------------------------------------------------------- tcp_input.c 1161 if (so->so_options & SO_DEBUG) 1162 tcp_trace(TA_INPUT, ostate, tp, &tcp_saveti, 0); 1163 /* 1164 * Return any desired output. 1165 */ 1166 if (needoutput || (tp->t_flags & TF_ACKNOW)) 1167 (void) tcp_output(tp); 1168 return; 1169 dropafterack: 1170 /* 1171 * Generate an ACK dropping incoming segment if it occupies 1172 * sequence space, where the ACK reflects our state. 1173 */ 1174 if (tiflags & TH_RST) 1175 goto drop; 1176 m_freem(m); 1177 tp->t_flags |= TF_ACKNOW; 1178 (void) tcp_output(tp); 1179 return; ------------------------------------------------------------------------- tcp_input.c |
1161-1162
If the SO_DEBUG
socket option is enabled, tcp_trace
appends the trace record to the kernel’s circular buffer. Remember that the code in Figure 28.7 saved both the original connection state and the IP and TCP headers, since these values may have changed in this function.
1163-1168
If either the needoutput
flag was set (Figures 29.6 and 29.15) or if an immediate ACK is required, tcp_output
is called.
1169-1179
An ACK is generated only if the RST flag was not set. (A segment with an RST is never ACKed.) The mbuf chain containing the received segment is released, and tcp_output
generates an immediate ACK.
Figure 29.27 completes the tcp_input
function.
Table 29.27. tcp_input
function: final processing.
----------------------------------------------------------------------- tcp_input.c 1180 dropwithreset: 1181 /* 1182 * Generate an RST, dropping incoming segment. 1183 * Make ACK acceptable to originator of segment. 1184 * Don't bother to respond if destination was broadcast/multicast. 1185 */ 1186 if ((tiflags & TH_RST) || m->m_flags & (M_BCAST | M_MCAST) || 1187 IN_MULTICAST(ti->ti_dst.s_addr)) 1188 goto drop; 1189 if (tiflags & TH_ACK) 1190 tcp_respond(tp, ti, m, (tcp_seq) 0, ti->ti_ack, TH_RST); 1191 else { 1192 if (tiflags & TH_SYN) 1193 ti->ti_len++; 1194 tcp_respond(tp, ti, m, ti->ti_seq + ti->ti_len, (tcp_seq) 0, 1195 TH_RST | TH_ACK); 1196 } 1197 /* destroy temporarily created socket */ 1198 if (dropsocket) 1199 (void) soabort(so); 1200 return; 1201 drop: 1202 /* 1203 * Drop space held by incoming segment and return. 1204 */ 1205 if (tp && (tp->t_inpcb->inp_socket->so_options & SO_DEBUG)) 1206 tcp_trace(TA_DROP, ostate, tp, &tcp_saveti, 0); 1207 m_freem(m); 1208 /* destroy temporarily created socket */ 1209 if (dropsocket) 1210 (void) soabort(so); 1211 return; 1212 } ----------------------------------------------------------------------- tcp_input.c |
1180-1188
An RST is generated unless the received segment also contained an RST, or the received segment was sent as a broadcast or multicast. An RST is never generated in response to an RST, since this could lead to RST storms (a continual exchange of RST segments between two end points).
This code contains the same error that we noted in Figure 28.16: it does not check whether the destination address of the received segment was a broadcast address.
Similarly, the destination address argument to
IN_MULTICAST
needs to be converted to host byte order.
1189-1196
The values of the sequence number field, the acknowledgment field, and the ACK flag of the RST segment depend on whether the received segment contained an ACK.
Figure 29.28 summarizes these fields in the RST segment that is generated.
Realize that the ACK flag is normally set in all segments except when an initial SYN is sent (Figure 24.16). The fourth argument to tcp_respond
is the acknowledgment field, and the fifth argument is the sequence number.
1192-1193
If the SYN flag is set, ti_len
must be incremented by 1, causing the acknowledgment field of the RST to be 1 greater than the received sequence number of the SYN. This code is executed when a SYN arrives for a nonexistent server. When the Internet PCB is not found in Figure 28.6, a jump is made to dropwithreset
. But for the received RST to be acceptable to the other end, the acknowledgment field must ACK the SYN (Figure 28.18). Figure 18.14 of Volume 1 contains an example of this type of RST segment.
Finally note that tcp_respond
builds the RST in the first mbuf of the received chain and releases any remaining mbufs in the chain. When that mbuf finally makes its way to the device driver, it will be discarded.
1197-1199
If a temporary socket was created in Figure 28.7 for a listening server, but the code in Figure 28.16 found the received segment to contain an error, dropsocket
will be 1. If so, that socket is now destroyed.
1201-1206
tcp_trace
is called when a segment is dropped without generating an ACK or an RST. If the SO_DEBUG
flag is set and an ACK is generated, tcp_output
generates a trace record. If the SO_DEBUG
flag is set and an RST is generated, a trace record is not generated for the RST.
1207-1211
The mbuf chain containing the received segment is released and the temporary socket is destroyed if dropsocket
is nonzero.
The refinements to speed up TCP processing are similar to the ones described for UDP (Section 23.12). Multiple passes over the data should be avoided and the checksum computation should be combined with a copy. [Dalton et al. 1993] describe these modifications.
The linear search of the TCP PCBs is also a bottleneck when the number of connections increases. [McKenney and Dove 1992] address this problem by replacing the linear search with hash tables.
[Partridge 1993] describes a research implementation being developed by Van Jacobson that greatly reduces the TCP input processing. The received packet is processed by IP (about 25 instructions on a RISC system), then by a demultiplexer to locate the PCB (about 10 instructions), and then by TCP (about 30 instructions). These 30 instructions perform header prediction and calculate the pseudo-header checksum. If the segment passes the header prediction test, contains data, and the process is waiting for the data, the data is copied into the process buffer and the remainder of the TCP checksum is calculated and verified (a one-pass copy and checksum). If the TCP header prediction fails, the slow path through the TCP input processing occurs.
We now describe TCP header compression. Although header compression is not part of TCP input, we needed to cover TCP thoroughly before describing header compression. Header compression is described in detail in RFC 1144 [Jacobson 1990a]. It was designed by Van Jacobson and is sometimes called VJ header compression. Our purpose in this section is not to go through the header compression source code (a well-commented version of which is presented in RFC 1144, and which is approximately the same size as tcp_output
), but to provide an overview of the algorithm. Be sure to distinguish between header prediction (Section 28.4) and header compression.
Most implementations of SLIP and PPP support header compression. Although header compression could, in theory, be used with any data link, it is intended for slow-speed serial links. Header compression works with TCP segments only—it does nothing with other IP datagrams (e.g., ICMP, IGMP, UDP, etc.). Header compression reduces the size of the combined IP/TCP header from its normal 40 bytes to as few as 3 bytes. This reduces the size of a typical TCP segment from an interactive application such as Rlogin or Telnet from 41 bytes to 4 bytes—a big saving on a slowspeed serial link.
Each end of the serial link maintains two connection state tables, one for datagrams sent and one for datagrams received. Each table allows a maximum of 256 entries, but typically there are 16 entries in this table, allowing up to 16 different TCP connections to be compressed at any time. Each entry contains an 8-bit connection ID (hence the limit of 256), some flags, and the complete uncompressed IP/TCP header from the most recent datagram. The 96-bit socket pair that uniquely identifies each connection—the source and destination IP addresses and source and destination TCP ports—are contained in this uncompressed header. Figure 29.29 shows an example of these tables.
Since a TCP connection is full duplex, header compression can be applied in both directions. Each end must implement both compression and decompression. A connection appears in both tables, as shown in Figure 29.29. In this example, the entry with a connection ID of 1 in the top two tables has a source IP address of 128.1.2.3, source TCP port of 1500, destination IP address of 192.3.4.5, and a destination TCP port of 25. The entry with a connection ID of 2 in the bottom two tables is for the other direction of the same connection.
We show these tables as arrays, but the source code defines each entry as a structure, and a connection table is a circular linked list of these structures. The most recently used structure is stored at the head of the list.
By saving the most recent uncompressed header at each end, only the differences in various header fields from the previous datagram to the current datagram are transmitted across the link (along with a special first byte indicating which fields follow). Since some header fields don’t change at all from one datagram to the next, and other header fields change by small amounts, this differential coding provides the savings. Header compression works with the IP and TCP headers only—the data contents of the TCP segment are not modified.
Figure 29.30 shows the steps involved at the sending side when it has an IP datagram to send across a link using header compression.
Three different types of datagrams are sent and must be recognized at the receiver:
Type IP
is specified with the high-order 4 bits of the first byte equal to 4. This is the normal IP version number in the IP header (Figure 8.8). The normal, uncompressed datagram is transmitted across the link.
Type COMPRESSED_TCP
is specified by setting the high-order bit of the first byte. This looks like an IP version between 8 and 15 (i.e., the remaining 7 bits of this byte are used by the compression algorithm). The compressed header and uncompressed data are transmitted across the link, as we describe later in this section.
Type UNCOMPRESSED_TCP
is specified with the high-order 4 bits of the first byte equal to 7. The normal, uncompressed datagram is transmitted across the link, but the IP protocol field (which equals 6 for TCP), is replaced with the connection ID. This identifies the connection state table entry for the receiver.
The receiver can identify the datagram type by examining its first byte. The code that does this was shown in Figure 5.13. In Figure 5.16 the sender calls sl_compress_tcp
to check if a TCP segment is compressible, and the return value of this function is logically ORed into the first byte of the datagram.
Figure 29.31 shows an illustration of the first byte that is sent across the link.
The 4 bits shown as “-” comprise the normal IP header length field. The 7 bits shown as C, I, P, S, A, W
, and U
indicate which optional fields follow. We describe these fields shortly.
Figure 29.32 shows the complete IP datagram for the various datagrams that are sent.
We show two datagrams with a type of IP:
one that is not a TCP segment (e.g., a protocol of UDP, ICMP, or IGMP), and one that is a TCP segment. This is to illustrate the differences between the TCP segment sent as type IP
and the TCP segment sent as type UNCOMPRESSED_TCP:
the first 4 bits are different as is the protocol field in the IP header.
Datagrams are not candidates for header compression if the protocol is not TCP, or if the protocol is TCP but any one of the following conditions is true.
The datagram is an IP fragment: either the fragment offset is nonzero or the more-fragments bit is set.
Any one of the SYN, FIN, or RST flags is set.
The ACK flag is not set.
If any one of these three conditions is true, the datagram is sent as type IP
.
Furthermore, even if the datagram is a TCP segment that looks compressible, it is possible to abort the compression and send the datagram as type UNCOMPRESSED_TCP
if certain fields have changed between the current datagram and the last datagram sent for this connection. These are fields that normally do not change for a given connection, so the compression scheme was not designed to encode their differences from one datagram to the next. The TOS field and the don’t fragment bit are examples. Also, when the differences in some fields are greater than 65535, the compression algorithm fails and the datagram is sent uncompressed.
We now describe how the fields in the IP and TCP headers, shown in Figure 29.33, are compressed. The shaded fields normally don’t change during a connection.
If any of the shaded fields have changed from the previous segment on this connection to the current segment, the segment is sent uncompressed. We don’t show IP options or TCP options in this figure, but if either are present and have changed from the previous segment, the segment is sent uncompressed (Exercise 29.7).
If the algorithm transmitted only the nonshaded fields when the shaded fields do not change from the previous segment, about a 50% savings would result. VJ header compression does even better than this, by knowing which fields in the IP and TCP headers normally don’t change. Figure 29.34 shows the format of the compressed IP/TCP header.
The smallest compressed header consists of 3 bytes: the first byte (the flag bits) followed by the 16-bit TCP checksum. For protection against possible link errors, the TCP checksum is always transmitted without any change. (SLIP provides no link-layer checksum, although PPP does provide one.)
The other six fields, connid, urgoff, Δwin, Δack, Δseq, and Δ ipid, are optional. We show the number of bytes used to encode all the fields to the left of the field in Figure 29.34. The largest compressed header appears to be 19 bytes, but we’ll see shortly that the 4 bits SAWU can never be set at the same time in a compressed header, so the largest size is actually 16 bytes.
Six of the 7 bits in the first byte specify which of the six optional fields are present. The high-order bit of the first byte is always set to 1. This identifies the datagram type as COMPRESSED_TCP
. Figure 29.35 summarizes the 7 bits, which we now describe.
Table 29.35. The 7 bits in the compressed header.
Flag bit | Description | Structure member | Meaning if flag = 0 | Meaning if flag = 1 |
---|---|---|---|---|
C | connection ID | same connection ID as last | connid = connection ID | |
I | IP identification |
|
| Δipid = current − previous |
P | TCP push flag | PSH flag off | PSH flag on | |
S | TCP sequence# |
| same | Δseq = current − previous |
A | TCP acknowledgment# |
| same | Δack = current − previous |
W | TCP window |
| same | Δwin = current − previous |
U | TCP urgent offset |
| URG flag not set | urgoff = urgent offset |
C | If this bit is 0, this segment has the same connection ID as the previous compressed or uncompressed segment. If this flag is 1, connid is the connection ID, a value between 0 and 255. |
I | If this bit is 0, the IP identification field has increased by 1 (the typical case). If this bit is 1, Δipid is the current value of |
P | This bit is a copy of the PSH flag from the TCP segment. Since the PSH flag doesn’t follow any established pattern, it must be explicitly specified for each segment. |
S | If this bit is 0, the TCP sequence number has not changed. If this bit is 1, Δseq is the current value of |
A | If this bit is 0, the TCP acknowledgment number has not changed (the typical case). If this bit is 1, Δack is the current value of |
W | If this bit is 0, the TCP window has not changed (the typical case). If this bit is 1, Δwin is the current value of |
U | If this bit is 0, the URG flag in the segment is not set and the urgent offset has not changed from its previous value (the typical case). If this bit is 1, urgoff is the current value of |
The differences are encoded as the current value minus the previous value, because most of these differences will be small positive numbers (with Δwin being an exception) given the way these fields normally change.
We note that five of the optional fields in Figure 29.34 are encoded in 0, 1, or 3 bytes.
0 bytes: | If the corresponding flag is not set, nothing is encoded for the field. |
1 byte: | If the value to send is between 1 and 255, a single byte encodes the value. |
3 bytes: | If the value to send is either 0 or between 256 and 65535, 3 bytes encode the value: the first byte is 0, followed by the 2-byte value. This always works for the three 16-bit values, urgoff, Δwin, and Δipid; but if the difference to encode for the two 32-bit values, Δack and Δseq, is less than 0 or greater than 65535, the segment is sent uncompressed. |
If we compare the nonshaded fields in Figure 29.33 with the possible fields in Figure 29.34 we notice that some fields are never transmitted.
The IP total length field is not transmitted since most link layers provide the length of a received message to the receiver.
Since the only field in the IP header that is being transmitted is the identification field, the IP checksum is also omitted. This is a hop-by-hop checksum that protects only the IP header across any given link.
Two common cases are detected and transmitted as special combinations of the 4 low-order bits: SAWU. Since urgent data is rare, if the URG flag in the segment is set and both the sequence number and window also change (implying that the 4 low-order bits would be 1011 or 1111), the segment is sent uncompressed. Therefore if the 4 low-order bits are sent as 1011 (called *SA) or 1111 (called *S), the following two special cases apply:
*SA | The sequence number and acknowledgment number both increase by the amount of data in the last segment, the window and urgent offset don’t change, and the URG flag is not set. This special case avoids encoding both Δseq and Δack. This case occurs frequently for both directions of echoed terminal traffic. Figures 19.3 and 19.4 of Volume 1 give examples of this type of data flow across an Rlogin connection. |
*S | The sequence number changes by the amount of data in the last segment, the acknowledgment number, window, and urgent offset don’t change, and the URG flag is not set. This special case avoids encoding Δseq. This case occurs frequently for the sending side of a unidirectional data transfer (e.g., FTP). Figures 20.1, 20.2, and 20.3 of Volume 1 give examples of this type of data transfer. This case also occurs for the sender of nonechoed terminal traffic (e.g., commands that are not echoed by a full-screen editor). |
Two simple examples were run across the SLIP link between the systems bsdi
and slip
in Figure 1.17. This SLIP link uses header compression in both directions. The tcpdump
program described in Appendix A of Volume 1 was also run on the host bsdi
to save a copy of all the frames. This program has an option that outputs the compressed header, showing all the fields in Figure 29.34.
Two traces were obtained: a short portion of an Rlogin connection and a file transfer from bsdi
to slip
using FTP. Figure 29.36 shows a summary of the different frame types for both connections.
The two entries of 75 verify our claim that this special case often occurs for both directions of echoed terminal traffic. The entry of 325 verifies our claim that this special case occurs frequently for the sending side of a unidirectional data transfer.
The 10 frames of type IP
for the FTP example correspond to four segments with the SYN flag set and six segments with the FIN flag set. FTP uses two connections: one for the interactive commands and one for the file transfer.
The UNCOMPRESSED_TCP
frame types normally correspond to the first segment following connection establishment, the one that establishes the connection ID. An additional few are seen in these examples when the type of service is set (the Net/3 Rlogin and FTP clients and servers all set the TOS field after the connection is established).
Figure 29.37 shows the distribution of the compressed-header sizes. The average size of the compressed header for the final four columns in Figure 29.37 is 3.1, 4.1, 6.0, and 3.3 bytes, a significant savings compared to the uncompressed 40-byte headers, especially for the interactive connection.
Almost all of the 325 6-byte headers in the FTP input column contain only a Δack of 256, which being greater than 255 is encoded in 3 bytes. The SLIP MTU is 296, so TCP uses an MSS of 256. Almost all of the 250 3-byte headers in the FTP output column contain the *S special case (sequence number change only) with a change of 256 bytes. But since this change refers to the amount of data in the previous segment, nothing is transmitted other than the flag byte and the TCP checksum. The 78 4-byte headers in the FTP output column are this same special case, but with a change in the IP identification field also (Exercise 29.8).
Header compression must be enabled on a given SLIP or PPP link. With a SLIP link there are normally two flags that can be set when the interface is configured: enable header compression and autoenable header compression. These two flags are set using the link0
and link2
flags to the ifconfig
command, respectively. Normally a client (the dialin host) decides whether to use header compression or not. The server (the host or terminal server to which the client dials in) specifies the autoenable flag only. If header compression is enabled by the client, its TCP will send a datagram of type UNCOMPRESSED_TCP
to specify the connection ID. When the server sees this packet it enables header compression (since it was in the autoenable mode). If the server never sees this type of packet, it never enables header compression for this line.
PPP allows the negotiation of options between the two ends of the link when the link is established. One of the options that can be negotiated is whether to use header compression or not.
This chapter completes our detailed look at TCP input processing. We started with the processing of an ACK in the SYN_RCVD state, which completes a passive open, a simultaneous open, or a self-connect.
The fast retransmit algorithm lets TCP detect a dropped segment after receiving a specified number of consecutive duplicate ACKs and retransmit the segment before the retransmission timer expires. Net/3 combines the fast retransmit algorithm with the fast recovery algorithm, which tries to keep the data flowing from the sender to the receiver, albeit at a slower rate, using congestion avoidance but not slow start.
ACK processing then discards the acknowledged data from the socket’s send buffer and handles a few TCP states specially, when the receipt of an ACK changes the connection state.
The URG flag is processed, if set, and TCP’s urgent mode is mapped into the socket abstraction of out-of-band data. This is complicated because the process can receive the out-of-band byte inline or in a special out-of-band buffer, and TCP can receive urgent notification before the data byte referenced by the urgent pointer has been received.
TCP input processing completes by calling TCP_REASS
to merge the received data into either the socket’s receive buffer or the socket’s out-of-order queue, processing the FIN flag, and calling tcp_output
if a segment must be generated in response to the received segment.
TCP header compression is a technique used on SLIP and PPP links to reduce the size of the IP and TCP headers from the normal 40 bytes to around 3-6 bytes (typically). This is done by recognizing that most fields in these headers don’t change from one segment to the next on a given connection, and the fields that do change often change by a small amount. This allows a flag byte to be sent indicating which fields have changed, and the changes are encoded as differences from the previous segment.
29.1 | A client connects to a server and no segments are lost. Which process, the client or server, completes its open of the connection first? |
29.1 | Assume a 2-second RTT. The server has a passive open pending and the client issues its active open at time 0. The server receives the SYN at time 1 and responds with its own SYN and an ACK of the client’s SYN. The client receives this segment at time 2, and the code in Figure 28.20 completes the active open with the call to |
29.2 | A Net/3 system receives a SYN for a listening socket and the SYN segment also contains 50 bytes of data. What happens? |
29.2 | Assume the sequence number of the SYN is 1000 and the 50 bytes of data are numbered 1001–1050. When the SYN is processed by |
29.3 | Continue the previous exercise assuming that the client does not retransmit the 50 bytes of data; instead the client responds with a segment that acknowledges the server’s SYN/ACK and contains a FIN. What happens? |
29.3 | The server’s socket is in the SYN_RCVD state when the client’s ACK/FIN arrives, so In this example six segments requiring three round trips are required to pass the 50 bytes from the client to server. To reduce the number of segments requires the TCP extensions for transactions [Braden 1994]. |
29.4 | A Net/3 client performs a passive open to a listening server. The server’s response to the client’s SYN is a segment with the expected SYN/ACK, but the segment also contains 50 bytes of data and the FIN flag. List the processing steps for the client’s TCP. |
29.4 | The client’s socket is in the SYN_SENT state when the server’s response is received. Figure 28.20 processes the segment and moves the connection to the ESTABLISHED state. A jump is made to |
29.5 | Figure 18.19 in Volume 1 and Figure 14 in RFC 793 both show four segments exchanged during a simultaneous close. But if we trace a simultaneous close between two Net/3 systems, or if we watch the close sequence following a self-connect on a Net/3 system, we see six segments, not four. The extra two segments are a retransmission of the FIN by each end when the other’s FIN is received. Where is the bug and what is the fix? |
29.5 | The bug is in the entry |
29.6 | Page 72 of RFC 793 says that when data in the send buffer is acknowledged by the other end “Users should receive positive acknowledgments for buffers which have been sent and fully acknowledged (i.e., send buffer should be returned with ‘ok’ response).” Does Net/3 provide this notification? |
29.6 | No. An OK return from a write system call only means the data has been copied into the socket buffer. Net/3 does not notify the process when that data is acknowledged by the other end. An application-level acknowledgment is required to obtain this information. |
29.7 | What effect do the options defined in RFC 1323 have on TCP header compression? |
29.7 | RFC 1323 timestamps defeat header compression because whenever the timestamps change, the TCP options change, and the segment is sent uncompressed. The window scale option has no effect because the value in the TCP header is still a 16-bit value. |
29.8 | What effect does the Net/3 assignment of the IP identification field have on TCP header compression? |
29.8 | IP assigns the ID field from a global variable that is incremented each time any IP datagram is sent. This increases the probability that two consecutive TCP segments sent on the same connection will have ID values that differ by more than 1. A difference other than 1 causes the Δipid field in Figure 29.34 to be transmitted, increasing the size of the compressed header. A better scheme would be for TCP to maintain its own counter for assigning IDs. |
18.218.70.93