Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5

Congestion Control in High-Speed Networks

This chapter discusses congestion control in high-speed networks with long latencies. One of the consequences of the application of control theory to TCP congestion control was the realization that TCP Reno was inherently unstable as the delay-bandwidth product of the network became large or even for very large bandwidths. As a result of this, a number of new congestion control designs were suggested with the objective of solving this problem such as High-Speed TCP (HSTCP), TCP Binary Increase Congestion control (BIC), TCP CUBIC, and Compound TCP (CTCP), which are described in this chapter. Currently, TCP CUBIC serves as the default congestion control algorithm for Linux servers and as a result is as widely deployed as TCP Reno. CTCP is used as the default option for Windows servers and is also very widely deployed. We also describe the eXpress Control Protocol (XCP) and Rate Control Protocol (RCP) algorithms that have been very influential in the design of high-speed congestion control protocols. Finally, we make connections between the algorithms in this chapter and the stability theory from Chapter 3 and give some general guidelines to be used in the design of high-speed congestion control algorithms.

Keywords

Congestion control in high-speed networks; TCP over high-speed links; TCP Response Function; HSTCP; High-Speed TCP; TCP BIC; TCP CUBIC; CTCP; Compound TCP; FAST TCP; XCP Congestion Control; RCP Congestion Control; Scalable TCP; Yeah-TCP; TCP Africa; TCP Illinois

5.1 Introduction

Link speeds have exploded in the past 2 decades from tens of megabits to multiple gigabits per second for both local as well as long-distance networks. The original congestion control algorithms for TCP, such as Reno and Tahoe, worked very well as long as the link speeds were lower but did not scale very well to the higher speeds and longer propagation delays for several reasons:

• In Chapter 3, Section 3.4, by using a control theory framework, it was shown that TCP Reno becomes unstable if either the link capacity, the propagation delay, or both become large. This results in severe oscillations of the transmission rates and buffer occupancies, which has been observed in practice in high-speed networks.

• Consider the following scenario: The link capacity is 10 Gbps, and the Round Trip Latency (RTT) is 100 ms, so the delay bandwidth product is about 100,000 packets (for a packet size of 1250 bytes). For TCP to grow its window from the midpoint (after a multiplicative decrease) to full window size during the congestion avoidance phase, it will require 50,000 RTTs, which is about 5000 seconds (1.4 hours). If a connection finishes before then, the link will be severely underutilized. Hence, when we encounter a combination of high capacity and large end-to-end delay, the multiplicative decrease policy on packet loss is too drastic, and the linear increase is not quick enough.

• The square-root formula for TCP throughput (see Chapter 2) shows that for a round trip latency of T, the packet drop rate has to be less than $\frac{1.5}{{(R T)}^{2}}$ to sustain an average throughput of R. Hence, using the same link and TCP parameters as above, the packet drop rate has to be less than $10^{- 10}$ to sustain a throughput of 10 Gbps. This is much smaller than the drop rate that can be supported by most links.

A number of alternate congestion control algorithms for high-speed networks were suggested subsequently to solve this problem, including Binary Increase Congestion control (BIC) [1], CUBIC [2], High-Speed TCP (HSTCP) [3], ILLINOIS [4], Compound TCP (CTCP) [5], YEAH [6], and FAST TCP [7]. The Linux family of operating systems offers users a choice of one of these algorithms to use (in place of TCP Reno) while using TCP CUBIC as the default algorithm. The Windows family, on the other hand, offers a choice between Reno and CTCP. As a result of the wide usage of Linux on web servers, it has recently been estimated [8] that about 45% of the servers on the Internet use either BIC or CUBIC, about 30% still use TCP Reno, and another 20% use CTCP.

In this chapter, we provide a description and analysis of the most popular high-speed congestion control algorithms, namely HSTCP, BIC, CUBIC, and CTCP. Each of these algorithms modifies the TCP window increase and decrease rules in a way such that the system is able to make better utilization of high-speed and long-distance links. We also discuss two other protocols, namely eXpress Control Protocol (XCP) and Rate Control Protocol (RCP). These need multi-bit network feedback, which has prevented their deployment in Internet Protocol (IP) networks. However, several interesting aspects of their design have influenced subsequent protocol design. Whereas XCP and RCP approach the high-speed congestion control problem by using an advanced Active Queue Management (AQM) scheme, the other algorithms mentioned are server based and designed to work with regular tail-drop or Random Early Detection (RED) routers.

The rest of this chapter is organized as follows: Section 5.2 introduces the concept of the Response Function and its utility in designing and analyzing congestion control algorithms, Sections 5.3 to 5.8 discuss the design and analysis of a number of high-speed congestion control algorithms, including HSTCP (Section 5.3), TCP BIC (Section 5.4.1), TCP CUBIC (Section 5.4.2), Compound TCP (Section 5.5), FAST TCP (Section 5.6), XCP (Section 5.7), and RCP (Section 5.8). Section 5.9 contains some observations on the stability of these algorithms.

An initial reading of this chapter can be done in the sequence 5.1→5.2→5.3 followed by either 5.4 or 5.5 or 5.6 or 5.7 or 5.8 (depending on the particular algorithm the reader is interested in)→5.9.

5.2 Design Issues for High-Speed Protocols

In networks with large link capacities and propagation delays, the congestion control protocol needs to satisfy the following requirements:

1. It should be able to make efficient use of the high-speed link without requiring unrealistically low packet loss rates. As pointed out in the introduction, this is not the case for TCP Reno, because of its conservative window increase and aggressive window decrease rules.

2. In case of very high-speed links, the requirement of intraprotocol fairness between the high speed TCP protocol and regular TCP is relaxed because regular TCP is not able to make full use of the available bandwidth. Hence, in this situation, it is acceptable for the high speed TCP variants to perform more aggressively than standard TCP.

3. If connections with different round trip latencies share a link, then they should exhibit good intraprotocol fairness.

Define the Response Function w of a congestion control algorithm as w=R_avg.T packets per round trip time (ppr), where R_avg is the average throughput and T is the minimum round trip latency. This is sometimes also referred to as the average window because it is the average number of packets transmitted per round trip time. A basic tool used to evaluate high-speed designs is the log-log plot of the Response Function versus p (Figure 5.1). Note that the ability of a protocol to use the high amount of bandwidth available in high-speed networks is determined by whether it can sustain high sending rates under low loss rates. From Figure 5.1, it follows that a protocol becomes more scalable if its sending rate is higher under lower loss rates.

Figure 5.1 Comparison of TCP Reno with additive increase/multiplicative decrease (AIMD) (32,0.125) and High-Speed TCP (HSTCP).

From Chapter 2, the Response Function for TCP Reno is given by

$w = \sqrt{\frac{3}{2 p}} o r \log w = 0.09 - 0.5 \log p$ (1)

(1)

This plot is shown in Figure 5.1, along with the plot for the Response Function of some high-speed algorithms.

• Additive increase/multiplicative decrease AIMD(32, 0.125) corresponds to an AIMD algorithm (introduced in Section 2.3.1 of Chapter 2) with an additive increase of 32 packets and a multiplicative decrease factor of 0.125. From equation 72 in Chapter 2, the response function for AIMD(32, 0.125) is given by

$\log w = 1.19 - 0.5 \log p$ (2)

(2)

Hence AIMD(32, 0.125) has the same slope as Reno on the log-log plot but is able to achieve a higher window size for both high- and lower link speeds.

• The HSTCP response function is derived in Section 2.3 (equation 8), and is given by

$\log w = {\begin{array}{l} - 0.82 - 0.82 \log p & i f p < 0.0015 \\ 0.09 - 0.5 \log p & o t h e r w i s e \end{array}$ (3)

(3)

This response function coincides with that of Reno for larger values of the packet drop rate, but for smaller values, it exhibits a steeper slope, which explains its better performance in high-speed links.

Even though Figure 5.1 plots the response function versus the packet drop rate, it also implies a relationship between the response function and the link capacity as we show next. We will consider the case when all packet drops are due to buffer overflows, so from equation 53 in Chapter 2, it follows that the number of packets N transmitted during a single cycle for TCP Reno is given by

$N = \frac{3 W_{m}^{2}}{8}$

where W_m is the maximum window size. But $W_{m} \approx C T$ if we ignore the contribution caused by the buffer occupancy. Hence, it follows that the packet drop probability p is given by

$p = \frac{1}{N} \approx \frac{8}{3 W_{m}^{2}} = \frac{8}{3 {(C T)}^{2}}$

which implies that the packet drop probability is inversely proportional to the square of the link speed. This is also illustrated in Figure 5.2, which shows the frequency of packet drops decreasing as the window size increases for two connections that pass through links with different capacities but with the same round trip latency. Hence, it follows that small values of p correspond to a large link capacity and vice versa.

Figure 5.2 Illustrating relation between packet drop rate and link speed.

Note the following features for the graph in Figure 5.1:

1. At small values of p (or equivalently for very high-speed links), it is desirable that the Response Function for the high-speed protocol be much larger than for TCP Reno because this implies that for a given interval between packet drops (which is the inverse of the packet loss rate), the congestion window can grow to a much larger value compared with Reno. Note that this implies that if Reno and the high-speed protocol compete for bandwidth in a high-speed link, the latter will end up with a much larger portion of the bandwidth (i.e., the system will not be fair). However, this is deemed to be tolerable because the high-speed protocol is taking up link bandwidth that cannot be used by TCP Reno anyway.

2. At larger values of p (or equivalently lower speed links), it is desirable for TCP friendliness that the point at which the two curves cross over lie as much to the left as possible. The reason for this is that for lower speed links, the congestion control algorithm should be able to sustain smaller window sizes, which is better suited to coexisting with TCP Reno without taking away bandwidth from it. To do this, the Response Function for the high-speed algorithm should be equal to or lower than that for TCP Reno after the two curves cross over.

3. To investigate the third requirement regarding intraprotocol RTT fairness, we can make use of the following the result that was derived in Chapter 2:
Assume that the throughput of a loss based congestion control protocol with constants A, e, d is given by

$R = \frac{A}{T^{e} p^{d}} packets / s$

Then the throughput ratio between two flows with round trip latencies T1 and T2, whose packet losses are synchronized, is given by

$\frac{R_{1}}{R_{2}} = {(\frac{T_{2}}{T_{1}})}^{\frac{e}{1 - d}}$ (4)

(4)

Note that as d increases, the slope of the response function and RTT unfairness both increase (i.e., the slope of the response function on the log-log plot determines its RTT unfairness).

Thus, by examining the Response Function graph, one can verify whether the high-speed protocol satisfies all the three requirements mentioned.

To satisfy the first requirement, many high-speed algorithms use a more aggressive version of the Reno additive increase rule. Hence, instead of increasing the window size by one packet per round trip time, they increase it by multiple packets, and the increment size may also be a function of other factors such as the current window size or network congestion. However, to satisfy the second requirement of TCP friendliness, they modify the packet increase rule in one of several ways. For example:

• HSTCP [3] reverts to the TCP Reno packet increment rule when the window size is less than some threshold.

• Compound TCP [5] and TCP Africa [9] keep track of the connections’ queue backlog in the network and switch to a less aggressive packet increment rule when the backlog exceeds a threshold.

There is also a tradeoff between requirements 2 and 3 as follows: Even though it is possible to design a more TCP-friendly protocol by moving the point of intersection with the TCP curve to a lower packet drop rate (i.e., to the left in Figure 5.1), this leads to an increase in the slope of the Response Function line, thus hurting RTT fairness.

5.3 High Speed TCP (HSTCP) Protocol

HSTCP by Floyd et al. [3,10] was one of the first TCP variants designed specifically for high-capacity links with significant propagation delays. The design for HSTCP was based on the plot of the Response Function versus Packet Drop Rate introduced in the previous section and proceeds as follows: The ideal response function for a high-speed congestion control algorithm should have the following properties:

• The HSTCP Response Function should coincide with that of TCP Reno when the packet drop rate exceeds some threshold P (or equivalently when the congestion window is equal to or less than W packets). As a default HSTCP chooses P₀=0.0015 and W₀=38 packets.

• We next choose a large ppr, say, W₁=83,000 packets (which corresponds to a link capacity of 10 Gbps for packet size of 1500 bytes and round trip time T=100 ms) and a corresponding packet loss rate of P₁=10⁻⁷. For packet loss rates less than P₀, we specify that the HSTCP Response Function should pass through the points (P₀, W₀) and (P₁, W₁) and should also be linear on a log-log scale (Figure 5.3).

Figure 5.3 Comparison of response functions for TCP Reno versus High-Speed TCP (HSTCP).

As a result, the Response Function for HSTCP is given by

$\log w = S (\log p - \log P_{0}) + \log W_{0}$ (5)

(5)

where

$S = \frac{\log W_{1} - \log W_{0}}{\log P_{1} - \log P_{0}}$ (6)

(6)

Equation 5 can be rewritten as

$w = W_{0} {(\frac{p}{p_{0}})}^{S} packets$ (7)

(7)

If we substitute the values of (P₀,W₀) and (P₁,W₁) in equation 6, then we obtain S=−0.82, so that the HSTCP Response Function is given by

$w = \frac{0.15}{p^{0.82}} packets for p \leq P_{0} .$ (8)

(8)

We now derive an expression for the AIMD increase–decrease parameters for HSTCP. The window size W in HSTCP uses the following AIMD rules with increase–decrease functions a(w) and b(w) that are functions of the window size (these are implemented on a per-ACK basis):

$\begin{array}{l} A C K : W \leftarrow W + \frac{a (W)}{W} \\ D r o p : W \leftarrow W - b (W) W \end{array}$ (9)

(9)

For the case $W \leq W_{0}$ , HSTCP defaults to TCP Reno so that

$a (W) = 1 a n d b (W) = 0.5 f o r W \leq W_{0} .$ (10)

(10)

For W>W₀, we assume that b(W) varies linearly with respect to log(W) for $W \in [W_{0}, W_{1}]$ With b(W₀)=0.5 and b(W₁)=B, where B is the target minimum value for B(w) for W=W₁, (Floyd et al. [3] recommend a value of B=0.1 at W=W₁), we obtain

$b (W) = (B - 0.5) \frac{\log W - \log W_{0}}{\log W_{1} - \log W_{0}} + 0.5$ (11)

(11)

To obtain the corresponding a(W), we use the following formula for the response function of AIMD congestion control (this was derived in Section 2.3.1 in Chapter 2). We choose a point $W \in [W_{0}, W_{1}]$ and compute the corresponding probability P(W) by using equation 5. Substitute (P(W),W) into the following equation

$W = \sqrt{\frac{a (W) (2 - b (W))}{2 P (W) b (W)}}$ (12)

(12)

so that

$a (W) = P (W) W^{2} \frac{2 b (W)}{2 - b (W)}$ (13)

(13)

Note that there are a couple of heuristics in the use of equation 12: (1) equation 12 is a formula for the response time function, not the window size, and (2) equation 12 was derived for the case when a and b are constants, which is not the case here.

Table 5.1 gives the values of a(W) and b(W) using equations 11 and 13 for some representative values of W between 38 and 84,035.

Table 5.1

Computation of a(W) and b(W)

W	a(W)	b(W)
38	1	0.5
1058	8	0.33
10661	30	0.21
21013	42	0.17
30975	50	0.15
40808	56	0.14
51258	61	0.13
71617	68	0.11
84035	71	0.1

a(W), defined in equation (13); b(W), defined in equation (11); W, HSTCP Window Size.

Even though HSTCP is not widely used today, it was one of the first high-speed algorithms and served as a benchmark for follow-on designs that were done with the objective of addressing some of its shortcomings. A significant issue with HSTCP is the fact that the window increment function a(w) increases as w increases and exceeds more 70 for large values of w, as shown in Table 5.1. This implies that at the time when the link buffer approaches capacity, it is subject to a burst of more than 70 packets, all of which can then be dropped. Almost all follow-on high-speed designs reduce the window increase size as the link approaches saturation. HSTCP is more stable than TCP Reno at high speeds because it is less aggressive in reducing its window after a packet loss which reduces the queue size oscillations at the bottleneck node.

If we apply equation 4 to HSTCP, then substituting e=1, d=0.82, we obtain

$\frac{R_{1}}{R_{2}} = {(\frac{T_{2}}{T_{1}})}^{\frac{1}{1 - d}} = {(\frac{T_{2}}{T_{1}})}^{5.55}$ (14)

(14)

Hence, because of the aggressive window increment policy of HSTCP, combined with small values of the decrement multiplier, connections with a larger round trip latency are at a severe throughput disadvantage. There is a positive feedback loop which causes the connection with the large latency to loose throughput, which operates as follows: Even if the two connections start with the same window size, the connection with smaller latency will gain a small advantage after a few round trip delays because it is able to increase its window faster. However this causes its window size to increase, which further accelerates this trend. Conversely the window size of the large latency connection goes into a downward spiral as reductions in its throughput causes a further reduction in its window size.

5.4 TCP BIC and CUBIC

Both TCP BIC and CUBIC came from the High-Speed Networks Research Group at North Carolina State University [1,11]. TCP BIC was invented first and introduced the idea of binary search in congestion control algorithms, and TCP CUBIC was later designed to simplify the design of BIC and improve on some of its weaknesses. Both these protocols have been very successful, and currently TCP CUBIC serves as the default congestion control choice on the Linux operating system. It has been estimated [8] that TCP BIC is used by about 15% of the servers in the Internet, and CUBIC is used by about 30% of the servers.

The main idea behind TCP BIC and CUBIC, is to grow the TCP window quickly when the current window size is far from the link saturation point (at which the previous loss happened), and if it is close to the saturation point then the rate of increase is slowed down (see Figure 5.5 for BIC and Figure 5.8 for CUBIC). This results in a concave increase function, which decrements the rate of increase as the window size increases. The small increments result in a smaller number of packet losses if the window size exceeds the capacity. This makes BIC and CUBIC very stable and also very scalable. If the available link capacity has increased since the last loss, then both BIC and CUBIC increase the window size exponentially using a convex function. Since this function shows sub-linear increase initially, this also adds to the stability of the algorithm for cases in which the new link capacity did not increase significantly.

5.4.1 TCP BIC

The objective of BIC’s design is to satisfy the three criteria of RTT fairness, TCP friendliness, and scalability. The algorithm results in a concave response time function for BIC as shown in Figure 5.4, from which we can observe the following:

• Because the BIC response time graph intersects with that of Reno around the same point as that of HSTCP, it follows BIC is also TCP friendly at lower link speeds.

• The slope of BIC’s response time at first rises steeply and then flattens out to that of TCP Reno for larger link speeds. This implies that for high-speed links, BIC exhibits the RTT fairness property.

• Because of the fast initial rise in its response time, BIC scales well with link speeds.

Figure 5.4 The concavity of the TCP Binary Increase Congestion control (BIC) response function.

The evolution of BIC’s congestion window proceeds in three stages (Figure 5.5): additive increase, binary search increase, and max probing.

Figure 5.5 Window growth function of TCP BIC.

5.4.1.1 Binary Search Increase

The Binary Search window increase rule in BIC was inspired by the binary search algorithm and operates as follows:

• If W_max is the maximum window size that was reached just before the last packet loss, and W_min is the window size just after the packet loss, then after 1 RTT, the algorithm computes W_mid, the midpoint between W_max and W_min, and sets the current window size to W_mid.

• After the resulting packet transmissions:

• If there are no packet losses, then W_min is set to W_mid, or

• If there are packet losses then W_max is set to W_mid.

For the case of no packet losses the process repeats for every RTT until the difference between W_max and W_min falls below a preset threshold S_min. Note that as a result of this algorithm, the window size increases logarithmically.

Just as for binary search, this process allows the bandwidth probing to be more aggressive when the difference between the current window and the target window is large and gradually becomes less aggressive as the current window gets closer to the target. This results in a reduction in the number of lost packets as the saturation point is reached. This behavior contrasts with HSTCP, which increases its rate near the link saturation point, resulting in excessive packet loss.

5.4.1.2 Additive Increase

When the distance to W_mid from the current W_min is too large, then increasing the window to that midpoint leads to a large burst of packets transmitted into the network, which can result in losses. In this situation, the window size is increased by a configured maximum step value S_max. This continues until the distance between W_min and W_mid falls below S_max, at which the W_min is set directly to W_mid. After a large window size reduction, the additive increase rule leads to an initial linear increase in window size followed by a logarithmic increase for the last few RTTs.

For large window sizes or equivalently for higher speed links, the BIC algorithm performs more as a pure additive increase algorithm because it spends most of its time in this mode. This implies that for high-speed links, its performance should be close to that of the AIMD algorithm, which is indeed the case from Figure 5.4.

5.4.1.3 Max Probing

When the current window size grows past W_max, the BIC algorithm switches to probing for the new maximum window, which is not known. It does so in a slow-start fashion by increasing its window size in the following sequence for each RTT: W_max+S_min, W_max+2S_min, W_max+4S_min,…,W_max+S_max. The reasoning behind this policy is that it is likely that the new saturation point is close to the old point; hence, it makes sense to initially gently probe for available bandwidth before going at full blast. After the max probing phase, BIC switches to additive increase using the parameter S_max.

BIC also has a feature called Fast Convergence that is designed to facilitate faster convergence between a flow holding a large amount of bandwidth and a second flow that is starting from scratch. It operates as follows: If the new W_max for a flow is smaller than its previous value, then this is a sign of a downward trend. To facilitate the reduction of the flow’s bandwidth, the new W_max is set to (W_max+W_min)/2, which has the effect of reducing the increase rate of the larger window and thus allows the smaller window to catch up.

We now provide an approximate analysis of BIC using the sample path–based fluid approximation technique introduced in Section 2.3 of Chapter 2. This analysis explains the concave behavior of the BIC response time function as illustrated in Figure 5.4. We consider a typical cycle of window increase (see Figure 5.6), that is terminated when a packet is dropped at the end of the cycle.

Figure 5.6 Evolution of TCP Binary Increase Congestion control (BIC) window size.

Define the following:

W_max: Maximum window size

W_min: Minimum window size

N₁: Number of RTT rounds in the additive increase phase

N₂: Number of RTT rounds in the logarithmic increase phase

Y₁: Number of packets transmitted during the additive increase phase

Y₂: Number of packets transmitted during the logarithmic increase phase

As per the deterministic approximation technique, we make the assumption that the number of packets sent in each cycle of the congestion window is fixed (and equal to 1/p, where p is the packet drop rate). The average throughput R_avg is then given by

$R_{avg} = \frac{Y_{1} + Y_{2}}{T (N_{1} + N_{2})} = \frac{1 / p}{T (N_{1} + N_{2})}$ (15)

(15)

We now proceed to compute the quantities Y₁, Y₂, N₁ and N₂.

Note that $β$ is multiplicative decrease factor after a packet loss, so that

$W_{\min} = (1 - β) W_{\max}$ (16)

(16)

BIC switches to the logarithmic increase phase when the distance from the current window size to W_max is less than 2S_max; hence, if this distance less than this, there is no additive increase. Because $W_{\max} - W_{\min} = β W_{\max}$ , it follows that N₁ is given by

$N_{1} = \max (⌈ \frac{β W_{\max}}{S_{\max}} ⌉ - 2, 0)$ (17)

(17)

and the total increase in window size during the logarithmic increase phase is given by $β W_{\max} - N_{1} S_{\max}$ , which we denote by X. Note that

$\frac{X}{2} + \frac{X}{2^{2}} + \dots + \frac{X}{2^{N_{2}}} = X - S_{\min}$ (18)

(18)

This equation follows from the fact that the X is reduced by half in each RTT of the logarithmic increase phase until the distance to W_max becomes less than S_min. From equation 18, it follows that

$X (1 - \frac{1}{2^{N_{2}}}) = X - S_{\min}, so that$

$N_{2} = \log (\frac{β W_{\max} - N_{1} S_{\max}}{S_{\min}}) + 2$ (19)

(19)

Note that 2 has been added in equation 19 to account for the first and last round trip latencies of the binary search increase.

We now compute Y₁ and Y₂. Y₁ is given by the formula

$Y_{1} = \frac{1}{T} \int_{0}^{N_{1} T} W (T) d t$ (20)

(20)

The integral is given by the area A₁ under the W(t) curve for the additive increase phase (Figure 5.6), which is given by

$A_{1} = (1 - β) W_{\max} N_{1} T + \frac{1}{2} (N_{1} - 1) S_{\max} N_{1} T, so that$

$Y_{1} = (1 - β) W_{\max} N_{1} + \frac{1}{2} (N_{1} - 1) S_{\max} N_{1}$ (21)

(21)

Similarly, to compute Y₂, we need to find the area A₂ under the W(t) curve for the logarithmic increase phase. Note that from Figure 5.6, A₃=W_maxN₂−A₂; hence, it is sufficient to find the area A₃. Define $Z = β W_{\max} - N_{1} S_{\max}$ , then A₃ is given by

$\begin{array}{l} A_{3} & = [(Z + \frac{Z}{2}) + (\frac{Z}{2} + \frac{Z}{2^{2}}) + \dots + (\frac{Z}{2^{m}} + \frac{Z}{2^{m + 1}}) + \dots] - S_{\min} \\ = 2 Z - S_{\min} \end{array}$

It follows that

$Y_{2} = W_{\max} N_{2} - 2 (β W_{\max} - N_{1} S_{\max}) + S_{\min}$ (22)

(22)

Because the total number of packets transmitted during a period is also given by 1/p, it follows that

$\frac{1}{p} = Y_{1} + Y_{2}$ (23)

(23)

Equation 23 can be used to express W_max as a function of p. Unfortunately, a closed-form expression for W_max does not exist in general, but it can computed for the following special cases:

1. $β W_{\max} ≫ 2 S_{\max}$
This condition implies that the window function is dominated by the linear increase part, so that N₁>>N₂ and from equations 17, 21, and 23, it can be shown that the average throughput is given by

$R_{a v g} \approx \frac{1}{T} \sqrt{\frac{S_{\max}}{2} \frac{2 - β}{β} \frac{1}{p}}$ (24)

(24)

which is same as the throughput for a general AIMD congestion control algorithm that was derived in Chapter 2.

2. $β W_{\max} > 2 S_{\max}$ and $β W_{\max}$ is divisible by S_max.
From equations 21 to 23, it follows that

$W_{\max} = \frac{- b + \sqrt{b^{2} + 4 a (c + \frac{1}{p})}}{2 a}$ (25)

(25)

where

$a = \frac{β (2 - β)}{2 S_{\max}} b = \log \frac{S_{\max}}{S_{\min}} + \frac{2 - β}{β} c = S_{\max} - S_{\min}$

and the throughput is given by

$R_{a v g} = \frac{1}{T} \frac{2 - β}{p} \frac{1}{\sqrt{b^{2} + 4 a (c + \frac{1}{p}) + (1 - β) b + \frac{β (2 - β)}{2}}}$ (26)

(26)

3. $β W_{\max} \leq 2 S_{\max}$
This condition implies that N₁=0, and assuming $\frac{1}{p} ≫ S_{\min}$ , W_max is approximately given by

$W_{\max} \approx \frac{1}{p} \frac{1}{\log (\frac{β W_{\max}}{S_{\min}}) + 2 (1 - β)}$ (27)

(27)

This equation can be solved for W_max in terms of the LambertW(y) function (which is the only real solution to $x e^{x} = y$ ) to get

$W_{\max} = \frac{1}{p} \frac{\ln 2}{L a m b e r t W (\frac{4 β e^{- 2 β \ln 2} \ln 2}{p S_{\min}})}$ (28)

(28)

so that

$R_{a v g} \approx \frac{Y_{2}}{T N_{2}} = \frac{W_{\max} N_{2} - 2 β W_{\max}}{T N_{2}} = \frac{W_{\max}}{T} [1 - \frac{2 β}{\log (\frac{β W_{\max}}{S_{\min}}) + 2}]$ (29)

(29)

From these equations, we can draw the following conclusions:

• Because large values of W_max also correspond to large link capacities (because W_max=CT), it follows by a comparison of equations 24 and equation 72 in Chapter 2 that for high-capacity links, BIC operates similar to an AIMD protocol with increase parameter a=S_max, decrease parameter of $b = β$ , and the exponent d=0.5. This follows from the fact that for high-capacity links, the BIC window spends most of its time in the linear increase portion.

• For moderate values of C, we can get some insight into BIC’s behavior by computing the constants in part 2. As recommended by Xu et al. [1], we choose the following values for BIC’s parameters: $β = 0.125, S_{\max} = 32, S_{\min} = 0.01$ . It follows that a=0.0036, b=18.5, and c=31. 99, and from equation 26, it follows that

$R_{a v g} = \frac{1.875}{T} \sqrt{\frac{1}{359.02 p^{2} + 0.014 p}}$

This formula implies that if 359p²>>0.014p, i.e., p>>3.9E(-5), then it follows that

$R_{a v g} = \frac{0.1}{T p}$

Hence, it follows that for larger packet drop rates (i.e., smaller values of C), the exponent d=1.

Because d=0.5 for large link capacities and d=1 for smaller link capacities, it follows that for intermediate value of capacity 0.5<d<1, which explains the concave shape of the BIC response function Figure 5.4.

Figure 5.7 from Xu et al. [1] shows the variation of the BIC response function as a function of S_max and S_min. Figure 5.7A shows that for a fixed value of S_min, increasing S_max leads to an increase in throughput for large link capacities. Figure 5.7B shows that for a fixed value of S_max, as S_min increases the throughput decreases for lower link capacities.

Figure 5.7 **(A)** TCP Binary Increase Congestion control (BIC) response function with varying S_max. **(B)** TCP BIC response function with varying S_min.

Figure 5.8 Window growth function for TCP CUBIC.

5.4.2 TCP CUBIC

TCP CUBIC is a follow-on design from the same research group that had earlier come up with TCP BIC [11]. Their main motivation in coming up with CUBIC was to improve on BIC in the following areas: (1) BIC’s window increase function is too aggressive for Reno, especially under short RTT or low-speed networks, and (2) reduce the complexity of BIC’s window increment–decrement rules to make the algorithm more analytically tractable and easier to implement.

Recall that BIC implemented the concave and convex functions that govern the window increase by using a binary search technique. To reduce the resulting complexity of the algorithm, CUBIC replaced the binary search by a cubic function, which contains both concave and convex portions. Another significant innovation in CUBIC is that the window growth depends only on the real time between consecutive congestion events, unlike TCP BIC or Reno in which the growth depends at the rate at which ACKs are returning back to the source. This makes the window growth independent of the round trip latency, so that if multiple TCP CUBIC flows are competing for bandwidth, then their windows are approximately equal. This results in a big improvement in the RTT fairness for CUBIC as shown later in this section.

Define the following:

W(t): Window size at time t, given that the last packet loss occurred at t=0

W_m: Window size at which the last packet loss occurred

$τ$ : Average length of a cycle

$α, K$ : Parameters of the window increase function

$β$ : Multiplicative factor by which the window size is decreased after a packet loss, so that the new window is given by $(1 - β) W_{m}$

The window size increase function for TCP CUBIC is given by

$W (t) = α {(t - K)}^{3} + W_{m}$ (30)

(30)

Note that at t=0⁺, the window size is given by $W (0^{+}) = (1 - β) W_{m}$ , so that

$(1 - β) W_{m} = α {(- K)}^{3} + W_{m}, i .e .,$

$K = \sqrt[3]{\frac{β W_{m}}{α}}$ (31)

(31)

CUBIC runs in one of the following three different modes depending on the value of W(t):

1. CUBIC is in the TCP Mode if W(t) is less than the window size that TCP Reno would reach at time t.
In short RTT networks, TCP Reno can grow faster than CUBIC because its window increases by one every RTT, but CUBIC’s window increase rate is independent of the RTT. To keep CUBIC’s growth rate the same as that of Reno in the situation, CUBIC emulates Reno’s window adjustment algorithm after a packet loss event using an equivalent AIMD $(α, β)$ algorithm as follows: Recall that the throughput for TCP Reno is given by

$R = \frac{1}{T} \sqrt{\frac{3}{2 p}}$

and that for an AIMD $(α, β)$ algorithm is given by

$R = \frac{1}{T} \sqrt{\frac{α (2 - β)}{2 β} \frac{1}{p}}$

Hence, given $β$ , a choice of

$α = \frac{3 β}{2 - β}$

will ensure that the AIMD algorithm has the same average throughput as TCP Reno. Because the window size increases by $α$ in every round trip and there are t/T round trips in time t, it follows that the emulated CUBIC window size after t seconds is given by

$W_{A I M D} (t) = (1 - β) W_{m} + \frac{3 β}{(2 - β)} \frac{t}{T}$

If W_AIMD is larger than W_CUBIC from equation 30, then W_CUBIC is set equal to W_AIMD. Otherwise, W_CUBIC is used as the current congestion window size.

2. If W(t) is greater than the corresponding TCP Reno window but less than W_m, then CUBIC is in the concave region.

3. If W(t) is larger than W_m, then CUBIC is in the convex region.

We now proceed to do an analysis of the throughput for TCP CUBIC. Again using the deterministic approximation technique from Section 2.3, we consider a typical cycle of window increase (Figure 5.9) that is terminated when a packet is dropped at the end of the cycle.

Figure 5.9 TCP CUBIC congestion window size over time.

During the course of this cycle, the TCP window increases from $(1 - β) W_{m}$ to $W_{m}$ , and we will assume that this takes on the average $τ$ seconds. From equation 15, it follows that

$τ = \sqrt[3]{\frac{β W_{m}}{α}}$ (32)

(32)

Note that the throughput R(t) at time t, is given by

$R (t) = \frac{W (t)}{T}$ (33)

(33)

Again, using the deterministic assumption from Section 2.3 of Chapter 2, it follows that the average throughput is given by

$R_{a v g} = \frac{N}{τ} w h e r e N = \frac{1}{p}$ (34)

(34)

is the number of packets transmitted during a cycle. N is also given by

$N = \int_{0}^{τ} \frac{W (t)}{T} d t$ (35)

(35)

Following the usual recipe, we equate N to 1/p, where p is the packet drop rate. Hence,

$\frac{1}{p} = \int_{0}^{τ} \frac{W (t)}{T} d t$ (36)

(36)

Note that

$\begin{array}{l} \int_{0}^{τ} W (t) d t & = \int_{0}^{τ} [α {(t - τ)}^{3} + W_{m}] d t \\ = τ W_{m} + α \int_{0}^{τ} {(t - τ)}^{3} d t \\ = τ W_{m} - \frac{α τ^{4}}{4} \end{array}$ (37)

(37)

Substituting equations 32 and 37 into equation 35, it follows that

$\begin{array}{l} \frac{T}{p} & = W_{m} {[\frac{β W_{m}}{α}]}^{1 / 3} - \frac{α}{4} {[\frac{β W_{m}}{α}]}^{4 / 3} \\ = W_{m}^{4 / 3} \frac{(4 - β)}{4} {(\frac{β}{α})}^{1 / 3} \end{array}$ (38)

(38)

This implies that

$W_{m} = \sqrt[4]{{(\frac{T}{p})}^{3} \frac{α}{β} {(\frac{4}{4 - β})}^{3}}$ (39)

(39)

and

$τ = \sqrt[3]{(\frac{β W_{m}}{α})} = \sqrt[4]{\frac{T}{p} (\frac{4}{4 - β}) (\frac{β}{α})}$ (40)

(40)

The average throughput is then given by

$R_{a v g} = \frac{N}{τ} = \frac{1 / p}{τ} = \sqrt[4]{\frac{α (4 - β)}{4 p^{3} T β}}$ (41)

(41)

From this equation, it follows that the response function for CUBIC is given by

$\log w = \log R_{a v g} T = 0.25 \log \frac{α (4 - β)}{4 β} + 0.75 \log T - 0.75 \log p$

Because CUBIC is forced to follow TCP Reno for smaller window sizes, its response function becomes

$\log w = {\begin{matrix} 0.25 \log \frac{α (4 - β)}{4 β} + 0.75 \log T - 0.75 \log p & i f p < \bar{p} \\ 0.09 - 0.5 \log p & i f p \geq \bar{p} \end{matrix}$

where

$\log \bar{p} = \log \frac{α (4 - β)}{4 β} + 3 \log T - 0.36 .$

Figure 5.10 clearly shows this piecewise linearity of the CUBIC response function for T=10 ms and T=100 ms, respectively.

Figure 5.10 **(A)** Response function for TCP Reno and CUBIC for T=10 ms. **(B)** Response function for TCP Reno and CUBIC for T=100 ms.

Based on these equations, the response function for CUBIC can be regarded as a family of response functions parameterized by the round trip latency T. As shown in Figure 10A, for smaller values of T, the response functions of CUBIC and Reno mostly coincide, but for large values of T (see Figure 5.10B), the CUBIC response function grows faster and looks more like that of HSTCP. This is an interesting contrast with the behavior of HSTCP, in which the switch between HSTCP and TCP Reno happens as a function of the window size W alone, irrespective of the round trip latency T.

The dependency of the Reno–CUBIC cross-over point on T is a very attractive feature because most traffic, especially in enterprise environments, passes over networks with low latency, for whom regular TCP Reno works fine. If a high-speed algorithm is used over these networks, we would like it to coexist with Reno without overwhelming it, and CUBIC satisfies this requirement very well. Over high-speed long-distance networks with a large latency, on the other hand, TCP Reno is not able to make full use of the available bandwidth, but CUBIC naturally scales up its performance to do so.

Figure 5.11 shows the ratio between the throughputs of two flows of CUBIC, BIC, HSTCP, and TCP SACK, for a bottleneck link capacity of 400 mbps, and with one of the flows with a fixed RTT of 162 ms; the RTT of the other flow is varied between 16 ms and 162 ms. This clearly shows that CUBIC has better inter-RTT fairness than the other protocols. This can also be inferred from equation 4 because if we substitute e=0.25 and d=0.75, it follows that

$\frac{R_{1}}{R_{2}} = {(\frac{T_{2}}{T_{1}})}^{\frac{0.25}{1 - 0.75}} = (\frac{T_{2}}{T_{1}}) .$

Figure 5.11 Intraprotocol fairness as a function of the round trip latency.

Hence, TCP CUBIC has better intraprotocol fairness that any of the other high-speed protocols and even TCP Reno. This is because it does not use ACK clocking to time its packet transmissions.

5.5 The Compound TCP (CTCP) Protocol

CTCP was invented by Tan et al. [5,12] from Microsoft Research and serves as the default congestion control algorithm in all operating systems from that company, from Windows Vista onward. In addition to making efficient use of high-speed links, the main design objective of CTCP was to be more TCP friendly compared with the existing alternatives such as HSTCP and CUBIC.

To do so, they came up with an innovative way to combine the fairness of delay-based congestion control approach (as in TCP Vegas) and the aggressiveness of loss-based approaches. Tan et al. [5,12] made the observation that the reason protocols such HSTCP and BIC are not very TCP friendly is that they continue to use the loss-based approach to congestion detection and as a result overload the network. To correct this problem, they suggested that the system should keep track of congestion using the TCP Vegas queue size estimator and use an aggressive window increase rule only when the estimated queue size is less than some threshold. When the estimated queue size exceeds the threshold, then CTCP switches to a less aggressive window increment rule, which approaches the one used by TCP Reno.

To implement their approach, they used the following decomposition for the window size into two components:

$W (t) = W_{c} (t) + W_{d} (t)$ (42)

(42)

In equation 42, W_c(t) is the part of the window that reacts to loss-based congestion signals and changes its size according to the rules used by TCP Reno. The component W_d(t), on the other hand, reacts to a delay-based congestion signal and a uses a new set of rules for changing its window size. Specifically, W_d(t) has a rapid window increase rule when the network is sensed to be underutilized and gracefully reduces its size when the bottleneck queue builds up.

Define the following:

W(t): CTCP Window size at time t

W_c(t): Congestion component of the CTCP window size at time t

W_d(t): Delay component of the CTCP window size at time t

T: Base value of the round trip latency

T_s(t): Smoothed estimate of the round trip latency at time t

R_E(t): Expected estimate of the CTCP throughput at time t

R(t): Observed CTCP throughput at time t

$θ (t)$ : Delay-based congestion indicator at time t

$γ$ : Delay threshold, such that the system is considered to b congested if $θ (t) \geq γ$

We briefly recap the calculation of the delay-based congestion indicator from TCP Vegas (see Chapter 1):

$\begin{array}{l} R_{E} (t) = \frac{W (t)}{T} \\ R (t) = \frac{W (t)}{T_{s} (t)} \\ θ (t) = (R_{E} (t) - R (t)) T \end{array}$ (43)

(43)

Note that the expression for $θ$ can also be written as

$θ (t) = \frac{W (t)}{T_{s} (t)} (T_{s} (t) - T)$ (44)

(44)

which by Little’s law equals the number of queued packets in the network.

CTCP specifies that the congestion window should evolve according to the following equations on a per RTT basis:

$W \leftarrow W + α W^{k}, when there are no packet losses or queuing delays and$ (45)

(45)

$W \leftarrow (1 - β) W on one or more packet losses during a RTT$ (46)

(46)

Given that the standard TCP congestion window evolves according to (per RTT):

$W_{c} \leftarrow W_{c} + 1 with no loss, and$ (47a)

(47a)

$W_{c} \leftarrow \frac{W_{c}}{2} with loss$ (47b)

(47b)

it follows that the delay-based congestion window $W_{d}$ should follow the following rules (per RTT):

$W_{d} \leftarrow W_{d} + {[α W^{k} - 1]}^{+} i f θ \leq γ$ (48a)

(48a)

$W_{d} \leftarrow {[W_{d} - ς θ]}^{+} i f θ \geq γ$ (48b)

(48b)

$W_{d} \leftarrow {[(1 - β) W_{d} - \frac{W_{c}}{2}]}^{+} with loss$ (48c)

(48c)

Using the fact that W=W_c+W_d, it follows from equations 47a and 47b and 48a to 48c that the CTCP window evolution rules (equations 45 and 46) can be written as

$W \leftarrow W + δ i f W + δ < γ$ (49a)

(49a)

$W \leftarrow W_{0} i f W + δ \geq W_{0} a n d W_{c} < W_{0}$ (49b)

(49b)

$W \leftarrow W + 1 i f W_{c} \geq W_{0}$ (49c)

(49c)

$W \leftarrow (1 - β) W with loss$ (49c)

(49c)

where $δ = \max {⌊ α W^{k} ⌋, 1}$ and W₀ is given by equation 51.

The shape of the window increase function in equations 49a to 49c (Figures 5.12 and 5.13) is explained as follows: When the network queuing backlog is smaller than $γ$ , the delay-based congestion window grows rapidly as per equation 48a. When the congestion level reaches $γ$ , the congestion window continues to increase at the rate of one packet per round trip time, and the delay window starts to ramp down as per equation 48b. If the rate of increase of the congestion window W_c equals the rate of decrease of the delay window W_d, then the resulting CTCP window W stays constant at W₀ until the size of the congestion window W_c exceeds W₀. At that point, the CTCP window W resumes its increase at the same linear rate as the congestion window, and the delay window dwindles to zero (see equation 49c).

Figure 5.12 Evolution of CTCP window size.

Figure 5.13 Evolution of the individual window sizes in CTCP.

Referring to Figure 5.12, we now derive an expression for the throughput of CTCP using the deterministic sample path–based technique from Section 2.3. The evolution of the window increase is divided into three periods, as explained. Let the duration of these periods be T_DE, T_EF, and T_FG and let N_DE, N_EF and N_FG be the number of packets transmitted over these intervals. Then the average throughput R_avg is given by

$R_{a v g} = \frac{N_{D E} + N_{E F} + N_{F G}}{T_{D E} + T_{E F} + T_{F G}}$ (50)

(50)

Assuming that the CTCP window size at the end of the cycle is given by W_m, so that the window size at the start of the next cycle is given by $(1 - β) W_{m}$ , we now proceed to compute the quantities in equation 50. Noting that the queue backlog during the interval T_EF is given by $γ$ , it follows from Little’s law that

$γ = (T_{s} - T) \frac{W_{0}}{T_{s}}$

where W₀ is the CTCP window size during this interval. Hence, it follows that

$W_{0} = γ \frac{T_{s}}{T_{s} - T}$ (51)

(51)

During the interval T_DE, the CTCP window increases according to

$\frac{d W}{d t} = \frac{α W^{k}}{T_{s}}, so that$ (52)

(52)

$\frac{W^{1 - k}}{1 - k} = \frac{α t}{T_{s}}$ (53)

(53)

For the case k=0.75, which is the recommended choice for k as we shall see later, it follows from equation 53 that

$W (t) = {(\frac{α t}{4 T_{s}})}^{4}, t_{D} \leq t \leq t_{E}$

which illustrates the rapid increase in the window size in this region. Because W increases as a power of 4, we refer to this as quartic increase.

Because the window size increases from $(1 - β) W_{m}$ to W₀ during the interval T_DE, from equation 53, it follows that

$T_{D E} = \frac{T_{s}}{α (1 - k)} [W_{0}^{1 - k} - {(1 - β)}^{1 - k} W_{m}^{1 - k}]$ (54)

(54)

and the number of packets transmitted during this interval is given by (with T_DE=t_D−t_E)

$\begin{array}{l} N_{D E} & = \int_{t_{d}}^{t_{E}} \frac{W}{T_{s}} d t = {[\frac{α (1 - k)}{T_{s}}]}^{\frac{1}{1 - k}} \frac{1}{T_{s}} \int_{t_{d}}^{t_{E}} t^{\frac{1}{1 - k}} d t \\ = \frac{1}{α (2 - k)} [W_{0}^{2 - k} - {(1 - β)}^{2 - k} W_{m}^{2 - k}] \end{array}$ (55)

(55)

Note that the interval T_EF comes to an end when the size of the congestion window W^c becomes equal to W₀. Because $\frac{d W^{c}}{d t} = \frac{1}{T_{s}}$ , by equation 47, it follows that

$T_{D F} = [W_{0} - (1 - β) W_{m}] T_{s} and$ (56)

(56)

$T_{E F} = T_{D F} - T_{D E} = [W_{0} - (1 - β) W_{m}] T_{s} - \frac{T_{s}}{α (1 - k)} [W_{0}^{1 - k} - {(1 - β)}^{1 - k} W_{m}^{1 - k}]$ (57)

(57)

$and N_{E F} = W_{0} T_{E F} .$ (58)

(58)

During the interval T_FG, the CTCP window increases linearly as for TCP Reno, so that

$\frac{d W}{d t} = \frac{1}{T_{s}} so that Δ W = \frac{Δ t}{T_{s}} .$

Hence,

$T_{F G} = (W_{m} - W_{0}) T_{s} and$ (59)

(59)

$\begin{array}{l} N_{F G} & = \int_{t_{F}}^{t_{G}} \frac{W}{T_{s}} d t = \int_{t_{F}}^{t_{G}} \frac{t}{T_{s}^{2}} d t = \frac{t_{G}^{2} - t_{F}^{2}}{2 T_{s}^{2}} \\ = \frac{W_{m}^{2} - W_{0}^{2}}{2} \end{array}$ (60)

(60)

Note that the total number of packets transmitted over the interval T_DG is also given by 1/p, so that

$\frac{1}{p} = N_{D E} + N_{E F} + N_{F G}$ (61)

(61)

Equation 61 can be numerically solved to obtain the value of W_m, which can then be substituted back into equations 54, 57, and 59 to obtain the length of the interval T_DG. From equation 50, it follows that

$R_{avg} = \frac{1 / p}{T_{D G}}$ (62)

(62)

If we make the assumption that the packet loss rate is high enough to cause the window size to decrease during the interval T_DE, it is possible to obtain a closed-form expression for R. In this case, N_DE=1/p so that

$\frac{1}{α (2 - k)} [W_{m}^{2 - k} - {(1 - β)}^{2 - k} W_{m}^{2 - k}] = \frac{1}{p} so that$

$W_{m} = {[\frac{α (2 - k)}{1 - {(1 - β)}^{2 - k}}]}^{\frac{1}{2 - k}} \frac{1}{p^{\frac{1}{2 - k}}} and$ (63)

(63)

$T_{D E} = \frac{T_{s} W_{m}^{1 - k}}{α (1 - k)} [1 - {(1 - β)}^{1 - k}]$ (64)

(64)

The average throughput R_avg is then given by

$\begin{array}{l} R_{avg} & = \frac{\frac{1}{p} α (1 - k)}{T_{s} W_{m}^{1 - k} (1 - {(1 - β)}^{1 - k})} \\ = {(\frac{α}{p})}^{\frac{1}{2 - k}} \frac{1}{T_{s} (1 - {(1 - β)}^{1 - k})} {[\frac{1 - {(1 - β)}^{2 - k}}{2 - k}]}^{\frac{1 - k}{2 - k}} \end{array}$ (65)

(65)

To make the variation of the throughput with p the same as that for HSTCP, Tan et al. [5,12] chose k so that 1/(2−k)=0.833, so k=0.8. To make the implementation easier, they used k=0.75 instead.

We can develop some intuition into CTCP’s performance with the following considerations: With reference to Figure 5.12, note that the demarcation between the quartic increase portion of the CTCP window and the linear increase portion is governed by the window size parameter W₀ given by

$W_{0} = γ \frac{T_{s}}{T_{s} - T} .$

From this equation, it follows that:

• For networks with large latencies, the end-to-end delay is dominated by the propagation delays; hence, $T_{s} \approx T$ . This leads to a large value of W₀.

• For networks with small latencies, the queuing delays may dominate the propagation delays, so that $T_{s} ≫ T$ . This leads to small values of W₀.

The interesting fact to note is that W₀ is independent of the link capacity, which we now use to obtain some insights into CTCP behavior. Also note that in the absence of random link errors, the maximum window size W_m is approximately equal to the delay bandwidth product CT.

Based on the values of the link speed C and the threshold W₀, we can have four different scenarios (Figure 5.14):

• As link speed C increases, for a fixed value of end-to-end latency, then W_m increases, and as a result, the number of packets transmitted in the linear portion of window increases (see, e.g., the change from equations 14a to 14b or 14c to 14d). This can also be seen by the formula for N_FG in equation 60. As a result, at higher link speeds, the CTCP throughput converges toward that for TCP Reno.

• As the end-to-end latency increases, then W₀ increases as explained above, and the number of packets transmitted in the quartic region of the CTCP window increases (see, e.g., the change from equation 14a to 14c or 14b to 14d). This can also be seen by the formula for N_DE in equation 55. As a result, an increase in latency causes the CTCP throughput to increase faster compared with that of Reno. This behavior is reminiscent with that of CUBIC because CUBIC’s throughput also increased with respect to Reno as the end-to-end latency increases. However, the CTCP throughput will converge back to that for Reno as the link speed becomes sufficiently large, while the CUBIC throughput keeps diverging.

Figure 5.14 Variation of TCP window with varying latency and link capacities.

These two effects are clearly evident in the response function plot in Figure 5.15.

Figure 5.15 Response functions for CTCP, CUBIC, and Reno.

Figure 5.15 plots the response function for CTCP in which the following features are evident: The response functions for Reno and CTCP are close to each other at lower link speeds. Then as the link speed increases, CTCP increases at a faster rate and diverges away from Reno. This can be explained by the fact that the quartic increase portion dominates the linear increase in this region. Finally, when the link speed assumes very large values, to the left part of the graph, CTCP and Reno again converge together. This is because the linear portion of the window increase in CTCP is dominant at high link speeds, as explained earlier.

As the lend-to-end latency increases, we can expect this pattern to persist, with the difference that the portion of the curve where the CTCP response time is bigger will move to the left (because of larger values of W₀).

Figure 5.15 also plots the response function for CUBIC for comparison purposes. Observe that response functions for CUBIC starts diverging away from that of Reno as the link speed increases, and unlike CTCP, the divergence keeps increasing. This implies that at very high link speeds CTCP will be at a disadvantage with respect to CUBIC or Reno, as has been noted by Munir et al. [13].

From the discussion in the previous few sections, we observe that there is a fundamental difference between the response function for HSTCP on one hand and that for CUBIC or CTCP on the other. In the latter case, the response function changes as the round trip latency changes and tends to be more TCP Reno friendly at lower latencies. This is an attractive property to have because lower latencies correspond to LAN environments, in which most transfers still use TCP Reno. On the other hand, at higher latencies, both CUBIC and CTCP response functions diverge from that of TCP Reno (CUBIC more so than CTCP). This is tolerable because Reno is not able to make full use of the available link capacity in high-latency and high-capacity networks and may even become unstable.

5.6 The Fast TCP Protocol

The design of FAST TCP was inspired by that of TCP Vegas, and in contrast to the other high-speed protocols, it uses end-to-end delay as a measure of congestion rather than dropped packets. Given the description of TCP Vegas in Chapter 1, the following equation describes the evolution of TCP Vegas’ window size (assuming that the parameters $α = β$ ):

$W_{i} (t + 1) = W_{i} (t) + \frac{1}{T_{i} (t)} sgn (α_{i} - R_{i} (t) θ_{i} (t))$ (66)

(66)

where $θ_{i} (t)$ is the queuing delay that connection i experiences, such that the round trip latency is given by $T_{i} (t) = T_{i} + θ_{i} (t)$ , where T_i is the fixed part. Hence, $R_{i} (t) θ_{i} (t)$ is a measure of the number of queued packets from the connection in the network. The function sgn(z)=−1 if z<0, 0 if z=0 and 1 if z>0. Hence, TCP Vegas adjusts its window up or down by one packet per RTT depending on whether the number of queued packets is smaller or greater than its target $α$ .

To create a high-speed version of this algorithm, Wei et al. [7]. changed the window size evolution equation to

$W_{i} (t + 1) = W_{i} (t) + γ_{i} (α_{i} - R_{i} (t) θ_{i} (t)) γ_{i} \in (0, 1]$ (67)

(67)

As a result, the window adjustments in FAST TCP depend on the magnitude as well as the sign of the difference $α_{i} - R_{i} (t) θ_{i} (t)$ and are no longer restricted to a change of at most 1 every round trip time. Hence, FAST can adjust its window by a large amount, either up or down, when the number of buffered packets is far away from its target and by a smaller amount when it is closer.

It can be shown that FAST has the following property: The equilibrium throughputs of FAST are the unique optimal vector that maximize the sum $\sum_{i} α_{i} \log R_{i}$ subject to the link constraint that the aggregate flow rate at any link does not exceed the link capacity. This result was obtained using the Primal-Dual solution to a Lagrangian optimization problem using the methodology described in Chapter 3. Furthermore, the equilibrium throughputs are given by

$R_{i}^{a v g} = \frac{α_{i}}{\bar{θ_{i}}}$ (68)

(68)

Just as for TCP Vegas, equation 68 shows that FAST does not penalize flows with large round trip latencies, although in practice, it has been observed that flows with longer RTTs obtain smaller throughput than those with shorter RTTs. Equation 68 also implies that in equilibrium, source i maintains $α_{i}$ packets in the buffers along its path. Hence, the total amount of buffering in the network is at least $\sum_{i} α_{i}$ to reach equilibrium. Wang et al. [14] have also proven a stability result for FAST, which states that it is locally asymptotically stable in general networks when all the sources have a common round trip delay, no matter how large the delay is.

5.7 The eXpress Control Protocol (XCP)

The XCP congestion control algorithm [15] is an example of a “clean slate” design, that is, it is a result of a fundamental rethink of the congestion control problem without worrying about the issue of backward compatibility. As a result, it cannot deployed in a regular TCP/IP network because it requires multi-bit feedback, but it can be used in a self-contained network that is separated from the rest of the Internet (which can done by using the split TCP architecture from Chapter 4, for example).

XCP fundamentally changes the nature of the feedback from the network nodes by having them provide explicit window increase or decrease numbers back to the sources. It reverses TCP’s design philosophy in the sense that all congestion control intelligence is now in the network nodes. As a result, the connection windows can be adjusted in a precise manner so that the total throughput at a node matches its available capacity, thus eliminating rate oscillations. This allows the senders to decrease their windows rapidly when the bottleneck is highly congested while performing smaller adjustments when the sending rate is close to the capacity (this is similar to the BIC and CUBIC designs and conforms to the averaging principle). To improve system stability, XCP reduces the rate at which it makes window adjustments as the round trip latency increases.

Another innovation in XCP is the decoupling of efficiency control (the ability to get to high link utilization) from fairness control (the problem of how to allocate bandwidth fairly among competing flows). This is because efficiency control should depend only on the aggregate traffic behavior, but any fair allocation depends on the number of connections passing through the node. Hence, an XCP-based AQM controller has both an efficiency controller (EC) and a fairness controller (FC), which can be modified independently of the other. In regular TCP, the two problems coupled together because the AIMD window increase–decrease mechanism is used to accomplish both objectives, as was shown in the description of the Chiu-Jain theory in Chapter 1.

5.7.1 XCP Protocol Description

The XCP protocol works as follows:

• Traffic source k maintains a congestion window W_k, and keeps track of the round trip latency T_k. It communicates these numbers to the network nodes via a congestion header in every packet. Note that the rate R_k at source k is given by

$R_{k} = \frac{W_{k}}{T_{k}}$

Whenever a new ACK arrives, the following equation is used to update the window:

$W_{k} \leftarrow W_{k} + S_{k}$ (69)

(69)

where S_k is explicit window size adjustment that is computed by the bottleneck node and is conveyed back to the source in the ACK packet. Note that S_k can be either positive or negative, depending on the congestion conditions at the bottleneck.

• Network nodes monitor their input traffic rate. Based on the difference between the link capacity C and the aggregate rate Y given by $Y = \sum_{k = 1}^{K} R_{k}$ , the node tells the connections sharing the link to increase or decrease their congestion windows, which is done once every average round trip latency, for all the connections passing through the node (this automatically reduces the frequency of updates as the latencies increase). This information is conveyed back to the source by a field in the header. Downstream nodes can reduce this number if they are experiencing greater congestion, so that in the end, the feedback ACK contains information from the bottleneck node along the path.

• Operation of the Efficiency Controller (EC): The EC’s objective is to maximize link utilization while minimizing packet drops and persistent queues. The aggregate feedback $ϕ$ (in bytes) is computed as follows:

$ϕ = α T_{a v} (C - \sum_{k = 1}^{K} R_{k}) - β b$ (70)

(70)

where $α, β$ are constants whose values are set based on the stability analysis, T_av is the average round trip latency for all connections passing through the node, and b is the minimum queue seen by an arriving packet during the last round trip interval (i.e., it’s the queue that does not drain at the end of a round trip interval). The first term on the RHS is proportional to the mismatch between the aggregate rate and the link capacity (multiplied by the round trip latency to convert it into bytes). The second term is required for the following reason: When there is a persistent queue even if the rates in the first term are matching, then the second term helps to reduce this queue size.
From the discussion in Chapter 3, the first term on the RHS in equation 7 is proportional to the derivative of the queue length, and the second term is proportional to the difference between the current queue size and the target size of zero. Hence, the XCP controller is equivalent to the proportional-integral (PI) controller discussed in Chapter 3.
Note that the increase in transmission rates in XCP can be very fast because the increase is directly proportional to the spare link bandwidth as per equation 70, which makes XCP suitable for use in high-speed links. This is in contrast to TCP, in which the increase is by at most one packet per window.

• Operation of the Fairness Controller (FC): The job of the FC is to divide up the aggregate feedback among the K connections in a fair manner. It achieves fairness by making use of the AIMD principle, so that:

– If $ϕ$ >0, then the allocation is done so that the increase in throughput for all the connections is the same.

– If $ϕ$ <0, then the allocation is done so that the decrease in throughput of a connection proportional to its current throughput.

When $ϕ$ is approximately zero, then the convergence to fairness comes to a halt. To prevent this, XCP does an artificial deallocation and allocation of bandwidth among the K connections by using the feedback h given by

$h = \max (0, 0.1 \sum_{k = 1}^{K} R_{k} - | ϕ |),$ (71)

(71)

so on every RTT, at least 10% of the traffic is redistributed according to AIMD. This process is called “bandwidth shuffling” and is the principle fairness mechanism in XCP to redistribute bandwidth among connections when a new connection starts up.

The per packet feedback to source k can be written as

$ϕ_{k} = p_{k} - n_{k}$ (72)

(72)

where p_k is the positive feedback and n_k is the negative feedback.

First consider the case when $ϕ$ >0. This quantity needs to be equally divided among the throughputs of the K connections. The corresponding change in window size of the k^th connection is given by $Δ W_{k} = Δ R_{k} T_{k} \propto T_{k}$ because the throughput deltas are equal. This change needs to be split equally among the packets from connection k that pass through the node during time T_k, say m_k. Note that m_k is proportional to $W_{k} / M S S_{k}$ and inversely proportional to T_k. It follows that the change in window size per packet for connection k is given by

$p_{k} = K_{p} \frac{T_{k}^{2} M S S_{k}}{W_{k}}$ (73)

(73)

where K_p is a constant. Because the total increase in aggregate traffic rate is $\frac{h + \max (ϕ, 0)}{T_{a v}}$ , it follows that

$\frac{h + \max (ϕ, 0)}{T_{a v}} = \sum^{L} \frac{p_{k}}{T_{k}}$ (74)

(74)

where the summation is over the average number of packets seen by the node during an average RTT. Substituting equation 73 into equation 74, we obtain

$K_{p} = \frac{h + \max (ϕ, 0)}{T_{a v} \sum^{L} \frac{T_{k} M S S_{k}}{W_{k}}}$ (75)

(75)

When $ϕ$ <0, then the decrease in throughput should be proportional to the current throughput. It follows that $Δ W_{k} = Δ R_{k} T_{k} \propto R_{k} T_{k} = W_{k}$ . Splitting up this change in window size among all the m_k packets that pass through the node during an RTT, it follows that

$n_{k} = \frac{K_{n} W_{k}}{m_{k}} = K_{n} T_{k} M S S_{k}$ (76)

(76)

where the constant K_n is given by

$K_{n} = \frac{h + \max (- ϕ, 0)}{T_{a v} \sum^{L} M S S_{k}}$ (77)

(77)

5.7.2 XCP Stability Analysis

In this section, we show that the if parameters $α, β$ (that appear in equation 70) satisfy the conditions:

$0 < α < \frac{π}{4 \sqrt{2}} and β = α^{2} \sqrt{2}$ (78)

(78)

then the system is stable, independently of delay, capacity, and number of connections. This analysis is done using the techniques introduced in Chapter 3, whereby the system dynamics with feedback are captured in the fluid limit using ordinary differential equations, to which the Nyquist stability criterion is applied.

In this section, we will assume that the round trip latencies for all the k connections are equal, so that T_k=T_av=T, k=1,…,K. Using the notation used in the previous section, note that the total traffic input rate into the bottleneck queue is given by

$Y (t) = \sum_{k = 1}^{K} R_{k} (t) = \sum_{k = 1}^{K} \frac{W_{k} (t)}{T}$ (79)

(79)

Because the aggregate feedback $ϕ$ is divided up among the K connections, every T seconds, it follows that the total rate of change in their window sizes is given by $ϕ / T$ bytes/sec. Using equation 70, it follows that

$\sum_{k = 1}^{K} \frac{d W_{k}}{d t} = \frac{1}{T} [- α T (y (t - T) - C) - β b (t - T)]$ (80)

(80)

From equations 79 and 80, it follows that

$\frac{d Y (t)}{d t} = \frac{1}{T^{2}} [- α T (Y (t - T) - C) - β b (t - T)]$ (81)

(81)

Hence, the equations that govern the input traffic rate and buffer dynamics are the following

$\frac{d b (t)}{d t} = {\begin{matrix} Y (t) - C & i f & b (t) > 0 \\ \max (0, Y (t) - C) & i f & b (t) = 0 \end{matrix}$ (82)

(82)

$\frac{d Y}{d t} = - \frac{α}{T} (Y (t - T) - C) - \frac{β}{T^{2}} b (t - T)$ (83)

(83)

These form a feedback loop, as shown in Figure 5.16. In this figure, the queue dynamics are governed by equation 82, and the rate allocation is done via equation 83. These equations have the equilibrium point b_eq=0 and Y_eq=C. If we consider the following linear equations around this equilibrium point, by making the substitution x(t)=Y(t)−C, we get

$\frac{d b (t)}{d t} = x (t)$ (84)

(84)

$\frac{d x (t)}{d t} = - K_{1} x (t - T) - K_{2} b (t - T)$ (85)

(85)

where

$K_{1} = \frac{α}{T} a n d K_{2} = \frac{β}{T^{2}}$ (86)

(86)

Figure 5.16 Rate allocation and queue size dynamics in eXpress Control Protocol (XCP).

Passing to the Laplace transform space, it follows from equations 84 and 85 that the open loop transfer function for the system is given by

$G (s) = \frac{s K_{1} + K_{2}}{s^{2}} e^{- s T}$ (87)

(87)

To apply the Nyquist Criterion, we express equation 87 in the complex plane by making the substitution $s = j ω$ , so that

$G (j ω) = | G (j ω) | e^{- arg (G (j ω))}$

where

$| G (j ω) | = \frac{\sqrt{ω^{2} K_{1}^{2} + K_{2}^{2}}}{ω^{2}}$ (88)

(88)

and

$arg (G (j ω)) = π - \tan^{- 1} \frac{ω K_{1}}{K_{2}} + ω T$ (89)

(89)

We now choose the critical frequency $ω_{c}$ at which $| G (j ω_{c}) | = 1$ , as a function of the parameters K₁ and K₂ as follows

$ω_{c} = \frac{K_{2}}{K_{1}}$ (90)

(90)

Using equation 88, it follows that $\frac{K_{2}}{K_{1}^{2}} = \sqrt{2}$ , which can be expressed using equation 86 as $β = α^{2} \sqrt{2}$ .

To prove stability of the closed-loop system, the Nyquist criterion requires that $\arg (G (j ω_{c})) < π$ radians. Substituting equation 90 into equation 89, it follows that

$\arg (G (j ω_{c})) = π - \frac{π}{4} + \frac{β}{α} < π i . e . α < \frac{π}{4 \sqrt{2}}$

which completes the proof.

This analysis shows that unlike for traditional congestion control and AQM schemes, XCP’s stability criterion is independent of the link capacity, round trip latency, or number of connections. It was later shown by Balakrishnan et al. [16] that the linearization assumption around the point b_eq=0 may not be a good approximation because the system dynamics are clearly nonlinear at this point (see equation 82). They did a more exact stability analysis using Lyapunov functionals and showed that the more exact stability region for $α, β$ varies as a function of T and is in fact much larger than the region given by equation 78. However, it is not an exact superset because there are points near the boundary of the curve $β = α^{2} \sqrt{2}$ that lie in the stable region as per the linear analysis but actually lie in the unstable region according to the nonlinear analysis (see Balakrishnan et al. [16]).

5.8 The Rate Control Protocol (RCP)

The RCP protocol [17] was inspired by XCP and builds on it to assign traffic rates to connections in a way such that they can quickly get to the ideal Processor Sharing (PS) rate. RCP is an entirely rate-based protocol (i.e., no windows are used). Network nodes compute the ideal PS-based rate and then pass this information back to the source, which immediately changes its rate to the minimum PS rate that was computed by the nodes that lie along its path.

RCP was designed to overcome the following issue with XCP: When a new connection is initiated in XCP and the nodes along its path are already at their maximum link utilizations, then XCP does a gradual redistribution of the link capacity using the bandwidth-shuffling process (see equation 71). Dukkipatti and McKeown [17] have shown that this process can be quite slow because new connections start with a small window and get to a fair allocation of bandwidth in a gradual manner by using the AIMD principle. However, it takes several round trips before convergence happens, and for smaller file sizes, the connection may never get to a fair bandwidth distribution. Hence, one of the main innovations in RCP, compared with XCP, is that the bandwidth reallocation happens in one round trip time, in which a connection gets its equilibrium rate.

The RCP algorithm operates as follows:

• Every network node periodically computes a fair-share rate R(t) that it communicates to all the connections that pass through it. This computation is done approximately once per round trip delay.

• RCP mandates two extra fields in the packet header: Each network node fills field 1 with is fair share value R(t), and when the packet gets to the destination, it copies this field into the ACK packet sends it back to the source. Field 2, as in XCP, is used communicate the current average round trip latency T_k, from a connection to all the network nodes on its path. The nodes use this information to compute the average round trip latency T_av of all the connections that pass through it.

• Each source transmits at rate R_k, which is the smallest offered rate along its path.

• Each network node updates its local rate R(t) according to the equation below:

$R (t) = R (t - T_{a v}) + \frac{[α (C - Y (t)) - β \frac{b (t)}{T_{a v}}]}{\hat{N} (t)}$ (91)

(91)

where $\hat{N} (t)$ is the node’s estimate of the number of connections that pass through it and the other variables are as defined for XCP.

The basic idea behind equation 91 is that if there is spare bandwidth available (equal to C−Y(t)), then share it equally among all the connections; on the other hand, if C−Y(t)<0, then the link is over subscribed, and each connection is asked to decrease its rate evenly. If there is a queue build-up of b(t), then a rate decrease of b(t)/T_av will bring it down to zero within a round trip interval.

RCP replaces the interval at which the node recomputes R(t) to T′, so that it is user configurable and uses the following estimate for the number of connections:

$\hat{N} (t) = \frac{C}{R (t - T')}$

so that equation 91 can be written as

$R (t) = R (t - T') ⌊ 1 + \frac{\frac{T'}{T_{a v}} (α (C - Y (t)) - β \frac{b (t)}{T_{a v}})}{C} ⌋$ (92)

(92)

where the rate change has been scaled down by T′/T_av because it is done more than once per T_av seconds.

It can be easily shown that system dynamic equations for RCP are exactly the same as for XCP (see equations 82 and 83). Hence, the stability conditions derived for XCP in the previous section hold in this case as well.

5.9 Stability of High-Speed TCP Algorithms

The algorithms described in this chapter all operate in the region where regular TCP has been shown to exhibit poor performance because of oscillatory behavior. From the analysis that was described in Chapter 3, the following techniques were found to increase the stability of congestion control algorithms in networks with high link capacity or end-to-end latency:

• Sophisticated AQM schemes that feed back the buffer occupancy value, as well as the first or higher derivatives of the buffer occupancy. Note that the first derivative of the buffer occupancy is given by

$\frac{d b (t)}{d t} = C - \sum_{i} R_{i}$

so that it provides the same information as the difference between the link capacity and the current traffic load (i.e., how close the link is to saturation).

• Window size increment rules that incorporate the Averaging Principle (AP) to reduce the rate of the window size increase as the link approaches saturation

The natural question to ask is how do the algorithms described in this chapter fare in this regard? We have not seen any work that applies control theory to models for HSTCP, BIC, CUBIC, or CTCP, but simulations show that they are much more stable than Reno in the high-link speed and large propagation delay environment.

A reason why this is the case can be gleaned from the principles enunciated earlier. Applying the Averaging Principle, algorithms that reduce the window increment size as the link reaches saturation, are more stable than those that do not do so. BIC, CUBIC, CTCP, and FAST all fall into this category of algorithms. BIC, CUBIC, and FAST reduce their window increment size in proportion to how far they are from the link saturation point while CTCP changes to a linear rate of increase (from quartic) when the queue backlog increases beyond a threshold. XCP and RCP, on the other hand, use a sophisticated AQM scheme in which the network nodes feed back the value of the buffer occupancy as well as the difference between the link capacity and the total traffic load. The Quantum Congestion Notification (QCN) algorithm that we will meet in Chapter 8 uses a combination of the averaging principle and first derivative feedback to achieve stability.

The HSTCP algorithm, on the other hand, does not fall cleanly into one of these categories. But note that HSTCP decreases its window size decrement value (on detecting packet loss) as the window size increases, and in general, algorithms that do a less drastic reduction of their window size compared with Reno have better stability properties (because this has the effect of reducing the rate oscillations). This is also true for TCP Westwood (see Chapter 4), whose window size reduction is proportional to the amount of queuing delay. The Data Center Congestion Control Protocol (DCTCP) algorithm (see Chapter 7) also falls in this class because it reduces its window size in proportion to the amount of time that the queue size exceeds some threshold.

5.10 Further Reading

Other TCP algorithms designed for high-speed long latency links include Scalable TCP [18], H-TCP [19], Yeah-TCP [6], TCP Africa [9], and TCP Illinois [4]. All of these are deployed on fewer than 2% of the servers on the Internet, according to the study done by Yang et al. [8].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Congestion Control in High-Speed Networks

Create new playlist

Sign In

Sign Up

Congestion Control in High-Speed Networks

Keywords

5.1 Introduction

5.2 Design Issues for High-Speed Protocols

5.3 High Speed TCP (HSTCP) Protocol

5.4 TCP BIC and CUBIC

5.4.1 TCP BIC

5.4.1.1 Binary Search Increase

5.4.1.2 Additive Increase

5.4.1.3 Max Probing

5.4.2 TCP CUBIC

5.5 The Compound TCP (CTCP) Protocol

5.6 The Fast TCP Protocol

5.7 The eXpress Control Protocol (XCP)

5.7.1 XCP Protocol Description

5.7.2 XCP Stability Analysis

5.8 The Rate Control Protocol (RCP)

5.9 Stability of High-Speed TCP Algorithms

5.10 Further Reading

Table of Contents for
Chapter 5. Congestion Control in High-Speed Networks