7
Queuing and Scheduling

Throughout this book, we have always referred to the queuing and scheduling tool as the star of the QOS realm, the tool that makes crystal clear the principle of benefiting some at the expense of others. In this chapter, we analyze the internals of the queuing and scheduling mechanism that allow such differentiation. But first, we analyze the parameters associated with any queuing and scheduling mechanism and their common ground. Only then do we present the different queuing and scheduling mechanisms. The most commonly deployed mechanism is PB-DWRR, which stands for priority-based deficit weighted round robin. However, we also present all other mechanisms that have preceded PB-DWRR because analyzing their pros and cons provides the necessary clues to understand the roots and goals of PB-DWRR and why the industry has moved toward it.

7.1 Queuing and Scheduling Concepts

Queuing and scheduling applied to an interface allows traffic to be split into multiple queues so that the scheduler can decide which type of treatment the traffic inside each queue receives. If the traffic mapped to each queue belongs to a specific class of service, the scheduler can apply differentiated behavior to different classes of service, as illustrated in Figure 7.1.

Schematic of interface split into three queues Q0, Q1, and Qn serviced by round-robin scheduling corresponding to three types of Behavior1, Behavior2, and BehaviorN.

Figure 7.1 Queuing and scheduling applying different behaviors

The above is a quick recap of what has been previously discussed throughout this book, so let us now dive more deeply into the internals of queuing and scheduling.

The two most important parameters associated with the queuing and scheduling mechanism are buffers and bandwidth. Buffering is the length of the queue, that is, how much memory is available to store packets. However, the entire packet does not need to be queued, as we see later; sometimes what is stored is a notification, which is a representation of the packet contents. The buffering value can be defined either as a time value during which packets are accepted on the interface on which queuing is enabled or as a physical size in terms of how many packets or packet notifications can reside in the queue at the same time. The buffering value is a quota of available memory and can be defined as milliseconds of traffic or absolute numbers of packets.

The bandwidth parameter refers to the scheduling part of the equation. A total amount of bandwidth is made available to the queuing and scheduling mechanism. Scheduling determines how much is allocated to each queue. The total amount of bandwidth can be either the interface speed or the shaping rate if a shaper is applied after the scheduler, as illustrated in Figure 7.2.

Image described by caption/surrounding text.

Figure 7.2 Bandwidth parameter in queuing and scheduling. I, interface bandwidth; S, shaping rate

The queuing and scheduling discipline used determines how the resources are allocated. The requirement for the presence of queuing and scheduling is typically controlled by the presence of congestion. If resources are sufficient and there is no competition for resources, there is no need for queuing. One way to create congestion is to place more traffic on an interface than the outgoing line speed can support. Congestion can also be created artificially by applying a shaping rate to the interface that imposes a speed limit lower than the maximum interface line speed. The leftover traffic is throttled or back pressured into memory, which is then partitioned across the actual queues. The scheduler then services the queues and is responsible for the rate at which packets from each queue are transmitted.

As previously discussed in Chapter 2, a queue has a head and a tail. A packet enters a queue at the tail, remains in the queue until it reaches the head, and then leaves the queue. In a queuing system, packets can be dropped from either the tail or the head of the queue and can even be dropped from both at the same time. Most commonly, packets are dropped from the tail. When the queue buffer fill rate is much faster than the removal rate, the buffer fills up completely. The result is that no more packets can be placed in the buffer and any newly arrived packet needs to be dropped. But a queue can also drop packets from the head. Packets at the head of the queue are the ones that have moved from the tail to the head and thus are those that have stayed in the queue for the longest amount of time. In the case of extreme congestion and resource starvation, the queue receives no scheduling slots. To avoid stalling the queue and having a hopelessly long delay for the traffic inside the queue, a maximum time is enforced on all packets in a queue for how long they are allowed to remain in the queue, waiting to be scheduled. The term for this is packet aging, meaning that the packets are aged out from the queue buffer because there is no point in trying to deliver them.

Tail drops and packet aging are not mutually exclusive. If, to drain the queue, the rate of dropping packets at the head of the queue cannot keep up with the rate at which packets are entering the queue at the tail, packet dropping can occur from both the tail and the head as a result of packet aging, as shown in Figure 7.3.

Image described by caption/surrounding text.

Figure 7.3 Tail and head aging drops in a queue

7.2 Packets and Cellification

Routers can handle packet queuing in two different ways, either queuing the entire packet or splitting it into fixed-size cells, a process commonly called cellification. Queuing the entire packet is the older method and is conceptually simpler. It is commonly used in CPU processing-based forwarding routers. In this scenario, there is no split between the control and forwarding planes, which typically does not become a limiting factor if the supported interfaces are low-speed ones. However, queuing the entire packet has its drawbacks, which have led to the appearance and popularity of cellification. A packet has a variable length, where the minimum and maximum values are specified by the technology or optionally by the interface configuration. Also, a packet can contain several different headers. These possible variations in the packet length and in the number of possible headers create the challenge of how to manage memory resources. Also, the router CPU is affected, because every time the packet is processed, it needs to be read bit by bit. So different packet lengths imply different processing delays. In Chapter 3, we discussed this topic briefly when we analyzed several different types of delay factors inherent to packet transmission. As we stated in Chapter 3, we ignore processing delay, which can be done if the control and forwarding planes of the router are independent.

The cellification process chops a packet into fixed-size cells, which makes it easier for the router’s memory management to deal with the packets. It also provides consistency in terms of transmission times and slots for the buffering required. The cellification process is shown in Figure 7.4, which shows a packet containing several headers that is split into 64-byte cells.

Schematic of cellification process, showing packet with four headers, namely Application Payload, Application Playload, TCP/UDP header, and IP Header, that is split into 64-byte cells.

Figure 7.4 Cellification of a packet

Cellification also has a special cell called the notification, also called the “cookie.” The cookie contains the relevant pieces of the packet that are required for its processing inside the router. So instead of processing the entire packet every single time, it needs to be evaluated by any tool, the router evaluates only the cookie. The idea is to create a representation of the packet with only the required information, for example, if an IP packet contains in its header the DSCP value of X that is contained in the cookie. However, other information, for example, the packet payload, may not be added to the cookie. The cookie contents are dynamic, so if the packet is assigned to a certain class of service, this information is written to the cookie. Also, if information contained in the packet header, such as the DSCP field, is changed, the cookie is updated as well. The cellification process is typically completely transparent both in terms of the router’s configuration and management. Nevertheless, it is important to understand its basics and the benefits it offers.

7.3 Different Types of Queuing Disciplines

Over a number of years, several mathematical studies have led to the design and creation of a variety of queuing algorithms. To describe all possible queuing mechanisms is impossible. However, some major well-known disciplines regarding scheduling are detailed in the next sections of this chapter:

  • First in, first out (FIFO) queuing
  • Fair queuing (FQ)
  • Priority queuing (PQ)
  • Weighted fair queuing (WFQ)
  • Weighted round robin (WRR)
  • Deficit weighted round robin (DWRR)
  • Priority-based deficit weighted round robin (PB-DWRR).

The most powerful of these and perhaps the most interesting is PB-DWRR. However, this chapter discusses them all, because it is interesting to evaluate the pros and cons of each one and because doing so illuminates the evolutionary history of queuing algorithms and the reasons behind the popularity of PB-DWRR.

7.4 FIFO

FIFO is a well-known acronym for first in, first out and is probably the most basic queuing scheduling discipline. The principle behind FIFO queuing is that all packets are treated equally by all being placed in the same queue. So in a nutshell, there is one queue and the scheduler only serves this queue. This mechanism implies that the removal rate is directly inherited from either the interface speed or from the shaping rate if there is a shaper applied. Because with any queuing mechanism there is no overtaking within a queue, the packets are serviced in the same order in which they were placed into the queue. This is illustrated in Figure 7.5.

Image described by caption/surrounding text.

Figure 7.5 FIFO scheduling

For most router vendors, FIFO is probably the default behavior, with one queue for most transit traffic plus another one for locally generated control plane packets such as routing protocol packets.

Generally speaking, most hardware implementations need at least one buffer per interface to be able to build a packet before transmitting it. The function of the queue is to be a placeholder buffer that is used when the Layer 1 and Layer 2 headers are added to the packet. Providing a buffer that is the size of two to four packets helps the system to be able to utilize the interface efficiently and at line rate. FIFO service is predictable and the algorithm is simple to implement for a router or a switch. However, the FIFO algorithm is perhaps not a true “queuing and scheduling” solution because it does not provide any form of differentiated services, since traffic belonging to different classes of service share the same road lane (effectively the same queue) to the scheduler.

Under congestion conditions, FIFO queuing first introduces a delay as the queue starts to fill, and when the queue becomes full, all newly arrived packets are discarded. Here the implementation of the drop behavior can differentiate.

One alternative is to drop everything in the buffer to avoid having the queue collapse under the burden of stalled sessions. The reason for this behavior is to avoid a situation similar to the TCP silly syndrome, in which the receiving window gets smaller and smaller because most of the packets in the buffer memory are waiting to be processed. The result is that all sessions suffer service degradation and the packets are stuck in the queue in the order in which they arrived. Despite the possible availability of large buffer memory, the buffers cannot be utilized properly because only a small number of senders’ congestion window (CWND) can be used.

Another drop alternative is to use the more intelligent ways described earlier to clear the queue by dropping from the tail and by using packet aging.

FIFO queuing has the following benefits:

  • The FIFO algorithm is simple to implement.
  • It provides acceptable performance if the queue saturation is low and short. In fact, FIFO can be seen as more of a burst-absorbing implementation than a true queuing strategy.
  • If applications are TCP based and not delay sensitive, the process of removal from the queue is well suited because packet ordering is preserved at the cost of introducing delay. If moderate congestion occurs, TCP slows down because of RTT, but retransmissions are kept to a minimum.

FIFO queuing has the following limitations:

  • FIFO provides no differentiated services. In cases of extreme congestion and buffer usage, all services are equally bad.
  • Delay and jitter cannot be controlled because the queue depth usage can vary. Therefore, FIFO is not an appropriate solution for real-time applications. For example, a voice flow consisting of many packets might be stuck behind a 1500-byte TCP transfer with a large CWND, a concept discussed in Chapter 4.
  • Greedy flows can take up most of the queue depth, and thus bursty flows can consume all available buffer space. Because TCP, by nature, tries to scale up the size of the transmitting window, session control is very difficult with a single queue or buffer.

7.5 FQ

The main disadvantage of FIFO queuing is that flows consisting of many packets can take up most of the bandwidth for less bandwidth-intensive applications, because FIFO does not separate flows or streams of packets. FQ, also commonly called the fairness algorithm, is a scheduling algorithm that addresses the basic limitation of FIFO queuing. FQ classifies packet flows into multiple queues, offering a fair scheduling scheme for the flows to access the link. In this way, FQ separates traffic and flows and avoids applications that consume less bandwidth being starved by applications that consume more bandwidth.

The scheduling is very much a statistical multiplexing process among the queues, with queues buffering packets that belong to different flows. FQ, defined by John Nagle in 1985, takes into account packet size to ensure that each flow receives an equal opportunity to transmit an equal amount of data. Each queue is assigned the same weight; that is, the queues are scheduled with the same amount of bandwidth. This scheme avoids the problems of dealing with small and large packets in the queue, so the speed of removal from the queue is a function of a number of bits, not a function of a number of packets. So, for example, a queue with large packets has access to the same bandwidth as a queue with small packets, because in each scheduling turn the queue is serviced in terms of number of bits, not number of packets.

The main issue with the fairness algorithm is the fact that it is indeed fair, which may seem odd, but it does invalidate the desired QOS goal of providing an uneven distribution of resources across different classes of service. If many queues must be visited in a fair order, flows that require low delay can suffer.

For routers to implement the FQ model, a hash function needs to separate flows from each other to map packets to a particular session. The hash function can compute the session or flow using a rich set of variables, such as a combination of the source port address, destination port addresses, and protocols, and possibly even higher-layer information beyond TCP, UDP, and port numbers. But once this classification is done, dynamically allocating memory and creating logical queues that exist only when the flow is active is a very resource-intensive task that is hard to implement when most of the packet forwarding takes place within a hardware-based architecture that has little interaction with processing resources. As a result, classifying and dynamically creating queues for each active hashed flow is an unscalable process.

The fairness algorithm is a common tactic to handle oversubscribed backplanes on routers and switches. A common hardware architecture is to have a one-stage to two-stage fabric switch that connects line modules. In this scenario, incoming traffic from multiple ports is forwarded over a fabric to a single destination port. Here, fairness can ensure that the incoming ports on the line modules have equal access to the fabric so that they can be delivered to the ports on the outgoing module, as illustrated in Figure 7.6.

Schematic of traffic from a white port and black port arriving the ingress line card. The traffic flows through FQ and Fabric Switch and is delivered to white and black ports on the egress linecard.

Figure 7.6 Fairness algorithm

However, the fairness algorithm does not take into account the fact that the backplane is not aware of individual flows, which are the basis for the original idea of using FQ to create bandwidth fairness. So what is indeed implemented is a slight variation of the original FQ algorithm.

The FQ algorithm implies that the data rate may fluctuate for short intervals, because the packets are delivered sequentially to achieve fairness in each scheduling cycle. This behavior can possibly lower the throughput, especially when compared to algorithms that focus more on maximum throughput. On the plus side, however, FQ is very effective in avoiding starvation of flows under heavy loads.

FQ offers the following benefit:

  • The FQ algorithm isolates each flow into its own logical queue. Thus, in theory, a greedy flow cannot affect other queues.

FQ has the following limitations:

  • The FQ algorithm is extremely complicated to implement, and no example of a vendor implementation exists on high-speed routers or routers created to handle large numbers of sessions and large amounts of traffic. FQ is more of a theoretical construct than a practical paradigm.
  • It is resource intensive because many states and hashes must be computed and because memory must be allocated and reallocated based on changes in the session state.
  • Delay and jitter can still be issues because each session hash is seen as a queue entity. If many sessions need to be scheduled, the multiplexing scheme used to determine the session slot interval can vary, depending on how many active sessions are actively transmitting packets. The wait time to the next scheduled service can be long if many sessions are active.

7.6 PQ

For applications that are sensitive to delay and that are not able to handle packet loss, the PQ scheduling algorithm provides a simple method of supporting differentiated service classes. After a classification scheme has classified the packets and placed them into different queues with different priorities, the PQ algorithm handles the scheduling of the packets following a priority-based model. Packets are scheduled to be transmitted from the head of a given queue only if all queues of higher priority are empty. So the different levels of priority assigned to queues introduce the unfairness desired regarding of how queues are serviced. However, inside a single priority queue, packets are still scheduled in a FIFO order, as with any other queuing and scheduling method. The PQ algorithm operation is illustrated in Figure 7.7.

Schematic of two queues, Q0 (labeled high priority and consists of three packets) and Q1 (labeled low priority and consists of three packets) serviced by round-robin scheduling and delivered to the egress order. Four white packets are atop the egress order.

Figure 7.7 PQ scheduling

In Figure 7.7, queue zero (Q0) has the highest priority, so as long as there are packets in Q0, the scheduler serves only this queue. Q1 has a low priority, so packets are removed from this queue only when Q0 is empty.

PQ offers benefits for traffic that requires no packet loss and low delay. With PQ, such applications can be selectively scheduled, and their service can be differentiated from the bulk of the best-effort flows. The PQ requirements for queuing and scheduling are minimal and not very complex compared with other more elaborate queuing disciplines that also offer differentiated service.

However, PQ has its own set of limitations. If the volume of high-priority traffic becomes excessive, lower-priority queues may face a complete resource starvation. If their queuing rate remains constant but the rate of removal from the queue decreases, packet drops start to occur, either from the tail or the head of the queue or possibly both. The drop rate for traffic placed in low-priority queues increases as the buffer space allocated to the low-priority queues starts to overflow. The result is not only packet drops but also increased latency. Regarding TCP sessions, the state of TCP traffic may become stale if it has to be retransmitted. Also, introducing such a conservative queuing discipline and scheduling mechanism can affect the network service to such extent that the service for the whole network decreases because of starvation of the majority of traffic and applications. As discussed in Chapter 4, some real-time applications require several packet types to achieve a real-time service. For example, for Session Initiation Protocol (SIP) traffic, the PQ algorithm can prioritize SIP traffic. However, if DHCP offers cannot reach users and DNS servers cannot resolve queries to reach SIP servers and users, the gains obtained by favoring SIP traffic are limited because no users are available to establish SIP sessions. In addition, misbehaving high-priority flows can add delay and jitter to other high-priority flows that share the same queue. Effectively, a real-time service comprises several different applications and packet types, and all these applications and packet types must receive equivalent service levels for the real-time service to operate properly. Having all one’s eggs in the same basket is not a design for today’s Internet and for large-scale enterprise networks.

PQ can be implemented both in a strict mode and in a rate-controlled mode. For example, the priority queue rate can be policed to drop or reclassify traffic if the rate increases beyond certain thresholds. Such thresholds can be specified as a percentage of the link bandwidth or an explicit rate. This method avoids starvation of lower-priority queues and can also provide control over the delay inserted in traffic in high-priority queues by establishing a limit on the maximum amount of traffic that can be queued.

PQ has the following benefits:

  • The PQ algorithm provides a simple method of supporting differentiated service classes, in which different queues are serviced differently according to the priority assigned. The buffering and scheduling computational requirements are low and not very complex, even when implementing PQ on network equipment with high-speed links.
  • Low-delay and loss-sensitive applications such as real-time traffic can be effectively protected from other greedy traffic flows such as TCP sessions, and low delay and jitter can be maintained for the traffic classified as high priority.

PQ has the following limitations:

  • Because of its absolute allocation of resource and scheduling services to the priority-classified traffic, the PQ algorithm can result in network malfunctioning because low-priority traffic becomes stalled. The cost of giving certain traffic a higher priority comes at the expense of penalizing lower-priority traffic.
  • High-priority traffic must be very well provisioned and rate controlled. Otherwise, service breakdowns can result for all traffic that is not high priority.

7.7 WFQ

WFQ is commonly referred as “bit-by-bit round robin,” because it implements a queuing and scheduling mechanism in which the queue servicing is based on bits instead of packets. WFQ was developed in 1989 by Demers, Keshav, Shenke, and Zhang and emulates the Generic Processor Sharing (GPS) concepts of virtual processing time. In WFQ, each queue or flow is allocated a weight that is a proportion of the interface rate or the shaping rate. WFQ is aware of packet sizes and can thus support variable-sized packets. The benefit is that sessions with big packets do not get more scheduling time than sessions with smaller packets, because effectively the focus of WFQ is bits and not packets. So there is no unfairness in the scheduling for sessions with smaller packet sizes. With WFQ, each queue is scheduled based on a computation performed on the bits of each packet at the head of the queue. Because the traffic computation is done based on stream of bits and not of packets and because what the router receives and transmits are indeed packets, WFQ implicitly increases complexity. Figure 7.8 illustrates the WFQ scheduling principle.

Schematic of three queues, Q0 (has four packets), Q1 (has two packets), and Q2 (has two packets) , all of same weight, serviced by round-robin scheduling. Q0 removes three packets with total value of X bytes, Q1 removes one packet with total value of Y bytes, and Q2 removes two packets of Z bytes.

Figure 7.8 WFQ scheduling

In Figure 7.8, all queues have the same weight, and thus the number of bytes scheduled in each scheduling turn is the same for all queues and reflects the weight value. Q0 removes three packets with a total value of X bytes, Q1 removes one packet with Y bytes, and Q2 removes two packets with Z bytes. The weight factor is effectively an allowance for how many resources can be assigned and used.

The drawback of WFQ is that it is very resource intensive because of the bit computations. The original WFQ idea also consumes many resources because the flows are not aggregated into classes with limited queues. Instead, each flow or stream gets its own queue or buffer quota, similar to FQ. Because of these high resource demands and the complex computation needed to check the state for each flow and its packets, WFQ has been implemented more on CPU-based platforms whose queuing disciplines are based on bus-based architectures.

WFQ has the following benefit:

  • The implementations based on WFQ algorithm provide service differentiation between classes and their aggregated traffic, rather than merely differentiating between individual flows. A weight allocated to each class divides the scheduling and bandwidth ratio for each class. Also, because WFQ is bits aware, it can handle packets of variable lengths.

WFQ has the following limitations:

  • The original WFQ design is more of a queuing theory. The existing implementations do not follow the original concept in which each flow is allocated a weight. Instead, flows are aggregated by being classified into different service classes, and these classes are then assigned to queues.
  • WFQ is extremely complicated to implement, as is FQ. Maintaining state information and computing the hash table are resource-intensive tasks.

7.8 WRR

WRR is a scheduling discipline that addresses the shortcomings of PQ and FQ. The basic concept of WRR is that it handles the scheduling for classes that require different bandwidth. WRR accomplishes this by allowing several packets to be removed from a queue each time that queue receives a scheduling turn. WRR also addresses the issue with PQ in which one queue can starve queues that are not high-priority queues. WRR does this by allowing at least one packet to be removed from each queue containing packets in each scheduling turn. At first glance, it may seem that WRR is very similar to WFQ. The difference between the two is that WFQ services bits at each scheduling turn, whereas WRR handles packets in each scheduling turn. The number of packets to be serviced in each scheduling turn is decided by the weight of the queue. The weight is usually a percentage of the interface bandwidth, thereby reflecting the service differences between the queues and the traffic classes assigned to those queues. Figure 7.9 illustrates WRR.

Schematic of three queues, Q0 (has four white packets), Q1 (has two black packets), and Q2 (has gray two packets) , of weights 60%, 20%, and 20%, respectively , serviced by round-robin scheduling. Q0 removes three packets, Q1 removes one packet, and Q2 removes one packet.

Figure 7.9 WRR scheduling

As illustrated in Figure 7.9, three packets are removed from Q0 and one packet from both Q1 and Q2. The number of packets removed reflects the weight for the queue. As seen, Q0 has three times more weight than Q1 and Q2, and then it removes three times more packets each scheduling turn.

WRR has no knowledge of the true sizes of the packets in the buffers that are to be scheduled. The queues and scheduling are generally optimized for an average packet size. However, the sizes are all just estimates and have no true meaning with regard to the actual traffic mix in each queue. This operation of WRR is both an advantage and an issue. Because WRR has no complex resources that demand state computation as with WFQ, which must transform bits to bandwidth scheduling, it is fairly simple to implement WRR. The result is a solution well suited for handling a large number of flows and sessions, making WRR into something of a core QOS solution that can deal with large volumes of traffic and with congestion. The drawback of WRR is that it is unaware of bandwidth because it does not handle variable-sized packet. In terms of receiving a schedule turn, a 1500-byte packet is equivalent to a 64-byte packet. In practice, the packets’ weight counts only in each round of service scheduling. When the traffic mix is somewhat the same with regard to classes and queues, WRR is, over time, acceptably fair in its scheduling. However, in the short term, the differences can be great. And if traffic classes have large variations in traffic mix and traffic types and thus large differences in packet sizes, the queuing situation can become quite unfair, favoring classes that are dominated by large packets. For example, a TCP-based traffic class can gain the advantage over a real-time traffic class whose packets are relatively small, as illustrated in Figure 7.10.

Schematic of three queues, Q0 (has four white packets), Q1 (has two black packets), and Q2 (has gray two packets) , of weights 50%, 25%, and 25%, respectively , serviced by round-robin scheduling. Q0 removes two packets, Q1 removes one packet, and Q2 removes one packet.

Figure 7.10 Comparing WRR with large and small packets

In Figure 7.10, Q1 and Q2 have the same weight. However, because the packets in Q1 are twice the size of packets in Q2, in reality Q1’s bandwidth rate is twice that of Q2.

WRR has the following benefits:

  • The WRR algorithm is easy to implement.
  • To implement differentiated service, each queue is allocated a weight that reflects the interface’s bandwidth. This scheme is fair because it avoids having one queue starve another queue and because it is possible to provide service priority by allocating various weights to the queues.

WRR has the following limitations:

  • Because WRR is unaware of how different packet lengths are scheduled, scheduling can be unfair when queues have different sized packets. When one queue has mostly small packets while another has mostly big packets, more bandwidth is allocated to the queue with big packets.
  • Services that have a very strict demand on delay and jitter can be affected by the scheduling order of other queues. WRR offers no priority levels in its scheduling.

7.9 DWRR

With WRR for each scheduling turn, the number of packets that are granted service is based on a weight that reflects the bandwidth allocation for the queue. As discussed in the previous section, bandwidth allocation can be unfair when the average packet sizes are different between the queues and their flows. This behavior can result in service degradation for queues with smaller average packet sizes. Deficit Round Robin (DRR), or Deficit Weighted Round Robin (DWRR), is a modified weighted round-robin scheduling discipline that addresses the limitations of WRR. Deficit algorithms are able to handle packets of variable size without knowing the mean size. A maximum packet size number is subtracted from the packet length, and packets that exceed that number are held back until the next scheduling turn.

While WRR serves every nonempty queue, DWRR serves packets at the head of every nonempty queue whose deficit counter is greater than the size of the packet at the head of the queue. If the deficit counter is lower, the queue is skipped and its credit value, called a quantum, is increased. The increased value is used to calculate the deficit counter the next time around when the scheduler examines the queue for service. If the queue is served, the credit is decremented by the size of packet being served.

Following are the key elements and parameters in implementing DWRR that affect the scheduling service for the queues:

  • Weight, which is similar to WRR algorithm weight, reflects a proportion of the bandwidth on the outgoing interface.
  • The quantum translates the weight value into bytes. With each scheduling turn, a value of quantum is added in proportion to the weight. Thus, the quantum is the throttle mechanism that allows the scheduling to be aware of the bandwidth.
  • The value of credits can be positive or negative. Positive credits accrue when, at the end of a scheduler turn, there are leftover bytes that were not used in the queue’s scheduling turn. This value is deferred to the queue’s next scheduling turn. Negative credits accrue when the queue has transmitted more than its bandwidth value in a scheduling turn, and thus the queue is in debt when the next scheduling turn comes around.
  • The deficit counter, which provides bandwidth fairness, is the sum of the quantum and the credits. The scheduler removes packets until the deficit counter reaches a value of zero or until the size of the packets is larger than the remaining deficits. But if the queue does not have enough deficits to schedule a packet, the value of the deficit is retained until the next scheduling round, and the scheduler moves to the next queue.

Let us go through DWRR in practice to illustrate its attributes. Consider a setup as shown in Figure 7.11 with three queues with different weights that reflect the proportion of bandwidth of outgoing interface. Q0 has 50% of the bandwidth, and Q1 and Q2 each have 25% of the bandwidth. To simplify the example, all queues have an equal length. The queues have variable-sized packets and different numbers of packets. At each scheduling turn, credits are added to the quantum number of each queue that reflects the weight as a quota of the bandwidth.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling.

Figure 7.11 DWRR scheduling configuration

Figure 7.12 shows the first scheduling turn and the process of dequeue for Q0. In Figure 7.12, we see that Q0 has three packets, two that are 300 bytes and one that is 500 bytes. Each scheduling turn, a quantum value of 1000 is added to Q0. In this turn, then, the deficit counter is 1000, which means that 1000 bytes can be subtracted from the queue. The 300-byte and 500-byte packets can be transmitted because they consume 800 credits from the 1000-credit bucket. But the last 300-byte packet cannot be transmitted in this turn, because 300 are bigger than the remaining 200 credits (1000 − 800 = 200). The result is that the scheduling jumps, in order, to the next queue with packets inside and the 200 credits for Q0 are saved for next scheduling turn. Now let’s look at the next queue in order, Q1, as illustrated in Figure 7.13.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q0 has three packets, two of which are 300 bytes and and one is 500 bytes. Q1 has two packets of 1000 bytes each, and Q3 has two packets of 300 bytes each.

Figure 7.12 DWRR scheduling, turn 1, Q0

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q0 has one packet of 300 bytes. Q1 has two packets of 1000 bytes each, and Q3 has two packets of 300 bytes each.

Figure 7.13 DWRR scheduling, turn 1, Q1

Q1 contains two 1000-byte packets, but the quantum per scheduling turn is 500 credits. This means that no packets can be transmitted in this scheduling cycle because not enough credits exist (500 < 1000). The result is that the deficit counter for Q1 is incremented with 500 credits for the next scheduling round. This example illustrates the power of deficit-style algorithms compared with plain round-robin systems. With DWRR, Q1 is actually punished because the available quantum and credit values are smaller than the actual packet sizes. This situation is proof that DWRR scheduling is aware of variable packet sizes, not just the number of packets to be queued. Let us move to the next queue in order, Q2, shown in Figure 7.14.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q0 has one packet of 300 bytes. Q1 has two packets of 1000 bytes each, and Q3 has one packet of 300 bytes.

Figure 7.14 DWRR scheduling, turn 1, Q2

Q2 has a quantum value of 500, and because the packet at the head of the queue is 300 bytes, it is transmitted. The value of 300 is subtracted from the 500, resulting in 200 credits that are saved for the next scheduling turn. The scheduling wheel has now completed one full turn, and we are back at Q0, as illustrated in Figure 7.15.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q1 has two packets of 1000 bytes each  and Q3 has one packet of 300 bytes.

Figure 7.15 DWRR scheduling, turn 2, Q0

The Q0 deficit counter is now 1200. The 300-byte packet is transmitted and its value is subtracted from the deficit counter, resulting in a new value of 900 credits (see Figure 7.15). While implementations can vary, a common behavior is to set the deficit counter to zero if no packets are in the queue. Queues that behave this way are honored by the algorithm, but they receive no extra points for no work. You cannot get fame and glory unless the queue is filled with packets. Now the scheduler returns to the troubled Q1, from which no packets were transmitted in turn 1, as illustrated in Figure 7.16.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q1 has one packet of 1000 bytes and Q3 has one packet of 300 bytes.

Figure 7.16 DWRR scheduling, turn 2, Q1

Finally, there is some good news for Q1. The updated quantum value is 500, and with the 500 credits saved from cycle 1, for a total of 1000 credits, Q1 can transmit one of the 1000-byte packets. Moving to Q2, the news is all good. The deficit counter is far bigger than the number bytes in the queue, so the last packet is transmitted, as illustrated in Figure 7.17.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q1 has one packet of 1000 bytes.

Figure 7.17 DWRR scheduling, turn 2, Q2

Let us look at a third scheduling turn, which illustrates an interesting DWRR scenario. Q0 has no packets to schedule, so it is not serviced and the scheduler moves to Q1. This queue has the same problem, with big packets eating up the deficit counter, and Q1 cannot transmit its packet, as illustrated in Figure 7.18.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q1 has one packet of 1000 bytes.

Figure 7.18 DWRR scheduling, turn 3, Q1

Scheduling turns, that is, bandwidth that has not been used by one queue can be shared with other queues. So for poor Q1, this is good news. The extra available quantum of credits from Q0 is shared with other queues that have packets to be transmitted, with the leftover quantum divided between the queues based on their weight: the more weight, the more of the extra quantum credits are allocated. Because Q2 had no packets in its buffer, the entire leftover quantum for Q0 is given to Q1. This is illustrated in Figure 7.19.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling.

Figure 7.19 DWRR scheduling, turn 3, leftover bandwidth

How does a deficit-based system drop packets? The previous example did not explicitly examine this. In the packet-dropping schemes described earlier in this chapter, packets can be dropped at the head of the queue when packets remain in the queue too long because of congestion, and packets can also be dropped from the tail when they cannot be placed into a full queue. This is also the process for DWRR. In the previous example, for ease of understanding, the credits were set to a low value. For the most part, the credit value reflects the MTU size of the outgoing interface. Thus, the credit refill rate should keep up with the queue refill rate to avoid a complete forwarding stalemate in the queue.

An alternative way to describe the state in which no scheduling occurs is with the concept of negative versus positive credit states. When a queue enters a negative state, it does not receive any service because it exceeded its rate credit in its last scheduling turn. A queue can enter negative credit territory if the last packet could not be transmitted because of insufficient credits, resulting in a deficit counter that is negative. Once the queue is in this negative credit state, it cannot schedule packets until the quantum of credits added returns the currently negative deficit counter value to a positive number. The only way to get into positive credit territory is to have the queue build up enough credits to be able to schedule packets again. In summary, a queue does not receive any scheduling slots unless the deficit counter is greater than zero.

Let us consider the earlier scenario with Q1. Here, Q1 is implicitly punished because it has attempted to use more than its quantum of credit. The result is that in the scheduling turn 1, no packets are transmitted because the quantum values are less than the number of bytes in the queue. In such a scenario, packets drops occur at the head of the queue because the packets stay in the queue too long and age out. In the case of severe congestion, clearing of the queue cannot keep up with the rate at which the queue is filled.

With the logic of negative credits, a queue is allowed to have a negative value for the deficit counter. Figure 7.20 shows that Q1 receives a negative credit value during the first scheduling turn because the 1000-byte packet is bigger than the quantum of 500. The result is a deficit counter value of −500.

Schematic of three queues, Q0, Q1, and Q2 of weights 50%, 25%, and 25% serviced by round-robin scheduling. Q0 has one packet of 300 bytes, Q1 has one packet of 1000 bytes, and Q2 has two packets of 300 each.

Figure 7.20 DWRR scheduling, turn 1, Q2 with negative credits

At the next scheduling turn for Q1, the state is negative, and the quantum of 500 is not higher than the deficit counter, so no packets are scheduled. Because the deficit counter is less than zero, the scheduling algorithm then skips to next queue. The good news, however, for Q1 is that at the third scheduling turn, the queue has a positive credit state and could schedule packets again. Whether Q1 actually drops the 1000-byte packet while it is waiting to receive a scheduling turn depends on the buffer depth and the fill rate of the queue. The negative versus positive credit state is a way to honor well-behaved queues and punish poorly behaved ones.

DWRR has the following benefit:

  • The DWRR algorithms limit the shortcomings with WRR because they are a more modern form of WFQ that incorporates scheduling aware of bits and packets.

DWRR has the following limitation:

  • Services that have very strict demand on delay and jitter can be affected by other queues by the scheduling order. DWRR has no way to prioritize its scheduling.

7.10 PB-DWRR

Introducing the deficit counter allows the WRR algorithm to be aware of bandwidth and improves the fairness. But for certain traffic types, fairness is not the desired behavior. What is needed is a priority scheduling similar to PQ, but that preserves the benefits of DWRR. To achieve predictable service for sensitive, real-time traffic, a priority level for scheduling needs to be introduced. By enabling strict priority or by offering several priority levels and using DWRR scheduling between queues with the same priority levels, service assurance with regard to delay and loss protection can be achieved for demanding traffic types, such as voice and real-time broadcasting. To illustrate this, consider three queues, one of which has a priority of strict high, as shown in Figure 7.21.

Schematic of three queues Q0 and Q1 labeled Priority Low and are of weights 25% and 50%, and Q2 labeled Priority Stricy-High, of weight 100%. Q0 has three packets of sizes 300, 500, and 300. Q1 has two packets of size 1000 each and Q2 has two packets of size 200 each.

Figure 7.21 PB-DWRR scheduling, with one strict-high priority queue

When service cycles are scheduled, Q2, which has strict-high priority, always has preference over the two low-priority queues. In each scheduling turn, Q2 is drained first. If the strict-high queue is filled faster than it is cleared, the result is that the packets in Q0 and Q1 become stale and can age out because these queues are not behaving within their defined weights. This aging out of packets is a side effect of the fact that a strict-high queue never goes into negative credits. However, a strict-high priority queue system can work if the rate of packets entering Q2 is controlled, limited, for example, to 20% of the total bandwidth. Then the credit state is reflected accordingly. One way to control the rate into a queue is by adding a policer that limits the rate by dropping packets that exceed a defined rate limit. This scheme avoids having the strict-priority queue enter a runaway state in which it monopolizes all the scheduling cycles. Another way to control the rate into the queue is to control the delay for the queue. Figure 7.22 illustrates rate control. Assigning a weight of 25% of the scheduling bandwidth to Q2 prevents more than one packet from being in Q2 at a time; as a result, Q2 drops one of the 200-byte packets at the tail of the queue.

Schematic of three queues Q0 and Q1 labeled Priority Low and are of weights 25% and 50%, and Q2 labeled Priority Stricy-High, of weight 100%. Q0 has three packets of sizes 300, 500, and 300. Q1 has two packets of size 1000 each and Q2 has one packet of size 200. One 200-size packet is discarded.

Figure 7.22 PB-DWRR scheduling, policed priority-high queue

Instead of using a strict-high priority queue, an alternative way to rate limit the queue is to use an alternating priority scheme. This method allows the priority-high queue to be scheduled between every other queue. For example:

Queue 0, Queue 2, Queue 1, Queue 2, Queue 0, Queue 2, …

An alternating priority mode can, however, cause some unpredicted delay because the other queues could introduce delay depending on the traffic that they schedule. Also, with alternating priority, the rate of the queue is not controlled. Another way to handle fairness is to have queues behave according to their weight, without caring about priority if they exceed their weight. This solution prevents the previously illustrated scenario of runaway higher-priority queues causing packets in lower-priority queues to become stale. Consider the PB-DWRR configuration and queue status in Figure 7.23.

Schematic of four queues Q0 and Q1 labeled Priority Low and are of weights 20% and 50%, and Q2 and Q3 labeled Priority High, of weights 20% and 10% serviced by round-robin scheduling. Q0 has three packets of sizes 300, 500, and 300. Q1 has two packets of size 1000 each, Q2 has two packets of size 200 each and Q3 has two packets of size 100 and 1000.

Figure 7.23 PB-DWRR scheduling with two priority-high queues

This scenario shows four queues, with two different priority levels. The different priority levels means that there are multiple levels of scheduling. Q2 and Q3 have a high-priority level. If there are packets in these two queues and if they have enough credit state, they are serviced before the two low-priority queues.

Let us evaluate the first scheduling turn. Because DWRR separates queues with the same priority level, Q2 is scheduled to receive service in the first cycle. The quantum for Q2 is 200. Thus, its credit state is 200 and it can clear one 200-byte packet from the queue. Q2’s deficit counter value is now 0, and as explained earlier, the deficit counter must be greater than zero to receive service in the same scheduling cycle. The next queue visited is the next priority-high queue, Q3. This queue schedules one packet but then enters into a negative credit state because of its quantum. Q3’s credit state is 100. When removing 1000 bytes from the queue, the value of the deficit counter becomes −900 (see Figure 7.24).

Schematic of four queues Q0 and Q1 labeled Priority Low and are of weights 20% and 50%, and Q2 and Q3 labeled Priority High, of weights 20% and 10% serviced by round-robin scheduling. Q0 has three packets of sizes 300, 500, and 300. Q1 has two packets of size 1000 each, Q2 has one packet of size 200 and Q3 has one packet of size 100.

Figure 7.24 PB-DWRR scheduling, turn 1, Q2 and Q3

At the next scheduling turn, Q2 is visited again because its priority level is high and because it contains packets. The quantum is 200 bytes, which is also the number of bytes waiting to be cleared from the queue, as illustrated in Figure 7.25.

Schematic of four queues Q0 and Q1 labeled Priority Low and are of weights 20% and 50%, and Q2 and Q3 labeled Priority High, of weights 20% and 10% serviced by round-robin scheduling. Q0 has three packets of sizes 300, 500, and 300. Q1 has two packets of size 1000 each and Q3 has one packet of size 100.

Figure 7.25 PB-DWRR scheduling, turn 2, Q2 and Q3

Now something interesting happens: Q3 is in negative credit state. This means it gets no scheduling cycles because the deficit counter is not greater than zero. The result is that turn 2 in the scheduling now jumps to the low-priority queues. Because Q1 has a higher weight than Q0, the scheduling cycle services Q1, which removes one 1000-byte packet from the queue, leaving its deficit counter at a value of 0. The next queue to be scheduled in turn 2 is Q0, and it removes 800 bytes before running into a negative credit state, as illustrated in Figure 7.26.

Schematic of four queues Q0 and Q1 labeled Priority Low and are of weights 20% and 50%, and Q2 and Q3 labeled Priority High, of weights 20% and 10% serviced by round-robin scheduling. Q0 has one packet of size 300, Q1 has one packet of size 1000, and Q3 has one packet of size 100.

Figure 7.26 PB-DWRR scheduling, turn 2, Q0 and Q1

Now we have a scenario in which one of the priority-high queues, Q3, and the low-priority Q0 are in negative credit state. If we honor the state of the queues, in the next scheduling turn, the queue to get scheduling service is Q1 because it is not in negative credit state. Because Q2 has no packets in the queue, the two queues in negative credit state can use Q2’s weight and thereby update their credit states based on available weights. Because a queue can receive service only once the deficit counter is greater than zero, the queue needs to be bypassed in some scheduling rounds to increase their credits. If no other queues with positive credit state have packets in their queues, this refill of credits happens faster because the punished queues can reuse the available weights not being used by other queues.

What is the most efficient way to schedule queues with the same priority level? A good design is probably a combination of strict high and DWRR scheduling between queues with the same level of priority. This is a good design in the core of the network for situations in which the strict-priority queue may need more bandwidth for a short time because of a network failure and the resultant reconvergence. Limiting real-time traffic should instead be done on the edge of the network, and using a combination of strict high and DWRR schedule is likely also a good design. But if there are multiple priority-high queues, some restrictions or control, for example, DWRR, are needed to avoid possible resource clashes between multiple high-priority queues.

PB-DWRR has the following benefit:

  • PB-DWRR incorporates the lessons learned from most other queuing algorithms. It incorporates most of the features introduced by these algorithms, such as byte deficit scheduling and priority levels.

PB-DWRR has the following limitation:

  • It is not a standard, so all routers and switches that implement it may not act in a similar fashion. PB-DWRR is highly dependent on the hardware it runs on, the available resources, and of course, on the actual vendor implementation.

7.11 Conclusions about the Best Queuing Discipline

As of this writing, the most popular queuing and scheduling mechanism is PB-DWRR. The reason behind its success and acceptance in the industry is that PB-DWRR is built on the lessons learned from its predecessors. This chapter has described PB-DWRR concepts. However, actual vendor implementations may vary slightly. As with any queuing and scheduling mechanism, there is always a dependency on resources and hardware. However, one piece of the queuing and scheduling puzzle is still missing: the WRED profiles. We discuss these in the next chapter.

Further Reading

  1. Clarke, P., Saka, F., Li, Y.-T. and Di Donato, A. (2004) Using QoS for High Throughput TCP Transport over Fat Long Pipes. London: University College London.
  2. Croll, A. and Packman, E. (2000) Managing Bandwidth: Deploy QOS in Enterprise Networks. Upper Saddle River, NJ: Prentice Hall.
  3. Goralski, W. (2010) JUNOS Internet Software Configuration Guide: “Class of Service Configuration Guide,” Release 10.1, Published 2010. www.juniper.net (accessed September 8, 2015).
  4. Heinanen, J., Baker, F., Weiss, W. and Wroclawski, J. (1999) RFC 2597, Assured Forwarding PHB Group, June 1999. https://tools.ietf.org/html/rfc2597 (accessed August 18, 2015).
  5. Jacobson, V., Nichols, K. and Poduri, K. (1999) RFC 2598, An Expedited Forwarding PHB, June 1999. https://tools.ietf.org/html/rfc2598 (accessed August 18, 2015).
  6. Nichols, K., Blake, S., Baker, F. and Black, D. (1998) RFC 2474, Definition of the Differentiated Services Field, December 1998. https://tools.ietf.org/html/rfc2474 (accessed August 18, 2015).
  7. Postel, J. (1981) RFC 793, Transmission Control Protocol—DARPA Internet Protocol Specification, September 1981. https://tools.ietf.org/rfc/rfc793.txt (accessed August 18, 2015).
  8. Semeria, C. (2000) Internet Backbone Routers and Evolving Internet Design, White Paper, Juniper Networks. www.juniper.net (accessed August 18, 2015).
  9. Semeria, C. (2001) Supporting Differentiated Service Classes: Queue Scheduling Disciplines, White Paper, Juniper Networks. www.juniper.net (accessed August 18, 2015).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.253