CHAPTER 4

Routing Algorithms

In this chapter we study routing algorithms. Routing algorithms establish the path followed by each message or packet. The list of routing algorithms proposed in the literature is almost endless. We clearly cannot do justice to all of these algorithms developed to meet many distinct requirements. We will focus on a representative set of approaches, being biased toward those being used or proposed in modern and future multiprocessor interconnects. Thus, we hope to equip you with an understanding of the basic principles that can be used to study the spectrum of existing algorithms. Routing algorithms for wormhole switching are also valid for other switching techniques. Thus, unless explicitly stated, the routing algorithms presented in this chapter are valid for all the switching techniques. Specific proposals for some switching techniques will also be presented. Special emphasis is given to design methodologies because they provide a simple and structured way to design a wide variety of routing algorithms for different topologies.

Many properties of the interconnection network are a direct consequence of the routing algorithm used. Among these properties we can cite the following:

image Connectivity. Ability to route packets from any source node to any destination node. This property was introduced in Chapter 3.

image Adaptivity. Ability to route packets through alternative paths in the presence of contention or faulty components.

image Deadlock and livelock freedom. Ability to guarantee that packets will not block or wander across the network forever. This issue was discussed in depth in Chapter 3.

image Fault tolerance. Ability to route packets in the presence of faulty components. Although it seems that fault tolerance implies adaptivity, this is not necessarily true. Fault tolerance can be achieved without adaptivity by routing a packet in two or more phases, storing it in some intermediate nodes. Fault tolerance also requires some additional hardware mechanisms, as will be detailed in Chapter 6.

The next section presents a taxonomy of routing algorithms. The most interesting classes in the taxonomy are studied in the remaining sections of this chapter.

4.1 Taxonomy of Routing Algorithms

Figure 4.1 presents a taxonomy of routing algorithms that extends an earlier classification scheme [125]. Routing algorithms can be classified according to several criteria. Those criteria are indicated in the left column in italics. Each row contains the alternative approaches that can be followed for each criterion. Arrows indicate the relations between different approaches. An overview of the taxonomy is presented first, developing it in greater detail later. Routing algorithms can be first classified according to the number of destinations. Packets may have a single destination (unicast routing) or multiple destinations (multicast routing). Multicast routing will be studied in depth in Chapter 5 and is included here for completeness.

image

Figure 4.1 A taxonomy for routing protocols.

Routing algorithms can also be classified according to the place where routing decisions are taken. Basically, the path can be either established by a centralized controller (centralized routing) at the source node prior to packet injection (source routing) or determined in a distributed manner while the packet travels across the network (distributed routing). Hybrid schemes are also possible. We call these hybrid schemes multiphase routing. In multiphase routing, the source node computes some destination nodes. The path between them is established in a distributed manner. The packet may be delivered to all the computed destination nodes (multicast routing) or only to the last destination node (unicast routing). In this case, intermediate nodes are used to avoid congestion or faults.

Routing algorithms can be implemented in different ways. The most interesting ways proposed up to now consist of either looking at a routing table (table lookup) or executing a routing algorithm in software or hardware according to a finite-state machine. In both cases, the routing algorithm can be either deterministic or adaptive. Deterministic routing algorithms always supply the same path between a given source/destination pair. Adaptive routing algorithms use information about network traffic and/or channel status to avoid congested or faulty regions of the network. Routing algorithms designed to tolerate faults will be studied in depth in Chapter 6.

Adaptive routing algorithms can be classified according to their progressiveness as progressive or backtracking. Progressive routing algorithms move the header forward, reserving a new channel at each routing operation. Backtracking algorithms also allow the header to backtrack, releasing previously reserved channels. Backtracking algorithms are mainly used for fault-tolerant routing.

At a lower level, routing algorithms can be classified according to their minimality as profitable or misrouting. Profitable routing algorithms only supply channels that bring the packet closer to its destination. They are also referred to as minimal. Misrouting algorithms may also supply channels that send the packet away from its destination. They are also referred to as nonminimal. At the lowest level, routing algorithms can be classified according to the number of alternative paths as completely adaptive (also known as fully adaptive) or partially adaptive.

In this chapter we focus on unicast routing algorithms for multiprocessors and multicomputers. Centralized routing requires a central control unit. This is the kind of routing algorithm used in single-instruction multiple-data (SIMD) machines [162]. In source routing, the source node specifies the routing path on the basis of a deadlock-free routing algorithm (either using table lookup or not). The computed path is stored in the packet header, being used at intermediate nodes to reserve channels. The routing algorithm may use only the addresses of current and destination nodes to compute the path (deterministic routing) or may also use information collected from other nodes about traffic conditions in the network (adaptive routing). Note that collecting information from other nodes may produce a considerable overhead. Additionally, that information may be obsolete. Thus, adaptive source routing is only interesting if traffic conditions change very slowly. Source routing has been mainly used in computer networks with irregular topologies [338]. Myrinet [30], a high-performance LAN supporting irregular topologies, also uses source routing. The first few flits of the packet header contain the address of the switch ports on intermediate switches. See Section 7.2.8 for a description of Myrinet.

Source routing has also been proposed for multicomputer interconnection networks. As traffic conditions may change very quickly in these networks, adaptive source routing is not interesting. Since the header itself must be transmitted through the network, thereby consuming network bandwidth, it is important to minimize header length. One source routing method that achieves this goal is called street-sign routing. The header is analogous to a set of directions given to a driver in a city. Only the names of the streets that the driver must turn on, along with the direction of the turn, are needed. More precisely, packets arriving at an intermediate node have a default output channel in the same dimension and direction as the current channel. For each turn, the header must contain the node address at which the turn will take place and the direction of the turn. Furthermore, this information must be stored in the header according to the order in which nodes are reached. Upon receiving a header flit, the router compares the node address in the flit to the local node address. If they match, the packet either turns or has reached its destination, as specified in the header flit. Otherwise, the packet will be forwarded through the default output channel. Street-sign routing was proposed in iWarp [39]. See Exercise 4.1 for an example of street-sign routing.

For efficiency reasons most hardware routers use distributed routing. Pipelined switching techniques route the header of the packet as soon as it is received at an intermediate node. With distributed routing, the header is very compact. It only requires the destination address and a few implementation-dependent control bits. In distributed routing, each intermediate node has to make a routing decision based on the local knowledge of the network. By repeating this process at each intermediate node, the packet should be able to reach its destination. Note that distributed routing algorithms route packets at intermediate nodes without requiring a global knowledge of the network. This can be done because the designer knows the topology of the whole network. Distributed routing algorithms are mainly used in regular topologies so that the same routing algorithm can be used in all the nodes. Almost all commercial topologies can be seen as consisting of several orthogonal dimensions. In these topologies, it is easy to compute the distance between current and destination nodes as the sum of the offsets in all the dimensions. As a consequence, routing decisions are much simpler than in other topologies.

It is also possible to route a packet in two or more phases. In multiphase unicast routing the source node computes an intermediate node. The packet is routed toward this node using distributed routing. On reception of the packet, the intermediate node reinjects the packet into the network, using either another intermediate node or the final destination as the next destination for the packet. An example of a multiphase routing algorithm is the random routing algorithm [345]. In this algorithm, the source node computes a random intermediate destination. On reception, the intermediate node forwards the packet toward its final destination. Random routing was proposed to reduce contention by randomizing the paths followed by a set of packets transferred between every source/destination pair. However, random routing destroys all the communication locality existing in network traffic. Multiphase routing has also been proposed for fault-tolerant routing [333]. In this case, intermediate destinations are used to break dependencies between channels and avoid deadlocks in the presence of faults. This algorithm will be analyzed in Chapter 6.

Regardless of the place where the routing algorithm is computed, it should be able to deliver the packet to its destination node. In most parallel computers, the designer chooses the topology of the interconnection network. As indicated above, most machines use topologies that can be decomposed into orthogonal dimensions. This is the case for hypercubes, meshes, and tori. In these topologies, it is possible to use simple routing algorithms based on finite-state machines like e-cube (dimension-order routing) [334]. This routing algorithm routes packets by crossing dimensions in increasing order, nullifying the offset in one dimension before routing in the next one. A routing example is shown in Figure 4.2. Note that dimension-order routing can be executed at the source node, storing information about turns (changes of dimension) in the header. This is the street-sign routing algorithm described above. Dimension-order routing can also be executed in a distributed manner. At each intermediate node, the routing algorithm supplies an output channel crossing the lowest dimension for which the offset is not null.

image

Figure 4.2 Routing example for dimension-order routing on a 2-D mesh.

Some manufacturers supply building blocks for parallel computers, so that different topologies can be built by using the same chips [227]. Also, some parallel computers feature reconfigurable topologies, so that the user can change the topology even dynamically at run time [112]. Finally, some manufacturers allow the final users to build the topology that best fits their needs, allowing the use of irregular topologies [257]. In all these cases, especially for irregular topologies, it is very difficult to derive a routing algorithm based on a finite-state machine. An alternative implementation approach consists of using table lookup [338]. This is a traditional approach used in computer networks.

An obvious implementation of table-lookup routing is to place a routing table at each node, with the number of entries in the table equal to the number of nodes in the network. Once again, routing can be performed either at the source node or at each intermediate node. In the first case, given a destination node address, the corresponding entry in the table indicates the whole path to reach that node. In the second case, each table entry indicates which outgoing channel should be used to forward the packet toward its destination. However, such an implementation is only practical for very small systems because table size increases linearly with network size. One way to reduce the table size for distributed routing is to define a range of addresses to be associated with each output channel. In this case, each routing table requires only one entry per output channel. Each entry contains an interval of destination addresses, specified by indicating its bounds. This routing technique is called interval routing [199] and has been implemented in the Inmos T9000 transputer and its associated router C104 [227, 228]. An important issue in interval routing is how to assign appropriate labels to nodes so that a single interval per output channel is enough to route all the packets and the resulting routing algorithm is minimal and deadlock-free. See Exercise 4.2 for an example of interval routing.

While establishing a path between source and destination nodes, the routing algorithm may supply a single path (deterministic routing). When source routing is used, this path is computed at the source node without considering network traffic. When deterministic routing is implemented in a distributed way, the routing algorithm supplies a single routing choice at each intermediate node, based on current and destination node addresses. In this case, channel status is not considered while computing the output channel to be used. Deterministic routing algorithms for regular topologies are simple. When implemented in hardware using a finite-state machine, routing logic is compact and fast. Most commercially available multicomputers use distributed deterministic routing. Deterministic routing algorithms usually perform well under uniform traffic. However, performance is poor when traffic is not uniform, especially when some pairs of nonneighbor nodes exchange information very frequently.

Alternatively, adaptive routing algorithms consider network state while making a decision. As indicated above, it is not interesting to combine source routing and adaptive routing. Thus, in what follows, we only consider distributed routing. Although some authors considered nonlocal information [289], most proposals only use local information for efficiency reasons. As seen in Chapter 3, adaptive routing algorithms can be decomposed into two functions: routing and selection. The routing function supplies a set of output channels based on the current node or buffer and the destination node. A selection from this set is made by the selection function based on the status of output channels at the current node. This selection is performed in such a way that a free channel (if any) is supplied. As a consequence, adaptive routing algorithms are able to follow alternative paths instead of waiting on busy channels. Thus, these algorithms increase routing flexibility at the expense of a more complex and slower hardware. Several experimental parallel computers use adaptive routing [74]. Also, commercial machines with adaptive routing are being developed [257] or are even available in the market [313]. Example 3.8 describes a distributed adaptive routing algorithm.

Adaptive routing algorithms can be classified as progressive or backtracking. Progressive routing algorithms move the header forward, reserving a new channel at each routing operation. Backtracking algorithms allow the header to backtrack, releasing previously reserved channels. Backtracking algorithms systematically search the network, backtracking as needed, using history information to ensure that no path is searched more than once. Note that in most switching techniques, data immediately follow the packet header. In these switching techniques, backtracking is not possible without a very complex hardware support. However, limited backtracking (one channel) is possible with some hardware support [181]. On the other hand, pipelined circuit switching is very well suited for backtracking because data flits do not follow the header immediately, giving more freedom to the header to search the network. Backtracking algorithms are mainly used for fault-tolerant routing, as will be seen in Chapter 6.

At a lower level, routing algorithms can be classified as profitable or misrouting. Profitable routing algorithms only supply channels that bring the packet closer to its destination. Misrouting algorithms may also supply channels that send the packet away from its destination. Misrouting algorithms are based on an optimistic view of the network: taking an unprofitable channel is likely to bring the header to another set of profitable channels that will allow further progress to the destination. Although misrouting algorithms are more flexible, they usually consume more network resources. As a consequence, misrouting algorithms usually exhibit a lower performance when combined with pipelined switching techniques. Also, misrouting algorithms may suffer from livelock, as seen in Chapter 3. Misrouting algorithms are usually proposed for fault-tolerant routing because they are able to find alternative paths when all the minimal paths are faulty. These algorithms will also be studied in Chapter 6.

At the lowest level, routing algorithms can be completely adaptive (also known as fully adaptive) or partially adaptive. A fully adaptive algorithm can use all the physical paths in its class. For example, a profitable algorithm that is fully adaptive is able to choose among all the minimal paths available in the network. These algorithms are also called fully adaptive minimal routing algorithms. It should be noted that although all the physical paths are available, a given routing algorithm may restrict the use of virtual channels in order to avoid deadlock. A routing algorithm that maximizes the number of routing options while avoiding deadlock is referred to as maximally adaptive. An even higher flexibility in the use of virtual channels can be achieved by using deadlock recovery techniques. In this case, there is no restriction on the use of virtual channels, and the corresponding routing algorithm is referred to as true fully adaptive. A completely adaptive backtracking algorithm is also called exhaustive. Partially adaptive algorithms are only able to use a subset of the paths in their class.

Note that deterministic routing algorithms should be progressive and profitable. Backtracking makes no sense because the same path will be reserved again. Also, misrouting is not interesting because some bandwidth is wasted without any benefit.

This chapter is organized as follows. Section 4.2 studies some deterministic routing algorithms as well as a basic design methodology. Section 4.3 presents some partially adaptive routing algorithms and a design methodology. Section 4.4 analyzes fully adaptive routing algorithms and their evolution, also presenting design methodologies. Section 4.5 describes some routing algorithms that maximize adaptivity or minimize the routing resources required for fully adaptive routing. Section 4.6 presents some nonminimal routing algorithms. Section 4.7 describes some backtracking algorithms. As backtracking algorithms have interesting properties for fault-tolerant routing, these algorithms will also be analyzed in Chapter 6. Sections 4.8 and 4.9 study some routing algorithms for switch-based networks, focusing on multistage interconnection networks and irregular topologies, respectively. Finally, Section 4.10 presents several selection functions as well as some resource allocation policies. The chapter ends with some engineering issues and commented references.

4.2 Deterministic Routing Algorithms

Deterministic routing algorithms establish the path as a function of the destination address, always supplying the same path between every pair of nodes. Deterministic routing is distinguished from oblivious routing. Although both concepts are sometimes considered to be identical, in the latter the routing decision is independent of (i.e., oblivious to) the state of the network. However, the choice is not necessarily deterministic. For example, a routing table may include several options for an output channel based on the destination address. A specific option may be selected randomly, cyclically, or in some other manner that is independent of the state of the network. A deterministic routing algorithm will always provide the same output channel for the same destination. While deterministic algorithms are oblivious, the converse is not necessarily true.

Deterministic routing became very popular when wormhole switching was invented [77]. Wormhole switching requires very small buffers. Wormhole routers are compact and fast. However, pipelining does not work efficiently if one of the stages is much slower than the remaining stages. Thus, wormhole routers have the routing algorithm implemented in hardware. It is not surprising that designers chose the simplest routing algorithms in order to keep routing hardware as compact and fast as possible. Most commercial multicomputers (Intel Paragon [164], Cray T3D [174], nCUBE-2/3 [248]) and experimental multiprocessors (Stanford DASH [203], MIT J-Machine [256]) use deterministic routing.

In this section, we present the most popular deterministic routing algorithms as well as a design methodology. Obviously, the most popular routing algorithms are the simplest ones. Some topologies can be decomposed into several orthogonal dimensions. This is the case for hypercubes, meshes, and tori. In these topologies, it is easy to compute the distance between current and destination nodes as the sum of the offsets in all the dimensions. Progressive routing algorithms will reduce one of those offsets in each routing step. The simplest progressive routing algorithm consists of reducing an offset to zero before considering the offset in the next dimension. This routing algorithm is known as dimension-order routing. This routing algorithm routes packets by crossing dimensions in strictly increasing (or decreasing) order, reducing to zero the offset in one dimension before routing in the next one.

For n-dimensional meshes and hypercubes, dimension-order routing produces deadlock-free routing algorithms. These algorithms are very popular and receive several names, like XY routing (for 2-D mesh) or e-cube (for hypercubes) [334]. These algorithms are described in Figures 4.3 and 4.4, respectively, where FirstOne() is a function that returns the position of the first bit set to one, and Internal is the channel connecting to the local node. Although these algorithms assume that the packet header carries the absolute address of the destination node, the first few sentences in each algorithm compute the offset from the current node to the destination node. This offset is the value carried by the header when relative addressing is used. So, the remaining sentences in each algorithm describe the operations for routing using relative addressing. Note that relative addressing would also require updating the header at each intermediate node. Exercises 3.1 and 4.3 show that the channel dependency graphs for dimension-order routing in n-dimensional meshes and hypercubes are acyclic. However, the channel dependency graph for tori has cycles. This topology was analyzed by Dally and Seitz [77], who proposed a design methodology for deadlock-free deterministic routing algorithms.

image

Figure 4.3 The XY routing algorithm for 2-D meshes.

image

Figure 4.4 The dimension-order routing algorithm for hypercubes.

The methodology starts by considering some connected routing function and its channel dependency graph D. If it is not acyclic, routing is restricted by removing arcs from the channel dependency graph D to make it acyclic. If it is not possible to make D acyclic without disconnecting the routing function, arcs can be added to D by splitting physical channels into a set of virtual channels. As proposed in [77], this methodology establishes a total order among virtual channels, labeling them accordingly. Every time a cycle is broken by splitting a physical channel into two virtual channels, a new channel index is introduced to establish the ordering between virtual channels. Also, every time a cycle is broken by adding a virtual channel to each physical channel, the new set of virtual channels is assigned a different value for the corresponding index. The next example shows the application of this methodology to unidirectional rings and k-ary n-cubes.

EXAMPLE 4.1

Consider a unidirectional ring with four nodes denoted ni, i = {0, 1, 2, 3}, and a unidirectional channel connecting each pair of adjacent nodes. Let ci, i = {0, 1, 2, 3}, be the outgoing channel from node ni. In this case, it is easy to define a connected routing function. It can be stated as follows: If the current node ni is equal to the destination node nj, store the packet. Otherwise, use ci,∀ji. Figure 4.5(a) shows the network. Figure 4.5(b) shows that the channel dependency graph for this routing function contains a cycle. Thus, every physical channel ci is split into two virtual channels, c0i and c1i, as shown in Figure 4.5(c). Virtual channels are ordered according to their indices. The routing function is redefined in such a way that virtual channels are used in strictly increasing order. The new routing function can be stated as follows: If the current node ni is equal to the destination node nj, store the packet. Otherwise, use c0i if j < i, or c1i if j > i. Figure 4.5(d) shows the channel dependency graph for this routing function. As can be seen, the cycle has been removed because after using channel c03, node n0 is reached. Thus, all the destinations have a higher index than n0, and it is not possible to request c00. Note that channels c00 and c13 are not in the graph because they are never used.

image

Figure 4.5 Unidirectional rings and their channel dependency graphs.

It is possible to extend the routing function for unidirectional rings so that it can be used for unidirectional k-ary n-cubes. As above, each physical channel is split into two virtual channels. Additionally, a new index is added to each virtual channel. Channels are labeled as cdvi, where d, d = {0, …, n − 1}, is the dimension traversed by the channel, v, v = {0, 1}, indicates the virtual channel, and i, i = {0, …, k − 1}, indicates the position inside the corresponding ring. The routing function routes packets in increasing dimension order. Inside each dimension, the routing function for rings is used. It is easy to see that this routing function routes packets in strictly increasing order of channel indices. Figure 4.6 shows the dimension-order routing algorithm for unidirectional k-ary 2-cubes. Note that the third subindex of each channel is not indicated because it is only required to distinguish between channels from different routers. So, there is no ambiguity.

image

Figure 4.6 The dimension-order routing algorithm for unidirectional 2-D tori.

Although dimension-order routing is usually implemented in a distributed way using a finite-state machine, it can also be implemented using source routing and distributed table lookup. See Exercises 4.1 and 4.2 to see how dimension-order routing can be implemented using source routing (street-sign routing) and table-lookup routing (interval routing), respectively.

Finally, as will be seen in Chapter 7, dimension-order routing is very simple to implement in hardware. Additionally, switches can be decomposed into smaller and faster switches (one for each dimension), thus increasing the speed.

4.3 Partially Adaptive Algorithms

Several partially adaptive routing algorithms have been proposed. Partially adaptive routing algorithms represent a trade-off between flexibility and cost. They try to approach the flexibility of fully adaptive routing at the expense of a moderate increase in complexity with respect to deterministic routing. Most partially adaptive algorithms proposed up to now rely upon the absence of cyclic dependencies between channels to avoid deadlock. Some proposals aim at maximizing adaptivity without increasing the resources required to avoid deadlocks. Other proposals try to minimize the resources needed to achieve a given level of adaptivity.

4.3.1 Planar-Adaptive Routing

Planar-adaptive routing aims at minimizing the resources needed to achieve a given level of adaptivity. It has been proposed by Chien and Kim for n-dimensional meshes and hypercubes [58]. The idea in planar-adaptive routing is to provide adaptivity in only two dimensions at a time. Thus, a packet is routed adaptively in a series of 2-D planes. Routing dimensions change as the packet advances toward its destination.

Figure 4.7 shows how planar-adaptive routing works. A fully adaptive routing algorithm allows a packet to be routed in the m-dimensional subcube defined by the current and destination nodes, as shown in Figure 4.7(a) for three dimensions. Planar-adaptive routing restricts packets to be routed in plane A0, then moving to plane A1, and so on. This is depicted in Figure 4.7(b) and (c) for three and four dimensions, respectively. All the paths within each plane are allowed. The number of paths in a plane depends on the offsets in the corresponding dimensions.

image

Figure 4.7 Allowed paths in (a) fully adaptive routing in three-dimensional networks, (b) planar-adaptive routing in three-dimensional networks, and (c) planar-adaptive routing in four-dimensional networks.

Each plane Ai is formed by two dimensions, di and di + 1. There are a total of (n − 1) adaptive planes. The order of dimensions is arbitrary. However, it is important to note that planes Ai and Ai+1 share dimension di + 1. If the offset in dimension di is reduced to zero, then routing can be immediately shifted to plane Ai + 1. If the offset in dimension di + 1 is reduced to zero while routing in plane Ai, no adaptivity will be available while routing in plane Ai+1. In this case, plane Ai+1 can be skipped. Moreover, if in plane Ai, the offset in dimension di+1 is reduced to zero first, routing continues in dimension di exclusively, until the offset in dimension di is reduced to zero. Thus, in order to offer alternative routing choices for as long as possible, a higher priority is given to channels in dimension di while routing in plane Ai.

As defined, planar-adaptive routing requires three virtual channels per physical channel to avoid deadlocks in meshes and six virtual channels to avoid deadlocks in tori. In what follows, we analyze meshes in more detail. Channels in the first and last dimension need only one and two virtual channels, respectively. Let di, j be the set of virtual channels j crossing dimension i of the network. This set can be decomposed into two subsets, one in the positive direction and one in the negative direction. Let di, j + and di, j– denote the positive and negative direction channels, respectively.

Each plane Ai is defined as the combination of several sets of virtual channels:

image

In order to avoid deadlock, the set of virtual channels in Ai is divided into two classes: increasing and decreasing networks (Figure 4.8). The increasing network is formed by di,2+ and di+1,0 channels. The decreasing network is formed by di,2– and di+1,1 channels. Packets crossing dimension di in the positive direction are routed in the increasing network of plane Ai. Similarly, packets crossing dimension di in the negative direction are routed in the decreasing network of Ai. As there is no coupling between increasing and decreasing networks of Ai, and planes are crossed in sequence, it is easy to see that there are no cyclic dependencies between channels. Thus, planar-adaptive routing is deadlock-free.

image

Figure 4.8 (a) Increasing and (b) decreasing networks in plane Ai for planar-adaptive routing.

4.3.2 Turn Model

The turn model proposed by Glass and Ni [129] provides a systematic approach to the development of partially adaptive routing algorithms, both minimal and nonminimal, for a given network. As shown in Figure 3.2, deadlock occurs because the packet routes contain turns that form a cycle. As indicated in Chapter 3, deadlock cannot occur if there is not any cyclic dependency between channels. In many topologies, channels are grouped into dimensions. Moving from one dimension to another one produces a turn in the packet route. Changing direction without moving to another dimension can be considered as a 180-degree turn. Also, when physical channels are split into virtual channels, moving from one virtual channel to another one in the same dimension and direction can be considered as a 0-degree turn. Turns can be combined into cycles. The fundamental concept behind the turn model is to prohibit the smallest number of turns such that cycles are prevented. Thus, deadlock can be avoided by prohibiting just enough turns to break all the cycles. The following six steps can be used to develop maximally adaptive routing algorithms for n-dimensional meshes and k-ary n-cubes:

1. Classify channels according to the directions in which they route packets.

2. Identify the turns that occur between one direction and another.

3. Identify the simple cycles that these turns can form.

4. Prohibit one turn in each cycle.

5. In the case of k-ary n-cubes, incorporate as many turns as possible that involve wraparound channels, without reintroducing cycles.

6. Add 0-degree and 180-degree turns without reintroducing cycles. These turns are needed if there are multiple channels in the same direction and for nonminimal routing algorithms.

In order to illustrate the use of the turn model, we may consider the case of a 2-D mesh. There are eight possible turns and two possible abstract cycles, as shown in Figure 4.9(a). The deterministic XY routing algorithm prevents deadlock by prohibiting four of the turns, as shown in Figure 4.9(b). The remaining four turns cannot form a cycle, but neither do they allow any adaptiveness.

image

Figure 4.9 The turn model in 2-D mesh: (a) abstract cycles in 2-D mesh; (b) four turns (solid arrows) allowed in XY routing; (c) six turns (solid arrows) allowed in west-first routing.

However, prohibiting fewer than four turns can still prevent cycles. In fact, for a 2-D mesh, only two turns need to be prohibited. Figure 4.9(c) shows six turns allowed, suggesting the corresponding west-first routing algorithm: route a packet first west, if necessary, and then adaptively south, east, and north. The two turns prohibited in Figure 4.9(c) are the two turns to the west. Therefore, in order to travel west, a packet must begin in that direction. Figure 4.10 shows the minimal west-first routing algorithm for 2-D meshes, where Select() is the selection function defined in Section 3.1.2. This function returns a free channel (if any) from the set of channels passed as parameters. See Exercise 4.5 for a nonminimal version of this algorithm. Three example paths for the west-first algorithm are shown in Figure 4.11. The channels marked as unavailable are either faulty or being used by other packets. One of the paths shown is minimal, while the other two paths are nonminimal, resulting from routing around unavailable channels. Because cycles are avoided, west-first routing is deadlock-free. For minimal routing, the algorithm is fully adaptive if the destination is on the right-hand side (east) of the source; otherwise, it is deterministic. If nonminimal routing is allowed, the algorithm is adaptive in either case. However, it is not fully adaptive.

image

Figure 4.10 The minimal west-first routing algorithm for 2-D meshes.

image

Figure 4.11 Examples of west-first routing in an 8 × 8 2-D mesh.

There are other ways to select six turns so as to prohibit cycles. However, the selection of the two prohibited turns may not be arbitrary [129]. If turns are prohibited as in Figure 4.12, deadlock is still possible. Figure 4.12(a) shows that the three remaining left turns are equivalent to the prohibited right turn, and Figure 4.12(b) shows that the three remaining right turns are equivalent to the prohibited left turn. Figure 4.12(c) illustrates how cycles may still occur. Of the 16 different ways to prohibit two turns, 12 prevent deadlock and only 3 are unique if symmetry is taken into account. These three combinations correspond to the west-first, north-last, and negative-first routing algorithms. The north-last routing algorithm does not allow turns from north to east or from north to west. The negative-first routing algorithm does not allow turns from north to west or from east to south.

image

Figure 4.12 Six turns that complete the abstract cycles: (a) the three remaining left turns are equivalent to the prohibited right turn; (b) the three remaining right turns are equivalent to the prohibited left turn; (c) cycle formed by combining (a) and (b).

In addition to 2-D mesh networks, the turn model can be used to develop partially adaptive routing algorithms for n-dimensional meshes, for k-ary n-cubes, and for hypercubes [129]. By applying the turn model to the hypercube, an adaptive routing algorithm, namely, P-cube routing, can be developed. Let s = sn −1 sn −2s0 and d = dn −1 dn −2d0 be the source and destination nodes, respectively, in a binary n-cube. The set E consists of all the dimension numbers in which s and d differ. The size of E is the Hamming distance between s and d. Thus, iE if sidi. E is divided into two disjoint subsets, E0 and E1, where iE0 if Si = 0 and di = 1, and jE1 if sj = 1 and dj = 0. The fundamental concept of P-cube routing is to divide the routing selection into two phases. In the first phase, a packet is routed through the dimensions in E0 in any order. In the second phase, the packet is routed through the dimensions in E1 in any order. If E0 is empty, then the packet can be routed through any dimension in E1. Figure 4.13 shows the pseudocode for the P-cube routing algorithm. The for loop computes the sets E0 and E1. This can be done at each intermediate node as shown or at the source node. In the latter case, the selected channel should be removed from the corresponding set. The digit() function computes the value of the digit in the given position.

image

Figure 4.13 The minimal P-cube routing algorithm for hypercubes.

Note that cycles cannot exist only traversing dimensions in E0 since they represent channels from a node to a higher-numbered node. In a cycle, at least one channel must be from a node to a lower-numbered node. For similar reasons, packets cannot form cycles by only traversing dimensions in set E1. Finally, since packets only use dimensions in E1after traversing all of the dimensions in E0, deadlock freedom is preserved. In effect, the algorithm prohibits turns from dimensions in E1 to dimensions in E0. This is sufficient to prevent cycles. By partitioning the set of dimensions to be traversed in other ways that preserve the acyclic properties, you can derive other variants of this algorithm.

Let the sizes of E, E0, and E1 be k, k0, and k1, respectively; k = k0 + k1. There exist k! shortest paths between s and d. Using P-cube routing, a packet may be routed through any of (k0!) (k1!) of those shortest paths. A similar algorithm was proposed in [184]; however, the P-cube routing algorithm can be systematically generalized to handle nonminimal routing as well [129].

4.4 Fully Adaptive Algorithms

This section presents several methodologies for the design of fully adaptive routing algorithms as well as some specific routing algorithms. Where possible, the presentation follows a chronological order, showing the evolution of design methodologies. First, we present some algorithms developed for computer networks in Section 4.4.1, as well as a methodology to adapt those algorithms for wormhole switching in Section 4.4.2. Then we present some methodologies based on the concept of the virtual network in Section 4.4.3. Both the methodologies presented in Sections 4.4.2 and 4.4.3 are based on Dally and Seitz’s theorem [77] (Corollary 3.1), thus requiring the absence of cyclic dependencies between channels. The resulting routing algorithms require a large number of virtual channels. When the restriction on cyclic dependencies is relaxed, the number of resources needed to avoid deadlock is reduced considerably. In Section 4.4.4 we first present a routing algorithm that is not based on Theorem 3.1, showing the transition. It is followed by a design methodology based on Theorem 3.1. Although this design methodology presents a practical way to design efficient fully adaptive routing algorithms for a variety of topologies, some more adaptivity can be obtained by proposing specific routing algorithms for some topologies. This is done in Section 4.5.

4.4.1 Algorithms Based on Structured Buffer Pools

These routing algorithms were designed for SAF networks using central queues. Deadlocks are avoided by splitting buffers into several classes and restricting packets to move from one buffer to another in such a way that buffer class is never decremented. Gopal proposed several fully adaptive minimal routing algorithms based on buffer classes [133]. These algorithms are known as hop algorithms.

The simplest hop algorithm starts by injecting a packet into the buffer of class 0 at the current node. Every time a packet stored in a buffer of class i takes a hop to another node, it moves to a buffer of class i + 1. This routing algorithm is known as the positive-hop algorithm. Deadlocks are avoided by using a buffer of a higher class every time a packet requests a new buffer. By doing so, cyclic dependencies between resources are prevented. A packet that has completed i hops will use a buffer of class i. Since the routing algorithm only supplies minimal paths, the maximum number of hops taken by a packet is limited by the diameter of the network. If the network diameter is denoted by D, a minimum of D + 1 buffers per node are required to avoid deadlock. The main advantage of the positive-hop algorithm is that it is valid for any topology. However, the number of buffers required for fully adaptive deadlock-free routing is very high, and this number depends on network size.

The minimum number of buffers per node can be reduced by allowing packets to move between buffers of the same class. In this case, classes must be defined such that packets moving between buffers of the same class cannot form cycles. In the negative-hop routing algorithm, the network is partitioned into several subsets in such a way that no subset contains two adjacent nodes. If S is the number of subsets, then subsets are labeled 0, 1, …, S − 1, and nodes in subset i are labeled i. Hops from a node with a higher label to a node with a lower label are negative. Otherwise, hops are nonnegative. When a packet is injected, it is stored into the buffer of class 0 at the current node. Every time a packet stored in a buffer of class i takes a negative hop, it moves to a buffer of class i + 1. If a packet stored in a buffer of class i takes a nonnegative hop, then it requests a buffer of the same class. Thus, a packet that has completed i negative hops will use a buffer of class i.

There is not any cyclic dependency between buffers. Effectively, a cycle starting at node A must return to node A and contains at least another node B. If B has a lower label than A, some hop between A and B (possibly through intermediate nodes) is negative, and the buffer class is increased. If B has a higher label than A, some hop between B and A (possibly through intermediate nodes) is negative, and the buffer class is increased. As a consequence, packets cannot wait for buffers cyclically, thus avoiding deadlocks. If D is the network diameter and S is the number of subsets, then the maximum number of negative hops that can be taken by a packet is HN = [D(S − 1)/S]. The minimum number of buffers per node required to avoid deadlock is HN + 1. Figure 4.14 shows a partition scheme for k-ary 2-cubes with even k. Black and white circles correspond to nodes of subsets 0 and 1, respectively.

image

Figure 4.14 Partition scheme for k-ary 2-cubes with even k.

Although the negative-hop routing algorithm requires approximately half the buffers required by the positive-hop algorithm in the best case, this number is still high. It can be improved by partitioning the network into subsets and numbering the partitions, such that there are no cycles in any partition. This partitioning scheme does not require that adjacent nodes belong to different subsets. In this case, a negative hop is a hop that takes a packet from a node in a higher-numbered partition to a node in a lower-numbered partition. The resulting routing algorithm still requires a relatively large number of buffers, and the number of buffers depends on network size. In general, hop routing algorithms require many buffers. However, these algorithms are fully adaptive and can be used in any topology. As a consequence, these routing algorithms are suitable for SAF switching when buffers are allocated in main memory.

4.4.2 Algorithms Derived from SAF Algorithms

Boppana and Chalasani proposed a methodology for the design of fully adaptive minimal routing algorithms for networks using wormhole switching [36]. This methodology starts from a hop algorithm, replacing central buffers by virtual channels. The basic idea consists of splitting each physical channel into as many virtual channels as there were central buffers in the original hop algorithm, and assigning virtual channels in the same way that central buffers were assigned.

More precisely, if the SAF algorithm requires m classes of buffers, the corresponding algorithm for wormhole switching requires m virtual channels per physical channel. Let the virtual channels corresponding to physical channel ci be denoted as ci,1, ci,2, , ci,m. If a packet occupies a buffer bk of class k in the SAF algorithm and can use channels ci, cj, … to move to the next node, the corresponding wormhole algorithm will be able to use virtual channels ci,k,cj,k, …. This situation is depicted in Figure 4.15, where a single physical channel ci has been drawn.

image

Figure 4.15 Derivation of wormhole routing algorithms from SAF algorithms.

This scheme produces an unbalanced use of virtual channels because all the packets start using virtual channel 0 of some physical channel. However, very few packets take the maximum number of hops. This scheme can be improved by giving each packet a number of bonus cards equal to the maximum number of hops minus the number of hops it is going to take [36]. At each node, the packet has some flexibility in the selection of virtual channels. The range of virtual channels that can be selected for each physical channel is equal to the number of bonus cards available plus one. Thus, when no bonus cards are available, a single virtual channel per physical channel can be selected. If the packet uses virtual channel j after using virtual channel i, it consumes ji − 1 bonus cards.

Using this methodology, the hop algorithms presented in Section 4.4.1 can be redefined for wormhole switching. The routing algorithms resulting from the application of this methodology have the same advantages and disadvantages as the original hop algorithms for SAF networks: they provide fully adaptive minimal routing for any topology at the expense of a high number of virtual channels. Additionally, the number of virtual channels depends on network size, thus limiting scalability.

4.4.3 Virtual Networks

A useful concept to design routing algorithms consists of splitting the network into several virtual networks. A virtual network is a subset of channels that are used to route packets toward a particular set of destinations. The channel sets corresponding to different virtual networks are disjoint. Depending on the destination, each packet is injected into a particular virtual network, where it is routed until it arrives at its destination. In some proposals, packets traveling in a given virtual network have some freedom to move to another virtual network. Virtual networks can be implemented by using disjoint sets of virtual channels for each virtual network and mapping those channels over the same set of physical channels. Of course it is also possible to implement virtual networks by using separate sets of physical channels.

The first design methodologies for fully adaptive routing algorithms were based on the concept of virtual networks [167, 213]. This concept considerably eases the task of defining deadlock-free routing functions. Effectively, deadlocks are only possible if there exist cyclic dependencies between channels. Cycles are formed by sets of turns, as shown in Section 4.3.2. By restricting the set of destinations for each virtual network, it is possible to restrict the set of directions followed by packets. Thus, each virtual network can be defined in such a way that the corresponding routing function has no cyclic dependencies between channels. However, this routing function is not connected because it is only able to deliver packets to some destinations. By providing enough virtual networks so that all the destinations can be reached, the resulting routing function is connected and deadlock-free. Packet transfers between virtual networks are not allowed or restricted in such a way that deadlocks are avoided.

In this section, we present some fully adaptive routing algorithms based on virtual networks. Jesshope, Miller, and Yantchev proposed a simple way to avoid deadlock in n-dimensional meshes. It consists of splitting the network into several virtual networks in such a way that packets injected into a given virtual network can only move in one direction for each dimension [167]. Figure 4.16 shows the four virtual networks corresponding to a 2-D mesh. Packets injected into the X + Y + virtual network can only move along the positive direction of dimensions X and Y. Packets injected into the X + Y – virtual network can only move along the positive direction of dimension X and the negative direction of dimension Y, and so on. Packets are injected into a single virtual network, depending on their destination. Once a packet is being routed in a given virtual network, all the channels belonging to minimal paths can be used for routing. However, the packet cannot be transferred to another virtual network. It is obvious that there are no cyclic dependencies between channels, thus avoiding deadlock.

image

Figure 4.16 Virtual networks for a 2-D mesh: (a) XY + virtual network, (b) X + Y + virtual network, (c) XY – virtual network, and (d) X + Y – virtual network.

This strategy can be easily extended for meshes with more than two dimensions. For example, a 3-D mesh requires eight virtual networks, corresponding to the following directions: XYZ –, XYZ +, XY + Z –, XY + Z +, X + YZ –, X + YZ +, X + Y + Z —, and X + Y + Z +. In general, an n-dimensional mesh requires 2n virtual networks, each one consisting of n unidirectional virtual or physical channels per node (except some nodes in the border of the mesh). Thus, from a practical point of view, this routing algorithm can only be used for a very small number of dimensions (two or three at most). Also, a similar strategy can be applied to torus (k-ary n-cube) networks, as shown in [136] for a 2-D torus.

Linder and Harden showed that the number of virtual networks required for n-dimensional meshes can be reduced to half [213]. Instead of having a unidirectional channel in each dimension, each virtual network has channels in both directions in the 0th dimension and only one direction in the remaining n − 1 dimensions. This is shown in Figure 4.17 for a 2-D mesh. With this reduction, an n-dimensional mesh requires 2n–l virtual networks. However, channels in the 0th dimension are bidirectional instead of unidirectional. Therefore, the number of virtual channels per node is equal to n2n −1, in addition to an extra virtual channel in dimension 0, for a total of (n + 1)2n −1 per node. The number of virtual channels across each physical channel is 2n −1, except for dimension 0, which has 2n virtual channels due to the bidirectional nature. Again, this strategy can only be used for a very small number of dimensions. Note that the virtual networks for a 2-D mesh (see Figure 4.17) are equivalent to the virtual networks for a plane in planar-adaptive routing (see Figure 4.8). Figures 4.18 and 4.19 show the routing algorithm for 2-D meshes.

image

Figure 4.17 Reducing the number of virtual networks for a 2-D mesh: (a) virtual network 0 and (b) virtual network 1.

image

Figure 4.18 The Linder-Harden and planar-adaptive routing algorithms for 2-D meshes (virtual network 0). (PAR = planar-adaptive routing.)

image

Figure 4.19 The Linder-Harden and planar-adaptive routing algorithms for 2-D meshes (virtual network 1). (PAR = planar-adaptive routing.)

The reduction in the number of virtual networks does not introduce cyclic dependencies between channels, as long as packets are routed following only minimal paths. Effectively, as shown in Section 4.3.2, cycles are formed by sets of turns. The restriction to minimal paths eliminates all the 180-degree turns. At least two dimensions with channels in both directions are required to form a cycle. Thus, no virtual network has cyclic dependencies between channels.

Linder and Harden also applied the concept of a virtual network to k-ary n-cube networks [213]. The basic idea is the same as for meshes: each virtual network has channels in both directions in the 0th dimension and only one direction in the remaining n − 1 dimensions. However, the existence of wraparound channels makes it more difficult to avoid cyclic dependencies. Thus, each virtual network is split into several levels, each level having its own set of virtual channels. Every time a packet crosses a wraparound channel, it moves to the next level. If only minimal paths are allowed, the wraparound channels in each dimension can only be crossed once by each packet. In the worst case, a packet will need to cross wraparound channels in all the dimensions. Thus, one level per dimension is required, in addition to the initial level, for a total of n + 1 levels. Figure 4.20 shows the virtual networks and their levels for a 2-D torus. For reference purposes, one of the levels is enclosed in a dashed box. This fully adaptive routing algorithm for k-ary n-cube networks requires 2n −1 virtual networks and n + 1 levels per virtual network, resulting in (n + 1)2n −1 virtual channels per physical channel across each dimension except dimension 0, which has (n + 1)2n virtual channels. Thus, fully adaptive routing requires many resources if cyclic dependencies between channels are to be avoided. As we will see in Section 4.4.4, the number of resources required for fully adaptive routing can be drastically reduced by relying on Theorem 3.1 for deadlock avoidance.

image

Figure 4.20 Virtual networks for a 2-D torus: (a) virtual network 0 and (b) virtual network 1.

Some recent proposals allow cyclic dependencies between channels, relying on some version of Theorem 3.1 to guarantee deadlock freedom [95]. In general, virtual networks are defined in such a way that the corresponding routing functions are deadlock-free. Packet transfers between virtual networks are restricted in such a way that deadlocks are avoided. By allowing cyclic dependencies between channels and cyclic transfers between virtual networks, guaranteeing deadlock freedom is much more difficult [95, 217]. However, virtual networks still have proven to be useful to define deadlock-free routing algorithms. See Exercise 3.3 for an example.

4.4.4 Deterministic and Adaptive Subnetworks

In this section we first present a routing algorithm that is not based on Theorem 3.1, followed by a design methodology based on that theorem.

Dally and Aoki proposed two adaptive routing algorithms based on the concept of dimension reversal [73]. The most interesting one is the dynamic algorithm. This routing algorithm allows the existence of cyclic dependencies between channels, as long as packets do not wait for channels in a cyclic way. Before describing the dynamic algorithm, let us define the concept of dimension reversal. The dimension reversal (DR) number of a packet is the count of the number of times a packet has been routed from a channel in one dimension, p, to a channel in a lower dimension, q < p.

The dynamic algorithm divides the virtual channels of each physical channel into two nonempty classes: adaptive and deterministic. Packets injected into the network are first routed using adaptive channels. While in these channels, packets may be routed in any direction without a maximum limit on the number of dimension reversals a packet may make. Whenever a packet acquires a channel, it labels the channel with its current DR number. To avoid deadlock, a packet with a DR of p cannot wait on a channel labeled with a DR of q if pq. A packet that reaches a node where all output channels are occupied by packets with equal or lower DRs must switch to the deterministic class of virtual channels. Once on the deterministic channels, the packet must be routed in dimension order using only the deterministic channels and cannot reenter the adaptive channels.

The dynamic algorithm represents a first step toward relaxing the restrictions for deadlock avoidance. Instead of requiring the absence of cyclic dependencies between channels as the algorithms presented in Section 4.4.3, it allows those cyclic dependencies as long as packets do not wait for channels in a cyclic way. When a packet may produce a cyclic waiting, it is transferred to the deterministic class of virtual channels, where it is routed in dimension order. The dynamic algorithm provides the maximum routing flexibility when packets are routed using adaptive channels. However, that flexibility is lost when packets are transferred to the deterministic channels.

More routing flexibility can be obtained by using Theorem 3.1. Fully adaptive routing algorithms described in Sections 4.4.1, 4.4.2, and 4.4.3 do not allow cyclic dependencies between resources to avoid deadlocks. However, those algorithms require a large set of buffer resources. Using Theorem 3.1, it is possible to avoid deadlocks even when cyclic dependencies between resources are allowed. The resulting routing algorithms require a smaller set of buffers or channels.

Defining routing algorithms and checking if they are deadlock-free is a tedious task. In order to simplify this task, Duato proposed some design methodologies [86, 91, 98]. In general, design methodologies do not supply optimal routing algorithms. However, they usually supply near-optimal routing algorithms with much less effort. In this section, we present a general methodology for the design of deadlock-free fully adaptive routing algorithms that combines the methodologies previously proposed by Duato. For VCT and SAF switching, it automatically supplies deadlock-free routing algorithms. For wormhole switching, a verification step is required.

The design methodology presented in this section is based on the use of edge buffers. For SAF switching, a similar methodology can be defined for networks using central queues. This methodology describes a way to add channels to an existing network, also deriving the new routing function from the old one. Channels are added following a regular pattern, using the same number of virtual channels for all the physical channels. This is important from the implementation point of view because bandwidth sharing and propagation delay remain identical for all the output channels.

METHODOLOGY 4.1

This methodology supplies fully adaptive minimal and nonminimal routing algorithms, starting from a deterministic or partially adaptive routing algorithm. When restricted to minimal paths, the resulting routing algorithms have been referred to as Duato’s protocol (DP) [125]. The steps are the following:

1. Given an interconnection network I1, select one of the existing routing functions for it. Let R1 be this routing function. It must be deadlock-free and connected. It can be deterministic or adaptive. For wormhole switching, it is recommended that R1 is minimal. Let C1 be the set of channels at this point.

2. Split each physical channel into a set of additional virtual channels. Let C be the set of all the (virtual) channels in the network. Let Cxy be the set of output channels from node x belonging to a path (minimal or not) from node x to node y. Define the new routing function R as follows:

    

image (4.1)

    That is, the new routing function can use any of the new channels or, alternatively, the channels supplied by R1. The selection function can be defined in any way. However, it is recommended to give a higher priority to the new channels belonging to minimal paths. For wormhole switching, it is recommended that R is restricted to use only minimal paths.

3. For wormhole switching, verify that the extended channel dependency graph for R1 is acyclic. If it is, the routing algorithm is valid. Otherwise, it must be discarded, returning to step 1.

Step 1 establishes the starting point. We can use either a deterministic or adaptive routing function as the basic one. All the algorithms proposed in previous sections are candidates for selection. However, algorithms proposed in Sections 4.2 and 4.3.2 should be preferred because they require a small amount of resources.

Step 2 indicates how to add more (virtual) channels to the network and how to define a new fully adaptive routing function from the basic one. As defined, this methodology supplies nonminimal routing functions. It is possible to define minimal routing functions by restricting Cxy to contain only the output channels belonging to minimal paths from x to y. This methodology can also be applied by adding physical channels instead of virtual ones.

For wormhole switching, step 3 verifies whether the new routing function is deadlock-free or not. If the verification fails, the above-proposed methodology may lead to an endless cycle. Thus, it does not supply a fully automatic way to design fully adaptive routing algorithms. Many minimal routing functions pass the verification step. However, very few nonminimal routing functions pass it. This is the reason why we recommend the use of minimal routing functions for wormhole switching.

For VCT and SAF switching, step 3 is not required. The methodology supplies deadlock-free routing functions. Effectively, according to the definition for R given in expression (4.1), R1(x, y) = R(x, y) C1, x, yN. Thus, there exists a subset of channels C1C that defines a routing subfunction R1 that is connected and deadlock-free. Taking into account Theorem 3.2, it is easy to see that R is deadlock-free. This is indicated in the next lemma.

LEMMA 4.1

For VCT and SAF switching, the routing functions supplied by Methodology 4.1 are deadlock-free.

Methodology 4.1 is not defined for routing functions whose domain includes the current channel or queue containing the packet header. In fact, this methodology cannot be applied to such routing functions because it extends the range of the routing function but not its domain. If the domain is not extended, it is not possible to consider the newly added channels, and the packets stored in them cannot be routed.

Note that if the domain of R were extended by considering the newly added channels, the initial routing function R1 would be modified. For example, consider the positive-hop algorithm described in Section 4.4.1 and adapted to wormhole switching in Section 4.4.2. Routing decisions are taken based on the class of the channel occupied by the packet. Knowledge of the class is necessary to ensure deadlock freedom. If the packet uses a channel supplied by R1, and subsequently a newly added channel, it is not possible for the receiving router to know the class of the channel previously occupied by the packet.

So, this methodology cannot be directly applied to routing functions that consider the current channel or queue in their domain. However, a closer examination will reveal that in practice this is not as restrictive as it may initially appear because most of those routing functions can be redefined so that they do not require the current channel or queue. For example, the class of the channel occupied by a packet in the positive-hop algorithm can be made available by providing a field containing the class in the packet header. If this field is added, the input channel is no longer required in the domain of the routing function.

Examples 4.2 and 4.3 show the application of Methodology 4.1 to a binary n-cube using wormhole switching and a 2-D mesh using SAF switching, respectively. Exercise 4.4 shows the application of this methodology to k-ary n-cubes as well as some optimizations.

EXAMPLE 4.2

Consider a binary n-cube using wormhole switching. For step 1 we can select the e-cube routing algorithm. It forwards packets crossing the channels in order of decreasing dimensions. This routing function is connected and deadlock-free. For step 2, consider that each physical channel ci has been split into k virtual channels, namely, ai, 1, ai, 2, …, ai, k −1, bi. Let C1 be the set of b channels. The algorithm obtained applying step 2 can be stated as follows: Route over any useful dimension using any of the a channels. If all of them are busy, route over the highest useful dimension using the corresponding b channel. A useful dimension is one that forwards a packet nearer to its destination.

Figure 4.21 shows the extended channel dependency graph for R1 on a 3-cube. Black circles represent the unidirectional channels belonging to C1 and are labeled as bij, where i and j are the source and destination nodes, respectively. As a reference, channels are also represented by thin lines, horizontal and vertical ones corresponding to dimensions 0 and 1, respectively. Also, the nodes of the 3-cube have been labeled as nk, where k is the node number. Arcs represent channel dependencies, dashed arcs corresponding to indirect dependencies. It can be seen that the graph is acyclic. Then, R is deadlock-free.

image

Figure 4.21 Extended channel dependency graph for R1.

EXAMPLE 4.3

Consider a 2-D mesh using SAF switching. For step 1 we select dimension-order routing (XY routing). Let R1 denote this routing function. R1 is connected and deadlock-free. Split each physical channel ci into two virtual channels, namely, ai and bi. Let C1 be the set of bi channels. Thus, CC1 is the set of ai channels. According to expression (4.1), the routing function R supplies all the outgoing a channels from the current node, also supplying one b channel according to dimension-order routing (XY routing). Taking into account Lemma 4.1, R is deadlock-free. When restricted to minimal paths, R is also deadlock-free for wormhole switching, as shown in Example 3.8. The resulting routing algorithm is shown in Figure 4.22. In this figure, Xa+ and Xb+ denote the a and b channels, respectively, in the positive direction of the X dimension. A similar notation is used for the other direction and dimension.

image

Figure 4.22 Duato’s fully adaptive minimal routing algorithm for 2-D meshes (Duato’s protocol).

4.5 Maximally Adaptive Routing Algorithms

In this section, we present some routing algorithms based on deadlock avoidance that either maximize adaptivity for a given set of resources (channels or buffers) or minimize the use of resources for fully adaptive routing. We present below some routing algorithms based on deadlock recovery that accomplish both.

Some authors have proposed metrics to measure adaptivity [129]. However, such metrics are not related to performance. In fact, although maximally adaptive routing algorithms are optimal or near optimal in the use of resources for deadlock-avoidance-based routing, they do not necessarily achieve the highest performance. The reason is that the use of resources may be unbalanced. For instance, different physical channels may have a different number of virtual channels, or some virtual channels may be more heavily used than other ones. This may produce an uneven traffic distribution, as shown in [344]. This is not the case for true fully adaptive routing based on deadlock recovery. This algorithm maximizes adaptivity for a given set of resources while balancing the use of resources. As a consequence, this algorithm usually achieves the highest performance.

4.5.1 Algorithms with Maximum Adaptivity

The fully adaptive routing algorithm proposed by Linder and Harden requires two and one virtual channels per physical channel for the first and second dimensions, respectively, when applied to 2-D meshes (see Section 4.4.3 and Figure 4.17). If dimensions are exchanged, the resulting routing algorithm requires one and two virtual channels per physical channel for the X and Y dimensions, respectively. This algorithm is called double-y. The double-y routing algorithm uses one set of Y channels, namely, Y1, for packets traveling X–, and the second set of Y channels, namely, Y2, for packets traveling X+.

Based on the turn model, Glass and Ni analyzed the double-y routing algorithm, eliminating the unnecessary restrictions [130]. The resulting algorithm is called maximally adaptive double-y (mad-y). It improves adaptivity with respect to the double-y algorithm. Basically, mad-y allows packets using Y1 channels to turn to the X+ direction and packets using X– channels to turn and use Y2 channels. Figures 4.23 and 4.24 show the turns allowed by the double-y and mad-y algorithms, respectively.

image

Figure 4.23 Turns allowed (solid lines) by the double-y algorithm.

image

Figure 4.24 Turns allowed (solid lines) by the mad-y algorithm.

The mad-y algorithm has the maximum adaptivity that can be obtained without introducing cyclic dependencies between channels. However, as shown in Theorem 3.1, cycles do not necessarily produce deadlock. Thus, the mad-y algorithm can be improved. It was done by Schwiebert and Jayasimha, who proposed the opt-y algorithm [306, 308]. This algorithm is deadlock-free and optimal with respect to the number of routing restrictions on the virtual channels for deadlock-avoidance-based routing. Basically, the opt-y algorithm allows all the turns between X and Y2 channels as well as turns between X+ and Y1 channels. Turns from Y1 to X– channels are prohibited. Turns from X– to Y1 channels as well as 0-degree turns between Y1 and Y2 channels are restricted. These turns are only allowed when the packet has completed its movement along X– channels (the X-offset is zero or positive). Figure 4.25 shows the turns allowed by the opt-y algorithm.

image

Figure 4.25 Turns allowed (solid lines) by the opt-y algorithm. Dotted lines are prohibited turns. Dashed lines are restricted turns.

Defining a routing algorithm by describing the allowed and prohibited turns makes it difficult to understand how routing decisions are taken at a given node. The opt-y algorithm is described in Figure 4.26 using pseudocode.

image

Figure 4.26 The opt-y fully adaptive routing algorithm for 2-D meshes.

The opt-y algorithm can be generalized to n-dimensional meshes by using the following steps [308]:

image Assign a channel to both directions of each dimension.

image Number the dimensions in some order, and add a second virtual channel to both directions of all dimensions except the first dimension.

image Allow packets to route along the second virtual channel at any time.

image For each dimension except the last, select one of the two directions as the chosen direction of that dimension. Prohibit a packet from routing on the first virtual channel of any direction until it has completed routing in the chosen direction of all lower dimensions.

image Allow a packet to make a 0-degree turn between the two virtual channels of a direction only after the packet has completed routing in the chosen direction of all lower dimensions.

Basically, the generalized opt-y algorithm allows fully adaptive minimal routing in one set of virtual channels. If packets have completed their movement along the chosen direction in all the dimensions, then fully adaptive routing is also allowed on the second set of virtual channels. Otherwise the second set of virtual channels only allows partially adaptive routing across the dimensions for which packets have completed their movement along the chosen direction in all the lower dimensions.

4.5.2 Algorithms with Minimum Buffer Requirements

Cypher and Gravano [65] proposed two fully adaptive routing algorithms for torus networks using SAF switching. These algorithms have been proven to be optimal with respect to buffer space [63] for deadlock-avoidance-based routing. Thus, there is no deadlock-avoidance-based fully adaptive routing algorithm for tori that requires less buffer space. Obviously, these routing algorithms have cyclic dependencies between queues or channels. The first algorithm (Algorithm 1) only requires three central buffers or queues to avoid deadlock, regardless of the number of dimensions of the torus. The second algorithm (Algorithm 2) uses edge buffers, requiring only two buffers per input channel to avoid deadlock. These routing algorithms are also valid for VCT switching but not for wormhole switching.

The routing algorithms are based on four node orderings. The first ordering, which is called the right-increasing ordering, is simply a standard row-major ordering of the nodes. The second ordering, which is called the left-increasing ordering, is the reverse of the right-increasing ordering. The third ordering, which is called the inside-increasing ordering, assigns the smallest values to nodes near the wraparound edges of the torus and the largest values to nodes near the center of the torus. The fourth ordering, which is called the outside-increasing ordering, is the reverse of the inside-increasing ordering. Tables 4.1 through 4.4 show the node orderings for an 8 × 8 torus. A transfer of a packet from a node a to an adjacent node b will be said to occur to the right (similarly, left, inside, or outside) if and only if node a is smaller than node b when they are numbered in right-increasing (similarly, left-increasing, inside-increasing, or outside-increasing) ordering.

Table 4.1

The right-increasing ordering for an 8 × 8 torus.

image

Table 4.2

The left-increasing ordering for an 8 × 8 torus.

image

Table 4.3

The inside-increasing ordering for an 8 × 8 torus.

image

Table 4.4

The outside-increasing ordering for an 8 × 8 torus.

image

Algorithm 1: Three queues per node are required, denoted A, B, and C. Additionally, each node has an injection queue and a delivery queue. Each packet moves from its injection queue to the A queue in the source node, and it remains using A queues as long as it is possible for it to move to the right along at least one dimension following a minimal path. When a packet cannot move to the right following a minimal path, it moves to the B queue in its current node, and it remains using B queues as long as it is possible for it to move to the left along at least one dimension following a minimal path. When a packet cannot move to the left following a minimal path, it moves to the C queue in its current node, and it remains moving to the inside using C queues until it arrives at its destination node. It then moves to the delivery queue in its destination node.

Note that a packet in an A queue may move to an A queue in any adjacent node. It may not actually move to the right, but this option must exist on at least one of the minimal paths to the destination. Similarly, a packet in a B queue may move to a B queue in any adjacent node. Also, note that a packet entering a C queue cannot move to an A or B queue. However, the routing algorithm is fully adaptive because a packet in a C queue only needs to move to the inside [65].

EXAMPLE 4.4

Consider a packet p that is routed from node (5, 7) to node (3, 2) in an 8 × 8 torus using Algorithm 1 (Figure 4.27). The packet p will first be stored in the injection queue in node (5, 7). Then, it moves to the A queue in node (5, 7). At this point, p cannot move to the right. Thus, it moves to the B queue in node (5, 7). At this point, p is allowed to move to the left (moving to the B queue in node (5, 0) or in node (4, 7)). Thus, all the B queues in neighboring nodes along minimal paths can be used. Assume that p moves to the B queue in node (4, 7), and then p moves to the B queue in node (4, 0). At this point, p is still allowed to move to the left (moving to the B queue in node (3, 0)). Assume that p moves to the B queue in node (4, 1) and then to the B queue in node (3, 1). At this point, p can no longer move to the left. Thus, it moves to the C queue in node (3, 1). Then, p moves to the C queue in node (3, 2), reaching its destination node, and moving to the delivery queue in node (3, 2). Note that p moved to the inside when moving from node (3, 1) to node (3, 2).

image

Figure 4.27 Routing example for Algorithms 1 and 2.

Algorithm 2: Two edge queues per input channel are required, denoted A and C if the channel moves packets to the outside, and denoted B and D if the channel moves packets to the inside. Additionally, each node has an injection queue and a delivery queue. Assuming that a packet has not arrived at its destination, a packet in an injection, A, or B queue moves to an A or B queue of an outgoing channel along a minimal path if it is possible for it to move to the inside along at least one dimension. Otherwise, it moves to the C queue of an outgoing channel along a minimal path. A packet in a C queue moves to a B or C queue of an outgoing channel along a minimal path if it is possible for it to move to the outside along at least one dimension. Otherwise, it moves to the D queue of an outgoing channel along a minimal path. A packet in a D queue can only move to the D queue of an outgoing channel along a minimal path. In any of the previous cases, when a packet arrives at its destination node, it moves to the delivery queue in that node.

Note that Algorithm 2 allows packets to move from C queues back to B queues and from B queues back to A queues. However, when a packet enters a D queue, it must remain using D queues until delivered. Once again, a packet entering a D queue can only move to the inside. However, the routing algorithm is fully adaptive because a packet in a D queue only needs to move to the inside [65].

EXAMPLE 4.5

Consider a packet p that is routed from node (5, 7) to node (3, 2) in an 8 × 8 torus using Algorithm 2. Assume that p follows the same path as in Example 4.4 (see Figure 4.27). The packet p will first be stored in the injection queue in node (5, 7). At this point, p is allowed to move to the inside (moving to node (4, 7)). Thus, p can select between the A queue in node (5, 0) and the B queue in node (4, 7). Assume that p moves to the B queue in node (4, 7). At this point, p can no longer move to the inside. Thus, it can move to the C queues in neighboring nodes along minimal paths. Assume that p moves to the C queue in node (4, 0). At this point, p is allowed to move to the outside (moving to the C queue in node (3, 0)). Thus, p can select between the B queue in node (4, 1) and the C queue in node (3, 0). Assume that p moves to the B queue in node (4, 1). At this point, p is allowed to move to the inside (moving to the B queue in node (4, 2)). Thus, p can select between the A queue in node (3, 1) and the B queue in node (4, 2).

Assume that p moves to the A queue in node (3, 1). As p is still allowed to move to the inside, it moves to the B queue in node (3, 2), reaching its destination node, and moving to the delivery queue in node (3, 2).

4.5.3 True Fully Adaptive Routing Algorithms

All the routing algorithms described in previous sections use avoidance techniques to handle deadlocks, therefore restricting routing. Deadlock recovery techniques do not restrict routing to avoid deadlock. Hence, routing strategies based on deadlock recovery allow maximum routing adaptivity (even beyond that proposed in [306, 308]) as well as minimum resource requirements. In particular, progressive deadlock recovery techniques, like Disha (see Section 3.6), decouple deadlock-handling resources from normal routing resources by dedicating minimum hardware to efficient deadlock recovery in order to make the common case (i.e., no deadlocks) fast. As proposed, sequential recovery from deadlocks requires only one central flit-sized buffer applicable to arbitrary network topologies [8, 9], and concurrent recovery requires at most two central buffers for any topology on which a Hamiltonian path or a spanning tree can be defined [10].

When routing is not restricted, no virtual channels are dedicated to avoid deadlocks. Instead, virtual channels are used for the sole purpose of improving channel utilization and adaptivity. Hence, true fully adaptive routing is permitted on all virtual channels within each physical channel, regardless of network topology. True fully adaptive routing can be minimal or nonminimal, depending on whether routing is restricted to minimal paths or not. Note that fully adaptive routing used in the context of avoidance-based algorithms connotes full adaptivity across all physical channels but only partial adaptivity across virtual channels within a given physical channel. On the other hand, true fully adaptive routing used in the context of recovery-based algorithms connotes full adaptivity across all physical channel dimensions as well as across all virtual channels within a given physical channel. Routing restrictions on virtual channels are therefore completely relaxed so that no ordering among these resources is enforced.

As an example, Figure 4.28 shows a true fully adaptive minimal routing algorithm for 2-D meshes. Each physical channel is assumed to be split into two virtual channels a and b. In this figure, Xa + and Xb + denote the a and b channels, respectively, in the positive direction of the X dimension. A similar notation is used for the other direction and dimension. As can be seen, no routing restrictions are enforced, except for paths to be minimal.

image

Figure 4.28 True fully adaptive minimal routing algorithm for 2-D meshes (Disha).

Figure 4.28 only shows the routing algorithm. It does not include deadlock handling. This issue was covered in Section 3.6. For the sake of completeness, Figure 4.29 shows a flow diagram of the true fully adaptive nonminimal routing algorithm implemented by Disha. The shaded box corresponds to the routing algorithm described in Figure 4.28 (extended to handle nonminimal routing). If after a number of tries a packet cannot access any virtual channel along any minimal path to its destination, it is allowed to access any misrouting channel except those resulting in 180-degree turns. If all minimal and misrouting channels remain busy for longer than the timeout for deadlock detection, the packet is eventually suspected of being deadlocked. Once this determination is made, its eligibility to progressively recover using the central deadlock buffer recovery path is checked. As only one of the packets involved in a deadlock needs to be eliminated from the dependency cycle to break the deadlock, a packet either uses the recovery path (is eligible to recover) or will eventually use one of the normal edge virtual channel buffers for routing (i.e., is not eligible to recover, but the deadlock is broken by some other packet that is eligible). Hence, Disha aims at optimizing routing performance in the absence of deadlocks and efficiently dealing with the rare cases when deadlock may be impending. If deadlocks are truly rare, substantial performance benefits can be gleaned (see Section 9.4.1).

image

Figure 4.29 Flow diagram of the true fully adaptive nonminimal routing algorithm implemented by Disha.

4.6 Nonminimal Routing Algorithms

As indicated in the previous section, routing algorithms based on deadlock recovery can be designed to use nonminimal paths. Some methodologies proposed in previous sections can also be used for the design of avoidance-based nonminimal routing algorithms. This is the case for the turn model, Dally and Aoki’s algorithm, and the methodology proposed in Section 4.4.4 for VCT switching. Also, PAR, hop algorithms, most algorithms based on virtual networks, and the methodology proposed in Section 4.4.4 for wormhole switching can be extended so as to consider nonminimal routing.

For networks using wormhole switching, nonminimal routing algorithms usually degrade performance because packets consume more network resources. In particular, blocked packets occupy more channels on average, reducing the bandwidth available to the remaining packets. As a consequence, nonminimal routing algorithms are usually proposed for fault-tolerant routing because they are able to find alternative paths when all the minimal paths are faulty. These algorithms will be studied in Chapter 6.

However, blocked packets are completely removed from the network when VCT switching is used. In this case, taking an unprofitable channel is likely to bring the packet to another set of profitable channels that will allow further progress to the destination. As indicated in Chapter 3, deadlock can be avoided in VCT switching by using deflection routing. This routing technique uses nonminimal paths to avoid deadlock when all the minimal paths are busy.

The Chaos router [188, 189] implements a fully adaptive nonminimal routing algorithm and uses deflection routing to avoid deadlock. This router uses VCT switching and splits messages into fixed-length packets. It has a central packet queue implemented in hardware to remove packets from the network when they are blocked. The Chaos routing algorithm routes packets following a minimal path whenever possible. If all the minimal paths for a packet are busy, routing is retried for a fixed period of time. When waiting time exceeds a threshold, the packet is stored in the packet queue. To prevent starvation, packets in the queue have the highest priority in accessing a free output channel belonging to a minimal path when it becomes available. Queue overflow is prevented by derouting (misrouting) packets. When a queue slot is requested and the queue is full, a queued packet is randomly selected and forwarded along a nonminimal path.

Derouting packets requires guaranteeing that a nonminimal path is always available. This is equivalent to guaranteeing that some output channel is free. As indicated in Chapter 3, this can be ensured because the number of input channels (including injection channels) is equal to the number of output channels (including delivery channels). However, the Chaos router implements a more elaborate protocol to select the output channel for derouting. The protocol and formal proofs of deadlock freedom can be found in [189].

4.7 Backtracking Protocols

Backtracking protocols work on the premise that it is better to be searching for alternative paths than to be waiting for a channel to become available. This premise is especially true when the channel is faulty because it will not be available until repaired. The application of backtracking protocols to fault-tolerant routing will be studied in Chapter 6. In this section, we only consider the use of backtracking protocols to improve performance.

From a performance point of view, backtracking protocols are suited to circuit switching or some variant of it. Backtracking protocols search the network for a path in a depth-first manner. Potential paths from source to destination are searched by routing a header flit or probe through the network. The header acquires (virtual) channels as it moves toward its destination. When the header cannot continue onward, it backtracks over the last acquired channel, releases it, and continues its search from the previous node. As seen in Chapter 3, deadlock is prevented in circuit switching by reserving all the required resources before starting packet transmission. During the reservation phase, the header backtracks as needed instead of blocking when a channel is not available. Also, livelock is not a problem because backtracking protocols can use history information to avoid searching the same path repeatedly.

Although backtracking protocols can also be implemented with SAF switching, the performance overhead can be substantial compared to the use of deadlock avoidance techniques. From the behavioral point of view, the actual switching technique is unimportant and will not be referenced unless necessary in the remainder of this discussion.

Most backtracking protocols use the routing header to store history information [52, 62]. This significantly increases the size of the header over progressive protocols and, consequently, increases the time required to route the probe through the network. This is particularly a problem for misrouting backtracking protocols, since the number of links traversed during path setup can be very high. To overcome this problem, the history information can be distributed throughout the nodes of the network, reducing the header to a size comparable to that of e-cube [125, 126]. At each node in the network, each input link has a history bit vector h with as many bits as the node has channels. This history vector is associated with the circuit probe that came in on that channel. As each candidate outgoing channel is searched, the corresponding bit is set to “remember” that the channel has been searched. Each node also has a history bit vector h, for the node itself, since the node (not a channel) may be the source of the probe.

In this section, we briefly describe several backtracking protocols. Exhaustive profitable backtracking (EPB) performs a straightforward depth-first search of the network using only profitable links [139]. It is guaranteed to find a minimal path if one exists. This protocol is completely adaptive, profitable, and backtracking. Although the EPB protocol does not repeatedly search the same paths, it can visit a specific node several times (see Example 4.6). This can lead to unnecessary backtracking and longer setup times. The k-family routing paradigm is a family of partially adaptive, profitable, backtracking protocols proposed for binary hypercubes that use a heuristic to help minimize redundancy in the search for a path [62].

k-family protocols are two-phased, using a heuristic search in the first phase and an exhaustive search in the second. Each protocol is distinguished by a parameter k that determines when the heuristic is used and when exhaustive backtracking is used. When the probe is at a distance from the destination greater than k, the heuristic is used; when the distance is less than or equal to k, an exhaustive profitable search is used. If k = 1, the protocol is a strictly heuristic search. As k grows to the distance between the source and destination, the search becomes more and more exhaustive. A k protocol makes use of a history mask contained in the circuit probe. At each level in the search tree, the cumulative history mask of the ancestor nodes determines which links might be explored. The history mask records all the dimensions explored at each link and all dimensions explored at all ancestor nodes. The heuristic limits exploration to dimensions not marked in the history mask. In this manner, the search tree in Example 4.6 is pruned of links that are likely to lead into areas of the network that have been searched already.

The multilink paradigm is a family of profitable backtracking protocols that only make a partial reservation of the path [297]. Instead of sending a probe to set up a path from source to destination, the multilink protocol uses wormhole switching in the absence of contention. When the requested output channel is busy, the router switches to multilink mode. In this mode, it sends a probe to search the network for a path of a given maximum size (multilink). The maximum number of channels that can be reserved at once is a fixed parameter for each protocol (multilink size). Once the path has been established, the router returns to normal mode and data flits advance. If another busy channel is found, the router switches to multilink mode again, sending another probe. By limiting the number of channels that can be reserved at once, the search tree is pruned and the setup phase is much faster. However, the resulting routing algorithm is only partially adaptive. Also, deadlock freedom must be ensured by establishing an ordering between channels and multilinks. Additionally, once the data flits advance up to a given node, the probe can only backtrack up to that node. Although this protocol was initially proposed as profitable, it can be easily modified to use misrouting.

The exhaustive misrouting backtracking (EMB) protocol does a depth-first search of the network using both profitable and unprofitable links [125]. It uses the “best first” heuristic, taking profitable links over unprofitable ones. Although, on the average, the established circuits are longer than with a purely profitable protocol, EMB’s probability of finding a path is greater. The drawback of this algorithm is that it cannot detect that a packet is undeliverable until it searches every path it can reach in the network. In the worst case, a single probe can tie up large amounts of network resources searching in vain for a nonexistent path. The protocol is just “too optimistic.” Also, reserved paths may be too long, unnecessarily wasting channel bandwidth.

Misrouting freedom can be limited by using a partially adaptive two-phase protocol analogous to the k-family protocol where the probe is free to misroute only if it is within a certain distance u of its destination [125]. The first phase of two-phase backtracking (TPB-u) performs an exhaustive profitable search when the probe is at a distance greater than u from the destination. When TPB-u is within u of the destination, it enters the second phase, where it performs an exhaustive misrouting search. A TPB-u probe can switch phases several times during a single route.

The misrouting backtracking protocol with m misroutes (MB-m) is very much like the exhaustive misrouting backtracking protocol, except that it limits to m the maximum number of misroutes allowed at any time [126]. This protocol will be studied in Chapter 6.

In general, backtracking protocols do not improve performance over progressive ones when packets are short because the overhead to set up a path is very high. For very long packets or messages, backtracking protocols may perform better than progressive ones, especially when the number of alternative paths offered by the network is high.

EXAMPLE 4.6

To illustrate backtracking protocols, we present the network shown in Figure 4.30. We will examine routing from node 000 to node 111 for EPB, k-family with k = 1, EMB, TPB-1, MB-1, and multilinks. Figure 4.30 shows the search tree for a profitable backtracking protocol routing a packet from node 000 to node 111 in a binary 3-cube. EPB will perform a depth-first search of the tree (left to right). Each path from the root node to a leaf node corresponds to a path from the source node (000) to the destination node (111) in the network. In Figure 4.30, links (011, 111) and (101, 111) are busy. In routing from 000 to 111, the EPB probe would visit the following nodes in sequence:

image

Figure 4.30 Search tree for a profitable backtracking protocol.

image

The final circuit the probe established would be

image

Note that the probe had to backtrack all the way to the source node and visited node 011 twice.

The k-family probe would visit the following node sequence:

image

The resulting circuit would be the same as for EPB. Note that the k-family protocol pruned the second visit to 011 because the history mask passed to 010 from the source indicated that dimension 0 had already been searched.

For EMB, TPB-1, and MB-1, the probe’s search would proceed as follows:

image

The final circuit would be

image

In this case, since misrouting occurs at a node only one hop from the destination, TPB-1 behaves identically to EMB. If it were not possible to route at node 010, EMB would be free to misroute again, but TPB-1 would have to backtrack. MB-1 also behaves identically to EMB because only one misroute is required to search an alternative free path. Again, if it were not possible to route at node 010, MB-1 would have to backtrack. The final circuits established by the profitable protocols will always be the minimal length. The circuits established by the misrouting protocols can be much longer.

The multilinks protocol uses wormhole switching in the absence of contention. Thus, the header would reserve the following path:

image

As data flits immediately follow the header, it cannot backtrack at node 011. Thus, the header will have to wait for link (011, 111) to become free. The partially optimistic behavior of the multilinks protocol prevents it from backtracking. However, if this protocol were modified to support misrouting, a probe would be sent from node 011 instead of waiting for a free channel. This probe would visit the following node sequence:

image

The resulting circuit would be the same as for MB-1.

4.8 Routing in MINs

In this section, we describe several issues concerning routing in MINs. For array processors, a central controller establishes the path from input to output. In cases where the number of inputs equals the number of outputs, each input synchronously transmits a message to one output, and each output receives a message from exactly one input. Computing the switch settings to realize such a permutation is a complex task. Furthermore, some permutations may not be realizable in one pass through the network. In this case, multiple passes of the data through the network may be required, and the goal is to minimize the number of passes. The complexity of the off-line computation of the switch settings is proportional to the number of switches. This section and most multiprocessor applications consider only networks with the same number of inputs and outputs. Contention-free centralized routing will be addressed in Section 4.8.1. Also, see [263] for a brief survey.

On the other hand, in asynchronous multiprocessors, centralized control and permutation routing are infeasible. So, a routing algorithm is required to establish the path across the stages of a MIN. The simplest solution consists of using source routing. In this case, the source node specifies the complete path. As this solution produces a considerable overhead, we will focus on distributed routing. Routing algorithms for MINs will be described in Section 4.8.2.

There are many ways to interconnect adjacent stages. See Sections 1.7.4 and 1.7.5 for a definition of some connection patterns and MIN topologies, respectively.

4.8.1 Blocking Condition in MINs

The goal of this section is to derive necessary and sufficient conditions for two circuits to block, that is, require the same intermediate link. In an N-processor system, there are exactly N links between every stage of k × k switches. The network consists of n = logk N stages, where each stage is comprised of image switches. The intermediate link patterns connect an output of switch stage i to an input of switch stage i + 1. Blocking occurs when two packets must traverse the same output at a switch stage i, i = 0, 1, …, n − 1.

At any stage i, we can address all of the outputs of that stage from 0 to N − 1. Let us refer to these as the intermediate outputs at stage i. If we take a black box view of the network, we can represent it as shown in Figure 4.31 for an Omega network with 2 × 2 switches. Each stage of switches and each stage of links can be represented as a function that permutes the inputs. An example is shown in the figure of a path from network input 6 to network output 1. The intermediate outputs on this path are marked as shown. They are 4, 0, and 1 at the output of switch stages 0, 1, and 2, respectively. Our initial goal is the following. Given an input/output pair, generate the addresses of all of the intermediate outputs on this path. Since these networks have a single path from input to output, these addresses will be unique. Two input/output paths conflict if they traverse a common intermediate link or share a common output at any intermediate stage.

image

Figure 4.31 Functional view of the structure of a multistage Omega network.

In the following, we derive the blocking condition for the Omega network. Equivalent conditions can be derived for other networks in a similar manner. All input/output addresses in the following are represented in base k. For ease of discussion, we will assume a circuit-switched network. Consider establishing a circuit from sn −1 sn −2s1s0 to dn −1 dn −2d1d0. Consider the first link stage in Figure 4.31. This part of the network functionally establishes the following connection between its input and output.

image (4.2)

The right-hand side of expression (4.2) is the address of the output of the k-shuffle pattern and the address of the input to the first stage of switches. As mentioned in the previous section, each switch is only able to change the least significant digit of the current address. After the shuffle permutation, this is sn–1. So the output of the first stage of switches will be such that the least significant digit of the address will be equal to dn−1. Thus, the input/output connection established at the first stage of switches should be such that the input is connected to the output as follows:

image (4.3)

The right-hand side of the above expression is the address of the input to the second link stage. It is also the output of the first stage of switches and is therefore the first intermediate output. An example is shown in Figure 4.31 for a path from input 6 to output 1, which has a total of three intermediate outputs. Similarly, we can write the expression for the address of the intermediate output at stage i as

image (4.4)

This is assuming that stages are numbered from 0 to n − 1. We can now write the blocking condition. For any two input/output pairs (S, D) and (R, T), the two paths can be set up in a conflict-free manner if and only if,∀i, 0 ≤ in − 1,

image (4.5)

Testing for blocking between two input/output pairs is not quite as computationally demanding as it may first seem. Looking at the structure of the blocking condition we can see that if it is true and the circuits do block, then ni − 1 least significant digits of the source addresses are equal and i + 1 most significant digits of the destination addresses are equal. Let us assume that we have a function φ(S, R) that returns the largest integer l such that the l least significant digits of S and R are equal. Similarly, let us assume that we have a function ψ (D, T) that returns the largest integer m such that the m most significant digits of D and T are equal. Then two paths (S, D) and (R, T) can be established in a conflict-free manner if and only if

image (4.6)

where N = kn. As a practical matter, blocking can be computed by sequences of shift and compare operations, as follows:

image

The addresses of the two input/output pairs are concatenated. Figuratively, a window of size n slides over both pairs and the contents of both windows are compared. If they are equal at any point, then there is a conflict at some stage. This will take O(logk N) steps to perform. To determine if all paths can be set up in a conflict-free manner, we will have to perform O(N2) comparisons with each taking O(logk N) steps, resulting in an O(N2 logk N) algorithm. By comparison, the best-known algorithm for centralized control to set up all of the switches in advance takes O(N logk N) time. However, when using the above formulation of blocking, it is often not necessary to perform the worst-case number of comparisons. Often the structure of the communication pattern can be exploited in an off-line determination of whether patterns can be set up in a conflict-free manner.

4.8.2 Self-Routing Algorithms for MINs

A unique property of Delta networks is their self-routing property [273]. The self-routing property of these MINs allows the routing decision to be determined by the destination address, regardless of the source address. Self-routing is performed by using routing tags. For a k × k switch, there are k output ports. If the value of the corresponding routing tag is i (0 ≤ ik − 1), the corresponding packet will be forwarded via port i. For an n-stage MIN, the routing tag is T = tn −1t1t0, where ti controls the switch at stage Gi.

As indicated in Section 1.7.5, each switch is only able to change the least significant digit of the current address. Therefore, routing tags will take into account which digit is the least significant one at each stage, replacing it by the corresponding digit of the destination address. For a given destination dn −1 dn −2d0, in a butterfly MIN the routing tag is formed by having ti = di+ 1 for 0 ≤ in − 2 and tn −1 = d0. In a cube MIN, the routing tag is formed by having ti = dni−1 for 0 ≤ in − 1. Finally, in an Omega network, the routing tag is formed by having ti = dni−1 for 0 ≤ in − 1. Example 4.7 shows the paths selected by the tag-based routing algorithm in a 16-node butterfly MIN.

EXAMPLE 4.7

Figure 4.32 shows a 16-node butterfly MIN using 2 × 2 switches, and the paths followed by packets from node 0100 to node 0011, and from node 1010 to node 1000. As indicated above, the routing tag for a given destination dn −1 dn −2d0 is formed by having ti = di+1 for 0 ≤ in − 2 and tn −1 = d0. Thus, the routing tag for destination 0011 is 1001. This tag indicates that the packet must take the upper switch output (port 0) at stages G2 and G1 and the lower switch output (port 1) at stages G3 and G0. The routing tag for destination 1000 is 0100. This tag indicates that the packet must take the upper switch output at stages G3, G1, and G0 and the lower output at stage G2.

image

Figure 4.32 Paths selected by the tag-based routing algorithm in a 16-node butterfly MIN.

One of the nice features of the traditional MINs (TMINs) above is that there is a simple algorithm for finding a path of length logk N between any input/output pair. However, if a link becomes congested or fails, the unique path property can easily disrupt the communication between some input and output pairs. The congestion of packets over some channels causes the known hot spot problem [275]. Many solutions have been proposed to resolve the hot spot problem. A popular approach is to provide multiple routing paths between any source and destination pair so as to reduce network congestion as well as to achieve fault tolerance. These methods usually require additional hardware, such as extra stages or additional channels.

The use of additional channels gives rise to dilated MINs. In a d-dilated MIN (DMIN), each switch is replaced by a d-dilated switch. In this switch, each port has d channels. By using replicated channels, DMINs offer substantial network throughput improvement [191]. The routing tag of a DMIN can be determined by the destination address as mentioned for TMINs. Within the network switches, packets destined for a particular output port are randomly distributed to one of the free channels of that port. If all channels are busy, the packet is blocked.

Another approach for the design of MINs consists of allowing for bidirectional communication. In this case, each port of the switch has dual channels. In a butterfly bidirectional MIN (BMIN) built with k × k switches, source address S and destination address D are represented by k-ary numbers sn −1s1s0, and dn −1d1d0, respectively. The function FirstDifference(S, D) returns t, the position where the first (leftmost) different digit appears between sn −1s1s0 and dn −1d1d0.

A turnaround routing path between any source and destination pair is formally defined as follows. A turnaround path is a route from a source node to a destination node. The path must meet the following conditions:

image The path consists of some forward channel(s), some backward channel(s), and exactly one turnaround connection.

image The number of forward channels is equal to the number of backward channels.

image No forward and backward channels along the path are the channel pair of the same port.

Note that the last condition is to prevent redundant communication from occurring. To route a packet from source to destination, the packet is first sent forward to stage Gt. It does not matter which switch (at stage Gt) the packet reaches. Then, the packet is turned around and sent backward to its destination. As it moves forward to stage Gt, a packet may have multiple choices as to which forward output channel to take. The decision can be resolved by randomly selecting from among those forward output channels that are not blocked by other packets. After the packet has attained a switch at stage Gt, it takes the unique path from that switch backward to its destination. The backward routing path can be determined by the tag-based routing algorithm for TMINs.

Note that the routing algorithms described above are distributed routing algorithms, in which each switch determines the output channel based on the address information carried in the packet. Example 4.8 shows the paths available in an eight-node butterfly BMIN.

EXAMPLE 4.8

Figure 4.33 shows the paths available in an eight-node butterfly BMIN using 2 × 2 switches. Figure 4.33(a) shows the case when the source and destination nodes are 000 and 111, respectively. In this case, FirstDifference(S, D) = 2. Thus, packets turn at stage G2, and there are four different shortest paths available. In Figure 4.33(b), the source and destination nodes are 001 and 011, respectively. FirstDifference(S, D) = 1, and packets turn at stage G1.

image

Figure 4.33 Paths available in an eight-node butterfly BMIN: (a) FirstDifference(S, D) = 2; (b) FirstDifference (S, D) = 1.

4.9 Routing in Switch-Based Networks with Irregular Topologies

Recently, switch-based interconnects like Autonet [305], Myrinet [30], and ServerNet [156] have been proposed to build NOWs for cost-effective parallel computing. Typically, these switches support networks with irregular topologies. Such irregularity provides the wiring flexibility required in LANs and also allows the design of scalable systems with incremental expansion capability. The irregularity also makes the routing on such systems quite complicated.

Switch-based networks consist of a set of switches where each switch can have a set of ports. Each port consists of one input and one output link. A set of ports in each switch are either connected to processors or left open. The remaining ports are connected to ports of other switches to provide connectivity between the processors. Such connectivity is typically irregular, and the only guarantee is that the network is connected. Typically, all links are bidirectional full duplex, and multiple links between two switches are allowed. Such a configuration allows a system with a given number of processors to be built using fewer switches than a direct network while allowing a reasonable number of external communication ports per processor. Figure 4.34(a) shows a typical NOW using switch-based interconnect with irregular topology. In this figure, it is assumed that switches have eight ports and each processor has a single port.

image

Figure 4.34 (a) An example system with switch-based interconnect and irregular topology. (b) The corresponding graph G.

The switch may implement different switching techniques: wormhole switching, VCT, or ATM. However, wormhole switching is used in most cases. Several deadlock-free routing schemes have been proposed in the literature for irregular networks [30, 156, 285, 305]. Routing in irregular topologies can be performed by using source routing or distributed routing. In the former case, each processor has a routing table that indicates the sequence of ports to be used at intermediate switches to reach the destination node. That information is stored in the packet header [30]. In the latter case, processors and switches require routing tables. The use of table lookup routing allows the use of the same switch fabric for different topologies. However, some network mapping algorithm must be executed in order to fill those tables before routing can be performed. The details of the mapping algorithm greatly depend on the underlying hardware support. Typically, switches support some broadcasting capability in the absence of routing tables.

In this section, we briefly describe the deadlock-free routing scheme used in DEC AN1 (Autonet) networks [305]. This routing scheme is valid for all the switching techniques. In addition to providing deadlock freedom, it provides adaptive communication between some nodes in an irregular network. Also, we describe a fully adaptive routing algorithm for irregular topologies [319, 320] that considerably improves performance over the routing scheme proposed in [305].

Once a packet reaches a switch directly connected to its destination processor, it can be delivered as soon as the corresponding link becomes free. Therefore, we are going to focus on routing packets between switches. As indicated in Chapter 3, the interconnection network I between switches can be modeled by a multigraph I = G(N, C), where N is the set of switches and C is the set of bidirectional links between the switches. It should be noted that bidirectional links were considered as two unidirectional channels in Chapter 3 but not in this section. Figure 4.34(b) shows the graph for the irregular network in Figure 4.34(a).

The Autonet routing algorithm is distributed and is implemented using table lookup. When a packet reaches a switch, the destination address stored in the packet header is concatenated with the incoming port number, and the result is used to index the routing table at that switch. The table lookup returns the outgoing port number that the packet should be routed through. When multiple routes exist from the source to the destination, the routing table entries return alternative outgoing ports. In case multiple outgoing ports are free, the routing scheme selects one port randomly.

In order to fill the routing tables, a breadth-first spanning tree (BFS) on the graph G is computed first using a distributed algorithm. This algorithm has the property that all nodes will eventually agree on a unique spanning tree. Routing is based on an assignment of direction to the operational links. In particular, the up end of each link is defined as (1) the end whose switch is closer to the root in the spanning tree, or (2) the end whose switch has the lower ID if both ends are at switches at the same tree level. Links looped back to the same switch are omitted from the configuration. The result of this assignment is that each cycle in the network has at least one link in the up direction and one link in the down direction.

To eliminate deadlocks while still allowing all links to be used, this routing uses the following up/down rule: a legal route must traverse zero or more links in the up direction followed by zero or more links in the down direction. Thus, cyclic dependencies between channels are avoided because a packet may never traverse a link along the up direction after having traversed one in the down direction. Such routing not only allows deadlock freedom but also adaptivity. The lookup tables can be constructed to support both minimal and nonminimal adaptive routing. However, in many cases, up/down routing is not able to supply any minimal path between some pairs of nodes, as shown in the following example.

EXAMPLE 4.9

Figure 4.35 shows the link direction assignment for the example irregular network shown in Figure 4.34(a). Switches are arranged in such a way that all the switches at the same tree level in the BFS spanning tree (rooted at switch 6) are at the same vertical position in the figure. The assignment of the up direction to the links in this network is illustrated. The down direction is along the reverse direction of the link. Note that every cycle has at least one link in the up direction and one link in the down direction. It can be observed that all the alternative minimal paths are allowed in some cases. For example, a packet transmitted from switch 5 to switch 4 can be routed through either switch 6 or switch 1. In some other cases, however, only some minimal paths are allowed. For example, a packet transmitted from switch 2 to switch 1 can be routed through switch 4, but it cannot be routed through switch 7. In this example, the packet can also be routed through switches 4 and 6, thus following a nonminimal path. It should be noted that any transmission between adjacent switches is always allowed to use the link(s) connecting them, regardless of the direction assigned to that link. However, when two switches are located two or more links away, it may happen that all the minimal paths are forbidden. This is the case for packets transmitted from switch 3 to switch 2. The shortest path (through switch 8) is not allowed. All the allowed paths (through switches 5, 1, 4; through switches 5, 6, 4; through switches 5, 1, 6, 4; and through switches 5, 6, 1, 4) are nonminimal.

image

Figure 4.35 Link direction assignment for the irregular network shown in Figure 4.34(a).

The use of nonminimal paths consumes more channel bandwidth than strictly required to transmit packets. This drawback becomes more pronounced as network size increases. Nonminimal paths are imposed by the need to avoid deadlock. However, as indicated in Section 3.1.3, it is not necessary to remove all the cyclic dependencies between channels in order to avoid deadlock.

The design methodology presented in Section 4.4.4 cannot be directly applied to irregular networks implementing wormhole switching. The reason is that most routing functions defined for irregular networks only provide nonminimal paths between some pairs of nodes. As a consequence, the routing functions supplied by that methodology do not pass the verification step (step 3) in most cases.

A general methodology for the design of adaptive routing algorithms for networks with irregular topology was proposed in [320]. It is similar to the design methodology presented in Section 4.4.4, but it provides less routing flexibility. That methodology can be summarized as follows. Given an interconnection network and a deadlock-free routing function defined on it, it is possible to duplicate all the physical channels in the network, or to split them into two virtual channels. In both cases, the graph representation of the new network contains the original and the new channels. Then, the routing function is extended so that newly injected messages can use the new channels without any restriction as long as the original channels can only be used in the same way as in the original routing function. However, once a message reserves one of the original channels, it can no longer reserve any of the new channels again. This constraint is the only difference between this methodology and Methodology 4.1 presented in Section 4.4.4. Indeed, the resulting routing algorithms are very much like the dynamic dimension reversal routing algorithm described in Section 4.4.4, except that packets do not need to carry the dimension reversal number.

It is easy to see that the design methodology described above supplies deadlock-free routing algorithms. Effectively, consider a routing subfunction R1 identical to the original routing function. The channels supplied by R1 are used exactly in the same way in the original routing function and in the extended routing function. Hence there are no (direct or indirect) cross-dependencies between channels supplied by R1. Also, packets using channels supplied by R1 can no longer use other channels not supplied by R1. Hence, there are no indirect dependencies between channels supplied by R1. Therefore, the extended channel dependency graph for R1 is identical to the channel dependency graph for the original routing function, which is acyclic. Thus, the extended routing function is deadlock-free.

According to the extended routing function defined above, new channels provide more routing flexibility than original channels. Besides, they can be used to route messages through minimal paths. However, once a message reserves an original channel, it is routed through the original paths, which, in most cases, are nonminimal. Also, routing through original paths produces a loss of adaptivity.

Following this reasoning, the general methodology proposed in [320] can be refined by restricting the transition from new channels to original channels, since they provide less adaptivity and nonminimal paths in most cases. The extended routing function can be redefined in the following way. Newly injected messages can only leave the source switch using new channels belonging to minimal paths, and never using original channels. When a message arrives at a switch from another switch through a new channel, the routing function gives a higher priority to the new channels belonging to minimal paths. If all of them are busy, then the routing algorithm selects an original channel belonging to a minimal path (if any). To ensure that the new routing function is deadlock-free, if none of the original channels provides minimal routing, then the original channel that provides the shortest path will be used. Once a message reserves an original channel, it will be routed using this kind of channel according to the original routing function until it is delivered.

This enhanced design methodology was proposed in [319] and can be applied to any routing algorithm that avoids deadlock by prohibiting cyclic dependencies between channels. In particular, it can be applied to networks using up/down routing simply by splitting physical channels into two virtual channels (or duplicating each physical channel) and using up/down routing as the original routing function. By restricting the use of original channels as indicated above, most messages are allowed to follow minimal paths, and therefore a more efficient use of resources is made. Also, adaptivity is considerably increased with respect to the original up/down routing algorithm. As a result, latency decreases significantly, and the network is able to deliver a throughput several times higher than the one achieved by the original up/down routing algorithm [319] (see Section 9.8).

4.10 Resource Allocation Policies

In this section we briefly describe several policies proposed in the literature for the allocation of network resources. Resource allocation policies can be grouped into two classes: those required to select a resource for a packet when several resources are available, and those required to arbitrate between several packets contending for the same resource. The first class of policies is usually required to select a channel or buffer among the options offered by an adaptive routing function. The function that performs this selection is usually referred to as a selection function. The second class of policies is usually required to allocate resources like the routing control unit (the circuit that computes the routing algorithm) or the bandwidth of a physical channel when it is multiplexed among several virtual channels. Both classes of policies are studied in the next sections.

4.10.1 Selection Function

Without loss of generality, let us assume that when a packet is routed, the network resources requested by that packet are the output channels at the current node. As indicated in Section 4.1, adaptive routing algorithms can be decomposed into two functions: routing and selection. The routing function supplies a set of output channels based on the current node or buffer and the destination node. The selection function selects an output channel from the set of channels supplied by the routing function. This selection can be random. In this case, network state is not considered (oblivious routing). However, the selection is usually done taking into account the status of output channels at the current node. Obviously, the selection is performed in such a way that a free channel (if any) is supplied. However, when several output channels are available, some policy is required to select one of them. Policies may have different goals, like balancing the use of resources, reserving some bandwidth for high-priority packets, or even delaying the use of resources that are exclusively used for deadlock avoidance. Regardless of the main goal of the policy, the selection function should give preference to channels belonging to minimal paths when the routing function is nonminimal. Otherwise the selection function may produce livelock. In this section we present some selection functions proposed in the literature for several purposes.

In [73], three different selection functions were proposed for n-dimensional meshes using wormhole switching with the goal of maximizing performance: minimum congestion, maximum flexibility, and straight lines. In minimum congestion, a virtual channel is selected in the dimension and direction with the most available virtual channels. This selection function tries to balance the use of virtual channels in different physical channels. The motivation for this selection function is that packet transmission is pipelined. Hence, flit transmission rate is limited by the slowest stage in the pipeline. Balancing the use of virtual channels also balances the bandwidth allocated to different virtual channels. In maximum flexibility, a virtual channel is selected in the dimension with the greatest distance to travel to the destination. This selection function tries to maximize the number of routing choices as a packet approaches its destination. In meshes, this selection function has the side effect of concentrating traffic in the central part of the network bisection, therefore producing an uneven channel utilization and degrading performance. Finally, in straight lines, a virtual channel is selected in the dimension closest to the current dimension. So, the packet will continue traveling in the same dimension whenever possible. This selection function tries to route packets in dimension order unless the requested channels in the corresponding dimension are busy. In meshes, this selection function achieves a good distribution of traffic across the network bisection. These selection functions were evaluated in [73] for 2-D meshes, showing that minimum congestion achieves the lowest latency and highest throughput. Straight lines achieves similar performance. However, maximum flexibility achieves much worse results. These results may change for other topologies and routing functions, but in general minimum congestion is a good choice for the reason mentioned above.

For routing functions that allow cyclic dependencies between channels, the selection function should give preference to adaptive channels over channels used to escape from deadlocks [91]. By doing so, escape channels are only used when all the remaining channels are busy, therefore increasing the probability of escape channels being available when they are required to escape from deadlock. The selection among adaptive channels can be done by using any of the strategies described above. Note that if escape channels are not free when requested, it does not mean that the routing function is not deadlock-free. Escape channels are guaranteed to become free sooner or later. However, performance may degrade if packets take long to escape from deadlock. In order to reduce the utilization of escape channels and increase their availability to escape from deadlock, it is possible to delay the use of escape channels by using a timeout. In this case, the selection function can only select an escape channel if the packet header is waiting for longer than the timeout. The motivation for this kind of selection function is that there is a high probability of an adaptive channel becoming available before the timeout expires. This kind of selection function is referred to as a time-dependent selection function [88, 94]. In particular, the behavior of progressive deadlock recovery mechanisms (see Section 3.6) can be modeled by using a time-dependent selection function.

The selection function also plays a major role when real-time communication is required. In this case, best-effort packets and guaranteed packets compete for network resources. If the number of priority classes for guaranteed packets is small, each physical channel may be split into as many virtual channels as priority classes. In this case, the selection function will select the appropriate virtual channel for each packet according to its priority class. When the number of priority classes is high, the set of virtual channels may be split into two separate virtual networks, assigning best-effort packets and guaranteed packets to different virtual networks. In this case, scheduling of guaranteed packets corresponding to different priority classes can be achieved by using packet switching [294]. In this switching technique, packets are completely buffered before being routed in intermediate nodes, therefore allowing the scheduling of packets with the earliest deadline first. Wormhole switching is used for best-effort packets. Latency of guaranteed packets can be made even more predictable by using an appropriate policy for channel bandwidth allocation, as indicated in the next section.

4.10.2 Policies for Arbitration and Resource Allocation

There exist some situations where several packets contend for the use of resources, therefore requiring some arbitration and allocation policy. These situations are not related to the routing algorithm but are described here for completeness. The most interesting cases of conflict in the use of resources arise when several virtual channels belonging to the same physical channel are ready to transfer a flit, and when several packet headers arrive at a node and need to be routed. In the former case, a virtual channel allocation policy is required, while the second case requires a routing control unit allocation policy.

Flit-level flow control across a physical channel involves allocating channel bandwidth among virtual channels that have a flit ready to transmit and have space for this flit at the receiving end. Any arbitration algorithm can be used to allocate channel bandwidth including random, round-robin, or priority schemes. For random selection, an arbitrary virtual channel satisfying the above-mentioned conditions is selected. For round-robin selection, virtual channels are arranged in a circular list. When a virtual channel transfers a flit, the next virtual channel in the list satisfying the above-mentioned conditions is selected for the next flit transmission. This policy is usually referred to as demand-slotted round-robin and is the most frequently used allocation policy for virtual channels.

Priority schemes require some information to be carried in the packet header. This information should be stored in a status register associated with each virtual channel reserved by the packet. Deadline scheduling can be implemented by allocating channel bandwidth based on a packet’s deadline or age (earliest deadline or oldest age first) [72]. Scheduling packets by age reduces the variance of packet latency. For real-time communication, the latency of guaranteed packets can be made more predictable by assigning a higher priority to virtual channels reserved by those packets. By doing so, best-effort packets will not affect the latency of guaranteed packets because flits belonging to best-effort packets will not be transmitted unless no flit belonging to a guaranteed packet is ready to be transmitted. This approach, however, may considerably increase the latency of best-effort packets. A possible solution consists of allowing up to a fixed number of best-effort flits to be transmitted every time a guaranteed packet is transmitted [294].

When traffic consists of messages of very different lengths, a few long messages may reserve all of the virtual channels of a given physical channel. If a short message requests that channel, it will have to wait for a long time. A simple solution consists of splitting long messages into fixed-length packets. This approach reduces waiting time for short messages but introduces some overhead because each packet carries routing information and has to be routed. Channel bandwidth is usually allocated to virtual channels at the granularity of a flit. However, it can also be allocated at the granularity of a block of flits, or even a packet, thus reducing the overhead of channel multiplexing. The latter approach is followed in the T9000 transputer [228]. In this case, buffers must be large enough to store a whole packet.

A different approach to support messages of very different lengths has been proposed in the Segment Router [185, 186]. This router has different buffers (virtual channels) for short and long messages. Additionally, it provides a central queue for short messages. The Segment Router implements VCT switching for short messages and a form of buffered wormhole switching for long messages. Instead of allocating channel bandwidth at the granularity of a flit, the Segment Router allocates channel bandwidth to short messages for the transmission of the whole message. For long messages, however, channel bandwidth is allocated for the transmission of a segment. A segment is the longest fragment of a long message that can be stored in the corresponding buffer. Segments do not carry header information and are routed in order following the same path. If a physical channel is assigned to a long message, after the transmission of each segment it checks for the existence of waiting short messages. If such a message exists, it is transmitted over the physical channel followed by the next segment of the long message. In this channel allocation scheme, a channel is only multiplexed between one long and (possibly) many short messages. It is never multiplexed between two long messages. However, no priority scheme is used for channel allocation. It should be noted that once the transmission of a short message or segment starts over a channel, it is guaranteed to complete because there is enough buffer space at the receiving end.

A router is usually equipped with a routing control unit that computes the routing algorithm when a new packet header is received, eventually assigning an output channel to that packet. In this case, the routing control unit is a unique resource, requiring arbitration when several packets contend for it. Alternatively, the circuit computing the routing function can be replicated at each input virtual channel, allowing the concurrent computation of the sets of candidate output channels when several packet headers need to be routed at the same time. This is the approach followed in the Reliable Router [74]. However, the selection function cannot be fully replicated because the execution of this function requires updating the channel status register. This register is a centralized resource. Otherwise, several packets may simultaneously reserve the same output channel. So when two or more packets compete for the same output channel, some arbitration is required.

The traditional scheduling policy for the allocation of the routing control unit (or for granting access to the channel status register) is round-robin among input channels. In this scheduling policy, input channel buffers form a logical circular list. After routing the header stored in a buffer, the pointer to the current buffer is advanced to the next input buffer where a packet header is ready to be routed. The routing function is computed for that header, supplying a set of candidate output channels. Then, one of these channels is selected, if available, and the packet is routed to that channel. This scheduling policy is referred to as input driven [116]. It is simple, but when the network is heavily loaded a high percentage of routing operations may fail because all of the requested output channels are busy.

An alternative scheduling policy consists of selecting a packet for which there is a free output channel. In this strategy, output channels form a logical circular list. After routing a packet, the pointer to the current output channel is advanced to the next free output channel. The router tries to find a packet in an input buffer that needs to be routed to the current output channel. If packets are found, it selects one to be routed to the current output channel. If no packet is found, the pointer to the current output channel is advanced to the next free output channel. This scheduling policy is referred to as output driven [116]. Output-driven strategies may be more complex than input-driven ones. However, performance is usually higher when the network is heavily loaded [116]. Output-driven scheduling can be implemented by replicating the circuit computing the routing function at each input channel, and adding a register to each output channel to store information about the set of input channels requesting that output channel. Although output-driven scheduling usually improves performance over input-driven scheduling, the difference between them is much smaller when channels are randomly selected [116]. Indeed, a random selection usually performs better than round-robin when input-driven scheduling is used. Finally, it should be noted that most selection functions described in the previous section cannot be implemented with output-driven scheduling.

4.11 Engineering Issues

From an engineering point of view, the “best” routing algorithm is either the one that maximizes performance or the one that maximizes the performance/cost ratio, depending on the design goals. A detailed quantitative analysis must consider the impact of the routing algorithm on node design. Even if we do not consider the cost, a more complex routing algorithm may increase channel utilization at the expense of a higher propagation delay and a lower clock frequency. A slower clock also implies a proportional reduction in channel bandwidth if channels are driven synchronously. Thus, router design must be considered to make performance evaluation more realistic. This is the reason why we postpone performance evaluation until a detailed study of the routing hardware is presented.

In this section, we present a brief qualitative performance study. It would be nice if we could conclude that some routing algorithm is the “best” one. However, even if we do not consider the cost, the optimal routing algorithm may vary depending on network traffic conditions. For wormhole switching, fully adaptive routing algorithms that require many virtual channels are not interesting in general because they drastically slow down clock frequency. This is the case for the algorithms presented in Sections 4.4.2 and 4.4.3.

For low-dimensional meshes and a uniform distribution of packet destinations, the performance of deterministic routing algorithms is similar to or even higher than that of partially or fully adaptive algorithms. This is because meshes are not symmetric and adaptive algorithms concentrate traffic in the central part of the network bisection. However, deterministic algorithms distribute traffic across the bisection more evenly. For symmetric networks, like tori and hypercubes, fully adaptive algorithms that require few resources to avoid deadlock (like the ones presented in Section 4.4.4) usually outperform deterministic routers, even for uniform traffic.

For switch-based networks with irregular topology, adaptive routing algorithms that allow cyclic dependencies between channels (like the ones presented in Section 4.9) also increase performance over deterministic or partially adaptive routing. This increment is usually much higher than in symmetric networks. The reason is that deterministic or partially adaptive routing algorithms for irregular networks usually route many messages across nonminimal paths, thus offering more chances for improvement.

When traffic is not uniformly distributed, the behavior may differ completely. On the one hand, when there are hot spots, partially and fully adaptive algorithms usually outperform deterministic algorithms considerably. This is the case for some synthetic loads like bit reversal, perfect shuffle, and butterfly [100]. On the other hand, when traffic locality is very high, adaptive algorithms do not help, and the lower clock frequency affects performance negatively [100].

Some optimizations like the ones presented in Sections 4.5.1 and 4.5.2 have a small impact on performance. They may actually reduce performance because the use of resources is less balanced than in routing algorithms like the ones designed with Methodology 4.1. On the other hand, the true fully adaptive routing algorithm presented in Section 4.5.3 provides full adaptivity and allows a better balanced use of resources, usually achieving the highest performance. However, this routing algorithm must be combined with deadlock detection and recovery techniques. Currently available detection techniques only work efficiently when all the packets are short and have a similar length.

Traffic pattern is not the only parameter that influences performance. Packet size affects performance considerably. If messages or packets are very short, as is the case in distributed shared-memory multiprocessors, adaptive minimal and even nonminimal routing algorithms usually increase performance, but the additional complexity introduced by the arrival of packets out of order may not be worth the additional performance. Additionally, as indicated above, if traffic locality is high, adaptive algorithms are not interesting at all.

For long messages that are not split into small packets, nonminimal routing algorithms are not interesting because bandwidth is wasted every time a long message reserves a nonminimal path. Note that this is not the case for short packets because misrouting is an alternative to waiting for a free minimal path and channels are reserved for a short period of time.

However, profitable backtracking algorithms may outperform progressive algorithms when messages are long because they look for alternative minimal paths instead of waiting for a free channel. The main drawback of backtracking algorithms is the overhead produced by the circuits required to keep track of the header history, thus avoiding searching a path twice. Also, note that backtracking algorithms cannot be easily combined with wormhole or VCT switching. They are usually used on networks using some variant of circuit switching. As data do not follow the header immediately, some additional performance degradation is produced.

In summary, the optimal routing algorithm depends on traffic conditions. For uniform or local traffic, deterministic routing algorithms are the best choice. For irregular traffic or hot spots, adaptive routers usually outperform deterministic ones. However, as we will see in Chapter 9, the impact of the routing algorithm on performance is small when compared with other parameters like traffic characteristics.

4.12 Commented References

This chapter aims at describing the most interesting routing algorithms and design methodologies proposed up to now. References to the original work have been included along with the descriptions. Also, some proposals were presented in Chapter 3 while describing deadlock-handling mechanisms. Although the number of routing algorithms proposed in the literature is very high, most of them have some points in common with other proposals. For example, several algorithms proposed in the literature [136, 331] are equivalent or very similar to those obtained by using Methodology 4.1. Other proposals tried to maximize adaptivity, very much like the algorithms proposed in Section 4.5.1. This is the case for the mesh-route algorithm proposed in [42]. Also, some proposals combine previously proposed methodologies in an interesting way. The positive-first, negative-first routing algorithm for 2-D meshes proposed in [344] combines two algorithms from the turn model in a balanced way, trying to distribute traffic more evenly.

A detailed description of every proposed routing algorithm is beyond the scope of this book. The interested reader can refer to [318] for a survey on multistage interconnection networks and to [263] for a brief survey on switch-based networks. Routing algorithms for direct interconnection networks are surveyed in [125, 241, 255]. These surveys mainly focus on distributed routing using wormhole switching. However, source routing and table-lookup routing are also covered in [255]. Additionally, backtracking protocols are covered in [125]. Nevertheless, this chapter covers several proposals that have not been surveyed elsewhere.

Exercises

4.1. Indicate the information stored in the packet header for the paths shown in Figure 4.36 when the routing algorithm is implemented using street-sign routing.

image

Figure 4.36 Node labeling and routing example for street-sign routing on a 2-D mesh.

    Solution To implement the routing algorithm using street-sign routing, the packet header must encode the information indicated in Table 4.5 for the packets destined for nodes 10 and 8.

Table 4.5

Routing actions for the packets destined for nodes 10 (left) and 8 (right).

image

4.2. Indicate the destination interval for each channel in each node to implement XY routing using interval routing on a 4 × 4 mesh.

    Solution It is possible to implement XY routing using interval routing. However, the implementation is somewhat tricky and requires labeling the nodes in a different way. Figure 4.37 shows a different node labeling as well as the paths followed by two packets using XY routing. Table 4.6 shows the destination intervals for all the output channels in the network. The output channel has been indicated in the first column for all the nodes in each row. Empty boxes indicate that packets cannot use that channel (the channel does not exist). Note that intervals do not overlap. Also, the union of the intervals for all the output channels of each node is the whole set of nodes, excluding the current one.

Table 4.6

Routing tables for the output channels in a 4 × 4 mesh.

image

image

Figure 4.37 Node labeling and routing example for interval routing on a 2-D mesh.

4.3. Draw the channel dependency graph for dimension-order routing on a binary 3-cube.

    Solution Figure 4.38 shows the channel dependency graph for dimension-order routing on a 3-cube. This graph assumes that dimensions are crossed in decreasing order. Black circles represent the unidirectional channels and are labeled as cij, where i and j are the source and destination nodes, respectively. As a reference, channels are also represented by thin lines, horizontal and vertical ones corresponding to dimensions 0 and 1, respectively. Also, the nodes of the 3-cube have been labeled as nk, where k is the node number. Arcs represent channel dependencies. It can be seen that the graph is acyclic.

image

Figure 4.38 Channel dependency graph for dimension-order routing on a 3-cube.

4.4. Define a fully adaptive minimal routing algorithm for bidirectional k-ary n-cubes using Methodology 4.1.

    Solution For step 1, it is possible to use the deterministic routing algorithm proposed in Example 4.1. Although this algorithm was proposed for unidirectional k-ary n-cubes, extending it for bidirectional channels is straightforward. Simply, a similar routing algorithm is used for both directions of each ring. As packets only follow minimal paths, there is not any dependency between channels in one direction and channels in the opposite direction. Thus, the routing algorithm is deadlock-free.

    Step 2 adds a new virtual channel to each physical channel (one in each direction). The new virtual channels can be used for fully adaptive minimal routing. The resulting routing algorithm is deadlock-free because the extended channel dependency graph has no cycles (drawing this graph is left as an exercise).

    As a consequence of extending the routing algorithm for bidirectional rings, some virtual channels are never used and can be removed. This optimization was proposed in [24, 136]. However, this optimization requires the use of different routers depending on the position in the network. A more efficient optimization was proposed in [100]. Instead of using the routing algorithm described in Example 4.1 for step 1, it uses the routing algorithm proposed in Example 3.3 for unidirectional rings, extended for bidirectional k-ary n-cubes by using dimension-order routing.

    Again, step 2 adds a new virtual channel to each physical channel (one in each direction) that can be used for fully adaptive minimal routing. In this case, no optimization is required after applying Methodology 4.1 because the resulting routing algorithm uses all the virtual channels.

4.5. Consider the minimal-path west-first routing algorithm. Using the turn model, include as many 180-degree turns as possible and describe the resulting algorithm using pseudocode.

    Solution For each dimension, it is possible to include a single 180-degree turn. In dimension X, the only possible turn is from west to east. In dimension Y, there is a choice. Let us assume that the turn from north to south is allowed. Figure 4.39 shows the pseudocode for the resulting nonminimal west-first routing algorithm. It is assumed that the Select() function uses a static priority to select channels. For the list of parameters of this function, priority decreases from left to right. By doing so, a higher priority is assigned to the channels belonging to minimal paths. Also, note that the routing function requires the use of the current input channel buffer InChannel to select the output channel. This is necessary because a given packet may traverse a node twice.

image

Figure 4.39 The nonminimal west-first routing algorithm for 2-D meshes.

Problems

4.1. The cube-connected cycle [284] can be defined as a binary hypercube of rings. It is an interconnection network based on the binary n-cube. Each node of a binary n-cube is replaced by an n-cycle, and the cube connection in the nth dimension is attached to the nth node in the cycle. Define a deterministic routing algorithm for the cube-connected cycle using the methodology proposed in Section 4.2.

4.2. The shuffle-exchange interconnection network [326] has two outgoing channels per node: a shuffle channel and an exchange channel. The shuffle channel from node ni is connected to node nj where the binary representation of j is the left rotation of the binary representation of i. The exchange channel from node ni is connected to node nk where the binary representations of i and k differ in the least significant bit. Define a deterministic routing algorithm for the shuffle-exchange network using the methodology proposed in Section 4.2.

4.3. Using the turn model, propose minimal-path partially adaptive routing algorithms for 3-D meshes that have as few routing restrictions as possible.

4.4. Apply Methodology 4.1 to design fully adaptive minimal routing algorithms for 2-D meshes, using the north-last, west-first, and negative-first algorithms as the starting point (step 1). Verify whether the resulting algorithms are deadlock-free if wormhole switching is used.

4.5. Prove that the opt-y routing algorithm described in Section 4.5.1 is deadlock-free by drawing the extended channel dependency graph for a suitable routing subfunction and showing the absence of cycles.

    Hint The routing subfunction can be defined by restricting R to use channels X and Y1.

4.6. Show that Algorithm 1 in Section 4.5.2 is deadlock-free by using Theorem 3.1.

    Hint The routing subfunction R1 can be defined as follows: When a packet is in an A queue and it is possible for it to move to the right along at least one dimension, R1 supplies A queues that move the packet to the right. When a packet is in an A queue and it cannot move to the right, R1 supplies the B queue at the current node. When a packet is in a B queue and it is possible for it to move to the left along at least one dimension, R1 supplies B queues that move the packet to the left. When a packet is in a B queue and it cannot move to the left, R1 supplies the C queue at the current node. When a packet is in a C queue and it has not arrived at its destination, R1 supplies C queues that move the packet to the inside. Obviously, when a packet has arrived at its destination, R1 supplies no external channel.

4.7. Show that Algorithm 2 in Section 4.5.2 is deadlock-free by using Theorem 3.1.

    Hint The routing subfunction R1 can be defined as follows: When a packet is in an injection, A, or B queue and it is possible for it to move to the inside along at least one dimension, R1 supplies channels containing B queues. When a packet is in an injection, A, or B queue and it can only move to the outside, R1 supplies channels containing C queues. When a packet is in a C queue and it is possible for it to move to the outside along at least one dimension, R1 supplies channels containing C queues. When a packet is in a C queue and it can only move to the inside, R1 supplies channels containing D queues. When a packet is in a D queue and it has not arrived at its destination, R1 supplies channels containing D queues. Obviously, when a packet is in any queue and it has arrived at its destination, R1 supplies no external channel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.80.173