CHAPTER 5

Collective Communication Support

Historically, the programmer of a distributed-memory multiprocessor has invoked various system primitives to send messages among processes executing on different nodes, resulting in a message-passing program. In order to simplify the programmer’s task and improve code portability, an alternative approach has been pursued whereby a sophisticated compiler generates data movement operations from shared-memory parallel programs. In order to support a portable and scalable software design across different platforms, the data parallel programming model, which is characterized by executing the same program on different processors with different data, is considered the most promising programming model for multicomputers. Several data parallel languages have been proposed, including Fortran D [113], Vienna Fortran [357], Distributed Fortran 90 [236], Cray MPP Fortran [272], CM Fortran [340], and High Performance Fortran (HPF) [151].

These languages support a variety of global data movement and process control operations. Such operations include replication, reduction, segmented scan, and permutation. Data movement operations are often applied to different dimensions of data arrays. Replication is needed in sending a single datum to many other processors for use in a computation. For example, in Gaussian elimination, a copy of the pivot element must be sent to all the processors to which the pivot row is distributed. Reduction is the opposite of replication and may be used to implement such functions as array summation and to determine the maximum and minimum values of an array. Segmented scan operations are useful for solving linear algebra problems using representations of sparse matrices, such as banded and tridiagonal matrices. Parallel prefix is a special case of segmented scan, which is useful for solving dense linear algebra problems. Permutation involves rearrangement of data for different purposes, such as transposing a matrix, rotating data blocks, and exchanging data in certain dimensions [272]. In addition to data manipulation operations, control operations, such as barrier synchronization and global conditionals, are an essential part of data parallel programming [341, 355].

EXAMPLE 5.1

The following Gaussian elimination example illustrates the use of several data parallel operations. The algorithm is used in solving the equation Ax = b, where A is an n × n matrix and x and b are n-element vectors. For efficient data manipulation, in addition to storing the matrix A in A[1:N, 1:N], the vector b is stored in A[1:N, N+1], and the vector x is stored in A[0, 1:N]. The following code segment, based on a pseudo-high-level data parallel language, demonstrates one approach to solving the problem.

image

Although the above code is by no means optimal parallel code, it serves to show how the data parallel operations mentioned above are fundamental to parallel programming. The function MAX in statement S2 is a reduction operation, which finds the location of the pivot element. The function EXCHANGE in S3 is a permutation operation, which exchanges the ith column and the column with the pivot element. S4 performs row-wise normalization with respect to the pivot element. Since one operand has a higher dimension than the other operand, a replication operation of the lower-dimension data is implied. Using data dependence analysis, both loops in S5 and S6 can be parallelized. Furthermore, in S7 the replication of A[J, I] and A[I, K] across multiple processors is implied because one of the dimensions is not a function of the loop indices. The ENDDO statements in S8 and S9 imply barrier synchronization when the corresponding DO is parallelized.

The above operations that involve global data movement and global control are known as collective communication as many processes are collectively involved in performing such operations. As indicated in [114], many scientific applications exhibit the need for such communication patterns, and providing collective communication services can simplify the programming of multicomputers. Details of those frequently used collective communication operations will be described in the next section. This chapter will emphasize efficient support, both hardware and software, for implementing multicast communication, which is essential to the efficient implementation of collective communication operations.

5.1 Collective Communication Services

Collective communication involves a group of processes. In order to simplify the programming and allow for efficient implementation, these communicating processes are usually defined within a context called the process group. A unique group ID is associated with each distinct process group. Members of a process group may not be fixed. During different phases of a computation, new members may be added to the process group, and old members may be removed from the process group.

Various collective communication services have been identified for processes within a process group. Providing such services can simplify the programming effort and facilitate efficient implementation. Consider a process group, G, with n processes, P1, P2, …, Pn. We assume that all processes will be involved in collective communication, although it is possible that some processes are disabled or masked off from participation. Four basic types of collective communication services within the context of a process group are described below.

5.1.1 Multiple One-to-One Communication

In this category, each process can send at most one message and receive at most one message. If each process has to send exactly one message and receive exactly one message, there are n! different permutation or communication patterns. Figure 5.1(a) shows a circular shift communication pattern in which Pi. sends a message to Pi+1 for 1 ≤ in − 1 and Pn delivers its message to P1. In some applications, with no need of wraparound, a shift communication pattern can be supported in which P1 only sends a message and Pn only receives a message, as shown in Figure 5.1(b).

image

Figure 5.1 Two multiple one-to-one communication patterns: (a) circular shift permutation communication and (b) shift communication.

5.1.2 5.1.2 One-to-All Communication

In one-to-all communication, one process is identified as the sender (called the root) and all processes in the group are receivers. There are some variations. In some designs, the sender may or may not be a member of the process group. Also, if the sender is a member of the process group, it may or may not be a receiver. Here we assume that the sender is a member of the process group and itself is also a receiver. There are two distinct services in this category:

image Broadcast. The same message is delivered from the sender to all receivers.

image Scatter. The sender delivers different messages to different receivers. This is also referred to as personalized broadcast.

Figure 5.2 shows the communication patterns of these two services.

image

Figure 5.2 Two one-to-all communication patterns: (a) broadcast communication and (b) scatter communication.

5.1.3 5.1.3 All-to-One Communication

In all-to-one communication, all processes in a process group are senders and one process (called the root) is identified as the sole receiver. Again, there are two distinct services:

image Reduce. Different messages from different senders are combined together to form a single message for the receiver. The combining operator is usually commutative and associative, such as addition, multiplication, maximum, minimum, and the logical OR, AND, and exclusive OR operators. This service is also referred to as personalized combining or global combining.

image Gather. Different messages from different senders are concatenated together for the receiver. The order of concatenation is usually dependent on the ID of the senders.

Figure 5.3 shows the communication patterns of these two services.

image

Figure 5.3 Two all-to-one communication patterns: (a) reduce communication and (b) gather communication.

5.1.4 All-to-All Communication

In all-to-all communication, all processes in a process group perform their own one-to-all communication. Thus, each process will receive n messages from n different senders in the process group. Again, there are two distinct services:

image All-broadcast. All processes perform their own broadcast. Usually, the received n messages are concatenated together based on the ID of the senders. Thus, all processes have the same set of received messages. This service is also referred to as gossiping or total exchange.

image All-scatter. All processes perform their own scatter. The n concatenated messages are different for different processes. This service is also referred to as personalized all-to-all broadcast, index, or complete exchange.

Figure 5.4 shows the communication patterns of these two services.

image

Figure 5.4 Two all-to-all communication patterns: (a) all-broadcast communication and (b) all-scatter communication.

5.1.5 Convenient Collective Communication Services

In addition to the four basic types of collective communication services, some collective communication services require the combination of these basic services. Some of these frequently used collective communication services, referred to as convenient or composite collective communication services, are listed below:

image All combining. The result of a reduce operation is available to all processes. This is also referred to as a reduce and spread operation. The result may be broadcast to all processes after the reduce operation, or multiple reduce operations are performed with each process as a root.

image Barrier synchronization. A synchronization barrier is a logical point in the control flow of an algorithm at which all processes in a process group must arrive before any of the processes in the group are allowed to proceed further. Obviously, barrier synchronization involves a reduce operation followed by a broadcast operation.

image Scan. A scan operation performs a parallel prefix with respect to a commutative and associative combining operator on messages in a process group. Figure 5.5(a) shows a parallel prefix operation in a four-member process group with respect to the associative combining operator f. Apparently, a scan operation involves many reduce operations. The reverse (or downward) of parallel prefix is called parallel suffix, as shown in Figure 5.5(b).

image

Figure 5.5 Scan communication patterns: (a) parallel prefix communication and (b) parallel suffix communication.

Collective communication services are demanded in many scientific applications. Such services have been supported by several communication packages for multicomputers. However, efficient implementation of various collective communication services is machine dependent. The next section will describe system support for collective communication.

5.2 System Support for Collective Communication

From the programmer’s perspective, collective communication services are provided in the context of a process group. In a multicomputer, each individual application is usually allocated with a subset of processors, called a processor cluster, in order to achieve the best performance (the performance may be degraded when more processors are allocated due to increased penalty from communication overhead) and to increase the system throughput. The scheduling of processors in a multicomputer should carefully consider the trade-off between space sharing and time sharing, which is beyond the scope of this book.

From the viewpoint of system or processors, a process group only involves a subset of processors, as shown in Figure 5.6 with two process groups. The four processes of process group 1 are each assigned to different processors, as are the three processes in process group 2. Process 2 from group 1 and process 0 from group 2 share the same processor.

image

Figure 5.6 Process view versus processor view.

Obviously, as indicated in Figure 5.6, the “all” communication in a process group becomes the “many” communication, which involves an arbitrary subset of processors, from the system’s point of view. In order to efficiently support collective communication, it is desirable that at the processor level the system can support “one-to-one” (unicast), “one-to-many,” “many-to-one,” and “many-to-many” communication primitives in hardware. All multiprocessors directly support various forms of unicast communication, such as blocking versus nonblocking, synchronous versus asynchronous, and direct remote memory access. One-to-many communication, mainly in the form of multicast, in which the same message is delivered to those destinations, has received attention. Some vendors are aware of the importance of collective communication and have facilitated it by implementing broadcast and multicast support in hardware, such as nCUBE-2, or by implementing barrier synchronization in hardware, such as Cray T3D. However, in many-to-one and many-to-many communication, the multiple senders may start at a different time. Thus, certain combining and buffering mechanisms are required to efficiently support such service. This may be expensive in hardware design.

5.3 Preliminary Considerations

In the remainder of this chapter we will mainly focus on the efficient implementation of multicast, both hardware and software, in parallel computers. We will also describe some hardware approaches designed to support barrier synchronization, reduction, and global combining. Before describing those mechanisms, we present some preliminary considerations.

5.3.1 The Need for Multicast Communication

Figure 5.7 shows a multicast communication pattern with three destinations, where process P0 has to send the same message to three processes: P1, P2, and P3 (here each process is assumed to reside in a separate node).

image

Figure 5.7 A multicast communication pattern with three destinations.

If multicast communication primitives are not supported, the program structure of the communication pattern in Figure 5.7 is

image

As shown above, a multicast communication can be supported by many one-to-one communications. Assume that both send and recv are blocking operations. If P0 is executing send (msg, P1), and P1 has not yet executed the recv statement, P0 is blocked.

Meanwhile, P2 is executing the recv statement and is blocked because P0 has not yet executed the statement send (msg, P2). Obviously, system resources are wasted due to the unnecessary blocking. Because of the nondeterministic and asynchronous properties of multicomputers, there is no way for P0 to predetermine a proper message-passing sequence. However, if P0 executed a single multicast (msg, P1, P2, P3) statement instead of three send statements, P2 would proceed as soon as it executed the recv statement.

Even if operations are not blocking, providing multicast operations may reduce communication latency considerably. In current multicomputers, software overhead accounts for a high percentage of communication latency [214]. Thus, replacing several send operations by a single multicast operation reduces software overhead. Therefore, supporting collective communication operations in software is useful even if those operations are not supported in hardware.

Furthermore, when a node sends the same message toward several destinations, some of these replicated messages may traverse the same communication channels, creating more traffic than needed, as shown in the next example.

EXAMPLE 5.2

Figure 5.8 shows different ways to perform multicast in a 3 × 5 mesh network. Node 1 wants to deliver the same message to nodes 9, 10, 13, and 14. Depending on the path followed, the total traffic in the network has different values. As shown in the figure, traffic can be considerably reduced by supporting multicast in hardware because the message is transmitted only once across each channel in the path.

image

Figure 5.8 Different ways to perform multicast in a mesh network: (a) minimizing multicast traffic and (b) minimizing the number of hops.

The main problem here is that of determining which path(s) should be used to deliver a message from the source to all its destination(s). Since there are potentially many paths joining pairs of nodes in a parallel computer, different routes can be found depending on the criteria employed.

5.3.2 Evaluation Criteria

Two major routing design parameters are traffic and time. For a given multicast communication, the parameter “traffic” is quantified in the number of channels used to deliver the source message to all its destinations. This parameter takes into account the repeated use of some channels. The parameter “time” is the message communication latency. In an asynchronous multiprocessing environment, time should be considered from the destination node point of view because the receiving process can continue its execution as soon as the message is received. It is desirable to develop a routing mechanism that completes communication while minimizing both traffic and time. However, these two parameters are not, in general, totally independent; achieving a lower bound for one may prevent us from achieving the other. As illustrated in Figure 5.8, if multicast communication is supported, multiple messages traversing a channel can be reduced to one message transmission. Figure 5.8 shows two possible multicast communication patterns. In Figure 5.8(a), the total traffic created is 7; in Figure 5.8(b), it is increased to 8. However, the distance between nodes 1 and 9 is reduced from 6 to 4.

The communication latency is dependent on the underlying switching technique. For SAF switching, the communication latency is linearly proportional to the number of channels between two nodes. Thus, the parameter time is usually represented by the number of channels traversed. In this case, minimizing time implies that for each destination node, the message should be delivered through a shortest path to that destination. In wormhole switching, the communication latency is almost independent of the number of hops between two nodes if there is no contention in the channels. However, deadlock-free routing becomes a critical issue.

An important metric used to evaluate a network is its communication latency, which is the sum of three component values: start-up latency, network latency, and blocking time. The start-up latency, Ts, is the time required for message framing/unframing, memory/buffer copying, validation, and so on, at both source and destination nodes. The start-up latency is mainly dependent on the design of system software within the nodes and the interface between nodes and routers. The network latency equals the elapsed time after the head of a message has entered the network at the source until the tail of the message emerges from the network at the destination. Given a source and destination node, the start-up and network latencies are static values, frequently used to characterize contention-free networks. The blocking time includes all possible delays encountered during the lifetime of a message. These delays are mainly due to conflicts over the use of shared resources, for example, a message encountering a busy channel or a full buffer. Blocking time reflects the dynamic behavior of the network due to the passing of multiple messages and may be high if the network traffic is heavy or unevenly distributed.

Multicast latency refers to the elapsed time from when the source sends out its first copy of the message until the last destination has received its copy of the message. Multicast latency can be critical to program speedup because, as in the case of barrier synchronization and data replication, the multicast operation may be performed in the serial component of the parallel algorithm.

5.4 Models for Multicast Communication

We shall use graphs to model the underlying topology of multicomputers. Let graph G(V, E) denote a graph with node set V and edge set E. When G is known from context, the sets V(G) and E(G) will be referred to as V and E, respectively. A path with length n is a sequence of edges e1, e2, …, en such that

1. eiej if ij.

2. ei and ei+1 have a common end node.

3. If ei is not the first or last edge, then it shares one of its end nodes with ei–1 and the other with ei+1.

Suppose ei = (vi, vi+1) for 1 ≤ in. In the following discussion, a path with length n will be represented by its node-visiting sequence (v1, v2, …, vn, vn+1). A cycle is a path whose starting and ending nodes are the same (i.e., v1 = vn+1). Furthermore, we assume that every pair of nodes in the path, except v1 and vn+1, are different. A graph is said to be connected if every pair of its nodes are joined by a path. A tree is a connected graph that contains no cycles. A graph F(V, E) is a subgraph of another graph G(V, E) if V(F)⊆V(G) and E(F)⊆E(G). A subgraph that is a tree is referred to as a subtree. For a pair of nodes u, v in V(G), dG(u, v) denotes the length (the number of edges) of a shortest path from u to v in G.

The interconnection topology of a multicomputer is denoted by a host graph G(V, E), where each vertex in V corresponds to a node and each edge in E corresponds to a communication channel (link). For a multicast communication, let u0 denote the source node and u1, u2, …, uk denote k destination nodes, where k ≥ 1. The set K = {u0, u1, …, uk}, which is a subset of V(G), is called a multicast set. Depending on the underlying communication paradigm and the routing method, the multicast communication problem in a multicomputer can be formulated as different graph-theoretical problems.

Multicast Path Problem

In some communication mechanisms, replication of an incoming message in order to be forwarded to multiple neighboring nodes may involve too much overhead and is usually undesirable. Thus, the routing method does not allow each processor to replicate the message passing by. Also, a multicast path model provides better performance than the tree model (to be described below) when there is contention in the network. From a switching technology point of view, the multicast path model is more suitable for wormhole switching.

The multicast communication problem becomes the problem of finding a shortest path starting from u0 and visiting all k destination nodes. This optimization problem is the finding of an optimal multicast path (OMP) and is formally defined below.

DEFINITION 5.1

A multicast path (v1, v2, …, vn) for a multicast set K in G is a subgraph P (V, E) of G, where V(P) = {v1, v2, …, vn} and E(P) = {(vi, vi+1) : 1 ≤ in − 1}, such that v1 = u0 and KV(P). An OMP is a multicast path with the shortest total length.

Multicast Cycle Problem

Reliable communication is essential to a message-passing system. Usually, a separate acknowledgment message is sent from every destination node to the source node on receipt of a message. One way to avoid the sending of |K| separate acknowledgment messages is to have the source node itself receive a copy of the message it initiated after all destination nodes have been visited. Acknowledgments are provided in the form of error bits flagged by intermediate nodes when a transmission error is detected. Thus, the multicast communication problem is the problem of finding a shortest cycle, called the optimal multicast cycle (OMC), for K.

DEFINITION 5.2

A multicast cycle (v1, v2, …, vn, v1) for K is a subgraph C(V, E) of G, where V(C) = {v1, v2, …, vn} and E(C) = {(vn, v1), (vi, vi+1) : 1 ≤ in − 1}, such that KV(C). An OMC is a multicast cycle with the shortest total length.

Steiner Tree Problem

Both OMC and OMP assume that the message will not be replicated by any node during transmission. However, message replication can be implemented by using some hardware approach [197]. If the major concern is to minimize traffic, the multicast problem becomes the well-known Steiner tree problem [121]. Formally, we restate the Steiner tree problem as follows:

DEFINITION 5.3

A Steiner tree, S(V, E), for a multicast set K is a subtree of G such that KV(S). A minimal Steiner tree (MST) is a Steiner tree with a minimal total length.

Multicast Tree Problem

In the Steiner tree problem, it is not necessary to use a shortest path from the source to a destination. If the distance between two nodes is not a major factor in the communication time, such as in VCT, wormhole, and circuit switching, the above optimization problem is appropriate. However, if the distance is a major factor in the communication time, such as in SAF switching, then we may like to minimize time first, then traffic. The multicast communication problem is then modeled as an optimal multicast tree (OMT). The OMT problem was originally defined in [195].

DEFINITION 5.4

An OMT, T(V, E), for K is a subtree of G such that (a) KV(T), (b) dT(u0, ui) = dG(u0, ui), for 1 ≤ ik, and (c) |E(T)| is as small as possible.

The above graph optimization problems can be stated as follows:

Given a host graph G, a multicast set K, and an integer l, does there exist an OMP (OMC, MST, OMT) for K with total length less than or equal to l?

Apparently, the complexity of each of the above optimization problems is directly dependent on the underlying host graph. The above graph optimization problems for the popular hypercube and 2-D mesh topologies were studied in [210, 212], showing that the OMC and OMP problems are NP-complete for those topologies. Also, it was shown that the MST and OMT problems are NP-complete for the hypercube topology [60, 135]. The MST problem for the 2-D mesh topology is equivalent to the rectilinear Steiner tree problem, which is NP-complete [120].

The NP-completeness results indicate the necessity to develop heuristic multicast communication algorithms for popular interconnection topologies. Multicast communication may be supported in hardware, software, or both. Sections 5.5 and 5.7 will address hardware and software implementations, respectively, of multicast.

5.5 Hardware Implementations of Multicast

Hardware support of multicast communication in multicomputers requires increased functionality within the routers. This functionality may include interpretation of multicast addresses (or group ID) and forwarding of messages onto multiple outgoing channels (replication). The result is the capability for one local processor to efficiently send the same message to a specific set of destinations without requiring assistance from any other processor. The approaches used to implement such functionality are highly dependent on the network topology and may affect the design of the switching strategy used in the network. Before studying those approaches, some schemes to encode multiple destination addresses are described in the next section.

5.5.1 Multiaddress Encoding Schemes

The header of multidestination messages must carry the addresses of the destination nodes. The header information is an overhead to the system, increasing message latency and reducing the effective network bandwidth. A good multiaddress encoding scheme should minimize the message header length, also reducing the header processing time.

In wormhole and VCT switching, the routing algorithm is executed before the whole message arrives at the router. As the header may require several flits to encode the destination addresses, it is desirable that the routing decision in each router could be made as soon as possible to reduce message latency. Ideally, a message header should be processed on the fly as header flits arrive. When the number of destination addresses is variable, it is inefficient to use a counter to indicate the number of destinations. Such a counter should be placed at the beginning of a message header. Since the value of the counter may be changed at a router if the destination set is split into several subsets, it would prevent the processing of message headers on the fly. An alternative approach is to have an end-of-header (EOH) flit to indicate the end of a header. Another approach consists of using 1 bit in each flit to distinguish between header and data flits.

Figure 5.9 shows five different encoding schemes, namely, all-destination encoding, bit string encoding, multiple-region broadcast encoding, multiple-region stride encoding, and multiple-region bit string encoding. These schemes were proposed in [54].

image

Figure 5.9 Multiaddress encoding schemes: (a) all-destination encoding, (b) bit string encoding, (c) multiple-region broadcast encoding, (d) multiple-region stride encoding, and (e) multiple-region bit string encoding.

The all-destination encoding is a simple scheme in which all destination addresses are carried by the header. This encoding scheme has two important advantages. First, the same routing hardware used for unicast messages can be used for multidestination messages. Second, the message header can be processed on the fly as address flits arrive. This scheme is good for a small number of addresses because the header length is proportional to the number of addresses. However, it produces a significant overhead when the number of destinations is large.

One way to limit the size of the header is to encode destination addresses as a bit string, where each bit corresponds to a destination ranged between node b and node e, as shown in Figure 5.9(b). Since the length of the string in a system is predefined, the EOH field is not required. This encoding scheme is good when the average number of destinations is large. However, it is inefficient when the system is large and the number of destinations is small. The main drawback of the bit string encoding scheme is that a router usually has to buffer the entire bit string in order to make the routing decision and to generate the output bit string(s). Additionally, address decoding cannot be done with the same routing hardware as for unicast messages. Finally, the length of the string usually depends on network size, limiting the scalability of the system.

The remaining encoding schemes try to optimize the header length by considering ranges of addresses or regions. In Figure 5.9(c), each region is specified by two fields: the beginning and ending addresses of the region. Within each region, the message is broadcast to all the addresses in the range. In some applications, a node may send a message to a set of destination addresses that have a constant distance between two adjacent addresses. A suitable encoding scheme for those applications consists of adding a stride to the definition of each region, as shown in Figure 5.9(d). Finally, if the destination addresses are irregularly distributed but can be grouped into regions, each region can be specified by a bit string, in addition to the beginning and ending addresses (see Figure 5.9(e)). The main drawback of encoding schemes based on regions is that the routing hardware required to decode addresses is complex. Also, several header flits may be required to encode each region. Those flits should reach the router before starting a routing operation.

5.5.2 Tree-Based Multicast Routing

One approach to multicast routing is to deliver the message along a common path as far as possible, then replicate the message and forward each copy on a different channel bound for a unique set of destination nodes. The path followed by each copy may further branch in this manner until the message is delivered to every destination node. In such tree-based routing, the destination set is partitioned at the source, and separate copies are sent on one or more outgoing links. A message may be replicated at intermediate nodes and forwarded along multiple outgoing links toward disjoint subsets of destinations. The destination nodes can be either leaf nodes or intermediate nodes in the tree. In this approach, each node in the network should be able to replicate messages by sending copies out through different output channels. Message replication can be easily done by software in networks using SAF switching. This has been the traditional approach to support collective communication. Hardware replication of messages is much more complex, being suitable for networks using VCT or wormhole switching.

In order to replicate messages, the routing hardware requires several changes with respect to the router model described in Section 2.1. We assume that each destination address is encoded in a different flit, and that each message contains a set of destination address flits followed by data flits. Also, we assume that a single bit is used in each flit to distinguish between destination addresses and data flits. All the address flits are routed at each intermediate node, either reserving a new output channel or being transmitted through a previously reserved channel. Some status registers are required at the routing and arbitration unit to keep track of all the output channels reserved by a message arriving on a given input channel. Once all the address flits have been routed, incoming data flits are simultaneously forwarded to all the output channels previously reserved by the address flits. Note that if the switch is a crossbar and it is implemented as a set of multiplexors (one for each switch output), it is possible to select the same input from several outputs. Therefore, the switch can be easily configured so that flits are simultaneously forwarded to several output channels from the same input channel. However, flow control is more complex than in unicast routing. Flit acknowledgment signals returned across all the output channels reserved by a message must be ANDed because flits can only be forwarded when all the routers that should receive them have confirmed buffer availability. Finally, the delivery buffer at each router should be designed in such a way that flits containing destination addresses are automatically discarded.

Although this section focuses on hardware support of multicast communication, we first describe a simple algorithm for broadcasting in hypercubes. This algorithm can be implemented in either hardware or software. It will serve to illustrate some drawbacks of tree-based multicast routing.

Broadcast Tree for Hypercube

Consider an n-cube topology. An original version of the following broadcast algorithm, which produces a spanning binomial tree based on the concept of recursive doubling, was proposed by Sullivan and Bashkow [334]. Each node in the system will receive the broadcast message exactly once and in no later than n time steps. Let s be the address of the source node and v be the node receiving a broadcast message. The broadcast algorithm is listed in Figure 5.10.

image

Figure 5.10 The broadcast algorithm for hypercubes.

The function FirstOne(v) indicates the location of the least significant 1 in an n-bit binary number v. If vk = 1 and vj = 0 for all 0 ≤ jk − 1, then FirstOne(v) = k. If v = 0, then k = n.

EXAMPLE 5.3

Figure 5.11 shows a broadcast tree rooted at node 0000 in a 4-cube. The numbers between square brackets indicate the time step for the corresponding message transmission.

image

Figure 5.11 Broadcast tree in a 4-cube rooted at node 0000.

Deadlock in Tree-Based Multicast Wormhole Switching

The spanning binomial tree is suitable for networks supporting SAF or VCT switching techniques. When combined with wormhole switching, tree-based multicast routing suffers from several drawbacks. Since there is no message buffering at routers, if one branch of the tree is blocked, then all are blocked (Figure 5.12). For the sake of clarity, input and output buffers in the figure have capacity for a single flit. Flit A at node N5 cannot advance because the next requested channel is busy. As a consequence, flit C cannot be replicated and forwarded to node N5. Blockage of any branch of the tree can prevent delivery of the message even to those destination nodes to which paths have been successfully established. For example, flits A and B at nodes N3 and N2 can advance and reach the destination node N4. However, flits at node N1 cannot advance. Tree-based multicast routing may cause a message to hold many channels for extended periods, thereby increasing network contention. Moreover, deadlock can occur using such a routing scheme. The nCUBE-2 [248], a wormhole-switched hypercube, uses a spanning binomial tree approach to support broadcast and a restricted form of multicast in which the destinations form a subcube.

image

Figure 5.12 Message blocking in tree-based multicast routing.

Figure 5.13 shows a deadlocked configuration on a 3-cube. Suppose that nodes 000 and 001 simultaneously attempt to transmit broadcast messages A and B, respectively. The broadcast message A originating at host 000 has acquired channels [000, 001], [000, 010], and [000, 100]. The header flit has been duplicated at router 001 and is waiting on channels [001, 011] and [001, 101]. The B broadcast has already acquired channels [001, 011], [001, 101], and [001, 000] but is waiting on channels [000, 010] and [000, 100]. The two broadcasts will block forever.

image

Figure 5.13 Deadlock in a 3-cube.

In a similar manner, you may attempt to extend deadlock-free unicast routing on a 2-D mesh to encompass multicast. An extension of the XY routing method to include multicast is shown in Figure 5.14, in which the message is delivered to each destination in the manner described. As in the hypercube example, the progress of the tree requires that all branches be unblocked. For example, suppose that the header flit in Figure 5.14 is blocked due to the busy channel [(4, 2), (4, 3)]. Node (4, 2) cannot buffer the entire message. As a result of this constraint, the progress of messages in the entire routing tree must be stopped. In turn, other messages requiring segments of this tree are also blocked. Network congestion may be increased, thereby degrading the performance of the network. Moreover, this routing algorithm can lead to deadlock. Figure 5.15 shows a deadlocked configuration with two multicasts, A and B.

image

Figure 5.14 An XY multicast routing pattern.

image

Figure 5.15 A deadlock situation in a 3 × 4 mesh: (a) two multicasts in deadlock; (b) detailed representation.

In Figure 5.15(a), A has acquired channels [(1, 1), (0, 1)] and [(1, 1), (2, 1)], and requests channel [(2, 1), (3, 1)], while B has acquired channels [(2, 1), (3, 1)] and [(2, 1), (1, 1)], and requests channel [(1, 1), (0, 1)]. Figure 5.15(b) shows the details of the situation for the nodes (0, 1), (1, 1), (2, 1), and (3, 1). Because neither node (1, 1) nor node (2, 1) buffers the blocked message, channels [(1, 1), (0, 1)] and [(2, 1), (3, 1)] cannot be released. Deadlock has occurred.

The next section presents a multicast wormhole routing algorithm based on the concept of network partitioning, which is equivalent to the concept of virtual networks presented in Section 4.4.3. A multicast operation is implemented as several submulticasts, each destined for a proper subset of the destinations and each routed in a different subnetwork. Because the subnetworks are disjoint and acyclic, no cyclic resource dependency can exist. Thus, the multicast routing algorithms are deadlock-free.

Double-Channel XY Multicast Wormhole Routing

The following double-channel XY multicast routing algorithm for wormhole switching was proposed by Lin, McKinley, and Ni [206] for 2-D mesh.

The algorithm uses an extension of the XY routing algorithm, which was shown above to be susceptible to deadlock. In order to avoid cyclic channel dependencies, each channel in the 2-D mesh is doubled, and the network is partitioned into four subnetworks, N+X,+Y, N+X,–Y, NX,+Y, and NX,–Y, as in Section 4.4.3. Subnetwork N+X,+Y contains the unidirectional channels with addresses [(i, j), (i + 1, j)] and [(i, j), (i, j + 1)], subnetwork N+X,–Y contains channels with addresses [(i, j), (i + 1, j)] and [(i, j), (i, j − 1)], and so on. Figure 5.16 shows the partitioning of a 3 × 4 mesh into the four subnetworks.

image

Figure 5.16 Network partitioning for 3 × 4 mesh.

For a given multicast, the destination node set D is divided into at most four subsets, D+X,+Y, D+X,–Y, DX,+Y, and DX,–Y, according to the relative positions of the destination nodes and the source node u0. Set D+X,+Y contains the destination nodes to the upper right of u0, D+X,–Y contains the destinations to the lower right of u0. and so on.

The multicast is thus partitioned into at most four submulticasts from u0 to each of D+X,+Y, D+XY, DX,+Y, and DX,–Y. The submulticast to D+X,+Y will be implemented in subnetwork N+X,+Y using XY routing, D+X,–Y in subnetwork N+X,–Y, and so on. The message routing algorithm is given in Figure 5.17. Example 5.4 illustrates the operation of the algorithm.

image

Figure 5.17 Double-channel XY routing algorithm.

EXAMPLE 5.4

Consider the 6 × 6 mesh given in Figure 5.14. At the source node (3, 2), the destination set

D = {(0, 0), (0, 2), (0, 5), (1, 3), (4, 5), (5, 0), (5, 1), (5, 3), (5, 4)}

is divided into four subsets:

D+X,+Y = {(4, 5), (5, 3), (5, 4)}

D+X,–Y = {(5, 0), (5, 1)}

DX,+Y = {(0, 5), (1, 3)}

DX,–Y = {(0, 0), (0, 2)}

The message will be sent to the destinations in D+X,+Y through subnetwork N+X,+Y, to the nodes in D+X,–Y using subnetwork N+X,–Y, to the nodes in DX,+Y using subnetwork NX,+Y, and to the nodes in DX,–Y using subnetwork NX,–Y, respectively. The routing pattern is shown with solid arrows in Figure 5.18.

image

Figure 5.18 The routing pattern of the double-channel XY multicast tree.

While this multicast tree approach avoids deadlock, a major disadvantage is the need for double channels. It may be possible to implement double channels with virtual channels; however, the signaling for multicast communication is more complex. Moreover, the number of subnetworks grows exponentially with the number of dimensions of the mesh, increasing the number of channels between every pair of nodes accordingly.

Tree-Based Multicast with Pruning

Tree-based multicast routing is more suitable for SAF or VCT switching than for wormhole switching. The reason is that when a branch of the tree is blocked, the remaining branches cannot advance if wormhole switching is used. There is a special case in which tree-based multicast routing is suitable for networks using wormhole switching: the implementation of invalidation or update commands in DSMs with coherent caches [224]. Taking into account the growing interest in these machines, it is worth studying this special case.

Message data only require a few flits. Typically, a single 32-bit flit containing a memory address is enough for invalidation commands. The command itself can be encoded in the first flit together with the destination node address. An update command usually requires one or two additional flits to carry the value of the word to be updated. Hence, it is possible to design compact hardware routers with buffers deep enough to store a whole message. However, when multicast routing is considered, the message header must encode the destination addresses. As a consequence, message size could be several times the data size. Moreover, messages have a very different size depending on the number of destinations, therefore preventing the use of fixed-size hardware buffers to store a whole message. A possible solution consists of encoding destination addresses as a bit string (see Section 5.5.1) while limiting the number of destination nodes that can be reached by each message to a small value. This approach will be studied in Section 5.6 and applied to the implementation of barrier synchronization and reduction.

An alternative approach consists of using a fixed-size input buffer for flit pipelining (as in wormhole switching) and a small fixed-size data buffer to store the whole message data (as in VCT switching, except that destination addresses are not stored) [224]. This approach combines the advantages of wormhole switching and VCT switching. Its most important advantage is that a fixed and small buffer size is required for every channel, regardless of the number of destinations of the message. As a consequence, routers are compact and fast. As buffers are small, flits containing the destination addresses cannot be buffered at a single node, usually spanning several nodes. The main limitation of this mechanism is that it only works when message data can be completely buffered at each node.

In order to allow data flits to be buffered at a given node even when the next channel requested by the message is busy, a special message format is used. Figure 5.19 shows the format of a multicast message. The first flit of each message, d1, encodes the first destination address. It is followed by the data flits of the message and the remaining destination addresses, d2, d3, …, dn. As we will see, this message format will be kept while the message advances and produces new branches.

image

Figure 5.19 Message format for tree-based multicast routing.

The architecture of a router supporting the message format described above is very similar to the one for tree-based multicast routing. However, it requires a few changes. When the first flit of a message reaches a node, it is routed as usual, reserving the requested output channel if it is free. The next few flits contain the message data. As these flits arrive, they are copied to the data buffer of the input channel and transferred across the switch. The remaining flits, if any, contain additional destination addresses. These flits are routed in the same way as the first flit of the message. If the output channel requested by a destination address flit di was already reserved by a previously routed flit dj, 1 ≤ ji, then di is simply forwarded across the switch. Otherwise, di reserves a new output channel. In this case, di is forwarded across the switch, immediately followed by the message data flits previously stored in the data buffer. By doing so, the message forwarded across the new branch conforms to the format indicated in Figure 5.19. While data flits are being appended to the new branch of the message, incoming address flits are stored in the input buffer. Note that data only require a few flits. Also, routing each additional destination address flit di can be done in parallel with the transmission of di–1 across the switch.

Tree-based multicast routing is susceptible to deadlock. However, when message data only require a few flits, deadlock can be recovered from by pruning branches from the tree when a branch is blocked [224]. This mechanism relies on the use of data buffers to hold the message data, as described above. As shown in Figure 5.12, when a channel is busy, flit buffers are filled and flow control propagates backward, reaching the source node nb of the branch. When this occurs, flits cannot be forwarded across any branch of the tree starting at nb. The pruning mechanism consists of pruning all the branches of the tree at nb except the one that was blocked. Pruning is only performed at the source node of the blocked branch. However, if flow control continues stopping flits at other nodes, pruning is also performed at those nodes.

Note that when using the described pruning mechanism, each pruned branch contains a message according to the format described in Figure 5.19. In particular, that message carries one or more destination address flits and all the data flits. Also, after pruning a branch it is possible to resume message transmission toward the remaining destination nodes because the data buffer contains a copy of all the data flits of the message.

The basic idea behind the pruning mechanism is that, once a tree has been completely pruned, the corresponding multicast message becomes a set of multiple unicast messages. Assuming that the base routing function for unicast routing is deadlock-free, multicast routing is also deadlock-free. However, not all the branches of a blocked tree need to be pruned. A deadlocked configuration requires a set of messages, each one waiting for a resource held by another message in the set. Therefore, only branches that cannot forward flits need to be pruned. Note that a branch may be unable to forward flits because another branch was blocked (see Figure 5.12). Moreover, it is useless to prune a blocked branch because flits will not be able to advance. As a result, pruning is only performed on branches that cannot forward flits but will be able to forward them as soon as they are pruned. By doing so, pruned branches will advance, freeing resources that may be needed by blocked branches of other multicast messages.

EXAMPLE 5.5

Figure 5.20 shows an example of deadlock recovery by pruning branches in tree-based multicast routing. The routing algorithm is XY routing. As can be seen, message A cannot proceed because channel [9, 5] was previously reserved by message B. Similarly, message B cannot proceed because channel [8, 4] was previously reserved by message A. When node 9 routes the flit of message A containing the address of node 1, it finds that there is no free output channel for it. Then, a pruning of all the other branches of message A is performed at this node. In this case, the branch destined for node 4 is pruned, so that it can freely advance toward its destination. Such a pruning will release channels [9, 8] and [8, 4], which is blocking message B. Then, message B will advance, eventually releasing channel [9, 5], which is requested by message A. As a result, deadlock has been recovered from. In the event that flow control stopped the advancing of flits at node 13, message B would also prune its branch destined for node 5. This pruning is redundant, but it shows that nodes performing a pruning do not need to synchronize.

image

Figure 5.20 Deadlock recovery by pruning branches in tree-based multicast routing

Note that this example of deadlock recovery only serves to illustrate the pruning mechanism. It would not produce pruning in a real network unless the network was larger and each message would be destined for several additional nodes. The reason is that pruning would only take place at node 9 if there were more destination address flits in message A after the one destined for node 1. Similarly, pruning could only take place at node 13 if message B contained enough destination address flits to fill the buffers at nodes 8 and 12.

Tree-Based Multicast on Multistage Networks

The most natural way of implementing multicast/broadcast routing in multistage interconnection networks (MINs) is by using tree-based routing. Multicast can be implemented in a single pass through the network by simultaneously forwarding flits to several outputs at some switches. Broadcast can be implemented in a single pass by simultaneously forwarding flits to all the outputs at all the switches traversed by a message.

The replication of messages at some switches can be synchronous or asynchronous. In synchronous replication, the branches of a multidestination message can only forward if all the requested output channels are available. Hence, at a given time, all the message headers are at different switches of the same stage of the network. Synchronous replication requires a complex hardware signaling mechanism, usually slowing down flit propagation. In asynchronous replication, each branch can forward independently without coordinating with other branches. As a consequence, hardware design is much simpler. However, bubbles may arise when some requested output channels are not available. The reason is that branches finding a free output channel are able to advance while branches finding a busy output channel are blocked. A similar situation for direct networks was shown in Figure 5.12.

The propagation of multidestination messages may easily lead to deadlock, regardless of whether synchronous or asynchronous message replication is performed, as shown in the following example.

EXAMPLE 5.6

Figure 5.21 (a) shows a deadlocked configuration in a 16-node butterfly MIN using asynchronous message replication. Figure 5.21(b) shows the same deadlocked configuration when messages are replicated synchronously. In each part of the figure, two multidestination messages A and B, destined for nodes 1000 and 1111, are sent by nodes 1010 and 1110, respectively (see Example 4.7 to see how these messages are routed). In Figure 5.21(a), the upper branch of message A successfully reserved the required output channel at stage G2, proceeding toward node 1000. However, the upper branch of message B is blocked at stage G2, requesting the same channel previously reserved by message A. On the other hand, the lower branch of message A is blocked at stage G2, requesting a channel previously reserved by the lower branch of message B. The situation in Figure 5.21(b) is identical, except that the branches that successfully reserved the required output channels do not forward because all the branches of each tree must forward synchronously. Note that all the branches reached stage G2 at the same time and are routed simultaneously. However, the situation depicted in Figure 5.21(b) may occur depending on the priority of input channels at each switch.

image

Figure 5.21 Deadlock in a 16-node butterfly MIN involving multidestination messages: (a) with asynchronous message replication; (b) with synchronous message replication.

The tree-based multicast mechanism with pruning described in the previous subsection and the associated message format also work on MINs. Indeed, that mechanism was first developed for MINs, and then adapted to direct networks. The behavior of the pruning mechanism is more intuitive when applied to a MIN, as can be seen in Example 5.7. Note that pruning a single branch from each deadlocked configuration is enough to recover from deadlock. However, switches would have to synchronize. Additionally, recovery from deadlock is faster if all the switches detecting blocked branches perform a pruning operation.

EXAMPLE 5.7

Consider the deadlocked configuration shown in Figure 5.21(a). Flow control propagates backward, stopping flit advance for the blocked branches at stages G2, G1, and G0. As a consequence, the branch of message A that is not blocked is pruned at stage G0. Similarly, the branch of message B that is not blocked is also pruned at stage G0. The pruned branches advance, as shown in Figure 5.22(a). Once the pruned branches have released the output channels they were occupying at stage G2, the blocked branches are able to proceed reserving channels, as shown in Figure 5.22(b). Note that in this example, the blocked branches have the same destinations as the pruned branches.

image

Figure 5.22 Pruning branches to recover from deadlock in a 16-node butterfly MIN: (a) messages A and B detected at stage G0 that one of their branches is blocked, pruning the branch that is not blocked; (b) the pruned branches have released the channels they were occupying, and the blocked branches are able to proceed.

Alternative approaches have been proposed in [55, 322]. In [55], a scheme with synchronous replication of multidestination messages that uses a deadlock avoidance scheme at each intermediate switch has been proposed. This scheme is the only one up to now that has been developed for MINs using pure wormhole switching. However, this scheme is complex and requires considerable hardware support. In [322], deadlock is simply avoided by using VCT switching. However, this approach requires the use of buffers deep enough to store whole messages, including destination addresses. Therefore, this approach requires more buffering capacity than the pruning mechanism described in the previous subsection.

In order to alleviate the overhead of encoding many destination addresses, a multiport encoding mechanism has been proposed in [322]. In a MIN with n stages and k × k switches, this encoding requires n k-bit strings, one for each stage. Each k-bit string indicates the output ports to which the message must be forwarded at the corresponding switch. A 1 in the ith position of the k-bit string denotes that the message should be forwarded to output port i. Figure 5.23(b) shows an example of a message format using multiport encoding. Figure 5.23(a) shows the path followed by this message when it is injected at node 6.

image

Figure 5.23 Multicast on multistage networks using multiport encoding: (a) path followed by a multidestination message; (b) message format using multiport encoding.

The main drawback of this encoding scheme is that the same jth k-bit string must be used at all the switches in stage j that route the message. For example, in Figure 5.23(a), all the switches in stage G2 forward the message to the lower output of the switch. As a consequence, several passes through the network are usually required to deliver a message to a given destination set. However, the number of passes decreases beyond a certain number of destinations [322]. In particular, broadcast requires a single pass. So, this encoding scheme and the multicast mechanism proposed in [322] are more appropriate when the number of destinations is very high. For a small number of destinations, the message format and the pruning mechanism described in the previous subsection usually perform better, also requiring smaller buffers. Anyway, the main limitation of both techniques is that message length is limited by the buffer capacity.

5.5.3 Path-Based Multicast Communication

To support deadlock-free multicast or broadcast wormhole routing, the tree-based communication pattern does not perform well unless messages are very short because the entire tree is blocked if any of its branches are blocked. A solution is to prohibit branching at intermediate nodes, leading to a multicast path pattern. To reduce the length of the multicast path, the destination node set can be divided into several disjoint subsets, sending a copy of the source message across several separate multicast paths, each path for each subset of the destination nodes. This multidestination routing scheme will be referred to as path-based routing.

In path-based routing, the header of each copy of a message consists of multiple destinations. The source node arranges these destinations as an ordered list, depending on their intended order of traversal. As soon as the message is injected into the network, it is routed based on the address in the leading header flit corresponding to the first destination. Once the message header reaches the router of the first destination node, the flit containing this address is removed by the router. Now the message is routed to the node whose address is contained in the next header flit. This address corresponds to the second destination in the ordered list. While the flits are being forwarded by the router of the first destination node to its adjacent router, they are also copied flit by flit to the delivery buffer of this node. This process is carried out in each intermediate destination node of the ordered list. When the message reaches the last destination node, it is not routed any further and is completely consumed by that node.

The routing hardware requires a few changes with respect to the router model described in Section 2.1 to support path-based routing. We assume that each destination address is encoded in a different flit. Also, we assume that a single bit is used in each flit to distinguish between destination addresses and data flits. Some control logic is required to discard the current destination address and transmit the next destination flit to the routing and arbitration unit. Sending a single clock pulse to the corresponding input channel buffer so that flits advance one position is enough to discard the previous destination address. Then, the routing and arbitration unit is reset so that it looks for a destination address flit at the header of a buffer and starts a routing operation again. Also, note that if the switch is a crossbar and it is implemented as a set of multiplexors (one for each switch output), it is possible to select the same input from several outputs. Therefore, the switch can be easily configured so that flits are simultaneously forwarded to the next router and copied to the delivery buffer of the current node. Finally, the delivery buffer should be designed in such a way that flits containing destination addresses are automatically discarded and no flow control is required. This is to avoid conflicts with flow control signals from the next node when messages are simultaneously forwarded to the next router and copied to the delivery buffer.

A simple analysis shows that the probability that a message is blocked in path-based routing is lower than that for tree-based routing. Suppose that the probability that a message is blocked in each channel is p. Assume that at a certain level of the tree-based routing the total number of branches is k. The tree requires that all k channels be available at the same time. The probability that a message is blocked at this level would be 1 – (1 – p)k. On the other hand, the probability of blocking for path-based routing is p since each multicast path only requests one channel at one moment. An important property of path-based routing is that, when one path is blocked, it will not block the message delivery on the other paths. For example, consider a 6-cube and suppose that p is 0.1. The second level of the broadcast routing tree has 6 branches. The probability that a message is blocked at this level is 1 – (1 − 0.1)6, which is 0.47, while the probability is only 0.1 for path-based routing.

On the other hand, path-based routing requires establishing an ordered list of destination addresses for each copy of a message. However, tree-based routing does not require any ordering among destinations. In many cases, the ordered destination list(s) can be computed at compile time. In some other cases, the compiler does not help because messages are dynamically generated by hardware. For example, messages are generated by the cache controller in DSMs with coherent caches. In this case, a clever organization of the cache directory may considerably reduce the time required to prepare the ordered destination list [69]. Finally, when the list of destinations has to be explicitly ordered for each multidestination message, path-based routing should be preferred over tree-based routing only if the additional cost of computing the ordered destination list(s) is compensated by a reduction in message latency.

In order to reduce the number of channels used for a given multicast, the subpath between the source and one of the destinations in a multicast path is not necessarily in a shortest path. Therefore, in path-based routing the average distance between the source and each destination is generally longer than that of tree-based routing. Note that in wormhole switching the network latency is almost independent of the length of a path. Moreover, path-based routing does not require replicating messages at each intermediate node, implying a less complex router.

Routing Function Based on Hamiltonian Paths

A suitable network partitioning strategy for path-based routing is based on Hamiltonian paths. A Hamiltonian path visits every node in a graph exactly once [146]; a 2-D mesh has many Hamiltonian paths. Thus, each node u in a network is assigned a label, l(u). In a network with N nodes, the assignment of the label to a node is based on the position of that node in a Hamiltonian path, where the first node in the path is labeled 0 and the last node in the path is labeled N − 1. Figure 5.24(a) shows a possible labeling in a 4 × 3 mesh, in which each node is represented by its integer coordinate (x, y). The labeling effectively divides the network into two subnetworks. The high-channel subnetwork contains all of the channels whose direction is from lower-labeled nodes to higher-labeled nodes, and the low-channel subnetwork contains all of the channels whose direction is from higher-labeled nodes to lower-labeled nodes.

image

Figure 5.24 The labeling of a 4 × 3 mesh: (a) physical network, (b) high-channel network, and (c) low-channel network.

Unicast as well as multicast communication will use the labeling for routing. That is, a unicast message will follow a path based on the labeling instead of using X Y routing. If the label of the destination node is greater than the label of the source node, the routing always takes place in the high-channel network; otherwise, it will take place in the low-channel network.

The label assignment function l for an m × n mesh can be expressed in terms of the x and y coordinates of nodes as

image

Let V be the node set of the 2-D mesh. The first step in finding a deadlock-free multicast algorithm for the 2-D mesh is to define a routing function R : V × VV that uses the two subnetworks in such a way as to avoid channel cycles. One such routing function, defined for a source node u and destination node v, is defined as R(u, v) = w, such that w is a neighboring node of u and

image

Given a source and a destination node, it can be observed from Figure 5.24 that a message is always routed along a shortest path. This is important because the same routing function is used for unicast and multicast communication. It was proved in [211] that for two arbitrary nodes u and v in a 2-D mesh, the path selected by the routing function R is a shortest path from u to v. Furthermore, if l(u) < l(v), then the nodes along the path are visited in increasing order. If l(u) > l(v), then the nodes along the path are visited in decreasing order. However, if the Hamiltonian path is defined in a different way, unicast communication may not follow shortest paths.

Next, two path-based multicast routing algorithms that use the routing function R are defined [211]. In these two algorithms, the source node partitions the set of destinations according to the subnetworks and sends one copy of the message into each subnetwork that contains one or more destinations. The message visits destination nodes sequentially according to the routing function R.

Dual-Path Multicast Routing

The first heuristic routing algorithm partitions the destination node set D into two subsets, DH and DL, where every node in DH has a higher label than that of the source node u0, and every node in DL has a lower label than that of u0. Multicast messages from u0 will be sent to the destination nodes in DH using the high-channel network and to the destination nodes in DL using the low-channel network.

The message preparation algorithm executed at the source node of the dual-path routing algorithm is given in Figure 5.25. The destination node set is divided into the two subsets, DH and DL, which are then sorted in ascending order and descending order, respectively, with the label of each node used as its key for sorting. Although the sorting algorithm may require O(|D| log |D|) time, it often need be executed only once for a given set of destinations, the cost being amortized over multiple messages. In fact, for algorithms with regular communication patterns, it may be possible to do the sorting at compile time. The path routing algorithm, shown in Figure 5.26, uses a distributed routing method. Upon receiving the message, each node first determines whether its address matches that of the first destination node in the message header. If so, the address is removed from the message header and the message is delivered to the host node. At this point, if the address field of the message header is not empty, the message is also forwarded toward the first destination node in the message header using the routing function R. It was proved in [211] that the dual-path multicast routing algorithm is deadlock-free.

image

Figure 5.25 Message preparation for the dual-path routing algorithm.

image

Figure 5.26 The path routing algorithm.

EXAMPLE 5.8

Figure 5.27 shows an example of dual-path multicast routing in a 6 × 6 mesh. Each node x is labeled with l(x) and its integer coordinates. First, the destination set is divided into two sets DH and DL at source node (3, 2), with DH = {(5, 3), (1, 3), (5, 4), (4, 5), (0, 5)} and DL = {(0, 2), (5, 1), (5, 0), (0, 0)}. Then, two copies of the message are sent along the paths shown in the figure. Note that the path between consecutive destinations is always minimal. However, the path may not be minimal when destinations are not consecutive.

image

Figure 5.27 An example of dual-path multicast routing in a 6 × 6 mesh.

The performance of the dual-path routing algorithm is dependent on the distribution of destination nodes in the network. The total number of channels used to deliver the message in Example 5.8 is 33 (18 in the high-channel network and 15 in the low-channel network). The maximum distance from the source to a destination is 18 hops. In order to reduce the average length of multicast paths and the number of channels used for a multicast, an alternative is to use a multipath multicast routing algorithm, in which the restriction of having at most two paths is relaxed.

Multipath Multicast Routing

In a 2-D mesh, most nodes have outgoing degree 4, so up to four paths can be used to deliver a message, depending on the locations of the destinations relative to the source node. The only difference between multipath and dual-path routing concerns message preparation at the source node. Figure 5.28 gives the message preparation of the multipath routing algorithm, in which the destination sets DH and DL of the dual-path algorithm are further partitioned. The set DH is divided into two sets, one containing the nodes whose x coordinates are greater than or equal to that of u0 and the other containing the remaining nodes in DH. The set DL is partitioned in a similar manner.

image

Figure 5.28 Message preparation for the multipath routing algorithm.

The rules by which ties are broken in partitioning the destination nodes depends on the location of the source node in the network and the particular labeling method used. For example, Figure 5.29(a) shows the partitioning of the destinations in the high-channel network when the source is the node labeled 15. When the node labeled 8 is the source, the high-channel network is partitioned as shown in Figure 5.29(b).

image

Figure 5.29 Multipath destination address partitioning: (a) partitioning with source node 15 and (b) partitioning with source node 8.

EXAMPLE 5.9

Figure 5.30 shows an example of multipath multicast routing in a 6 × 6 mesh. The source node and the destination set are the same as in Figure 5.27. The destination set is first divided into two sets DH and DL at source node (3, 2), with DH = {(5, 3), (1, 3), (5, 4), (4, 5), (0, 5)} and DL = {(0, 2), (5, 1), (5, 0), (0, 0)}. DH is further divided into two subsets DH1 and DH2 at step 3, with DH1 = {(5, 3), (5, 4), (4, 5)} and DH2 = {(1, 3), (0, 5)}. DL is also divided into DL1 = {(5, 1), (5, 0)} and DL2 = {(0, 2), (0, 0)}. Then, the multicast is performed using four multicast paths, as shown in Figure 5.30. In this example, multipath routing requires only 21 channels, and the maximum distance from source to destination is 6 hops.

image

Figure 5.30 An example of multipath multicast routing in a 6 × 6 mesh.

Example 5.9 shows that multipath routing can offer advantages over dual-path routing in terms of generated traffic and the maximum distance between the source and destination nodes. Also, multipath routing usually requires fewer channels than dual-path routing. Because the destinations are divided into four sets rather than two, they are reached more efficiently from the source, which is approximately centrally located among the sets. Thus, when the network load is not high, multipath routing offers slight improvement over dual-path routing due to the fact that multipath routing introduces less traffic to the network.

However, a potential disadvantage of multipath routing is not revealed until both the load and number of destinations are relatively high. When multipath routing is used to reach a relatively large set of destinations, the source node will likely send on all of its outgoing channels. Until this multicast transmission is complete, any flit from another multicast or unicast message that routes through that source node will be blocked at that point. In essence, the source node becomes a hot spot. In fact, every node currently sending a multicast message is likely to be a hot spot. If the load is very high, these hot spots may throttle system throughput and increase message latency. Hot spots are less likely to occur in dual-path routing, accounting for its stable behavior under high loads with large destination sets. Although all the outgoing channels at a node can be simultaneously busy, this can only result from two or more messages routing through that node. A detailed performance study can be found in [206, 208].

Multicast Channel Dependencies

As indicated in Section 3.1.3, there is a dependency between two channels when a packet or message is holding one of them, and then it requests the other channel. In path-based multicast routing, the delivery of the same message to several destinations may produce additional channel dependencies, as we show in the next example.

Let us consider a 2-D mesh using XY routing. This routing algorithm prevents the use of horizontal channels after using a vertical channel. Figure 5.31 shows an example of multicast routing on a 2-D mesh. A message is sent by node 0, destined for nodes 9 and 14. The XY routing algorithm first routes the message through horizontal channels until it reaches node 1. Then, it routes the message through vertical channels until the first destination is reached (node 9). Now, the message must be forwarded toward its next destination (node 14). The path requested by the XY routing algorithm contains a horizontal channel. So, there is a dependency from a vertical channel to a horizontal one. This dependency does not exist in unicast routing. It is due to the inclusion of multiple destinations in the message header. More precisely, after reaching an intermediate destination the message header is routed toward the next destination. As a consequence, it is forced to take a path that it would not follow otherwise. For this reason, this dependency is referred to as multicast dependency [90, 96].

image

Figure 5.31 Multicast routing on a 2-D mesh with XY routing.

We studied four particular cases of channel dependency (direct, direct cross-, indirect, and indirect cross-dependencies) in Sections 3.1.4 and 3.1.5. For each of them we can define the corresponding multicast dependency, giving rise to direct multicast, direct cross-multicast, indirect multicast, and indirect cross-multicast dependencies. The only difference between these dependencies and the dependencies defined in Chapter 3 is that multicast dependencies are due to multicast messages reaching an intermediate destination. In other words, there is an intermediate destination in the path between the reserved channel and the requested channel.

The extended channel dependency graph defined in Section 3.1.3 can be extended by including multicast dependencies. The resulting graph is the extended multicast channel dependency graph [90, 96]. Similarly to Theorem 3.1, it is possible to define a condition for deadlock-free multicast routing based on that graph.

Before proposing the condition, it is necessary to define a few additional concepts. The message preparation algorithm executed at the source node splits the destination set for a message into one or more destination subsets, possibly reordering the nodes. This algorithm has been referred to as a split-and-sort function SS [90, 96]. The destination subsets supplied by this function are referred to as valid.

A split-and-sort function SS and a connected routing function R form a compatible pair (SS, R) if and only if, when a given message destined for the destination set D is being routed, the destination subset containing the destinations that have not been reached yet is a valid destination set for the node containing the message header. This definition imposes restrictions on both SS and R because compatibility can be achieved either by defining SS according to this definition and/or by restricting the paths supplied by the routing function. Also, if (SS, R) is a compatible pair and R1 is a connected routing subfunction of R, then (SS, R1) is also a compatible pair.

The following theorem proposes a sufficient condition for deadlock-free, path-based multicast routing [90, 96]. Whether it is also a necessary condition for deadlock-free multicast routing remains as an open problem.

THEOREM 5.1

A compatible pair (SS, R) for an interconnection network I, where R is connected, is deadlock-free if there exists a connected routing subfunction R1 and the pair (SS, R1) has no cycles in its extended multicast channel dependency graph.

Adaptive Multicast Routing

This section describes two adaptive multicast routing functions based on the extension of the routing function presented in Section 5.5.3. The label-based dual-path (LD) adaptive multicast routing algorithm for 2-D meshes [205] is similar to the dual-path routing algorithm presented in Section 5.5.3. In addition to minimal paths between successive destination nodes, the LD algorithm also allows nonminimal paths as long as nodes are crossed in strictly increasing or decreasing label order.

The message preparation algorithm executed at the source node is identical to the one given in Figure 5.25 for the dual-path routing algorithm. The destination node set is divided into two subsets, DH and DL, which are then sorted in ascending order and descending order, respectively, with the label of each node used as its key for sorting. The LD algorithm for message routing is shown in Figure 5.32.

image

Figure 5.32 The LD algorithm for message routing.

EXAMPLE 5.10

Figure 5.33 shows an example for the LD multicast routing algorithm in a 6 × 6 mesh. The destination set is divided into two sets DH and DL at source node (3, 2), with DH = {(4, 3), (1, 3), (4, 5), (0, 5)} and DL = {(1, 1), (5, 1), (0, 0)}. Then, two copies of the message are sent along the paths shown in the figure. In this example, nonminimal paths are used from node (3, 2) to node (4, 3), from (1, 3) to (4, 5), and from (3, 2) to (1, 1).

image

Figure 5.33 An example of LD multicast routing in a 6 × 6 mesh.

Another adaptive routing function was proposed in [90, 96]. It allows messages to use the alternative minimal paths between successive destinations offered by the interconnection network. This routing function can be used for meshes and hypercubes, once a suitable node-labeling scheme is defined. It will serve as an example of the application of Theorem 5.1.

It is possible to define dual-path and multipath routing algorithms based on the routing function described below. This can be easily done by using one of the message preparation algorithms presented in Figures 5.25 and 5.28. For the sake of brevity, we will focus on the dual-path algorithm.

Hamiltonian path-based routing can be made adaptive by following a methodology similar to the one proposed in Section 4.4.4. This design methodology supplies a way to add channels following a regular pattern, also deriving the new routing function from the old one.

The routing function defined on page 238 defines a minimal-path connected deterministic routing function R1. It does not require virtual channels. From a logical point of view, we can consider that there is a single virtual channel per physical channel. Let C1 be the set of (virtual) channels supplied by R1. Now, each physical channel is split into two virtual channels. This is equivalent to adding one virtual channel to each physical channel. The additional virtual channels will be used for adaptive routing. Let C be the set of all the virtual channels in the network. Let C1H be the set of virtual channels belonging to C1 that connect lower-labeled nodes to higher-labeled nodes, and let C1L be the set of virtual channels belonging to C1 that connect higher-labeled nodes to lower-labeled nodes. Let CxyH be the set of output virtual channels from node x such that each channel belongs to a minimal path from x to y and the label of its destination node is not higher than the label of y. Let CxyL be the set of output virtual channels from node x such that each channel belongs to a minimal path from x to y and the label of its destination node is not lower than the label of y. The adaptive routing function R is defined as follows:

image

In other words, the adaptive routing function can use any of the additional virtual channels belonging to a minimal path between successive destinations, except when a node with a label higher (for l(x) < l(y)) or lower (for l(x) > l(y)) than the label of the destination node is reached. Alternatively, the channels supplied by R1 can also be used. This routing function can be easily extended for n-dimensional meshes as well as to support nonminimal paths. The extension for 3-D meshes is presented in [216]. The higher the number of dimensions, the higher the benefits from using adaptivity.

EXAMPLE 5.11

Figure 5.34 shows the label assignment and a routing example for the adaptive routing function R in an 8 × 8 mesh. In what follows, we will refer to nodes using their labels instead of their integer coordinates. The destination set for the example has been split into two subsets, DH = {41, 52, 53, 62} and DL = {1}. Solid arrows show the path supplied by the deterministic routing function R1. Dashed arrows show the additional paths offered by the adaptive routing function R. Inside each path, all the additional virtual channels can be used by R.

image

Figure 5.34 Label assignment and routing example for an 8 × 8 mesh.

As can be seen, R does not use all the minimal paths between successive destinations. For instance, when the next destination is node 41, the message is not allowed to reach nodes 44, 43, or 42 because they have a higher label value than the next destination node. However, R may route a message destined for a node with a higher label value through nodes with decreasing label values. This is the case when the next destination is node 41, allowing the message to be routed across channels [12, 11], [11, 10], [10, 9], [28, 27], [27, 26], or [26, 25]. In these cases, R always uses virtual channels belonging to CC1. A similar situation arises when the next destination node has a lower label value than the source node. This is the case when the next destination is node 1.

Channels belonging to C1 are used exactly in the same way by R and R1. Therefore, there is not any cross-dependency between channels belonging to C1. Let us analyze the indirect dependencies that appear in the example shown in Figure 5.34 by considering some paths supplied by R. For instance, if the header of the message destined for node 41 is at node 19, then is routed to node 28 using a channel of C1, then to node 27 using a channel of CC1, and then to node 36 using a channel of C1, there is an indirect dependency between the channels of C1. This dependency arises because R may route a message destined for a node with a higher label value through nodes with decreasing label values. If node 28 were a destination node, this example would correspond to an indirect multicast dependency.

Also, some indirect multicast dependencies exist between horizontal and vertical channels, due to messages that come back over their way after reaching an intermediate destination node. An example can be seen in Figure 5.34, when the message header is at node 52, and for instance, it is routed to node 53 using a channel of C1, then to nodes 52 and 51 using channels of CC1, and finally, to node 60 using a channel of C1.

The indirect and indirect multicast dependencies shown in the previous examples cannot be obtained by composing other dependencies. However, some indirect and indirect multicast dependencies can also be obtained by composing other direct and/or direct multicast dependencies. These dependencies do not add information to the channel dependency graphs.

Taking into account the definition of the split-and-sort function SS (message preparation algorithm) for the dual-path algorithm, a valid destination set for a source node ns only contains nodes with label values higher (alternatively, lower) than l(ns). When the next destination node has a label value higher (lower) than the current node, the routing function cannot forward the message to a node with a label value higher (lower) than the next destination node. Thus, the pair (SS, R) is compatible. However, if all the minimal paths between successive destination nodes were allowed, the pair would not be compatible. For instance, if after reaching node 41 the message of the example given in Figure 5.34 were allowed to reach node 54, the current destination set DH1 – {52, 53, 62} would not be a valid destination set for node 54.

There is not any dependency between channels belonging to C1H and channels belonging to C1L. However, if all the minimal paths between successive destinations were allowed, such dependencies would exist. The extended multicast channel dependency graph for (SS, R1) consists of two subgraphs, each one containing the dependencies between channels belonging to C1H and C1L, respectively. Figure 5.35 shows part of the extended multicast channel dependency graph for (SS, R1) and a 4 × 4 mesh. For the sake of clarity, the dependencies that do not add information to the graph have been removed. Also, only the subgraph for C1H is displayed. The subgraph for C1L is identical, except for channel labeling. Channels have been labeled as Hs or Vs, where H and V indicate that the channel is horizontal and vertical, respectively, and s is the label assigned to the source node of the channel. It must be noted that there are two horizontal output channels belonging to C1 for most nodes, but only one of them belongs to C1H. The same consideration applies to vertical channels.

image

Figure 5.35 Extended multicast channel dependency graph for the pair (SS, R1) and a 4 × 4 mesh. Only the channels in C1H have been represented. (Hs = horizontal channel from node s; Vs = vertical channel from node s.)

Solid arrows represent direct or direct multicast dependencies. Dashed arrows represent indirect or indirect multicast dependencies. Dotted arrows represent indirect multicast dependencies. The graph has been represented in such a way that it is easy to see that there are no cycles. Obviously, the routing subfunction R1 is connected. As the pair (SS, R1) has no cycles in its extended multicast channel dependency graph, by Theorem 5.1 the pair (SS, R) is deadlock-free.

Base Routing Conformed Path

Deadlock avoidance is considerably simplified if unicast and multicast routing use the same routing algorithm. Moreover, using the same routing hardware for unicast and multicast routing allows the design of compact and fast routers. The Hamiltonian path-based routing algorithms proposed in previous sections improve performance over multiple unicast routing. However, their development has been in a different track compared to e-cube and adaptive routing. Moreover, it makes no sense sacrificing the performance of unicast messages to improve the performance of multicast messages, which usually represent a much smaller percentage of network traffic. Thus, as indicated in [269], it is unlikely that a system in the near future will be able to take advantage of Hamiltonian path-based routing.

The base routing conformed path (BRCP) model [269] defines multicast routing algorithms that are compatible with existing unicast routing algorithms. This model also uses path-based routing. The basic idea consists of allowing a multidestination message to be transmitted through any path in the network as long as it is a valid path conforming to the base routing scheme. For example, on a 2-D mesh with XY routing, a valid path can be any row, column, or row-column.

It is important to note that the BRCP model, as defined, cannot be directly applied on top of adaptive routing algorithms that allow cyclic dependencies between channels, like the ones described in Section 4.4.4. These algorithms divide the virtual channels of each physical channel into two classes: adaptive and escape. Adaptive channels allow fully adaptive routing, while escape channels allow messages to escape from potential deadlocks. Escape channels can be selected by a message at any node. However, if the BRCP model allows all the valid paths conforming to the base routing scheme, destination nodes may be arranged along a path that can only be followed by using adaptive channels. As a consequence, messages may be forced to select adaptive channels to reach the next destination node, therefore preventing some messages from selecting escape channels to escape from a potential deadlock.

However, this does not mean that the BRCP model cannot be combined with fully adaptive routing algorithms that allow cyclic dependencies between channels. Simply, destination nodes should be arranged in such a way that multidestination messages could be transmitted through any valid path consisting of escape channels. It is not necessary to restrict multidestination messages to use only escape channels. Those messages are allowed to use adaptive channels. However, by arranging destinations as indicated above, we guarantee that any message can select an escape channel at any node. For example, consider a 2-D mesh using wormhole switching and the routing algorithm described in Example 3.8. In this algorithm, one set of virtual channels is used for fully adaptive minimal routing. The second set of virtual channels implements XY routing and is used to escape from potential deadlocks. The BRCP model can be applied to the second set of virtual channels exactly in the same way that it is applied to a 2-D mesh with XY routing. By doing so, multicast messages can take advantage of the BRCP model, while unicast messages can benefit from fully adaptive routing.

The main goal of the BRCP model is providing support for multidestination messages without introducing additional channel dependencies with respect to the base routing algorithm used for unicast routing [269]. As indicated in Theorem 5.1, a compatible pair (SS, R) is deadlock-free if there exists a connected routing subfunction R1 and the pair (SS, R1) has no cycles in its extended multicast channel dependency graph. Instead of restricting the routing function R to satisfy this condition, the BRCP model restricts the split-and-sort function SS (message preparation algorithm) so that multicast dependencies do not produce any new channel dependency. Therefore, the extended multicast channel dependency graph and the extended channel dependency graph (defined in Section 3.1.3) are identical, and the condition proposed by Theorem 5.1 for deadlock-free routing is identical to the one for unicast routing (Theorem 3.1). As a consequence, if the base routing algorithm is deadlock-free, the multicast routing algorithm is also deadlock-free. The BRCP model can be formally defined as follows:

DEFINITION 5.5

A multidestination message from a source s with an ordered destination list {d1, d2, …, dn–1, dn} in a network supporting the routing function R conforms to this base routing if and only if the destination set {d1, d2,…, dn–1} can be covered as intermediate nodes on one of the possible paths from s to dn under the routing constraint R1, where R1 is a connected routing subfunction of R that has no cycles in its extended channel dependency graph.

The routing subfunction R1 supplies the escape channels mentioned above. If the routing function R has no cyclic dependencies between channels, then R1 is made equal to R. In this case, a multidestination message can be transmitted through any path in the network as long as it is a valid path conforming to the base routing scheme defined by R.

EXAMPLE 5.12

Figure 5.36 shows examples of multidestination messages on a 2-D mesh under the BRCP model. If the network uses XY routing, a multidestination message can cover a set of destinations in row/column/row-column order. It should be noted that a set of destinations ordered in a column-row manner is an invalid path under the BRCP model for XY routing. Similarly, in a network using planar-adaptive routing (see Section 4.3.1), a multidestination message can cover a set of destinations along any diagonal in addition to the flexibility supported by XY routing. Such additional paths are shown as bold lines in the figure. If the underlying routing scheme supports the west-first routing algorithm (see Section 4.3.2), it can provide further flexibility in covering many destinations using a single message. In this example, a nonminimal west-first routing algorithm is assumed. If the base routing scheme supports nonminimal routing, then multidestination messages can also use nonminimal paths. This multicast routing algorithm conforming to nonminimal west-first routing was first proposed in [205]. Finally, if the base routing scheme supports fully adaptive routing, the set of destinations that can be covered by multidestination messages depends on the routing subfunction implemented by escape channels, as indicated above. Possible choices include XY routing and west-first routing.

image

Figure 5.36 Examples of multidestination messages on a 2-D mesh under the BRCP model.

Once the set of valid paths for multidestination messages has been determined, it is necessary to define routing algorithms for collective communication. The hierarchical leader-based (HL) scheme has been proposed in [269] to implement multicast and broadcast. Given a multicast destination set, this scheme tries to group the destinations in a hierarchical manner so that the minimum number of messages is needed to cover all the destinations. Since the multidestination messages conform to paths supported by the base routing, the grouping scheme takes into account this routing scheme and the spatial positions of destination nodes to achieve the best grouping. Once the grouping is achieved, multicast and broadcast take place by traversing nodes from the source in a reverse hierarchy.

Consider a multicast pattern from a source s with a destination set D. Let L0 denote the set D ∪ {s}. The hierarchical scheme, in its first step, partitions the set L0 into disjoint subsets with a leader node representing each subset. The leader node is chosen in such a way that it can forward a message to the members of its set using a single multidestination message under the BRCP model. For example, if a 2-D mesh implements XY routing, then the partitioning is done such that the nodes in each set lie on a row, column, or row-column.

Let the leaders obtained by the above first-step partitioning be termed as level-1 leaders and identified by a set L1. This set L1 can be further partitioned into disjoint subsets with a set of level-2 leaders. This process of hierarchical grouping is continued as long as it is profitable, as indicated below. Assuming that the grouping is carried out for m steps, there will be m sets of leaders, satisfying that LmLm–1 ⊆ … ⊆L1L0.

After the grouping is achieved, the multicast takes place in two phases. In the first phase, the source node performs unicast-based multicast (see Section 5.7) to the set Lm. The second phase involves m steps of multidestination message passing. It starts with the leaders in the set Lm and propagates down the hierarchical grouping in a reverse fashion to cover the lower-level leaders and, finally, all the nodes in destination set D.

As mentioned above, the hierarchical grouping is continued as long as it is profitable in order to reduce the multicast latency. As start-up latency dominates communication latency in networks using pipelined data transmission, the minimum multicast latency is usually achieved by minimizing the number of communication steps. As will be seen in Section 5.7, unicast-based multicast requires [log2 (|Lm| + 1)] steps to reach all the leaders in the set Lm, assuming that the source node does not belong to that set. Therefore, the size of the set Lm should be reduced. However, while going from Lm–1 to Lm, one additional step of multidestination communication is introduced into the multicast latency. Hence, it can be seen that when [log2(|Lm1| + 1)] > [log2(|Lm| + 1)] + 1, it is profitable to go through an additional level of grouping. The previous expression assumes that the source node does not belong to Lm–1. If the source node belongs to Lm–1 or Lm, then the cardinal of the corresponding set must be reduced by one unit.

EXAMPLE 5.13

Figure 5.37 shows an example of multicast routing in a 2-D mesh using XY routing under the hierarchical leader-based scheme. Given a destination set, the first-level grouping can be done along rows (dimension 0) to obtain the level-1 leaders as shown in the figure. The level-1 leaders can be grouped along columns (dimension 1) to form L2 with two level-2 leaders. Note that each level-2 leader is also a level-1 leader. The multicast takes place in two phases. In the first phase, the source node uses unicast-based multicast to send unicast messages to the two level-2 leaders. In the next phase, these two level-2 leaders send multidestination messages along dimension 1 to cover the level-1 leaders. Finally, all the level-1 leaders (including the level-2 leaders) send multidestination messages along dimension 0 to cover the remaining destinations. In this example, the multicast takes four communication steps.

image

Figure 5.37 Example of multicast routing in a 2-D mesh using XY routing under the hierarchical leader-based scheme.

As the degree of multicasting increases, better grouping can be achieved while reducing L1 to L2. The best case is achieved for broadcast, as shown in Figure 5.38. In this case, all the level-1 leaders are reduced to a single level-2 leader. This indicates that broadcast from any arbitrary source can be done in three steps. A more efficient scheme to perform broadcast in two steps in a 2-D mesh using XY routing is shown in Figure 5.39(a) [268]. The number of steps to perform broadcasting can be further reduced by using the nonminimal west-first routing algorithm as the base routing, as shown in Figure 5.39(b) [268]. In this case, the message is first routed to the lower-left corner of the network without delivering it to any destination. From that point the message crosses all the nodes in the network, delivering copies of the message as it advances. Although this scheme reduces the number of steps to one, the path traversed by the message is very long, usually resulting in a higher broadcast latency than two-step broadcast.

image

Figure 5.38 Broadcast routing in a 2-D mesh using XY routing under the hierarchical leader-based scheme (notation is the same as in Figure 5.37).

image

Figure 5.39 Broadcast routing in a 2-D mesh: (a) two-step broadcast using XY routing under the hierarchical leader-based scheme; (b) one-step broadcast using nonminimal west-first routing.

As defined, the HL scheme may produce considerable contention when several nodes send multicast messages concurrently. A method to reduce node contention is to make each multicast choose unique intermediate nodes, as different as possible from the rest. However, with dynamic multicast patterns, all concurrent multicasts are unaware of one another. This means that a multicast has no information whatsoever about the source and destinations of the other multicasts.

A good multicast algorithm should only use some local information to make its tree as unique as possible. A possible solution consists of using the position of the source node in the system [173]. This is unique for each multicast message. If each multicast constructs its tree based on the position of its corresponding source, then all the multicasts will end up with trees as unique as possible.

In the HL algorithm the multicast tree is generated using primarily the destination (distribution) information. It depends, to a very small degree, on the position of the source. So, it can be improved by choosing the leader sets for a multicast depending on the position of its source. In the source quadrant-based hierarchical leader (SQHL) scheme [173], groupings are done exactly as in the HL scheme. However, the leader nodes are chosen in a different way. Consider a k-ary n-mesh. Let si, be the ith coordinate of the source node s of a multicast. Consider the set of destinations or leaders of the previous level that only differ in the ith coordinate. They will be reached by a single multidestination message. If sik/2, then the leader node will be the one in the set with the lowest coordinate in dimension i. Otherwise, the leader node will be the one with the highest coordinate in dimension i. For example, when grouping is performed along a given row in a 2-D mesh, the leader will be the leftmost destination in that row if the source node is located in the left half of the network. Otherwise, the leader will be the rightmost destination in that row. The remaining steps are the same as in the original HL scheme. The SQHL scheme allows multicast messages from source nodes in different quadrants of a mesh to proceed concurrently with minimal interference.

In the source-centered hierarchical leader (SCHL) scheme [173], each group g from the HL scheme is partitioned into at most two groups g1 and g2 based on the coordinates of the source node. Let si be the ith coordinate of the source node s of a multicast. Let us assume that destination nodes in g only differ in the coordinate corresponding to dimension i. All the nodes in that group with an ith coordinate lower than or equal to si, will be in g1. The remaining nodes will be in g2. Leader nodes are chosen so that they have an ith coordinate as close as possible to si. Both the SQHL and SCHL schemes improve performance over the HL scheme, as will be seen in Chapter 9.

Deadlocks in Delivery Channels

Path-based multicast routing functions avoid deadlock by transmitting messages in such a way that nodes are crossed either in increasing or decreasing label order. However, deadlock is still possible because messages transmitted on the high-channel and low-channel subnetworks (see Figure 5.24) use the same delivery channels at every destination node [216, 269].

Figure 5.40 shows a deadlocked configuration in which two multidestination messages M1 and M2 are destined for nodes A and B. M1 and M2 are traveling in the high-channel and low-channel subnetworks, respectively. Assuming that l(A) < l(B), M1 first reached node A, then node B. Also, M2 first reached node B, then node A. As there is a single delivery channel at each node, each message has reserved one delivery channel and is waiting for the other delivery channel to become free. Note that M1 cannot be completely delivered to node A because wormhole switching is used.

image

Figure 5.40 Deadlock produced by using a single delivery channel.

Deadlocks may arise because messages traveling in the high-channel and low-channel subnetworks share the same delivery channels, thus producing cyclic dependencies between them. Effectively, message M1 in Figure 5.40 has reserved the delivery channel cDA at node A, then requested the delivery channel cDB at node B. As a consequence, there is a dependency from cDA to cDB Similarly, there is a dependency from cDB to cDA This cyclic dependency can be easily broken by using different delivery channels for each subnetwork. In this case, two delivery channels at each node are enough to avoid deadlocks.

This situation may also arise when multidestination routing is based on the BRCP model. Consider an n-dimensional mesh with dimension-order routing for unicast messages. In this case, multidestination messages conforming to the base routing may be destined for sets of nodes along any dimension. As a consequence, the situation depicted in Figure 5.40 may occur in all the dimensions simultaneously. Figure 5.41 shows a deadlocked configuration for a 2-D mesh using XY routing where each node has three delivery channels. Note that each node has four input channels, which allow up to four messages requesting delivery channels at the same time.

image

Figure 5.41 Deadlock produced in a 2-D mesh with three delivery channels per node. Multidestination routing uses the BRCP model and XY routing as the base routing.

As indicated in [268], the basic idea to avoid deadlock is to consider a given topology and routing function R, and determine the maximum number of multidestination messages that can enter a node simultaneously under the BRCP model. For example, if a base routing function R in an n-dimensional mesh requires v virtual channels per physical channel to support deadlock-free unicast routing, then 2nv delivery channels per node are sufficient for deadlock-free multidestination communication [268]. Note that each node has 2n input physical channels. Thus, each input virtual channel is associated with a dedicated delivery channel, therefore breaking all the dependencies between delivery channels.

It should be noted that virtual channels can also be used to reduce congestion. In this case, they are usually referred to as virtual lanes. Virtual lanes do not require additional delivery channels. Instead, each set of virtual lanes shares a single delivery channel [268]. When a multidestination message arrives on a virtual lane and reserves a delivery channel, and another multidestination message arrives on another virtual lane in the same set, it must wait. This waiting cannot produce deadlock because both messages follow the same direction. Also, it should be noted that adaptive routing algorithms that allow cyclic dependencies between channels only require the escape channels to avoid deadlock. Since the BRCP model restricts routing for multidestination messages according to the paths defined by escape channels, only the escape channels need to be considered when computing the number of delivery channels required for those algorithms.

The high number of delivery channels required to implement deadlock-free multidestination communication under the BRCP model may restrict the applicability of this model. Fortunately, current trends in network topologies recommend the use of low-dimensional meshes or tori (two or three dimensions at most). Also, as shown in Sections 4.2 and 4.4.4, it is possible to design deterministic and fully adaptive routing algorithms with a very small number of virtual channels. One and two virtual channels per physical channel are enough for deterministic routing in n-dimensional meshes and k-ary n-cubes, respectively. For fully adaptive routing, the requirements for escape channels are identical to those for deterministic routing. As indicated in [268], four delivery channels are enough to support deadlock-free multidestination communication under the BRCP model in 2-D meshes when the base routing is either XY routing, nonminimal west-first, or fully adaptive routing based on escape channels (described in Section 4.4.4). Moreover, there is very little blocking probability with all delivery channels being accessed simultaneously. Hence, delivery channels can be implemented as virtual channels by multiplexing the available bandwidth at the network interface.

5.6 Hardware Support for Barrier Synchronization and Reduction

In this section we describe algorithms and architectural support to perform barrier synchronization, reduction, and global combining. These algorithms have been proposed by Panda in [265, 266] and are based on the BRCP model described in Section 5.5.3.

5.6.1 Barrier Synchronization on a Linear Array

Barrier synchronization can be performed in two phases by using multidestination messages. The first phase implements reporting by using gather messages. The second phase implements wake-up by using broadcasting messages.

Consider a linear array of six processors as shown in Figure 5.42. Assume that four processors (P0, P1, P2, and P4) participate in a barrier. The rightmost processor P4, after reaching its barrier point, can send a multidestination gather message. The header of this message consists of an ordered destination list (P2, P1, and P0) with P0 as the final destination. As this message propagates toward P0, it can gather information from processors P2 and P1 regarding whether they have reached the barrier or not. If this information is available at the router interface, the gather message need not be consumed and retransmitted at intermediate destinations.

image

Figure 5.42 Two-phase implementation of a barrier on a linear array of processors: (a) propagation of a gather message; (b) propagation of a broadcasting message. (from [265])

As the message propagates, it checks for the arrival of information at every router interface of intermediate destinations. If the intermediate destination has already arrived at the barrier, the message proceeds. Otherwise, it gets blocked at the respective router interface until the associated processor arrives at the barrier and provides information to the router interface. If a processor is not participating in a barrier, the message moves ahead at its router interface. Finally, the message is consumed at its final destination. In the example in Figure 5.42, the gather message from P4 will check the arrival of processors P2 and P1 at the barrier on its way and will finally get consumed at P0.

With the consumption of the gather message, processor P0 has information that all other participating processors (P1, P2, and P4) have arrived at the barrier. To implement the wake-up phase, P0 initiates a broadcasting message with P1, P2, and P4 as destinations. The propagation of this message, as shown in Figure 5.42(b), is unconditional. It is very similar to the movement of a multidestination multicast/broadcast message as described in Section 5.5.3. The only difference is that, as this message passes through the routers of the intermediate participating processors (P1 and P2), it activates the wake-up signal to these processors. Processors P0 and P4 get woken up after sending and receiving the broadcast message, respectively.

Architectural Support

In order to implement multidestination gather/broadcasting messages, some architectural support is required at the router interface. Similar to the concept of barrier registers [260], a set of buffers can be provided at each router interface of the system. Figure 5.43(a) shows a possible router interface organization with m buffers. Each buffer has a few bits for synchronization id, a flag participate/not participate (P), a flag arrived/not arrived (A), and some space for holding an incoming gather message. These buffers can be accessed by the associated processor. Note that by supporting m buffers at every router interface, a system can implement m concurrent barriers for an application at a given time.

image

Figure 5.43 Architectural supports for implementing multidestination gather/broadcasting messages: (a) synchronization buffers at a router interface; (b) message format. (A = arrived/not arrived; id = synchronization id; P = participate/not participate.) (from [265])

Figure 5.43(b) shows a suitable format for gather/broadcasting messages. Each message carries a msg type field (2–3 bits) indicating whether it is a gather, broadcast, or unicast message. The width of the synchronization id field depends on how many buffers can be incorporated into the router interface. The destination addresses are encoded as a bit string (see Section 5.5.1). For the linear array example being considered, 6 bits are sufficient to encode the addresses. Assuming processor 0 is identified as bit 0 of this address, the destination bit string of the gather message initiated by P4 will be 000111. Similarly, the broadcast message initiated by P0 will have a bit string of 010110.

Communication Sequence

Let us consider the communication sequence for the processors participating in a barrier using gather and broadcast messages. We assume static barriers [260] so that the processors participating in a barrier are known at compile time and the associated communication sequence can be generated before the program execution. A split-phase synchronization scheme [141, 261] with separate report and wake-up phases is assumed. Figure 5.44 shows the communication sequence to implement a barrier on a linear array. Before the execution of barrier x, 0 ≤ x <m, each intermediate processor specifies its desire for participating in the barrier by grabbing buffer x at its router interface, setting the associated participate flag to 1 and resetting the arrived flag to 0. These flags are reset by default, indicating that the associated processor does not participate. The arrived flag of buffer x at an intermediate router is set to 1 when the associated processor reaches its barrier point after computation.

image

Figure 5.44 Communication sequence of processors on a linear array while participating in a barrier with synchronization id = x, using gather and broadcast messages. (from [265])

The rightmost participating processor initiates a gather message with synchronization id = x. This gather message, while passing through the router of an intermediate destination, checks for the participate and arrived flags of the associated buffer x. If the processor is not participating, then the message keeps on moving ahead. If the processor is participating and it has already arrived at the barrier, then the message also proceeds. If the processor is participating and it has not arrived at the barrier, then the message gets blocked at the router interface. In this case, the message is stored in the message field of buffer x until the arrived flag is set to 1 by the processor. Then, the gather message is forwarded again into the network. Finally, this gather message is consumed by the leftmost participating processor. Note that gather messages require the use of VCT switching. When a gather message is stored in the corresponding buffer at the router interface, it is removed from the network. This is required to avoid deadlock and to reduce network contention because a gather message may be waiting for a long time until a given processor reaches the barrier.

After the gather phase is over, the leftmost processor initiates a broadcast message to the rightmost processor with intermediate processors as intermediate destinations. This message, as it passes through the routers of intermediate destinations, wakes up the associated processors, also resetting the participate and arrived flags of buffer x. The leftmost processor is done with the synchronization after sending the broadcast message. Similarly, the rightmost processor is done when it receives the broadcast message.

5.6.2 Barrier Synchronization on Meshes

In this section we study the complete barrier synchronization as well as the synchronization of an arbitrary subset of nodes in a 2-D mesh.

Complete Barrier Synchronization

Consider a k × k mesh as shown in Figure 5.45. Complete barrier synchronization can be achieved by considering the mesh as a set of linear arrays [265]. A basic scheme using four steps is shown in Figure 5.45(a). The first step uses a gather message in all rows to collect information toward the first column. The second step uses a gather message on the first column. At the end of these two steps, the upper-left corner processor has information that all other processors have arrived at the barrier. Now it can initiate a broadcast message along the first column. During the final step, the nodes in the first column initiate broadcast messages along their respective rows. This requires four communication steps to implement barrier synchronization on a k × k mesh. For any k-ary n-cube system, similar dimensionwise gather and broadcast can be done to implement barrier synchronization with 2n communication steps. Note that under this basic scheme, the multidestination gather and broadcast messages need to move along a single dimension of the system only. As each multidestination message is destined for the nodes in a row or column, destination addresses can be encoded as a bit string of k bits. Hence, using 16-bit or 32-bit flits, destination addresses can be encoded into a few (1–3) flits for current network sizes.

image

Figure 5.45 Complete barrier synchronization on a k × k mesh using (a) a four-step basic scheme and (b) an enhanced scheme with three steps. (from [265])

An enhancement to this basic scheme can be done as shown in Figure 5.45(b). The gather message initiated by the bottom row does not stop at the leftmost node of this row but continues along the first column to gather information from the leftmost leaders of all other rows. This will combine the first two gather steps into a single one. Hence, using such a multidimensional gather message, complete barrier synchronization can be implemented on a 2-D mesh in three steps. This enhancement can be easily extended to a higher number of dimensions. Note that such a multidimensional message conforms to the BRCP model discussed in Section 5.5.3. The headers of such multidimensional messages are longer in size because they need to carry more destinations in their headers. Although such a scheme is feasible, it requires additional logic at a router interface to wait for the arrival of multiple gather messages.

Arbitrary Set Barrier Synchronization

Consider a subset of nodes in a 2-D mesh trying to barrier-synchronize, as shown in Figure 5.46. If all processors belong to a single task, then this subset barrier can be implemented as a complete barrier by forcing all processors to participate in each barrier. However, such an assumption is very restrictive. In this section, we describe a general scheme that allows for multiple subset barriers to be executed concurrently in a system, and the operation for a given barrier involves only the processors participating in that barrier [265].

image

Figure 5.46 Barrier synchronization for an arbitrary subset of processors in a 2-D mesh using multidestination gather and broadcasting messages together with unicast message passing: (a) gather within rows; (b) gather within columns across row leaders; (c) unicast-based gather across column leaders; (d) unicast-based broadcast to column leaders; (e) broadcast to row leaders within columns; and (f) broadcast within rows. (from [265])

For a 2-D mesh, the scheme uses six phases as shown in Figure 5.46. The first phase consists of using gather messages within rows. For every row having at least two participating processors, a gather message is initiated by the rightmost participating processor and is consumed by the leftmost participating one. Let us designate the leftmost participating processors as row leaders. Now for every column having at least two row leaders, a gather message can be initiated by the bottom row leader to gather information along that column and finally be consumed by the top row leader. Let these top row leaders be designated as column leaders. After these two phases, information needs only to be gathered from the column leaders. It can be easily seen that these column leaders fall into disjoint rows and columns based on the grouping done. Hence no further reduction can be achieved with multidestination messages on systems supporting XY routing. If the base routing algorithm supports adaptivity, then further reduction can still be achieved by using adaptive paths.

The third phase consists of unicast-based message passing. This phase implements gather among column leaders in a tree-like manner using unicast-based message passing in every step. Once this reduction phase is over, the participating upper-left corner processor has information that all other processors have reached their respective barriers. This completes the report phase of barrier synchronization. Now the wake-up phase of barrier synchronization can be achieved by a series of broadcast phases. The fourth phase involves broadcast from the upper-left corner processor to column leaders in a tree-like manner using unicast-based multicast (see Section 5.7). The fifth phase involves broadcast by column leaders to row leaders using multidestination messages. During the final step, the row leaders use broadcast messages in their respective rows to wake up the associated processors.

In this six-phase scheme, every phase except the third and the fourth needs only one communication step. The number of communication steps needed in the third and the fourth phases depends on the number of column leaders involved. This issue was discussed in Section 5.5.3.

5.6.3 Reduction and Global Combining

The reduction operation can be implemented by using multidestination gather messages. The basic name of this message indicates that it gathers information from multiple processors as it propagates. The reduction operation can be any associative and commutative function (sum, max, min, or user defined) as defined under collective communication by the Message Passing Interface (MPI) standard [242]. Global combining can be achieved by broadcasting the result of the reduction operation to all the participating processors. This can be implemented by using multidestination broadcast/multicast messages. Barrier synchronization is a special case of global combining where there are no data (just an event). Hardware support for the broadcast/multicast operation has been described in Section 5.5.3. Hence, in this section we will mainly focus on the reduction operation.

Unlike gather messages for barrier synchronization, the reduction operation requires messages to carry some data. Also, the reduction operation to be performed on data must be encoded in the message. Figure 5.47(a) shows a suitable format for a gather message [266]. The msg type and id fields have already been described in Section 5.6.1. Also, destination addresses are encoded as bit strings. The function field indicates the type of reduction operation (sum, max, min, etc.). Finally, the message contains some data flits on which the reduction operation will be performed.

image

Figure 5.47 Architectural support for implementing the reduction operation: (a) message format; (b) message buffers and logic at the router interface. (id = synchronization id; ma = message arrived/not arrived; sa = self arrived/not arrived; sdata = self data.) (from [265])

The movement of gather messages through the network was described in Section 5.6.1. However, the tasks performed when a multidestination gather message arrives at an intermediate destination depend on the operation to be performed. In this section we give a more general description.

When several processors participate in a reduction operation, it may happen that a multidestination gather message reaches an intermediate processor and this processor has not reached the point when the reduction operation is executed. Similar to barrier synchronization, the message must be stored in a buffer of the router interface, removing it from the network. Therefore, VCT switching is assumed.

The router interface organization required to implement the reduction operation also differs from the one presented in Section 5.6.1. Figure 5.47(b) shows a possible router interface organization with m buffers [266]. Each buffer has a few bits for id, a flag sa to indicate whether its associated processor has arrived at the reduction point during its execution or not, a flag ma indicating whether the message for the corresponding id has arrived or not, a buffer sdata to hold the data supplied by its associated processor, a buffer message to hold the incoming message, and a buffer result to hold the result. These buffers can be accessed by the associated processor.

A multidestination gather message, after arriving at the router interface of an intermediate destination, checks for the flag sa to be set on the buffer carrying the same id as that of itself. If this flag is set, it indicates that the processor has also arrived at its reduction execution point and has supplied data in the buffer sdata. Now the appropriate logic (indicated by the function field of the message) gets activated and operates on sdata and the data portion of the message to produce result. If the flag sa is not set, the processor has not supplied its data to the router interface yet. In this case, the message is stored in the message field of the buffer carrying the same id and the flag ma is set. The logic operation will start as soon as the processor arrives at its reduction execution point. Once the logic operation is over at the current router, the message is forwarded to the next destination while replacing data of the message with the result. Like this, the message moves ahead step by step while gathering results on its way. Finally, the message gets consumed by the router of the last destination, and the gathered result is made available to the corresponding processor. The operation of a gather message traversing k processors on a path can be expressed more formally as follows:

image

where sdatai is the data item associated with processor pi, the operation image specifies the required reduction function, and gather[0, k − 1] is the result gathered by the message. The result computed at the router interface can also be made available to each intermediate processor. Note that this result is the parallel prefix computation of operation image over data items sdata associated with the processors already traversed by the message.

As mentioned above, global combining can be achieved by broadcasting the result of the reduction operation to all the participating processors. When performance is critical, global combining can be performed in half the number of steps by using exchange messages and a slightly different buffer structure at the router interface [266]. In particular, the single flag ma gets substituted by two flags (pma for positive message and nma for negative message). Similarly, the message field of each buffer gets replaced by two fields (pmessage and nmessage).

Figure 5.48 shows the implementation of the global combining operation on a linear array with a pair of positive and negative exchange messages. Each exchange message acts very much like a gather message. However, the router interface executes the reduction operation (sum, max, min, etc.) when the processor has arrived at its reduction execution point and both the positive and negative exchange messages have already arrived. It can be easily seen that the result of this operation at each router interface is the result of the global combining operation.

image

Figure 5.48 Implementation of the global combining operation on a linear array with a pair of positive and negative exchange messages. (from [266])

This scheme can be easily extended to meshes. Global combining can be achieved on a 2-D mesh by considering it as a set of linear arrays. As shown in Figure 5.49, exchange is first performed along all rows in parallel. At the end of this step, each processor has the result of all the processors in the same row. The second step involves a similar exchange operation along the columns in parallel and operates on the result of the first step. It can be easily seen that this two-step algorithm implements global combining over all processors with the final result being available to all processors.

image

Figure 5.49 Global combining on a 2-D mesh in two steps using exchange messages. (from [266])

5.7 Software Implementations of Multicast

Hardware implementations of multicast communication are desirable as they would offer better performance. However, most existing wormhole-switched multicomputers support only unicast communication in hardware. In these environments, all communication operations must be implemented in software by sending one or more unicast messages. For instance, the multicast operation may be implemented by sending a separate copy of the message from the source to every destination. Depending on the number of destinations, such separate addressing may require excessive time, particularly in a one-port architecture in which a local processor may send only one message at a time. Assuming that the start-up latency dominates the communication latency, separate addressing achieves a communication latency that increases linearly with the number of destinations.

Performance may be improved by organizing the unicast messages as a multicast tree, whereby the source node sends the message directly to a subset of the destinations, each of which forwards the message to one or more other destinations. Eventually, all destinations will receive the message.

The potential advantage of tree-based communication is apparent from the performance of various broadcast methods. For example, in the spanning binomial tree algorithm described in Section 5.5.2, the number of nodes that already received the broadcast message is doubled after each step. Hence, assuming a one-port architecture, communication latency increases logarithmically with the number of destinations.

It is possible to reduce latency even more if the router interface has several ports, allowing nodes to inject several messages simultaneously. However, in this section we will only consider algorithms for a one-port architecture because most commercial multicomputers have a single port. See [232] and [153, 231, 295, 342] for a survey and detailed descriptions of multicast/broadcast algorithms for multiport architectures, respectively.

5.7.1 Desirable Features in Multicast Trees

Which types of multicast trees should be used depends on the switching technique and unicast routing algorithm. The following features are desirable in the software implementation of a multicast tree:

1. No local processors other than the source and destination processors should be involved in the implementation of the multicast tree.

2. The implementation should exploit the small distance sensitivity of wormhole switching.

3. The height of the multicast tree should be minimal. Specifically, for m − 1 destination nodes, the minimum height is k = [log2(m)].

4. There should be no channel contention among the constituent messages of the multicast. In other words, the unicast messages involved should not simultaneously require the same channel. Note that this feature does not eliminate contention with other unicast or multicast messages.

How to achieve these goals depends on the switching technique and unicast routing algorithm of the network. Although the user has no control over the routing of individual messages, the designer may be able to reduce or eliminate channel contention by accounting for the routing algorithm in defining the set of unicast messages and the order in which they are transmitted. The following (small-scale) example illustrates the issues and difficulties involved in implementing efficient multicast communication in wormhole-switched networks that use dimension-order routing.

EXAMPLE 5.14

Consider the 4 × 4 2-D mesh in Figure 5.50, and suppose that a multicast message is to be sent from node (2, 1) to seven destinations {(0, 0), (1, 0), (2, 0), (1, 1), (1, 2), (3, 2), (1, 3)}.

image

Figure 5.50 An example of multicast in a 4 × 4 mesh.

In early direct network systems using SAF switching, the procedure shown in Figure 5.51(a) could be used. At step 1, the source sends the message to node (1, 1). At step 2, nodes (2, 1) and (1, 1) inform nodes (2, 0) and (1, 2), respectively. Continuing in this way, this implementation requires four steps to reach all destinations. Node (3, 1) is required to relay the message, even though it is not a destination. Using the same routing strategy in a wormhole-switched network also requires four steps, as shown in Figure 5.51 (b). In this case, however, only the router at node (3, 1) is involved in forwarding the message. Hence, the message may be passed from (2, 1) to (3, 2) in one step, and no local processors other than the source and destinations are involved in sending the message.

image

Figure 5.51 Unicast-based software multicast trees: (a) a multicast tree based on store-and-forward switching; (b) a multicast tree based on wormhole switching; (c) collision occurs in step 3 at the channel between (1, 1) and (1, 2); (d) collision may occur at the channel between (1, 1) and (1, 2) if the sending and receiving latencies are large and approximately equal in value; and (e) collision-free multicast tree.

In Figure 5.51(c), the branches of the tree are rearranged to take advantage of the distance insensitivity of wormhole switching. The local processor at each destination receives the message exactly once. Using this method, the number of steps is apparently reduced to three. However, closer inspection reveals that the message sent from node (1, 0) to node (1, 2) and the message sent from node (2, 1) to node (1, 3) in step 3 use a common channel, namely, the [(1, 1), (1, 2)] channel. Consequently, these two unicasts cannot take place during the same step, and again four steps are actually required.

This situation is rectified in Figure 5.51(d), where only three steps are required. No local processors other than the source and destinations are involved, and the messages sent within a particular step do not contend for common channels. In practice, however, the message-passing steps of the multicast operation may not be ideally synchronized, and contention may arise among messages sent in different steps. As indicated in Section 5.2, start-up latency includes system call time at both the source and destination nodes; these latencies are termed the sending latency and receiving latency, respectively. If these latencies are large relative to the network latency, messages can be sent concurrently, but in different steps. For example, assuming that both sending latency and receiving latency have the same value, denoted by t, and that network latency is negligible, the labels in Figure 5.51(d) indicate when a copy of the message will enter and leave each node. Assuming a one-port architecture, the latency between two consecutive sends at a particular node is t. Leaf nodes in the tree do not send messages, and therefore encounter only receiving latency. For other destinations (intermediate nodes in the tree), both receiving latency and sending latency are incurred. Under these conditions, node (1, 0) may not have finished receiving the message from node (2, 1) until after node (2, 1) has finished sending to node (3, 2) and started sending to node (1, 3). If node (1, 0) sends to node (1, 2) at this time, 3t, then contention will occur for the [(1, 1), (1, 2)] channel. The multicast tree in Figure 5.51(e), which is based on the methods presented in the following sections, is contention-free regardless of message length or receiving latency.

5.7.2 Dimension-Ordered Chains

Developing an algorithm that produces minimum-time, contention-free multicast implementations for a specific system requires a detailed understanding of potential conflicts among messages, which in turn are dependent on the routing algorithm used. This section formulates a method to avoid contention among unicast messages under the most common routing algorithm for wormhole-switched n-dimensional meshes, namely, dimension-order routing. This method was presented in [234].

A few preliminaries are in order. A node address x in a finite n-dimensional mesh is represented by σn–1 (xn–2(x) … σ0(x). Under a minimal deterministic routing algorithm, all messages transmitted from a node x to a node y will follow a unique shortest path between the two nodes. Let such a path be represented as P(x, y) = (x; z1, z2, …, Zk; y), where the zis are the sequence of intermediate routers visited by the message. In order to simplify the presentation, we let z0 = x and zk+1 = y.

In order to characterize contention among messages transmitted under dimension-order routing, an ordering on nodes in an n-dimensional mesh is needed. The multicast algorithms described herein are based on lexicographic ordering of the source and destination nodes according to their address components. Actually, two such orderings are possible: one in which the subscripts of address components increase from right to left, and another in which the subscripts are reversed. Which ordering is appropriate for multicasting in a given system depends on whether addresses are resolved, under dimension-order routing, in decreasing or increasing dimension order. Here it is assumed that addresses are resolved in increasing dimension order, and we will refer to the ordering relation as dimension order.

DEFINITION 5.6

The binary relation dimension order, denoted <d, is defined between two nodes x and y as follows: x <d y if and only if either x = y or there exists an integer j such that σj (x) < σj (y) and σi (x) = σi (y)∀ i, 0 ≤ ij − 1.

Since <d is just lexicographic ordering, it is a total ordering on the nodes in an n-dimensional mesh. Therefore, it is reflexive, antisymmetric, and transitive. Given a set of node addresses, they can be arranged in a unique, ordered sequence according to the <d relation.

DEFINITION 5.7

A sequence of nodes x1, x2, …, xm is a dimension-ordered chain if and only if all the elements are distinct and either (1) xi <d xi+1 for 1 ≤ i < m or (2) xi <d xi–1 for 1 < im.

The following lemmas address contention among messages sent between nodes whose addresses are arranged as a dimension-ordered chain [235].

LEMMA 5.1

If u <d v <d x <d y, then dimension-ordered routes P(u, v) and P(x, y) are arc disjoint.

LEMMA 5.2

If u <d v <d x <d y, then dimension-ordered routes P(y, x) and P(v, u) are arc disjoint.

Lemmas 5.1 and 5.2 are critical to the development of efficient multicast algorithms because they indicate how channel contention may be avoided. The chain algorithm is a distributed algorithm that can be used to multicast a message from a source node to one or more destinations. The algorithm applies to situations in which the address of the source node is either less than or greater than those of all the destinations, according to the <d relation. Figure 5.52 gives the chain algorithm executed at each node. The source address and the destination addresses are arranged as a dimension-ordered chain in either increasing or decreasing order, with the source node occupying the position at the left end of the chain. The source node sends first to the destination node halfway across the chain, then to the destination node one-quarter of the way across the chain, and so on. Each destination receives a copy of the message from its parent in the tree and may be responsible for forwarding the message to other destinations. The message carries the addresses of those nodes to be in the subtree rooted at the receiving node. The chain algorithm is designed to produce minimum-time multicast implementations on top of dimension-ordered unicast routing. Although some messages are passed through multiple routers before reaching their destinations, it turns out that channel contention will not occur among the messages, regardless of message length or start-up latency—referred to as depth-contention-free. The following theorem forms the basis for developing software-based multicast algorithms.

image

Figure 5.52 The chain algorithm for multicast.

THEOREM 5.2

A multicast implementation resulting from the chain algorithm is a depth-contention-free, minimum-time implementation.

EXAMPLE 5.15

Figure 5.53 shows the steps of a multicast implementation resulting from the chain algorithm in a 4 × 4 2-D mesh that uses XY routing. The set of nodes involved is the same as in Figure 5.50; however, node (0, 0) is the source rather than node (2, 1). The source and the seven destinations have been arranged as a dimension-ordered chain. In this case, the X dimension is considered the low-order dimension, and the Y dimension is considered the high-order dimension.

image

Figure 5.53 Multicast chain example in a 4 × 4 mesh.

5.7.3 Multicast in Hypercubes

As presented above, the chain algorithm is only applicable to those cases in which the source address is less than or greater than (according to <d) all the destination addresses. Clearly, this situation is not true in general. For a hypercube network in which e-cube routing is used, it is straightforward to construct a depth-contention-free multicast algorithm using the chain algorithm. Specifically, the symmetry of the hypercube effectively allows the source node to play the role of the first node in a dimension-ordered chain. The exclusive-OR operation, denoted image, is used to carry out this task.

DEFINITION 5.8

A sequence d1, d2, …, dm–1 of hypercube addresses is called a d0-relative dimension-ordered chain if and only if d0 image d1, d0 image d2, …, d0 image dm−1 is a dimension-ordered chain.

Let d0 be the address of the source of a multicast with m − 1 destinations. The source can easily sort the m − 1 destinations into a d0-relative dimension-ordered chain, Φ = d1, d2, …, dm−1. The source may then execute the chain algorithm using Φ instead of the original addresses. The multicast tree resulting from this method is called a Unicast-cube, or U-cube, tree. An interesting and useful property of the U-cube tree involves broadcast: the well-known binomial tree [334] is a special case of the U-cube tree when the source node and all destinations form a subcube. Also, the implementation constituting a U-cube tree is a depth-contention-free, minimum-time implementation.

5.7.4 Multicast in Meshes

Unlike hypercubes, n-dimensional meshes are not symmetric. The source address may lie in the middle of a dimension-ordered chain of destination addresses, but the exclusive-OR operation is not applicable in the implementation of depth-contention-free communication. However, another relatively simple method may be used, again based on the chain algorithm, to address this problem.

The U-mesh algorithm is given in Figure 5.54. The source and destination addresses are sorted into a dimension-ordered chain, denoted Φ, at the time when multicast is initiated by calling the U-mesh algorithm. The source node successively divides Φ in half. If the source is in the lower half, then it sends a copy of the message to the smallest node (with respect to <d) in the upper half. That node will be responsible for delivering the message to the other nodes in the upper half, using the same U-mesh algorithm. If the source is in the upper half, then it sends a copy of the message to the largest node in the lower half. In addition to the data, each message carries the addresses of the destinations for which the receiving node is responsible. At each step, the source deletes from Φ the receiving node and those nodes in the half not containing the source. The source continues this procedure until Φ contains only its own address. Note that if the source happens to lie at the beginning or end of Φ, then the U-mesh algorithm degenerates to the chain algorithm.

image

Figure 5.54 The U-mesh algorithm.

In addition, when executed at an intermediate node in the tree, the U-mesh algorithm is simply the chain algorithm.

EXAMPLE 5.16

Figure 5.55 depicts a multicast in a 6 × 6 mesh. Node (3, 3) is the source of a multicast message destined for the 16 shaded nodes. Figure 5.56 shows the result of the U-mesh algorithm for this example; intermediate routers are not shown. The source begins with a dimension-ordered chain Φ = (0, 1), (0, 2), (0, 4), (1, 0), (1, 3), (1, 5), (2, 0), (2, 2), (2, 3), (2, 5), (3, 0), (3, 2), (3, 3), (3, 4), (3, 5), (4, 1), (5, 2). As shown iny Figure 5.56, the source (3, 3) first sends to node (2, 3), the node with the highest address in the lower half of Φ. The lower half is deleted from Φ, and therefore the nodes remaining in Φ are (2, 5), (3, 0), (3, 2), (3, 3), (3, 4), (3, 5), (4, 1), (5, 2). Node (3, 3) next sends to node (3, 4), the node with the lowest address in the upper half. The new sequence Φ becomes (2, 5), (3, 0), (3, 2), (3, 3). The next recipient is the node with the highest address in the lower half of Φ, namely, (3, 0). Finally, node (3, 3) sends to node (3, 2). Each of the receiving nodes is likewise responsible for delivering the message to the nodes in its subtree using the chain algorithm. As shown in Figure 5.56, the multicast operation requires five steps.

image

Figure 5.55 U-mesh regions for 16 destinations in a 2-D mesh.

image

Figure 5.56 U-mesh tree for 16 destinations in a 2-D mesh.

Inspection of Figures 5.55 and 5.56 shows that if the constituent unicast messages follow XY routing, then no contention is possible among them. Two regions, low and high, are defined on either side (with respect to <d) of the source node. By the construction of the U-mesh algorithm, any message sent by a node i in the high region will be destined for another node j, i <d j, in the high region. Similarly, any message sent by a node i in the low region will be destined for another node j, j <d i, in the low region. Stated in other terms, any reachable set includes nodes in either the low region or the high region, but not both. This property can be used to prove depth-contention-free message transmission within each region and, furthermore, that no channel contention can exist on the boundary between the two regions. This is stated by the following theorem:

THEOREM 5.3

The multicast implementation constituting a U-mesh tree is a minimum-time, depth-contention-free implementation.

However, contention may arise between concurrent multicast operations initiated by different nodes, especially when destination sets are identical. Contention can be reduced by modifying the dimension-ordered chain according to the position of the source node in the chain. The source-partitioned U-mesh (SPUmesh) algorithm [173] performs a rotate-left operation on the dimension-ordered chain Φ for a source and set of destinations. The rotation produces a new chain Φ1 whose first element is the source of the multicast. Then, the U-mesh algorithm is performed on Φ1.

5.8 Engineering Issues

When designing a parallel computer, the designer faces many trade-offs. One of them concerns providing support for collective communication. Depending on the architecture of the machine, different issues should be considered.

Multicomputers usually rely on message passing to implement communication and synchronization between processes executing on different processors. As indicated in Section 5.3.1, supporting collective communication operations may reduce communication latency even if those operations are not supported in hardware. The reason is that system calls and software overhead account for a large percentage of communication latency. Therefore, replacing several unicast message-passing operations by a single collective communication operation usually reduces latency significantly. For example, when a processor needs to send the same message to many different processors, a single multicast operation can replace many unicast message transmissions. Even if multicast is not supported in hardware, some steps like system call, buffer reservation in kernel space, and message copy to the system buffer are performed only once. Also, when multicast is not supported in hardware, performance can be considerably improved by using the techniques described in Section 5.7 to organize the unicast messages as a multicast tree. Using those techniques, communication latency increases logarithmically with the number of destinations. Otherwise it would increase linearly. Obviously, implementing some hardware support for collective communication operations will speed up the execution of those operations even more.

On the other hand, communication between processes is usually performed in shared-memory multiprocessors by accessing shared variables. However, synchronization typically requires some hardware support. Barrier synchronization involves a reduce operation followed by a broadcast operation. Moreover, distributed shared-memory multiprocessors with coherent caches rely on a cache coherence protocol. Different copies of the same cache line are kept coherent by using write-invalidate or write-update protocols. Both invalidate and update commands may benefit from implementing hardware support for collective communication operations. Invalidations can be performed by sending a multicast message to all the caches having a copy of the block that is to be written to. Acknowledgments can be gathered from those caches by performing a reduce operation. Updates can also be performed by sending a multicast message to all the caches having a copy of the block.

Adding hardware support for collective communication increases cost and hardware complexity, possibly slowing down the routing hardware. Now, the question is whether or not it is useful to provide hardware support for collective communication. Some parallel computers provide support for a few operations. The nCUBE-2 (wormhole-switched hypercube) [248] supports broadcast within each subcube. The NEC Cenju-3 (wormhole-switched unidirectional MIN) [183] supports broadcast within each contiguous region. The TMC CM-5 [202] supports one multicast at a time via the control network. Unfortunately, in some cases that support was not properly designed. In the nCUBE-2 and the NEC Cenju-3 deadlock is possible if there are multiple multicasts.

One of the reasons for the lack of efficient hardware support for collective communication is that most collective communication algorithms proposed in the literature focused on networks with SAF switching. However, current multiprocessors and multicomputers implement wormhole switching. Path-based routing (see Section 5.5.3) was the first mechanism specifically developed to support multicast communication in direct networks implementing wormhole switching. However, the first path-based routing algorithms were based on Hamiltonian paths [211]. These algorithms are not compatible with the most common routing algorithms for unicast messages, namely, dimension-order routing. Therefore, it is unlikely that a system will take advantage of Hamiltonian path-based routing. Note that it makes no sense sacrificing the performance of unicast messages to improve the performance of multicast messages, which usually represent a smaller percentage of network traffic. Additionally, path-based routing requires a message preparation phase, splitting the destination set into several subsets and ordering those subsets. This overhead may outweigh the benefits from using hardware-supported multicast. Fortunately, in some cases it is possible to perform the message preparation phase at compile time.

More recently, the BRCP model (see Section 5.5.3) has been proposed [269]. In this model, the paths followed by multidestination messages conform to the base routing scheme, being compatible with unicast routing. Moreover, this model allows the implementation of multicast routing on top of both deterministic and adaptive unicast routing, and therefore is suitable for current and future systems. So, it is likely we will see some hardware implementations of multicast routing based on the BRCP model in future systems. However, more detailed performance evaluation studies are required to assess the benefits of hardware-supported multicast. Also, as indicated in Section 5.5.3, several delivery ports are required to avoid deadlock. This constraint may limit the applicability of the BRCP model.

Efficient barrier synchronization is critical to the performance of many parallel applications. Some parallel computers implement barrier synchronization in hardware. For example, the Cray T3D [259] uses a dedicated tree-based network with barrier registers to provide fast barrier synchronization. Instead of using a dedicated network, it is possible to use the same network as for unicast messages by implementing the hardware mechanisms described in Section 5.6.1. Again, these mechanisms have been proposed very recently, and a detailed evaluation is still required to assess their impact on performance.

As indicated above, protocols for cache coherence may improve performance if multicast and reduce operations are implemented in hardware. In this case, latency is critical. Multicast can be implemented by using the BRCP model described in Section 5.5.3. The multidestination gather messages introduced in Section 5.6.3 can be used to collect acknowledgments with minimum hardware overhead. This approach has been proposed and evaluated in [69]; up to a 15% reduction in overall execution time was obtained.

The cost of the message preparation phase can be reduced by using tree-based multicast routing because it is not necessary to order the destination set. Tree-based routing usually produces more channel contention than path-based routing and is also prone to deadlock. However, the average number of copies of each cache line is small, reducing contention considerably. Also, the pruning mechanism proposed in Section 5.5.2 can be used to recover from deadlock. Tree-based multicast routing with pruning has been specifically developed to support invalidations and updates efficiently [224]. This mechanism has some interesting advantages: it requires a single start-up regardless of the number of destinations, therefore achieving a very small latency. It requires a single delivery channel per node. Also, it is able to deliver a message to all its destinations using only minimal paths. However, this mechanism only supports multicast. Support for the reduce operation is yet to be developed.

Most hardware mechanisms to support collective communication operations have been developed for direct networks. Recently some researchers focused on MINs. As shown in Section 5.5.2, MINs are very prone to deadlock when multicast is supported in hardware. Current proposals to avoid deadlock require either complex signaling mechanisms or large buffers to implement VCT switching. Up to now, no general solution has been proposed to this problem. Therefore, direct networks should be preferred if collective communication operations are going to be supported in hardware.

Finally, note that efficient mechanisms to support collective communication operations in hardware have been proposed very recently. Including hardware support for collective communication in the router may increase performance considerably if collective communications are requested frequently. However, in practice, most collective communication operations involve only a few nodes. In these cases, the performance gain achieved by supporting those operations in hardware is small. The only exception is barrier synchronization. In this case, hardware support reduces synchronization time considerably. Some manufacturers include dedicated hardware support for barrier synchronization. Whether this support should be extended to other operations remains an open issue.

5.9 Commented References

Since the proposal of the spanning binomial tree broadcast algorithm [334] for hypercubes, broadcast communication and its extensions (gossiping and personalized broadcast) have been extensively studied [147, 152, 168]. Multicast communication was first studied in [195, 196] for hypercubes based on VCT switching. The theoretical foundation of optimal multicast problems was laid out in [60, 210, 212]. The multicast deadlock problem was studied in [47] for networks using VCT switching. Three multicast protocols, multi-unicast, resumable, and restricted-branch multicast, were proposed to deal with the deadlock problem.

Hardware-supported multicast routing in wormhole-switched networks was first studied in [211] for hypercubes and 2-D meshes, proposing the path-based multicast routing scheme and the dual-path and multipath algorithms. Detailed performance study of path-based multicast routing was reported in [206, 208]. These multicast algorithms were based on Hamiltonian paths. Path-based adaptive multicast routing was first studied in [205], where three adaptive multicast routing algorithms for 2-D meshes, called PM, FM, and LD, were proposed. In [204], a path-based multicast algorithm was proposed for the hypercube, and different strategies to order the destination sets were evaluated. The first sufficient condition for deadlock-free adaptive multicast routing in wormhole-switched networks that allows cyclic dependencies between channels was proposed in [90, 96], where an adaptive multicast routing algorithm was also proposed. This algorithm was evaluated in [216] for 3-D meshes. The above-mentioned sufficient condition was used to propose efficient adaptive multicast routing algorithms for 2-D meshes in [109]. The path-based multicast mechanism was generalized in [343], where a trip-based model was proposed. This model can be applied to any connected network of arbitrary topology. Using this model, it is possible to construct trips such that at most two virtual channels per physical channel are needed to support multiple multicast on arbitrary network topologies.

The BRCP model for path-based multicast routing was first proposed in [269], where two schemes to implement multicast and broadcast, the hierarchical leader-based scheme and the multiphase greedy scheme, were also proposed. The BRCP model was applied to the development of mechanisms for barrier synchronization [265], reduction, and global combining [266]. For this purpose, gather and exchange messages were proposed in addition to multicast messages. Gather and multicast messages were also proposed to support write-invalidate protocols for DSMs in [69], which presented a detailed performance evaluation study.

The problem of deadlocks produced by delivery channels was independently reported in [38, 216, 269]. A formal solution to this problem was proposed in [268].

Tree-based multicast routing algorithms for wormhole-switched 2-D meshes were first proposed in [206]. Since then, tree-based multicast routing has been discarded for direct networks implementing wormhole switching. However, tree-based multicast routing has been recently proposed for direct networks to support write-invalidate and write-update protocols for DSMs [224]. This routing scheme relies on a pruning mechanism to recover from deadlock.

Tree-based multicast routing has been recently studied on MINs with wormhole switching in [55], where a deadlock avoidance scheme for synchronous message replication was proposed. Asynchronous tree-based multicast routing was proposed in [346], avoiding deadlock by serializing the initiations of multicast operations in groups of switches. In [322], a multiport encoding mechanism was proposed for MINs using VCT switching.

Several approaches have been proposed to implement fast barrier synchronization in hardware [20, 107, 141, 260, 261, 265, 355].

Unicast-based multicast communication for direct networks implementing wormhole switching was studied in [233, 234], where the chain algorithm as well as the U-cube and the U-mesh algorithms were proposed. This work focused on one-port architectures. For multiport and all-port architectures, several multicast/broadcast algorithms have been proposed [153, 231, 295, 342]. Optimal multicast trees based on a parameterized communication model have been studied in [271]. Unicast-based multicast algorithms have also been proposed for bidirectional [353] and unidirectional multistage interconnection networks [56].

Several schemes to encode multiple destination addresses in the message header were proposed in [54]. Finally, see [252] and [267] for a discussion of the issues in designing efficient and practical algorithms for collective communication, and [232] for a survey of collective communication in wormhole-switched networks.

Exercises

5.1. Consider a hypercube implementing VCT switching, and the multicast set K = {u0, u1, …, uk}, where u0 is the source node and u1, u2, …, uk (k ≥ 1) are k destination nodes. Propose a tree-based greedy multicast algorithm that works by sending a message through the dimension that can reach the greatest number of remaining destinations and repeating the procedure until all destinations have been processed. If there is a tie in the number of destinations that can be reached, any selection policy may be used.

    Solution Assume that for a node ui in an n-cube, its binary address is represented as ui,n–1, …, ui0 Figure 5.57 shows a greedy multicast algorithm, referred to as a LEN tree, that was proposed by Lan, Esfahanian, and Ni (LEN) [195].

image

Figure 5.57 The LEN tree algorithm.

5.2. Consider a 5-cube and a multicast communication, where the source node is 00110 and the six destination nodes are 00111, 10100, 11101, 10010, 00001, and 00000. Show the multicast tree for the greedy multicast algorithm.

    Solution Figure 5.58 shows the multicast tree obtained when using the greedy multicast algorithm.

image

Figure 5.58 The multicast tree for the greedy multicast algorithm.

5.3. Consider a 5-cube and a multicast communication, where source node 01010 is sending to a set of six destinations {00111, 01011, 01101, 10001, 11010, 11110}.

    Compute the source-relative, dimension-ordered chain and show the corresponding U-cube tree.

    Solution Taking the exclusive-OR of each destination address with 01010 and sorting the results produces the (01010)-relative dimension-ordered chain Φ = 01010, 11010, 11110, 01011, 00111, 10001, 01101. The corresponding U-cube tree is shown in Figure 5.59. It takes three steps for all destination processors to receive the message.

image

Figure 5.59 Multicast chain example in a 5-cube.

5.4. Consider a unidirectional MIN using wormhole switching. Suppose that path-based multicast routing is implemented in that network. Show that path-based multicast routing produces deadlock in MINs.

    Solution Figure 5.60 shows an example of deadlock for a path-based multicast: a path from node 0 to nodes 4 and 5.

image

Figure 5.60 An example of deadlock for a path-based multicast.

5.5. Consider an implementation of tree-based multicast routing in a MIN using multiport encoding. The network has n stages and k × k switches. How many different destination sets can be reached in a single pass through the network?

    Solution Multiport encoding requires n k-bit strings, one for each stage. Each k-bit string indicates the output ports to which the message must be forwarded at the corresponding switch. The same jth k-bit string must be used at all the switches in stage j that route the message. So, a string with all 0s means that the message will not be forwarded through any port in any switch of the corresponding stage. Every other combination is a valid string and will deliver the message to one or more destinations. Therefore, if the network had a single stage, it would be possible to reach 2k − 1 different destination sets. For n stages, the number of different destination sets is (2k − 1)n.

Problems

5.1. Consider a 5-cube and a multicast communication, where node 01010 is the source and nodes 00000, 10111, 00111, 10101, 11111, 10000, and 00001 are destinations.

image Show the corresponding LEN tree.

image Show the corresponding U-cube tree.

5.2. Consider a 7 × 7 mesh and a multicast communication, where node (1, 3) is the source and nodes (0, 0), (0, 6), (6, 0), (6, 6), (3, 3), and (3, 0) are six destinations.

image Show the corresponding U-mesh tree and the path followed by each message.

image Show the paths used by dual-path routing.

image Show the paths used by multipath routing.

5.3. Consider an implementation of multicast communication on a 2-D mesh based on the BRCP model. The base routing algorithm is north-last (see Section 4.3.2). Define a grouping strategy for destination nodes that minimizes the number of communication steps.

5.4. Consider a 2-D mesh using XY routing. Multicast communication is based on the BRCP model using XY routing as the base routing. Each node has four delivery channels to avoid deadlock, each one associated with the corresponding input channel. Now consider only the dependencies between delivery channels in adjacent nodes and draw the dependency graph for those channels on a 3 × 3 mesh. Compare this graph with the channel dependency graph for XY unicast routing (shown in Figure 3.20).

5.5. In a k-port communication model, each processor can simultaneously send messages (same or different) to k processors and simultaneously receive messages from k processors. Modify the chain algorithm for a two-port architecture.

5.6. Consider the architectural support for the implementation of barrier synchronization described in Section 5.6.1. Now consider the implementation of complete barrier synchronization in 2-D meshes as described in Section 5.6.2. Assume that the base routing is the west-first routing algorithm. Define a communication scheme to perform complete barrier synchronization in two steps. How does this scheme affect the architectural support?

5.7. In the study of various communication supports, most research has assumed a certain interconnect topology, such as mesh and hypercube. It is beneficial to study communication supports for a general communication model in order to obtain bounds for the communication latency. One such model is LogP, which assumes a fully connected topology. Thus, it is equally efficient to send a message between any pair of processors. Furthermore, the communication is organized in steps. During each step, each processor is able to send k messages to i other processors and to receive k messages from k other processors, assuming a k-port architecture. Now consider a parallel machine with n processors. One processor has to broadcast m messages to the other n − 1 processors. What is the minimum number of steps required to do such a broadcast for each of the following cases?

1. n = 2i for some i, m = 1, k = 1

2. Arbitrary n, m = 1, k = 1

3. Arbitrary n and m,k = 1

4. n =2i for some i, m = 1, k = 2

5. Arbitrary n, m = 1, k = 2

5.8. Consider a fully connected topology with n processors and k ports per processor. How many steps are needed to perform a reduce operation? Show the corresponding algorithm as well.

5.9. Consider the implementation of path-based multicast routing in unidirectional MINs using wormhole switching. Even if deadlocks did not occur, would it be efficient?

5.10. Path-based routing was designed to support multicast communication. What modifications should be made in order to support scatter communication? What are the advantages and disadvantages of the proposed mechanism over multiple unicast routing?

5.11. Many of the proposed multicast routing algorithms were for mesh topologies. Consider the dual-path algorithm. Is it possible to modify this algorithm so that it works for torus topologies? If the answer is negative, explain the reasons.

    Otherwise, define a Hamiltonian path that takes advantage of wraparound channels and a multicast routing algorithm based on it.

5.12. Consider a 2-D torus network with two virtual unidirectional channels per physical unidirectional channel.

image Propose a dimension-ordered, deadlock-free unicast routing algorithm.

image Modify the above algorithm to support unicast-based multicast communication.

5.13. Discuss the advantages and disadvantages in using group ID (GID) instead of a list of destination addresses in delivering a multicast message.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.48.244