CHAPTER 8

Messaging Layer Software

The majority of this text has been concentrated on issues facing the design of the interconnection network: switching, routing, and supporting router architectures. However, from the user’s perspective the experienced latency/throughput performance is not only a function of the network, but also of the operations performed at the source and destination processors to inject and receive messages from the network. End-to-end performance can be measured from data located in the memory of the sending process to data in the memory of the receiving process. This includes the process of preparing data for transmission (e.g., computation of packet headers and check sums) and injecting/ejecting messages into/out of the network. In most systems, a large segment of this functionality is implemented in software: the messaging layer. With the dramatic improvements in raw link bandwidths, the time on the wire experienced by messages has become overshadowed by the software overheads at the sender/receiver.

These software overheads are determined by the functionality to be provided by the messaging layer and the hardware features that are available in the network and node interfaces to support this desired functionality. For example, the overhead in transferring message data from the local memory to the network interface is determined by the internal node design and the availability of services such as DMA, interrupts, or memory-mapped interfaces. This chapter first discusses the major functions that are provided by current message layers and then discusses how the implementation of these functions are affected by the services provided by the network hardware and the internal node architecture. The services of this messaging layer are the basis on which user-level message-passing libraries can be constructed. The chapter focuses on the message-passing programming model and does not include issues concerning the support of shared-memory abstractions on distributed-memory architectures. The data transfer requirements for implementing shared-address spaces have similarities with the requirements of message-passing implementations, and several current research efforts are studying the integration of shared-memory and message-passing communication within a common framework. However, there appears to be a greater consensus on the requirements for supporting message-passing models, and this chapter focuses on examples from this domain.

Generally, an application programming interface (API) provides access to the services of the messaging layer through a set of library procedures and functions. A desirable API is one that presents clear semantics, is independent of a specific platform, and presents high-level abstractions that are easy to use. The Message Passing Interface (MPI) represents a standard message-passing API that has evolved out of the collective efforts of many people in academia, industry, and government over the past several years. It captures many of the desirable features of a message-passing API gleaned from collective experience with the development of several message-passing libraries. It is rapidly becoming the API of choice for portable parallel message-passing programming. Therefore, a brief introduction to MPI is provided in the latter half of this chapter. A full introduction to the MPI standard can be found in many excellent texts and papers (e.g., [138, 325]). An understanding of the design and implementation of the messaging layer can help motivate and clarify the semantics of the MPI interface and aid in developing efficient MPI-based programs.

8.1 Functionality of the Messaging Layer

The design of the messaging layer follows from an identification of the functions that must be performed in the delivery of a message. An example of the flow of data from the location in the source node’s memory to the destination node’s memory location is illustrated in, Figure 8.1. While this represents only one implementation, it is fairly generic and captures functions found in many implementations. Let us suppose that a user process starts with a single word to be transmitted to a process being executed on another node. Transmission of this word is initiated via a call to a message-passing procedure such as send (buf, nbytes, dest), where buf contains the nbytes of data to be transmitted to node dest. A message packet must be created (packetization) with a header containing information required to correctly route the packet to its destination. This packet may also include other information (e.g., CRC codes) in addition to the data. Access to the network interface may not be immediately available. Therefore, the message may be buffered in system memory (copying and buffering) prior to injection into the network while the send() call returns to the main program. Until recently, network interfaces were largely treated as I/O devices. Traditionally, drivers that control such devices (interface control) were privileged and were available only through a system call (user/kernel transitions). More recently, the high overhead of such an approach has evolved into more efficient schemes for transferring control to message handlers that execute in the user address space. Once the network interface has the message packet, it is injected into the network, where the routers cooperate in delivering the message to the destination node interface. When the message is received at the node, there must be some way to invoke the messaging layer software (e.g., interrupts, polled access, etc.). Similar device driver services may then be invoked to transfer the message from the network interface into temporary system buffers (copied later into the user buffers) or transmitted directly to user buffers. Message transfer is now complete.

image

Figure 8.1 A model for message transmission/reception.

We can view each of the above functions as essential to the process of transmitting messages. Collectively they determine the minimum latency of a message, and the slowest component (invariably a software component) determines the maximum bandwidth. Consider the issues in each of these steps.

8.1.1 Packetization

Computation of the message headers, sequence numbers, parity, CRC, checksums, and so on are overhead operations. When these operations are implemented in software they can exact significant performance penalties. Most interfaces now implement the majority, if not all, of packetization functions in hardware. For example, in the Cray T3D, the interface hardware performs a table lookup to generate the routing tag [259], while it is possible to compute the CRC while copying data between buffers or during a DMA transfer [30]. Packetization overheads have been largely minimized in modern machines.

8.1.2 Network Interface Control

As an I/O device, control of the network interface can take one of many forms. Data may be transferred to/from the interface using direct memory access (DMA). In this case the software must initialize the appropriate DMA channel and initiate DMA access. This can be as simple as requiring two instructions: one to load the starting address and one to load the counter ([105] for the nCUBE-2). If DMA is only allowed into certain portions of memory, then interaction with the operating system may be required. Pages used as targets of DMA or used to hold network interface data structures such as queues should be pinned down to prevent them from being swapped out.

Alternatively the interface may be memory mapped and accessible from user space. The messaging software may initialize the interface with stores to memory locations corresponding to control registers. Message packets may be similarly transferred to interface memory. When a message is received, the messaging software may be invoked via interrupts or polling. Programmed I/O is yet another way to access the network interface if it is treated as an I/O device. Note that these software overheads corresponding to network interface control are a function of mechanisms provided with the design of the node to transfer data to/from memory from/to I/O devices.

However, the network interface may not be on the I/O bus, but rather higher in the hierarchy such as on the memory bus [240] or integrated into the processor [40]. Such tightly coupled interfaces are capable of delivering a higher percentage of the physical network bandwidth, but remain specialized. Such organizations are also less likely to be compatible with evolving node architecture designs, but eliminate the performance bottleneck of (relatively) slow I/O buses. Examples of alternatives for integrating the network interface into a common node architecture are illustrated in Figure 8.2. However, this approach of interfacing via the I/O bus is becoming increasingly popular due to the ease with which it can be used with commercial off-the-shelf processor and memory systems, as well as widely used bus and interface standards.

image

Figure 8.2 Examples of the placement on network interfaces. (NI = network interface; PCI = peripheral component interconnect.)

8.1.3 User/Kernel Interface

A major performance issue is whether the message software must execute in the kernel context or whether it can be run from the user level. Early message layer implementations had the message handlers execute in the kernel, necessitating expensive context swaps on each invocation. Messages from the network were first read into system buffers and then copied into user memory data structures. There are active efforts in the design of several current-generation message layers seeking implementations where the message handlers execute in the user context and have direct access to the network interfaces from the user level. In addition to the substantial savings in buffering and copying, the high overhead of frequent context switches is avoided.

Some node architectures (e.g., Intel Paragon) make use of a coprocessor to execute all message-passing functions. The interaction of the coprocessor with the compute processor may be interrupt driven or polled. This permits significant overlap between the message processing and computation; however, it may do little for latency. Since the message processor controls the network interface, similar considerations about protected communications arise when seeking to run message handlers in user context rather than the kernel on the coprocessor.

8.1.4 Copying and Buffering

Buffering policies are extremely important to the design of the message layer. They are crucial to both correctness as well as performance. Network routing protocols remain deadlock-free under the consumption assumption: all messages destined for a node are eventually consumed. Deadlock freedom proofs that rely on this assumption (as do virtually all such proofs) are based on memory of infinite extent. In reality, memory is necessarily limited, and therefore some flow control between senders and receivers is necessary to ensure that a fixed amount of storage can be allocated, deallocated, and reused over large numbers of messages while avoiding the loss of messages. Credit-based flow control, windowing schemes, and so on are examples of techniques for managing a limited amount of memory.

To ensure the availability of buffer space, the Intel iPSC/2 machines employed a three-trip protocol for sending long messages (> 100 bytes). An initial request message is transmitted to the receiver to allocate buffer space. On receiving a reply message, the message can be transmitted. Various token protocols can be employed where each node has a number of tokens corresponding to buffers. To send a message a processor must have a token from the destination. Tokens can be returned by piggybacking on other messages. The Fast Message library [264] uses a return-to-sender optimistic flow control protocol for buffer management. Packets are optimistically transmitted after allocating buffer space at the source for the packet. If the packet cannot be received due to the lack of buffer space, it is returned to the sender where it can be retransmitted, and buffer space is guaranteed to be available to receive the rejected packet. If the packet is successfully delivered, acknowledgments are used to free buffer space at the source. This scheme has the advantage of requiring buffer space proportional to the number of outstanding message packets rather than having to preallocate space proportional to the number of nodes.

The overhead of buffer management can be measured as the time spent in the message handlers to acquire and release buffers, performing status updates such as marking buffers as allocated and free, and updating data structures that track available buffers. Generally memory is statically allocated to keep these management costs down.

In addition to the above functions, interprocessor communication is often expected to preserve other properties that may require additional functionality within the message layer. A detailed study of the source of software overheads in the message layer by Karamcheti and Chien [171] identified several functions implemented in the messaging layer to provide services that are not provided by the network, but are expected by the user-level programs. Detailed breakdowns of the cost of the message layer were also presented by Martin [225]. The analysis presented in these studies provides the following insights.

8.1.5 In-Order Message Delivery

Very large messages must be segmented into multiple packets for transmission and delivery. If the network does not guarantee delivery order, then some mechanism is necessary to reassemble the message at the destination. For example, adaptive routing may cause packets to arrive out of order since they may be routed along different paths. The use of virtual channels and virtual lanes may cause packets to overtake one another in transit. Reordering message packets may entail the use of sequence numbers within the message packet headers to enable reconstruction at the destination. The messaging layer may have to buffer out-of-order packets so that packets may be delivered to the user-level programs in order, incurring additional buffering delays. The costs occur as (1) time to place sequence numbers within packet headers, (2) latency/bandwidth effects due to larger headers, (3) buffering costs at the destination, and (4) checks for sequencing.

8.1.6 Reliable Message Delivery

Most message-passing programs are predicated upon the reliable delivery of messages to ensure correct execution. Reliable delivery may be ensured by acknowledgments from the destination and may require buffering messages until such acknowledgments are received. The cost of managing these buffers and the processing of acknowledgments can be charged as overhead due to reliable delivery. The time to compute various checks (e.g., CRC) if implemented in software is also charged to this category. Further, messages may be ejected from the network and rerouted in software in the presence of failures. These rerouting decisions are made by the messaging layer at the intermediate nodes. To detect messages that have been dropped in the network, the handlers may maintain timestamps of outstanding messages that were recorded when the message was injected into the network. Subsequent query of the timestamp and comparison with the current time involves a few handler instructions to detect lost packets.

8.1.7 Protection

The state of messages may be maintained in the network interface or in special pages in memory. In a multiprogrammed environment, programs must be prevented from interfering with the message-passing state of other programs. When message handlers operate in the kernel context, this can be achieved via traditional mechanisms. When handlers operate in the user address space, this can be more difficult to do. One approach [225] is to have a user scheduler that manages the network interface. Each process now must have a unique ID that is part of the message. When the message handlers receive a message, this ID is checked. If the message is not destined for the currently active process, it is stored in a separate queue. This queue is accessed by the scheduler and stored as part of the saved network interface state for that process. From the point of view of assessing software overheads, the instructions for extracting and checking process IDs and storing messages can be attributed to process protection mechanisms.

The preceding discussion was the result of experiences with the design and implementation of messaging layers that focused solely on message transport. If we consider the support of message-passing APIs with additional semantics, other operations may have to be supported that will further contribute to the software overheads. The MPI standard presented in Section 8.4 provides several examples of such useful semantics for message-passing applications.

8.2 Impact of Message Processing Delays

The aggregation of the various overheads produced by the software messaging layer as described above has a considerable impact on the performance of the interconnection network. An increase of the software overhead directly affects the communication latency. Moreover, it can even limit the throughput achieved by the interconnection network since the injection rate decreases as software overhead increases.

Consider an interconnection network with N nodes. Assuming that each node has a single injection/delivery port, a new message cannot be injected until the currently injected message has left the node. We are going to compute an upper bound for channel utilization as a function of communication start-up latency. In what follows, we will use the following notation:

N Number of nodes in the network
BN Network bisection bandwidth
Bc Channel bandwidth
ts Start-up latency
tI Injection time
L Message length
PB Probability of a message crossing the bisection
u Channel utilization in the network bisection

Assuming that there is no contention in the network and there is no pipelining in the network interface, the time required at the source node to inject a message is given by tI = ts + L/Bc. As each node has a single port, the maximum injection rate is 1/tI.

For the sake of simplicity let us assume a message destination distribution in which the probability of a message crossing the network bisection is pB. Assuming that there is no contention in the network, the average utilization of channels in the network bisection is

image

Now, let us compute channel utilization by using values taken from some real machine. We assume the following values:

image

It should be noted that channel utilization decreases as communication locality increases. So, for this study we assume a uniform distribution of message destinations. For this distribution, we have pB ≈ 0.5. For these values, channel utilization is

image

Table 8.1 shows the average channel utilization for different message lengths. Note that these values are upper bounds because they assume that there is no contention in the network. Obviously, channel utilization cannot be higher than one. Thus, the value displayed in the table for channel utilization when message length is 4 Kbytes means that the start-up latency is not the limiting factor in this case. Instead, channel utilization is limited by channel contention.

Table 8.1

Upper bound for the average channel utilization as a function of message length.

image

As can be seen in Table 8.1, channel utilization is very low for messages shorter than 256 bytes. Therefore, when messages are short, network performance is limited by software overhead. In this case, there is no benefit in increasing channel bandwidth. The minimum message length required to achieve the maximum channel utilization in the absence of network contention is

image

This length is equal to 3,333 bytes for a uniform distribution. This expression also indicates that when communication locality increases, the minimum message length required to achieve the maximum channel utilization also increases. Moreover, for traffic exhibiting a high degree of locality, full channel utilization cannot be achieved regardless of message length. This occurs when pB ≤ 0.125 for this example.

8.3 Implementation of the Messaging Layer

Traditionally, interprocessor communication has been treated in the same manner as input/output operations. The network interface was essentially a fast I/O device. Access to this device was protected and was granted through the operating system. This is still true for intercomputer communication in a LAN environment. However, the resulting high latency is difficult to absorb in parallel programs. This has spurred activity in the implementation of low-overhead messaging layers. Two notable examples are the Active Message layer [104, 105] and the Fast Message layer [264]. To motivate the implementation of these approaches, consider the operational model of previous (and some current-generation) message-passing machines illustrated in Figure 8.3.

image

Figure 8.3 Messaging layer implementation. (NI = network interface; OS = operating system.)

While the user program nonblocking send() and receive() procedures execute asynchronously, the message layer is responsible for synchronizing the transport of messages. A nonblocking send() call immediately returns to the user program, but the message layer may need to buffer the message until the network interface is available. In this case the message is copied into system buffers to allow the program execution to proceed. Reception of a message will generate an interrupt to invoke the operating system handler that extracts the message from the network interface and buffers the message until a matching receive() is executed by the user program. At this point the message is copied into the user program data structures. The advantage of this approach is that messages can be consumed in an arbitrary order and at arbitrary points in time.

The designers of the first generation of streamlined messaging layers point out that this programming model is mismatched to the hardware capabilities, leading to high overheads due to repeated transitions across the user/operating system interface and excessive copying and buffer management [105]. A closer look reveals that messages are transmitted by handlers and received by handlers. If these handlers were integrated into the programs themselves and user-level access provided to the network interface, computation could be smoothly integrated with computation. Buffering/copying can be minimized as well as the overheads due to system calls. This streamlined implementation can result in over an order of magnitude reduction in software overhead. Active Messages [104, 105] and Fast Messages [264] represent this approach to the implementation of the messaging layer.

Such messaging layer implementations are necessarily at a lower level of abstraction, resembling interrupt handlers in invocation and use. Therefore, these procedures tend to serve better as targets for compilers and library developers than as a user-level API. Further, the use of industry-standard APIs is very important for portability and development cost. Nonetheless, these lean message layers are available for the implementation of finely tuned, efficient parallel programs as well as serving as a target for portable message-passing APIs. The following two sections describe examples of implementations of Active Messages and Fast Messages.

8.3.1 Example: Active Messages

The basic principle underlying Active Messages is that the message itself contains the address of a user-level handler that is to be executed on message reception [105]. Consider how this might work for the transmission of a one-word message. Let us suppose that the network interface is mapped into the user address space so that user programs have access to the control registers and buffer memory within the interface. The network interface receives a message into the local buffer within the interface. The message interrupts the processor, and the handler specified within the message is invoked. This handler may read the message data in the buffer and assign it to a local program variable. Alternatively, if the message data provide a request for a value, the handler can immediately transmit a reply. In either case, the message need not be extracted from the interface and buffered locally in memory. Rather than using interrupts, the user program may be periodically polling the interface seeking messages. In this case, no interrupt is generated. There is no buffering and no invocation of the operating system. Execution of the handler is extremely fast. Compare this scenario with the one described in conjunction with Figure 8.3.

Given this basic model, the following presents a synopsis of the design of an Active Message layer for the HP workstations (HPAM) using the Medusa network interface [225]. The network interface is mapped into the address space of the processor and is comprised of 1 Mbyte of VRAM partitioned into 8 Kbyte blocks (one per message). Four queues control the buffering and transmission of messages as shown in Figure 8.4. Queue entries are comprised of a memory block number and message length pair. Transmission proceeds with the construction of a message in a block and insertion of the corresponding queue entry in the TX_READY queue to initiate transmission. Completion of the transmission causes the queue entry to move to the TX_FREE queue. Messages are received by removing entries from the RX_READY queue and accessing VRAM. When the message block is empty, the queue entry moves to the RX_FREE queue to be used by the interface to place incoming network packets in free blocks.

image

Figure 8.4 A model of the Medusa network interface.

Active Message implementations are based on a request-reply model of communication [105]. Every message generates a reply. Some messages naturally generate a reply such as remote read operations. However, other messages do not, in which case a reply must be automatically generated. The HPAM implementation is unique in that it supports an all-to-all communication model. All buffers and bookkeeping storage are statically allocated. The following describes a sequence of operations for implementing point-to-point communication.

The single request-reply protocol requires two pairs of buffers for communicating with each processor as shown in Figure 8.5. The outbound pair is used to transmit a message and store the reply. Similarly the inbound pair is used to store an incoming message and the locally generated reply. A descriptor table contains status information for each buffer (e.g., busy/free, time busy, etc.). When a message is to be transmitted, the outbound buffer is marked as busy, and the timestamp of the message is stored in the descriptor table entry. The buffer remains allocated until a reply is received. Busy buffers cause handlers to block while still receiving other messages: to preserve deadlock freedom active message handlers cannot be preempted. Retransmission occurs after a timeout period. When a message is received (on an inbound channel), it contains a pointer into the descriptor table to enable checks prior to invoking the handler specified in the message (e.g., does it belong to the currently active process?). This handler may now read the data in the inbound buffer, copying it to local data structures. The handler generates a reply if necessary (or one is generated automatically). All HPAM functions automatically poll the interface to avoid the high cost of the interrupts (approximately 10 μs in this case).

image

Figure 8.5 An example of the operation of Active Messages.

The preceding description is simplified, omitting many important checks and features unique to the specific implementation. However, the description does cover important aspects of the Active Message implementation: the implementation of a request-reply protocol, buffer management, and reliable message delivery mechanisms. Detailed issues deal with retransmission protocols, handling lost packets, and synchronized access to the interface in a multiprogramming environment. This latter issue is of particular concern in the design of the messaging layer. What happens when a message arrives for a process that has been swapped out? In this case the handler cannot be invoked. The HPAM implementation solves this problem by enforcing protection via an external scheduling daemon that is responsible for saving and restoring the network interface card state on each context switch. When a message arrives for a blocked process, it is stored in a special queue for the blocked process. When the process is restored the HPAM layer first checks for messages in this queue.

The preceding example has provided an implementation of a message layer that permits user-level messaging with very low latencies: on the order of 29 μs per message. In other environments (e.g., networks using ATM communication), it may not be feasible to be granted user-level access to the network interfaces. In this case, careful streamlined implementation of the user/kernel interface is necessary for efficient implementations. Buffering/copying costs can still be significantly reduced.

EXAMPLE 8.1

Figure 8.6 illustrates the logical flow of information through handlers in an implementation of the Active Message paradigm. Process 0 makes a request for remote data. The source handler initializes a completion flag (request_complete) and calls the Active Message procedure am_request_4(). The source now polls waiting for completion of the remote read. The message contains the name of the handler to be invoked by the destination process on message reception. This handler responds to the request message with a read operation to return the data value. This handler runs to completion. Note that no buffering is necessary at this point. Similarly, reception of the message at the host invokes the reply handler, which assigns the message contents to a local memory location. Buffering and copying of message contents are avoided. The request handler within the destination process is only permitted access to the network to reply to messages, while the reply handler at the source process is prevented from accessing the network. This helps prevent cyclic dependencies and certain deadlock situations.

image

Figure 8.6 An example of the use of Active Messages.

8.3.2 Example: Illinois Fast Messages

A second example of a streamlined messaging layer implementation is the Fast Messages (FM) library from the University of Illinois [264]. The FM procedures are similar to Active Messages in that a handler is specified within the message. However, there is no notion of request-reply message pairs. There are also no restrictions on the actions that a handler can take. Thus, program properties such as deadlock freedom are the responsibility of, and must be guaranteed by, the programmer. The reduced constraints can lead to very efficient implementations. In this regard it can be a viewed as low-level messaging library suitable as a target for compilers and library developers, rather than as a general-purpose, user-level API.

The following description is of the FM implementation on Myrinet using Sun Sparcstations [264]. The Myrinet interface (the LANai) incorporates a processor that executes a control program for injecting/ejecting messages from the network interface and setting up transfers to and from the host. The LANai memory is mapped into the host address space. Communication between the host and the LANai is across the memory and I/O buses of the Sparc. The implementation of the message function is via four queues as shown in Figure 8.7. The send and receive queues are in the network interface, and the reject and receive queues are in host memory. The FM send functions write directly into the send queues in the network interfaces, initiating transmission. However, on a receive, the interface will DMA the message into the receive queue in kernel memory (DMA is allowed only to/from kernel pages). DMA on reception permits overlap of computation and communication, while direct stores to the interface on send calls removes the overhead of copying to the DMA region. This is a good example of how precise trade-offs are clearly a function of the node architecture and operating system constraints.

image

Figure 8.7 Implementation of FM on Myrinet.

The reject queue is used for implementation of the optimistic buffer allocation policy. Since there is no request-reply coupling, there is no running knowledge of the availability of buffer space on remote nodes. Therefore, when a message is sent, buffer space is first allocated locally (the reject queue) prior to transmission. When a message is received and no buffer space is available, the message is rejected. The returned message is guaranteed to be received. In the current implementation, the distinction between request and rejected packets is not made in the interface but is done at the host. This choice was influenced by the relatively slow speed of the network interface processor and the ability to make use of block DMA transfers if messages were simply packed together without being interpreted at the interface. The user programs must ensure that the handlers that receive messages are invoked often enough to reduce the number of rejected packets. This approach requires acknowledgment of received packets to release buffers at the source. The current FM implementation optimizes acknowledgments by piggybacking these messages on top of normal traffic. The FM handler when invoked extracts all of the messages from the queue, and programs are expected to poll regularly to remove pending messages.

8.4 Application Programming Layer: The Message Passing Interface

The early success of the message-passing programming model in constructing applications led to several commercial as well as public domain efforts toward the implementation of message-passing libraries. Examples include Intel’s NX/2 [276], PVM [335], and p4 [45, 46]. These concurrent efforts naturally shared many common attributes by virtue of support for a common programming model, while necessarily presenting distinctions in syntax, semantics, and constraints on their usage due to their developmental and evolutionary heritage. The ability to provide a portable implementation of message-passing libraries was demonstrated in principle with several of these packages. However, the diversity of available libraries remained a hindrance to the development of truly portable message-passing programs. Thus, a forum was created with members from academia, industry, and government. Through an open participatory process, the first Message Passing Interface (MPI) standard evolved and was completed in June 1994. A second revision subsequently appeared in June 1995 as the MPI version 1.1 standard.

The MPI standard is an application programming interface, not an implementation. “The MPI standard defines the user interface and functionality for a wide range of message-passing capabilities” [325]. In defining the syntax and semantics of the interface, the designers had to balance the impact of the specification on portability, efficiency, and ease of use. For example, data type information appears in the interface to permit support across heterogeneous environments where the value of an operand may have distinct representations on different machines (e.g., byte ordering). The standard refrains from specifying how operations must be performed, focusing on the logical behavior and what must be performed. The stated goals of MPI are illustrative and summarized from [325] below:

image The interface should reflect the needs of applications.

image The interface should not constrain the implementation (i.e., permit optimizations that may eliminate buffering, use extensive buffering, support concurrent communication and computation, etc.).

image The specification should permit support for a heterogeneous environment.

image The semantics should be language independent.

image The user should be relieved from the responsibility for reliable communication (i.e., checking or creating acknowledgments).

image Usage should not require much deviation from well-understood current practice.

image The interface design should be supportive of thread safety.

The standard remains focused on the logical aspects of message passing and avoids specifying or addressing system-specific operations such as I/O operations, task management, and other operating-system-specific features. The standard also avoids explicitly addressing shared-memory operations, although MPI can be (and has been) implemented on shared-memory machines. Some of these (non-MPI) features could conceivably be offered as extensions with a specific vendor implementation. There are several excellent texts and publications that describe in some detail the rationale and specification of the MPI standard [138, 325], as well as tutorial descriptions on the use of MPI [138]. There is also a continually growing set of Web pages that provides information on the standard, updates of evolving efforts, and links to a variety of tutorials, texts, and papers. An excellent place to start is at [243]. The purpose of this chapter is to merely highlight major features of the standard and, where possible, relate the interface features to the implementation of the messaging layer. As a result, we briefly describe some of the novel features of MPI. This includes general concepts, the structure of point-to-point communication, support for collective communication, and the availability of virtual process topologies.

8.4.1 General Concepts

An MPI program consists of autonomous processes, executing their own code in either the MIMD or SPMD programming paradigms. Each process usually executes in its own address space and communicates via MPI communication primitives. The basic communication mechanism within MPI is point-to-point communication between pairs of processes. The model assumes the use of a reliable user-level message transmission protocol. Furthermore, messages are nonovertaking. The implication of these semantics is that if single-threaded programs are written with point-to-point message-passing calls with source and destination process addresses explicitly specified, program execution is deterministic.

The community experience with the development of many successful message-passing library packages led to the precise definition and adoption of several logical concepts as an aid in thinking about and writing correct, efficient, parallel programs. At the highest level MPI specifies the notion of a communication domain. This domain is a set of n processes, which are implementation-dependent objects. From the programmer’s perspective, each process is assigned a rank in this group between 0 … (n − 1). Message-passing programs are written to send messages to processes identified by their rank. The group of processes is identified by an object called a communicator. Each process within a group can be thought of as possessing a communicator for the communication domain within which it is a member. The communicator is logically a set of links to other processes within the same group (i.e., communication domain) and is referred to as an intracommunicator. The relationship between intracommunicators in a communication domain is illustrated in Figure 8.8.

image

Figure 8.8 Representation of intracommunicators.

The notion of a communicator provides a very simple concept with which to structure groups of processes. Often it is desirable to allocate one subgroup of processes to a specific subtask while another subgroup of processes is tasked with a distinct computation. Processes within subgroups will communicate among themselves. There will also usually be communication between processes in distinct subgroups. In MPI terminology, each subgroup will have intracommunicators for communication within the subgroup. Each process must also possess intercommunicators for communication between subgroups. Logically the intercommunicator used by processes within one group can be thought of as links to processes within the other group and vice versa. This information captured within intercommunicators is illustrated in Figure 8.9.

image

Figure 8.9 Representation of intercommunicators.

The linear rank of a process within a group provides no information about the structure of the problem, and a great deal of mental bookkeeping is often required to orchestrate interprocessor communication to follow the structure of the problem. We often keep a mental map or even a physical drawing of how processes must communicate. To facilitate such thinking, MPI provides for the specification and use of virtual topologies of processes within a group. This virtual topology can be used for realizing efficient assignments of processes to processors or for writing scalable programs. A substantial body of work exists on algorithms for mapping parallel programs onto parallel architectures. Optimal mapping algorithms exist for many structured programs and networks. The specification of virtual topologies effectively provides hints to the run-time system about the communication structure of the program, enabling the optimization of program mappings. The fact that a logical communication topology is specified does not preclude communication between any pair of processes not connected within this topology. It simply means that this existence of communication between these two processes cannot be made known to the run-time system for use in computing process assignments. The use of topologies can also make it easier to write scalable parallel programs. Many parallel algorithms, particularly those that involve domain decomposition over large data sets, are structured with regular communication patterns. If programs are written parameterized by the topology, implementations are easily scaled to larger data sets. The availability of MPI functions to query and extract the rank of neighboring processes makes the development of such general-purpose programs relatively easy. While explicit support is available for the support of various multidimensional orthogonal topologies, MPI functions facilitate the specification and use of arbitrary logical process communication structures as well.

8.4.2 Point-to-Point Communication

With process groups, communicators, and logical topologies we have a set of logical concepts with which to describe the point-to-point and collective communication semantics of MPI. The basic constructs of MPI are best illustrated by examining an example MPI function. Consider the following blocking send procedure [325]:

image

The above procedure transmits count number of entries starting at address buf to node dest. Note that the message size is specified as a number of entries and not as a number of bytes. Usage in conjunction with the datatype argument promotes a portable implementation. The language support provides definitions for a host of predefined data types (e.g., for the C language there exist MPI_CHAR, MPI_FLOAT, MPI_BYTE, etc.). Facilities are also available for the user to define new data types.

In addition to specifying the source of message data, the call also specifies the dest, tag, and comm fields, which with the addition of the source address form the message envelope. The comm argument is the communicator, and dest is the rank of the receiving process within this communication domain. If the communicator is an intercommunicator, then dest is the rank of the process in the destination domain (we must be careful since processes can belong to more than one domain and have different ranks in each domain). Thus, comm and dest serve to uniquely identify the destination process. The tag is simply an integer-valued argument that can be used by the receiving processes to distinguish between messages.

The blocking semantics are with respect to buffer usage rather than the temporal relationship between send and matching receive operations. When a blocking send procedure returns, this does not necessarily imply that the matching receive procedure has started executing. A blocking send call returns as soon as the send buffer can be safely reused. Depending upon the implementation, this may be the case after the message has been copied directly into the receive buffer, or after it has been copied into a system buffer for subsequent transmission. In the latter case the send may return before the matching receive has even begun executing, which may run counter to our expectations for blocking operations. A blocking receive returns when the buffer contains the message data. The blocking receive is shown below:

image

The receiver uses the message envelope to specify the sources from which messages will be received. Only messages whose envelopes match the receiver’s request will be accepted. Receivers also control message reception via type matching. The source field may be a wild card (e.g., the predefined string MPI_ANY_SOURCE). In this instance, a message may be received from any source, and it may not be possible to know a priori the exact size of the message. Therefore, the buffer size is an upper bound on the required storage. Query functions are available to obtain information about received messages; for example, MPI_GET_COUNT() returns the number of received entries in the receive buffer. The status field is a structure that contains additional information for the receiver.

There are a host of nonblocking calls that return immediately to permit overlap of communication and computation. These calls are supported by query functions that are used to determine if the operation has actually completed. Such calls are a necessary prelude to the reuse of buffers used in the nonblocking calls. These functions include MPI_WAIT() and MPI_TEST(). A common structure is to initiate a nonblocking send operation and continue with computation that is overlapped with the transmission. When the program reaches a point where the send buffer must be reused, queries are used to establish the completion of the previously issued send operation.

It is clear that the choice of buffering strategy has a significant impact on performance, particularly if there are known features of the application that can be exploited. To enable the programmer to influence the buffering strategy and thus performance, MPI offers several modes for send/receive operations. The preceding descriptions were that of standard mode send/receive calls. In this case, no assumption can be made with regard to whether the message is buffered or not. In buffered mode, buffering can be provided within the user program. Therefore, the user can guarantee that some buffering is available. Synchronous mode realizes the semantics of a rendezvous operation and ensures that a matching receive has executed. Finally, in ready mode, the user can assert that the matching receive has been posted. These distinct modes enable the user to exploit some knowledge of the application implementation in influencing the choice of the message transfer protocol used in a communication operation.

When communication repeatedly occurs between processes (e.g., in loops), some of the overhead involved in the message-passing implementation may be shared across multiple messages by using features for persistent communication. Calls to MPI functions are used to set up and retain the local parameters used for communication. The setup procedures (e.g., MPI_SEND_INIT() and MPI_RECV_INIT()) return handles bound to arguments. These handles are used in subsequent MPI_START() calls. The semantics of each pair of persistent communication procedure calls is identical to that of a nonblocking send or receive.

It is important to understand that while the send and receive call semantics are well defined, this does not preclude the programmer from constructing erroneous programs. In particular, errant use of blocking receive calls can lead to deadlocked programs. There is also no notion of fairness in how receive calls match messages. These considerations should be taken into account when constructing programs.

8.4.3 Collective Communication

A specific class of communication patterns that have received increasing attention in the recent past has been the class of collective communication operations. As the name suggests, collective communication involves the aggregation or dissemination of data from/to multiple processes. The importance of collective communications is derived from the fact that many frequently used parallel algorithms such as sorting, searching, and matrix manipulation share data among groups of processes. Transmission of data to multiple destinations can be implemented with multiple calls for point-to-point transmission. However, these patterns of sharing data are very regular and are important enough to merit special procedures.

In general, collective communication involves one or more senders and one or more receivers. Examples include broadcast of a single data item from one process to all other processes, broadcast of unique items from one process to all other processes, and the inverse operation: gathering data from a group of processes. There are also other operations that are collective in nature although no data are communicated (i.e., barrier synchronization). In general, we can identify the following common collective communication operations:

image Broadcast: A source process sends identical data to all other processes.

image Scatter: A source process sends a distinct message to all other processes.

image Gather: This operation is the reverse of scatter.

image All-to-all broadcast: Every process communicates the same data to every other process.

image All-to-all personalized exchange: Every process communicates unique data to each other process.

The MPI standard specifies support for all of these collective communication operations as well as other global operations such as reduce and scan operations. As an example, consider the following MPI call:

image

This operation is executed by all of the processes in the group. Each process sends a distinct block of data to each other process. The jth block in process i is sent to process j, where it is stored in the ith block. The receiver buffer must be large enough to store sendcount elements from each process. This is the most general collective communication primitive. Suitable restrictions on messages, senders, and receivers realize the remaining collective communication operations. Corresponding MPI calls exist for each of these operations.

Another useful operation that does not involve the transfer of any data between processes is the barrier synchronization operation. The call appears as follows:

image

This procedure blocks until all processes within the group have made the call. The semantics of the collective communication operations are consistent with those of point-to-point operations, but are more restrictive in several ways. The message sizes must be known and match the receiver-specified buffer sizes. These calls are all blocking calls and therefore must appear in the same order in all processes. Note that the tag does not appear in the above (or other) calls. These calls are also only available in the equivalent of standard mode. When a collective communication call returns, buffers can be reused (as in point-to-point calls). However, no assertion can be made about the status of other processes. Even though the operations are collective in nature, and logically may be thought of as occurring simultaneously, a user cannot assume that these operations synchronize processes in the same way that a barrier synchronization operation does.

The preceding description provides a brief overview of the major concepts underlying the MPI standard. Both Active Messages and Fast Messages present candidate implementation techniques for MPI. We can see that the abstractions supported by a user-level API are richer and more complex than the performance-driven, memory-to-memory data transfer view of lower-level messaging layers. However, understanding their differences and relationships can be helpful in writing efficient programs. The complete list of MPI procedures can be found in [325], and many useful links to MPI-related material can be found on the Web at [243].

The following discussion presents an example of an MPI program template that contains the basic elements typically found in a message-passing parallel program. This template is for creating a parallel program based on a master-slave model of parallel program execution. This excellent template is presented at [245] and is shown below with a few additions. A more detailed exposition of such a template as well as the description of the local area multicomputer (LAM) implementation and development environment for MPI can be found at [245]. By filling in portions of this template with correct code, we can create an executable MPI program.

image

image

EXAMPLE 8.2

This example utilizes the master-slave paradigm for implementing parallel computations. A master process farms out units of work to a set of slave processes. When all units of work have been completed, the master process sends messages that cause the slaves to terminate execution. This simple example contains elements that are representative of a variety of MPI programs. All MPI programs begin with an MPI_Init() and end with an MPI_Finalize(). Any MPI-related processing must be contained within the program between these two calls. Since each process executes the same code, the next two MPI calls determine the number of processes in this communication domain and the rank of the current process in this pool. The first three MPI calls above are generally found at the beginning of every MPI program and enable process-specific execution. The fourth call can be used to obtain the processor node name of the executing process.

A process now first checks to determine if it is serving as the master or as a slave. We set the process of rank 0 to serve as the master. Accordingly one of the two procedures is called. Consider the procedure that implements the behavior of the master. The master process sends units of work to all of the slaves. Note that the first call in the master procedure determines the total number of processes in the domain. The next iterative set of calls uses point-to-point communication to send units of work to all slave processes. As long as there is work to be done, the master process receives results of work from the slaves and continues to provide each such slave an additional unit of work. When all units of work have been farmed out, the master process waits for the results from each slave and sends a termination message to each slave as their final results come in. The slave processes simply receive messages and return results until they are informed that all work is completed and they may terminate. Note that on procedure return the last call made by both master and slave processes is MPI_Finalize().

All communication is performed using point-to-point send and receive procedures. We can also think of an alternative approach where the master process utilizes an MPI_Bcast() call to terminate all slave processes after all results have been received.

The preceding example illustrates how processes can be written to have different behaviors as a function of their rank and of the environment in general. Care must be taken in reasoning about the concurrent execution of MPI programs. It is usually not possible to make assertions about what events in different processes are taking place concurrently. Rather it is usually only possible to make assertions about relative orders of events based on the partial sequencing produced by message-passing calls.

8.5 Engineering Issues

The main goal is to get as close to the “bandwidth on the wire” as possible. The messaging layer software is a major determinant of how close we can actually get. The design of the messaging layer software is strongly influenced by the node architecture and the set of services that the message layer is to provide to user programs. First consider the network interface. Part of the message handler is devoted to interaction with the network interface in the transfer of messages to and from the interface. The cost of this interaction and consequent software overhead depends on how this interface is accessed. Interfaces may be memory mapped, register mapped, or tightly integrated within the memory hierarchy. Coordination of transfers between the interface and memory may involve polling, direct memory access, or interrupts.

In most commodity systems interrupts are expensive. Avoidance of interrupts will typically require the processor to poll the interface to receive messages. User programs must ensure that the polling frequency is tied to the available buffer space to prevent buffer overflow and/or substantial buffer management traffic between nodes. Flow control at this level must avoid loss of message packets due to lack of buffer space. Given the varying speeds of the processors, interfaces, and networks, sufficient buffer space will decouple the operation of all three components. Larger buffers will support lower polling frequencies. If packets are buffered for retransmission, increasing the memory can actually have the effect of reducing source buffering requirements for retransmission.

It is desirable that access to the network interface not be through the operating system. Invoking the operating system for services is expensive. While these costs can be amortized over large messages, short messages incur a significant penalty. The traditional justification for operating system access to the network interface is for maintaining protection between programs that share the interface. There have been recent techniques evolving toward minimizing the operating system interaction while still being able to provide protection between programs. One approach [29] is for programs to access the interface through system calls to set up protected information that is used on subsequent user-level transfers. For a sequence of transfers between a pair of nodes, these system calls need be made only once. A second approach is to implement protection mechanisms outside of the operating system and tag messages with what are effectively process IDs. The handlers that access the interface perform the appropriate checks before modifying any of the interface state [225].

However, in some systems there may be constraints derived from the operating system design that make it necessary to access interfaces through the operating system. For example, DMA mechanisms may be permitted only between the interface and kernel-mapped portions of memory [264]. In the implementation of Fast Messages over Myrinet, this presented the choice between copying messages to kernel memory to use the DMA mechanism versus memory-mapped writes across the I/O bus to the interface. Pakin, Lauria, and Chien [264] argue that asymmetry in the manner in which send and receive operations are supported may be desirable. Any involvement by the node processor is time taken away from useful computation. However, rather than copying messages into system buffers for concurrent transmission, it may be just as efficient for the node processor to be involved in send operations, particularly for short messages, However, receive operations should use the interface mechanisms such as DMA since it can be overlapped with useful computation on a node. The node processor and handler software does not have to be invoked until the message is completely in memory. The utility of such asymmetry in the manner in which send and receive operations are handled is a function of the node architecture and operating system constraints. These may not be as much of an issue if the system is being designed from the bottom up rather than having to use existing commercial systems.

In designing the message layer, there is a question of where the responsibility is placed for correctness (e.g., deadlock freedom). It is possible to design buffer management protocols that are fast but prone to deadlock if their use is not carefully orchestrated. Alternatively, buffer management protocols can be designed to prevent the loss of messages due to lack of buffer space at the expense of additional checks and/or messages. For example, the request-reply protocol followed by Active Messages serves to prevent forms of deadlock. The choice is often resolved based on the user of the messaging layer. If the implementation is to serve as a target for library developers and compilers, implementors may choose faster options at the expense of relying on correctness by construction. If the implementation is to serve as an application-level programming interface, preventive techniques may be used to ensure correctness with some relative loss in performance.

In addition to interface and buffer management, Karamcheti and Chien [171] identified several classes of services that may be enforced within the messaging layer, and therefore incur overhead. These include checks for in-order message packet delivery (reordering out-of-sequence packets) and reliable message delivery (handling and generating acknowledgments). If we were to closely examine the code for a typical message handler, we could find instructions devoted to one or more of the following functions:

image Network interface control for transmission and reception

image Kernel interaction

image Buffer allocation/deallocation

image Copying

image Reliable message delivery

image Protected access to the network interface

image In-order delivery of message packets

image Deadlock freedom

When user-level handlers are used, in multiprogrammed systems messages may arrive for a process that is swapped out. The maximum performance from implementations such as Active Messages and Fast Messages is obtained when process scheduling across multiple nodes is coordinated with messaging operations [264]. This continues to be an active area of current research. The continuing gap between memory and processor speeds and the interest in shared-memory parallel architectures has led to extensive analysis and characterization of the memory hierarchy. There are well-understood techniques for managing this hierarchy and integrating the management within the compiler and run-time systems. A similar set of techniques does not yet exist for the hierarchy of delays encountered by messages within the network. Interprocessor communication is still treated more like I/O operations rather than memory accesses, although attention has been drawn to this fact and we can expect to see increased activity in this area, particularly as workstation clusters become viable parallel computing platforms.

8.6 Commented References

It is clear that software costs will be determined by the architecture of the network interface and the manner in which it will be integrated into the processing nodes. Several research projects have been studying efficient communication support by focusing on the design of the network interface. Central themes have included support for both the shared-memory and message-passing paradigms, low-overhead message initiation, overlapping communication and computation, and integrating I/O and interprocessor communication. Developments in these areas certainly will impact the design of future messaging layers.

An emerging class of flexible network interface architectures are those based on the use of an integer processing unit in the network interface to run message-handling software. Commercial examples include Myrinet [30] and ServerNet [155], while research examples include the Stanford FLASH [148] and Wisconsin Typhoon [290]. The Stanford FLASH project uses a special-purpose chip—the MAGIC chip—for handling communication, I/O, and memory transfers. The MAGIC chip comprises a cache-based processor operating in parallel with a fast pipelined data path to handle data transfers. Protocol handlers for synchronization, message passing, and other operations execute within MAGIC out of local memory. The control path is faster than the data path so the handlers themselves are not a performance bottleneck. The important property here is that the network is optimized to handle cache-line-size transfers. Message passing is really implemented on top of an optimized shared-memory communication mechanism. Typhoon is a similar network interface and supports the Tempest user interface for both shared-memory and message-passing operations. Both projects explore the support of multiple communication paradigms and the customization of communication policies. We can think of Active Message handlers executing within such an interface although the FLASH interface does not support user-level handlers within MAGIC. Given the increasing high wire bandwidth of commercial networks, such data paths must be quite fast so as not to be the bottleneck in injecting/ejecting messages to/from the network.

More tightly coupled register-mapped interfaces have been proposed in [150], which provides an abstract machine optimized for the support of short messages, such as those found in shared-memory operations. Longer messages can also be handled, although in this case the tight coupling with the processor data path may encourage more efficient alternatives. The SHRIMP project [29] supports communication by mapping virtual memory pages of the sending process into the address space of the destination processes. Accesses to these pages can now be transparently captured and transmitted to the remote process, or can be buffered and explicitly transmitted with send() operations. System calls set up this mapping information and enforce protection policies in a multiprogramming environment. A tightly coupled network interface unit enables fast message initiation, and support is readily provided for user-level messaging.

The availability of such interfaces serves to provide more efficient support for message passing. User programs still expect a certain set of services in support of the semantics of the message-passing libraries. The Active Messages [105] and Fast Messages [264] paradigms are really independent of the specific node architecture, although the efficiency of the implementations certainly depends on the node architectures.

Problems

8.1. Similar to the metric employed in the evaluation of vector machines, we can define the image metric as the message length required to achieve 50% of the maximum channel utilization. Write an expression for this metric in terms of the model parameters provided in Section 8.2.

8.2. Assume the network is an N = k2 node torus and that each physical link is implemented as two unidirectional, single-flit-wide channels. With uniformly distributed message destinations and B flit messages, write an expression for the injection rate in flits/node/cycle that will saturate the bisection bandwidth. Compute this value for a 1,024-node network with 32-flit messages.

8.3. Processor speeds have historically outpaced network speeds although that is changing. Suppose the processor speed doubles between system generations while the network bisection bandwidth increases by 50%, and the memory-network interface bandwidth remains constant. Discuss the effects on channel utilization and network throughput from one generation of the system to the next.

8.4. Often message-passing programs exhibit a great deal of temporal locality: many messages are transferred between a pair of processes with short intermessage intervals. What overheads in managing this message transmission can be shared across multiple messages? That is, what benefits can be gained from MPI’s concept of persistent communication?

8.5. Develop a simple model to quantify the benefits of persistent communication in terms of its effect on channel utilization.

    Hint Persistent communication effectively reduces the value of ts in proportion to the number of messages that can share this overhead (Section 8.2).

8.6. Construct an example of the occurrence of deadlock using Active Message handlers that can be preempted.

8.7. The HPAM Active Message layer for the HP workstations statically allocates a request-reply buffer pair for every potential destination processor since the communication model is based on all-to-all communication. An outbound channel reserves a buffer for the expected reply. However, why is a reply buffer used for an inbound channel? You would expect that when a request message arrived, a reply would be automatically generated and transmitted by the handler. Why does this reply have to be saved in a buffer? When can this buffer be freed?

    Hint Consider the issue of reliable communication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.218.19