Chapter 5. Packet In: Cluster Networking Basics and Example Devices

Chapter Objectives

  • Discuss basic Ethernet network technologies and topologies

  • Introduce the Ethernet hardware components in a cluster

  • Cover the design of Ethernet networks in a cluster

Active elements in a cluster are tied together by communication networks that allow the sharing of data and control information. A cluster may have one or more networks, each with their own characteristics. A given cluster may have management and control, data, SANs, and the HSI. This chapter covers the information needed to begin the Ethernet network design discussion in the next chapter.

A Short View of Ethernet Networking History

Today's low-cost network switching components provide us with a lot of design choices for connecting cluster components. Ethernet is one of the primary network types used for management, data, and inexpensive high-speed connections between compute slices, file servers, administrative nodes, and master nodes. The commodity status of Ethernet networking equipment makes it a prime candidate for low-cost, high-performance networks, and hard to ignore unless there are other criteria being considered.

From a humble beginning in teletypes, acoustic modems, proprietary RS-232, RS-422, and other serial physical links running synchronous data link control (SDLC) and high-level data link control (HDLC) protocols, networking technology has settled on Ethernet as one of the primary interconnect standards. Gone are the proprietary serial point-to-point links that had more to do with interfacing to RS-232 terminals, papertape readers, and card punches than connecting peer-to-peer systems. (Ah, the days of Heathkit terminal kits, extra charges for read-only memories [ROMs] containing the upper-case character set, and point-to-point terminal cables running at a whopping 1200 Baud [with the wind at their backs].) Except in a very limited and aging technical population, they shan't be remembered.

Ethernet is a LAN technology based on a contention protocol called carrier detect multiple access with collision detect (CDMA/CD). The original protocol was developed for Aloha Net in the 1970s, which used a packet radio-based networking link to communicate between mainframe computers at the University of Hawaii, over the “ether.”

Ethernet was developed by DEC, Intel, and Xerox in 1976. The original network was a bus or star topology that supported one megabit per second. Connections to the cable were made with “vampire” taps that pierced the coaxial cable's insulating jacket with pointed contacts that even looked like “fangs.” The ten megabit-per-second version of Ethernet technology became an Institute of Electrical and Electronics Engineers (IEEE) standard, designated 802.3, in 1983.

Current Ethernet standards support 1-Mb, 10-Mb, 100-Mb, and 1000-Mb (GbE) per-second data rates full or half duplex over both fiber optic cable and copper shielded twisted-pair physical connections. At the time of this writing 10 GbE is available and becoming a standard.

One may take the view that today's networking technology inhabits a twilight zone between software and hardware. It is hard to distinguish where the software influence stops and hardware takes over. Once a packet is created by a software protocol stack and is given to the network interface, the intervening hardware delivers it to the destination interface as if by magic.

Just to complicate things, complete Transmission Control Protocol/Internet Protocol (TCP/IP) software “stacks” are now available implemented in hardware. The addition of protocols like remote direct memory access (RDMA), which allows low-latency movement of data between individual systems, also blur the line. Because it is hard to draw a direct line between hardware and software in the networking world, we cover a mixture of both in this chapter.

The Open System Interconnect (OSI) Communication Model

The hardware and software composing one side of a network link may be viewed as a series of layers. The most common model is the open system interconnect (OSI) model, originated in the early 1980s, which comprises seven levels. This model is shown in Figure 5-1 along with the associated TCP/IP layers.

The seven-level OSI communication model

Figure 5-1. The seven-level OSI communication model

The models are useful in helping us to discuss where particular actions are taken in the sequence of sending frames, network packets, or datagrams from one system to another. The OSI model includes both hardware and software aspects of the communications interface in its layers. Most of our discussion about networking equipment will be associated with levels 1, 2, and 3.

Ethernet Network Topologies

The original Ethernet network was a single segment of cable, and true to the nature of a bus, only one station could be transmitting data at a time. To send a frame (OSI level 2) across the network medium (OSI level 1), a station began to broadcast the frame, then checked to see whether there was a collision with another station. If a collision was detected, then stations would both “back off” for a random period of time and then try again.

It is not hard to see the packet radio origins of the original Ethernet technology and CDMA/CD protocol. The more systems on the bus, and the busier those systems become, the less likely that a particular station's data will make it through to its destination. The term broadcast storm, describing the complete overload caused by too many networked systems trying to send data at once, soon appeared in the networking dictionary.

New interconnection devices appeared as the limitations of a single physical cable became obvious, which allowed extending the number of physical connections and the area of coverage. The active and passive hubs were developments that extended the reach and scope of the LAN. As Ethernet transmission became possible over twisted, shielded-pair cables, the scope and breadth of the network grew.

Early networking devices performed their duties at OSI level 1, the physical layer. They merely replicated the properly conditioned electrical signals to multiple attached physical media. You should note that stations (or interfaces) attached to the Ethernet segment or LAN receive all the frames being sent across the it. The frame is discarded by all stations that do not match the destination address for the frame. This is the default behavior, although it is possible to put an Ethernet interface into “promiscuous” mode, which captures all frames from the segment to which it is attached.

Ethernet Frames

Ethernet frames in a LAN are addressed between media access control (MAC) addresses that are unique for every Ethernet interface on the LAN (or had darned well better be unique to avoid nasty problems). IEEE 802 networks further divide the OSI “link level” into two sublevels: link level control (LLC) and MAC. The LLC sublayer sits on top of the physical layer and handles frame detection, flow control, and error checking. The MAC sublayer sits below the LLC sub-layer and controls access to the network medium and the interface's ability to transmit data.

An Ethernet MAC address is 48 bits wide, and usually is specified in hexidecimal digits, such as 0x00010203EB9B (the 0x prefix specifies a hexidecimal number), or in pairs of hexidecimal digits, like 00:01:02:03:EB:9B. The first 24 bits of the MAC address are unique to a manufacturer, being the IEEE organizationally unique identifier (OUI) assigned to that organization. There may be more than one OUI associated with a given manufacturer as a result of the mergers, acquisitions, and running out of space in the remaining 24 bits. The Ethernet interface's MAC address is only valid and “visible” on the LAN segment to which it is directly attached.

An Ethernet frame propagating across a physical LAN segment is shown in Figure 5-2. When viewed at OSI level 1, the frame is a set of signals propagating across the media (ether, copper cable, fiber-optic cable, and so on). At OSI level 2, the frame contains start and stop information that “frames” the actual data, a header that contains source and destination MAC addresses, a data “payload,” and a checksum for error detection. The maximum data payload, also called the maximum transmission unit (MTU), is normally 1500 bytes for Ethernet. A nonstandard MTU size is implemented by some networking equipment vendors. Called jumbo frames, it allows data payloads of up to nine kilobytes per frame.

Single-cable Ethernet LAN segment

Figure 5-2. Single-cable Ethernet LAN segment

Ethernet frames are capable of containing or “encapsulating” data packets (datagrams) from other protocols like TCP/IP. Once a frame is received from the Ethernet network, the computer's operating system “network stack” will remove the Ethernet-specific information. The operating system kernel then examines the content of the data from the frame and determines the proper software destination for the information.

Some of today's modern Ethernet interface cards allow the interface's MAC address to be changed or programmed by the operating system software or driver. This feature allows one interface to take the place of another at the network link level (OSI level 2). This ability is used by high-availability software that allows one server system to assume transparently (to client systems) the identity of another.

Ethernet Hubs

Hubs allowed extending the number of LAN connections and the area covered by the network, but also extended the “collision domain” to the connected segments. A passive hub merely broadcasts Ethernet frames received to all attached ports. An active hub performs signal amplification before broadcasting the frames. Hubs perform the signal regeneration and broadcasting at the physical level of the network.

In some literature, passive hubs and active hubs may also be called concentrators or repeaters respectively. As the number of ports available on hubs increased, the ability to control the port parameters and to monitor behavior of the network and the hub itself became more important. A hub transmits a frame on all connected LAN segments, as shown in Figure 5-3.

Ethernet hub and attached LAN segments

Figure 5-3. Ethernet hub and attached LAN segments

Ethernet hubs are still available and are low cost, but they are not suitable for high-performance cluster networks because of their collision domain characteristics. The next technological breakthrough was the network “router.”

Network Routers

Hubs allowed expanding the number of shared segments in a single physical LAN, but they also extended the LAN's collision domain. As the number of network segments and attached systems in a LAN grows, the collisions take their toll. Systems spend more and more time trying to communicate than actually communicating. Because sharing data is the primary reason that networks exist, a solution was needed.

The network router appeared in the late 1980s. It extended the rudimentary filtering abilities of hubs and provided the ability to isolate traffic and separate LAN segments from each other. Isolating LAN segments effectively reduced the size of the collision domains, reducing the number of systems per physical segment.

Routers also provided (and still provide) additional protocol and media interface capabilities beyond just Ethernet. Thus, connections to different network media types are possible with routers, and this can enable LAN data to be sent over longer distances—the wide area network (WAN). Routers are still commonly found at the “edge” of a network, where either protocol or physical medium translation, or both, is needed.

Figure 5-4 shows multiple LAN collision domains isolated by a router. The router in the diagram also provides a connection to the WAN along with a path between collision domains or separate LAN segments belonging to the same LAN. This was the first point in the evolution of LAN topology where a LAN could span multiple isolated segments separated by an arbitrary distance.

LAN collision domains connected by a router

Figure 5-4. LAN collision domains connected by a router

A router works its magic at OSI level 3 or higher. Notice that this level is above level 2 in the OSI model. If this is the case, we have a slight issue when talking about communicating between physical Ethernet segments. How does an Ethernet interface with one MAC address, on an electrically and physically isolated segment, communicate with an interface and MAC address on another?

The answer is that it can't and it doesn't—directly. The communication takes place at a higher level in the OSI model, facilitated by the router, using a protocol and datagram, or “packet” format defined for just that purpose. The Ethernet frame's information is encapsulated in a datagram sent or “routed” to the proper destination across the network “backbone” or WAN to its final destination, where it is placed into an Ethernet frame to continue its journey. The Ethernet communication no longer relies on a direct, electrical connection to a LAN segment to be properly delivered.

A router also implements packet filtering. It does this by inspecting the contents of the datagrams it receives from the attached networks. The need to start looking into the payload of the datagrams increases the overhead on a per-datagram basis and can slow transmission even when this behavior is implemented in hardware. In other words, there usually is both a dollar cost and performance price to be paid for the generality of a router.

Up to this point we have avoided details of the protocol that is used to transport data between networks and physically separated segments. To continue our discussion, however, we must cover some of the basics of this “internetwork” protocol.

Internet Protocol and Addressing

As you will recall, the MAC addresses used to send and receive Ethernet frames are only meaningful and “visible” on the physical LAN segment to which the interface is connected. Referring to Figure 5-4, notice that for a system in LAN 1 to communicate with a system in LAN 2, there must be a protocol to bridge the addressing between the MAC addresses on separate LANs.

Also, remember that we separated the LAN collision domains with “smart” hubs, and then routers, to cut down on the intersystem “chatter” or broadcast storms caused by normal Ethernet behavior. However, this does not mean that we want to filter all packets traveling between the LANs or collision domains. We need a way for useful packets to travel between networks—a method that augments the MAC addressing used in the LAN.

The TCP/IP is the most widely used protocol that allows addressing and routing packets, or datagrams, between networks. There are several good books on the TCP/IP protocol, so I cover only the fundamentals here.

IP and TCP/UDP

The IP specification defines a datagram or “packet” format, an address format, and rules for routing packets between IP addresses on separate networks. Looking back at Figure 5-1, you can now see the correlation between levels in the OSI model, and the model used for TCP/IP. The IP corresponds to the network layer (level 3) of the OSI model. The two other protocols shown, TCP and user datagram protocol (UDP), correspond to the transport layer (level 4) of the OSI model.

The TCP specification defines a connection-oriented method of exchanging packets, with provisions for flow control, error recovery, and out-of-order packet arrival. These features are most useful for unreliable connections, connections that may route data across multiple pathways, or connections that have low bandwidth and long delays—the WAN or Internet.

The UDP protocol specification defines a mechanism that is not connection oriented and has no flow control, error recovery, or provisions for out-of-order arrival. UDP does have a packet-level checksum available to detect packet corruption. UDP is most suitable for reliable LAN traffic, low-overhead communications, or as a low-level transport for services that are willing to implement their own error detection and recovery mechanisms. It does not have the extra “overhead” associated with TCP, but neither is it a reliable transport.

IP Addressing

A TCP/IP address is 32 bits, which is usually specified as four eight-bit decimal numbers (or “octets”) between dots. An example of this IP address notation is 192.168.0.103,[1] which may be called a dotted quad or dotted octet notation. Originally, a TCP/IP address was envisioned as belonging to one of three “classes”—A, B, or C. This was to make it easy to determine the network address portion from the system address portion for routing purposes.

A class A address had one byte of network address content and three bytes of system address content. The high-order bit in a class A address is set to 0, allowing a range for the first octet from 0 to 126 (the 127 network is reserved). This allows 127 networks of 16,777,214 host addresses. (The host address that is all zeroes is the network address, and the host address with all ones is the broadcast address. This subtracts two from the total number of available hosts.)

A class B network had two bytes of network address and two bytes of system address. To identify a class B address, the two most significant bits in the first byte were set to 10, yielding first octet values from 128 to 191. Two bytes of network address, ranging from 128.0 to 191.255 allows 16,385 networks of 65,534 hosts.

A class C address had three bytes of network address and one byte of system address. The first three bits in the highest byte were set to 110, which meant the first octet values ranged from 192.0.0 to 223.255.254. This arrangement produced networks of 254 hosts.

If you look at these conventions, you will see that the binary prefix (the first binary bits of the first byte) for class A addresses is 0, for class B addresses is 01, and for class C addresses is 011. Are there more classes? The answer to this question is yes: Class D addresses, those with a prefix of 1110 are used for multicast addresses, and class E addresses, with a prefix of 1111, are reserved for future use. We will not explore any further classes, but we will revisit the term multicast later. The three network classes and their address ranges are shown in Figure 5-5.

IP network classes and address ranges

Figure 5-5. IP network classes and address ranges

This class-based division of IP addresses worked well enough in the early days of the Internet, but the allocation scheme eventually started to show strain with the growing number of connections. For example, the practical number of systems that could be attached to a physical Ethernet segment is certainly less than 1024, but a class B network allowed more than 65,000 system addresses. It was, therefore, difficult to use all the system addresses in an assigned class-B network efficiently. The network address space became fragmented with used and unused addresses, reducing the actual number of available addresses. (The network address space is the range of all possible network and host addresses contained in the allowed 32 bits. Without the reserved addresses, this would total 232 or 4,294,967,296 addresses.)

IP Subnetting

To solve this issue, IP “subnetting” was introduced in 1984. Subnetting allowed specifying a “subnet mask” that divides a class A, B, or C network into smaller pieces by masking the address into two pieces that are not constrained by the byte boundaries of the original class-based addressing scheme. The bits in the left-hand portion of the subnet mask are set to 1 to specify the corresponding network portion of the IP address, and the remaining 0 bits specify the host portion of the address.

To find the network portion, perform the logical “AND” of the binary value for the address and the binary value of the net mask. To find the host portion of the address, you first perform the one's complement of the binary net mask, exchanging zeros for ones and ones for zeroes, and then perform the logical “AND” of the net mask and the IP address. To find the broadcast address, perform the logical “OR” of the binary net mask and the binary IP address.

These calculated values are not always straightforward and intuitive, so be careful. You can think of the subnet mask as being “implied” by the byte positions in the earlier class A, B, and C address class schemes. If you need a refresher on binary arithmetic, there are a number of good references on the Internet, such as http://www.learntosubnet.com. A very useful network address calculator is available at http://www.telusplanet.net/public/sparkman/netcalc.htm.

As an example, let's take a class A network: 15.0.0.0. This network allows three bytes, or 24 bits, of host address—that's more than two million hosts, ranging from 15.0.0.1 to 15.255.255.254. To make efficient use of this address range for a large, distributed enterprise with multiple physical networks, we may wish to divide the address into subnets, each with a more usable number of hosts. If we chose a subnet mask of 255.255.248.0, we would produce 8192 networks (13 bits) that may have 2048 hosts (11 bits) each. This example, and the next, are shown in Figure 5-6.

IP subnetting example

Figure 5-6. IP subnetting example

As another example, let us return to our class C IP address, 192.168.0.103, which has a network address of 192.168.0.0 and a system address of 103, based on the implied, class-based net mask of 255.255.255.0. Another common notation for the address and network portions is: 192.168.0.103/24. The /24 indicates that the first 24 bits of the IP address specifies the network portion. The network portion of this example address is 24 bits, and the system portion is eight bits, using the class C scheme, allowing 254 hosts.

We may decrease the number of hosts and increase the number of subnets available in this class C address by using a subnet mask of 255.255.255.224. With this choice, the IP address is divided into eight networks (with decimal addresses in the fourth octet of 0, 32, 64, 96, 128, 160, 192, and 224) with 30 hosts each. Another way of specifying this address scheme is 192.168.0.0/27, and the example is shown in Figure 5-6. The IP address 192.168.0.103 would have a network portion of the address of 192.168.0.96 and a system portion of 7. (By subnetting in this way, we have actually decreased the potential number of hosts from 254 [28 minus 2] in the pure class C scheme to 240 or 8 times [25 minus 2], because of reserved broadcast and network addresses.) The network broadcast address would be 192.168.0.127.

The pure, class, and subnet approaches to addressing were modified in the early 1990s, and were replaced with classless interdomain routing (CIDR), which specifies the network portion of the IP address as a bit prefix, which is easier for networking equipment to route. Based on a given network prefix in the IP address, the datagram may be routed to a single location by a global Internet router, for which the local internet service provider's (ISP) routers determine the correct “local” or “internal” destination. This simplifies the global routing tables by treating the addresses as if they were hierarchical.

An IP supernetting example

Figure 5-7. An IP supernetting example

IP Supernetting

With the net mask freed from the byte boundaries defined by the IP address classes, it is now possible, using CIDR, to have subnets that contain more addresses than allowed by the more restrictive class scheme. Combining four class C networks—192.168.0.0, 192.168.1.0, 192.168.2.0, and 192.168.3.0—along with adjusting the net mask to 255.255.252.0 or /22 allows a single network with 1022 hosts (210 minus 2). This combined set of networks is called a supernet, and the process is called IP supernetting.

Supernetting allows the aggregation of smaller network address spaces into larger, more usable address space. This address space aggregation also allows a decrease in the complexity of global routing tables, because where there would have been multiple entries, there need only be one. Although it is unlikely that we will use supernetting, per se, in the design of clusters, it does help us understand the networking tools that we are likely to use.

Ethernet Unicast, Multicast, and Broadcast Frames

Ethernet frames may be categorized into three types: unicast, multicast, and broadcast. Each type of frame serves a different purpose and is accepted by a different number of systems. Unicast frames are the most restricted in terms of target audience, with broadcast frames being the least restrictive. The sending interface places its own MAC address into the Ethernet frame, in the source address field of each of the three frame types.

A unicast frame is used to send an Ethernet frame from one system to a single destination system. The frame is placed on the Ethernet medium with the destination MAC address of the target interface as part of the frame's header. Although every attached station on the LAN can “see” the frame, other interfaces will ignore the frame if the destination MAC address does not match theirs. Unicast traffic is a “one-to-one” communication.

Multicast frames are somewhat more complicated than unicast or broadcast frames, and are beyond the scope of this discussion. Suffice it to say that systems may participate in multicast traffic, and frames are sent to all participants. Multicasting is a useful feature, particularly when sending installation data to multiple systems at once. Multicast traffic is “one-to-many” communication. Multicasting is similar to broadcasting, but systems must “subscribe” to a multicast channel on an interface to receive the traffic. This allows the network interface to ignore multicast traffic if it is not information that the system has explicitly requested. This is in contrast to broadcast traffic.

A broadcast frame on an Ethernet segment contains a destination MAC address consisting of all (48) one bits. An Ethernet interface that detects a frame with the broadcast address will accept it, passing the frame contents up the network stack to be interpreted. All physically attached interfaces have access to broadcast frames. A broadcast frame may be considered a “one-to-all” communication.

Address Resolution Protocol (ARP)

Once a system has an IP address associated with its Ethernet interface, it must be able to determine the system with the desired destination IP address. On an Ethernet segment, the ARP provides the translation mechanism. A system on an Ethernet segment that wishes to send a datagram to another sends a broadcast Ethernet frame that contains the IP address of the desired target.

The ARP request frame is seen by all systems and devices directly connected to the physical Ethernet segment. Only the system that owns the desired IP address sends the reply frame directly to the requestor. The sender will cache the ARP information in the system ARP cache for a period of time, “killing” them in a preset amount of time.

A network device connected to the Ethernet segment may answer the ARP request with its own MAC address if the sender's data needs to transition between network segments. The sender makes an ARP request and the router responds with its own information. The sender then transfers its data to the router's interface, and the packet “escapes” the local segment into the willing hands of the router.

IPv4 and IPv6

The current version of the IP protocol, IPv4, is rapidly running out of available address space. There are several stopgap measures that have prolonged exhausting the available 32-bit IP addresses provided by IPv4. Supernetting, reclaiming unused addresses, Dynamic Host Configuration Protocol (DHCP), and network address translation (NAT) have all been used to prolong the inevitable.

TIP

Do not use IPv6 addressing inside your cluster unless you have a good reason for the added complexity.

A new version of the IP protocol, IPv6, is making an appearance in the networking world. Many systems, Linux included, support features of IPv6 as they become available. It is expected that network devices, such as routers, will also begin supporting the new protocol. The IPv6 protocol provides the following features and advantages (among others):

  • 128-bit IP addresses

  • Additional security (IP Security Protocol [IPsec] is required)

  • Help with routing scalability

The availability of IPv6 is something of which we must be aware, just in case we need a cluster to attach to an IPv6 network. For a treatise on IPv6, see Loshin [2004] for an in-depth analysis. I consider IPv6 beyond the scope of further discussion.

Private, Nonroutable Network Addresses

Several network address ranges within each class (A, B, C) are reserved for private, nonroutable networks.[2] These network address ranges are considered unavailable for public use and may not be registered by any person or company. Any routing equipment is designed to drop the packets using these addresses.

TIP

Using private, nonroutable network addresses inside your cluster will help to “hide” the internal resources from the outside world. This adds to security and the illusion that the cluster is a single entity.

The private, nonroutable networks are: class A addresses 10.0.0.0 to 10.255.255.255, class B addresses 172.16.0.0 to 172.31.255.255, and class C addresses 192.168.0.0 to 192.168.255.255. We will be using a selection of these special addresses for the internal cluster networks.

Ethernet Switching Technology

The Ethernet switch was introduced in the early 1990s. Ethernet switches are low-latency, high-bandwidth, learning bridges that switch frames at the MAC level (OSI layer 2). The switch may be thought of as providing a direct hardware connection between any two ports that are communicating, with no collisions. A switch allows multiple connections between ports to be active at the same time without interference, and because the switching is done at the MAC level, they are protocol independent.

The switch maintains a table of MAC address and port relationships, and will send frames to the appropriate port or ports on the switch based on the destination MAC address and the MAC addresses in the table. Any frame containing a destination that is not in the table is forwarded to the switch's backbone port. Frames arriving from the backbone port are checked against the local switch tables and are forwarded to the correct local port if appropriate.

To handle multiple simultaneous connections, the switch's backplane is typically rated in terms of tens to hundreds of gigabits per second. Much of the switching functionality is implemented in hardware to maintain high throughput. To support frame switching, MTU conversion, flow control, and other features, a switch usually incorporates an intelligent switching matrix and a shared-memory architecture, along with specialized firmware.

Some switches share some packet routing capabilities with routers and are able to route packets at OSI level 3. A fully switched LAN allows moving slower, more expensive routers to the “edge” of the network. (There is considerable confusion between terms like layer-2 router and layer-3 switch. These terms were introduced when filtering and “learning” technology was added to layer 2 bridges in an attempt to differentiate the new product from the more limited bridging technology. The confusion is still rampant.) You will rarely, if ever, find a router in the midst of a cluster. Note that a LAN is formed by devices connected at OSI levels 1 and 2 (hubs and bridges). Once a router is involved, the LAN ends.

Half and Full Duplex Operation

One feature inherent with switched Ethernet was the ability to allow simultaneous sending and receiving of data on the same connection. The original Ethernet CDMA/CD protocol, which allowed transmission by one interface in one direction at a time, is now termed half duplex and is still supported by switches for backward compatibility. The ability to send and receive data on the network link simultaneously is called full duplex operation.

There are obvious performance advantages to full-duplex operation. Half-duplex operation requires the line to be “turned around” to transmit data in the opposite direction. If collisions are present on a link, then it is in half-duplex operation. Collisions and one-way traffic reduce the potential throughput on a link, so full-duplex operation is desirable and may provide a substantial performance improvement.

During link initialization, information may be exchanged between the link client and the switch regarding the capabilities of each entity and the desired mode of operation. It is possible to set the speed and duplex of a particular link explicitly at either or both ends, or to allow the switch and network interface card (NIC) to “autonegotiate” the behavior. Ensuring that both the switch and the client interface agree is important to proper operation of the network link.

Although it is usually best to allow autonegotiation to handle the link configuration, it is not always successful. The standards for various implementations of Ethernet define how the auto-negotiation process works, and determine the default behavior in the event of disagreement between the switch and client NIC. The link components must be able to pick an operating mode for the physical link successfully, even if the capabilities of the NIC are not the smartest.

For unknown reasons, some 100base-TX Ethernet links seem to have inherent autonegotiation problems. I have seen frequent disagreements between the switch and the client NIC regarding the current operating mode of the link. When one party thinks it is in half duplex and the other thinks it is in full duplex, data errors occur on the link. If I had a nickel for every link (and associated unexplained performance issues) I have encountered in this state, I could retire.

Some Ethernet implementations, such as GbE, are inherently full duplex. In other words, the standard specifies two sets of fibers or shielded twisted pairs, one set for transmit and another for receive.

Store and Forward versus Cut-through Switching

Two major classes of data switching are circuit switching and packet switching. In circuit switching, a control packet creates or destroys the circuit. Once established, data flows along the established circuit. If a port is allocated to a circuit, then it will block other attempts to create a circuit, using it as an end point until the existing circuit is destroyed.

In packet switching, packets are routed based on their header contents. Two classes of packet switching are “store and forward” and “cut-through.” A store-and-forward device first receives all portions of the frame, checks the frame for validity, then sends it to the destination ports. Checking for a complete and valid frame eliminates the forwarding of malformed frames, but also increases the switch's latency.

A cut-through switch will detect the incoming packet and immediately begin sending it to the destination ports as soon as the destination address is be detected. Busy ports may force even a cut-through switch to buffer the packet. (Some vendor's switches will not buffer UDP packets by default, instead dropping the data if the switch gets congested. Although UDP has no built-in guarantee of datagram delivery, the acronym should not stand for “U Drop Packets.” This behavior is a particular issue if you are using UDP as the primary transport for NFS traffic.)

Collision Domains and Switching

A switch with ports dedicated to a single system creates a collision-free “switching domain.” Connecting switches together, with a backbone, enables systems with the same LAN addresses to communicate, even if they are not on the same switch. Again, this communication takes place without collisions.

An example of four LAN segments, physically separate and located on two different switches, is shown in Figure 5-8. Each switch allows systems belonging to the same LAN to communicate with each other, even though they are not on the same physical LAN segment. Any packet destined for a system not on the current switch is routed across the backbone to the other switch in the switching domain.

A multiple switch domain

Figure 5-8. A multiple switch domain

To allow connections between switches without creating bottlenecks, higher bandwidth links than the switch connections themselves are needed. Additionally, high-bandwidth connections to servers and other local resources may require more bandwidth than a single switch port. We look at the method for interconnecting switches with higher bandwidth links in the next section.

Link Aggregation

The ability to “gang,” “team,” or “aggregate” multiple network links together is a way to increase available bandwidth and network availability. Multiple full-duplex links operating in parallel provide the aggregated bandwidth of the component links. If one of the aggregated links should fail, the remaining links continue to operate, with the overall link operating at reduced capability.

Four, full-duplex 100base-FX links, for example, provide the equivalent of a single 400-Mb link—400 Mb in each direction. Another term for this functionality is trunking. The ability to aggregate ports originated with proprietary protocols and functionality from several switch vendors. Today, the IEEE 802.3AD specification defines an industry standard way of aggregating switch ports.

There is typically a limit to the number of individual links that may participate in an aggregate, which may depend on the switch's manufacturer. To the systems using the link, it must behave just as if it were a discrete link, with its own MAC address and network behavior. The switch must be able to balance traffic across the link transparently, and there are several load-balancing approaches.

An example of link aggregation is shown in Figure 5-9. Two core switches are linked together by four 10-GbE links, providing 40 Gbps in each direction. One of the core switches is tied to an edge switch with four one Gbps links, providing four Gbps in each direction.

An Ethernet link aggregation example

Figure 5-9. An Ethernet link aggregation example

Link aggregation will be an important ability for the switches that we choose to design our cluster's network. Without this capability, we would be unable to provide a network without inherent bottlenecks resulting from bandwidth imbalances between switch ports and uplinks. We delve deeper into aggregation later.

Virtual LANs

An Ethernet switch will normally forward, or “flood,” all broadcast and multicast frames it receives to all ports. The larger a switched LAN grows, the more this type of flooding occurs across ports on interconnected switches. The ability to control which frames arrive at a switch's ports is an important switch capability.

Switches with virtual LAN, or VLAN, capability allow a system manager to assign individual switch ports to one or more numbered VLANs. The port's VLAN assignment determines which broadcast and multicast frames are sent to the port by the switch. The switch maintains a table that maps MAC address and port location to the VLAN.

Broadcast and multicast frames are kept within their port's originating VLANs by the switch. Ethernet frames are tagged with VLAN information so that VLAN-capable switches can properly deliver them to participating ports. Using VLANS can

  • Provide extra security by limiting the ports that “see” restricted frames

  • Improve performance by limiting switch broadcast domains

  • Isolate multicast participants from the rest of the LAN

  • Drive you crazy configuring them without software help

Not all switches are capable of handling VLANs. Even though there is an IEEE standard, 802.1Q, that defines operation of VLANs, there are many proprietary features and potential interoperability issues. You should carefully check the capabilities of any switch before committing to it. An example network with switched VLANs is shown in Figure 5-10.

An example switched VLAN

Figure 5-10. An example switched VLAN

Systems that are members of one VLAN are not able to communicate with systems in other VLANs without the intervention of layer 3 routing. This is one reason that some switches implement both layer 2 switching along with layer 3 routing. The routing capability is needed to interconnect VLANs, just as if they were physically separate LAN segments.

Jumbo Frames

The default MTU, or data payload, for Ethernet frames is 1500 bytes, but some switches allow this value to be increased to as much as nine kilobytes. One term applied to this feature is jumbo frames, and it can increase the performance of bulk data transfers between Ethernet devices. Increasing the size of the data payload in each frame reduces the amount of fragmentation that occurs when sending large streams of data. Jumbo frames are extremely useful for file server connections over Ethernet, especially for NFS.

It is extremely important to ensure that both the switch and the NICs are capable of handling jumbo frames before enabling this functionality. Along with the hardware capabilities, the operating system, drivers, and utilities must be able to handle jumbo frames. A lack of complete support will make the use of jumbo frames difficult or impossible.

In some switches, a system manager may use the device management interface to create a VLAN within which jumbo frames are used. Ports assigned to that VLAN must usually be attached to NICs that have jumbo frames enabled. To interface with devices that cannot support jumbo frames, the switches in the network must be configured to fragment and route the traffic to ports in VLANs outside the “jumbo frame” VLAN. Careful consideration is needed to trade off the additional complexity against the potential for performance increases.

At the time of this writing, there is no standard for implementing jumbo frames across all hardware and software that might be involved, so each switch vendor has implemented it in their own special ways. One would hope that all implementations would interoperate, but this may or may not be the case. Carefully investigate all capabilities before committing to, or depending on, this functionality.

Managed versus Unmanaged Switches

As switches get more complicated and add more features like VLANs, they become more difficult to monitor and manage. Most modern switches have some level of management interface, including Web-based interfaces, serial port consoles, or Simple Network Management Protocol (SNMP). This is not to say that “dumb” unmanaged switches are not available.

Remember that your ability to debug the behavior of switch issues like autonegotiation conflicts, VLAN assignments, routing tables, and other essential switch information will depend on your ability to access the internal switch information with some form of interface. It is essential to select switches that may be configured and monitored remotely for your cluster.

TIP

Use only managed switches in your cluster. The ability to examine and control the operation of individual ports and global switch parameters is essential to proper administration (and debugging) of your cluster's networks. Selecting unmanaged switches to save money is a false economy.

Example Switches

As a final examination of switched Ethernet, let's examine two example switches, both made by Hewlett-Packard in their ProCurve line. One switch is a 1U “edge” switch and the other is a “core” switch, meaning it is intended to be a central switch. Both switches have a good price-to-performance ratio.

A GbE Edge Switch

First, let's examine the Hewlett-Packard ProCurve Switch 2848, shown in Figure 5-11, which provides 48 ports of 10/100/1000base-TX. (This is not to be considered an endorsement of this particular manufacturer or product, it is used only as an example.) Other features include

  • Secure shell (SSH) command-line interface

  • Secure socket layer (SSL) encrypted Web management access

  • 96-Gb-per-second backplane

  • Optional redundant power supply

  • Support of up to 60 port-based VLANs

  • Support of the Simple Network Time Protocol (SNTP)

  • DHCP relay

  • Serial RS-232 out-of-band management port

  • Cisco fast ether channel (FEC)

  • 802.3AD link aggregation

  • Latency of 11.8 microseconds for 100 Mb per second, 4.4 microseconds for 1000 Mb per second

  • Support of remote monitoring (RMON) and extended RMON for switch statistics, alarms, and events

An example edge switch

Figure 5-11. An example edge switch

These features, and many others, give us a very flexible switch for data connections or even for HSIs. The number of ports on the switch supports an expected number of 1U or 2U systems installed in a 42U rack, while taking only 1U.

One way of measuring the price performance of a switch is the cost per port. This gives one way of evaluating the switch characteristics for making design decisions. Based on the list price for this switch (as of January 19, 2004), the cost per port is $82.99, which is quite reasonable for the performance. There is also a 24-port version, the ProCurve 2824, with a per-port cost of $84.66.

One thing to steadfastly avoid is using a switch in which the backplane is “oversubscribed.” This means that the sum of the port bandwidths exceeds the backplane performance. For this switch, the backplane throughput is 96 Gb per second which equals the total bidirectional bandwidth of all 48 ports (one-Gb-per-second input and one-Gb-per-second output in full-duplex 1000baseTX operation).

One issue with this switch is the possibility of the four uplinks becoming a bottleneck. The sum of the 44 port bandwidth is 88 Gb per second, whereas the four uplinks are only capable of handling a total of 8 Gb per second.

Ethernet Core Switches

A “core” Ethernet switch can be thought of as a collapsed network backbone. It is a central connection point for the edge switches in the cluster's racks, which connect directly to devices like compute slices. Depending on the size of the cluster, we can choose a switch with sufficient ports for direct connections to the systems in other racks, or for the trunked connections, or for “uplinks” from the edge switches in individual racks.

The core switch needs to have routing capabilities to tie together any VLANs in the cluster. It should also be able to sustain the level of data transmission necessary to keep all attached ports active simultaneously. As a cluster grows in size, the requirements for the core switch also grow.

Let's examine two switches with more capabilities than an edge switch. As with the edge switch in the previous section, the switches are Hewlett-Packard ProCurve switches. First, let's take a look at the 5304xl and 5308xl switches. The first model, shown in Figure 5-12 is the model 5304xl. This switch has four slots that accept modules providing various types of network connections. Our example switch has four 20-port 10/100/1000base-TX modules.

A small core switch

Figure 5-12. A small core switch

Some of the other features of this switch are

  • Layers 2, 3, and 4 (TCP/IP port-based) routing

  • 76.8 Gb-per-second backplane

  • Hot-swappable modules

  • Static NAT capability

  • Secure management with SSH and SSL

  • Up to 256 simultaneous VLANs

  • Link aggregation with Cisco FEC and 802.3AD

  • Optional redundant power supply

The support for VLANs and link aggregation, among others, allows us to connect uplinks from the edge switches in our cluster, and provides aggregated links for file servers. Routing capabilities allow connection of any VLANs that we might define.

As with other switches, we need to be careful not to oversubscribe the backplane. The current configuration, with 80 GbE ports, could require 160 Gb per second if all ports are operating simultaneously in both directions. This configuration is certainly suitable for smaller clusters.

Finally, let's look at a much larger switch chassis, the ProCurve 9308m. The switch chassis shown has eight slots available for switching “blades.” The configuration in Figure 5-13 shows six 16-port 10/100/1000base-TX blades, and a dual 10-GbE blade. This switch comes in a smaller four-slot version and a larger 15-slot chassis.

A mid-size core switch

Figure 5-13. A mid-size core switch

Some of the switch features are

  • 256-Gb-per-second backplane

  • Less than 7-microsecond latency

  • Hot-swappable modules

  • VLAN support

  • 802.3AD link aggregation

The ability to handle larger numbers of direct connections, or trunked links, makes this switch a choice for larger cluster configurations. If the number of connections in a chassis is exhausted, the switches may be trunked together to provide an internal backbone for all the Ethernet networks in the cluster.

Ethernet Networking Summary

In this chapter we covered some networking basics that will be used again and again as we design the internal networks of a cluster and make system administration choices about the network configurations. We covered the various networking technologies associated with Ethernet and Internet routing. Finally, some example hardware from Hewlett-Packard provided examples of features and configurations available in various sizes, and prices, of network switching hardware. The products from other switch manufacturers, like Extreme Networks and Cisco Systems, also have an abundance of features and price points, and make good choices as networking infrastructure for clusters.



[1] Please notice the absence of any leading zeros in these decimal numbers. On a UNIX or Linux system, any number that starts with a zero is interpreted as an octal number. Thus if a misinformed system administrator enters 015.026.102.34 as an address, it will most likely be interpreted as 13.22.102.34, which is not the intent.

[2] There is a specific request for comment (RFC) that defines these networks and their use. Please see RFC 1918 at http://www.ietf.org/rfcs/rfc1918.txt for the complete definition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.154.86