Chapter 6. Tying It Together: Cluster Data, Management, and Control Networks

Chapter Objectives

  • Provide an example of serial port console management equipment

  • Describe remote system management cards and their features

  • Investigate the design of management and data networks

In addition to the HSI, there may be one or more Ethernet networks in a cluster. Here I present some of the design characteristics and choices for these networks, along with an example IP addressing scheme for a cluster.

Networked System Management and Serial Port Access

There are at least two classes of management access possible with compute slices, network devices, and other active elements in your cluster. How you access the devices will depend on their capabilities and the software support for their remote interfaces. Because of operating system limitations, some devices may have a very nice automated management interface for Microsoft Windows that is not available on Linux.

Irrespective of the availability of graphical remote tools, the device itself may allow out-of-band management via a network port, access to a character-based interface via a serial port, Web-based access, secure shell (SSH) access, or a combination of all these methods. (“Out-of-band” management typically means the device provides a completely separate network connection that is used only for management purposes.) The cluster's designer may wish to restrict access to management devices associated with systems, switches, and other manageable devices. The most secure way is to use a separate network to access only the management devices.

Careful planning of the management network details like addressing, default routes, which devices need to be attached, and attachment methods can save a lot of issues later. Because Ethernet switches flood all packets to all ports (unless VLANs are used), multiple logical networks may exist on the same physical network without too much effort. In some security-conscious environments, a physically separate, secure network for management access is an absolute design requirement.

Remote System Management Access

Selecting a compute slice that has a built-in system management interface can add to the manageability of your cluster. A system that is designed as a stand-alone server is more likely to have this functionality. Management or remote access devices, whether built in or added as separate PCI interfaces, allow functionality like remote access to the system console in graphics mode, remote power on and power off, and the ability to upgrade firmware and BIOS remotely, among others. The usefulness of these features becomes apparent the larger your cluster becomes.

The remote management capabilities are usually available from a separate, “out-of-band” network port integrated into the system. This means that the management capability is available from the external port, but the network is not available to the operating system. Think of the management device as having direct access to the state of the system hardware (even with the system shut down in some cases), but with no direct access from “inside.” An example of a built-in management interface is the iLO available in the Proliant servers from Hewlett-Packard. The network connections will normally support 10/100base-TX connections.

In addition to the integrated management interfaces, there are third-party PCI interface cards that provide system management capability. The software and drivers for these cards tend to be more focused on the Microsoft Windows operating system, rather than Linux, so beware of assuming driver support. Investigate the availability of drivers and functionality carefully before choosing a particular vendor's management interface card. Also, be aware that the add-in card may take up a much-needed PCI slot in your compute slices.

The management interface on each system will most likely need to be configured with an IP address before it can be remotely accessed over the management network. There are various approaches to configuring the management address, depending on the manufacturer. As shipped from the factory, the device may come with a default password and IP address. For security's sake, never use the default configuration shipped from the manufacturer.

Many of the more intelligent management devices are capable of getting their address and network information from DHCP. Collecting the device's MAC address for the DHCP configuration and setting the access password will still require an initial connection to the device. You should plan on making a temporary connection to each device through its serial or network port to configure the IP address and the management password.

Once configured, the management interface to the system will allow operations like remote power on and off, console access, BIOS configuration and updates, and monitoring of system parameters like chassis intrusion, fan speeds, and internal temperature. Some of the more advanced system management interfaces allow access to the integrated graphics device in the system via a Web-based Java interface.

Keyboard, Video, and Mouse Switches

The lowest common denominator for accessing a system console in graphics mode is video graphics array (VGA), which is supported to some level by virtually all PC graphics adapters. This minimal level of support still gives installation and configuration utilities a usable ability to present their interfaces in text, color, and even graphics. Some systems include reasonable integrated graphics capabilities such as the ATI Rage XL device shown in Figure 4-1, which depicts a Hewlett-Packard Proliant DL-360 system.

A VGA interface allows 720 pixels by 400 pixels in graphics mode, which is sufficient for a bit-mapped 25-line by 80-character display. In graphics mode, VGA displays may be 640 pixels by 480 pixels with 16 colors (from a palette of 262,144 colors) or 320 pixels by 200 pixels with 256 colors. There are other graphics modes such as super VGA (SVGA) that we will not cover here. What graphics functionality is available, and whether you can access it via the remote management card, depends on the system's manufacturer.

To allow graphics access to the system consoles in a cluster, a single keyboard, mouse, and graphics monitor may be shared between multiple systems with a keyboard–video–mouse (KVM) switch. Doing this requires a connection between the KVM switch and the VGA output, keyboard input (PS-2 or USB), and mouse input (PS-2 or USB) on each system. Because of the number of connections (three per system), distance limitations for cables, and KVM limitations, this should be limited to the master nodes or other important systems that need local access in the computer room.

Serial Port Concentrators or Switches

Many commodity server systems still have serial ports that allow access to external, character-based console I/O. This feature may be intended to have a character-based RS-232 terminal with a keyboard directly attached to it for use as a system console. The vast majority of systems in a cluster, however, will not have physical consoles attached to them.

Because system console access is necessary for troubleshooting and reporting purposes, a way of getting minimal connection to the console (and logging console output) is needed. Connections may be made via a “crash cart,” which has a terminal, monitor, keyboard, and mouse available for direct temporary connection in the event of a problem. The crash cart is wheeled up and the connection is made to the system via PS-2, USB, VGA, or serial connectors.

The console output from a system is very useful in tracing kernel panics, application errors, power-on self-test (POST), and operating system boot history. To collect this information requires a permanent connection to the serial port and a way to access the output. Additionally, it is not possible to get “in-band” management information from the system's built-in network interfaces, because a live operating system is required for these to be active.

What is needed is a connection to the serial port from an external device that can access and log the information. Serial port concentrators, sometimes called serial port switches, can provide this access over a network connection. An example serial port switch, the Cyclades TS-2000 16-port Console Server is shown in Figure 6-1. (Information on the complete set of products is available at http://www.cyclades.com.)

Cyclades TS-2000 serial port switch

Figure 6-1. Cyclades TS-2000 serial port switch

This device runs Linux as the internal operating system and has a number of very useful features for our purposes:

  • Secure SSH access via an in-band or out-of-band network connection

  • Event notification tied to console output

  • Remote output buffering

  • 8-, 16-, 32-, and 48-port configurations, 1U rack utilization

  • Support for trivial file transfer protocol (TFTP), DHCP, and network time protocol (NTP) protocols and functionality

With this device, we can monitor and log the output of the compute slice's serial console to either the “syslog” daemon or an NFS mount point on a remote system. If necessary, a login to an attached system's serial console is possible by accessing the switch via SSH.

We can eliminate the need for crash carts or physical access to the systems, by controlling who can access the management ports of the compute slices and Ethernet switches. At the same time, a single Ethernet connection gives us access for as many serial port consoles as are attached to the device. Although this level of management access is not needed for every cluster, the larger the cluster becomes, the more the system managers require this type of functionality.

Cluster Ethernet Network Design

To begin making design decisions about your cluster's Ethernet networks, you must first decide which networks will be present from the possibilities of data, management, and HSI. The size of your cluster will dictate whether you have a management LAN that is physically or logically separate from the data network. You might also be using Ethernet as the HSI for your cluster to save cost.

Whether you separate the networks into discrete management, HSI, and data networks will depend on your budget and the required performance of your cluster. Time-sensitive HSI communications, bulk data transmissions for the NFS or cluster file system, and console logging and management traffic do not necessarily play well together on the same physical link. Depending on your security considerations, it may be a requirement that the management network be physically separate in terms of links, switches, and access points.

The first step is to gather all the user requirements for security, access, and performance together in one place. With this information collected, you will have a clear picture of which types of networks are required and how to implement them with Ethernet devices. We look at specialized HSI networks in a separate chapter.

Choosing a Clusterwide IP Address Scheme

Having an idea of how many networks are required and what IP address ranges to use for your cluster's Ethernet networks is a good starting point. The size of the cluster, coupled with any plans for future additions, will help you to select the proper type and range of IP addresses. Choose wisely, because it is very difficult to change IP addresses once the cluster goes “live.”

Using the class-based IP scheme as an example, it would be a disaster to choose a “pure” type C network address and subnet mask for a cluster of 1024 nodes, unless you want to deal with subnetting within the cluster. Because a class C address is limited to 254 hosts, you might choose supernetting. You should carefully consider the impact on system administration effort.

Like it or not, we all think, and a lot of software is still written, to display and assume the old “pure” class-based schemes, with network and host addresses divided evenly on byte octet boundaries. Clean address boundaries and subnets make it easier to picture, and therefore administer, the networks.

IP Addressing Conventions

No matter what networking scheme you select, the host address with all zeros and the address with all ones in the host portion of the IP address are reserved. The network address (at the low end of the range) and the broadcast address (at the high end of the range) flank the set of available host addresses.

To make automation of system administration easier, you can choose a set of conventions for the remaining address range that divides it into two or more “blocks” of addresses assigned to specific types of equipment. Allowing scripts (and system administrators) to make assumptions about important addresses can lead to improved automation of administrative and start-up tasks.

One example of this is to use high addresses on a subnet to represent devices like switches, assigning them in a decreasing order. The lower range of addresses would then be reserved for connections to systems, reserving them in an increasing order. The default gateway for the subnet might always be the broadcast address minus one.

Using Nonroutable Network Addresses

To “hide” the internal networks in the cluster, you can choose network addresses from the private, nonroutable networks. Refer to Section 5.4.8 for more information on the available non-routable network address ranges. By making the master nodes in the cluster the access point, and by disabling routing between any internal networks and the external network, you can prevent packets from “leaking” to the outside world.

Using the nonroutable private addresses does not necessarily guarantee that packets with those addresses won't “escape” the cluster. Most routing equipment is designed to drop any packet it sees with the special network addresses. Additional security measures may be necessary to prevent access to internal cluster resources from the “outside” world. Some of the internal cluster networks are shown in Figure 6-2.

“Hidden” networks in a cluster

Figure 6-2. “Hidden” networks in a cluster

An Example Cluster Ethernet Network Design

Instead of talking about a design, let's do one. Our example cluster will have 1,024 nodes, and Ethernet management, data, and HSI networks. I chose 1,024 nodes because it adds some intrigue to the design.

The first thing to note is that with the IP network address (zero) and the broadcast address (all ones in the host part), we lose two addresses out of the total IP address range right out of the gate, so to speak. Also, there will be network devices like the default gateway that will take from the available address range. We will need to provide a host address range that provides more than the expected 1,024 host addresses to avoid disappointment.

Another quick note is in order with regard to this example. This configuration is more suited for a scientific cluster than for either a database or a Web serving cluster. The internal and external designs of the cluster networks are driven by the type of cluster and the required access to its internal components. Your application will determine the final requirements for the network design.

Choosing the Type of Network and Address Ranges

I like to have the network and host portions of the address readily visible without a lot of binary gymnastics. Let's choose the private, nonroutable 10.0.0.0 network format as the basis of the design, with a subnet mask of 255.255.0.0. This will give us room for 16,534 devices on each of the subnets we create.

Because our slate is clean, let's initially choose 10.1.0.0 for the management network, 10.2.0.0 for the data network, and 10.3.0.0 for the HSI network. As you can see, we are using the second octet to designate a “subnet,” just to make things readable. We can now start mapping expected device addresses. The network address scheme is shown in Figure 6-3.

Three example cluster networks and addresses

Figure 6-3. Three example cluster networks and addresses

Notice with this scheme it is possible to add subnets to the three networks, but they will be discontinuous. Subnetting the 10.1.0.0 network would involve adding the next network at 10.4.0.0, for example. If you have to add subnets, say to support VLANs among groups of systems, it might be better to leave some “growth room” between the network addresses. For example, the management network might consist of 10.1.0.0 through 10.4.0.0; the data network, 10.5.0.0 through 10.8.0.0; and so on.

Another consideration in network design is the use of a private network for the management network. With this design choice, there must be a way for the system managers to access the management network, either through the master nodes or from a separate, multihome system that is connected to the management network. Using a separate management system, in addition to the master nodes, is a good choice, and provides multiple paths to the management network in the event of master node failures.

Thinking ahead, and working out the network design work before the systems need to be installed and configured, can save headaches at the last minute. Working out difficulties with subnets, host addressing, and other details is easier if you write down the addresses and draw pictures. “Measure twice and cut once” applies to other crafts besides carpentry.

Device Addressing Schemes

In our example 1,024 compute slice cluster, using the previously described example networks and net masks, we have a host address range from 1 to 65,534 in all three networks. If we assume that no subnetting is needed in those networks, there will be 1,024 host addresses necessary for the compute slices, one address necessary for the file server in the data network (possibly more if there are multiple servers), and addresses for switch management and default routes.

We will start the compute slice addresses in each network with 1 and will run up through 1,024. The default gateway and other special addresses will start at the highest address (65,534) and will work down. Knowing that the compute slice addresses in all three networks have the same host value allows us to “map” the address to a physical node in a rack, if we wish to carry the convention that far.

We can also view the addresses as being arranged by physical rack, with host 1 through 32 in the first physical rack (assuming 1U systems), host 33 through 64 in the second physical rack, and so on. Maintaining this mapping for the other networks can make physically locating the system associated with an IP address easier. Of course, some systems that are designed for racks have “locator lights” that can be activated from the management interface, but a regular IP scheme can make system administration scripts and automation easier to implement.

There are other devices that will require addresses. A cluster of this size usually has spare compute slices, administrative systems, head nodes, and test/development systems in the cluster networks. All these “extra” systems must fit into the clusterwide IP addressing scheme.

The Management and Control Networks

Performance in the management LAN is not as big an issue as in the Ethernet HSI or the data network. The primary need is to provide intermittent access to the devices for configuration and troubleshooting. Just having the connectivity to all the management ports, at a reasonable speed, is our goal.

The main reasons to separate the management LAN from the other cluster LANs are security and isolation from networks requiring bulk data transport and low-latency communication. Although separation is always desirable, it is possible to share the same physical network to reduce costs.

The compute slice connections to the management cards will require one IP address each, for a total of 1,024. Using 1U compute slices, with 32 systems per 42U rack, the example 1,024 compute slice cluster has 32 compute racks. If each of these racks has one 48-port 10/100 switch for the management network, then we must allocate 32 IP addresses across the cluster for the switch out-of-band management access.

The management and control network allows configuration (ports, VLANs, routes, and so forth), monitoring, and troubleshooting the switches remotely; remote console access and logging; and remote power on and off of the systems with management capability. The management LAN topology is shown in Figure 6-4.

Management LAN topology

Figure 6-4. Management LAN topology

In addition to the Ethernet switches, for serial port management and console logging, there may be 32 32-port serial port switches. Each of these requires an IP address for access to the attached serial ports. These console switches will need access to an NFS server on the same network for remote buffering and console logging. The switch's out-of-band management network may be attached to a port in the 48-port management network switch.

All these management connections will need host names to match the IP addresses on the management LAN. Using a naming scheme that helps physically locate the device is a good idea. Once the IP addresses, all 64 or more of them, are assigned, a judicious choice of host name is in order. But we are getting a little ahead of ourselves.

Controlling physical access to the management LAN is important in some clusters. Keeping the management LAN connection on a completely isolated physical switch, with special external connections is a possibility, requiring access through a physically separate system is another. If physical separation is not an issue, then the head nodes may be configured to access the management LAN through the core switch. The latter configuration is shown in Figure 6-4.

The Data Network

In our example data network, let's choose NFS as an expedient file-serving technology. No matter what file-serving technology you choose, the data LAN needs to be optimized for bulk data transfer between the file servers and the cluster's compute slices and other clients. Switched GbE is a good low-cost choice for this connection.

Providing bandwidth with no bottlenecks will usually involve port aggregation from the server to the switch. To facilitate bulk data transport, Ethernet jumbo frames are also an option. Although link aggregation is commonly available, only certain types of switches and Ethernet interfaces support jumbo frames.

To avoid having 32 separate GbE data network connections from each rack to the master rack, it is possible to locate a properly sized GbE switch in each rack. In this configuration, the rack switches allow a “fan out” to systems in the rack, with a trunked link to the core switch. Care must be taken not to create bottlenecks (the trunk) or to oversubscribe the rack switch's backplane. An example data network topology is shown in Figure 6-5.

Example data network topology

Figure 6-5. Example data network topology

The network shown in Figure 6-5 uses a four-link trunk between the NFS file server and the core switch. This aggregated link provides four gigabits per second in each direction, which is a theoretical maximum throughput of 500 MB per second[1] each way (read and write) for the full-duplex link. This is roughly 488 KB per second for each node in the cluster.

Along with trunking, the configuration of Ethernet jumbo frames may introduce the need for VLANs. If a system's NIC is incapable of handling jumbo frames, it is necessary for the switching equipment to handle fragmenting the frames into the standard size that the interface can handle. Some switching equipment handles this by allowing the network manager to place jumbo frame-capable systems in one VLAN, placing “normal” systems in another VLAN, and allowing the switch to route the jumbo frames to the “normal” VLAN, fragmenting them as necessary.

Designing the back-end storage to maintain this level of writes, especially synchronous (sequential) writes to RAID 5 storage, is extremely difficult and expensive. It would take multiple RAID arrays and multiple Fibre-Channel loops to permit this level of I/O to and from the cluster. It is the write performance that is the primary issue.

Sustained, high-bandwidth I/O across all nodes in the cluster is a difficult file system and storage design problem. If every node in the cluster simultaneously required a checkpoint of only two gigabytes of RAM, we would need to write more than two terabytes of data. At the 500 MB per second theoretical maximum quoted earlier, it would take one hour nine minutes to complete the checkpoint. I cover file system issues later in this book.

If each of the switches shown in Figure 6-5 had four trunks to the core switch, each of the 32 switches could simultaneously pump a theoretical 500 MB per second into the core switch while receiving an equal amount. The backplane of the core switch must be able to sustain 32 GB per second or 256 Gb per second to avoid being oversubscribed—from just the data network traffic alone.

If this switch is also used by the management and control LAN, the traffic may easily exceed the capability of a single switch chassis. It becomes necessary to split the network across multiple switches to avoid saturating the backplane. Even then, the storage backend is the bottleneck. As the cluster gets larger, the load placed on the networking equipment forces larger and more expensive switch configurations to keep up with necessary traffic.

Similar design issues occur when the back-end storage is on a SAN. The number of simultaneous Fibre-Channel fabric logins may become an issue as the number of systems accessing the data grows. It is a real good idea to calculate the theoretical maximums for the I/O media, the storage channels, and the switches to design an acceptable data network.

The design decisions you make about bulk data transport needs will depend on your characterization of the applications running in your cluster. Depending on the expected utilization of the cluster, not all compute slices may need simultaneous access to the file server or database storage. This makes the network bandwidth and design parameters somewhat easier; otherwise, partitioning the data, partitioning the network, or both may be used to reduce the aggregate burden on the file server and storage back ends.

Example IP Address Assignments

In each of the three IP networks in our example cluster, we have a possible 65,534 usable host addresses and 255 networks, using the 255.255.0.0 subnet mask. This range ought to be sufficient for our 1,024-compute slice cluster. There are as many different ways to assign the addresses as there are people to do the design, but an example assignment for the management network devices is shown in Table 6-1.

Table 6-1. Example Management LAN IP Address Assignments

Description

Addresses

Compute slice management ports

10.1.0.1 to 10.1.4.0

Rack switch addresses

10.1.10.1 to 10.1.10.32

Console switch addresses

10.1.20.1 to 10.1.20.32

Data switch addresses

10.1.30.1 to 10.1.30.32

Core switch address

10.1.40.1

Storage device addresses

10.1.50.1 to 10.1.50.6

Master node management ports

10.1.60.1 to 10.1.60.4

Default gateway

10.1.255.254

Broadcast address

10.1.255.255

The IP address assignments for the data network are a little easier than the assignments for the management LAN. Example data LAN IP address assignments are shown in Table 6-2. One of the reasons behind the relative simplicity of this network is the design decision to avoid adding devices that might interfere with bulk data transfer.

Table 6-2. Example Data LAN IP Address Assignments

Description

Addresses

Compute slice data connections

10.2.0.1 to 10.2.4.0

Head node data connections

10.2.10.1 to 10.2.10.4

NFS server address

10.2.20.1

Default gateway

10.2.255.254

Broadcast address

10.2.255.255

Cluster Network Design Summary

In this chapter's network examples we examined some of the issues surrounding the choice of an IP addressing scheme for a large cluster. Having an adequate “IP address space” for the expected connections, and an overall scheme for allocating the addresses, can save headaches as software configuration starts. The primary message is, no matter what the expected IP addressing scheme is, you should do a paper design prior to committing to it.

Because of their low cost, Ethernet switches make a good basis for these networks. In Ethernet LANs comprising this equipment, UDP and TCP protocols run over an IP base. To design the LANs involved properly, we need intimate knowledge of the features, capabilities, and limitations of the switching hardware we will use.

To keep the example figures simple, I omitted any redundancy with regard to the core Ethernet switches. For realistic cluster designs, it is essential to consider multiple core switches to provide the proper level of redundancy for your cluster. The addition of extra switches in each of the diagrams covers the potential for hardware failure or loss of power to the primary core switch.

Do not forget to plan for redundant network paths, or a single failure can disable the entire cluster. Adding the redundancy is not without its costs, however. Additional switches mean added complexity in the form of more cables and interconnections to troubleshoot, not to mention the doubling of cost for the core switch components.

Although I waved my hands over some of the issues involved in designing the data and management LANs found in clusters, I cannot possibly cover every conceivable situation, nor can I cover it to an exhaustive depth. Of course, the design and equipment considerations you must consider will depend entirely on the size and intent of your cluster networks, but my hope is that the example considerations will help you avoid some of the pitfalls.



[1] Four links at one gigabit per second each is a total of four gigabits per second. Dividing four gigabits per second by eight bits per byte yields a 500 MB per second aggregate, or 125 MB per second for each link. In reality, the link may only be 80% to 90% efficient if jumbo frames are used with a good quality switch, so a total of 400 to 450 MB per second in each direction is a more reasonable expectation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.6.85