Chapter 17. Dollars and Sense: Cluster Economics

Chapter Objectives

  • Examine the costs associated with cluster hardware

  • Look for trends in cluster hardware costs

This chapter discusses the hardware expenses associated with building small to midsize clusters. Analysis of the costs is presented, based on list hardware prices for three compute slices and two HSI technologies.

Initial Perceptions

When I first became involved with clusters, it was on a rather huge implementation, by anyone's standards. The complexity of the environment was familiar to me, and neither I nor anybody else on the team thought that the job would be easy. By the time the target installation was complete, more than two years had elapsed. Even with our initial presumptions of difficulty, the job turned out to be more complex, difficult, and expensive than we expected.

Working on smaller clusters for other customers, I encounter the attitude that clusters are automatically cheaper than other comparable solutions, merely because they use commodity hardware and software. (I have come to call this the pile o' hardware syndrome.) As with any complex solution, there are a lot of details requiring attention when building a cluster, and I wondered whether the rosy assumptions were actually true. After all, I had worked on an exceptional project. Maybe the economies of scale were different for different-size clusters. I decided to see if there was an economic analysis available for clusters comprising 8 to 128 compute slices.

To date, I have found lots of agreement across the industry that clusters are a growing solution and that they can save money for certain types of applications. This opinion is repeated over and over again in articles and whitepapers. What is not discussed, however, are the actual cost breakdowns associated with building a cluster, with regard to hardware, software, or labor. Because labor costs are so variable, such an analysis is beyond the scope of this book. (Indeed, I frequently run into clusters that are built “for free” by graduate students or interns at educational institutions. In these cases, somebody is paying, I'm just not sure who.)

Setting the Ground Rules

As part of my job, I analyze and recommend new technologies that may be turned into solutions for my company's customers. Because application-specific clusters are of particular interest to our customers, I started researching the topic. Because I was unable to find an analysis of the hardware costs associated with the size of clusters that we target, I decided to build my own model clusters and analyze the hardware component cost.

Building even a model cluster from scratch involves generating a design, researching the available components, generating a parts list, and finally producing a hardware cost estimate. Starting the analysis, I made some discoveries fairly quickly.

  • There is a difference between using 1U compute slices and 2U compute slices in the cluster. A cluster of a given size will take twice as many compute racks with 2U compute slices as with 1U systems.

  • The complexity of a cluster is a function of the number of active components and the required number of interconnections. Minimizing the number of interconnections can improve reliability, but will increase cost.

  • Cluster interconnect switches tend to provide a number of ports that is a power of two, whereas Ethernet switches follow different rules. This is important because once you run out of ports on a switch, you need to add another one or move to the next bigger model, driving up the costs.

Each of these discoveries plays a part in the overall cost of the cluster's hardware.

Our example configurations have 8, 16, 32, 64, and 128 compute slices in 1U and 2U packages. Three HSI technologies are used in the designs: Ethernet, Myrinet, and Quadrics QsNet. All price calculations are based on published US list prices from the companies' Web sites. The cost of the HSI interface card for the compute slice is included in the HSI cost breakdown.

The systems considered were all dual-processor servers from Hewlett-Packard, the DL-360 G3, the zx6000, and the rx2600. The DL-360 is a dual-processor 3.2-GHz Intel Pentium 4 Xeon system, whereas both the zx6000 and rx2600 systems are dual-processor 1.5-GHz Intel Itanium 2 systems. The zx6000 is a workstation system, with graphics capability and an optional management card, whereas the rx2600 is intended as a commercial server and has management functionality included standard.

All systems were configured with similar configurations: two processors, 2 GB of system RAM, and one 36.4-GB SCSI disk. The zx6000 system was configured with the optional management card to match the functionality of the DL-360 and rx2600. The list prices were generated from this set of configurations.

Cluster Cabling and Complexity

There is a size beyond which a cluster's complexity skyrockets. Ask anyone who builds clusters and they will tell you it is somewhere between 128 compute slices and 512. If you take a look at the number of cables in the configurations, you can see an exponential increase in the number of cables, and therefore interconnections in a cluster as it gets larger.

Figure 17-1 shows the total number of cables in a cluster based on a 1U compute slice for various sizes of clusters, up to 128 compute slices. The figure shows cable configurations for a cluster using Infiniband as the HSI, along with a management LAN and data LAN. Also depicted is a cluster that has only Ethernet management and data LANs, and finally a “full configuration” with Ethernet being used for management and data LANs, and a separate HSI.

Total cables in a cluster with 1U compute slices

Figure 17-1. Total cables in a cluster with 1U compute slices

Figure 17-2 shows the number of cables for a cluster based on 2U compute slices. There is a slight increase in the number of cables resulting from the increased number of racks. The number of connections per system increases with 2U compute slices, but if the designer uses switches inside the racks, there is still an increase in connections over the cluster built with the 1U compute slices—mainly because there are more racks to connect.

Total cables in a cluster with 2U compute slices

Figure 17-2. Total cables in a cluster with 2U compute slices

Infiniband allows connecting systems with a single cable that provides high-bandwidth and low-latency multiplexed communications. It is possible to run the HSI, the management LAN, and the data LAN over a single cable. This can drastically reduce the number of cables per system, and therefore the overall system complexity.

Eight-Compute Slice Cluster Hardware Costs

Starting with the least expensive cluster configuration, one with eight compute slices, let's examine the contribution of the various hardware elements. The costs are broken into

  • Compute slice hardware

  • HSI

  • Infrastructure (management and data LANs)

Figure 17-3 shows that the major component of the total cost is the compute slices. The networking components and HSI are implemented with the smallest possible switches, which lowers the overall cost. When multiple switches, or multiple levels of HSI switches, are required, as for larger clusters, the cost increases substantially.

Hardware component cost, eight-compute slice cluster

Figure 17-3. Hardware component cost, eight-compute slice cluster

Both 1U (DL-360) and 2U (rx2600 and zx6000) systems are included in the graphs. The infrastructure costs, as a percentage of the total, are expected to increase for the 2U systems over the 1U systems, because more racks and switches are required as the cluster gets larger. The associated cost percentages with the first graph are listed in Table 17-1.

Table 17-1. Eight-compute Slice Hardware Cost Percentages

System and HSI Type

infrastructure Cost, %

Compute Cost, %

HSI Cost, %

HP DL-360 G3, Myrinet

29.57

51.80

18.63

HP DL-360 G3, Quadrics

27.20

47.65

25.16

HP rx2600, Myrinet

20.05

74.39

5.55

HP rx2600, Quadrics

19.06

72.95

8.00

HP zx6000, Myrinet

20.17

71.85

7.98

HP zx6000, Quadrics

19.00

69.65

11.36

Sixteen-Compute Slice Cluster Hardware Costs

Table 17-2 lists the cost breakdowns for a 16-compute slice cluster. At 16 systems, the compute slices still will fit into a single rack. The infrastructure costs are slightly larger than the eight-node cluster, and the HSI costs, which include interface cards and cables, are higher because of the increased number of rsystems.

Table 17-2. Sixteen-Compute Slice Hardware Cost Percentages

System and HSI Type

Infrastructure Cost, %

Compute Cost, %

HSI Cost, %

HP DL-360 G3, Myrinet

18.76

62.84

18.40

HP DL-360 G3, Quadrics

14.91

51.40

33.68

HP rx2600, Myrinet

7.19

87.50

5.32

HP rx2600, Quadrics

10.07

82.47

7.46

HP zx6000, Myrinet

11.59

73.53

14.88

HP xz6000, Quadrics

11.42

77.97

10.61

Note that as the size of the cluster grows, the cost of the infrastructure tends to drop as a percentage of the total costs. Part of this is the result of the more efficient filling of the acks. Sixteen 1U compute slices require only one compute rack and one master rack, whereas 16 2U compute slices require two compute racks and one master rack. Hardware costs for this configuration are shown in Figure 17-4.

Hardware component cost for a 16-compute slice cluster

Figure 17-4. Hardware component cost for a 16-compute slice cluster

Network switches tend to have ports in units of eight or 12, so moving to 16 compute slices means that one switch will be filled and a second is needed—increasing the cost. The HSI switches also begin to fill up, either requiring additional chassis or blades in the chassis to support the additional systems.

Thirty-two-Compute Slice Hardware Costs

As the cluster grows to 32 compute slices, the smaller Quadrics HSI switch chassis needs to be replaced with a larger size. This increases the cost for the HSI for this size of cluster, but leaves additional capability for expansion. The compute slice portion of the cluster's cost continues to grow in relationship to the infrastructure and HSI costs. This cost relationship is shown in Figure 17-5.

Hardware component cost for a 32-compute slice cluster

Figure 17-5. Hardware component cost for a 32-compute slice cluster

At 32 compute slices, the cluster is producing close to 274 MFLOP peak for the Intel Pentium Xeon systems and roughly 378 MFLOP peak for the Itanium 2 systems. This calculation is the sum of the capabilities of individual systems, and does not take into account the scaling across the cluster of the parallel Linpack benchmark. Producing 70 to 80% scaling is considered very good.

The infrastructure costs of our 32-compute slice clusters are dropping toward the single-digit percentages as the total number of compute slices increases. A 32-compute slice cluster requires two compute racks and one master rack for the 2U compute slices. The 1U compute slices require one compute rack and one master rack.

The total cost for the 32-compute slice clusters range from a low of $224,000 to a high of $903,000. The cost percentages for this cluster configuration are shown in Table 17-3.

Table 17-3. Thirty-two-Compute Slice Hardware Cost Percentages

System and HSI Type

Infrastructure Cost, %

Compute Cost, %

HSI Cost, %

HP DL-360 G3, Myrinet

13.47

68.11

18.42

HP DL-360 G3, Quadrics

10.66

53.40

35.94

HP rx2600, Myrinet

7.69

87.40

4.91

HP rx2600, Quadrics

7.17

81.45

11.38

HP zx6000, Myrinet

8.51

84.43

7.05

HP xz6000, Quadrics

7.70

76.41

15.88

Sixty-four-Compute Slice Hardware Costs

At 64 compute slices, the cluster now requires two compute racks for 1U compute slices and four compute racks for 2U systems. Sixty-four ports of HSI does not fill either the Myrinet or the Quadrics switch chassis. The cost breakdown is shown in Figure 17-6.

Hardware component cost for a 64-compute slice cluster

Figure 17-6. Hardware component cost for a 64-compute slice cluster

Because the HSI chassis do not require expansion, the overall percentage contribution to the cluster cost stays in a range similar to the 32-compute slice cluster. The cost of the compute slices dwarfs the infrastructure and HSI expenses (Table 17-4).

Table 17-4. Sixty-four-Compute Slice Hardware Cost Percentages

System and HSI Type

Infrastructure Cost, %

Compute Cost, %

HSI Cost, %

HP DL-360 G3, Myrinet

8.54

72.62

18.84

HP DL-360 G3, Quadrics

8.01

56.91

35.08

HP rx2600, Myrinet

4.73

90.40

4.87

HP rx2600, Quadrics

4.43

84.72

10.84

HP zx6000, Myrinet

5.50

87.49

7.01

HP xz6000, Quadrics

5.01

79.80

15.19

One Hundred Twenty-eight-Compute Slice Hardware Costs

Our final example hardware configuration is 128 compute slices (Figure 17-7). This cluster requires four compute racks for 1U compute slices and eight racks for 2U systems. The number of interrack cables begins to grow and there may be a need for additional management racks to hold HSI or other switch equipment.

Hardware component cost for a 128-compute slice cluster

Figure 17-7. Hardware component cost for a 128-compute slice cluster

With 128 compute slices in the cluster, the HSI switch chassis for both Myrinet and Quadrics are completely full. To expand the HSI would require additional levels of switches to ensure the required connectivity and low latency. Beyond this point, specialized HSI clusters begin to add substantial cost to this area. The cost percentages are shown in Table 17-5.

Table 17-5. One Hundred Twenty-eight-Compute Slice Hardware Cost Percentages

System and HSI Type

Infrastructure Cost, %

Compute Cost, %

HSI Cost, %

HP DL-360 G3, Myrinet

7.90

73.46

18.65

HP DL-360 G3, Quadrics

5.75

53.29

40.96

HP rx2600, Myrinet

3.31

94.15

2.54

HP rx2600, Quadrics

2.95

83.70

13.36

HP zx6000, Myrinet

3.99

89.03

6.98

HP xz6000, Quadrics

3.50

77.99

18.51

As we might expect, the majority of the cost is still in the compute slice hardware, ranging from a low of 74% to a high of 94%. In the clusters with a specialized HSI, the cost percentage ranges from a low of 3% to a high of 41%. The low value is the lower cost IA-32 systems with Myrinet, and the high-end cost is for the more expensive Itanium 2 systems with Quadrics. The performance and latency that you require will affect the hardware cost relationships in your cluster.

The Land beyond 128 Compute Slices

I mentioned earlier in this chapter that there is a point at which most cluster builders expect the complexity to become very difficult to manage. Certainly, if you look at the cost breakdown for the 128-compute slice cluster and compare it with the smaller clusters you can see that the hardware costs are following an exponential curve as the cluster gets larger. This is shown in Figure 17-8.

Growth of total cluster hardware costs (list prices)

Figure 17-8. Growth of total cluster hardware costs (list prices)

Part of this exponential rise in cost is due to the fact that we are using (exponential) powers of two for the number of compute slices in the cluster. Some of the cost increase, however, is the result of the limits in the number of available ports on the HSI and Ethernet switches. When one chassis is filled, it will take multiple identical switches (in the case of Ethernet) or multiple layers of switches (in the case of the HSI) to provide the necessary connections and to maintain bandwidth.

I am not considering software complexity here, but that tends to rise exponentially in relation to the number of interconnected systems. Along with the software and hardware costs, the amount of labor required to build the cluster also follows a similar complexity. When taken together, these factors all seem to indicate that the next major step—to 256 compute slices—is the point where organizations with limited time and budget will have to start making trade-offs in cluster size and hardware selection.

Buying the cheapest possible “white box” system for your compute slices may not be the correct answer to the cost increases. (A “white box” system is brandless hardware, from cottage-industry manufacturers that have no high operating overheads and can therefore offer the lowest possible prices.) The systems must be manageable, must be properly designed for rack heating situations, must have the proper chip sets to allow memory and CPU performance, and, finally, the company manufacturing the systems has to stay in business long enough to support your cluster's hardware. Initial hardware price is not the only consideration for your cluster's success.

Another way to look at the compute slice expenses is in terms of cost per GFLOP for scientific and engineering clusters or in terms of cost per transaction for commercial application clusters. Many of the industry-standard transaction benchmarks also include the cost per transaction in the reporting. An example of the cost per GFLOP is shown in Figure 17-9.

Cost per GFLOP based on cluster size

Figure 17-9. Cost per GFLOP based on cluster size

Hardware Cost Trends and Analysis

The preceding sections presented information on the cost breakdown percentages of several cluster configurations, from 8 to 128 compute slices. It is possible to analyze the hardware costs in terms of “cost per system” for racking hardware, or “cost per port” for networking equipment. An example of cost-per-system calculations for the Hewlett-Packard Proliant racks is shown in Figure 17-10, broken down per 1U and 2U compute slice.

Hewlett-Packard Proliant rack costs per system (list prices)

Figure 17-10. Hewlett-Packard Proliant rack costs per system (list prices)

The increase in price at the 2U 16-system point is the result of the need for two racks instead of one. For configurations that are small enough, rack sizes other than the 42U racks are possible. This is reflected in the expenses for the smaller 1U and 2U clusters.

For Ethernet data and management LAN costs, see Figure 17-11, which details the per-system cost for a GbE data LAN and a 10/100 management LAN. All equipment used was Hewlett-Packard ProCurve network switches.

Per system cost for data and management LANs (list price)

Figure 17-11. Per system cost for data and management LANs (list price)

Total cost for these LANs includes the required switches in the individual compute racks and the core switching equipment in the master rack. Note that in Figure 17-11 a single switch with a higher port count may be used in the 32-compute slice example, in which there are 1U systems, because there is only one compute rack. In the 32-compute slice cluster, with 2U systems, there are two compute racks.

The number of unused or “wasted” ports is calculated into the total cost. A very useful feature of the compute slices is multiple, built-in LAN interfaces. The built-in interfaces reduce cost by eliminating the need for added PCI interfaces. In systems with limited PCI connectivity, this leaves room for the HSI interface card.

When making a choice of HSI technology, the per-port cost is a useful way of looking solely at price differences. The per-port costs for Myrinet and Quadrics are shown in Figure 17-12, for the 8-, 16-, 32-, 64-, and 128-compute slice clusters. The total cluster cost of the HSI technologies go up exponentially in the number of compute slices, because they try to maintain complete connectivity and constant latency across the network.

Per-port HSI costs (list prices)

Figure 17-12. Per-port HSI costs (list prices)

At 128 ports of either Myrinet or Quadrics, the first chassis is completely filled, with all ports in use. To expand the network beyond this point requires additional levels of switch chassis, which increases the HSI cost considerably. This would be reflected in an increase in the total cost per port.

Cluster Economics Summary

In this chapter we analyzed a limited set of hardware across a limited set of cluster sizes. Additionally, we used the list prices for all of the hardware, although there are usually quantity or “special” discounts available from the hardware vendors. Dealing with “street price” for the components, however, is very tricky, because the street price varies with the source, situation, and vendor relationship.

By using list prices, we have been able to draw some conclusions to use as a baseline for making hardware configuration and purchasing projections, even if discounts are involved. It is likely that different components will be discounted at different rates in a real-world situation (for example, RAM, disk, system processing units, networking equipment, and HSI equipment).

To account for all the permutations here is beyond the scope of our discussion. To maintain some level of sanity in your situation, use the published list prices as a starting point, but allow for discount percentages in whatever hardware purchasing models you build. You can then adjust total costs as the quotations come in from your vendors.

Based on our tables, graphs, and observations, we have a somewhat better idea of what to expect for hardware costs in a given-size cluster:

  • Constantly increasing hardware cost, with the total percentage contribution from compute slices increasing as the cluster grows, as shown in Figure 17-8

  • Decreasing cost per GFLOP or transaction, approaching a constant value, as shown in Figure 17-9, provided scaling is linear across the cluster

  • Increased complexity, associated with the overall size of the cluster and the number of interconnections between active elements

  • Infrastructure costs that increase in step with the cluster size, but decrease toward a constant value on a per-system basis

Taking all these observations into account, it is difficult to see how the automatic impression that “clusters are cheaper” has taken hold, until you consider the cost of a similar-sized SMP server, as shown in Table 17-6.

Table 17-6. Comparison of Four-Way SMP Server versus Two-Way Compute Slices

Qty

System Description

Number of CPUs

Total RAM, GB

Total Cost[a]

2

HP Proliant DL-360 G3, 3.2-GHz Pentium 4 Xeon

4

8

$14,494.00

1

HP Proliant DL-580, 2.8-GHz Pentium 4 Xeon

4

8

$28,046.00

2

HP rx6000, 9 GB RAM

4

18

$63,020.00

1

HP rx5670, 17 GB RAM

4

17

$83,694.00

[a] All prices are US list price as of January 31, 2004.

The costs for large SMP systems escalate with memory capacity and total number of available CPUs. By the time an SMP server can support 64 CPUs, it must have an internal crossbar and possibly a special interconnect. Using smaller SMP systems to build a cluster, then, can improve performance and reduce cost, provided the application you are using supports a cluster environment.

Figure 17-13 shows a hardware-only comparison between large SMP configurations and equivalent compute slice-only hardware costs for a number of standard CPU configurations. All systems were configured with 4 GB of RAM per CPU. Prices are based on US list price as of May 14, 2004, and will change. Before making any assumptions about your particular situation, you should seek updated information.

Comparison of large SMP configurations and cluster hardware

Figure 17-13. Comparison of large SMP configurations and cluster hardware

There are several labels on the graph to notice. The leftmost dotted line on the graph marks the largest SMP configuration available for an Itanium 2 processor configuration at the time of this writing. The rightmost dotted line marks the largest SMP configuration available for a Hewlett-Packard PA-RISC processor, the PA-8800. This processor has a dual-CPU core, which allows two CPUs per module. This allows expansion of the system CPU capacity without necessitating a hardware design refresh.

The bottom two lines in Figure 17-13 show a compute slice configuration with equivalent numbers of CPUs built with dual-CPU systems: one line representing Itanium 2 and the other, IA-32 processors. The arrow to the right indicates the ability of the cluster solutions to scale the number of processors beyond the limits of the SMP configuration. For applications that can take advantage of this scaling, there is a potential cost advantage and it is possible to scale out beyond 128 processors, something the SMP configurations cannot do.

The commodity IA-32 processor also suffers from scaling issues in SMP configurations. The line representing the IA-32 SMP configurations stops at eight processors because that is the largest configuration I could find in the configuration guides. To scale the IA-32 solution, one has to adopt a cluster approach.

As a final note, the price scale on the left of the graph is logarithmic, so be sure to take that into account when reading the prices from the graph. This comparison is an artificial exercise, but it does illustrate the potential scaling and hardware cost savings provided by cluster configurations. A realistic model would take all aspects of the final solution into account, but such a model is beyond the scope of this discussion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.50