Chapter 4. Any Way You Slice It: Work and Master Nodes in a Cluster

Chapter Objectives

  • Examine the characteristics of “worker nodes” or “compute slices” in a cluster

  • Discuss criteria for choosing a compute slice

  • Describe several example system architectures[1]

  • Analyze expected performance and potential bottlenecks in a given system

The individual systems in a cluster that actually perform parallel work may be viewed as “slices” of a larger computer. Because each of these systems is usually in an individual package that includes a motherboard, CPUs, RAM, internal disks, and other components, it is important to consider what characteristics make a particular system suitable for use in a given cluster. This entails understanding underlying hardware “compute slices” and their architecture, and then being able to apply that knowledge to the cluster applications.

Criteria for Selecting Compute Slices

There is a wide range of systems available that are suitable for use as compute slices in a cluster. Which system you select depends on your budget, the desired characteristics for your cluster, and the applications it will run. There are a number of system characteristics that should be considered when selecting the proper vendor and model for your cluster's compute slices:

  • Total cost per individual system

  • Number and type of CPUs

  • CPU floating-point and integer performance characteristics

  • Transaction processing capabilities

  • Maximum supported RAM

  • Error check/correct (ECC) memory

  • 32-bit versus 64-bit memory addressing capabilities

  • Memory to CPU bandwidth

  • Memory to I/O bandwidth

  • Internal disk subsystem performance

  • Hot-plug disk support

  • Number of available I/O slots

  • Presence of serial port for console management

  • Built-in system management capabilities (remote power on/off)

  • Power and heat ratings

  • Built-in LAN capabilities

  • Operating system support for hardware and accessories

  • Overall system design quality

These characteristics may be grouped into performance-related, capacity-related, quality-related, and management-related groups. Which sets are more important depends on your application performance expectations, budget, cluster availability, and reliability requirements for the whole cluster.

An Example Compute Slice from Hewlett-Packard

The example clusters shown in Figure 3-1 and Figure 3-2 are built with compute slices from Hewlett-Packard: the Proliant model DL-360g3 server. As of this writing, this system is available with dual Intel 2.4/2.8/3.06/3.2 GHz Pentium 4 Xeon processors (IA-32 or “instruction architecture-32,” meaning 32 bits) with 1-MB level three cache, and up to eight gigabytes of system RAM in a 1 EIA unit (1U) package. (See http://www.hp.com for more information on this product family.)

Understanding the capabilities of the compute slices in your cluster includes taking a close look at the packaging and built-in connections. In the case of the DL-360, which is shown in Figure 4-1, there are two integrated 10/100/1000base-TX network connections, a serial port, either PS-2 or USB keyboard and mouse connections, and an integrated ATI Rage XL graphics controller with VGA graphics output. (VGA is a legacy output format that is supported by most Windows and Linux-compatible hardware platforms. It allows you to treat the graphics device as an 8-column by 24-row character device with character attributes like color, underlining, inverse video, and so forth.) Two hot-plug disk bays are attached to an internal redundant array of independent disk (RAID) array controller (capable of RAID mode 0 or 1—striped or mirrored—operation) with a 64 MB cache. The disk bays may contain two 18/36/73/146-GB disk drives.

Hewlett-Packard Proliant DL-360 front and back view

Figure 4-1. Hewlett-Packard Proliant DL-360 front and back view

There are two PCI interface slots available: One is taken if the optional redundant power supply is installed. The interface slots are PCI-X compatible. One is backward compatible to older PCI standards, which enables 133-MHz 64-bit I/O with burst rates of more than one gigabyte per second. It is important to look at the system block diagram to see where bottlenecks might exist in the I/O path.

A very useful feature is the integrated lights-out (iLO) management port, which allows connecting the system to an out-of-band management network for monitoring and management. This enables a secure, text-based console and remote power on and off, along with BIOS upgrades with the associated software tool. Most hardware vendors have some models with integrated management capability, and the software to do basic operations may be free. You need to check on the availability of Linux management software and Linux agents to enable full functionality for your systems. In the case of Hewlett-Packard, the software package for the Proliant server family is called “Insight Manager.” A basic feature that you should look for in any package is the ability to perform unattended BIOS upgrades via the management tool. This can save an immense amount of time, even if you have to use a proprietary software package.

The eight gigabytes of main memory is PC-2100 ECC 266-MHz double data rate (DDR) synchronous dynamic random access memory (SDRAM) and is 2:1 interleaved. Interleaving is a very important feature that can increase the bandwidth of memory access and relieve potential conflicts between multiple processors on the same memory bus In this case, the memory to the processor bus is called the front-side bus (FSB), and runs at 533 MHz. Most modern processors access memory in units of a cache line. The size of the cache line varies by processor type and system design. Memory accesses to sequential cache lines are spread across two independent memory channels.

TIP

Memory bandwidth is an important feature for performance, no matter what the number of processors in the compute slice or whether it is a technical or commercial cluster. Memory performance becomes increasingly important as the speed and number of CPUs, or the amount of floating-point computations, increases.

The single power supply is rated at 325 W. In evaluating the power utilization for a given system from any manufacturer, you need to be careful with the published specifications. The rating may be the absolute maximum for the power supply input, it could be a minimal configuration of RAM and disk, it could be for a fully loaded system, or it could just plain be wrong. It is very likely that you will get conflicting and inconsistent information. If possible, measure the actual power utilization of your final configuration under full load. (Strange as it may seem, a CPU that is performing intensive floating-point calculations may run hotter than one that is running the operating system's idle loop. I swear I am not kidding.)

This is not intended as an advertisement for this particular system, only an example of the kinds of specifications you are likely to wade through in choosing the proper compute slice for your cluster. This system is intended for use as a small to medium sized business server, so it has features that you may not need (the RAID controller, for example), which add to the complexity and cost. Remember that cheapest is not always best. It is a false sense of savings, for example, to buy the cheapest systems that frequently fail because they were not designed to withstand the extra heat of a racked configuration.

Analysis of the Example Compute Slice

Let's take a quick look at a schematic of the “guts” of a dual-processor Intel Pentium Xeon server. This is a fictitious design, based on architectural information on the Broadcom ServerWorks GC-LE dual-processor server chip set and the Hewlett-Packard specifications for the Proliant Model DL-360. This is only an approximation as to what is really inside the system on the motherboard, so don't hold me to the details too closely. For the sake of this analysis, let's just assume that we started by looking at this system because we were familiar with the particular vendor.

Figure 4-2 details the major components and buses that tie them together. You should notice the similarity between this figure and the generalized SMP system architecture pictured in Figure 1-2. Looking at the internal design of a system in this way may be instructive. It can help gauge expected performance of the compute slice or explain observations about performance characteristics.

Dual Intel Pentium Xeon server architecture

Figure 4-2. Dual Intel Pentium Xeon server architecture

Whether you want raw floating-point performance for scientific computing, or high levels of integer performance and I/O for transaction processing, there are common, desirable design characteristics. I typically look at the design of the system components and interconnects, examining the following.

  • Are there any obvious bottlenecks?

  • What is the CPU-to-memory bandwidth?

  • What is the disk-to-memory bandwidth?

  • What is the I/O-to-memory bandwidth?

  • What is the maximum memory capacity and controller configuration?

  • How many I/O slots are available in the system?

  • What is the network I/O capacity?

When examining this system, you can see that there is a lot of capability built into a 1U package. There are no obvious problems with the configuration (such as bottlenecks between the network and RAM). The next step is to begin characterizing the system's performance and compare it with alternatives. This is unfortunately easier to do for technical applications than for commercial processing.

One measure of system floating-point ability is the “theoretical peak” floating-point performance. This is simply calculated by multiplying the CPU instruction clock rate by the number of floating-point instructions that may be completed per cycle. It sets an upper bound on what is possible for the CPUs in the system and allows us to determine the efficiency of the system implementation. For our example, a dual 3.2-GHz Pentium Xeon compute slice, the calculation for theoretical peak floating-point operations per second (FLOPS) is

Dual Intel Pentium Xeon server architecture

In reality, however, the system performance on benchmarks such as Linpack,[2] which is a measure of CPU-bound floating-point performance, is likely to yield somewhere around 8.53 GFLOPs. Why? The location of the potential bottleneck is left as an exercise for you, the reader, but a good guess is probably the memory subsystem.[3] In other types of benchmarks, such as Web serving or transaction processing, the likely bottlenecks are elsewhere in the system.

A similar calculation for the Hewlett-Packard dual 1.5-GHz Itanium 2 processor system shown in Figure 4-3 is

Dual Intel Pentium Xeon server architecture
Dual Itanium 2 6M system architecture

Figure 4-3. Dual Itanium 2 6M system architecture

The system in question, a Hewlett-Packard rx2600 (similar to the workstation version, the zx6000) examined in the next section, yields 11.42 GFLOPs peak in benchmarks. This is because the memory is multiple banks of interleaved dual in-line memory modules (DIMMs), with a bandwidth of 6.4 GB/s to each bank.

The theoretical peak calculation for the dual-processor 2.0-GHz Opteron system is

Dual Itanium 2 6M system architecture

This system is interesting because each CPU has its own integrated memory controller and local RAM. A system with RAM distributed in this manner has a penalty to be paid for accessing non-local RAM on another processor. Taking full advantage of this architecture would seem to require an operating system that is ccNUMA capable. The fact is, the details of memory location are hidden by the hardware of the memory controller. The question to consider is: Just what is the system memory bandwidth for a hardware configuration like the system shown in Figure 4-4?

Theoretical dual AMD Opteron system architecture

Figure 4-4. Theoretical dual AMD Opteron system architecture

The “it depends” answer applies to this situation. Note that bandwidth-to-local processor memory is 5.3 GB/s across two channels and to nonlocal processor memory is 6.4 GB/s. Looking closer, you can see that the 6.4 GB/s is actually divided between separate read and write channels between the processors. So, for a problem that fits into the local RAM, you can expect local bandwidth and latency, and for problems that span local and remote RAM, you can expect a mixture of local and remote performance characteristics. Indications are that the architecture scales well to at least four processors. (Because this configuration is so new, I could not find any four-processor systems using the Opteron processor. They are sure to appear two days after this book is published.)

As we have seen, looking at the system hardware diagrams can sometimes make a difference when it comes to understanding benchmark numbers. Note that the theoretical peak performances for the Itanium and Xeon systems are roughly equivalent, even though the Itanium system operates at less than half of the clock rate of the Xeon system. Clock rates alone are not a good performance comparison point. The other surprise is the memory configuration of the Opteron system.

Comparing the Example Compute Slice with Similar Systems

One way of comparing systems is to examine the performance characteristics that are important to your application. The major categories of performance are integer and floating point when looking at a particular CPU and system configuration. Some representative numbers for the systems we are examining are shown in Table 4-1. We need to make sure that the numbers we are examining are “apples-to-apples” comparisons, and that they have something to do with the real world.

Table 4-1. Price and Single CPU Performance for Three compute slices

System Type

Base Price[a]

Final Price

SPECint_2000[b]

SPECfp_2000

Hewlett-Packard rx2600

1.5-GHz Itanium 2 3M

$5,730

$22,990

1322

2119

Hewlett-Packard DL-360g3

3.2-GHz Pentium Xeon

$4,448

$7,546

1319

1197

IBM eServer 325

2.0-GHz AMD Opteron

$5,959

$8,746

1226

1231

[a] Base and final prices are U.S. list prices from the respective manufacturer's Web site as of December 30, 2004.

[b] All SPEC values are from the SPEC Web site as of December 30, 2004.

A very important parameter in the choice of compute slices is total system cost. After all, you are building a cluster from multiple (tens to hundreds) of individual systems. Three individual systems were configured and priced: a 2U Hewlett-Packard dual 1.5-GHz Intel Itanium 2 6M, a 1U Hewlett-Packard dual 3.2-GHz Intel Pentium 4 Xeon, and a 1U IBM dual-2.0 GHz AMD Opteron 246 system. Each system was configured with dual 36-GB disks, 4 GB of RAM, two processors, and a RAID controller (if available). The relative performance in terms of floating-point and integer performance is shown in Figure 4-5.

Performance for three dual-processor compute slices

Figure 4-5. Performance for three dual-processor compute slices

The system list prices[4] are shown in Table 4-1, along with single-CPU performance numbers from the Standard Performance Evaluation Corporation (SPEC) CPU2000 benchmark.[5] The SPECint_2000 and SPECfp_2000 benchmarks are subcomponents of the SPEC CPU2000 benchmark.

The SPEC CPU2000 benchmark includes the CFP2000 and CINT2000 benchmarks, both of which contain single-CPU benchmarks (SPECint_2000 and SPECfp_2000) and multi-CPU benchmarks (SPECint_rate_2000 and SPECfp_rate_2000). The CINT2000 benchmark is a composite of 12 individual application benchmarks and the CFP2000 benchmark comprises 14. (Just to make things more interesting, there are also “base” benchmarks, such as SPECfp_base2000 and SPECfp_rate_base2000, that have stringent rules on compilation options and other environmental factors. Numbers are reviewed by the SPEC organization before they are published.) Multi-CPU versions of the benchmarks are presented in Table 4-2. Multi-CPU benchmarks run multiple copies of the individual applications and measure the relative system scaling.

Table 4-2. Multiple CPU Performance for Three Compute Slices

System Type

SPECint_rate_2000

SPECfp_rate_2000

Hewlett-Packard rx2600

1.5-GHz Itanium 2 6M

30.5

42.4

Hewlett-Packard DL-360g3

3.2-GHz Pentium Xeon

28.2

14.0

IBM eServer 325

2.0-GHz AMD Opteron

27.0

27.5

We have the price and we may calculate price performance ratios for the three compute slices to compare the relative costs. This represents how much you pay for a given unit of performance, either integer or floating point. The calculation also works for transaction processing and Web operations, and some industry benchmarks specify their results in dollars per transaction per second.

So how did our three example systems compare? As you can see in Table 4-2, the Itanium 2-based system from Hewlett-Packard wins in terms of highest integer and floating-point performance overall, based on the SPECint_rate_2000 benchmark. The IBM eServer 325 Opteron 246-based system[6] outperforms the Hewlett-Packard Xeon-based system in terms of floating-point performance, but is almost identical in terms of integer performance. The integer performance of all three systems is pretty close, but there is a substantial difference in floating-point performance when the Opteron and Xeon are compared with the Itanium 2 system.

Figure 4-6 shows the price-to-performance ratios for the systems, based on list price and the SPECint_rate and SPECfp_rate benchmarks. Notice that the high cost of the Itanium system drives the cost per operation up for both floating-point and integer operations. There are some reasons for the extra cost in the Itanium system.

List price-to-performance ratio for three example compute slices

Figure 4-6. List price-to-performance ratio for three example compute slices

The EPIC processor is new and fundamentally different from the other reduced instruction set computing (RISC) systems, the system is 64-bit and 32-bit capable (as is the Opteron), and the maximum RAM capacity is 24 GB. The memory latency and bandwidth, along with the I/O capabilities make the system the fastest compute slice available when coupled with the Quadrics HSI. There are specific applications and situations when the Itanium systems are simply unparalleled (please excuse the pun). The price-to-performance ratio of the Intel Xeon system is lower than the AMD Opteron for integer performance, but is higher in terms of floating-point performance. The choice, based solely on performance or price-to-performance is not clear-cut.

Example Clusters Using Our Compute Slices

Up to this point, we have taken a look at single systems and their architecture and performance characteristics. But before we jump into a choice of compute slice for our cluster, we need to compare some other factors using all the compute slices. But just how many is that?

Let's say we have a requirement to provide one teraflop of floating-point performance in our cluster. How many compute slices would that be? Now we run into a little “issue” with benchmarks.

We have been comparing systems based on their SPEC performance characteristics, but the SPEC benchmarks don't map cleanly to teraflops. The Linpack benchmark, however, does map to teraflops, provided someone has run the benchmark for the particular processor and system you are evaluating. In our case, two of the systems did not have current Linpack numbers as of this writing.

Another factor to keep in mind is that benchmark and application scaling across the cluster is not perfect. The numbers in Table 4-3 are for single systems. To get one teraflop across the cluster, we would need to measure the actual performance and adjust the number of systems in the cluster. If 80% scaling across the cluster is good (80% scaling is darned good), then we would actually need 102 Itanium 2 systems, 144 Pentium Xeon systems, and 185 Opteron systems to meet our goal. This increases the size of the cluster to seven Itanium compute racks, five Pentium Xeon compute racks, and six Opteron compute racks.

Table 4-3. Calculations for a One-Teraflop Cluster

System Type

Theoretical Peak GFLOP

2 CPU Linpack NxN GFLOPs

Number of Systems for One TFLOP[a]

List Price

Racks[b]

Hewlett-Packard rx2600 1.5-GHz Itanium 2 6M

12.0

11.83

85

$20,346,150.00

6

Hewlett-Packard DL-360g3 3.2-GHz Pentium Xeon

13.6

8.38[c]

119

$897,974.00

4

IBM eServer 325 2.0-GHz AMD Opteron

8.0

6.50[d]

154

$1,346,884.00

4

[a] Rounded up to nearest whole system.

[b] Compute racks, using 32 EIA units per 41U rack.

[c] Estimated from 3.06 GHz Linpack numbers.

[d] Estimated from 81% of 8 GFLOPs.

To make the best choice for a given cluster, we would have to examine other factors like the status of 64-bit applications, total memory capacity in the system, the cooling “budget” for the cluster's location, vendor price discounts, and integer versus floating-point performance requirements for real applications.

Thirty-two Bit and 64-Bit Compute Slices

It is important not to get confused about the 32-bit versus 64-bit functionality when considering compute slices. Many modern 32-bit systems and their processors already incorporate features that can be considered 64-bit in nature: wide internal data buses, wide registers, and extra physical memory address lines. Many of these features have been in 32-bit hardware for years now. We need to pay close attention to what is being discussed when “32-bit” and “64-bit” are used as adjectives.

Physical RAM Addressing

If you go back to Figure 4-2, you will notice that the two, 32-bit Intel Pentium Xeon processors in the system are capable of addressing a total of eight gigabytes of physical RAM. For this system to be an SMP system, the processors need to share uniformly all eight gigabytes of physical system RAM (in other words, have a uniform addressing scheme). This requires at least 33 hardware address bits on the memory bus for both processors, because the maximum value of a 32-bit binary number is 4,294,967,295 (or 232 minus 1) or four gigabytes.

Intel 32-bit processors incorporate physical address extension (PAE) features to allow the system hardware to address up to 64 GB of physical RAM. This means that there may be up to 36 bits in the physical RAM address used by the processors, if the extra lines are implemented by the hardware designers on the system's motherboard. This hardware feature allows a system's hardware to accommodate more that the four gigabytes of physical RAM that would be possible with 32 address bits.

So, to provide more than the four gigabyte physical RAM that is addressable by a 32-bit processor, the underlying processors and chip set must support a larger physical address with more than 32 address lines, and the hardware designers must take advantage of this by implementing the extra address lines. The operating system also must be able to manage the larger amount of physical RAM and properly parcel it out to 32-bit processes.

Even the two example compute slice systems that have 64-bit-capable processors (the Itanium 2 and the Opteron) do not fully implement the complete 64 possible hardware memory address lines, mainly for practical reasons of cost. The Opteron systems, for example, allow up to 40 bits of physical memory address (taking up space on the motherboard as circuit connections), which allows addressing up to one terabyte of physical RAM. This amount of RAM, even at today's prices would be very expensive, and would fit into any system enclosure only with great difficulty. (This would be 512 DIMMs at two gigabytes each. That's a lot of memory slots on the motherboard!)

The 64-bit processors can address up to 16,535 (214) petabytes (250 bytes) if the full 64 bits were used for physical RAM addresses, but because of physical constraints, hardware designers usually implement only a subset of the possible address bits. Once the hardware can address all the physical RAM, it is up to the operating system's virtual memory system to provide a “virtual address space” to execute a process.

Process Virtual Address Space

It is the job of the operating system and its virtual memory subsystem to manage physical memory assigned to a given process. This is done by dividing physical memory into fixed-size “pages” that are allocated to processes when needed from a page pool in main memory. The operating system keeps track of which physical page is associated with a particular “virtual” page being used by a process and manages the system's page pool. Once this relationship is determined, the hardware keeps frequently used physical-to-virtual page translations in a set of registers called the translation lookaside buffer or TLB.

How large a program can get, in terms of code and data size, is determined by the maximum virtual address space available (32 or 64 bits) from the hardware and the operating system. The total number of processes that may run on the system simultaneously, without “bumping” another process' physical pages, is determined by the total size of physical RAM. Two processes running on a dual-processor system, each requiring four gigabytes of memory, will not interfere as long as the system has eight gigabytes of physical RAM or more (ignoring operating system memory requirements).

The IA-32 architecture uses four kilobyte (212 bytes) physical pages. With four gigabytes of physical RAM, this yields a total of 1,048,576 pages (or 220 pages). Notice that

  • 212 × 220 = 2(12 + 20) = 232 = 4 GB

An address consisting of 32 bits may be divided into a page number (20 bits) and an offset into that page (12 bits). The IA-64 architecture supports multiple page sizes, but the Linux kernel may be compiled to use one of 4, 8, 16, or 64 kilobyte pages.

Not all pages in a given process' “virtual address space” need to be allocated or “mapped” to physical pages, but the total number of pages that a process can access in its virtual address space, fully mapped or not, differs between a 32-bit and a 64-bit process. A 32-bit process can address at most four gigabytes of virtual memory divided between instruction code and data. A 64-bit process may address multiple petabytes of virtual memory divided between code and data.

Figure 4-7 shows two situations: The first is a system with two four-gigabyte process address spaces sharing four gigabytes of physical RAM, and the second is the same two processes sharing 12 GB of physical RAM. In the first case, there are not enough physical pages available to satisfy the two processes completely, so virtual memory “paging” will occur. In the second case, the 12 GB of physical RAM completely fills the needs of both processes' address space, so some physical pages are left over. In both cases, the operating system pages mapped into the processes address space are shared through the magic of virtual memory.

Processes sharing physical memory

Figure 4-7. Processes sharing physical memory

If you are going to run multiple processes on your compute slices, then it is important to have enough physical RAM to allow the processes' to coexist without “fighting” for physical pages. If the operating system and the virtual memory system have to share physical pages between processes, virtual memory “paging” occurs. Performance suffers greatly when a system begins paging.

Before your eyes glaze over completely, let's discuss why all this is important in choosing a compute slice for your cluster. If you are processing chunks or pieces of data that will never grow beyond the four-gigabyte limit imposed by 32-bit process virtual addresses,[7] then you don't need 64-bit hardware, 64-bit operating systems, or 64-bit applications. If, however, you are going to be working with extremely large data sets (beyond four gigabytes) on a per-process basis, then you have no option but to use a 64-bit application, operating system, and hardware or to subdivide the problem into smaller pieces.

Software Implications of 64-Bit Hardware

The hardware compute slice and effects on the software applications you will run are not independent of each other. For example, both the Opteron and the Itanium processors in our example compute slices are designed to allow both 32- and 64-bit applications to coexist. Just because the hardware is capable of this behavior does not mean that it will be available in your chosen environment. The operating system must also provide facilities to support simultaneous 32- and 64-bit operation.

But why would you want to be able to run both “flavors” of applications on the same hardware? There are a number of answers, including legacy code, that cannot easily be ported to 64-bit operation. Making an application “64-bit clean” is not an easy task, particularly if the original programmers did not adhere to portability standards.

Several things happen in 64-bit code that do not occur “naturally” in 32-bit code.

  • Code becomes larger because of “immediate” values.

  • Data becomes larger, because data types like pointers become 64 bits instead of 32 bits.

  • The operating system must provide both 32- and 64-bit versions of libraries and interfaces, and applications must be linked against the proper libraries.

If you have complete control of your application and the source code, then it becomes your job to port to any new environment. If you use third-party applications, you must check to ensure that the application is available for the type of environment (and compute slice) you are choosing.

In addition to the issues we have discussed so far, there is the additional consideration of the programming model chosen by the operating system. In the integer-long-pointer 32-bit (ILP-32) model, the data types integer, long, and pointer are represented by 32-bit values. There are two competing models in the 64-bit world: integer-long-pointer 64-bit (ILP-64) and long-pointer 64-bit (LP-64). In the first, integer, long, and pointer data types are all 64-bits in length. This can tend to break code that was originally written to the ILP-32 model, because the integer type changes size from 32 to 64 bits.

The LP-64 model represents long and pointer data types as 64 bits, and leaves integers at 32 bits. Although this model does not guarantee trouble-free 64-bit porting efforts, it minimizes the number of changes made to the original 32-bit programming assumptions. Which hardware and operating system you choose for your compute slice will determine whether you have this issue.

As of this writing, the IA-64 Linux maintainers have decided that it is not necessary to use the hardware's ability to run 32-bit Itanium applications at the same time as 64-bit applications. You will need to compile your code for 64 bits for it to perform on this hardware. The Itanium hardware allows execution of 32-bit IA-32 (Pentium III) code, and Linux provides this functionality only to a limited extent. A good deal of performance is given up in the process, and code compiled for later Intel processors may not run under this facility. This attitude may change in the future; we will see. We do not discuss the software issues further here.

Memory Bandwidth

Commodity systems are based on widely available chip sets and other standardized parts, and RAM is no exception. There are a wide range of manufacturers and sources of system RAM, but the type and speed of the RAM will dictate its performance to a large extent. In addition to a plethora of types and speeds, there are different packages: dual in-line memory modules (DIMMs), single in-line memory modules (SIMMs), ad infinitum. We won't go any deeper into memory terminology, because we need to stay awake and focused.

For purposes of discussion, the types of RAM we have been encountering in our example compute slices is ECC DDR SDRAM.

The ECC designation means that the RAM has extra bits available that are used to detect and correct single-bit errors, and to detect multiple-bit errors in the RAM. This is an essential feature of RAM that is to be used in production systems. We would not want a stray bit error to cause improper results in the cluster's calculations. The message is: Don't buy systems without ECC RAM. Most of the example compute slices have RAM buses that are 128 bits wide for data outside the RAM controller, but have a total of 144 bits between the DIMMs and the controller. The extra 16 bits are being used for ECC check bits by the controller.

The SDRAM uses a clock (the S stands for synchronous, meaning in time with a clock signal) to determine valid times to read data from and write data to the storage devices in the RAM chips. DDR indicates that data may be moved at two times the clock frequency, as opposed to “single data rate” RAM.

To understand the designators for the RAM we are using in our compute slices, examine the following calculations and then look at Table 4-4:

Memory Bandwidth

Notice that the data rate, 1600 MB/s, matches the numeric part of the “PC1600” designator. But this is double data rate RAM, so the calculation is really

Memory Bandwidth

See Table 4-4 for more calculations—and you thought this was mysterious, right? So did I, until I started researching this book. Well, before we get overconfident, notice the rounding that takes place with PC2100 and PC2700 RAM. Oh, well, it was the thought that counts.

Table 4-4. Dual-Channel DDR SDRAM Peak Performance

RAM Designator

Clock Rate

Data Width[a]

Data Rate

Peak Bandwidth

PC100

100 MHz

64 bits

SDR

800 MB/s

PC133

133 MHz

64 bits

SDR

1.1 GB/s

PC1600

200 MHz

128 bits

DDR

3.2 GB/s

PC2100

266 MHz

128 bits

DDR

4.2 GB/s

PC2700

333 MHz

128 bits

DDR

5.3 GB/s

[a] The data width used here does not include ECC bits.

Memory and Cache Latency

Memory latency is another factor we need to look at if we are going to be performing lots of memory-intensive computations with our compute slices. An extra nanosecond here and there for each memory access tends to add up when you are performing billions of calculations a second over long periods of time. Although memory latency may be less critical for commercial computing, it is still important for the overall performance of the system.

Everything, of course, starts with main memory. Information like that covered in the last section can help understand the overall memory architecture and the bandwidth available to supply the processor caches. Some example main memory latencies are supplied in Table 4-5 and some specific values for the Itanium 2 processor (1.5 GHz) are shown in Table 4-6.

Table 4-5. Main Memory Latency Examples

System Type

Main Memory Latency

Two-processor 1.5-GHz Itanium 2 6M (HP zx1)

110 ns

Two-processor 3.2-GHz Pentium Xeon

243 ns

Two-processor 2.0-GHz Opteron 246 (1 “hop”)

89 ns

Two-processor 2.0-GHz Opteron 246 (2 “hops”)

115 ns

Table 4-6. Itanium 2 1.5 GHz Cache Latencies[a]

Description

Latency in Cycles

Latency in Nanoseconds[b]

Level 1 Instruction and data cache

1 cycle

0.67

Level 2 Integer

5 cycles

3.33

Level 2 Floating point

6 cycles

4.00

Level 3 Integer

12 cycles

8.00

Level 3 Floating point

13 cycles

8.67

[a] These are estimates only. There are exceptions and special cases not shown.

[b] At 1.5 GHz, a single cycle is 0.67 nanoseconds.

If your application is sensitive to memory latency, you will need to investigate the characteristics of the candidate hardware. A good place to start is by finding diagrams or descriptions of the processor caches and system chip set characteristics. It turns out that today's processor memory architecture is a hierarchy, potentially with many levels of caches. A diagram of this hierarchy for the processors in our example compute slices is shown in Figure 4-8.

Processor memory hierarchies

Figure 4-8. Processor memory hierarchies

As you can see from Figure 4-8, there are several cache levels involved, and each architecture has a different approach to keeping the processor “fed” with instructions and data. Keeping the processor supplied with instructions and data is important. “Processor stalls,” or pauses in processing, occur when the processor must wait for items to be fetched from main memory. Notice that the Opteron processor must deal with both processor local and “remote” memory, which means that there are two “main” memory latencies involved. This detail is “hidden” from the caches and the processor by the memory controller.

Each level of the memory hierarchy has its own penalty for a cache “miss,” which occurs when the needed item is not present. The miss information is usually specified in terms of cycles, so you need to know the specific frequency at which the particular cache is running. Because this changes with every generation and tweak of the processor, it can be difficult to find current information. Caveat inquisitor.

Number of Processors in a Compute Slice

A common number of processors found in a compute slice for scientific and engineering clusters is two. This may be due in part to packaging issues for high-performance processors (it is hard to fit four or more processors and RAM into a 1U or 2U package within a heat budget), but most likely is a result of the demand placed on the system bus by scientific computation. More than two processors on today's commodity buses tend to run out of memory bandwidth for scientific applications.

TIP

If you find a commodity SMP system with more than two CPUs that provides the proper scaling for your application, take a closer look at it. More CPUs per package reduces the number of compute slices to manage, and may also reduce the power and heat requirements for your cluster.

A cluster for Web serving or database serving may benefit by using compute slices that have more CPUs and a larger physical RAM capacity. There are newer architectures, like Opteron, that show promise of good scaling beyond two processors per system. In addition to the ability to add more physical RAM, larger packages also tend to allow more I/O connections to peripheral devices.

I/O Interface Capacity and Performance

Data I/O to peripherals is an important performance criterion for your cluster's compute slices. If your compute slices have local scratch disks, or if you are using an HSI, the I/O capabilities and performance will be important to the overall performance of the cluster. The ability to move data efficiently to and from device interface cards and main memory involves the direct memory access (DMA) controller, PCI bus controller, and the PCI interface card itself.

Care must be taken to select a compute slice that is capable of sustaining the necessary I/O rates for the type of work your cluster performs. The most common interfaces in commodity computer systems are built around the PCI standard. Along with the I/O rates, the number and type of the PCI interface slots must be sufficient to allow adding interface cards for the HSI and other required peripherals.

We must be wary of relying on the I/O performance of “core I/O” devices, because these devices tend to be low-cost PCI interfaces that are integrated on the system's motherboard. Although integrated devices may save on PCI slots, an analysis of the compute slice's core I/O architecture may be necessary to avoid encountering bottlenecks. In some cases, it may be better to add an interface card, if there are available slots, rather than use the built-in functionality.

PCI Implementation

We can think of the PCI bus as being implemented by an I/O convertor or bridge with a 64-bit input bus running at 66.6 MHz. At this speed, the maximum throughput to the bridge is 532 MB per second. Underneath the bridge (no, we aren't looking for trolls) there may be individual buses running 32 bits at 66.6 MHz or 64 bits at 33.3 MHz. The throughput on these buses is 266 MB per second.

The peripheral component interconnect express (PCI-X) allows for a 64-bit main bus running at 133.33 MHz, yielding a throughput of 1.06 GB per second. Subbuses are specified at 66.66 megahertz. A PCI-X interface card is backward compatible with PCI, but the data rate will drop to the 33.33 MHz value associated with the PCI 64-bit bus. PCI bus data rates are shown in Table 4-7.

Table 4-7. PCI and PCI-X Data Rates

Bus Type

Data Size, bits

Clock Rate, MHz

Data Rate, MB/s

PCI main

64

66.66

532

PCI

32/64

66.66/33.33

266

PCI-X main

64

133.33

1066

PCI-X

64

66.66

266

In addition to the PCI interface bus type, there are two sizes of PCI interface cards defined: full-size and short (sometimes called half size). You should make sure which types of slots are available in your compute slice hardware before making assumptions about interface cards. Sometimes one or more of the available PCI slots are half size, and this can really cramp your I/O style.

For examples of PCI bus implementations and speeds, see the example systems in Figure 4-2 through Figure 4-4. All three of these systems have PCI-X interface slots and use PCI for internal core I/O devices.

Accelerated Graphics Port

If your cluster is doing graphics rendering, then you may wish to investigate compute slices with an accelerated graphics port (AGP) slot. An AGP bus provides 32-bit data at various clock rates. AGP was originally intended to provide higher throughput than that of the first implementation of the PCI bus, to support high-performance two- and three-dimensional graphics interface cards.

The AGP 1.0 standard was introduced in 1996, and it provided for both AGP 1x (264 MB per second) and AGP 2x (528 MB per second). The AGP 2.0 standard provides AGP 4x (one gigabyte per second), and the more recent AGP 3.0 specification allows AGP 8x at two gigabytes per second. The “speeds and feeds” for AGP are listed in Table 4-8.

Table 4-8. Accelerated Graphics Port Data Rates

AGP Interface Type

Clock Rate, MHz

Data Rate, MB/s

AGP 1x

66.66

266.6

AGP 2x

133.33

533.3

AGP 4x

266.67

1066.6

AGP 8x

533.33

2133.3

Compute Slice Operating System Support

Because this is a book about building Linux clusters, any hardware we discuss must be capable of running the Linux operating system. Intel processors, including Itanium, all run versions of Linux, as does the Opteron processor family. Don't automatically assume, however, that a particular distribution or version of a distribution supports your choice of compute slice hardware.

Master Node Characteristics

The master nodes in a cluster are the access points for users. As such, the master nodes must be multiuser systems, with enough resources to support the expected number of simultaneous cluster users. In addition to the multiuser aspects of the master nodes, it is likely that the master node needs to be highly available to allow users uninterrupted access to the cluster resources. The loss of an unprotected master node will make the cluster unavailable until it is repaired or replaced.

Because it is a highly available, multiuser system, the master node will most likely be a different system model from the compute slices in your cluster. It will need more RAM, potentially more CPUs, and also more I/O slots to support redundant peripheral connections. How much you need to scale up the resources in the master nodes will depend on the number of simultaneous users expected and the activities that are allowed.

Another point worth mentioning is in regard to using a master node as a file server for the cluster. The tendency is to allow users to store data and software local to the cluster, to avoid the necessity of copying information to and from the local storage and external file servers. Because of this, and cost constraints, the temptation is merely to attach the cluster's storage to the master nodes and export it, via NFS, to the remainder of the compute slices in the cluster.

The performance of an NFS server, much like a disk array, is dependent on its ability to cache frequently used data in the system page cache to prevent physical I/O to the disk. By sharing memory and CPU resources between NFS and the users on the master node, the NFS performance to the whole cluster will be severely reduced. Although we will look at file system issues later in this book, it is best to keep a separate set of master and file server nodes in mind when choosing the hardware.

TIP

Do not ever seriously consider using the master nodes in your cluster as NFS file servers. This is bad, bad, bad. The quickest way to impact the performance of the whole cluster is to “save money” by combining the master node and the NFS server.

Figure 4-9 depicts a possible hardware choice for a master node in our example cluster. This particular system is a Hewlett-Packard DL-580, which incorporates up to four 2.8-GHz Xeon processors, 32 GB of RAM, dual hot-plug power supplies, redundant fans, and six PCI-X slots. Interesting hardware features of this system include on-line spare memory and the possibility to mirror memory (in other words, 16 GB replicating another 16 GB).

An example master node from Hewlett-Packard

Figure 4-9. An example master node from Hewlett-Packard

This hardware configuration is possibly overkill for smaller clusters. In large clusters that require high availability on the master nodes, the PCI slots become a limiting factor in providing the necessary redundant connections. Support for additional processors and memory to accommodate multiple users is another important feature.

TIP

Choosing a different processor type for a cluster's master node versus the compute slices can lead to issues. If software development is done on the master nodes, remember that many compilers default to compiling and optimizing for the type of system on which they are currently executing.

Although more expensive than a compute slice by a factor of ten, this system provides enough PCI connections to allow redundant network connections to the external network (two dual-GbE interface cards), and redundant connections to the internal management and data networks. It would also be possible to add a storage area network (SAN) connection for management purposes.

Compute Slice and Master Node Summary

What type of hardware you select for your cluster compute slices and master nodes will depend to a certain extent on what type of cluster you are building. CPU, memory, and I/O capabilities, and I/O performance are important to all types of clusters. Understanding the design strengths and weaknesses of a particular system's hardware can help avoid surprises and disappointment.

TIP

I need to make one very important point about cluster hardware: If you don't need something, leave it out of the systems you select. Extra hardware in the form of RAID controllers, graphics cards, dual power supplies, disks, and other unused hardware consume power and generate heat. For a single system, the additional usage may be negligible, but when you multiply the effects over tens, hundreds, or thousands of systems, it becomes a substantial liability. Extra hardware also conspires to lower the reliability of your cluster. When in doubt, leave it out.

System specification alone cannot guarantee that a particular hardware choice will be the correct one, but the more you understand about your application and its requirements, the better. Whenever possible, you should learn from the experience of other cluster users, and run your own benchmarks if possible. The information in this chapter can help you understand some of the issues involved in selecting the proper compute slice and master node hardware for your cluster.



[1] To do the level of analysis performed in this chapter, you need to have some familiarity with the hardware manufacturer or perform the necessary research into the hardware designs. I chose example hardware that I had previously encountered, to provide realistic approaches and information. For this reason, you will find that I focus on hardware from Hewlett-Packard, which I actually used to build clusters. This is a matter of practicality, and not a fixation on a particular vendor.

[2] The Linpack benchmark, developed by Jack Dongarra, currently at the University of Tennessee at Knoxville, is a solver that operates on an N by N matrix representing a dense system of linear equations. For vendor-specific benchmark values, see http://Performance.Netlib.Org/performance/html/PDSbrowse.html. The top 500 computers in the world are rated using this set of benchmarks and a parallel variation, and may be viewed at http://Www.Top500.Org.

[3] Paraphrasing one of the many laws attributed to Gene Amdahl, “There must be one megabyte per second on the system bus for every MIP (million instructions per second) of CPU power.” Because a double-precision floating-point calculation moves 16 bytes between the floating-point processor and main memory, we can restate the law as, “For every megaFLOP, there must be one megabyte per second on the system bus.”

[4] All prices are list values from the associated company's Web site as of December 30, 2003. Actual prices may vary with time and other factors. Every effort was made to compare similar configurations.

[5] Competitive benchmark results stated in the comparison reflect results published on www.spec.org as of December 30, 2003. The comparisons presented are based on the best performing systems currently shipping by Hewlett-Packard and IBM. For the latest SPEC CPU2000 benchmark results, visit www.spec.org.

[6] Information on Opteron may be found at http://Www.Amd.Com. Additional information from presentation titled “AMD Opteron: Performance Issues, Application Programmers [sic] View” [Rich and Cownie 2003]

[7] Because the maximum number of bytes addressed with 32 bits is 4 GB, this is the absolute maximum available to a process, unless the operating system can “play games” with the virtual address space. What this means is that the space is shared between the three major sections in your application: code, data, and the stack. If the combination of code + data + stack is larger than 4 GB, you are out of virtual address space (and luck).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.44.94