Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Distributed Computing

Distributed computing is the ability to run various jobs through the processing power of primarily heterogeneous environments. Although distributed computing can easily be covered under homogenous systems, what differenti-ates parallel computing from distributed computing is that it’s not necessarily tied to one type of machine or dedicated cluster.

To illustrate this fact, imagine that you’re going to the store to get some milk. It’s right after work and the store is crowded. Not only do you have to fight your way through the back forty to find an open spot in the parking lot, but you have to walk all the way to the back of the store to get your single bottle of milk. When you get back to the front of the store you realize that there’s only one checkout stand! So, there you are, with your one bottle of milk, behind all these people who have crammed their carts chock full of food. The line crawls along.

Enter the distributed computing model. With distributed checkout, the store has numerous checkout lines, which enables each person to go to the shortest line and speed through without worrying about their milk spoiling.

Distributed computing works in exactly the same way. Instead of a single overly large computer doing all the processing, chunks of data are broken up for several systems to process. The data is distributed throughout the computer farm, depending on the ability of the individual processor to process units of data. As soon as a computer finishes processing the data, it sends the results back to the master server and is offered a new packet. Of course, the faster the computer, the faster it can burn through data and receive new packets. Distributed computing can speed up processing by enabling computers to share in the workload of one or more jobs. Unlike parallel computing, some distributed models are designed to handle the unused CPU cycles of several machines; many of which aren’t dedicated to a single task. Other distributed models are similar to parallel computing methodologies, in which the environments are primarily homogenous and dedicated. In theory, you can harness the power of every spare computer in your workforce after hours to play Quake. Business models that most profit from this form of computing are those that require an amount of processing power that is too large for a single dedicated machine. This is what makes distributed computing so attractive; you don’t have to purchase a million dollar server. Chances are, your company has enough computing resources right under its nose. You just have to learn how to take advantage of all the wasted CPU cycles. Financial institutions, oil companies, pharmaceutical companies, and even the census bureau can all benefit from this technology.

The beauty of distributed computing is that it can perform its processing either in the background, or it can wait until a specified period of idle time has elapsed before spawning. A common method of harnessing idle processing is to embed the distributed application into a screensaver. Not only does this ensure that the computer is idle, but it also gives the user a sense that their computer is actually doing something productive. When the user returns to their workstation, the application stops, returns control to the user, and waits again for the specified idle time.

One For All, and All For One

The massive growth of the Internet and the availability of low cost, high speed processing power has spawned a multitude of projects that follow the distributed model. Any project that requires a massive amount of number crunching can be made into a popular project. All that a project needs is a large number of volunteers who are willing to share their idle processing time, a way of transmitting the data back and forth, and a server that’s dedicated to putting all the units together.

The projects shared here are indicative of how individuals, companies, and non-profit organizations can use the Internet as a massive distributed computer.

Distributed Net

Taking part in a project with a cow as a mascot might seem like a silly idea at first, but the folks at Distributed Net share in many of the Internet’s more popular distributed projects, such as cracking encryption. The goal of Distributed Net is to popularize and further the notions of distributed computing by offering several projects at once and optionally sharing prizes to the individual or team that processes the winning data.

Distributed Net works on such projects as the RC5 challenge, which is currently involved in finding a key with 2⁶⁴ combinations. The Optimal Golomb Ruler project, which searches for certain mathematical pairs, is running at over 182 billion nodes a second.

You can download the distributed net application for almost any operating system, including Linux, BEOS,VMS, IBM OS/390, and QNX. You can find and download the client at www.distributed.net/download/clients.html.

Search for Extraterrestrial Life

You too can join your friends in the search for extraterrestrial life by using the wonders of distributed computing technology. No longer do you have to invest millions in your own radio telescope. With SETI at Home, you can use your computer to analyze seemingly random noise in the hopes of discovering an entire new civilization that is too far away for any physical contact.

The SETI at Home project works in a distributed manner by using data taken from the Arecibo radio telescope in Puerto Rico and sending out the different packets to host computers that are participating in the program. The host program runs as a screensaver while analyzing the data. Most of the processing power that is needed by the distributed clients is spent micromanaging the data and separating known sounds from other white noise. Unfortunately, the Arecibo telescope is only able to sample a fraction of the sky, and because of the popularity of the program, the same data has been sent out multiple times for analysis.

You can find the SETI at Home project at http://setiathome.ssl.berkeley.edu/, including a client for Unices. The current project isn’t multi-threaded, so it won’t support symmetric multiprocessor (SMP) machines, nor does it benefit each user to distribute it, or run the data in parallel.

Using Distributed Computing to Fight AIDS

Although it’s not a Linux distributed computing program, the folks at fightaids@home are using distributed computing methods to process data to achieve a greater understanding of how genetics play a role in our lives. As the title states, one of the projects involved uses distributed computing to assist in AIDS research.

Fightaids@home uses distributed computing software from Entropia (www.entropia.com) software, which processes data to collect information for potential drug discoveries. The extracted data is taken by Olson laboratory at the Scripps Research Institute, and modeled with a program called AutoDock (www.scripps.edu/pub/olson-web/doc/autodock/index.html). This modeling program helps scientists understand how molecules dock with each other. This research helps them understand how certain medicines can best be synthesized to help in drug research.

Entropia’s software occasionally uses a part of CPU cycles to fund other profit applications to fund its research on the AIDS project. The client software is only available for Windows platforms with greater than 98MB of memory. It can be downloaded securely from www.entropia.com/FAAH/join-FAAH.asp.

Distributed File Sharing

Instead of using distributed computing algorithms to solve complex applications, file-sharing programs such as Napster and Gnutella use distributed files and databases to serve music and video files. These programs are essentially a group of distributed databases that are linked to provide a file-sharing network. Within these databases are lists of files (most notably mp3 songs) that are served on the client’s computer.

Each file-sharing system has its own protocol, which lies at the heart of the application. When a user installs the client, the program uploads the desired list of songs to the server, which enters the file information into the database. Subsequent client requests for the uploaded files can tell exactly where the files are and on which client computer they can be retrieved from the quickest. This provides for a quick and efficient means of file distribution.

Distributed Denial of Service

The darker side of distributed computing lies in the ability of crackers to perpetrate Distributed Denial of Service (DDOS) attacks and viruses.

Traditionally, the malicious cracker uses a network scanner to rapidly locate a large number of hosts for an easily penetrable service in which to install a client. Common culprits include systems that haven’t been hardened yet, or that contain the latest Microsoft IIS vulnerability. After a system is compromised, the cracker installs the distributed application and moves onto the next system. After enough systems have been compromised, the cracker starts them off, perhaps running a simple Internet Control Message Protocol (ICMP) flood, or a smurf attack.

Today’s crackers can program viruses that look for other hosts to infect by themselves. After several hosts have been infected, they can begin their rampage by starting a DDOS attack, or by having the program start sending malicious packets as soon as a host has been infected.

Condor

Condor is a program that runs on many UNIX-type platforms, including Linux. Essentially a complex front end for batch jobs, the Condor team at the University of Wisconsin-Madison started development of the distributed computing resource over 10 years ago. Condor is designed to take any number of jobs and distribute them across any number of machines that are also running Condor. This distributed application maintains a database of running machines and their specifications. Users can specify which type of machine to run their jobs on, and Condor distributes the jobs accordingly. Similar to most distributed applications, Condor suspends its processing when user activity is detected, and regains control after a certain amount of time.

Condor implements checkpoints, which is the ability for the distributed application to restart jobs if the remote computer loses communication or goes down. The server takes snapshots of the remote job so that it isn’t lost, but rolls back if required by the user. This is an important idea behind distributed computing; the data from the master node must be able to be tracked or resent to different computers if the data is lost or can’t be resent back to the originating node.

The architecture of Condor is laid out around a master server, which is referred to as the Central Manager. This Central Manager serves as the information collector, and negotiates resources and resource requests. The Central Manager shares information with the submit server, which organizes the jobs and maintains all the checkpoint information. Each job that is run on the remote machines creates a process on the machine, so that you need a great deal of random-access memory (RAM) or Swap Space to manage a large environment. All checkpoint data lives on this machine, unless you decide to dedicate a server to the task. The submit machine can also reside on the Central Manager. Each machine, including the Central Manager, can run jobs. If there’s a problem on the machine, the job is stored on the local disk until it can send the job back to the submit machine.

Condor jobs have to be run in the background, without user input. The jobs can be redirected to standard input or output, depending on the needs of the user. This means that you must plan for the batch jobs that are being run to be set up in advance to read or write to a file instead of to stdout.

Installing Condor

Condor can be downloaded from the binaries page at www.cs.wisc.edu/condor/downloads/. You can get binaries for Linux, Solaris, Digital UNIX, SGI, and Windows NT. Condor recommends Red Hat Linux, and although other distributions might work, they’re not supported or tested.

It’s a good idea to plan the topology before installing Condor for the first time so that you have a plan of attack. Knowing which machine in your environment can best handle the resources assigned to it helps in the long run. It also helps if youcan assign a dedicated machine to tasks, if your environment is large. Condor can be installed with either static or dynamic linking binaries. The source code is not available publicly, but on a case-by-case basis.

You can use Condor to handle certain applications differently in four different environments, which are called universes. There’s a standard universe, a vanilla universe, universes that you use with Parallel Virtual Machine (PVM), and Globus. The vanilla universe can be run with any job, although it doesn’t support checkpointing; jobs that are run with the standard universe must be linked with Condor, but can support checkpointing. Shell scripts are a good example of jobs that you can use with a vanilla universe; they can run, but they can’t be linked with Condor.

Although you can install Condor as any user, it’s suggested that you run it as root because of the lack of permission restrictions. The manual suggests that running as root allows the program to access certain parameters from the kernel, rather than calling on other programs to get parameters. By default, condor installs in /usr/local/condor. The setup script allows you to change this location. In general, running things as root isn’t such a great idea, but you make a trade off for a system that is dedicated to Condor.

The first step to running Condor is to grab the binaries and extract them to a temporary directory. This initial install is done on the Central Manager.

Make a condor account on the local install machine. The install script will fail if it’s not set up. You can get around making a condor account if you set the CONDOR_IDS environment variable to the uid.gid settings that condor uses.

Run condor_install. The setup script asks you a set of questions that are self-explanatory. Answer them to suit your environment. The setup presents a list of default values, if you’re not sure. The Condor install script places it’s binaries and scripts in /usr/local/condor/bin, although it links other existing directories if you want to include those. The installation asks if you want to perform a full install, submit only, or define a Central Manager. Remember that you cannot run jobs on a submit-only machine. You must run a full install on each machine, unless you want only a Central Manager. You can define a shared directory where all condor machines can retrieve their files from, including a set of scripts installed to manage your environment.

Condor has to get its files to run jobs from somewhere. You might consider a Network File System (NFS) shared directory specifically for this use. Step two of the install asks that you run the script on the machine that you are using as your file server; in this case, it is the machine that is also the Central Manager. You don’t have to distinctly set it up this way, as this is just for demonstration purposes. You also must enter in the names of the machines that you are setting up.

After the script finishes, you must run condor_init on all machines in your pool before you can run Condor jobs on them, even the master. It reminds you of this as it goes through its list of default questions. Or you can adjust these depending on where you installed the binary files. Although the initialization script populates condor_config for you, you might want to double check and make sure that everything fits your environment. That information is placed in /<condor root>/etc/condor_config. Don’t forget to also create the startup scripts; a sample script is placed in /<condor root>/etc/examples/ condor.boot. Edit the script to your taste and merge it with your current startup scripts.

Starting Condor is done by executing /usr/local/condor/sbin/condor_master. If everything goes fine, you see five processes running as follows:

condor   18650     1  0 22:44 ?        00:00:00 ./condor_master 
condor   18651 18650  0 22:44 ?        00:00:00 condor_collector -f 
condor   18652 18650  0 22:44 ?        00:00:00 condor_negotiator -f 
condor   18653 18650  1 22:44 ?        00:00:06 condor_startd -f 
condor   18654 18650  0 22:44 ?        00:00:00 condor_schedd –f

If you want to run condor jobs on more than one machine, you have to run condor_install on each, unless you created a shared directory that contains all the executable files.

Allocating Resources with ClassAds

Condor’s ClassAds mechanism acts as a gatekeeper for resources, which inter-acts between the job submissions and the servers who advertise their resources. The ClassAds enable the clients to send specific jobs to those servers that match the resource requirements of each job.

When Condor is installed on the remote servers, they advertise their resources so that they can best serve the needs of the server farm. The ClassAds mechanism displays the relevant attributes of the machines in the class pool, such as memory, CPU usage, and load average, so that each job can go to the machine that best suits its needs. As a machine owner (one whose machine accepts jobs), you can tweak the ClassAd to suit your machine, the jobs that it accepts, and your own computing resources. You can specify if the machine is to accept only certain jobs, if it runs only with no keyboard activity, at certain times, or other similar specifications.

Running condor-status gives you a good idea of the status of the Condor environment. If Condor is set up correctly, it displays something similar to the following:

Name          OpSys       Arch   State      Activity   LoadAv Mem  ActvtyTime 

matrix       LINUX       INTEL  Owner      Idle       0.000   123  0+00:15:04 
stomith      LINUX       INTEL  Owner      Idle       0.000   123  0+00:00:04

Running condor-status –l <server> gives you configuration details of the individual servers and shows their status. The output is similar to the following:

MyType = "Machine" 
TargetType = "Job" 
Name = "matrix.stomith.com" 
Machine = "matrix.stomith.com" 
Rank = 0.000000 
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) 
CondorVersion = "$CondorVersion: 6.2.1 Jul 27 2001 $" 
CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC21 $" 
VirtualMachineID = 1 
VirtualMemory = 2088108 
Disk = 10518096 
CondorLoadAvg = 0.000000 
LoadAvg = 0.000000 
KeyboardIdle = 2359 
ConsoleIdle = 9597550 
Memory = 123 
Cpus = 1 
StartdIpAddr = "<172.16.0.6:1035>" 
Arch = "INTEL" 
OpSys = "LINUX" 
UidDomain = "matrix.etopian.net" 
FileSystemDomain = "etopian.net" 
Subnet = "172.16.0" 
TotalVirtualMemory = 2088108 
TotalDisk = 10518096 
KFlops = 30068 
Mips = 368 
LastBenchmark = 1006382711 
TotalLoadAvg = 0.000000 
TotalCondorLoadAvg = 0.000000 
ClockMin = 1040 
ClockDay = 3 
TotalVirtualMachines = 1 
CpuBusyTime = 0 
CpuIsBusy = FALSE 
State = "Unclaimed" 
EnteredCurrentState = 1006390812 
Activity = "Idle" 
EnteredCurrentActivity = 1006390812 
Start = ((LoadAvg - CondorLoadAvg) <= 0.300000) && KeyboardIdle > 15 * 60 
Requirements = START 
CurrentRank = -1.000000 
LastHeardFrom = 1006392016

To show a list of current jobs, use condor_status–run.

After you get a listing of your environment and the type of machines that are in it, you can select which server or types of servers that your jobs run on.

The condor_status command keeps you up to date with the current environment. The only thing that you have to do is build your job and submit it.

Submitting Condor Jobs

Submitting jobs to Condor is done through four easy steps. The first thing you must do is prepare your code to actually work with Condor. Remember that these jobs have to work independently in the background with no user input, although you can pass data to and from Condor jobs to choose which type of universe the job runs in. Next, you select a universe for the code to run in, and relink if necessary. Third, you set up a control file to tell Condor what to do with the code it’s dealing with. Lastly, you submit the job to Condor with the control file.

Remember that Condor can run in four different types of universes: standard, vanilla, PVM, and Globus. To run Standard jobs, you must link them to Condor by relinking them with condor_compile. To do this, prepend the gcc command (or whatever compiler you use with condor_compile), as in the following example:

# condor-compile gcc foo.o bar.o –o myprogram

Vanilla jobs are designed for projects that can’t be relinked, or shell scripts. The only difference between standard and vanilla jobs is that vanilla jobs can’t use check points or remote system calls. Remember that if your script calls certain files, they’re easily accessible. Remember to set your file and NFS permissions correctly.

Submitting jobs through Condor is done by preparing a control file, which is called a submit description file. Condor uses this file to determine the requirements of the specific job. With the control file, you can specify such things as the class requirements of the job, the number of times the job is to be run, stdout (standard output) and stdin (standard input) attributes. Following is a basic example from the documentation:

#################### 
# 
# Example 1 
# Simple condor job description file 
# 
#################### 

Executable     = foo 
Log            = foo.log 
Queue

In this example, the script is named foo, and all data from the job goes into foo.log. Without an attribute for the Queue parameter, the job is only run once. You don’t have to have a log file, but it’s recommended.

As a second example, a Mathematica job is run through Condor:

#################### 
# 
# Example 2: demonstrate use of multiple 
# directories for data organization. 
# 
#################### 

Executable     = mathematica 
Universe = vanilla 
input   = test.data 
output  = loop.out 
error   = loop.error 
Log     = loop.log 
Initialdir     = run1 
Queue 

Initialdir     = run2 
Queue

The executable here is mathematica, and it’s a vanilla universe job. If the job is submitted as in example one, with no universe, a standard universe is assumed. If there are no machine-specific requirements, a machine of the same type is assumed to be requested. The data file that the mathematica file uses is test.data, stdout is loop.out, and stderr is loop.error. In this scenario, two different sets of data are required as evidenced by the queue attributes; each needs its own directory to store data in. Each data directory isn’t absolutely necessary, but it depends on the job. The directories are called with the initialdir attribute, and are stored in the run1 and run2 directories.

You can specify the type of machines that your job runs on by specifying the Requirements attribute. Remember that the attributes are case-sensitive. Your job might not run if you’re passing information that doesn’t match up to a job resource. In the previous case, you can tell job two to only run on a Sun UltraSparc machine, Solaris 2.8, with greater than or equal to 128MB of RAM, by specifying the following:

Requirements   = Memory >= 128 && OpSys == "SOLARIS28" && Arch =="SUN4u"

For a complete list of attributes, see Appendix D, “Condor ClassAd Machine Attributes.”

Submitting the job is done with the condor_submit command. The syntax is condor_submit <control-file>. For example,

#condor_submit –v control-file

runs the job according to the control file. The –v switch forces condor_submit to show the class_ad attributes.

Even though Condor can also link jobs for use with PVM and Globus, that’s beyond the scope of this book. More information on PVM and parallel clustering is covered in Chapter 8, “Parallel Computing,” and Chapter 9, “Programming a Parallel Cluster.” For a quick reference to Condor’s class attributes, see Appendix D.

Mosix, Kernel-Based Distributed Computing

Taking distributed computing a step in a different direction, the multicomputer Operating System for UNIX (Mosix) introduces job processing directly to the kernel, which makes the operating system part of the cluster. Approaching distributed computing in this manner allows jobs to run without linking or submitting them. Amaze your friends as you run simple commands such as ps and ls across a large cluster.

Mosix is designed to run on any Intel platform computer as a true adaptive distributed clustering environment. This allows it to run in any number of configurations, similar to Condor. You can run it with a dedicated pool of machines, or have machines join the cluster during off hours without intruding on users’ work.

Mosix takes care of its load balancing transparently to the user. It assigns jobs on its own, based on its own resource monitoring without intervention. Basically, Mosix treats the cluster as an extended SMP machine, which forks processes to other Mosix-enabled kernels.

Mosix is designed to work with a dedicated cluster. Mosix currently doesn’t have the ability to sense whether or not a workstation is idle. You have to tell a Mosix node when to join the cluster, or write a script to detect lack of workstation usage to join the cluster.

Installing Mosix

You can get the latest version of Mosix from www.mosix.org/txt_distribution.html. Because Mosix is kernel-dependent, you want to get the version of Mosix that’s designed for use with the specific kernel sources. You also must recompile the kernel to include Mosix. Don’t use the kernel sources that come with the distribution that you’re using; often they’ve been rewritten for the version of Linux that you’re using. Instead, spend the time and download the vanilla sources from kernel.org.

Mosix can be installed by hand, or for the rushed administrator, it comes with a script that does the installation for you. The first thing you must do is download the recommended kernel and unzip it into /usr/src/linux, optimally, /usr/src/linux-<kernelversion>. Uncompress Mosix into its own directory and install.

If you choose to run the install script that comes with the Mosix package, you must run the mosix.install script in the directory that you just uncompressed.

The script first has you point to the correct kernel source directory. The script also asks if it should include the new kernel in the lilo boot loader, and which default run levels it should be included in. If you’re using a distribution without run levels, be sure to check the -for none setting and add Mosix to your startup scripts. Mosix also asks you to set a MFS mount point for the Mosix file system. This is currently in the experimental stages.

If you look at the install FAQ, you can see that the following files are replaced, and that the old files are saved with a .pre_mosix extension.

/etc/inittab 
/etc/inetd.conf and/or /etc/xidentd.d/* 
/etc/lilo 
/etc/rc.d/init.d/atd 
/etc/cron.daily/slocate.cron

The install script then asks for the version of the kernel configuration methods that you’d like to take part in (namely config, xconfig, or menuconfig), patches the kernel, and runs the configuration method that you’ve chosen. Take advan tage of the kernel compilation; Mosix can only incorporate the options that it needs for itself; it can’t guess which options you want. Mosix recommends that you keep the kernel configuration options of CONFIG_BINFMT_ELF and CONFIG_PROC_FS. After you save the kernel sources, Mosix attempts to compile the kernel for you based on the options that you’ve given it.

If you need other kernel options besides the ones that Mosix gives you, you must patch the kernel before running mosix.install. Be sure that, if you’re running alternate file systems, such as ext3 or ReiserFS, you modify your kernel to be configured with these. Mosix does not include these other options by default.

Manual installation isn’t that much harder, but you do have to install everything by hand, rather than have Mosix scripts do everything for you. If you’re brave enough to try to install Mosix by hand, you gain a much greater level of control and finer granularity. You can find detailed instructions within the man pages that are created from within manuals.tar.

After configuring the kernel, be sure that the entry is placed within lilo and reboot.

Configuring your Mosix Cluster

After you let Mosix compile the kernel by using your choices, you need to edit/create /etc/mosix.map, or it prompts you right after booting with the new kernel. This file has three fields per line: any node number that’s associated with the node itself, the IP of the first node in the network, and the number of hosts in the cluster. Following is an example configuration:

1      172.16.0.1   6 
7      172.16.5.1   3 
10     172.16.10.1  7 
17     172.16.30.5  2

In this configuration, the first column of numbers represents the unique nodes number in the first part of the subnet. The second column represents the IP of that node, and the third represents the number of hosts in the subnet. Each subnet has to have its own entry. If the Mosix node is a gateway between subnets, you use the word ALIAS in place of the third column. If your cluster is separated by subnets, you have to create the file /etc/mosgates, and enter the number 1 . If there are two more gateways between a node and others, enter the number 2 .

If all the nodes are on the same subnet, you don’t need an /etc/mosgates entry. However, if the node number of the local node isn’t in the list, that node number must be written to /etc/mospe.

After that is set up and Mosix is started, Mosix takes over and shares processes over the different computers in the cluster. You can type complex scripts or commands, such as ls, and have them replicated over the entire cluster. Well, ls might not require the entire cluster, but you get the idea.

Mosix at the Command Line

Mosix has several utilities that you can use to configure and tune Mosix, so as to gain information about the running kernel. You can use the setpe command to configure a Mosix cluster. By using the setpe command with the –w switch, you can configure Mosix to use new commands to add hosts to the cluster. The setpe command with the –r switch rereads the current configuration from /etc/mosix.map. This resets the configuration, if you want to start over and reset the cluster back to its original state. The setpe –o command shuts down Mosix by removing the configuration.

The Mosix configuration also includes utilities called tune, mtune, tune_kernel, and prep_tune. These utilities direct output that you can use to include in the kernel to optimize Mosix.

The Mosix tool mosctl allows the administrator greater control over how Mosix works, and allows control over the processes. For example, using mosctl stay forces the current processes to stay on the current node, and not be migrated. This is cancelled by using mosctl nostay or –stay.

You can use the migrate utility to move processes to any machine in the cluster, to the home machine that spawned it, or perhaps to a more powerful machine that is better suited for the task. The syntax for migrating a job is migrate <pid> <mosix machine id> | <home> | <balance>. In this syntax, you can choose the machine ID to move the process to; home, where you can send the process back to its home machine; or balance, which finds the best machine to load balance the process. For example,

# migrate 1024 16

migrates pid 1024 to node 16.

Administrating Mosix

The system administrator can make changes to the running Mosix configuration and to the running jobs. The administrator can maintain the kernel parameters with the setpe command, and manipulate /proc directly.

The file /proc/mosix/admin/gateways handles how processes from guest nodes are migrated into the current node. By changing the number to a 1 in that file, Mosix blocks guest processes from entering. If that number is 0 , guest processes are free to enter.

By entering a 1 into /proc/mosix/admin/expel, Mosix expels the guest processes that are currently running in the host machine.

To run the tune_kernel command, you must enter a 1 into /proc/mosix/ admin/quiet. It is recommended that you do this kernel-tuning directly after installation of the Mosix cluster, or when changing networks or CPU.

More information about the current processes are available in /proc/<pid>/. This tells you which process runs where, with 0 being the home node. The entries in /proc/<pid>/lock state whether or not the process has been locked against auto migration from its home node. The entries in /proc/<pid>nmigs show the number of times that the process has been migrated.

Entries in /proc/<pid>/cantmove list the reasons why a process can’t migrate from its home directory. Some possible reasons are that the process is using files as shared memory, the process is using device-memory, the process is in 8086 mode, or the process is a kernel-daemon.

Information about all the configured nodes in the cluster are in /proc/mosix/nodes/<node-number>.

Using Diskless Clients with Mosix

A fully functional Linux workstation or cluster node can easily run without hard drives, CD-ROMs, or floppies, which saves administration time and maintenance. One of the benefits of having these diskless clients is that they can also harness their spare CPU cycles into a distributed cluster.

Diskless appliances can be bought for less than half the price of today’s cheapest workstations, or easily built from commodity parts. The only moving part in such a workstation is the fan from the power supply. Not having moving parts greatly reduces the amount of equipment that can fail, which reduces downtime. Diskless clients share the processing power of distributed and parallel clusters by sharing CPU cycles with the master nodes.

A diskless client can ease administration time and maintenance by not only reducing the amount of parts in a computer, but by centralizing all the necessary files on a single server. The single server environment allows for such things as the simplification of printer management, centralized backup, the restore of user files, and the elimination of costly trips to the user’s desktop. With the addition of Linux as the choice of OS, the risk of virus infection is greatly reduced, and with the advent of programs such as Star Office, the cost of the applications is almost nil. Add solitaire and a chat program or two, and your users have the same functionality as with Microsoft Windows. Both the server and the users are managed more effectively.

Although diskless appliances and XServers can reduce maintenance costs, the price of disks has gone down so much recently that implementing diskless workstations might not be worth the effort. It’s a judgment call, which is based mainly on the ability of the support staff to manage multiple systems.

Installing a Diskless Client with Mosix

Diskless clients are simple machines that rely on a network card with pre-installed read-only memory (ROM) attached, a kernel, and a NFS share. After the system is turned on, the network card gets its IP from a Bootstrap Protocol (BOOTP) or Dynamic Host Configuration Protocol (DHCP) server. Remember that the BOOTP or DHCP server has to be on the same subnet or have a protocol-like spanning tree enabled on the routers. The server transfers the kernel image to the client over Trivial File Transfer Protocol (TFTP), and upon boot it offers up a root file system that the client can mount through NFS. Optionally, the root server loads a XServer into memory and executes X Display Manager (XDM) for a remote log in to the server. These services often are offered from the same server, although it’s entirely possible to separate these services according to function.

The Linux Terminal Server Project (LTSP) is a concrete example of diskless clients. This is an implementation of an X-Terminal Server that runs an XServer, although most operations run on the server itself. Because of limited client requirements, it’s possible for the client to be no faster than a 486 with 16MB of RAM.

The LTSP makes use of the Etherboot (http://etherboot.sourceforge.net) program to serve a kernel through TFTP to the individual clients. The program allows the transfer of boot images and other programs through the network.

For quick installation, LTSP requires Red Hat, Mandrake, or compatible distribution. Debian is functional, although it’s still in beta. LTSP assumes that you’re using an IP address of 192.168.0.254, and it populates DHCP with a range that includes that address. If your server uses anything different than that, you have to edit a few files back to your original IP scheme. You also have to be running DHCP on the server, as described earlier in this chapter.

Remember that for this implementation, all programs run on the server, and therefore it uses the CPU of the server. A decent SMP-based server with a hefty bit of RAM (about a gig or so) can easily handle upwards of 80-100 clients. You can download the RPM packages from http://sourceforge.net/project/showfiles.php?group_id=17723. You need the most recent core package, lts_core-2.xx-xx.i386.rpm, and the corresponding kernel made for your net work card. If you’re using the terminal server with Mosix support, you can get the client kernel from www.nl.linux.org/∼jelmer/vmlinuz-mosix.ne2000,or www.nl.linux.org/∼jelmer/vmlinuz-mosix.all. Otherwise, you can grab a client kernel such as lts_kernel_eepro100-2.2-0.i386.rpm. You also need an XServer to match your video card, such as lts_xsvga-2.xx-xx.i386.rpm. Install the three files with rpm -i <package>.

The core Route Processor Module (RPM) installs files in /tftpboot/lts/. Make sure that the kernel you download lies within this file. Execute the main install file with the following:

/tftpboot/lts/templates/ltsp_initialize

The initialization script creates an /etc/dhcpd.conf.example file. Edit the script to reflect your environment, and copy it to /etc/dhcpd.conf. Edit the dhcpd.conf file to include the Media Access Control (MAC) address of the workstation, then add the workstation names and associated IPs to /etc/hosts. For instance, if /etc/dhcpd.conf contains a host ws001 parameter and a fixed-address option of 192.168.0.1, you add

192.168.0.1   ws001  ws001.fqdn.com

to /etc/hosts. Edit the /tftpboot/lts/ltsroot/etc/lts.conf file to reflect changes to your configuration.

You also need to make sure that the XServer process on the client isn’t migrated from there. To the beginning of /tftpboot/lts/ltsroot/etc/rc.local, add

echo 1 > /proc/mosix/admin/stay

after the line with PATH=. You also need to link the mosix.map file so that the clients can access it.

# ln –s /etc/mosix.map /tftpboot/lts/ltsroot/etc/mosix.map

Make sure that the client is able to use the Mosix files by copying them into the tftpboot directories, and add the Mosix startup to the client init scripts.

Reboot the server, turn on the client, and watch as it comes up cleanly.

Following is an example /etc/dhcpd.conf file that is set up with two workstations for the LTSP:

default-lease-time            21600; 
max-lease-time                21600; 
ddns-update-style ad-hoc; 
option subnet-mask            255.255.255.0; 
option broadcast-address      172.16.0.255; 
option routers                172.16.0.1; 
option domain-name-servers    24.130.1.32; 
option domain-name            "domain.com"; 
option root-path              "/tftpboot/lts/ltsroot"; 

shared-network WORKSTATIONS {
    subnet 172.16.0.0 netmask 255.255.255.0 {
    } 
} 

group {
    use-host-decl-names       on; 
    option log-servers        172.16.0.6; 

    host ws001 {
        hardware ethernet     00:E0:06:E8:00:84; 
        fixed-address         192.168.0.1; 
        filename              "lts/vmlinuz.eepro100"; 
    } 
    host ws002 {
        hardware ethernet     00:D0:B7:23:C0:5E; 
        fixed-address         172.16.0.50; 
        filename              "lts/vmlinuz.eepro100"; 
    } 
}

If you have problems with the installation of LTSP, you can check a few places to make sure that you configured everything correctly. Make sure that tcp_wrappers is configured to accept connections, and that hosts.deny and hosts.allow are set up correctly for the clients. Make sure that the TFTP server is set up correctly; some tweaks might be needed. Error messages are kept in /var/log/messages, as usual. If you can’t get an X display, remember to look in /var/log/xdm-error. If all else fails, the LTSP has a quick troubleshooting FAQ. Check out www.ltsp.org/documentation/lts_ig_v2.4/lts_ig_v2.4-15.html for more information.

Summary

Distributed computing takes information from one computer and shares the data or process load throughout a cluster of computers. Distributed computing can process large amounts of data in a short amount of time; much faster than any single computer or perhaps even supercomputer because of the number of potential nodes in each cluster.

The cluster doesn’t have to be dedicated; sometimes, it only uses nodes during times when the user is not at their desk.

Distributed computing is similar to parallel computing; in fact, sometimes their lines overlap. The distributed computing environment takes more advantage of heterogenous environments, but it isn’t necessarily limited by them (for example, they do just fine with homogenous networks). A distributed computing environment can be preferable if you have to use the computers for workstations and to compute nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Distributed Computing

Create new playlist

Sign In

Sign Up