LSF Workload Configuration

The following diagram depicts an LSF cluster configuration.

Figure 10-2. LSF Cluster Configuration


The submission host computer is the node where the user, or operator, submits the task to be performed. Typically, user accounts, established for these computers, grant the user permission to write data files on the execution host's storage. There is nothing special about being a submission host, since any node in the cluster can submit jobs. In fact, jobs can be submitted by computers outside the cluster by invoking an xterm session on one of the cluster nodes, or by using the Motif based tool provided with the LSF software.

The master host is the node where the LSF batch queues reside. When the LSF software initializes, one of the nodes in the cluster is elected to be the master host. This election is based on the order of nodes listed in a configuration file. If the first node listed in the configuration file is inoperative, the next node is chosen, and so forth.

The execution hosts are the nodes where the jobs will be executed. A master host can function as an execution host and usually does. If a node fails while a job is running, the job is lost, but no other nodes are affected. The failed node is then marked as offline in the cluster and no further jobs will be sent to it. All failed jobs are re--scheduled on another available host.

Setting up an LSF cluster requires some planning. Besides collecting configuration data on each execution host, care must also be taken to assure that the batch job has access and correct permissions for reading and writing data files. Also to avoid a single point of failure, it is a good idea to implement the LSF software's high availability features.

Setting Correct Permissions

Correct permissions are required to at several levels to run batch jobs.

File Access within a Cluster

At some point, the batch jobs that run in the LSF cluster will need to write data to a file system. Since the batch job runs with the same access rights as the user who submits it, that user must have appropriate permissions to access files that the batch jobs requires. Also, the data files, which are required, must be accessible to the host that is running the job.

The easiest way to assure file accessibility is to NFS mount the file systems containing program data on all systems in the cluster and use NIS for user authentication. Alternatively, the LSF software can be configured to copy the required files to the host running the job, then copy them back when the job is complete.

Note

If the LSF software is used in a mixed environment of Solaris and Windows NT servers, NFS can still be used as long as the NFS client on Windows NT supports the Universal Naming Convention. Since mounting disks on drive letters is not performed until a user logs in, these type of mounts will not work in an LSF cluster.


Host Authentication Considerations

When a batch job or a remote execution request is received, the LSF software first determines the user's identity. Once the user's identity is known, the LSF software decides whether it can trust the host the requests come from.

The LSF software normally allows remote execution by all users except root. The reason for this is that by configuring an LSF cluster you are turning a network of machines into a single computer. Users must have valid accounts on all hosts. This allows any user to run a job with their own permission on any host in the cluster. Remote execution requests and batch job submissions are rejected if they come from a host not in the LSF cluster.

User Account Mapping

By default, the LSF software assumes uniform user accounts throughout the cluster. This means that a job will be executed on any host with exactly the same user ID and user login name.

The LSF software has a mechanism to allow user account mapping across dissimilar name spaces. Account mapping can be done at the individual user level or system level.

High Availability Features

The LSF software has several features that assure high availability. The following diagram shows the failover features of the LSF software.

Figure 10-3. LSF Failover Features


In the above diagram there are four LSF servers, or hosts, connected to a fileserver. Although only a single fileserver is shown, in actuality, this could be a pair of fileservers in a HA cluster. The fileserver contains two key files: lsf.cluster and lsb.events.

The file lsf.cluster is the main configuration file LSF uses. All the names of the hosts in the LSF cluster are listed in this file along with the resources each has. By default, the host that appears first in lsf.cluster becomes the master host. As the master host, this node is responsible for maintaining the batch queues and launching jobs.

The file lsb.events is a log of jobs that have run and their current status. This file is updated only by the master host. It is usually kept on a separate file server or replicated if it is kept on a local disk.

In the event of a master host failure, the LIMs on the surviving nodes perform an election to determine which node should become the new master. The rule is to choose the next host listed in the lsf.cluster file. Once a new master host is established, the control of the lsb.events log is transferred to it. The LSF software also has an option to keep a copy of lsb.events on the master host. In the event the fileserver fails, the second copy of lsb.events prevents a single point of failure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.13.173