IBM Platform Load Sharing Facility add-ons and examples
This appendix describes examples of how to use IBM Platform Load Sharing Facility (LSF) and its add-ons.
The following tasks are illustrated in this appendix:
Submitting jobs with bsub
bsub is the command to submit batch jobs to LSF. The IBM Platform LSF Command Reference, Version 8.3, SC22-5349-00, has a 38-page description of bsub. We highlight the few noteworthy specifics about using bsub.
Upon job completion, the default action for bsub is to mail any job output and any error messages. The default mail destination is defined by LSB_MAILTO in lsf.conf.
To send STDOUT and STDERR from the run to files instead, submit it with the -o and -e flags to append to the existing files. To overwrite the output and error files, submit with the -oo and -oe instead. In Example C-1, the error is in the file job.err.
Example C-1 Showing the error that is produced
[yjw@i05n45 test]$ ls -l job.*
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ bsub -o job.out -e job.err job.sh
Job <1868> is submitted to default queue <normal>.
[yjw@i05n45 test]$ ls -l job.*
-rw-r--r-- 1 yjw itso 72 Aug 20 18:34 job.err
-rw-r--r-- 1 yjw itso 943 Aug 20 18:34 job.out
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ cat job.err
/home/yjw/.lsbatch/1345502084.1868: line 8: ./job.sh: Permission denied
To send STDOUT and STDERR from the run to the same file, use the same file name for the -o and -e flag. In Example C-2, the error is appended to the file job.out.
Example C-2 Error appended to the job.out file
[yjw@i05n45 test]$ bsub -o job.out -e job.out job.sh
Job <1869> is submitted to default queue <normal>.
[yjw@i05n45 test]$ ls -l job*
-rw-r--r-- 1 yjw itso 956 Aug 20 18:38 job.out
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ tail job.out
CPU time : 0.01 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
/home/yjw/.lsbatch/1345502295.1869: line 8: ./job.sh: Permission denied
Submit with command-line argument (bsub job.sh):
The file job.sh is not spooled. At run time, the job.sh uses the most recent modification that is written to the file.
The script job.sh, like any Linux shell script, can parse command-line arguments at run time.
The user must have execution access permission for the script job.sh to run, which is why the job in Example C-2 on page 322 failed.
The job script job.sh is executable and is submitted with argument 10 as a command-line argument to bsub in Example C-3. There is no error. The file job.err is empty. The job sleeps for 10 seconds (command-line argument), instead of 1 second.
Example C-3 Listing the job script executable
[yjw@i05n45 test]$ ls -l job.*
-rwxr-xr-x 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ cat job.sh
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
[yjw@i05n45 test]$ bsub -o job.out -e job.err job.sh 10
 
[yjw@i05n45 test]$ ls -l job*
-rw-r--r-- 1 yjw itso 0 Aug 20 18:54 job.err
-rw-r--r-- 1 yjw itso 1015 Aug 20 18:54 job.out
-rwxr-xr-x 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ cat job.out
Sender: LSF System <[email protected]>
Subject: Job 1870: <job.sh 10> Done
 
Job <job.sh 10> was submitted from host <i05n45.pbm.ihost.com> by user <yjw> in cluster <cluster1>.
Job was executed on host(s) <i05n49.pbm.ihost.com>, in queue <normal>, as user <yjw> in cluster <cluster1>.
</home/yjw> was used as the home directory.
</home/yjw/test> was used as the working directory.
Started at Mon Aug 20 18:54:27 2012
Results reported at Mon Aug 20 18:54:37 2012
 
Your job looked like:
 
------------------------------------------------------------
# LSBATCH: User input
job.sh 10
------------------------------------------------------------
 
Successfully completed.
 
Resource usage summary:
 
CPU time : 0.03 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
/bin/bash
Mon Aug 20 18:54:27 EDT 2012
Mon Aug 20 18:54:37 EDT 2012
 
 
PS:
 
Read file <job.err> for stderr output of this job.
Submit jobs with redirection (bsub < job.sh):
The job script job.sh is spooled when the command bsub < job.sh runs.
The commands run under the shell that is specified in the first line of the script. Otherwise, the jobs run under the system shell /bin/sh. On our test cluster that runs RHEL 6.2, /bin/sh is linked to /bin/bash. The default shell is BASH.
Because the commands are spooled, the file job.sh is not run and does not need to have execution permission.
The last argument that is redirected to bsub is interpreted as the job script. Therefore, no command-line argument can be used to run job.sh.
In Example C-4, job.sh is redirected to bsub. Although the job script job.sh is not executable, there is no error and the file job.err is empty.
Example C-4 Redirecting the job.sh to bsub
[yjw@i05n45 test]$ ls -l job.*
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ cat job.sh
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
[yjw@i05n45 test]$ bsub -o job.out -e job.err < job.sh
Job <1872> is submitted to default queue <normal>.
[yjw@i05n45 test]$ cat job.err
[yjw@i05n45 test]$ cat job.out
Sender: LSF System <[email protected]>
Subject: Job 1872: <echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date> Done
 
Job <echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date> was submitted from host <i05n45.pbm.ihost.com> by user <yjw> in cluster <cluster1>.
Job was executed on host(s) <i05n49.pbm.ihost.com>, in queue <normal>, as user <yjw> in cluster <cluster1>.
</home/yjw> was used as the home directory.
</home/yjw/test> was used as the working directory.
Started at Mon Aug 20 19:11:51 2012
Results reported at Mon Aug 20 19:11:52 2012
 
Your job looked like:
 
------------------------------------------------------------
# LSBATCH: User input
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
 
------------------------------------------------------------
 
Successfully completed.
 
Resource usage summary:
 
CPU time : 0.03 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
/bin/bash
Mon Aug 20 19:11:51 EDT 2012
Mon Aug 20 19:11:52 EDT 2012
 
 
PS:
Read file <job.err> for stderr output of this job.
In Example C-5, the last token of line (10) that is redirected to bsub is instead interpreted as a job script. The job failed with an error in the file job.err.
Example C-5 Job output with error
 
[yjw@i05n45 test]$ ls -l job.*
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ bsub -o job.out -e job.err < job.sh 10
Job <1873> is submitted to default queue <normal>.
[yjw@i05n45 test]$ ls -l job*
-rw-r--r-- 1 yjw itso 66 Aug 20 19:15 job.err
-rw-r--r-- 1 yjw itso 931 Aug 20 19:15 job.out
-rw-r--r-- 1 yjw itso 104 Aug 20 18:02 job.sh
[yjw@i05n45 test]$ cat job.out
Sender: LSF System <[email protected]>
Subject: Job 1873: <10> Exited
 
Job <10> was submitted from host <i05n45.pbm.ihost.com> by user <yjw> in cluster <cluster1>.
Job was executed on host(s) <i05n49.pbm.ihost.com>, in queue <normal>, as user <yjw> in cluster <cluster1>.
</home/yjw> was used as the home directory.
</home/yjw/test> was used as the working directory.
Started at Mon Aug 20 19:15:58 2012
Results reported at Mon Aug 20 19:15:58 2012
 
Your job looked like:
 
------------------------------------------------------------
# LSBATCH: User input
10
------------------------------------------------------------
 
Exited with exit code 127.
 
Resource usage summary:
 
CPU time : 0.01 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
 
 
PS:
 
Read file <job.err> for stderr output of this job.
 
[yjw@i05n45 test]$ cat job.err
/home/yjw/.lsbatch/1345504556.1873: line 8: 10: command not found
The directive #BSUB can be used in the job script job.sh to set options for bsub. This option is convenient when the job script is used as a template for running applications in LSF.
However, the bsub directives (#BSUB) are interpreted by the LSF scheduler only if the job script is spooled, that is, redirected to bsub. When the job.sh is submitted as a command-line argument, except for the first line starting with #!, all lines that start with # in the shell script are ignored as in a normal script. The options that are set by #BSUB are ignored.
In Example C-6, the directive option #BSUB -n 2 in the job script job.sh is ignored when the job script is submitted as a command-line argument. The job is completed by using the slot default value of 1.
Example C-6 Job script that is submitted as a command-line argument
[yjw@i05n45 test]$ ls -l job*
-rwxr-xr-x 1 yjw itso 237 Aug 20 19:42 job.sh
[yjw@i05n45 test]$ cat job.sh
#BSUB -n 2
 
env | grep LSB_DJOB_HOSTFILE
env | grep LSB_DJOB_NUMPROC
cat $LSB_DJOB_HOSTFILE
echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC
 
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
[yjw@i05n45 test]$ bsub -o job.out -e job.err job.sh
Job <1879> is submitted to default queue <normal>.
[yjw@i05n45 test]$ ls -l job*
-rw-r--r-- 1 yjw itso 0 Aug 20 19:49 job.err
-rw-r--r-- 1 yjw itso 1127 Aug 20 19:49 job.out
-rwxr-xr-x 1 yjw itso 237 Aug 20 19:42 job.sh
[yjw@i05n45 test]$ cat job.out
Sender: LSF System <[email protected]>
Subject: Job 1879: <job.sh> Done
 
Job <job.sh> was submitted from host <i05n45.pbm.ihost.com> by user <yjw> in cluster <cluster1>.
Job was executed on host(s) <i05n49.pbm.ihost.com>, in queue <normal>, as user <yjw> in cluster <cluster1>.
</home/yjw> was used as the home directory.
</home/yjw/test> was used as the working directory.
Started at Mon Aug 20 19:49:17 2012
Results reported at Mon Aug 20 19:49:19 2012
 
Your job looked like:
 
------------------------------------------------------------
# LSBATCH: User input
job.sh
------------------------------------------------------------
 
Successfully completed.
 
Resource usage summary:
 
CPU time : 0.03 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
LSB_DJOB_HOSTFILE=/home/yjw/.lsbatch/1345506555.1879.hostfile
LSB_DJOB_NUMPROC=1
i05n49.pbm.ihost.com
LSB_DJOB_NUMPROC=1
/bin/bash
Mon Aug 20 19:49:17 EDT 2012
Mon Aug 20 19:49:19 EDT 2012
 
 
PS:
 
Read file <job.err> for stderr output of this job.
 
[yjw@i05n45 test]$ cat job.err
In directive option #BSUB -n 2 in the job script, job.sh is ignored when the job script is submitted as a command-line argument. The job is completed by using the slot default value of 1.
In Example C-7, the job script is submitted by using redirection. The bsub directive (#BSUB -n 2) was interpreted and the job was completed by using job slots=2. The environment variable LSB_DJOB_NUMPROC was set to 2.
Example C-7 Job script that is submitted by using redirection
[yjw@i05n45 test]$ ls -l job.sh
-rwxr-xr-x 1 yjw itso 237 Aug 20 20:19 job.sh
[yjw@i05n45 test]$ cat job.sh
#BSUB -n 2
 
env | grep LSB_DJOB_HOSTFILE
env | grep LSB_DJOB_NUMPROC
cat $LSB_DJOB_HOSTFILE
echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC
 
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
[yjw@i05n45 test]$ bsub -o job.out -e job.err < job.sh
Job <1889> is submitted to default queue <normal>.
[yjw@i05n45 test]$ ls -l job*
-rw-r--r-- 1 yjw itso 0 Aug 20 20:20 job.err
-rw-r--r-- 1 yjw itso 2339 Aug 20 20:20 job.out
-rwxr-xr-x 1 yjw itso 237 Aug 20 20:19 job.sh
[yjw@i05n45 test]$ cat job.out
Sender: LSF System <[email protected]>
Subject: Job 1889: <#BSUB -n 2; env | grep LSB_DJOB_HOSTFILE;env | grep LSB_DJOB_NUMPROC;cat $LSB_DJOB_HOSTFILE;echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC; echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date> Done
 
Job <#BSUB -n 2; env | grep LSB_DJOB_HOSTFILE;env | grep LSB_DJOB_NUMPROC;cat $LSB_DJOB_HOSTFILE;echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC; echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date> was submitted from host <i05n45.pbm.ihost.com> by user <yjw> in cluster <cluster1>.
Job was executed on host(s) <2*i05n49.pbm.ihost.com>, in queue <normal>, as user <yjw> in cluster <cluster1>.
</home/yjw> was used as the home directory.
</home/yjw/test> was used as the working directory.
Started at Mon Aug 20 20:20:27 2012
Results reported at Mon Aug 20 20:20:28 2012
 
Your job looked like:
 
------------------------------------------------------------
# LSBATCH: User input
#BSUB -n 2
 
env | grep LSB_DJOB_HOSTFILE
env | grep LSB_DJOB_NUMPROC
cat $LSB_DJOB_HOSTFILE
echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC
 
echo $SHELL
export sleep_time=1
if [ $# -eq 1 ] ; then
sleep_time=$1
fi
date
sleep ${sleep_time}
date
 
------------------------------------------------------------
 
Successfully completed.
 
Resource usage summary:
 
CPU time : 0.03 sec.
Max Memory : 1 MB
Max Swap : 36 MB
 
Max Processes : 1
Max Threads : 1
 
The output (if any) follows:
 
LSB_JOBNAME=#BSUB -n 2; env | grep LSB_DJOB_HOSTFILE;env | grep LSB_DJOB_NUMPROC;cat $LSB_DJOB_HOSTFILE;echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC; echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date
LSB_DJOB_HOSTFILE=/home/yjw/.lsbatch/1345508425.1889.hostfile
LSB_JOBNAME=#BSUB -n 2; env | grep LSB_DJOB_HOSTFILE;env | grep LSB_DJOB_NUMPROC;cat $LSB_DJOB_HOSTFILE;echo LSB_DJOB_NUMPROC=$LSB_DJOB_NUMPROC; echo $SHELL;export sleep_time=1;if [ $# -eq 1 ] ; then; sleep_time=$1;fi;date;sleep ${sleep_time};date
LSB_DJOB_NUMPROC=2
i05n49.pbm.ihost.com
i05n49.pbm.ihost.com
LSB_DJOB_NUMPROC=2
/bin/bash
Mon Aug 20 20:20:27 EDT 2012
Mon Aug 20 20:20:28 EDT 2012
 
PS:
 
Read file <job.err> for stderr output of this job.
The file path (including the directory and the file name) for Linux can be up to 4,094 characters when expanded with the values of the special characters.
Use the special characters that are listed in Table C-1 in job scripts to identify unique names for input files, output files, and error files.
Table C-1 Special characters for job scripts
Description
Special characters
bsub options
Job array job limit
%job_limit
-J
Job array job slot limit
%job_slot_limit
-J
Job ID of the job
%J
-e -oe -o -oo
Index of job in the job array. Replaced by 0, if job is not a member of any array.
%I
-e -oe -o -oo
Task ID
%T
-e -oe -o -oo
Task array index
%X
-e -oe -o -oo
The IBM Platform LSF Configuration Reference, SC22-5350-00, includes a reference of the environment variables that are set in IBM Platform LSF. A number of the environment variables (see Table C-2) can be used in job scripts to manage and identify the run.
Table C-2 Common IBM Platform LFS environment variables for job scripts
Description
Environment variable
Job ID that is assigned by IBM Platform LSF.
LSB_JOBID
Index of the job that belongs to a job array.
LSB_JOBINDEX
Name of the job that is assigned by the user at submission time.
LSB_JOBNAME
Process IDs of the job.
LSB_JOBPID
If job is interactive, LSB_INTERACTIVE=Y. Otherwise, it is not set.
LSB_INTERACTIVE
A list of hosts (tokens in a line) that are selected by IBM Platform LSF to run the job.
LSB_HOSTS
Name of the file with a list of hosts (one host per line) that are selected by IBM Platform LSF to run the job.
LSB_DJOB_HOSTFILE
The number of processors (slots) that is allocated to the job.
LSB_DJOB_NUMPROC
The path to the batch executable job file that invokes the batch job.
LSB_JOBFILENAME
The following examples show how to use bsub to submit and manage jobs and how to use some other basic commands to control jobs and follow up with job execution status. Example C-8 shows how to stop and resume jobs.
Example C-8 IBM Platform LSF - Stopping and resuming jobs
[alinegds@i05n45 ~]$ bsub -J testjob < job.sh
Job <1142> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1142 alinegd RUN normal i05n45.pbm. i05n46.pbm. testjob Aug 9 15:09
1141 alinegd DONE normal i05n45.pbm. i05n49.pbm. testjob Aug 9 15:06
[alinegds@i05n45 ~]$ bstop -J testjob
Job <1142> is being stopped
[alinegds@i05n45 ~]$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1142 alinegd USUSP normal i05n45.pbm. i05n46.pbm. testjob Aug 9 15:09
1141 alinegd DONE normal i05n45.pbm. i05n49.pbm. testjob Aug 9 15:06
[alinegds@i05n45 ~]$ bresume -J testjob
Job <1142> is being resumed
[alinegds@i05n45 ~]$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1142 alinegd SSUSP normal i05n45.pbm. i05n46.pbm. testjob Aug 9 15:09
1141 alinegd DONE normal i05n45.pbm. i05n49.pbm. testjob Aug 9 15:06
[alinegds@i05n45 ~]$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1142 alinegd RUN normal i05n45.pbm. i05n46.pbm. testjob Aug 9 15:09
1141 alinegd DONE normal i05n45.pbm. i05n49.pbm. testjob Aug 9 15:06
Example C-9 shows a script that runs two jobs: testjob1 and testjob2. The testjob1 job is submitted with the -K option that forces the script testjob.sh to wait for the job to complete before it submits another job. The testjob2 job is only scheduled for execution after testjob1 completes.
Example C-9 IBM Platform LSF - Submitting a job and waiting for the job to complete (Part 1 of 2)
[alinegds@i05n49 ~]$ cat testjob.sh
#!/bin/bash
bsub -J testjob1 -K sleep 10
bsub -J testjob2 sleep 10
[alinegds@i05n49 ~]$ ./testjob.sh
Job <1314> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on i05n48.pbm.ihost.com>>
<<Job is finished>>
Job <1315> is submitted to default queue <normal>.
Example C-10 shows what happens when you run a script to submit jobs without the -K option.
Example C-10 IBM Platform LSF - Submitting a job and waiting for job to complete (Part 2 of 2)
[alinegds@i05n49 ~]$ cat testjob.sh
#!/bin/bash
bsub -J testjob1 sleep 10
bsub -J testjob2 sleep 10
[alinegds@i05n49 ~]$ ./testjob.sh
Job <1322> is submitted to default queue <normal>.
Job <1323> is submitted to default queue <normal>.
Example C-11 shows how to change the priority of the jobs. In this example, the jobs are submitted in the status PSUSP (a job is suspended by its owner or the LSF administrator while it is in the pending state) with the -H option, so that we can manipulate them before they start to run.
Example C-11 IBM Platform LSF - Changing the priority of the jobs
[alinegds@i05n45 ~]$ bsub -J testjob1 -H < job.sh
Job <1155> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bsub -J testjob2 -H < job.sh
Job <1156> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bsub -J testjob3 -H < job.sh
Job <1157> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1155 alinegd PSUSP normal i05n45.pbm. testjob1 Aug 9 15:41
1156 alinegd PSUSP normal i05n45.pbm. testjob2 Aug 9 15:41
1157 alinegd PSUSP normal i05n45.pbm. testjob3 Aug 9 15:41
[alinegds@i05n45 ~]$ btop 1157
Job <1157> has been moved to position 1 from top.
[alinegds@i05n45 ~]$ bjobs -a
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1157 alinegd PSUSP normal i05n45.pbm. testjob3 Aug 9 15:41
1155 alinegd PSUSP normal i05n45.pbm. testjob1 Aug 9 15:41
1156 alinegd PSUSP normal i05n45.pbm. testjob2 Aug 9 15:41
Example C-12 shows how to move jobs to other queues and resume jobs. The jobs are submitted in the status PSUSP (a job is suspended by its owner or the LSF administrator while it is in the pending state) with the -H option.
Example C-12 IBM Platform LSF - Moving jobs to other queues and resuming jobs
[alinegds@i05n49 ~]$ bsub -J testjob1 -H < job.sh
Job <1327> is submitted to default queue <normal>.
[alinegds@i05n49 ~]$ bsub -J testjob2 -H < job.sh
Job <1328> is submitted to default queue <normal>.
[alinegds@i05n49 ~]$ bsub -J testjob3 -H < job.sh
Job <1329> is submitted to default queue <normal>.
[alinegds@i05n49 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1327 alinegd PSUSP normal i05n49.pbm. testjob1 Aug 10 11:25
1328 alinegd PSUSP normal i05n49.pbm. testjob2 Aug 10 11:25
1329 alinegd PSUSP normal i05n49.pbm. testjob3 Aug 10 11:25
[alinegds@i05n49 ~]$ bswitch priority 1329
Job <1329> is switched to queue <priority>
[alinegds@i05n49 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1329 alinegd PSUSP priority i05n49.pbm. testjob3 Aug 10 11:25
1327 alinegd PSUSP normal i05n49.pbm. testjob1 Aug 10 11:25
1328 alinegd PSUSP normal i05n49.pbm. testjob2 Aug 10 11:25
[alinegds@i05n49 ~]$ bresume -q normal 0
Job <1327> is being resumed
Job <1328> is being resumed
[alinegds@i05n49 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1327 alinegd RUN normal i05n49.pbm. i05n49.pbm. testjob1 Aug 10 11:25
1328 alinegd RUN normal i05n49.pbm. i05n49.pbm. testjob2 Aug 10 11:25
1329 alinegd PSUSP priority i05n49.pbm. testjob3 Aug 10 11:25
[alinegds@i05n49 ~]$ bresume 1329
Job <1329> is being resumed
[alinegds@i05n49 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1329 alinegd RUN priority i05n49.pbm. i05n49.pbm. testjob3 Aug 10 11:25
1327 alinegd RUN normal i05n49.pbm. i05n49.pbm. testjob1 Aug 10 11:25
1328 alinegd RUN normal i05n49.pbm. i05n49.pbm. testjob2 Aug 10 11:25
Example C-13 shows how to submit jobs with dependencies. The testjob2 job does not run until testjob1 finishes with status 1.
Example C-13 IBM Platform LSF - Submitting jobs with dependencies
[alinegds@i05n45 ~]$ bsub -J testjob "sleep 100; exit 1"
Job <1171> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1171 alinegd RUN normal i05n45.pbm. i05n49.pbm. testjob Aug 9 16:05
[alinegds@i05n45 ~]$ bsub -w "exit(1171)" -J testjob2 < job.sh
Job <1172> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1171 alinegd RUN normal i05n45.pbm. i05n49.pbm. testjob Aug 9 16:05
1172 alinegd PEND normal i05n45.pbm. testjob2 Aug 9 16:05
[alinegds@i05n45 ~]$
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1172 alinegd RUN normal i05n45.pbm. i05n49.pbm. testjob2 Aug 9 16:05
Example C-14 shows what happens when a job is submitted with a dependency that is not met. The job stays in the pending (PEND) status and does not execute.
Example C-14 IBM Platform LSF - Jobs with unmet dependencies
[alinegds@i05n45 ~]$ bsub -J testjob "sleep 100"
Job <1173> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1173 alinegd RUN normal i05n45.pbm. i05n49.pbm. testjob Aug 9 16:08
[alinegds@i05n45 ~]$ bsub -w "exit(1173)" -J testjob2 < job.sh
Job <1174> is submitted to default queue <normal>.
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1173 alinegd RUN normal i05n45.pbm. i05n49.pbm. testjob Aug 9 16:08
1174 alinegd PEND normal i05n45.pbm. testjob2 Aug 9 16:09
[alinegds@i05n45 ~]$ bjdepinfo 1174
JOBID PARENT PARENT_STATUS PARENT_NAME LEVEL
1174 1173 DONE testjob 1
[alinegds@i05n45 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1174 alinegd PEND normal i05n45.pbm. testjob2 Aug 9 16:09
Job never runs
Adding and removing nodes from an LSF cluster
Example C-15 shows how to remove a host from the LSF cluster. Host i05n49 is being removed from the LSF cluster in this example. There are two parts for removing hosts. The first part consists of closing the host that is being removed and shutting down the daemons. The second part consists of changing the LSF configuration files, reconfiguring load information manager (LIM), and restarting mbatchd on the master hosts.
Example C-15 IBM Platform LSF - Removing a host from the LSF cluster (Part 1 of 2)
[root@i05n49 ~]# bhosts -l i05n49
HOST i05n49.pbm.ihost.com
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 60.00 - 12 0 0 0 0 0 -
 
CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
Total 0.0 0.0 0.0 0% 0.0 6 1 0 3690M 4G 46G
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M
 
 
LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
 
 
[root@i05n49 ~]# badmin hclose
Close <i05n49.pbm.ihost.com> ...... done
[root@i05n49 ~]# bhosts -l i05n49
HOST i05n49.pbm.ihost.com
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
closed_Adm 60.00 - 12 0 0 0 0 0 -
 
CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
Total 0.0 0.0 0.0 0% 0.0 6 1 0 3690M 4G 46G
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M
 
 
LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
 
 
[root@i05n49 ~]# badmin hshutdown
Shut down slave batch daemon on <i05n49.pbm.ihost.com> ...... done
[root@i05n49 ~]# lsadmin resshutdown
Shut down RES on <i05n49.pbm.ihost.com> ...... done
[root@i05n49 ~]# lsadmin limshutdown
Shut down LIM on <i05n49.pbm.ihost.com> ...... done
Before we reconfigure LIM and restart mbatchd in Example C-16, we edit the configuration file lsf.cluster.<clustername> and remove any entries regarding host i05n49. In your environment, you might need to change additional files, depending on your cluster configuration (see the section “Remove a host” in Administering IBM Platform LSF, SC22-5346-00, for details about other configuration files that you might need to change).
Example C-16 IBM Platform LSF - Removing a host from the LSF cluster (Part 2 of 2)
[root@i05n45 ~]# lsadmin reconfig
 
Checking configuration files ...
No errors found.
 
Restart only the master candidate hosts? [y/n] y
Restart LIM on <i05n45.pbm.ihost.com> ...... done
Restart LIM on <i05n46.pbm.ihost.com> ...... done
[root@i05n45 ~]# bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
i05n45.pbm.ihost.c ok - 12 0 0 0 0 0
i05n46.pbm.ihost.c closed - 12 0 0 0 0 0
i05n47.pbm.ihost.c closed - 12 0 0 0 0 0
i05n48.pbm.ihost.c closed - 12 0 0 0 0 0
 
[lsfadmin@i05n45 ~]$ badmin mbdrestart
 
Checking configuration files ...
 
No errors found.
 
MBD restart initiated
[lsfadmin@i05n45 ~]$ bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
i05n45.pbm.ihost.c ok - 12 0 0 0 0 0
i05n46.pbm.ihost.c ok - 12 0 0 0 0 0
i05n47.pbm.ihost.c ok - 12 0 0 0 0 0
i05n48.pbm.ihost.c ok - 12 0 0 0 0 0
Example C-18 on page 336 shows how to add hosts to the LSF cluster. This task consists of changing the configuration of the LSF cluster, reconfiguring LIM, restarting mbatchd, and then activating the new host.
Before we reconfigure LIM and restart mbatchd in Example C-18 on page 336, we edit the configuration file lsf.cluster.<clustername> and add the entry that is shown in bold in Example C-17 for host i05n49.
Example C-17 IBM Platform LSF - Adding the host to the configuration file
Begin Host
HOSTNAME model type server r1m mem swp RESOURCES #Keywords
i05n45 ! ! 1 3.5 () () (mg)
i05n46 ! ! 1 3.5 () () (mg)
i05n47 ! ! 1 3.5 () () ()
i05n48 ! ! 1 3.5 () () ()
i05n49 ! ! 1 3.5 () () ()
End Host
After changing the configuration file in Example C-17 on page 335, continue by adding the node as shown in Example C-18.
Example C-18 IBM Platform LSF - Adding new host to LSF cluster (Part 1 of 2)
[lsfadmin@i05n45 conf]$ lsadmin reconfig
 
Checking configuration files ...
No errors found.
 
Restart only the master candidate hosts? [y/n] y
Restart LIM on <i05n45.pbm.ihost.com> ...... done
Restart LIM on <i05n46.pbm.ihost.com> ...... done
[lsfadmin@i05n45 conf]$ badmin mbdrestart
 
Checking configuration files ...
 
No errors found.
 
MBD restart initiated
The second part consists of enabling the LSF daemons on the new host as shown in Example C-19.
Example C-19 IBM Platform LSF - Adding new host to the IBM Platform LSF cluster (Part 2 of 2)
[root@i05n49 lsf8.3_lsfinstall]# ./hostsetup --top="/gpfs/fs1/lsf"
Logging installation sequence in /gpfs/fs1/lsf/log/Install.log
 
------------------------------------------------------------
L S F H O S T S E T U P U T I L I T Y
------------------------------------------------------------
This script sets up local host (LSF server, client or slave) environment.
 
Setting up LSF server host "i05n49" ...
Checking LSF installation for host "i05n49" ... Done
LSF service ports are defined in /gpfs/fs1/lsf/conf/lsf.conf.
Checking LSF service ports definition on host "i05n49" ... Done
You are installing IBM Platform LSF - 8.3 Standard Edition.
 
... Setting up LSF server host "i05n49" is done
... LSF host setup is done.
 
[root@i05n49 lsf8.3_lsfinstall]# lsadmin limstartup
Starting up LIM on <i05n49.pbm.ihost.com> ...... done
[root@i05n49 lsf8.3_lsfinstall]# lsadmin resstartup
Starting up RES on <i05n49.pbm.ihost.com> ...... done
[root@i05n49 lsf8.3_lsfinstall]# badmin hstartup
Starting up slave batch daemon on <i05n49.pbm.ihost.com> ...... done
[root@i05n49 lsf8.3_lsfinstall]# bhosts
Failed in an LSF library call: Slave LIM configuration is not ready yet
[root@i05n49 lsf8.3_lsfinstall]# bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
i05n45.pbm.ihost.c ok - 12 0 0 0 0 0
i05n46.pbm.ihost.c ok - 12 0 0 0 0 0
i05n47.pbm.ihost.c ok - 12 0 0 0 0 0
i05n48.pbm.ihost.c ok - 12 0 0 0 0 0
i05n49.pbm.ihost.c ok - 12 0 0 0 0 0
[root@i05n49 lsf8.3_lsfinstall]# chkconfig --add lsf
[root@i05n49 lsf8.3_lsfinstall]# chkconfig --list | grep lsf
lsf 0:off 1:off 2:off 3:on 4:off 5:on 6:off
The documentation suggests that you can use the command ./hostsetup --top="/gpfs/fs1/lsf" --boot=y to enable IBM Platform LSF daemons to start on system boot-up. In Example C-19 on page 336, we enable IBM Platform LSF daemons manually (by using chkconfig --add lsf) because the option --boot=y was not accepted by the script hostsetup. Also, the script hostsetup is not on LSF_HOME. You can find it in the directory from which you installed IBM Platform LSF.
Creating a threshold on IBM Platform RTM
To create a threshold, click Add in the upper-right area of the Threshold Management section as shown in Figure C-1.
Figure C-1 IBM Platform RTM - Creating threshold (Step 1 of 6)
Select whether you want to create the threshold from a Graph Template or Host. In Figure C-2, we select the Graph Template option.
 
Types of graphs: Different types of graphs are available, depending on the option that you choose. Also, the graph options that you see when you select the option Host depend on the types of graphs that you have already generated. All graph templates are not available to choose from, only the templates that you have already used to generate graphs from the Graphs tab of IBM Platform RTM.
Figure C-2 IBM Platform RTM - Creating the threshold (Step 2 of 2)
After you select the type of threshold that you want to create, select the graph template that you want to use and click Create. In this example, we are creating a threshold to indicate when there are more than 20 jobs pending for more than 300 seconds in the normal queue, so we select the template Alert: Jobs Pending for X Seconds as shown in Figure C-3.
Figure C-3 IBM Platform RTM - Creating the threshold (Step 3 of 3)
After you select the graph template, enter any custom data that is required as shown in Figure C-4 and click Create.
Figure C-4 IBM Platform RTM - Creating the threshold (Step 4 of 4)
After you click Create, you see a page with several customizable options for this threshold. In this example, we configured the threshold to trigger an alarm when there are more than 20 jobs in pending status for more than 300 seconds (see option High Threshold in Figure C-5) and configured the alarm message to be sent to syslog (see the Syslogging check box in Figure C-6 on page 340). You can also configure emails to be sent to administrators when the alarm is triggered, among several other options.
Figure C-5 IBM Platform RTM - Creating the threshold (Step 5 of 5)
Figure C-6 shows how to configure alarms to send messages to the syslog.
Figure C-6 IBM Platform RTM - Creating the threshold (Step 6 of 6)
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.187.223