This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Distributed Resource Management
|
219
problem starts when you add multiple users. In a small group, it’s possible for users
to cooperate with one another without adding extra software. Sending email saying
“hey, stay off blast-server5 until I say so” works surprisingly well. But if you have a
large group or irresponsible users, you’ll want some kind of distributed resource
management (DRM) software.
There are a number of DRM software packages, both free and commercial. But even
the free ones will cost you time to install and maintain, and users need training to
use the system. Table 12-3 lists some of the most popular packages in the bioinfor-
matics community. Condor is an established DRM that is downloadable for free. It is
rare in that it supports Windows and Unix. LSF is a mature product with many bio-
informatics users. It is, however, expensive. For large groups, however, the robust-
ness makes the cost justifiable. Parasol is purpose-built for the UCSC kilocluster and
throws out some of the generalities for increased performance. PBS and ProPBS are
popular DRMs, and if you’re an academic user, you can get ProPBS for free. SGE is a
relative newcomer but has a strong following, partly due to the fact that it’s an open
source project.
Table 12-3. DRM software
Product Description (as advertised)
Condor Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured
batch systems, Condor provides a job-queuing mechanism, scheduling policy, priority scheme, resource moni-
toring, and resource management. Users submit their serial or parallel jobs to Condor; Condor then places
them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their
progress, and ultimately informs the user upon completion.
http://www.cs.wisc.edu/condor
LSF • Platform LSF 5 is built on a grid-enabled, robust architecture for open, scalable, and modular environments.
• Platform LSF 5 is engineered for enterprise deployment. It provides unlimited scalability with support for
over 100 clusters, more than 200,000 CPUs, and 500,000 active jobs.
• With more than 250,000 licenses spanning 1,500 customer sites, Platform LSF 5 has industrial-strength
reliability to process mission-critical jobs reliably and on time.
• A web-based interface puts the convenience and simplicity of global access to resources into the hands of
your administrators and users.
• Platform LSF 5, with its open, plug-in architecture, seamlessly integrates with third-party applications and
heterogeneous technology platforms.
http://www.platform.com
Parasol Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to
thousands of CPUs. Parasol was developed initially by Jim Kent, and extended by other members of the
Genome Bioinformatics Group at the University of California Santa Cruz. Parasol is currently a fairly minimal
system, but what it does, it does well. It can start up 500 jobs per second. It restarts jobs in response to the
inevitable systems failures that occur on large clusters. If some of your jobs die because of your program bugs,
Parasol can also help manage restarting the crashed jobs after you fix your program.
http://www.soe.ucsc.edu/~donnak/eng/parasol.htm