This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
218
|
Chapter 12: Hardware and Software Optimizations
One obvious improvement is to employ faster networking. Doing so increases the
cost of each compute node a little and significantly increases the cost of network
switches because gigabit network switches are still quite expensive. However, it is
possible to use a hybrid solution in which the database server is connected to a
hybrid network switch via a gigabit line and the compute nodes are connected to the
switch via the more common 100-Mb interface. This is much cheaper than using
gigabit everywhere, and, because exceeding 12.5 MBps is rare, it doesn’t hinder per-
formance too much.
When building file servers, people often neglect to put in enough RAM. For BLAST
database servers, though, you really want as much RAM as possible. Caching applies
on the file-server end, too, and if several computers request data from the file server,
it’s much better if it can be served from memory rather than from disk. If you’re
thinking of using autonomous network attached servers as a BLAST database server,
think again. Most don’t have gigabit networking or enough RAM.
Local databases
Keeping local copies of your BLAST databases on each node of the cluster will make
access to the data very fast. Most hard disks can read data at 20 to 30 MB per sec-
ond or about double what you could get from common networking. If your network
is slow, your cluster is large, or your searches are really insensitive, it’s much better
to have local copies of databases. The main concern with this approach is keeping
the files synchronized and updated with respect to a master copy. This can be done
via rsync or other means. However, if all the nodes update their databases at the
same time across a thin pipe, this operation could take a long time, and the compute
nodes may sit idle.
A lesser concern is the disks themselves. They cost money and are a potential source
of hardware failure (for this reason, some people advocate running the compute
nodes diskless). When discussing disks, there’s a great deal of debate over IDE ver-
sus SCSI. Drives using the IDE interface are generally slower and less reliable, but are
much less expensive. Experts on both sides of the debate will argue convincingly that
buying one type of drive makes more sense than buying the other. However, for opti-
mal performance, you really should access the database from cache rather than disk,
and therefore the disk shouldn’t really matter. Those who choose IDE or SCSI aren’t
necessarily fools, but people who fail to put enough RAM in their boxes are.
Distributed Resource Management
If you’re running a lot of BLAST jobs, one problem to consider is how to manage
them to minimize idle time without overloading your computers. Being organized is
the simplest way to schedule jobs. If you’re the only user, you can use simple scripts
to iterate over the various searches and keep your computer comfortably busy. The
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Distributed Resource Management
|
219
problem starts when you add multiple users. In a small group, it’s possible for users
to cooperate with one another without adding extra software. Sending email saying
“hey, stay off blast-server5 until I say so” works surprisingly well. But if you have a
large group or irresponsible users, you’ll want some kind of distributed resource
management (DRM) software.
There are a number of DRM software packages, both free and commercial. But even
the free ones will cost you time to install and maintain, and users need training to
use the system. Table 12-3 lists some of the most popular packages in the bioinfor-
matics community. Condor is an established DRM that is downloadable for free. It is
rare in that it supports Windows and Unix. LSF is a mature product with many bio-
informatics users. It is, however, expensive. For large groups, however, the robust-
ness makes the cost justifiable. Parasol is purpose-built for the UCSC kilocluster and
throws out some of the generalities for increased performance. PBS and ProPBS are
popular DRMs, and if you’re an academic user, you can get ProPBS for free. SGE is a
relative newcomer but has a strong following, partly due to the fact that it’s an open
source project.
Table 12-3. DRM software
Product Description (as advertised)
Condor Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured
batch systems, Condor provides a job-queuing mechanism, scheduling policy, priority scheme, resource moni-
toring, and resource management. Users submit their serial or parallel jobs to Condor; Condor then places
them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their
progress, and ultimately informs the user upon completion.
http://www.cs.wisc.edu/condor
LSF Platform LSF 5 is built on a grid-enabled, robust architecture for open, scalable, and modular environments.
Platform LSF 5 is engineered for enterprise deployment. It provides unlimited scalability with support for
over 100 clusters, more than 200,000 CPUs, and 500,000 active jobs.
With more than 250,000 licenses spanning 1,500 customer sites, Platform LSF 5 has industrial-strength
reliability to process mission-critical jobs reliably and on time.
A web-based interface puts the convenience and simplicity of global access to resources into the hands of
your administrators and users.
Platform LSF 5, with its open, plug-in architecture, seamlessly integrates with third-party applications and
heterogeneous technology platforms.
http://www.platform.com
Parasol Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to
thousands of CPUs. Parasol was developed initially by Jim Kent, and extended by other members of the
Genome Bioinformatics Group at the University of California Santa Cruz. Parasol is currently a fairly minimal
system, but what it does, it does well. It can start up 500 jobs per second. It restarts jobs in response to the
inevitable systems failures that occur on large clusters. If some of your jobs die because of your program bugs,
Parasol can also help manage restarting the crashed jobs after you fix your program.
http://www.soe.ucsc.edu/~donnak/eng/parasol.htm
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.185.42