This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
220
|
Chapter 12: Hardware and Software Optimizations
Software Tricks
In addition to choosing appropriate BLAST parameters and optimizing your hard-
ware set, you can use a few software tricks to increase your BLAST performance.
Most of these tricks involve splitting or concatenating sequences into optimal-sized
pieces because very large and very small sequences are inefficiently processed by
BLAST.
Multiplexing/Query Packing
Input and output (I/O) can become a large fraction of the overall CPU load when the
search parameters are insensitive, such as when running BLASTN. If you find your-
self running a lot of BLASTN searches, you can pack multiple queries together and
reduce the overhead of reading the database repeatedly. For example, let’s say you
have a collection of 100,000 ESTs from your favorite organism and you want to
search them against all other ESTs in the public database. If you search them one at a
time, you will perform 100,000 BLAST searches and therefore have to read the data-
base 100,000 times. It should go without saying that caching is essential in such a
task.
PBS The Portable Batch System (PBS) is a flexible batch queuing and workload management system originally
developed by Veridian Systems for NASA. It operates on networked, multiplatform UNIX environments, includ-
ing heterogeneous clusters of workstations, supercomputers, and massively parallel systems. Development of
PBS is provided by the PBS Products Department of Veridian Systems.
http://www.openpbs.org
ProPBS The PBS Pro Version 5.2 workload management solution is the professional version of the Portable Batch Sys-
tem. Built on the success of OpenPBS, PBS Pro goes well beyond it with the features and support you expect in
a mission-critical commercial product, such as:
Shrink-wrapped, easy-to-install binary distributions
Support on every major version of Unix and Linux
Enhanced fault tolerance and scalability
Enhanced scheduling algorithms
Computational grid support
Direct support from the team that created PBS
New, rewritten documentation
Source code availability
http://www.propbs.com
SGE The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing
solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling
distributed resource management software for wide-ranging requirements from compute farms to grid com-
puting.
http://gridengine.sunsource.net
Table 12-3. DRM software (continued)
Product Description (as advertised)
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Software Tricks
|
221
But what if you glue the sequences together in groups of 100? Well, you’ve just cut
your database I/O down to 1 percent of what it used to be, which can be a signifi-
cant savings. For ESTs and other sequences of this length, the speed up is typically
tenfold. This technique is called multiplexing or query packing. It isn’t as simple as it
sounds because there must be a way to prevent alignments from bridging the
sequences, the coordinates must be remapped, and the statistics need to be recalcu-
lated. MegaBLAST, part of the NCBI-BLAST distribution, is a specialized version of
BLASTN that multiplexes queries and includes a variety of other optimizations. It’s
really fast, and anyone doing a lot of BLASTN searches should use this program. You
can find more information about MegaBLAST in Chapters 9 and 13. Query packing
can also be accomplished with a single, sophisticated Perl script (see MPBLAST at
http://blast.wustl.edu).
Query Chopping
Larger sequences require more memory to search and align. This can blow away
your cached database, or worse, cause the computer to start swapping (using the
disk for RAM). In addition, for a variety of reasons, larger query sequences are pro-
cessed less efficiently. One way to solve this problem is to divide the query sequence
into several segments, search them independently, and then merge the results back
together. This is called query chopping and is effectively the opposite of query pack-
ing. The main difficulty with query chopping is dealing with alignments that cross
the boundaries between segments.
Both NCBI-BLAST and WU-BLAST let you specify that only a subsequence of a
large query sequence is to be searched (see the
-L parameter in Chapter 13 and the
newstart and nwlen parameters in Chapter 14). Currently, this works a little better
for WU-BLAST because alignments seeded in a restricted region can extend outside
this region, so there’s no need to stitch together the alignments between neighboring
segments. The following Perl script searches chromosome-sized sequences in 100-KB
segments using WU-BLAST. All coordinates and statistics are identical to a search
with an entire chromosome. Note that complexity filters are currently applied to the
whole sequence, so apply these filters ahead of time.
#!/usr/bin/perl -w
use strict;
die "usage: $0 <wu-blast command line> " unless @ARGV >= 3;
my ($BLAST, $DB, $Q, @P) = @ARGV;
die "ERROR ($0): single FASTA files only " if `grep -c ">" $Q` > 1;
my $params = "@P";
die "ERROR ($0): filter ahead of time " if $params =~ /filter|wordmask/;
open(FASTA, $Q) or die;
my $def = <FASTA>;
my $count = 0;
while (<FASTA>) {$count += length($_) -1}
my $segment = 100000;
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
222
|
Chapter 12: Hardware and Software Optimizations
for (my $i = 1; $i <= $count; $i += $segment) {
system("$BLAST $DB $Q nwstart=$i nwlen=$segment");
}
Database Splitting
If you have a computer cluster and a lot of individual BLAST jobs to run, you can
easily split the jobs among the nodes of your cluster. But what if you have a single,
slow BLAST job that you want to spread out over several computers? If your
sequence is very large, you can use query chopping as described earlier and assign
each computer a separate segment. But what if your sequence isn’t so large? A good
solution is to have each computer search only part of the database. You’ll need to do
a little statistical manipulation to set the effective search space to the entire data-
base, as well as some post-processing to merge all the reports together, but overall
the process is pretty simple. The hard part is making sure the database is properly
segmented on the various computers.
If you’re using NCBI-BLAST, you can create database slices using alias databases as
described previously. This allows a great deal more flexibility than physically split-
ting the databases into various parts. But remember that alias databases require that
you use GI numbers in the FASTA identifier.
If you’re using WU-BLAST, you can split the database dynamically. WU-BLAST has
command-line parameters called
dbrecmin and dbrecmax that describe the minimum
and maximum database records. You can assign each node of the cluster a different
subsection of the database by simply assigning
dbrecmin and dbrecmax. For example,
if your database contains 100 records and you have 10 nodes, node 1 gets records 1
to 10, node 2 gets records 11 to 20, etc. To benefit from caching, each node should
be assigned the same database slice.
Serial BLAST Searching
As discussed in Chapter 5, the best way to speed up BLAST searches is by making
the seeding more stringent. The only problem is that low-scoring alignments may be
lost. High scoring alignments, however, are relatively resistant to changes in seeding
parameters. The serial strategy takes advantage of this property; it uses an insensi-
tive search to identify database matches and then a sensitive search to generate the
alignments. An intuitive way to think about this with genomic sequence is “if I can
hit just one exon, I can get the whole gene.” The procedure has three steps and can
be carried out with a simple script:
1. Run BLAST with insensitive parameters.
2. Build a BLAST database from the matches.
3. Run BLAST with sensitive parameters on just the matches.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Software Tricks
|
223
NCBI-BLAST doesn’t currently offer a wide range of word sizes, so serial searching is
best carried out with WU-BLAST. Example 12-1 shows a script that wraps up the
entire procedure.
To demonstrate the performance of the serial strategy, the script in Example 12-1
performs a search of a Caenorhabditis briggsae genomic fragment (c009500587.
Contig4) against all C. elegans proteins (wormpep97). To minimize the effect of
chance similarities, only alignments with at least 30 amino acids and 35 percent iden-
tity are analyzed. The search parameters, search speed, and number of HSPs found
are displayed in Table 12-4. The first two rows correspond to standard, nonserial
searches. Using the parameters recommended in Chapter 9 (row 2) BLASTX runs
seven times faster than the very sensitive WU-BLAST default parameters (row 1).
This speed is paid for by a loss in sensitivity (number of HSPs). The serial searches
(rows 3 and above) offer varying levels of speed and sensitivity. Only a few combina-
tions of W and T are presented; there are many useful combinations. Of particular
interest is row 4, which has approximately the same sensitivity as row 1, but runs 18
times faster. Not bad for a short script. Because BLAST is under active development,
perhaps you’ll see serial searching become a standard part of BLAST software.
Example 12-1. A script for serial BLAST searching
#!/usr/bin/perl -w
use strict;
die "usage: $0 <database> <query> <wordsize> <hitdist> " unless @ARGV == 4;
my ($DB, $Q, $W, $H) = @ARGV;
$H = $H ? "hitdist=$H" : "";
my $tmpdir = "/tmp/tt-blastx.tmpdir";
END {system("rm -rf $tmpdir") if defined $tmpdir}
system("mkdir $tmpdir") == 0 or die "ERROR ($0): can't create $tmpdir ";
my $STD = "B=100000 V=100000 wordmask=seg";
# search
system("blastx $DB $Q W=$W T=999 $H $STD > $tmpdir/search") == 0 or die;
# collect names
my @name;
open(NAME, ">$tmpdir/names") or die;
open(SEARCH, "$tmpdir/search") or die;
while (<SEARCH>) {print NAME "$1 " if /^>(S+)/}
close SEARCH;
close NAME;
# build second stage database
system("xdget -p -f $DB $tmpdir/names > $tmpdir/database") == 0 or die;
system("xdformat -p $tmpdir/database") == 0 or die;
# align
system("blastx $tmpdir/database $Q $STD") == 0 or die;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.79.121