This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
222
|
Chapter 12: Hardware and Software Optimizations
for (my $i = 1; $i <= $count; $i += $segment) {
system("$BLAST $DB $Q nwstart=$i nwlen=$segment");
}
Database Splitting
If you have a computer cluster and a lot of individual BLAST jobs to run, you can
easily split the jobs among the nodes of your cluster. But what if you have a single,
slow BLAST job that you want to spread out over several computers? If your
sequence is very large, you can use query chopping as described earlier and assign
each computer a separate segment. But what if your sequence isn’t so large? A good
solution is to have each computer search only part of the database. You’ll need to do
a little statistical manipulation to set the effective search space to the entire data-
base, as well as some post-processing to merge all the reports together, but overall
the process is pretty simple. The hard part is making sure the database is properly
segmented on the various computers.
If you’re using NCBI-BLAST, you can create database slices using alias databases as
described previously. This allows a great deal more flexibility than physically split-
ting the databases into various parts. But remember that alias databases require that
you use GI numbers in the FASTA identifier.
If you’re using WU-BLAST, you can split the database dynamically. WU-BLAST has
command-line parameters called
dbrecmin and dbrecmax that describe the minimum
and maximum database records. You can assign each node of the cluster a different
subsection of the database by simply assigning
dbrecmin and dbrecmax. For example,
if your database contains 100 records and you have 10 nodes, node 1 gets records 1
to 10, node 2 gets records 11 to 20, etc. To benefit from caching, each node should
be assigned the same database slice.
Serial BLAST Searching
As discussed in Chapter 5, the best way to speed up BLAST searches is by making
the seeding more stringent. The only problem is that low-scoring alignments may be
lost. High scoring alignments, however, are relatively resistant to changes in seeding
parameters. The serial strategy takes advantage of this property; it uses an insensi-
tive search to identify database matches and then a sensitive search to generate the
alignments. An intuitive way to think about this with genomic sequence is “if I can
hit just one exon, I can get the whole gene.” The procedure has three steps and can
be carried out with a simple script:
1. Run BLAST with insensitive parameters.
2. Build a BLAST database from the matches.
3. Run BLAST with sensitive parameters on just the matches.