Software Tricks

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

This is the Title of the Book, eMatter Edition

220

Chapter 12: Hardware and Software Optimizations

Software Tricks

In addition to choosing appropriate BLAST parameters and optimizing your hard-

ware set, you can use a few software tricks to increase your BLAST performance.

Most of these tricks involve splitting or concatenating sequences into optimal-sized

pieces because very large and very small sequences are inefficiently processed by

BLAST.

Multiplexing/Query Packing

Input and output (I/O) can become a large fraction of the overall CPU load when the

search parameters are insensitive, such as when running BLASTN. If you find your-

self running a lot of BLASTN searches, you can pack multiple queries together and

reduce the overhead of reading the database repeatedly. For example, let’s say you

have a collection of 100,000 ESTs from your favorite organism and you want to

search them against all other ESTs in the public database. If you search them one at a

time, you will perform 100,000 BLAST searches and therefore have to read the data-

base 100,000 times. It should go without saying that caching is essential in such a

task.

PBS The Portable Batch System (PBS) is a flexible batch queuing and workload management system originally

developed by Veridian Systems for NASA. It operates on networked, multiplatform UNIX environments, includ-

ing heterogeneous clusters of workstations, supercomputers, and massively parallel systems. Development of

PBS is provided by the PBS Products Department of Veridian Systems.

http://www.openpbs.org

ProPBS The PBS Pro Version 5.2 workload management solution is the professional version of the Portable Batch Sys-

tem. Built on the success of OpenPBS, PBS Pro goes well beyond it with the features and support you expect in

a mission-critical commercial product, such as:

• Shrink-wrapped, easy-to-install binary distributions

• Support on every major version of Unix and Linux

• Enhanced fault tolerance and scalability

• Enhanced scheduling algorithms

• Computational grid support

• Direct support from the team that created PBS

• New, rewritten documentation

• Source code availability

http://www.propbs.com

SGE The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing

solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling

distributed resource management software for wide-ranging requirements from compute farms to grid com-

puting.

http://gridengine.sunsource.net

Table 12-3. DRM software (continued)

Product Description (as advertised)

This is the Title of the Book, eMatter Edition

Software Tricks

221

But what if you glue the sequences together in groups of 100? Well, you’ve just cut

your database I/O down to 1 percent of what it used to be, which can be a signifi-

cant savings. For ESTs and other sequences of this length, the speed up is typically

tenfold. This technique is called multiplexing or query packing. It isn’t as simple as it

sounds because there must be a way to prevent alignments from bridging the

sequences, the coordinates must be remapped, and the statistics need to be recalcu-

lated. MegaBLAST, part of the NCBI-BLAST distribution, is a specialized version of

BLASTN that multiplexes queries and includes a variety of other optimizations. It’s

really fast, and anyone doing a lot of BLASTN searches should use this program. You

can find more information about MegaBLAST in Chapters 9 and 13. Query packing

can also be accomplished with a single, sophisticated Perl script (see MPBLAST at

http://blast.wustl.edu).

Query Chopping

Larger sequences require more memory to search and align. This can blow away

your cached database, or worse, cause the computer to start swapping (using the

disk for RAM). In addition, for a variety of reasons, larger query sequences are pro-

cessed less efficiently. One way to solve this problem is to divide the query sequence

into several segments, search them independently, and then merge the results back

together. This is called query chopping and is effectively the opposite of query pack-

ing. The main difficulty with query chopping is dealing with alignments that cross

the boundaries between segments.

Both NCBI-BLAST and WU-BLAST let you specify that only a subsequence of a

large query sequence is to be searched (see the

-L parameter in Chapter 13 and the

newstart and nwlen parameters in Chapter 14). Currently, this works a little better

for WU-BLAST because alignments seeded in a restricted region can extend outside

this region, so there’s no need to stitch together the alignments between neighboring

segments. The following Perl script searches chromosome-sized sequences in 100-KB

segments using WU-BLAST. All coordinates and statistics are identical to a search

with an entire chromosome. Note that complexity filters are currently applied to the

whole sequence, so apply these filters ahead of time.

#!/usr/bin/perl -w

use strict;

die "usage: $0 <wu-blast command line> " unless @ARGV >= 3;

my ($BLAST, $DB, $Q, @P) = @ARGV;

die "ERROR ($0): single FASTA files only " if `grep -c ">" $Q` > 1;

my $params = "@P";

die "ERROR ($0): filter ahead of time " if $params =~ /filter|wordmask/;

open(FASTA, $Q) or die;

my $def = <FASTA>;

my $count = 0;

while (<FASTA>) {$count += length($_) -1}

my $segment = 100000;

This is the Title of the Book, eMatter Edition

222

Chapter 12: Hardware and Software Optimizations

for (my $i = 1; $i <= $count; $i += $segment) {

system("$BLAST $DB $Q nwstart=$i nwlen=$segment");

}

Database Splitting

If you have a computer cluster and a lot of individual BLAST jobs to run, you can

easily split the jobs among the nodes of your cluster. But what if you have a single,

slow BLAST job that you want to spread out over several computers? If your

sequence is very large, you can use query chopping as described earlier and assign

each computer a separate segment. But what if your sequence isn’t so large? A good

solution is to have each computer search only part of the database. You’ll need to do

a little statistical manipulation to set the effective search space to the entire data-

base, as well as some post-processing to merge all the reports together, but overall

the process is pretty simple. The hard part is making sure the database is properly

segmented on the various computers.

If you’re using NCBI-BLAST, you can create database slices using alias databases as

described previously. This allows a great deal more flexibility than physically split-

ting the databases into various parts. But remember that alias databases require that

you use GI numbers in the FASTA identifier.

If you’re using WU-BLAST, you can split the database dynamically. WU-BLAST has

command-line parameters called

dbrecmin and dbrecmax that describe the minimum

and maximum database records. You can assign each node of the cluster a different

subsection of the database by simply assigning

dbrecmin and dbrecmax. For example,

if your database contains 100 records and you have 10 nodes, node 1 gets records 1

to 10, node 2 gets records 11 to 20, etc. To benefit from caching, each node should

be assigned the same database slice.

Serial BLAST Searching

As discussed in Chapter 5, the best way to speed up BLAST searches is by making

the seeding more stringent. The only problem is that low-scoring alignments may be

lost. High scoring alignments, however, are relatively resistant to changes in seeding

parameters. The serial strategy takes advantage of this property; it uses an insensi-

tive search to identify database matches and then a sensitive search to generate the

alignments. An intuitive way to think about this with genomic sequence is “if I can

hit just one exon, I can get the whole gene.” The procedure has three steps and can

be carried out with a simple script:

1. Run BLAST with insensitive parameters.

2. Build a BLAST database from the matches.

3. Run BLAST with sensitive parameters on just the matches.

This is the Title of the Book, eMatter Edition

Software Tricks

223

NCBI-BLAST doesn’t currently offer a wide range of word sizes, so serial searching is

best carried out with WU-BLAST. Example 12-1 shows a script that wraps up the

entire procedure.

To demonstrate the performance of the serial strategy, the script in Example 12-1

performs a search of a Caenorhabditis briggsae genomic fragment (c009500587.

Contig4) against all C. elegans proteins (wormpep97). To minimize the effect of

chance similarities, only alignments with at least 30 amino acids and 35 percent iden-

tity are analyzed. The search parameters, search speed, and number of HSPs found

are displayed in Table 12-4. The first two rows correspond to standard, nonserial

searches. Using the parameters recommended in Chapter 9 (row 2) BLASTX runs

seven times faster than the very sensitive WU-BLAST default parameters (row 1).

This speed is paid for by a loss in sensitivity (number of HSPs). The serial searches

(rows 3 and above) offer varying levels of speed and sensitivity. Only a few combina-

tions of W and T are presented; there are many useful combinations. Of particular

interest is row 4, which has approximately the same sensitivity as row 1, but runs 18

times faster. Not bad for a short script. Because BLAST is under active development,

perhaps you’ll see serial searching become a standard part of BLAST software.

Example 12-1. A script for serial BLAST searching

#!/usr/bin/perl -w

use strict;

die "usage: $0 <database> <query> <wordsize> <hitdist> " unless @ARGV == 4;

my ($DB, $Q, $W, $H) = @ARGV;

$H = $H ? "hitdist=$H" : "";

my $tmpdir = "/tmp/tt-blastx.tmpdir";

END {system("rm -rf $tmpdir") if defined $tmpdir}

system("mkdir $tmpdir") == 0 or die "ERROR ($0): can't create $tmpdir ";

my $STD = "B=100000 V=100000 wordmask=seg";

# search

system("blastx $DB $Q W=$W T=999 $H $STD > $tmpdir/search") == 0 or die;

# collect names

my @name;

open(NAME, ">$tmpdir/names") or die;

open(SEARCH, "$tmpdir/search") or die;

while (<SEARCH>) {print NAME "$1 " if /^>(S+)/}

close SEARCH;

close NAME;

# build second stage database

system("xdget -p -f $DB $tmpdir/names > $tmpdir/database") == 0 or die;

system("xdformat -p $tmpdir/database") == 0 or die;

# align

system("blastx $tmpdir/database $Q $STD") == 0 or die;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Software Tricks

Create new playlist

Sign In

Sign Up

Table of Contents for
Software Tricks