This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
BLAST Databases
|
193
BLAST Databases
The mechanics of creating BLAST databases is quite simple; just run formatdb or
xdformat with the proper syntax. Chapter 10 discussed this topic, and you’ll find the
command summaries in Chapters 13 and 14. There are, however, subtleties that
make this process more complicated than it may appear.
Large Databases
One of the most common database complications occurs with large files. Most com-
puters today use 32-bit operating systems and 32-bit filesystems. This puts a physi-
cal limit of 4 GB on the amount of RAM and 4 GB on the size of any particular file.
(You may find that you are actually limited to less than 4 GB in both cases, and a 2-
GB limit is quite common.) Most computers these days don’t have or need 4-GB
RAM. However, most hard disks are quite a bit larger than 4 GB, and files can some-
times exceed these limits. Therefore many operating systems have the option of using
64-bit filesystems. Unfortunately you can’t just change the filesystem and expect
everything to work. Making software applications aware of large files often means
recompiling them with special flags, and the process of migrating to a 64-bit filesys-
tem can be painful because the applications don’t tell you useful things like “I’m not
large-file-aware.” Instead, they just sit there quietly burning CPU time while they run
in endless loops.
Large NCBI databases
The standard protocol for formatting a database is to run formatdb on a FASTA
database:
formatdb -p F -i fasta_db -o
NCBI-BLAST databases are physically limited to 4 GB of sequence, which corre-
sponds to about 4 billion amino acids or 16 billion nucleotides (nucleotides are com-
pressed 4:1). On a 32-bit filesystem, the previous approach won’t let you use all this
space because the FASTA file can’t contain more than 2 or 4 billion letters. Creating
a database larger than 2 or 4 billion letters requires piping sequence to formatdb.
cat fasta1 fasta2 fasta3 | formatdb -p F -i stdin -n my_db -o
But what if you happen to have more than 16 billion letters? This isn’t a problem
because formatdb automatically segments individual BLAST databases to files con-
taining 16 billion nucleotides and creates something called an alias database that
stitches them all together. This is really convenient because it means that you can
search enormous databases even on 32-bit filesystems. Alias databases are discussed
in more detail later in this chapter.
It’s still possible to run into file size issues by piping FASTA files to formatdb because
the filesystem maximum may be 2 GB and the implicit BLAST maximum is 4 GB.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
194
|
Chapter 11: BLAST Databases
Fortunately, formatdb lets you set the size of each database volume with the -v
parameter. The following example sets this size this to 2 billion and includes a bit
more realism by piping the FASTA files from a compressed format.
zcat file*.gz | formatdb -i stdin -p F -o -n my_db -v 2000000000
A word of caution: be precise. If you accidentally leave off one of the zeroes, you can
create 10 times as many files. For large databases, this can be a problem because the
maximum number of volumes is 100.
Large WU-BLAST databases
WU-BLAST doesn’t use alias databases, so the only way to create a database larger
than 4 GB is on a 64-bit filesystem. If you’re accessing or distributing databases over
a network, the network must also be 64-bit aware. To index a large number of com-
pressed files, use a command such as the following:
zcat file* | xdformat -n -I -o ESTs -- -
In the typical Unix command line syntax, the double-dash indicates the end of the
command line options, and the single-dash denotes standard input rather than a file.
If you’re stuck with a 32-bit filesystem and need to search large BLAST databases,
you can use virtual databases, which are explained next. If you use the free version of
WU-BLAST, there is no large file support and no virtual database mechanism. Your
best solution is to create several databases within the limits of your filesystem, search
each independently and then merge the results.
Virtual Databases
Virtual databases let you combine multiple databases and use them as if they were
one. It’s as simple as including the various databases in quotes on the command line.
Here’s how it looks for NCBI-BLAST:
blastall -p blastp -d "db1 db2 db3" -i query
And here is the equivalent WU-BLAST command line:
blastp "db1 db2 db3" query
Virtual databases are useful for grouping related searches. Let’s say you want to
search individually against an EST and an mRNA database, as well as a transcripts
database that combines the two. You can create each database individually, but there
is some duplication in data. Alternatively, you can just create the EST and mRNA
databases and use the virtual database
EST mRNA for transcripts. Virtual databases
behave just like normal databases. You can even retrieve sequences from virtual data-
bases with fastacmd (this feature isn’t yet available for xdget, but see the end of this
chapter for a workaround).
One thing you probably don’t want to do is to combine databases with redundant
sequences. For example, you wouldn’t want to group the NCBI nr database with
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
BLAST Databases
|
195
SWISSPROT because nr already includes SWISSPROT. Duplicated sequences
decrease the statistical significance of matches and can be confusing in the output.
Alias Databases
Alias databases are a unique and powerful feature of NCBI-BLAST. You’ve already
seen that formatdb creates alias databases when splitting large files, but alias data-
bases also have other uses. Alias databases can be used as static virtual databases for
any combination of databases; all you have to do is create a file with the proper name
and syntax. Here’s a simple alias file, transcripts.nal, that combines the previous
ESTs and mRNAs example to create a transcripts databases:
TITLE transcripts
DBLIST ESTs mRNAs
The TITLE is the name of the database and the DBLIST is simply a list of the databases
to merge. Using alias databases you can, for example, organize sequences by organ-
ism. All you have to do is create the individual databases and combine them in vari-
ous ways with alias files to create more comprehensive sets.
Not only can you join databases, but you can also use alias files to restrict searches to
particular sequences from a database. Let’s say you create a comprehensive EST
database and then want to create a human-only EST database. The alias file for such
a database looks something like this:
TITLE humanESTs
DBLIST ESTs
GILIST human.gi
The GILIST specifies a list of files that contains the GI numbers of the sequences to
search. There are a few complexities when working with GI lists. First, using a GI list
assumes that your sequences have GI numbers. If your original FASTA identifiers
don’t include GI numbers, you can’t use this feature. It is unfortunate that the file of
GI numbers isn’t a file of accession numbers, but that’s the way it is. If you want to
use GI lists for sequences without GI numbers, you have to add fake GI numbers to
your identifiers. The following script counts backward from the maximum possible
GI number and thus minimizes the potential conflict with real GI numbers:
#!/usr/bin/perl
$i = 2147483648;
while (<>) {
if (/^>/) {
$i--;
print ">gi|$i (fake-gi) ", substr($_, 1);
}
else {
print;
}
}
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
196
|
Chapter 11: BLAST Databases
To use this script, fake-gi.pl, simply place it upstream of formatdb in your pipe:
zcat file*.gz | fake-gi.pl | formatdb -i stdin -p F -o -n my_db -v 2000000000
Creating an alias database restricted to GI numbers is a bit more complicated than
just merging databases. First you need to get a list of GI numbers for your sequences
of interest. There are a number of ways you can do this, but the easiest is to use
NCBI Entrez. If you have a GI list called human.gi.list, the next step is to convert it to
binary form using formatdb (this step isn’t actually required, but it does improve per-
formance).
formatdb -F human.gi.list -B human.gi
Finally, create the alias file. You can do this yourself, but you might as well let
formatdb do it for you because it also adds the number of sequences and their length
to the information in the alias file.
formatdb -p F -i ESTs -L humanESTs -F human.gi
-L identifies the name of the alias database to create, and -F is the name of the binary
file of GI numbers.
As you can see, alias databases are a powerful way to join and split databases. As
BLAST development continues, you should expect to see more structure in BLAST
databases and greater control of sequences subsets.
Removing Redundancy
One way to improve the efficiency of your BLAST searches is to remove redundant
sequences. Consider what happens when you search a redundant database. The sta-
tistical significance of a database hit depends on the size of the database. Each redun-
dant sequence artificially increases the size of the database and therefore reduces the
statistical significance of any hit. In addition, redundancy makes the search slower.
The nrdb program that comes with the licensed version of WU-BLAST concatenates
the definition lines of all identical sequences. This program isn’t available with the
free version, and the NCBI distribution doesn’t include such a program in the BLAST
distribution. WU-BLAST also includes patdb, which is a bit more aggressive because
it concatenates identical subsequences. Both programs are very efficient. You can
find examples of their use in Chapter 10. You can also write your own database
purifier using Bioperl tools. (Some entertaining discussions about the “best” way to
do this may be found in the Bioperl mailing list archives at http://bioperl.org). Here’s
one way:
#!/usr/bin/perl
use Bio::SeqIO;
my %NR;
my $file = Bio::SeqIO->new(-fh => *ARGV);
while (my $fasta = $file->next_seq){
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
BLAST Databases
|
197
my $def = $fasta->id ." " .$fasta->desc;
$NR{$fasta->seq}{$def} = 1;
}
for my $seq(keys %NR){
print ">",join(chr(1),keys
%{$NR{$seq}})," ",$seq," ";
}
It’s more common to collapse redundant proteins than redundant nucleotide
sequences. One reason is because nucleotide sequences are rarely identical to one
another. Another is that nucleotide sequences can be very large, and the procedure
becomes impractical with off-the-shelf hardware. Most importantly, it is better to
assemble genomic fragments into chromosomes and ESTs into full-length tran-
scripts. Because these tasks are complicated and compute-intensive, these feats of
bioinformagic are best left to the experts.
Standard BLAST Databases
Every BLAST search is an experiment and should be planned as such. Just as you
wouldn’t want to use the same query for every BLAST search, you wouldn’t want to
use the same database for every BLAST search. However, a few databases are used so
frequently that they have become standards. All databases described in this section
are available from the NCBI at ftp://ftp.ncbi.nih.gov/blast/db/. The most important is
the nonredundant protein database, nr. This database combines all translations from
GenBank records (including RefSeq) with proteins from the SWISS-PROT, PIR, and
PDB databases. If you want to do a comprehensive search against all known pro-
teins, this is the database to use. Not all of the protein sequences have been verified
experimentally, so you should expect some errors.
There are several essential nucleotide databases. The ecoli and vector databases may
sound like uninteresting databases, but they’re actually quite important. The proce-
dures used to sequence DNA require various molecular biology techniques that rely
on the E. coli bacterium and various vector sequences for carrying DNA. Because of
this, many common sources of data contamination are from E. coli and vector
sequences. Screening nucleotide sequences against these databases is a good way to
detect these pollutants. Another database that is useful for detecting contaminants in
genomic DNA is the mito database of mitochondrial sequences. The est database is
also one of the most popular ones. It contains all the expressed sequence tags from
DDBJ/EMBL/GenBank. Many undiscovered proteins lurk in the est database.
The NCBI FTP site includes several other databases. For those interested in the busi-
ness side of bioinformatics, the pataa and patnt databases contain patented amino
acid and nucleotide sequences. The sts database contains sequence tagged sites,
which are mostly PCR amplimers that uniquely identify a region of a genome and are
used in genome mapping. The gss sequences correspond to genome survey
sequences. These are random-ish sequences from various organisms. Some
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.251.128