BLAST Databases

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

This is the Title of the Book, eMatter Edition

BLAST Databases

193

BLAST Databases

The mechanics of creating BLAST databases is quite simple; just run formatdb or

xdformat with the proper syntax. Chapter 10 discussed this topic, and you’ll find the

command summaries in Chapters 13 and 14. There are, however, subtleties that

make this process more complicated than it may appear.

Large Databases

One of the most common database complications occurs with large files. Most com-

puters today use 32-bit operating systems and 32-bit filesystems. This puts a physi-

cal limit of 4 GB on the amount of RAM and 4 GB on the size of any particular file.

(You may find that you are actually limited to less than 4 GB in both cases, and a 2-

GB limit is quite common.) Most computers these days don’t have or need 4-GB

RAM. However, most hard disks are quite a bit larger than 4 GB, and files can some-

times exceed these limits. Therefore many operating systems have the option of using

64-bit filesystems. Unfortunately you can’t just change the filesystem and expect

everything to work. Making software applications aware of large files often means

recompiling them with special flags, and the process of migrating to a 64-bit filesys-

tem can be painful because the applications don’t tell you useful things like “I’m not

large-file-aware.” Instead, they just sit there quietly burning CPU time while they run

in endless loops.

Large NCBI databases

The standard protocol for formatting a database is to run formatdb on a FASTA

database:

formatdb -p F -i fasta_db -o

NCBI-BLAST databases are physically limited to 4 GB of sequence, which corre-

sponds to about 4 billion amino acids or 16 billion nucleotides (nucleotides are com-

pressed 4:1). On a 32-bit filesystem, the previous approach won’t let you use all this

space because the FASTA file can’t contain more than 2 or 4 billion letters. Creating

a database larger than 2 or 4 billion letters requires piping sequence to formatdb.

cat fasta1 fasta2 fasta3 | formatdb -p F -i stdin -n my_db -o

But what if you happen to have more than 16 billion letters? This isn’t a problem

because formatdb automatically segments individual BLAST databases to files con-

taining 16 billion nucleotides and creates something called an alias database that

stitches them all together. This is really convenient because it means that you can

search enormous databases even on 32-bit filesystems. Alias databases are discussed

in more detail later in this chapter.

It’s still possible to run into file size issues by piping FASTA files to formatdb because

the filesystem maximum may be 2 GB and the implicit BLAST maximum is 4 GB.

This is the Title of the Book, eMatter Edition

194

Chapter 11: BLAST Databases

Fortunately, formatdb lets you set the size of each database volume with the -v

parameter. The following example sets this size this to 2 billion and includes a bit

more realism by piping the FASTA files from a compressed format.

zcat file*.gz | formatdb -i stdin -p F -o -n my_db -v 2000000000

A word of caution: be precise. If you accidentally leave off one of the zeroes, you can

create 10 times as many files. For large databases, this can be a problem because the

maximum number of volumes is 100.

Large WU-BLAST databases

WU-BLAST doesn’t use alias databases, so the only way to create a database larger

than 4 GB is on a 64-bit filesystem. If you’re accessing or distributing databases over

a network, the network must also be 64-bit aware. To index a large number of com-

pressed files, use a command such as the following:

zcat file* | xdformat -n -I -o ESTs -- -

In the typical Unix command line syntax, the double-dash indicates the end of the

command line options, and the single-dash denotes standard input rather than a file.

If you’re stuck with a 32-bit filesystem and need to search large BLAST databases,

you can use virtual databases, which are explained next. If you use the free version of

WU-BLAST, there is no large file support and no virtual database mechanism. Your

best solution is to create several databases within the limits of your filesystem, search

each independently and then merge the results.

Virtual Databases

Virtual databases let you combine multiple databases and use them as if they were

one. It’s as simple as including the various databases in quotes on the command line.

Here’s how it looks for NCBI-BLAST:

blastall -p blastp -d "db1 db2 db3" -i query

And here is the equivalent WU-BLAST command line:

blastp "db1 db2 db3" query

Virtual databases are useful for grouping related searches. Let’s say you want to

search individually against an EST and an mRNA database, as well as a transcripts

database that combines the two. You can create each database individually, but there

is some duplication in data. Alternatively, you can just create the EST and mRNA

databases and use the virtual database

EST mRNA for transcripts. Virtual databases

behave just like normal databases. You can even retrieve sequences from virtual data-

bases with fastacmd (this feature isn’t yet available for xdget, but see the end of this

chapter for a workaround).

One thing you probably don’t want to do is to combine databases with redundant

sequences. For example, you wouldn’t want to group the NCBI nr database with

This is the Title of the Book, eMatter Edition

BLAST Databases

195

SWISSPROT because nr already includes SWISSPROT. Duplicated sequences

decrease the statistical significance of matches and can be confusing in the output.

Alias Databases

Alias databases are a unique and powerful feature of NCBI-BLAST. You’ve already

seen that formatdb creates alias databases when splitting large files, but alias data-

bases also have other uses. Alias databases can be used as static virtual databases for

any combination of databases; all you have to do is create a file with the proper name

and syntax. Here’s a simple alias file, transcripts.nal, that combines the previous

ESTs and mRNAs example to create a transcripts databases:

TITLE transcripts

DBLIST ESTs mRNAs

The TITLE is the name of the database and the DBLIST is simply a list of the databases

to merge. Using alias databases you can, for example, organize sequences by organ-

ism. All you have to do is create the individual databases and combine them in vari-

ous ways with alias files to create more comprehensive sets.

Not only can you join databases, but you can also use alias files to restrict searches to

particular sequences from a database. Let’s say you create a comprehensive EST

database and then want to create a human-only EST database. The alias file for such

a database looks something like this:

TITLE humanESTs

DBLIST ESTs

GILIST human.gi

The GILIST specifies a list of files that contains the GI numbers of the sequences to

search. There are a few complexities when working with GI lists. First, using a GI list

assumes that your sequences have GI numbers. If your original FASTA identifiers

don’t include GI numbers, you can’t use this feature. It is unfortunate that the file of

GI numbers isn’t a file of accession numbers, but that’s the way it is. If you want to

use GI lists for sequences without GI numbers, you have to add fake GI numbers to

your identifiers. The following script counts backward from the maximum possible

GI number and thus minimizes the potential conflict with real GI numbers:

#!/usr/bin/perl

$i = 2147483648;

while (<>) {

if (/^>/) {

$i--;

print ">gi|$i (fake-gi) ", substr($_, 1);

}

else {

print;

}

This is the Title of the Book, eMatter Edition

196

Chapter 11: BLAST Databases

To use this script, fake-gi.pl, simply place it upstream of formatdb in your pipe:

zcat file*.gz | fake-gi.pl | formatdb -i stdin -p F -o -n my_db -v 2000000000

Creating an alias database restricted to GI numbers is a bit more complicated than

just merging databases. First you need to get a list of GI numbers for your sequences

of interest. There are a number of ways you can do this, but the easiest is to use

NCBI Entrez. If you have a GI list called human.gi.list, the next step is to convert it to

binary form using formatdb (this step isn’t actually required, but it does improve per-

formance).

formatdb -F human.gi.list -B human.gi

Finally, create the alias file. You can do this yourself, but you might as well let

formatdb do it for you because it also adds the number of sequences and their length

to the information in the alias file.

formatdb -p F -i ESTs -L humanESTs -F human.gi

-L identifies the name of the alias database to create, and -F is the name of the binary

file of GI numbers.

As you can see, alias databases are a powerful way to join and split databases. As

BLAST development continues, you should expect to see more structure in BLAST

databases and greater control of sequences subsets.

Removing Redundancy

One way to improve the efficiency of your BLAST searches is to remove redundant

sequences. Consider what happens when you search a redundant database. The sta-

tistical significance of a database hit depends on the size of the database. Each redun-

dant sequence artificially increases the size of the database and therefore reduces the

statistical significance of any hit. In addition, redundancy makes the search slower.

The nrdb program that comes with the licensed version of WU-BLAST concatenates

the definition lines of all identical sequences. This program isn’t available with the

free version, and the NCBI distribution doesn’t include such a program in the BLAST

distribution. WU-BLAST also includes patdb, which is a bit more aggressive because

it concatenates identical subsequences. Both programs are very efficient. You can

find examples of their use in Chapter 10. You can also write your own database

purifier using Bioperl tools. (Some entertaining discussions about the “best” way to

do this may be found in the Bioperl mailing list archives at http://bioperl.org). Here’s

one way:

#!/usr/bin/perl

use Bio::SeqIO;

my %NR;

my $file = Bio::SeqIO->new(-fh => *ARGV);

while (my $fasta = $file->next_seq){

This is the Title of the Book, eMatter Edition

BLAST Databases

197

my $def = $fasta->id ." " .$fasta->desc;

$NR{$fasta->seq}{$def} = 1;

}

for my $seq(keys %NR){

print ">",join(chr(1),keys

%{$NR{$seq}})," ",$seq," ";

}

It’s more common to collapse redundant proteins than redundant nucleotide

sequences. One reason is because nucleotide sequences are rarely identical to one

another. Another is that nucleotide sequences can be very large, and the procedure

becomes impractical with off-the-shelf hardware. Most importantly, it is better to

assemble genomic fragments into chromosomes and ESTs into full-length tran-

scripts. Because these tasks are complicated and compute-intensive, these feats of

bioinformagic are best left to the experts.

Standard BLAST Databases

Every BLAST search is an experiment and should be planned as such. Just as you

wouldn’t want to use the same query for every BLAST search, you wouldn’t want to

use the same database for every BLAST search. However, a few databases are used so

frequently that they have become standards. All databases described in this section

are available from the NCBI at ftp://ftp.ncbi.nih.gov/blast/db/. The most important is

the nonredundant protein database, nr. This database combines all translations from

GenBank records (including RefSeq) with proteins from the SWISS-PROT, PIR, and

PDB databases. If you want to do a comprehensive search against all known pro-

teins, this is the database to use. Not all of the protein sequences have been verified

experimentally, so you should expect some errors.

There are several essential nucleotide databases. The ecoli and vector databases may

sound like uninteresting databases, but they’re actually quite important. The proce-

dures used to sequence DNA require various molecular biology techniques that rely

on the E. coli bacterium and various vector sequences for carrying DNA. Because of

this, many common sources of data contamination are from E. coli and vector

sequences. Screening nucleotide sequences against these databases is a good way to

detect these pollutants. Another database that is useful for detecting contaminants in

genomic DNA is the mito database of mitochondrial sequences. The est database is

also one of the most popular ones. It contains all the expressed sequence tags from

DDBJ/EMBL/GenBank. Many undiscovered proteins lurk in the est database.

The NCBI FTP site includes several other databases. For those interested in the busi-

ness side of bioinformatics, the pataa and patnt databases contain patented amino

acid and nucleotide sequences. The sts database contains sequence tagged sites,

which are mostly PCR amplimers that uniquely identify a region of a genome and are

used in genome mapping. The gss sequences correspond to genome survey

sequences. These are random-ish sequences from various organisms. Some

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for BLAST Databases

Create new playlist

Sign In

Sign Up

Table of Contents for
BLAST Databases