This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
BLAST Databases
|
197
my $def = $fasta->id ." " .$fasta->desc;
$NR{$fasta->seq}{$def} = 1;
}
for my $seq(keys %NR){
print ">",join(chr(1),keys
%{$NR{$seq}}),"
",$seq,"
";
}
It’s more common to collapse redundant proteins than redundant nucleotide
sequences. One reason is because nucleotide sequences are rarely identical to one
another. Another is that nucleotide sequences can be very large, and the procedure
becomes impractical with off-the-shelf hardware. Most importantly, it is better to
assemble genomic fragments into chromosomes and ESTs into full-length tran-
scripts. Because these tasks are complicated and compute-intensive, these feats of
bioinformagic are best left to the experts.
Standard BLAST Databases
Every BLAST search is an experiment and should be planned as such. Just as you
wouldn’t want to use the same query for every BLAST search, you wouldn’t want to
use the same database for every BLAST search. However, a few databases are used so
frequently that they have become standards. All databases described in this section
are available from the NCBI at ftp://ftp.ncbi.nih.gov/blast/db/. The most important is
the nonredundant protein database, nr. This database combines all translations from
GenBank records (including RefSeq) with proteins from the SWISS-PROT, PIR, and
PDB databases. If you want to do a comprehensive search against all known pro-
teins, this is the database to use. Not all of the protein sequences have been verified
experimentally, so you should expect some errors.
There are several essential nucleotide databases. The ecoli and vector databases may
sound like uninteresting databases, but they’re actually quite important. The proce-
dures used to sequence DNA require various molecular biology techniques that rely
on the E. coli bacterium and various vector sequences for carrying DNA. Because of
this, many common sources of data contamination are from E. coli and vector
sequences. Screening nucleotide sequences against these databases is a good way to
detect these pollutants. Another database that is useful for detecting contaminants in
genomic DNA is the mito database of mitochondrial sequences. The est database is
also one of the most popular ones. It contains all the expressed sequence tags from
DDBJ/EMBL/GenBank. Many undiscovered proteins lurk in the est database.
The NCBI FTP site includes several other databases. For those interested in the busi-
ness side of bioinformatics, the pataa and patnt databases contain patented amino
acid and nucleotide sequences. The sts database contains sequence tagged sites,
which are mostly PCR amplimers that uniquely identify a region of a genome and are
used in genome mapping. The gss sequences correspond to genome survey
sequences. These are random-ish sequences from various organisms. Some