This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
206
|
Chapter 11: BLAST Databases
Sequence Database Management Strategies
There are many useful public sequence databases, and you may have access to some
private ones as well. Because this is a book about BLAST, we assume you want to use
these collections of sequences in BLAST searches. Some sequences may be used as
queries, and others in databases. How are you going to manage them all in a rational
way? Several possible strategies exist, and the correct one for you depends on your
needs and resources. To demonstrate some of the issues, let’s review a typical
sequence analysis scenario.
Suppose a colleague of yours has just found the gene that makes cats go crazy for cat-
nip. She wants to learn more about this gene and comes to you for help because you
are a BLAST expert. The first thing she wants to do is a BLAST search to find out
what vertebrate proteins are similar to this one. Where are you going to get such a
database of proteins? Once you perform the BLAST search, you find several interest-
ing similarities. Your colleague tells you that these are probably all part of a family of
proteins, and she would like to build a phylogenetic tree to determine their relation-
ships to one another. How are you going to get the individual sequences? Finally, she
decides she wants more information about the human sequences, and to do that, she
would like references to the scientific literature like the ones she would find in a
DDBJ/EMBL/GenBank report. How are you going to retrieve such information? You
could just refuse to help her because these aren’t really BLAST problems, but these
are the kinds of tasks many BLAST users must face. Let’s take a look at how they can
be solved.
This example has basically two solutions to each question: the first is to use tools
available on the Internet. The second is to build the tools yourself. In general, it is
much easier to use the Internet, but for high speed or high-throughput operations
you’ll want a local solution. After you read this chapter, you may decide that you
want some services to be provided locally, while others are Internet-only operations.
This section begins with a brief review of databases.
Queries, Indexes, and Reports
The most common database operation is a query. One person may want to retrieve a
particular sequence. Another may want all human sequences. As you have seen,
sequence records have quite a bit of useful information, and a user may request non-
sequence information such as all the MEDLINE references for all sequences with the
word disease in the description.
The efficiency with which a query is executed depends a lot on how the database is
indexed. If there is no indexing, a query must operate on every record of the data-
base. So, for example, if you want to find all the coelacanth sequences, you would
have to look through millions of records to find the handful whose sequences origi-
nate from the coelacanth. Clearly, this isn’t going to be efficient, so databases usually
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Sequence Database Management Strategies
|
207
have indexes that, for example, keep lists of species and all the sequences for each
species.
The most straightforward kind of indexing occurs when there is a unique relation-
ship between a property and a sequence. This is called a one-to-one mapping, and an
example would be an accession number. A more complex indexing occurs when a
property points to many sequences. This is called a one-to-many mapping, and an
example is a species name that is shared by millions of records.
Once a query is executed, the data must be reported in some format. For sequences,
this is usually the FASTA format. For other kinds of data, there are other appropri-
ate formats, such as lists, tables, and graphs.
Local Database Considerations
Having a local sequence database has some real advantages. First, local databases are
faster and more reliable because they don’t rely on an Internet connection. If you’re
involved in high-throughput research, these reasons are sufficient. Another compel-
ling reason is that you can combine several databases, and even include your own
sequences that aren’t in the public databases. The downside to creating a local
sequence database is the amount of work it takes. Depending on the scale of the
operation, it can be a full-time job. Here are six important issues to address when
building a local sequence database:
Downloading
Each database you support must be downloaded from time to time to keep the
data current. For example, GenBank has five to six major releases each year, as
well as daily updates. Other databases have their own update schedule. Manag-
ing updates can be a chore if you download a lot of databases, so automating the
procedure is a good idea. In addition, you may want to take measures to ensure
that during updates, which can take some time, the database that’s presented to
users isn’t actually changing. This may require keeping a mirror of some data.
Notice that having a local database doesn’t mean you can completely insulate
yourself from the Internet.
Processing records
Each database you support must have a parser to read the various fields of each
record. This may be as simple as pulling out the accession number for a
sequence, or it may be much more complicated, such when you record specific
keywords. You can build your own parsers but it takes less time to use one
already created, such as a parser from the Bioperl project.
Storing data
Your database schema will determine how each record is stored and what kinds
of relationships exist between various pieces of data. Designing an appropriate
schema is a difficult problem because it takes people who understand the data
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
208
|
Chapter 11: BLAST Databases
(biologists), the data models (software engineers), and the storage/backup of the
data (systems administrators).
Indexing
The efficiency of queries will largely depend on what data is indexed. You may
choose to index everything, but your indexes could grow much larger than your
data. So you may have to make compromises. This is another place where users
and engineers must interact to determine the appropriate solution.
Querying
Not all databases are queried in the same way. Relational databases usually
employ SQL as the query language, but many popular databases have their own
querying mechanisms. The details of how you interacts with the database may
depend on what kind of database you use. Regardless of the underlying architec-
ture, you may decide to present a different interface to users, such as a form in a
web browser or a script/program interface that connects directly to the database.
Formatting
You’ll definitely want to create FASTA files, but what other report formats will
you want to support? The DDBJ/EMBL/GenBank flat file formats are some-
times used to exchange data, so this would be useful, as would tabular format
and some kind of HTML that looks good in browsers. For each output format,
you may need some specialized code to generate the report.
As you can see, building a local database isn’t trivial. But it doesn’t have to be a full-
time job if you only want a subset of the information. For example, if all you want is
to retrieve records by accession number, you don’t need to invest more than a cou-
ple hours of work. The following section explores the common techniques for man-
aging sequence data.
Retrieving FASTA Files by Accession
The task of retrieving FASTA files by accession number is so common and has such
an easy solution that it should be a local resource. If you’re using NCBI-BLAST, the
fastacmd program retrieves sequences from BLAST databases singly or in batches. If
you’re using WU-BLAST, the xdget program does the same thing. To use these fea-
tures, you must index the databases when you format them, which is as simple as
including the
-o or -I option (see the command-line tutorial in Chapter 10, the refer-
ence sections for formatdb and fastacmd in Chapter 13, and xdformat and xdget in
Chapter 14). One limitation of this approach is that the sequences are stored in a
case-insensitive format in the database. If you use lowercase to denote regions con-
taining repeats, for example, that information will be lost. If this is a serious prob-
lem for you, use one of the flat-file indexing schemes described later.
NCBI-BLAST users take note that unless you use the NCBI FASTA definition line
format discussed earlier in this chapter, your definition lines may not look exactly
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Sequence Database Management Strategies
|
209
the same when they come out of the database. For example, if you have a definition
line such as this:
>FOO
When you retrieve it with fastacmd, it looks like:
>lcl|FOO no definition found
You can easily avoid such inconsistencies by using the recommended identifier for-
mat and by including descriptions on the definition line.
WU-BLAST users take note: xdget doesn’t support virtual databases. You can work
around this limitation with a simple script, such as this one:
#!/usr/bin/perl -w
use strict;
my (@DB, $i);
for ($i = 0; $i < @ARGV; $i++) {
if ($ARGV[$i] =~ /s/) {
@DB = split(/s+/, $ARGV[$i]);
last;
}
}
exec("xdget @ARGV") unless @DB;
my @pre = splice(@ARGV, 0, $i);
my @post = splice(@ARGV, 1);
foreach my $db (@DB) {
system("xdget @pre $db @post");
}
Flat File Indexing
One of the most common procedures used to manage sequence data is called flat file
indexing. In this approach, you keep concatenated sequence reports in their native
format and store the starting position of each record in a separate file. One advan-
tage of this approach is that you don’t have to do any work when you want to repro-
duce the data in flat file format. Another reason why flat file indexing is so common
is that it is simple to implement, at least for one-to-one mappings. To illustrate the
process, we’ll show you how to index identifiers in FASTA files. Here is an example
of a very short FASTA file:
>FOO
GAATTC
>BAR
ATAGCGAAT
This file has two records with identifiers FOO and BAR, and they begin at bytes 0 and
12, respectively (count the letters and don’t forget to add one for the end of line—in
Windows, the end of line is actually two characters, and this will change the
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
210
|
Chapter 11: BLAST Databases
positions to 0 and 14). You can now create an index file that tells where each record
begins in the file:
BAR 12
FOO 0
To use this index file, simply find the identifier of interest in the index and seek to
the appropriate position in the FASTA file. Note that you sorted the lookup file
alphabetically by identifier. This makes it much more efficient to find the record
because you can use a binary search to find the identifier. If you have an index file
containing 1 million records, on average, a linear search looks through 500,000
records, but a binary search looks at only 20.
You can make a couple of improvements to this simplistic indexing scheme. The first
is to allow the index file to support more than one FASTA file. This is a trivial modi-
fication because you can just add a filename to your index file:
BAR file-A 12
FOO file-A 0
XYZ file-B 0
Another easy improvement is to use a persistent indexed data structure such as a Perl
tied-hash. The Bioperl project uses this strategy in its
Bio::Index classes.
A slightly more complicated approach manages the indices with one of the many free
or commercial database applications, such as MySQL, PostgreSQL, FileMaker,
Microsoft Access, or whatever you happen to be familiar with. If you’re going to do
this, you might as well store a bit more data. For illustrative purposes, imagine you
create a schema like that in Table 11-3. In addition to the accession number, file, and
offset, this schema provides for a species and a molecule type (moltype). The actual
sequence in the schema was not provided because some applications can’t handle
data this large. If you wish to store sequences as well, test the performance of the sys-
tem with realistic data to see if the system scales well.
Using such a database you can don’t only the simple accession number retrievals, but
also the one-to-many relationships such as all human sequences or all DNA
sequences. All you have to do is query the database and seek to the appropriate place
in the appropriate file for every record. Organizing the data this way has a number of
advantages over just downloading DDBJ/EMBL/GenBank by division. For example,
if you want to make a database of all human transcripts, you need to identify the
Table 11-3. Sequence database example
Accession Species Moltype File Offset
A Homo sapiens AA file-1 12024
B Homo sapiens AA file-1 250
C Homo sapiens DNA file-2 28223
AF287139 Latimeria chalumnae cDNA file-3 0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.152.157