This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
206
|
Chapter 11: BLAST Databases
Sequence Database Management Strategies
There are many useful public sequence databases, and you may have access to some
private ones as well. Because this is a book about BLAST, we assume you want to use
these collections of sequences in BLAST searches. Some sequences may be used as
queries, and others in databases. How are you going to manage them all in a rational
way? Several possible strategies exist, and the correct one for you depends on your
needs and resources. To demonstrate some of the issues, let’s review a typical
sequence analysis scenario.
Suppose a colleague of yours has just found the gene that makes cats go crazy for cat-
nip. She wants to learn more about this gene and comes to you for help because you
are a BLAST expert. The first thing she wants to do is a BLAST search to find out
what vertebrate proteins are similar to this one. Where are you going to get such a
database of proteins? Once you perform the BLAST search, you find several interest-
ing similarities. Your colleague tells you that these are probably all part of a family of
proteins, and she would like to build a phylogenetic tree to determine their relation-
ships to one another. How are you going to get the individual sequences? Finally, she
decides she wants more information about the human sequences, and to do that, she
would like references to the scientific literature like the ones she would find in a
DDBJ/EMBL/GenBank report. How are you going to retrieve such information? You
could just refuse to help her because these aren’t really BLAST problems, but these
are the kinds of tasks many BLAST users must face. Let’s take a look at how they can
be solved.
This example has basically two solutions to each question: the first is to use tools
available on the Internet. The second is to build the tools yourself. In general, it is
much easier to use the Internet, but for high speed or high-throughput operations
you’ll want a local solution. After you read this chapter, you may decide that you
want some services to be provided locally, while others are Internet-only operations.
This section begins with a brief review of databases.
Queries, Indexes, and Reports
The most common database operation is a query. One person may want to retrieve a
particular sequence. Another may want all human sequences. As you have seen,
sequence records have quite a bit of useful information, and a user may request non-
sequence information such as all the MEDLINE references for all sequences with the
word disease in the description.
The efficiency with which a query is executed depends a lot on how the database is
indexed. If there is no indexing, a query must operate on every record of the data-
base. So, for example, if you want to find all the coelacanth sequences, you would
have to look through millions of records to find the handful whose sequences origi-
nate from the coelacanth. Clearly, this isn’t going to be efficient, so databases usually