Sequence Database Management Strategies (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

This is the Title of the Book, eMatter Edition

206

Chapter 11: BLAST Databases

Sequence Database Management Strategies

There are many useful public sequence databases, and you may have access to some

private ones as well. Because this is a book about BLAST, we assume you want to use

these collections of sequences in BLAST searches. Some sequences may be used as

queries, and others in databases. How are you going to manage them all in a rational

way? Several possible strategies exist, and the correct one for you depends on your

needs and resources. To demonstrate some of the issues, let’s review a typical

sequence analysis scenario.

Suppose a colleague of yours has just found the gene that makes cats go crazy for cat-

nip. She wants to learn more about this gene and comes to you for help because you

are a BLAST expert. The first thing she wants to do is a BLAST search to find out

what vertebrate proteins are similar to this one. Where are you going to get such a

database of proteins? Once you perform the BLAST search, you find several interest-

ing similarities. Your colleague tells you that these are probably all part of a family of

proteins, and she would like to build a phylogenetic tree to determine their relation-

ships to one another. How are you going to get the individual sequences? Finally, she

decides she wants more information about the human sequences, and to do that, she

would like references to the scientific literature like the ones she would find in a

DDBJ/EMBL/GenBank report. How are you going to retrieve such information? You

could just refuse to help her because these aren’t really BLAST problems, but these

are the kinds of tasks many BLAST users must face. Let’s take a look at how they can

be solved.

This example has basically two solutions to each question: the first is to use tools

available on the Internet. The second is to build the tools yourself. In general, it is

much easier to use the Internet, but for high speed or high-throughput operations

you’ll want a local solution. After you read this chapter, you may decide that you

want some services to be provided locally, while others are Internet-only operations.

This section begins with a brief review of databases.

Queries, Indexes, and Reports

The most common database operation is a query. One person may want to retrieve a

particular sequence. Another may want all human sequences. As you have seen,

sequence records have quite a bit of useful information, and a user may request non-

sequence information such as all the MEDLINE references for all sequences with the

word disease in the description.

The efficiency with which a query is executed depends a lot on how the database is

indexed. If there is no indexing, a query must operate on every record of the data-

base. So, for example, if you want to find all the coelacanth sequences, you would

have to look through millions of records to find the handful whose sequences origi-

nate from the coelacanth. Clearly, this isn’t going to be efficient, so databases usually

This is the Title of the Book, eMatter Edition

Sequence Database Management Strategies

207

have indexes that, for example, keep lists of species and all the sequences for each

species.

The most straightforward kind of indexing occurs when there is a unique relation-

ship between a property and a sequence. This is called a one-to-one mapping, and an

example would be an accession number. A more complex indexing occurs when a

property points to many sequences. This is called a one-to-many mapping, and an

example is a species name that is shared by millions of records.

Once a query is executed, the data must be reported in some format. For sequences,

this is usually the FASTA format. For other kinds of data, there are other appropri-

ate formats, such as lists, tables, and graphs.

Local Database Considerations

Having a local sequence database has some real advantages. First, local databases are

faster and more reliable because they don’t rely on an Internet connection. If you’re

involved in high-throughput research, these reasons are sufficient. Another compel-

ling reason is that you can combine several databases, and even include your own

sequences that aren’t in the public databases. The downside to creating a local

sequence database is the amount of work it takes. Depending on the scale of the

operation, it can be a full-time job. Here are six important issues to address when

building a local sequence database:

Downloading

Each database you support must be downloaded from time to time to keep the

data current. For example, GenBank has five to six major releases each year, as

well as daily updates. Other databases have their own update schedule. Manag-

ing updates can be a chore if you download a lot of databases, so automating the

procedure is a good idea. In addition, you may want to take measures to ensure

that during updates, which can take some time, the database that’s presented to

users isn’t actually changing. This may require keeping a mirror of some data.

Notice that having a local database doesn’t mean you can completely insulate

yourself from the Internet.

Processing records

Each database you support must have a parser to read the various fields of each

record. This may be as simple as pulling out the accession number for a

sequence, or it may be much more complicated, such when you record specific

keywords. You can build your own parsers but it takes less time to use one

already created, such as a parser from the Bioperl project.

Storing data

Your database schema will determine how each record is stored and what kinds

of relationships exist between various pieces of data. Designing an appropriate

schema is a difficult problem because it takes people who understand the data

This is the Title of the Book, eMatter Edition

208

Chapter 11: BLAST Databases

(biologists), the data models (software engineers), and the storage/backup of the

data (systems administrators).

Indexing

The efficiency of queries will largely depend on what data is indexed. You may

choose to index everything, but your indexes could grow much larger than your

data. So you may have to make compromises. This is another place where users

and engineers must interact to determine the appropriate solution.

Querying

Not all databases are queried in the same way. Relational databases usually

employ SQL as the query language, but many popular databases have their own

querying mechanisms. The details of how you interacts with the database may

depend on what kind of database you use. Regardless of the underlying architec-

ture, you may decide to present a different interface to users, such as a form in a

web browser or a script/program interface that connects directly to the database.

Formatting

You’ll definitely want to create FASTA files, but what other report formats will

you want to support? The DDBJ/EMBL/GenBank flat file formats are some-

times used to exchange data, so this would be useful, as would tabular format

and some kind of HTML that looks good in browsers. For each output format,

you may need some specialized code to generate the report.

As you can see, building a local database isn’t trivial. But it doesn’t have to be a full-

time job if you only want a subset of the information. For example, if all you want is

to retrieve records by accession number, you don’t need to invest more than a cou-

ple hours of work. The following section explores the common techniques for man-

aging sequence data.

Retrieving FASTA Files by Accession

The task of retrieving FASTA files by accession number is so common and has such

an easy solution that it should be a local resource. If you’re using NCBI-BLAST, the

fastacmd program retrieves sequences from BLAST databases singly or in batches. If

you’re using WU-BLAST, the xdget program does the same thing. To use these fea-

tures, you must index the databases when you format them, which is as simple as

including the

-o or -I option (see the command-line tutorial in Chapter 10, the refer-

ence sections for formatdb and fastacmd in Chapter 13, and xdformat and xdget in

Chapter 14). One limitation of this approach is that the sequences are stored in a

case-insensitive format in the database. If you use lowercase to denote regions con-

taining repeats, for example, that information will be lost. If this is a serious prob-

lem for you, use one of the flat-file indexing schemes described later.

NCBI-BLAST users take note that unless you use the NCBI FASTA definition line

format discussed earlier in this chapter, your definition lines may not look exactly

This is the Title of the Book, eMatter Edition

Sequence Database Management Strategies

209

the same when they come out of the database. For example, if you have a definition

line such as this:

>FOO

When you retrieve it with fastacmd, it looks like:

>lcl|FOO no definition found

You can easily avoid such inconsistencies by using the recommended identifier for-

mat and by including descriptions on the definition line.

WU-BLAST users take note: xdget doesn’t support virtual databases. You can work

around this limitation with a simple script, such as this one:

#!/usr/bin/perl -w

use strict;

my (@DB, $i);

for ($i = 0; $i < @ARGV; $i++) {

if ($ARGV[$i] =~ /s/) {

@DB = split(/s+/, $ARGV[$i]);

last;

}

exec("xdget @ARGV") unless @DB;

my @pre = splice(@ARGV, 0, $i);

my @post = splice(@ARGV, 1);

foreach my $db (@DB) {

system("xdget @pre $db @post");

}

Flat File Indexing

One of the most common procedures used to manage sequence data is called flat file

indexing. In this approach, you keep concatenated sequence reports in their native

format and store the starting position of each record in a separate file. One advan-

tage of this approach is that you don’t have to do any work when you want to repro-

duce the data in flat file format. Another reason why flat file indexing is so common

is that it is simple to implement, at least for one-to-one mappings. To illustrate the

process, we’ll show you how to index identifiers in FASTA files. Here is an example

of a very short FASTA file:

>FOO

GAATTC

>BAR

ATAGCGAAT

This file has two records with identifiers FOO and BAR, and they begin at bytes 0 and

12, respectively (count the letters and don’t forget to add one for the end of line—in

Windows, the end of line is actually two characters, and this will change the

This is the Title of the Book, eMatter Edition

210

Chapter 11: BLAST Databases

positions to 0 and 14). You can now create an index file that tells where each record

begins in the file:

BAR 12

FOO 0

To use this index file, simply find the identifier of interest in the index and seek to

the appropriate position in the FASTA file. Note that you sorted the lookup file

alphabetically by identifier. This makes it much more efficient to find the record

because you can use a binary search to find the identifier. If you have an index file

containing 1 million records, on average, a linear search looks through 500,000

records, but a binary search looks at only 20.

You can make a couple of improvements to this simplistic indexing scheme. The first

is to allow the index file to support more than one FASTA file. This is a trivial modi-

fication because you can just add a filename to your index file:

BAR file-A 12

FOO file-A 0

XYZ file-B 0

Another easy improvement is to use a persistent indexed data structure such as a Perl

tied-hash. The Bioperl project uses this strategy in its

Bio::Index classes.

A slightly more complicated approach manages the indices with one of the many free

or commercial database applications, such as MySQL, PostgreSQL, FileMaker,

Microsoft Access, or whatever you happen to be familiar with. If you’re going to do

this, you might as well store a bit more data. For illustrative purposes, imagine you

create a schema like that in Table 11-3. In addition to the accession number, file, and

offset, this schema provides for a species and a molecule type (moltype). The actual

sequence in the schema was not provided because some applications can’t handle

data this large. If you wish to store sequences as well, test the performance of the sys-

tem with realistic data to see if the system scales well.

Using such a database you can don’t only the simple accession number retrievals, but

also the one-to-many relationships such as all human sequences or all DNA

sequences. All you have to do is query the database and seek to the appropriate place

in the appropriate file for every record. Organizing the data this way has a number of

advantages over just downloading DDBJ/EMBL/GenBank by division. For example,

if you want to make a database of all human transcripts, you need to identify the

Table 11-3. Sequence database example

Accession Species Moltype File Offset

A Homo sapiens AA file-1 12024

B Homo sapiens AA file-1 250

C Homo sapiens DNA file-2 28223

AF287139 Latimeria chalumnae cDNA file-3 0

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Sequence Database Management Strategies (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Sequence Database Management Strategies (1/2)