The BLAST Algorithm (2/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

This is the Title of the Book, eMatter Edition

The BLAST Algorithm

all potential neighborhood words. The two-hit distance is set to 40 amino acids by

default, so words don’t have to be clustered as closely as they appear in the figures.

In principle, setting the two-hit distance to a smaller value also increases speed, but

in practice, its effect is insubstantial.

In WU-BLAST, you may set W to any value for any program. If W is set to 5 or more,

neighborhood word scores aren’t used; they are computed only by explicitly assign-

ing a value for T. High values of W in conjunction with moderate values of T can

lead to immense memory requirements, so it is best not to set T when W is 6 or

more. To alter the speed/sensitivity of WU-BLAST you can use a variety of combina-

tions of W, and T, and you can also employ the two-hit algorithm.

The statistical model underlying BLAST assumes the letters are independent of one

another so that the words MQV and MVQ have the same probability of occurring.

However, certain combinations occur in biological sequences much more often than

expected. These are usually low-complexity sequences—for example, FFF (see Chap-

ters 2 and 4). Low-complexity sequences are often of little biological interest and

aligning them wastes CPU cycles. Masking these regions is therefore common. Both

NCBI-BLAST and WU-BLAST let you replace these regions with N’s or X’s, which

have negative scores when aligned with a nucleotide or amino acid. A more useful

technique, termed soft masking, prevents seeding in such regions, but lets align-

ments extend through them.

Extension

Once the search space is seeded, alignments can be generated from the individual

seeds. We’ve drawn the alignment as arrows extending in both directions from the

seeds (Figure 5-5), and you’ll see why this is appropriate. In the Smith-Waterman

algorithm (Chapter 3), the endpoints of the best alignment are determined only

after the entire search space is evaluated. However, because BLAST searches only a

subset of the space, it must have a mechanism to know when to stop the extension

procedure.

To make this clearer, we’ll try aligning two sentences using a scoring scheme in

which identical letters score +1 and mismatches score -1. To keep the example sim-

ple, ignore spaces and don’t allow gaps in the alignment. Although only extension to

the right is shown, it also occurs to the left of the seed. Here are the two sentences:

The quick brown fox jumps over the lazy dog.

The quiet brown cat purrs when she sees him.

Assume the seed is the capital T, and you’re extending it to the right. You can extend

the alignment without incident until you get to the first mismatch:

The quic

The quie

This is the Title of the Book, eMatter Edition

Chapter 5: BLAST

At this point, a decision must be made about whether to continue the alignment or

stop. Looking ahead, it’s clear that more letters can be aligned, so it would be fool-

ish to stop now. The ends of the sentences, however, aren’t at all similar, so we

should stop if there are too many mismatches. To do this, we create a variable, X,

that represents the recent alignment history. Specifically, X represents how much the

score is allowed to drop off since the last maximum. Let’s set X to 5 and see what

happens. We’ll keep track of the sum score and the drop off score as we go.

Figure 5-6 shows the graphical interpretation.

The quick brown fox jump

The quiet brown cat purr

123 45654 56789 876 5654 <- score

000 00012 10000 123 4345 <- drop off score

The maximum score for this alignment is 9, and the extension is terminated when

the score drops to 4. After terminating, the alignment is trimmed back to the maxi-

mum score. If you set X to 1 or 2, the best alignment has a score of 6. If you set X to

Figure 5-5. Generating alignments by extending seeds

Figure 5-6. Attenuating extension with X

Extension

Alignment

Score

Length of extension

Trim to max

X = 5

This is the Title of the Book, eMatter Edition

The BLAST Algorithm

3, you can retrieve the longer alignment and save a little computation. A very large

value of X doesn’t increase the score and requires more computation. So what is the

proper value for X? It’s generally a good idea to use a large value, which reduces the

risk of premature termination and is a better way to increase speed than with the

seeding parameters. However, W, I, and 2-hit are better for controlling speed than X.

The gapless extension algorithm just demonstrated is similar to what was used in the

original version of BLAST. The algorithms in the current versions of BLAST allow

gaps and are related to the dynamic programming techniques described in Chapter 3.

Therefore, X not only depends on substitution scores, but also gap initiation and

extension costs.

Implementation details

Extension in BLASTN is a little different from BLASTP and the other protein-based

programs. The reason for this has to do with how nucleotide sequences are stored in

BLAST databases. Because there are only four common nucleotide symbols, nucle-

otide databases can be stored in a highly compressed state with only two bits per

nucleotide. What happens if the sequence contains an N or other ambiguous nucle-

otide? A random canonical nucleotide is substituted. For example, an N can be ran-

domly changed to A, C, G, or T and a Y changed to a C or T. It is possible, especially

with long stretches of ambiguous nucleotides, that the two-bit approximation termi-

nates extension prematurely.

NCBI-BLAST and WU-BLAST take very different approaches to the gapped exten-

sion procedure. NCBI-BLAST has three values for X (parameters

-X -y -Z) and WU-

BLAST has two (parameters

-X and -gapX). Some differences, such as the presence of

a floating bandwidth (NCBI) rather than a fixed bandwidth, are interesting from an

academic viewpoint but less so from a user’s perspective. What is important is that

altering the extension parameters from their defaults is generally not an effective way

to increase the speed or sensitivity of a BLAST search. You might consider adjusting

the parameters in two situations:

• If you align sequences that are nearly identical and you want to prevent the

extensions from straying into a nonidentical sequence, you can set the various X

values very low.

• If you try to align very distant sequences and have already adjusted W, T, and

the scoring matrix to allow additional sensitivity, it makes sense to also increase

the various X values. That said, when searching for remote homologies, it is

often better to employ an algorithm with a position-specific scoring matrix, such

as PSI-BLAST or HMMER.

Since gapped extension also depends on the gap initiation and extension costs, you

should note that these parameters are interpreted differently in NCBI-BLAST and

WU-BLAST. In NCBI-BLAST, the total cost for a gap is the gap opening cost (

-G)

plus the gap extension cost (

-E) times the length of the gap. In WU-BLAST, the total

This is the Title of the Book, eMatter Edition

Chapter 5: BLAST

cost of a gap is the cost of the first gap character (-Q) plus all remaining gap charac-

ters (

-R). The NCBI parameters -G 1 -E 1 are identical to -Q2-R1 in WU-BLAST.

Evaluation

Once seeds are extended in both directions to create alignments, the alignments are

evaluated to determine if they are statistically significant (Chapter 4). Those that are

significant are termed HSPs. At the simplest level, evaluating alignments is easy; just

use a score threshold, S, to sort alignments into low and high scoring. Because S and

E are directly related through the Karlin-Altschul equation, a score threshold is syn-

onymous with a statistical threshold. In practice, evaluating alignments isn’t as sim-

ple, which is due to complications that result from multiple HSPs.

Consider the alignment between a eukaryotic protein and its genomic source.

Because most coding regions are broken up by introns, an alignment between the

protein and the DNA is expected to produce several HSPs, one for each exon. In

assessing the statistical significance of the protein-DNA match, should each exon

alignment be forced to stand on its own against the statistical threshold, or does it

make more sense to combine the scores of the various exons? The latter is generally

more appropriate, especially if some exons are short and may be thrown out if not

aided in some way. However, determining the significance of multiple HSPs isn’t as

simple as summing all the alignment scores because many alignments are expected to

be extensions from fortuitous word hits and not all groups of HSPs make sense.

An alignment threshold is an effective way to remove many random, low-scoring

alignments (Figure 5-7). However, if the threshold is set too high, (Figure 5-7c), it

may also remove real alignments. This alignment threshold is based on score and

therefore doesn’t consider the size of the database. There are, of course, E-value and

P-value interpretations, if you consider the size of individual sequences or a constant

theoretical search space.

Qualitatively, the relationship between HSPs should resemble the relationship

between ungapped alignments. That is, the lines in the graph should start from the

upper left and continue to the lower right, the lines shouldn’t overlap, and there

should be a penalty for unaligned sequence. Groups of HSPs that behave this way are

considered consistent. Figure 5-8 shows consistent and inconsistent HSPs. From a

biological perspective, you expect the 5´ end of a coding sequence to match the

N-terminus of a protein and the 3´ end to match the C-terminus—not vice versa.

The algorithm for defining groups of consistent HSPs compares the coordinates of all

HSPs to determine if there are overlaps (a little overlap is actually allowed to account

for extensions that may have strayed too far). This computation is quadratic in the

number of HSPs and therefore can be costly if there are many HSPs (e.g., when the

sequences are long, and the alignment threshold is low).

This is the Title of the Book, eMatter Edition

The BLAST Algorithm

Figure 5-7. Increasing alignment thresholds remove low scoring alignments

Figure 5-8. Consistent and inconsistent alignment groups

Low

Medium

High

Consistent

Inconsistent

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The BLAST Algorithm (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
The BLAST Algorithm (2/3)