This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
84
|
Chapter 5: BLAST
cost of a gap is the cost of the first gap character (-Q) plus all remaining gap charac-
ters (
-R). The NCBI parameters -G 1 -E 1 are identical to -Q2-R1 in WU-BLAST.
Evaluation
Once seeds are extended in both directions to create alignments, the alignments are
evaluated to determine if they are statistically significant (Chapter 4). Those that are
significant are termed HSPs. At the simplest level, evaluating alignments is easy; just
use a score threshold, S, to sort alignments into low and high scoring. Because S and
E are directly related through the Karlin-Altschul equation, a score threshold is syn-
onymous with a statistical threshold. In practice, evaluating alignments isn’t as sim-
ple, which is due to complications that result from multiple HSPs.
Consider the alignment between a eukaryotic protein and its genomic source.
Because most coding regions are broken up by introns, an alignment between the
protein and the DNA is expected to produce several HSPs, one for each exon. In
assessing the statistical significance of the protein-DNA match, should each exon
alignment be forced to stand on its own against the statistical threshold, or does it
make more sense to combine the scores of the various exons? The latter is generally
more appropriate, especially if some exons are short and may be thrown out if not
aided in some way. However, determining the significance of multiple HSPs isn’t as
simple as summing all the alignment scores because many alignments are expected to
be extensions from fortuitous word hits and not all groups of HSPs make sense.
An alignment threshold is an effective way to remove many random, low-scoring
alignments (Figure 5-7). However, if the threshold is set too high, (Figure 5-7c), it
may also remove real alignments. This alignment threshold is based on score and
therefore doesn’t consider the size of the database. There are, of course, E-value and
P-value interpretations, if you consider the size of individual sequences or a constant
theoretical search space.
Qualitatively, the relationship between HSPs should resemble the relationship
between ungapped alignments. That is, the lines in the graph should start from the
upper left and continue to the lower right, the lines shouldn’t overlap, and there
should be a penalty for unaligned sequence. Groups of HSPs that behave this way are
considered consistent. Figure 5-8 shows consistent and inconsistent HSPs. From a
biological perspective, you expect the 5´ end of a coding sequence to match the
N-terminus of a protein and the 3´ end to match the C-terminus—not vice versa.
The algorithm for defining groups of consistent HSPs compares the coordinates of all
HSPs to determine if there are overlaps (a little overlap is actually allowed to account
for extensions that may have strayed too far). This computation is quadratic in the
number of HSPs and therefore can be costly if there are many HSPs (e.g., when the
sequences are long, and the alignment threshold is low).