This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
40
Chapter 3
CHAPTER 3
Sequence Alignment
BLAST finds statistically significant similarities between sequences by evaluating
alignments, but how are sequences aligned? In principle, there are many ways to align
two sequences, but in practice, one method is used more often than any other. This
chapter explains this technique with the biologist in mind, without using the mathe-
matical notation and jargon that is usually employed to describe such algorithm.
Divested of unfamiliar language and notation, these algorithms are quite simple.
Finding the optimal alignment between two sequences can be a computationally
complex task. Fortunately, a technique called dynamic programming (DP) makes
sequence alignment tractable as long as you follow a few rules. Rather than have you
struggle with a confusing definition of DP, this chapter demonstrates how the tech-
nique works for sequence alignment and then gets back to the generalities. There are
fundamentally two kinds of alignment: global and local. In global alignment, both
sequences are aligned along their entire lengths and the best alignment is found. In
local alignment, the best subsequence alignment is found. For example, if you want
to find the two most similar sentences between two books, you use local alignment.
If you want to compare the sentences end to end, use global alignment. This chapter
describes global alignment, then local alignment. The example uses English words
instead of biological sequences and the algorithms are quite general.
Global Alignment: Needleman-Wunsch
The global alignment algorithm described here is called the Needleman-Wunsch
algorithm. We will explain it in a way that seems natural to biologists, that is, it tells
the end of the story first, and then fills in the details. (This is why biologists make
terrible comedians; they always tell the punch line first.) We will align the words
COELANCANTH and PELICAN using a simple scoring scheme: +1 for letters that
match, –1 for mismatches, and –1 for gaps. The alignment will eventually look like
one of the following, which are equivalent given our scoring scheme: