Sequence Alignment

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Further Reading

Next Chapter

Page 65

This is the Title of the Book, eMatter Edition

Chapter 3

CHAPTER 3

Sequence Alignment

BLAST finds statistically significant similarities between sequences by evaluating

alignments, but how are sequences aligned? In principle, there are many ways to align

two sequences, but in practice, one method is used more often than any other. This

chapter explains this technique with the biologist in mind, without using the mathe-

matical notation and jargon that is usually employed to describe such algorithm.

Divested of unfamiliar language and notation, these algorithms are quite simple.

Finding the optimal alignment between two sequences can be a computationally

complex task. Fortunately, a technique called dynamic programming (DP) makes

sequence alignment tractable as long as you follow a few rules. Rather than have you

struggle with a confusing definition of DP, this chapter demonstrates how the tech-

nique works for sequence alignment and then gets back to the generalities. There are

fundamentally two kinds of alignment: global and local. In global alignment, both

sequences are aligned along their entire lengths and the best alignment is found. In

local alignment, the best subsequence alignment is found. For example, if you want

to find the two most similar sentences between two books, you use local alignment.

If you want to compare the sentences end to end, use global alignment. This chapter

describes global alignment, then local alignment. The example uses English words

instead of biological sequences and the algorithms are quite general.

Global Alignment: Needleman-Wunsch

The global alignment algorithm described here is called the Needleman-Wunsch

algorithm. We will explain it in a way that seems natural to biologists, that is, it tells

the end of the story first, and then fills in the details. (This is why biologists make

terrible comedians; they always tell the punch line first.) We will align the words

COELANCANTH and PELICAN using a simple scoring scheme: +1 for letters that

match, –1 for mismatches, and –1 for gaps. The alignment will eventually look like

one of the following, which are equivalent given our scoring scheme:

This is the Title of the Book, eMatter Edition

Global Alignment: Needleman-Wunsch

COELACANTH COELACANTH

P-ELICAN-- -PELICAN--

Note that every letter of each sequence is aligned to a letter or a gap. In local align-

ments, discussed later, this isn’t the case.

The alignment takes place in a two-dimensional matrix in which each cell corre-

sponds to a pairing of one letter from each sequence. To get an intuitive understand-

ing of the alignment algorithm, look at Figure 3-1, which shows where the maximum

scoring alignment lies in the matrix. The alignment starts at the upper left and fol-

lows a mostly diagonal path down and to the right. When two letters are aligned, the

path follows a diagonal trajectory. There are several places in which the letters from

COELACANTH are paired to gap characters. In this case, the graph is followed hori-

zontally. Although not shown here, the path may be also be followed vertically when

the letters from PELICAN are paired with gap characters. Gap characters can never

be paired to each other. Note that the first row and column are blank. The reason for

this will become clear shortly.

In reality, you don’t store letters in the matrix as shown in Figure 3-1. Each cell of

the matrix actually contains two values: a score and a pointer. The score is derived

from the scoring scheme. Here, this means +1 or –1, but when aligning biological

sequences, the values come from a scoring matrix (a topic of the next chapter). The

pointer is a directional indicator (an arrow) that points up, left, or diagonally up and

left. The pointer navigates the matrix, and its use will become clearer later in the

chapter. Now, let’s look at the algorithm in detail. There are three major phases: ini-

tialization, fill, and trace-back.

Figure 3-1. Example of an alignment matrix

COELACANTH

–

This is the Title of the Book, eMatter Edition

Chapter 3: Sequence Alignment

Initialization

In the initialization phase, you assign values for the first row and column

(Figure 3-2). The next stage of the algorithm depends on this. The score of each cell

is set to the gap score multiplied by the distance from the origin. Gaps may be

present at the beginning of either sequence, and their cost is the same as anywhere

else. The arrows all point back to the origin, which ensures that alignments go all the

way back to the origin (a requirement for global alignment).

Fill

In the fill phase (also called induction), the entire matrix is filled with scores and

pointers using a simple operation that requires the scores from the diagonal, vertical,

and horizontal neighboring cells. You will compute three scores: a match score, a

vertical gap score, and a horizontal gap score. The match score is the sum of the diag-

onal cell score and the score for a match (+1 or –1). The horizontal gap score is the

sum of the cell to the left and the gap score (–1), and the vertical gap score is com-

puted analogously. Once you’ve computed these scores, assign the maximum value

to the cell and point the arrow in the direction of the maximum score. Continue this

operation until the entire matrix is filled, and each cell contains the score and pointer

to the best possible alignment at that point.

If you look at the initialized matrix, you’ll find that there’s only one cell where you

can compute the maximum score because there’s only one cell with the required 3

neighbors. Now you can see why you needed to leave one extra column and row and

why you needed the initialization phase. Without them, you wouldn’t be able to

Figure 3-2. Initialization of the alignment matrix

COELACANTH

-1

-2

-3

-4

-5

-6

-7

-1 -2 -3 -4 -5 -6 -7 -8 -9 -100

This is the Title of the Book, eMatter Edition

Global Alignment: Needleman-Wunsch

start the fill. Ignoring the rest of the table, compute this cell (the one that matches C

to P).

The match score is the sum of the preceding diagonal cell (score = 0) and the score

for aligning C to P (–1). The total match score is –1. The horizontal gap score is the

sum of the score to the left (–1) and the gap score (–1). The horizontal gap score is

therefore –2. The same is true for the vertical gap score. Your maximum score is

therefore the diagonal score (–1), and the pointer is set to the diagonal (Figure 3-3).

Now that this first cell is computed, you can compute the cell to the right or the cell

below. Calculate one more cell, the right neighbor (the one that aligns O and P).

The match score is the sum of the preceding diagonal cell (–1) and the mismatch

score (–1), which equals –2. The horizontal gap score is the score of the cell to the

left (–1) and the gap penalty (–1), which is also –2. The vertical gap score is the cell

above (–2) and the gap penalty (–1), which totals –3. The maximum score is –2, but

this can come from the diagonal or from the left. This is where you must make a con-

sistent, arbitrary choice—for example, always choose the diagonal over a gap.

The matrix up to this point is shown in Figure 3-4. It may seem rather trivial, but the

maximum global alignment between CO and P can be calculated. It has a score of –2

and has either of the following equivalent forms:

CO CO

-P P-

Using the same maximizing procedure for each cell, you can fill the entire matrix. At

the end of the fill, the matrix appears as in Figure 3-5.

Figure 3-3. Beginning to fill the alignment matrix

Figure 3-4. Second cell filled in the alignment matrix

COELACANTH

-1

-1 -2 -3 -4 -5 -6 -7 -8 -9 -100

-1

COELACANTH

-1

-1 -2 -3 -4 -5 -6 -7 -8 -9 -100

-1 -2

This is the Title of the Book, eMatter Edition

Chapter 3: Sequence Alignment

Trace-Back

The trace-back lets you recover the alignment from the matrix. Like the other parts

of this algorithm, it’s pretty simple. Start at the bottom-right corner and follow the

arrows until you get to the beginning. To produce the alignment, at each cell, write

out the corresponding letters or a hyphen for the gap symbol. Since you’re following

it from the end to the start, the alignment will be backward, and you just reverse it.

The final alignment looks like this:

COELACANTH

-PELICAN-

Example 3-1 shows a Perl script.

Figure 3-5. Filled alignment matrix

Example 3-1. Trace-back with Needleman-Wunsch algorithm

# Needleman-Wunsch Algorithm

# usage statement

die "usage: $0 <sequence 1> <sequence 2> " unless @ARGV == 2;

# get sequences from command line

my ($seq1, $seq2) = @ARGV;

# scoring scheme

my $MATCH = 1; # +1 for letters that match

my $MISMATCH = -1; # -1 for letters that mismatch

my $GAP = -1; # -1 for any gap

# initialization

my @matrix;

COELACANTH

-1

-2

-3

-4

-5

-6

-7

-1 -2 -3 -4 -5 -6 -7 -8 -9 -100

-1 -2 -3 -4 -5 -6 -7 -8 -9 -10

-2 -2 -1

-3 -3

-2

-5 -5

-4 -5

-4

-3 -4

-4

-2

-3

-4

-0 -8

-6

-4

-2

-0

-7

-5

-3

-1

-6

-4

-2

-0

-5

-3

-1

-0

-4-3

-2-1

-1

-0-2

-2

-1-1

-5 -4

-4 -3

-2 -2 -2

-1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Sequence Alignment

Create new playlist

Sign In

Sign Up

Table of Contents for
Sequence Alignment