CHAPTER 9
Needleman–Wunsch Algorithm (Global Alignment)

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

9.1 INTRODUCTION

The Needleman–Wunsch algorithm (NWA; Needleman and Wunsch, 1970) is used for global alignment. We compare homologous molecular sequences character by character to achieve sequence alignment. Global alignment is the end‐to‐end alignment between two sequences; hence, it introduces gaps that represent insertions/deletions. This is useful for identifying “InDels” (Insertion and Deletions), and for overall comparison of two or more comparable (i.e., similar) sequences. Phylogenetically close sequences of the same length are the most suited for global alignment.

The best alignment can be identified by quantifying or scoring the possible alternative alignments. Scoring matrices are used to award the match(es) and penalize the mismatch(es) and gap(s), so that the best alignment with the highest score can be identified. The scores in the matrix are integer values (e.g., +1, 0, –1).

9.2 OBJECTIVE

To align two comparable sequences (nucleotide or amino acid) for obtaining a global alignment using the NWA.

9.3 PROCEDURE

Let us start with two short sequences which are to be globally aligned using an NWA:

  • Seq1: CTAGTAG
  • Seq2: CAGGTAGTG

If the sequences are of length “m” and “n”, respectively, we will obtain one scoring matrix of dimensions m × n.

9.3.1 Step 1: define a scoring scheme

A scoring scheme is first defined, and then the dynamic scoring is done, starting from the top left to the bottom right of the matrix:

  • Match score (Sij) = 2
  • Mismatch score (Sij) = –1
  • Gap penalty (d) = –2

Here, “i” (varying from 1, 2, …, m) and “j” (varying from 1, 2, …, n) refer to the indexes for the row and column numbers, respectively, of the cell of the scoring matrix (described in Step 2). Please note that the scoring scheme can differ according to the evolutionary relatedness of the input sequences. A gap extension penalty is also introduced in advanced methods of such dynamic algorithms, to direct proper alignment and select the best one, based on the scores.

9.3.2 Step 2: initiation of matrix construction

This step starts with creating a matrix that represents the score for each pair of residues belonging to the pair of sequences.

  1. Seq1: written along the vertical axis (or Y‐axis) of the matrix (i.e., along the rows).
  2. Seq2: written along the horizontal axis (or X‐axis) of the matrix (i.e., along the columns).

TABLE 9.1

← Sequence 1 i” varying from 1, 2, 3 …, m Sequence 2 →
j” varying from 1, 2, 3 …, n
C A G G T A G T G
C
T
A
G
T
A
G

Finally, we will get a matrix with each of its cells filled up with three scores obtained from three different types of movements, as shown below:

  • Horizontal movement: represents a gap in the sequence written vertically (along the Y‐axis, row numbers have been denoted by “i”)
  • Vertical movement: represents a gap in the sequence written horizontally (along the X‐axis, column numbers are indicated by “j”)
  • Diagonal movement: representing either match or mismatch between the two sequences (cell positions indexed by “i” and “j”, respectively) (Figure 9.1).
A four-square grid with a rightward, diagonal, and downward arrows having a common vertex at the top left square representing the horizontal, diagonal, and vertical movements, respectively.

FIGURE 9.1 Three types of movement along the matrix in dynamic programming.

Annex one row at the top and one column at the left of the original m × n dimensional matrix (to make it an (m + 1) × (n + 1) dimensional matrix). These are termed the 0th row and 0th column, respectively. Next, fill the first cell (i.e., the cell located at the top left‐most corner of the 0th row and 0th column) with a zero.

9.3.2.1 Scoring method: dynamic programming

The score of each cell can be determined from three different movements towards the cell. The formula used to calculate the scores for each cell is shown in Figure 9.2.

Twenty-rectangle grid with (i,j) on the second row of the third column with rightward, diagonal, and downward arrows pointing to (i, j+1), (i+1, j + 1), and (i + 1, j), respectively, of the surrounding cells.

FIGURE 9.2 Increment in the respective indexes of the cells (denoting row and column numbers, respectively) of the matrix, to indicate the movement along the cells.

[9.1] images

Select the highest score(s): Now, after obtaining these three scores in the cell being studied, the highest value is selected, which is to be used as the score of that cell. This highest score is used for calculating the respective scores of the cells located just right (hence, pertaining to horizontal movement), just bottom (vertical movement) and just at the bottom‐right (diagonal movement) positions of the current cell. If more than one score shows the highest value, then that highest value will be considered for calculating the score of the adjacent cells. However, the backtracking will be through all these cells contributing to the same highest values (see Figures 9.3 and 9.4).

Tabular representation with 10 columns for C, A, G, G, T, A, G, T, and G and 7 rows for C, T, A, G, T, A, and G, with arrows indicating back‐tracing based on the highest score out of the three scores.

FIGURE 9.3 Each cell is assigned three scores obtained from three possible movements – namely, horizontal, diagonal and vertical. The arrows indicate back‐tracing based on the highest score out of the three scores.

Tabular representation with 10 columns for C, A, G, G, T, A, G, T, and G and 7 rows for C, T, A, G, T, A, and G, with arrows indicating trace-back starting from the bottom right cell towards the top left cell.

FIGURE 9.4 Trace‐back starts from the bottom right cell towards the top left cell, according to the highest score(s) obtained in the previous step. There could be more than one path at a point (i.e., cell), if that cell has been awarded more than one highest score, due to two or three movements in the previous step.

Let us clarify this with the first few cells of the matrix shown in Step 3:

  1. Put “0” value in the first cell (utmost top left cell).
  2. The value of the next cell (right side) will be 0 + (–2) = –2; since it is a horizontal movement, so the gap penalty of –2 will be given. The next cell at the right side will have the score of –2 + (–2) = –4, due to the gap penalty for the same horizontal movement. Likewise, the subsequent cells on the right side will be awarded –6, –8, –10, –12, –14, –16, and the last one –18.
  3. Now, for the downward movement in the very first column, the gap penalty will be consecutively awarded to each cell. The cells will have the scores of –2, –4, … –14.
  4. The diagonal movement starts from 0 to the bottom right cell; we will check whether there is a match or mismatch. In this case, there is a match of “C” to “C”. Hence, a score of +2 will be awarded.
  5. Please note that, except for the first row and first column (which were annexed to the original matrix), every cell of the matrix can have three values, due to three possible movements towards that cell:
    • vertical movement from the cell just above,
    • horizontal movement from the cell just at left, and
    • diagonal movement from the adjacent cell just at the top‐left position).
    In the following matrix, three colors have been used: Blue for vertical movement (Gap), Yellow highlight for diagonal movement (Match or Mismatch), and Red for horizontal movement (Gap).
  6. Now, out of these three values, the highest one will be selected as the score for that cell. The chosen score(s) has/have been highlighted in yellow in the matrix below.

9.3.3 Step 3: trace‐back step

Finally, the trace‐back step starts from the last cell at the right side of the bottom row in a way such that arrow(s) will be drawn to the cell(s) from which the current score (highest of three scores of the present cell) has been obtained. If there are two or three cells which contribute to the highest value, then two (or three, respectively) arrows will be drawn to indicate the previous cells.

The arrows are drawn from the bottom right cell towards the top left cell. If certain cell(s) have more than one arrow, this will lead to more than one path at every such junction.

Tracing back is done from the bottom right cell towards the top left cell. The path is determined based on the source of the cell which contributes to the highest score in the present cell.

The dynamic programming thus proceeds from one cell to the other (in the defined directions), until the whole matrix obtains the score for each of the cells.

All possible paths are obtained from the scoring matrix during the process of global alignment.

9.3.4 Step 4: calculating the scores for each alignment

The score for each of these alignments is calculated according to the scoring scheme set at the beginning of scoring. The alignment score with the highest value is considered as the best global alignment. One could get more than one highest score (same value) when there are multiple (more than one) pair‐wise alignments. In such a situation, all those alignments with the highest score are equally good, and any one of these can be accepted as the global alignment.

In the example shown above, we get seven possible global alignments, all of which have the highest score (i.e., equal to 4). Here, we can select any one of these alignments as the best alignment (Figure 9.5).

7 Tabular representations with 10 columns for C, A, G, G, T, A, G, T, and G and 7 rows for C, T, A, G, T, A, and G with arrows, displaying global alignments (by Needleman-Wunsch algorithm).

FIGURE 9.5 Global alignment (by NWA) has yielded seven equally good (same alignment score of 4) alignments.

9.4 QUESTIONS

  1. 1. Briefly describe the steps of the NWA.
  2. 2. What are the features of a dynamic programming? Why is the NWA considered to be dynamic programming?
  3. 3. Align the following pairs of sequences manually, using the Needleman–Wunsch dynamic algorithm (assuming scores and penalties to be the same as in the given example):
    1. Seq1: ACTGTGCGT
    2. Seq2: GACGCGTG
    3. Seq1: GTCACACATGT
    4. Seq2: GACCGTATTCGAGT
  4. 4. Let the Match score (Sij) = 1, Mismatch score (Sij) = –1 and Gap penalty (d) = –3. Align the following pairs of sequences globally,
    1. Seq1: GTACG
    2. Seq2: AGTATGCGA
    3. Seq1: GCTGTAGTGG
    4. Seq2: CGTGCA
  5. 5. Assume three different sets of weights for the match, mismatch and gap:

    TABLE 9.2

    SetMatchMismatchGap
    11–1–3
    22 0–2
    31 0–1

    Now, globally align the following sequences and determine the best alignments for each of the sets of scores and compare:

    1. Seq1: GACTTAC
    2. Seq2: CGTTGAATTCAC
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.119.177