CHAPTER 10
Smith–Waterman Algorithm (Local Alignment)

CS Mukhopadhyay and RK Choudhary

School of Animal Biotechnology, GADVASU, Ludhiana

10.1 INTRODUCTION

The Smith–Waterman algorithm (Smith and Waterman, 1981) is a dynamic programming tool that is used for local alignment, to compare molecular sequences of any length with an aim to identify the conserved region(s). It is a modified form of the Needleman–Wunsch algorithm, where tracing back is stopped as soon as a score of zero (0) is encountered in the path. The scoring is done by replacing the negative values with zeroes in a cell.

10.2 OBJECTIVE

To align two sequences (nucleotide or amino acid) to find out the local region(s) of similarity, using the Smith–Waterman algorithm.

10.3 PROCEDURE

Two sequences will be utilized for local alignment using the Smith–Waterman algorithm:

  • Seq1: CTAGTAG
  • Seq2: CAGGTAGTG

10.3.1 Step 1: scoring scheme

The award for match and penalty for mismatch and gaps (horizontal and vertical) may be modified.

  • Match score (Sij) = 2
  • Mismatch penalty (Sij) = –1
  • Gap penalty (d) = –2

Here, “i” (varying from 1, 2, …, m) and “j” (varying from 1, 2, …, n) refer to the indexes for the row and column numbers of the cell of the scoring matrix, respectively.

Like the NW algorithm, dynamic scoring starts from the top left to the bottom right.

10.3.2 Step 2: matrix construction

  1. Write the sequences (Seq1 and Seq2) along the Y‐axis (along the rows) and the X‐axis (along the columns), respectively.

    Scoring Method: Dynamic Programming

  2. An award or penalty is determined by the movement along the cells, starting from the top left cell:
[10.1] images

The highest of these three scores is used for calculating the respective scores of the adjacent cells towards the right, bottom and diagonally bottom right (Figure 10.1).

Image described by caption and surrounding text.

FIGURE 10.1 The scores in each cell are obtained from the movements from three directions – namely, horizontal, diagonal and vertical. The arrows indicate back‐tracing based on the highest score out of the three scores.

10.3.3 Step 3: trace‐back step

  1. Round off all the negative values to zeroes.
  2. Identify the cell having the highest score in the entire matrix.
  3. Start tracing back from the cell (not necessarily at the bottom left of the matrix) with the highest score. Move to the cell located at the left, or top or top left (diagonal), based on the highest score among these three cells.
  4. Proceed backward (towards top left) in a similar manner to that described for the Needleman–Wunsch algorithm.
  5. Stop as soon as the highest score is encountered in the path is 0 (zero).
  6. Draw the path of tracing using arrows from the starting cell and finally to the stopping cell.
  7. There may be more than one highest score (of the same value). If a cell has initiated more than one arrow, then it will lead to more than one path at such a junction.
Image described by caption.

FIGURE 10.2 Trace‐back step: starting with the highest score in the matrix, moving towards the top left and stopping at the last positive score.

10.3.4 Step 4: calculating the scores for local alignment

The score for each of the local alignment is computed in a similar manner, as shown in global alignment.

Seq1: GTAG

| | | |Score: 2 + 2 + 2 + 2 = 8

Seq2: GTAG

TABLE 10.1 Similarities and differences between NW and SW algorithms.

SN Particulars NWA SWA
1 Application Global alignment: screens for the region(s) of high similarity, considering the whole sequences (from start to end). Local alignment: identifies the discrete region(s) of high similarity (only similar fragment(s) of the sequences).
2 Scoring system Award for ‘match’.
Penalty for ‘mismatch’ and ‘gap’.
The scores may vary depending on the sequence similarities
The scoring system is the same as NWA. The values may change accordingly
3 Algorithm type Dynamic algorithm to construct the similarity (or scoring) matrix in an iterative manner. Same as NWA.
4 Tracing Back No modification of scoring matrix obtained before starting trace‐back step from the bottom right corner. All the negative scores are set to zero, and trace‐back step may start from some cells with higher/highest positive values.
5 Number of trace back paths Multiple (but linked as if branched with each other) paths can be obtained, due to same scores in a cell obtained from those three movements (horizontal, vertical and diagonal) during dynamic scoring of the similarity matrix Multiple discrete paths can be obtained, due to the presence of higher positive scores in more than one cell. Each of the higher values gives a fresh start. However, in some cases, branch‐like multi‐paths could be found, like NWA.
6 Multiple results More than one global alignment could be found if two or more alignments have the highest scores There could be more than one local alignment.

10.4 QUESTIONS

  1. 1. Construct the scoring matrix using the Smith–Waterman algorithm and find out the local alignment based on the score, using the following pairs of sequences:
    1. Seq1: CAGTCAGT; Seq2: AGTTGCA
    2. Seq1: AGGCATGAA; Seq2: GGTTAA
    3. Seq1: CAGTCC; Seq2: AGTCCGCTAC
  2. 2. What are the applications of local alignment of nucleotide as well as amino acid sequences?
  3. 3. Given the pair of nucleotide sequences: Seq1: GATCGTCATG and Seq2: GACGTCACTG, for the following set of awards and penalty values, determine the local alignments:
    1. Match score (Sij) = 2; Mismatch score (Sij) = –1; Gap penalty (d) = –2
    2. Match score (Sij) = 1; Mismatch score (Sij) = –1; Gap penalty (d) = –2
    3. Match score (Sij) = 4; Mismatch score (Sij) = –3; Gap penalty (d) = –6

    Is there any change in the local alignment? If yes, why is this change observed? If no, what could be the reasons for no change in the best alignment?

  4. 4. Enumerate the differences between NW‐algorithm and SW‐algorithm. Let two similar sequences (with a couple of mismatches only) be aligned by both the methods; how far the results will differ? Please take the following example and work out:
    • Seq1: MACTAPRD
    • Seq2: MACCAPRE
  5. 5. Justify the following comment: “The SW algorithm is a database search algorithm”.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.42.99