The Smith–Waterman algorithm (Smith and Waterman, 1981) is a dynamic programming tool that is used for local alignment, to compare molecular sequences of any length with an aim to identify the conserved region(s). It is a modified form of the Needleman–Wunsch algorithm, where tracing back is stopped as soon as a score of zero (0) is encountered in the path. The scoring is done by replacing the negative values with zeroes in a cell.
10.2 OBJECTIVE
To align two sequences (nucleotide or amino acid) to find out the local region(s) of similarity, using the Smith–Waterman algorithm.
10.3 PROCEDURE
Two sequences will be utilized for local alignment using the Smith–Waterman algorithm:
Seq1: CTAGTAG
Seq2: CAGGTAGTG
10.3.1 Step 1: scoring scheme
The award for match and penalty for mismatch and gaps (horizontal and vertical) may be modified.
Match score (Sij) = 2
Mismatch penalty (Sij) = –1
Gap penalty (d) = –2
Here, “i” (varying from 1, 2, …, m) and “j” (varying from 1, 2, …, n) refer to the indexes for the row and column numbers of the cell of the scoring matrix, respectively.
Like the NW algorithm, dynamic scoring starts from the top left to the bottom right.
10.3.2 Step 2: matrix construction
Write the sequences (Seq1 and Seq2) along the Y‐axis (along the rows) and the X‐axis (along the columns), respectively.
Scoring Method: Dynamic Programming
An award or penalty is determined by the movement along the cells, starting from the top left cell:
[10.1]
The highest of these three scores is used for calculating the respective scores of the adjacent cells towards the right, bottom and diagonally bottom right (Figure 10.1).
10.3.3 Step 3: trace‐back step
Round off all the negative values to zeroes.
Identify the cell having the highest score in the entire matrix.
Start tracing back from the cell (not necessarily at the bottom left of the matrix) with the highest score. Move to the cell located at the left, or top or top left (diagonal), based on the highest score among these three cells.
Proceed backward (towards top left) in a similar manner to that described for the Needleman–Wunsch algorithm.
Stop as soon as the highest score is encountered in the path is 0 (zero).
Draw the path of tracing using arrows from the starting cell and finally to the stopping cell.
There may be more than one highest score (of the same value). If a cell has initiated more than one arrow, then it will lead to more than one path at such a junction.
10.3.4 Step 4: calculating the scores for local alignment
The score for each of the local alignment is computed in a similar manner, as shown in global alignment.
Seq1: GTAG
| | | |
Score: 2 + 2 + 2 + 2 = 8
Seq2: GTAG
TABLE 10.1Similarities and differences between NW and SW algorithms.
SN
Particulars
NWA
SWA
1
Application
Global alignment: screens for the region(s) of high similarity, considering the whole sequences (from start to end).
Local alignment: identifies the discrete region(s) of high similarity (only similar fragment(s) of the sequences).
2
Scoring system
Award for ‘match’. Penalty for ‘mismatch’ and ‘gap’. The scores may vary depending on the sequence similarities
The scoring system is the same as NWA. The values may change accordingly
3
Algorithm type
Dynamic algorithm to construct the similarity (or scoring) matrix in an iterative manner.
Same as NWA.
4
Tracing Back
No modification of scoring matrix obtained before starting trace‐back step from the bottom right corner.
All the negative scores are set to zero, and trace‐back step may start from some cells with higher/highest positive values.
5
Number of trace back paths
Multiple (but linked as if branched with each other) paths can be obtained, due to same scores in a cell obtained from those three movements (horizontal, vertical and diagonal) during dynamic scoring of the similarity matrix
Multiple discrete paths can be obtained, due to the presence of higher positive scores in more than one cell. Each of the higher values gives a fresh start. However, in some cases, branch‐like multi‐paths could be found, like NWA.
6
Multiple results
More than one global alignment could be found if two or more alignments have the highest scores
There could be more than one local alignment.
10.4 QUESTIONS
1. Construct the scoring matrix using the Smith–Waterman algorithm and find out the local alignment based on the score, using the following pairs of sequences:
Seq1: CAGTCAGT; Seq2: AGTTGCA
Seq1: AGGCATGAA; Seq2: GGTTAA
Seq1: CAGTCC; Seq2: AGTCCGCTAC
2. What are the applications of local alignment of nucleotide as well as amino acid sequences?
3. Given the pair of nucleotide sequences: Seq1: GATCGTCATG and Seq2: GACGTCACTG, for the following set of awards and penalty values, determine the local alignments:
Match score (Sij) = 2; Mismatch score (Sij) = –1; Gap penalty (d) = –2
Match score (Sij) = 1; Mismatch score (Sij) = –1; Gap penalty (d) = –2
Match score (Sij) = 4; Mismatch score (Sij) = –3; Gap penalty (d) = –6
Is there any change in the local alignment? If yes, why is this change observed? If no, what could be the reasons for no change in the best alignment?
4. Enumerate the differences between NW‐algorithm and SW‐algorithm. Let two similar sequences (with a couple of mismatches only) be aligned by both the methods; how far the results will differ? Please take the following example and work out:
Seq1: MACTAPRD
Seq2: MACCAPRE
5. Justify the following comment: “The SW algorithm is a database search algorithm”.