CS Mukhopadhyay and RK Choudhary
School of Animal Biotechnology, GADVASU, Ludhiana
The Needleman–Wunsch algorithm (NWA; Needleman and Wunsch, 1970) is used for global alignment. We compare homologous molecular sequences character by character to achieve sequence alignment. Global alignment is the end‐to‐end alignment between two sequences; hence, it introduces gaps that represent insertions/deletions. This is useful for identifying “InDels” (Insertion and Deletions), and for overall comparison of two or more comparable (i.e., similar) sequences. Phylogenetically close sequences of the same length are the most suited for global alignment.
The best alignment can be identified by quantifying or scoring the possible alternative alignments. Scoring matrices are used to award the match(es) and penalize the mismatch(es) and gap(s), so that the best alignment with the highest score can be identified. The scores in the matrix are integer values (e.g., +1, 0, –1).
To align two comparable sequences (nucleotide or amino acid) for obtaining a global alignment using the NWA.
Let us start with two short sequences which are to be globally aligned using an NWA:
If the sequences are of length “m” and “n”, respectively, we will obtain one scoring matrix of dimensions m × n.
A scoring scheme is first defined, and then the dynamic scoring is done, starting from the top left to the bottom right of the matrix:
Here, “i” (varying from 1, 2, …, m) and “j” (varying from 1, 2, …, n) refer to the indexes for the row and column numbers, respectively, of the cell of the scoring matrix (described in Step 2). Please note that the scoring scheme can differ according to the evolutionary relatedness of the input sequences. A gap extension penalty is also introduced in advanced methods of such dynamic algorithms, to direct proper alignment and select the best one, based on the scores.
This step starts with creating a matrix that represents the score for each pair of residues belonging to the pair of sequences.
← Sequence 1 | “i” varying from 1, 2, 3 …, m | Sequence 2 → | |||||||||
“j” varying from 1, 2, 3 …, n | |||||||||||
C | A | G | G | T | A | G | T | G | |||
C | |||||||||||
T | |||||||||||
A | |||||||||||
G | |||||||||||
T | |||||||||||
A | |||||||||||
G |
Finally, we will get a matrix with each of its cells filled up with three scores obtained from three different types of movements, as shown below:
Annex one row at the top and one column at the left of the original m × n dimensional matrix (to make it an (m + 1) × (n + 1) dimensional matrix). These are termed the 0th row and 0th column, respectively. Next, fill the first cell (i.e., the cell located at the top left‐most corner of the 0th row and 0th column) with a zero.
The score of each cell can be determined from three different movements towards the cell. The formula used to calculate the scores for each cell is shown in Figure 9.2.
Select the highest score(s): Now, after obtaining these three scores in the cell being studied, the highest value is selected, which is to be used as the score of that cell. This highest score is used for calculating the respective scores of the cells located just right (hence, pertaining to horizontal movement), just bottom (vertical movement) and just at the bottom‐right (diagonal movement) positions of the current cell. If more than one score shows the highest value, then that highest value will be considered for calculating the score of the adjacent cells. However, the backtracking will be through all these cells contributing to the same highest values (see Figures 9.3 and 9.4).
Let us clarify this with the first few cells of the matrix shown in Step 3:
Finally, the trace‐back step starts from the last cell at the right side of the bottom row in a way such that arrow(s) will be drawn to the cell(s) from which the current score (highest of three scores of the present cell) has been obtained. If there are two or three cells which contribute to the highest value, then two (or three, respectively) arrows will be drawn to indicate the previous cells.
The arrows are drawn from the bottom right cell towards the top left cell. If certain cell(s) have more than one arrow, this will lead to more than one path at every such junction.
Tracing back is done from the bottom right cell towards the top left cell. The path is determined based on the source of the cell which contributes to the highest score in the present cell.
The dynamic programming thus proceeds from one cell to the other (in the defined directions), until the whole matrix obtains the score for each of the cells.
All possible paths are obtained from the scoring matrix during the process of global alignment.
The score for each of these alignments is calculated according to the scoring scheme set at the beginning of scoring. The alignment score with the highest value is considered as the best global alignment. One could get more than one highest score (same value) when there are multiple (more than one) pair‐wise alignments. In such a situation, all those alignments with the highest score are equally good, and any one of these can be accepted as the global alignment.
In the example shown above, we get seven possible global alignments, all of which have the highest score (i.e., equal to 4). Here, we can select any one of these alignments as the best alignment (Figure 9.5).
Now, globally align the following sequences and determine the best alignments for each of the sets of scores and compare:
18.217.40.118