Longest Common Subsequence

Now, let's look at a different problem which is solvable by a dynamic programming algorithm. The problem we're now interested in is the longest common subsequence problem.


The difference between a subsequence and a substring is that a substring is a consecutive subsequence. For example, [a, d, f] is a subsequence of [a, b, c, d, e, f], but not a substring. [b, c, d] is a substring and a subsequence of [a, b, c, d, e, f].

We're interested in finding similarities between two given sequences by computing the Longest Common Subsequence (LCS) between them. A common subsequence, S3, of two given sequences, S1 and S2, is a sequence whose elements appear in both S and S2 in the same order, but not necessarily consecutively. This problem is usually applicable when finding DNA similarities of different organisms.

For example, if we have two strands, S1 and S2, as follows:

S1 = ACCGGTCGAGTGCGCGGAGCCGGCCGAA
S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAAA

Then the longest common strand between those two, let's call it S3, would be as follows:

S3 = GTCGTCGGAAGCCGGCCGAA

Figure 4.4: Calculating the longest common subsequence

This problem is solvable using dynamic programming. If we go for a brute force approach, we can enumerate all subsequences of S1 and check each subsequence to see if it is also a subsequence of S2, keeping track of the longest one we find.

However, if |S1| = m, then S1 has 2m subsequences, making it impractical for long sequences. The LCS exhibits the optimal-substructure property.

One way for this to become evident is to think in terms of prefixes. Let's assume that we have the following two sequences:

X = {x1, x2… xm}
Y = {y1, y2… yn}

Let Z be any LCS of X and Y that can be represented as follows:

Z = {z1, z2… zk}

Then, the following cases are possible:

  1. If xm = yn, then zk = xm = yn, and therefore Zk-1 is a LCS of Xm-1 and Yn-1
  2. If xm != yn, then zk != xm implies that Z is a LCS of Xm-1 and Y
  3. If xm != yn, then zk != yn implies that Z is a LCS of X and Yn-1

This tells us that an LCS of two sequences contains the LCS of prefixes of the two sequences, exhibiting the optimal-substructure property. If we define c[i][j] to be the length of a LCS of sequences Xi and Yj, then we can arrive at the following recursive formula, which guides us toward the dynamic programming solution for the problem that can be represented as follows:

Looking at the recurrence, the property of overlapping subproblems immediately pops out.

To find a LCS of X and Y, we may need to find the LCS's of X and Yn-1 and Xm-1 and Y, but each of these have the problem of finding the LCS of Xm-1 and Yn-1. Many other subproblems share sub-subproblems.

Using this recurrence, in a bottom-up fashion (where we have solutions for subproblems readily computed), we can produce the following dynamic programming algorithm to compute the length of the longest common subsequence of two strings:

public int length(String x, String y) {
int m = x.length();
int n = y.length();
int[][] c = new int[m + 1][n + 1];
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (x.charAt(i - 1) == y.charAt(j - 1))
c[i][j] = c[i - 1][j - 1] + 1;
else
c[i][j] = Math.max(c[i - 1][j], c[i][j - 1]);
}
}
return c[m][n];
}

Snippet 4.9: Computing the length of the longest common subsequence of two Strings using dynamic programming algorithm. Source class name: LongestCommonSubsequence

Go to https://goo.gl/4TdNVQ to access this code.

It is possible to also compute the longest common subsequence, and not only the length of it if we keep track of the direction we go in the c matrix on each step (either up or left), taking into account that we only add a new character to the optimal subsequence when xi = yj.

We don't cover the solution to this problem in this book. If you are interested in it, the Wikipedia page for the LCS problem has a detailed walkthrough on the implementation of it: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.37.10