Rabin-Karp

In 1987, Richard M. Karp and Michael O. Rabin proposed a string matching algorithm that performs well in practice and generalizes string matching against a set of patterns. The Rabin-Karp algorithm takes O(m) time in its preprocessing stage and its worst-case running time is O((n - m + 1)m), similar to Boyer-Moore's.

To better introduce the Rabin-Karp algorithm, let's assume that our alphabet ∑ is composed only of decimal digits (∑ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}), so that we can view a string of k characters as a decimal number with length k. Therefore, string 12345 corresponds to number 12345. Given a pattern P[0...m] and a substring from text T[i...i + m], if we convert both those strings to their correspondent decimal number, so that we have numbers p and ti, then p = ti only if P[0...m] = T[i...i + m], and therefore i is only a valid shift if p = ti.

If we could compute p in O(m) time and all the ti values in O(n - m + 1) time, then we could determine all valid shifts in O(n) time by comparing p with each of the ti values. The problem with this is when p and ti are too large to work with. If the numbers are too large, then we can work with them modulo q, for a suitable modulus q.

Let's leave the choice of a suitable modulus q for later. How we can generalize this to work with other alphabets? For example, what if we want to use characters that are not decimal digits?

Consider that, in the case of our original alphabet, to convert a string 12345 into a number, we would perform the operation 104*1+103*2+102*3+10¹*4+100*5. If we have a D-ary alphabet {0, 1… d - 1}, then we could use the same strategy, but replace 10 by d. One other consideration to have is that, when we have computed ti and we want to compute ti+1, then we can simply remove the leftmost digit, shift everything to the left, and add the newest digit, that is, ti+1 = ((ti - T[i]*dm-1)*d + T[i + 1]) % q.

One final consideration to have is that working with modulo q is not perfect. ti = p (mod q) does not imply that ti = p. But if ti != p (mod q), then ti != p. We can therefore use this as a fast heuristic test to rule out invalid shifts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.59.219