Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2. Classical algorithms for exact search

Thomas Mailund¹

(1)

Aarhus N, Denmark

We kick the book off by looking at classical algorithms for exact search, that is, finding positions in a string where a pattern string matches precisely. This problem is so fundamental that it received much attention in the very early days of computing, and by now, there are tens if not hundreds of approaches. In this chapter, we see a few classics.

Recall that we use iterators whenever we have an algorithm that loops over results that should be reported. All iterators must be initialized, and the resources they hold must be deallocated when we no longer need the iterator. When we loop, we have a function that returns true when there is something to report and false when the loop is done. The values the iterator reports are put in a structure that we pass along to the function that iterates to the next value to report. For the algorithms in this chapter, we initialize the iterators with the string in which we search, the pattern we search for, and the lengths of the two strings. Iterating over all occurrences of the pattern follows this structure:

struct iterator iter;

struct match match;

iter_init(iter, x, strlen(x), p, strlen(p));

while (next_func(&iter, &match)) {

// Process occurrence

}

iter_dealloc(&iter);

When we report an occurrence, we get the position of the match, so the structure the iterator use for reporting is this:

struct match {

uint32_t pos;

};

Naïve algorithm

The simplest way imaginable for exact search is to iteratively move through the string x, with an index j that conceptually runs the pattern p along x, and at each index start matching the pattern against the string using another index, i (see Figure 2-1). The algorithm has two loops, one that iterates j through x and one that iterates i through p, matching x[i + j] against p[i] along the way. We run the inner loop until we see a mismatch or until we reach the end of the pattern. In the former case, we move p one step forward and try matching again. In the second case, we report an occurrence at position j and then increment the index so we can start matching at the next position. We stop the outer loop when index j is greater than n − m. If it is, there isn’t room for a match that doesn’t run past the end of x.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig1_HTML.png — Figure 2-1
Exact search with the naïve approach

We terminate the comparison of x[i + j] and p[i] when we see a mismatch, so in the best case, where the first character in p never matches a character in x, the algorithm runs in time O(n) where n is the length of x. In the worst case, we match all the way to the end of p at each position, and in that case, the running time is O(nm) where m is the length of p.

To implement the algorithm using an iterator, the iterator needs to remember the string to search in and the pattern to search for—so we do not need to pass these along each time we increment the iterator with potentials for errors if we use the wrong strings—and we keep track of how far into the string we have searched.

struct naive_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

uint32_t current_index;

};

When we initialize the iterator, we remember the two strings and set the current index to zero—before we start iterating we are at the beginning of the string.

void init_naive_match_iter(

struct naive_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

iter->current_index = 0;

}

When we increment the iterator, we follow the algorithm as described earlier except that we start the outer loop at the index saved in the iterator. We search from this index in an outer loop, and at each new index, we try to match the pattern with an inner loop. We break the inner loop if we see a mismatching character, and if the inner loop reaches the end, we have a match and report it. Before we return, we set the iterator index and store the matching position in the match structure.

bool next_naive_match(

struct naive_match_iter *iter,

struct match *match

) {

uint32_t n = iter->n, m = iter->m;

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

if (m > n) return false;

if (m == 0) return false;

for (uint32_t j = iter->current_index; j <= n - m; j++) {

uint32_t i = 0;

while (i < m && x[j+i] == p[i]) {

i++;

}

if (i == m) {

iter->current_index = j + 1;

match->pos = j;

return true;

}

return false;

}

The code

if (m > n) return false;

if (m == 0) return false;

makes sure that it is possible to match the pattern at all and that the pattern isn’t empty. This is something we could also test when we initialize the iterator. However, we do not have a way of reporting that we do not have a possible match there, so we put the test in the “next” function.

We do not allocate any resources when we initialize the iterator, so we do not need to do anything when deallocating it either. We still need the deallocator function, however, so we always use the same design pattern when we use iterators. To make sure that if we, at some point in the future, need to free something that we put in an iterator, then all users of the iterator (should) have added code for this.

void dealloc_naive_match_iter(

struct naive_match_iter *iter

) {

// Nothing to do here...

}

Border array and border search

It is possible to get O(n + m) running times for both best and worst case, and several algorithms exist for this. We will see several in the following sections. The first one is based on the so-called borders of strings.

Borders and border arrays

A border of a string is any substring that is both a prefix and a suffix of the said string; see Figure 2-2. For example, the string x = ababa has borders aba, a, and the empty string. There is always at least one border per string—the empty string. It is possible to list all borders by brute force. For each index i in x, test if the substrings x[0, i] matches the string x[n − i, n]. This approach makes time O(n) per comparison, and we need it for all possible borders which means that we end up with a running time of O(n²). It is possible to compute the longest border in linear time, as we shall see. The way we compute it shows that sometimes more is less; we will compute more than the length of the longest suffix. What we will compute is the border array . This is an array that for each index i holds the length of the longest border of string x[0, i]. Consider x = ababa. For index 0 we have string a which has border a, so the first element in the border array is 1. The string ab only has the trivial, nonempty border, so the border array value is zero. The next string is aba with border a, so we get 1 again. Now abab has borders ab, so the border array holds 2. The full string x = ababa with border aba so its border array looks like ba = [1, 0, 1, 2, 3].

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig2_HTML.png — Figure 2-2
A string with three borders

We can make the following observation about borders and border arrays: The longest border of x[0, i] is either the empty string or an extension of a border of x[0, i − 1]. If the letter at x[i] is a, the border of x[0, i] is some string y followed by a. The y string must be both at the beginning and end of x[0, i − 1] (see Figure 2-3), so it is a border of x[0, i − 1]. The longest border for x[0, i] is the longest border of x[0, i − 1] that is followed by a (which may be the empty border if the string x begins with a) or the empty border if there is no border we can extend with a.

Another observation is that if you have two borders to a string, then the shorter of the two is a border of the longer; see Figure 2-4.

The two observations combined gives us an approach to computing the border array. The first string has the empty border as its only border, and after that, we can use the border array up to i − 1 to compute the length of the longest border of x[0, i]. We start by testing if we can extend the longest border with x[i], and if so, ba[i] = ba[i − 1] + 1. Otherwise, we look at the second-longest border, which must be the longest border of x[0, ba[i − 1]]. If the character after this border is x[i], then ba[i] = ba[ba[i − 1]] + 1. We continue this way until we have found a border we can extend (see Figure 2-5). If we reach the empty border, we have a special case—either we can extend the empty border because x[0] = x[i], in which case ba[i] = 1, or we cannot extend the border because x[0] ≠ x[i], in which case ba[i] = 0.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig3_HTML.png — Figure 2-3
Extending a border

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig4_HTML.png — Figure 2-4
A shorter border is always a border of a longer border

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig5_HTML.png — Figure 2-5
Searching for the longest border we can extend with letter a

An implementation of the border array construction algorithm can look like this:

ba[0] = 0;

for (uint32_t i = 1; i < m; ++i) {

uint32_t b = ba[i - 1];

while (b > 0 && x[i] != x[b])

b = ba[b - 1];

ba[i] = (x[i] == x[b]) ? b + 1 : 0;

}

The running time is m for a string x of length m. It is straightforward to see that the outer loop only runs m iterations but perhaps less easy to see that the inner loop is bounded by m iterations in total. But observe that in the outer loop, we at most increment b by one per iteration. We can assign b + 1 to ba[i] in the last statement in the inner loop and then get that value in the first line of the next iteration, but at no other point do we increment a value. In the inner loop, we always decrease b—when we get the border of b − 1, we always get a smaller value than b. We don’t allow b to go below zero in the inner loop, so the total number of iterations of that loop is bounded by how much the outer loop increase b. That is at most one per iteration, so we can decrement b by at most m, and therefore the total number of iterations of the inner loop is bounded by O(m).

Exact search using borders

The reason we built the border array was to do an exact search, so how does the array help us? Imagine we build a string consisting of the pattern we search for, p, followed by the string we search in, x, separated by a sentinel, $ character not found elsewhere in the two strings, y = p$x. The sentinel ensures that all borders are less than the length of p, m, and anywhere we have a border of length m, we must have an occurrence of p (see Figure 2-6). In the figure, the indices are into the p$x string and not into x, but you can remap this by subtracting m + 1. The start index of the match is i − m + 1 rather than the more natural i − m because index i is at the end of the match and not one past it.

We can construct the string p$x in linear time and compute the border array—and report occurrences in the process—in linear time, O(m + n). You don’t need to create the concatenated string, though. You can build the border array for p and use that when computing the border array for x. You pretend that p is prepended to x. When you do this, the sentinel between p and x is the null-termination sentinel in the C-string p.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig6_HTML.png — Figure 2-6
Searching using a border array

An iterator that searches a string with this algorithm must contain the border array of p, the index into x we have reached, and the b variable from the border array construction algorithm.

struct border_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

uint32_t *border_array;

uint32_t i; uint32_t b;

};

When we initialize the iterator, we set its index to zero. That, after all, is where we start searching in x. We also set the iterator’s b variable to zero. We imagine that we start the search after a sentinel, so the longest border at the start index for x has length zero. We then allocate and compute the border array.

void init_border_match_iter(

struct border_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

iter->i = iter->b = 0;

uint32_t *ba = malloc(m * sizeof(uint32_t));

compute_border_array(p, m, ba);

iter->border_array = ba;

}

Since we allocated the border array when we initialized the iterator, we need to free it again when we deallocate it.

void dealloc_border_match_iter(

struct border_match_iter *iter

) {

free(iter->border_array);

}

A third of my implementation for incrementing the following iterator is setting up aliases for the variables in the iterator, so I don’t have to write iter->b and iter->m for variables b and m, respectively. Other than that, there are the tests for whether it is possible at all to have a match, that we also saw in the previous section, and then there is the border array construction algorithm again, except that we never update an array but instead report when we get a border of length m.

bool next_border_match(

struct border_match_iter *iter,

struct match *match

) {

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

uint32_t *ba = iter->border_array;

uint32_t b = iter->b;

uint32_t m = iter->m;

uint32_t n = iter->n;

if (m > n) return false;

if (m == 0) return false;

for (uint32_t i = iter->i; i < iter->n; ++i) {

while (b > 0 && x[i] != p[b])

b = ba[b - 1];

b = (x[i] == p[b]) ? b + 1 : 0;

if (b == m) {

iter->i = i + 1;

iter->b = b;

match->pos = i - m + 1;

return true;

}

return false;

}

When we report an occurrence, we naturally set the position we matched in the report structure, and we remember the border and index positions.

Knuth-Morris-Pratt

The Knuth-Morris-Pratt (KMP) algorithm also uses borders to achieve a best- and worst-case running time of O(n + m), but it uses the borders in a slightly different way. Before we get to the algorithm, however, I want to convince you that we can, conceptually, move the pattern p through x in two different ways; see Figure 2-7. We can let j be an index into x and imagine p starting there. When we test if p matches there, we use a pointer into p, i, and test x[ j + i] against p[i] for increasing i. To move p to another position in x, we change j, for example, to slide p one position to the right we increment j by one. Alternatively, we can imagine p aligned at position j − i for some index j in x and an index i into p. If we change i, we move j − i so we move p. If, for example, we want to move p one step to the right, we can decrement i by one. To understand how the KMP algorithm works, it is useful to think about moving p in the second way. We will increment the j and i indices when matching characters, but when we get a mismatch, we move p by decrementing i.

The idea in KMP is to move p along x as we would in the naïve algorithm, but move a little faster when we have a mismatch. We use index j to point into x and i to point into p. We match x[ j] against p[i] as we scan along x and the pattern is aligned against x at index j − i. We can move p’s position by modifying either i or j. Consider a place in the algorithm where we have matched p[0, i] against x[ j − i, j] and see a mismatch. In the naïve algorithm, we would move p one step to the right and start matching p again at that position. We would set i to zero to start matching from the beginning of p, and we would decrement j to j − i + 1 to match at the new position at which we would match p. With KMP we will skip positions where we know that p cannot match, and we use borders to do this.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig7_HTML.png — Figure 2-7
Two ways to conceptually look at matching

If we have matched p up to index i and then had a mismatch, we know that the only next position at which we could possibly have a match is one where we match a border of p[0, i − 1] against a suffix of the string we already matched x[ j − i, j − 1]; see Figure 2-8. It has to be a border of p[0, i − 1] and not p[0, i], although that might look like a better choice from the figure. However, we know that p[0, i] doesn’t match at the last index, so we need a border of the pattern up to index i − 1. When we move p, we must be careful not to slide it past possible matches, but if we pick the longest border of p[0, i − 1], then this cannot happen. Aligning the longest border moves the pattern the shortest distance where a border matches a suffix of x[ j − i, j − 1]. When we have a mismatch at index i, we move p up to the next possible match by decreasing i to ba[i − 1]. See Figure 2-9 for a visualization of this idea.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig8_HTML.png — Figure 2-8
Borders and the observation underlying the KMP algorithm

When we move p to match a border to the string we already matched, there is a chance that the character following the border doesn’t match x[j] either. It is not straightforward to pick a border where we are sure that the next character matches, but we can easily avoid that we mismatch on exactly the same character as before. We need to modify the border array, so we do not include borders where the next character matches the character that follows the border. A border array where we have removed the borders p[0, ba[i]] and p[i − ba[i], i] where p[ba[i] + 1] = p[i + 1] is what we call a restricted border array . We can compute this restricted array by scanning the border array from left to right and skipping a border if the characters match. The longest border of the skipped border will not be followed by the same character as the border we skip. If it did, we would have skipped past it when we processed the longer border.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig9_HTML.png — Figure 2-9
Calculations for how much we should jump at a mismatch

for (uint32_t i = 0; i < m - 1; i++) {

if (ba[i] > 0 && pattern[ba[i]] == pattern[i + 1])

ba[i] = ba[ba[i] - 1];

}

An iterator for the KMP algorithm needs to hold the border array and the indices i and j.

struct kmp_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

uint32_t *ba;

uint32_t j, i;

};

The initialization consists of computing the border array and modifying it to avoid borders that are followed by the same characters.

void compute_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *ba

) {

ba[0] = 0;

for (uint32_t i = 1; i < m; ++i) {

uint32_t b = ba[i - 1];

while (b > 0 && x[i] != x[b])

b = ba[b - 1];

ba[i] = (x[i] == x[b]) ? b + 1 : 0;

}

void computed_restricted_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *ba

) {

compute_border_array(x, m, ba);

for (uint32_t i = 0; i < m - 1; i++) {

if (ba[i] > 0 && x[ba[i]] == x[i + 1])

ba[i] = ba[ba[i] - 1];

}

void init_kmp_match_iter(

struct kmp_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

iter->j = 0; iter->i = 0;

uint32_t *ba = malloc(m * sizeof(uint32_t));

ba[0] = 0;

computed_restricted_border_array(p, m, ba);

iter->ba = ba;

}

Since we allocate memory for the border array, we must also free it in the deallocation function.

void dealloc_kmp_match_iter(

struct kmp_match_iter *iter

) {

free(iter->ba);

}

The next function gets its information from the iterator. It then iterates as long as index j hasn’t reached a point where no more matches can occur, or until we have a match that we report. We scan the text and pattern, by increasing i and j as long as we have a match, and if i reaches m, we know that we have a match. To move p to the next position, we increase j by one if i is zero (so we need to match from the beginning of p), or we decrease i if i is not zero using the border array. If we have a match to report, we update the iterator with the current loop state and return the position where we had the match, j − m.

bool next_kmp_match(

struct kmp_match_iter *iter,

struct match *match

) {

// Aliases to make the code easier to read...

uint32_t j = iter->j;

uint32_t i = iter->i;

uint32_t m = iter->m;

uint32_t n = iter->n;

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

if (m > n) return false;

if (m == 0) return false;

// Remember that j matches the first i

// items into the string, so + i.

while (j <= n - m + i) {

// Match as far as we can

while (i < m && x[j] == p[i]) {

i++; j++;

}

// We need to check this

// before we update i.

bool we_have_a_match = i == m;

// Update indices

if (i == 0) j++;

else i = iter->ba[i - 1];

// If we have a hit...

if (we_have_a_match) {

// ...yield new match

iter->j = j; iter->i = i;

match->pos = j - m;

return true;

}

return false;

}

The reason we have a Boolean for when we have a match is that we need to update the indices whether we have a match or not. I chose this solution to avoid duplicated code, but you could also have the update code twice instead.

To see that the algorithm runs in linear time, we need two observations. First, the index j never decreases and this bounds it to maximal n steps. This variable does not increase in each iteration, but when j doesn’t increase, i decreases instead. When i decreases, we conceptually move p toward the right by increasing j −i, where the beginning of p sits under x. We never move p to the left, so the number of steps we can move p forward is also bounded by n. So, in each iteration we either increment j or move p, and both operations are bounded by n steps. This means that we have a linear bound on the KMP algorithm.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig10_HTML.png — Figure 2-10
Matching from right to left

Boyer-Moore-Horspool

With the Boyer-Moore-Horspool (BMH) algorithm , we are back to a worst-case running time of O(nm), but now with a best-case running time of O(n/m + m), that is, a potential sublinear running time. The trick to going faster than linear is to match the pattern from right to left (see Figure 2-10), which lets us use information about the match after the current position of p. If we have a mismatch before we reach the first character in p, that is, before we reach index j in x, we might be able to skip past the remaining prefix of x[ j, j+m] without looking at it.

The idea in the BMH algorithm is straightforward. To have a match, at the very least, the last character of p, p[m − 1], should match the last character in the substring we are trying to match p against, that is, x[ j + m − 1]. If we see a mismatch, we do not simply increment j by one. Instead, we move p to the next position where the rightmost occurrence of x[ j + m − 1] occurs in p; see Figure 2-11. We cannot include the last character since if that matches we will not move anywhere, so “rightmost” really means the rightmost that is not the last character. As a preprocessing step, we want to build a “jump table” that, whenever we have a mismatch (or get to the beginning of p), moves us to the next position where we match x[ j + m − 1]. If the rightmost occurrence of this last character is at index k in p, we want to move p m − 1 − k positions; see Figure 2-12.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig11_HTML.png — Figure 2-11
Observation that lets us jump ahead after a mismatch

When we build the jump table, we cannot include the last character in p, p[m − 1], since jumping to align this one might leave us at the same position as we are already at. Excluding it means that p will always jump at least one position to the right, that is, when j is updated with the jump table, it will always increase. When we build the jump table, we start by setting all entries to m. If a character is not in p, this will let us jump entirely past the current j +m −1 position. We then iterate through p and insert the position where we see a character into the table. If a character occurs more than once, it will be the last position that is in the table because we update the table from left to right.

for (uint32_t k = 0; k < 256; k++) {

jump_table[k] = m;

}

for (uint32_t k = 0; k < m - 1; k++) {

jump_table[pattern[k]] = m - k - 1;

}

It should be obvious that the preprocessing is done in O(m).

In our BMH iterator, we need to store the current position of p in x, j, and we need to store the jump table.

struct bmh_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

uint32_t jump_table[256];

uint32_t j;

};

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig12_HTML.png — Figure 2-12
The length to jump at a mismatch

When we initialize the iterator, we set j to the first position in x and we compute the jump table for the pattern.

void init_bmh_match_iter(

struct bmh_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->j = 0;

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

for (uint32_t k = 0; k < 256; k++) {

iter->jump_table[k] = m;

}

for (uint32_t k = 0; k < m - 1; k++) {

iter->jump_table[pattern[k]] = m - k - 1;

}

The jump table is not heap allocated, we know its size at compile time, so we do not need to free it when we deallocate the iterator.

void dealloc_bmh_match_iter(

struct bmh_match_iter *iter

) {

// Nop

}

When we increment the iterator, we search from the current j, but instead of incrementing the index by one in each iteration, we increment it with the value in the jump table. For each position of j, we try to match p starting from the last character and moving toward the first. If we have a match, we report the matching position and increment j to the position where the next search should start.

bool next_bmh_match(

struct bmh_match_iter *iter,

struct match *match

) {

// Aliasing to make the code easier to read...

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

uint32_t n = iter->n;

uint32_t m = iter->m;

uint32_t *jump_table = iter->jump_table;

if (m > n) return false;

if (m == 0) return false;

for (uint32_t j = iter->j;

j < n - m + 1;

j += jump_table[x[j + m - 1]]) {

uint32_t i = m - 1;

while (i > 0 && p[i] == x[j + i]) {

i--;

}

if (i == 0 && p[0] == x[j]) {

match->pos = j;

iter->j = j + jump_table[text[j + m - 1]];

return true;

}

return false;

}

To see that the worst-case running time is O(nm), consider a string and a pattern that has only one character, x = aaa · · · a, p = aaa · · · a. With these two strings, we never have a mismatch, the rightmost occurrence of a (excluding the last character in p as we do) is m − 2, so we jump m − 1 − k with k = m – 2, which moves us one position to the right. So at each position in x, we match m characters, which gives us a running time of O(nm). For the best-case running time, consider x = aaa · · · a again but now the pattern p = bbb · · · b. We will always get a mismatch at the first character we see, and then we need to move to the rightmost occurrence of a in p. There isn’t any a in p, so we move p all the way past position j + m − 1. This means that we only look at every mth character and we, therefore, get a running time of O(n/m +m), where the m is for the preprocessing. These two examples are, of course, extreme, but with random strings over a large alphabet, or with natural languages, the rightmost occurrence can be far to the left or even not in the string, and we achieve running times close to the best case bound.

There is another observation we can make that won’t change the worst-case running time but might let us jump faster along x. If we have matched to index i and have a mismatch there, there is no reason to place p at a location where the same character will mismatch. We have the mismatching character and know where the rightmost occurrences of characters are in p, so we can jump p to a position where p[i] and x[ j + i] will match. If the rightmost occurrence of x[ j + i] is at position k in p, then we can jump by i − k; see Figure 2-13. If the rightmost occurrence of x[ j + i] is to the right of i in p, then the jump would be negative, which we do not want since that moves our search backward and can in the worst case lead to an infinite loop. But if we jump the maximal length given both of the rules above, then the jump rules will always move us at least one character forward.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig13_HTML.png — Figure 2-13
Second jump rule for BMH

We cannot make a jump table for this rule since the length we have to jump depends on i as well as the character where the mismatch occurs. So instead we store the “rightmost occurrence” array in the iterator. We can compute both jump rules from this. We need a way to handle characters that are not in the pattern, so we use a signed value. That way, these characters can have index -1. Here, we assume that the length of the pattern cannot be more than half of the string we are searching in, but this is not an unreasonable assumption and is unlikely ever to be a problem.

struct bmh_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

// Signed so we can indicate no occurrence

int32_t rightmost[256];

uint32_t j;

};

Computing the array is straightforward. We initialize the array with -1 which is what the entries should be if a character is not found in p. We then run from left to right and insert indices by their character.

void init_bmh_match_iter(

struct bmh_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->j = 0;

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

for (uint32_t k = 0; k < 256; k++) {

iter->rightmost[k] = -1;

}

for (uint32_t k = 0; k < m - 1; k++) {

iter->rightmost[pattern[k]] = k;

}

The expression for jumping occurs twice in the algorithm but is quite cumbersome to write and not informative about what is really happing, so it is a good idea to define a macro to handle it.

static inline uint32_t MAX(uint32_t a, uint32_t b) {

return (((a) > (b)) ? (a) : (b));

}

#define BMH_JUMP()

MAX(i - iter->rightmost[x[j + i]],

(int32_t)m - iter->rightmost[x[j + m - 1]] - 1)

The various variables in the macro are not arguments but hardwired to be used inside the algorithm. This makes it easier to see what the intent of the macro is inside the function and is an approach I will often take in this book.

The function that increments the iterator uses the macro instead of the jump table from earlier, it uses a signed value for i so we can handle -1 when we get it from the rightmost array, but otherwise, there are no changes compared to the version from earlier.

bool next_bmh_match(

struct bmh_match_iter *iter,

struct match *match

) {

// Aliasing to make the code easier to read...

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

uint32_t n = iter->n;

uint32_t m = iter->m;

int32_t *rightmost = iter->rightmost;

if (m > n) return false;

if (m == 0) return false;

// We need to handle negative numbers, and we have already

// assumed that indices into the pattern can fit into

// this type.

int32_t i = m - 1;

for (uint32_t j = iter->j;

j < n - m + 1;

j += BMH_JUMP()) {

i = m - 1;

while (i > 0 && p[i] == x[j+i]) {

i--;

}

if (i == 0 && p[0] == x[j]) {

match->pos = j;

iter->j = j + BMH_JUMP();

return true;

}

return false;

}

Extended rightmost table

The two components work well for jumping along x, but the first only looks at the last match of p in x and the other only contributes to jumping if the rightmost occurrence of the mismatched character is to the left of i. We can do better than this and jump to the rightmost position to the left of i every time we have a mismatch. To do this, we need a table where we can look up by character and by index. If the alphabet has size k, we would have a k × m table. We can do this as still be in O(nm) worst case and O(n/m + m) running time, but we can save some space by having a linked list per character, and look up indices in it. Let us add such an array of lists to the iterator:

struct index_linked_list {

struct index_linked_list *next;

uint32_t data;

};

struct bmh_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

int32_t rightmost[256];

struct index_linked_list *rightmost_table[256];

uint32_t j;

};

There is nothing unique in how a linked list is implemented. See the Appendix. We will need prepend to lists and to free them, so we write functions for that.

static inline struct index_linked_list *

new_index_link(

uint32_t val,

struct index_linked_list *tail

) {

struct index_linked_list *link =

malloc(sizeof(struct index_linked_list));

link->data = val; link->next = tail;

return link;

}

void free_index_list(

struct index_linked_list *list

) {

while (list) {

struct index_linked_list *next = list->next;

free(list);

list = next;

}

When we initialize an iterator, we append each position to the list found at the position’s character. When we are done, each list will contain all the occurrences of the character they are associated with.

void init_bmh_match_iter(

struct bmh_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->j = 0;

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

for (uint32_t k = 0; k < 256; k++) {

iter->rightmost[k] = -1;

iter->rightmost_table[k] = 0;

}

for (uint32_t k = 0; k < m - 1; k++) {

iter->rightmost[p[k]] = k;

iter->rightmost_table[p[k]] =

new_index_link(k,

iter->rightmost_table[p[k]]);

}

We allocate list links in the initializer, so we must free them again in the deallocator.

void dealloc_bmh_match_iter(

struct bmh_match_iter *iter

) {

for (uint32_t k = 0; k < 256; k++) {

free_index_list(iter->rightmost_table[k]);

}

The positions in each list are in descending order, so if we search for the rightmost occurrence to the left of an index i, we scan until we find a position that is less than i.

static int32_t find_rightmost(

struct index_linked_list *list,

int32_t i

) {

while (list) {

if (list->data < i) {

return list->data;

}

list = list->next;

}

return -1;

}

The iteration function doesn’t change, but the BMH_JUMP() macro does. Instead of the table lookup to find the rightmost occurrence in the entire string, we use the find_rightmost() function . Otherwise, nothing is new.

#define BMH_JUMP()

MAX(i - find_rightmost(

iter->rightmost_table[text[j+i]], i),

(int32_t)m - iter->rightmost[text[j+m-1]] - 1)

You might object, now, that the search in the lists is not constant time, so the running time now potentially exceeds O(nm + m) in the worst case. To see that this isn’t so, consider how many links we have to search through. The only indices that are larger than i are those m − i we scanned past before a jump. The first link after that will have an index smaller than i and we return that immediately. This means that the search in the list is not more expensive than the scan we just did in the string, so the running time is at most twice as many operations as without the lists, so worst case O(nm + m) and best case O(n/m + m).

Boyer-Moore

The Boyer-Moore (BM) algorithm adds two additional jump rules to the BMH algorithm. These exploit that we have information about a suffix of the pattern that we have matched against a substring of x. If we have matched the pattern suffix p[i, m] against x[ j + i, j + m], then we can use knowledge about where p[i, m] occurs, or partly occurs, in the pattern. We have one rule for when p[i, m] occurs somewhere in p at some p[k, k + m − i]. If one or more of such substrings exist, then we should move the rightmost occurrence such that it aligns with the matched part. For p to match in x, it must at least match on the p[i, m] = p[k, k + m − i]. Picking the rightmost matching substring means that we move the minimal distance where such a match is possible, guaranteeing that we do not skip past a potential match. Since we have a mismatch at p[i − 1] ≠ x[ j + i − 1], we will also require that p[i − 1] ≠ p[k − 1]. Without this requirement, we might shift to a position where we get exactly the same mismatch in the next comparison. See Figure 2-14 for the intuition for the first jump rule.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig14_HTML.png — Figure 2-14
Jump rule one

As you can see from the figure, we shift p to the right to match up a substring with a suffix. This suffix is a border of the string p[k, n]. The border array we used in the first, nontrivial, algorithm gives us, for each i the longest border of p[0, i + 1], and this border is a prefix of the string. We need a suffix, so we use a corresponding border array computed from the right. We call it the reversed border array . Since we want the preceding characters to mismatch, we modify the reversed border array the same way as for the usual border array, but from the right to take care of preceding characters. Let us call this the restricted reversed border array , just to have something to call it. We shall use a variant of this array to compute the first jump table. There will not always be an occurrence of p[i, m] to the left of the suffix p[i, m], in which case this jump rule cannot be used. When this is the case, we set the jump rule to zero. Since the character rules in the BMH algorithm ensure that we always step at least one position to the right, and we will take the largest step possible between the two rules in this section and the two rules in the previous section, we will always move forward.

The second jump table is used when the string p[i, m] does not occur to the left of the suffix. If we cannot match such an occurrence against x[ j + i, j + m], then we can try to match a suffix of x[ j + i, j + m] against a border of p; see Figure 2-15. If there is no nontrivial border of p, we set the jump table distance to zero. The character-based rules will ensure that we always move forward by at least one character.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig15_HTML.png — Figure 2-15
Jump rule two

Jump rule one

To build the jump table for the first case, it might be tempting to try to use a border array to build a jump table. For each position i in the string p, the border array tells us how long a match we have with a prefix, that is, the length of the longest string that is both a prefix and a suffix of p[0, i]. If we build a border array from the right instead of from the left, we know, for each position i, the length of the longest string that is both a prefix and a suffix of p[i, m].

void compute_reverse_border_array(

uint32_t *rba,

const uint8_t *x,

uint32_t m

) {

rba[m - 1] = 0;

for (int32_t i = m - 2; i >= 0; --i) {

uint32_t b = rba[i+1];

while (b > 0 && x[i] != x[m - 1 - b])

b = rba[m - b];

rba[i] = (x[i] == x[m - 1 - b]) ? b + 1 : 0;

}

Here, it is easy to modify the algorithm to run from right to left instead of left to right, but in general, if you want to have a border-like structure where you can compute the left-to-right version and want the right-to-left version, you can reverse the string, build the left-to-right array, and then reverse that array. For the reverse border array , it would look like this:

void compute_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *ba

) {

ba[0] = 0;

for (uint32_t i = 1; i < m; ++i) {

uint32_t b = ba[i - 1];

while (b > 0 && x[i] != x[b])

b = ba[b - 1];

ba[i] = (x[i] == x[b]) ? b + 1 : 0;

}

static void intarray_rev_n(uint32_t *x, uint32_t n)

{

uint32_t *y = x + n - 1;

while (x < y) {

uint32_t tmp = *y;

*y = *x;

*x = tmp;

x++ ; y--;

}

void compute_reverse_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *rba

) {

uint8_t x_copy[m];

strncpy((char *)x_copy, (char *)x, m);

str_inplace_rev_n(x_copy, m);

computed_border_array(x_copy, m, rba);

intarray_rev_n(rba, m);

}

Here, I have split the computation into multiple functions to make it clear what each step is. I prefer the reverse-compute-reverse strategy in most cases, because though it is slightly less efficient it greatly minimizes the work necessary to implement both directions and reduces the risk of errors since there are fewer lines of code.

You can also calculate the reversed restricted border array this way. Recall that the restricted border array is the border array where we exclude from the array the borders that are followed by the character p[i + 1]; we only keep borders where the letter that follows them differs. If we reverse the string, compute the restricted border array, and then reverse the result, we get the restricted reversed border array :

void computed_restricted_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *ba

) {

compute_border_array(x, m, ba);

for (uint32_t i = 0; i < m - 1; i++) {

if (ba[i] > 0 && x[ba[i]] == x[i + 1])

ba[i] = ba[ba[i] - 1];

}

void compute_reverse_restricted_border_array(

const uint8_t *x,

uint32_t m,

uint32_t *rba

) {

uint8_t x_copy[m];

strncpy((char *)x_copy, (char *)x, m);

str_inplace_rev_n(x_copy, m);

computed_restricted_border_array(x_copy, m, rba);

intarray_rev_n(rba, m);

}

Now, when we have the restricted reverse border array, we can scan from left to right and make a pointer from where a suffix ends (where we might have a mismatch in the algorithm) to the position of the border inside the pattern—perhaps something like this (see Figure 2-16):

for (uint32_t i = 0; i < m - 1; i++) {

jump[n - xrba[i] - 1] = n - xrba[i] - i;

}

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig16_HTML.png — Figure 2-16
Jump table based on reverse borders

We go from left to right and let each suffix that also occurs internally in p and know about the position where it occurs. We do not add the matching part of the suffix to the jump table since we want to jump when we have a mismatch, so we get the index at the position left of the matching suffix. At the matching suffix, we insert in the jump table the length we should jump, n − xrba[i] − i. Because we do this from left to right, if there is more than one occurrence, we will get the rightmost one.

That this algorithm computes the jump table sounds convincing, and I have seen this implemented numerous times, which is why I mention it. There is a problem, however: several borders can end at the same rightmost index. Consider the string dabcacabca. Figure 2-17 shows the reverse border array on the left and the restricted reverse border array on the right (the reverse border array where the previous character differs between the suffix and the border). If we build a jump table from the reverse border array, we would get a jump for the two nonzero values; see the two arrows on the top of Figure 2-18. We would not see the jump at the bottom because it jumps to an index where a longer border starts.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig17_HTML.png — Figure 2-17
Example of backward border array and restricted backward border arrays

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig18_HTML.png — Figure 2-18
Border array jump table

Each index in our pattern can be the endpoint of multiple borders. If we follow the preceding idea, we only set a jump pointer to the longest border. If there are several suffixes of the pattern where the rightmost occurrence in the pattern starts at the same position, then only the longest match will get a jump rule. These borders, however, will have different endpoints; see Figure 2-19.¹ These are unique—two different borders can’t have the same endpoint. Imagine two borders with the same endpoint and consider their start points. Where the shortest starts, there must be a mismatch between the start point and the border at the suffix. If not, the border would be longer and the start point further to the left. This, however, contradicts that the longer border must match, or it would be shorter. See Figure 2-20. If the longer light-gray string is a border, then it must match the “a” in the suffix. The character that precedes the shorter string must therefore also be “a”, contradicting that it could be another character. So while start points for rightmost occurrences of suffixes are not unique, the endpoints are.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig19_HTML.png — Figure 2-19
Point to border endpoints rather than start points

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig20_HTML.png — Figure 2-20
Uniqueness of endpoints

There is an array, called the Z array , that captures the essence of the start point/endpoint difference. The array is very similar to the border array, but at each index i, it stores the length, k, of the longest string p[i, i + k] that starts in index i (not ends, as the border array) and is a prefix of p, that is, p[0, k] = p[i, i + k]; see Figure 2-21. We do not want an array of strings that matches prefixes but rather one that matches suffixes, that is, we want an array Z' where Z'[i] contains the length of the longest substring [i − k, i] that matches a suffix of p, that is, p[i − k, i] = p[n − k, n]. This is the reverse of the Z array of the reversed string p, so we can compute it assuming we have a function compute_z_array() that computes the Z array.

void compute_reverse_z_array(

const uint8_t *x,

uint32_t m,

uint32_t *Z

) {

uint8_t x_copy[m + 1];

strncpy((char *)x_copy, (char *)x, m); x_copy[m] = 0;

str_inplace_rev_n(x_copy, m);

compute_z_array(x_copy, m, Z);

intarray_rev_n(Z, m);

}

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig21_HTML.png — Figure 2-21
The Z array

In the Z array, p[0, Z[i]] = p[i, i + Z[i]] is the longest border of p[0, i + Z[i]]. We cannot extend it to the longest border of p[0, i +Z[i]+1] because p[Z[i]+1] ≠ p[i +Z[i]+1] since otherwise the longest string starting in i that matches a prefix of p would be at least one longer. With the Z array, we get the “restricted border” effect for free in this sense. For the reversed Z array, the same is true but for the letter that precedes the border strings.

Computing the Z array

The trivial algorithm for computing the Z array simply matches the string at each position against the first part of our string, giving us an O(n²) running time.

uint32_t match(

const uint8_t * s1,

const uint8_t * s2

) {

uint32_t n = 0;

while (*s1 && *s2 && (*s1 == *s2)) {

++s1;

++s2;

++n;

}

return n;

}

void trivial_compute_z_array(

const uint8_t *x,

uint32_t n,

uint32_t *Z

) {

Z[0] = 0;

for (uint32_t i = 1; i < n; ++i) {

Z[i] = match(x, x + i);

}

The match() function works on pointers to strings and compares them as long as they haven’t reached the null sentinel and as long as they agree on the current character. If our string consists of only one character, x = aⁿ, then match() will run through the entire string x[i, n] which averaged over all the indices as O(n), giving us a total running time of O(n²). We can do better!

In a linear-time construction algorithm, we iteratively consider indices k = 0, 1, ..., n − 1 and compute Z[k] using the previously computed values. We let l and r denote the leftmost index and one past the right index, respectively, of the rightmost string we have seen so far. As invariants in the algorithm, we have for all k' < k that Z[k'] is computed and available and that l and r index the rightmost string x[l, l + Z[l]] seen so far.

There are three cases to consider; see Figure 2-22. The first is when the index we are computing is to the right of r. We can get in this situation if we have seen a rightmost string pointed to by l and r, but the following ks gave us empty borders. To get Z[k] we must compare x with x + k to get the matching length. If this length is zero, we set Z[k] to zero and move to k + 1. If the match result is greater than zero, we set Z[k] to the value and move l to k and r to k + Z[k]; the rightmost string we have seen is now then one starting in k, and the updated indices will point at it.

In the other two cases, k is between l and r. This means that the string x[l, r] contains information about the string starting in k that we can exploit. Let k' = k − l and r' = r − l. If we look at Z[k']—which, by the invariant, is available—there are two possibilities. Either Z[k'] < r’-k’ = r-k in which case the string starting in k' stops before index r'. This means that there is a mismatch between x[Z[k'] + 1] and x[k' + Z[k'] + 1]; see the middle case in Figure 2-22. Since x[l, r] is a prefix of x, the mismatching character will also follow the string x[k, k + Z[z']] which means that the longest string matching a prefix and starting in k will have the length Z[k']. We update Z[k] = Z[k'] and leave l and r alone; the string pointed to by l and r is still the rightmost.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig22_HTML.png — Figure 2-22
The three cases for the Z array construction algorithm

Case three is when the string starting at index k' continues past index r'. In this case, we know that a prefix of the string matches a suffix of the string x[l, r], but we do not know how much further the prefix will match the string starting at index k. To find out where, we must do a character-by-character match. We do not need to start this search at index 1 and k, however. We know that the first r − k characters match, so we can start our match at indices r − k and r; see case three in Figure 2-22. When we have found the right string, we update the left pointer to point at the start of it, l = k, and we set the right pointer to the end of the string r = k + Z[k].

An implementation can look like this:

void compute_z_array(

const uint8_t *x,

uint32_t n,

uint32_t *Z

) {

Z[0] = 0;

if (n == 1) return; // special case

Z[1] = match(x, x + 1);

uint32_t l = 1;

uint32_t r = l + Z[1];

for (uint32_t k = 2; k < n; ++k) {

// Case 1:

if (k >= r) {

Z[k] = match(x, x + k);

if (Z[k] > 0) { l = k; r = k + Z[k]; }

} else {

uint32_t kk = k - l;

if (Z[kk] < r - k) {

// Case 2:

Z[k] = Z[kk];

} else {

// Case 3

Z[k] = r - k + match(x + r - k, x + r);

l = k;

r = k + Z[k];

}

To see that the running time is linear, first ignore the calls to match() . If we do, we can see that there are a constant number of operations in each case, so without matching, we clearly have a linear running time. To handle match(), we do not consider how much we match in each iteration—something that depends on the string and that we do not have much control over. Instead, we consider the sum of all matches in a run of the algorithm. We never call match() in the second case, so we need only to consider cases one and three. Here, the key observation is that we only search to the right of r and never twice with the same starting position.

In case one, we either have a mismatch on the first character, or we get a new string back. In the first case, we leave l and r alone and immediately move to the next k, which we will use as the starting point for the next match. As long as we are not getting a string back from our matching, we use constant time for match() calls for each k, and we never call match() on the same index twice. If we get a string, we move the right pointer, r, to the end of this string. We will never see the indices to the left of r again because both case one and three never search to the left of r. In the search in case three and a nontrivial result in case one, we always move r past all indices we have started a search in earlier.

Z-based jump table

The jump rule needs to slide p to the rightmost position where we potentially have a match. If i is where the rightmost occurrence sits, then index i − rZ[i] is the character just before the start of the occurrence. If there is no occurrence, we set the jump distance to zero, which leaves it up to one of the other jump rules. The prefix it matches starts at position n − rZ[i], and it is when we have a mismatch p[i − rZ[i]] ≠ p[n − rZ[i] − 1] we should move p. Therefore, we want to store the jump distance for index n−rZ[i]−1 in our jump table. The distance we need to jump is the one that places i −rZ[i] at position n −rZ[i]−1, so n −rZ[i]−q −(i −rZ[i]) = n − i − 1. See Figure 2-23 for an illustration of the jump rule and Figure 2-24 for a concrete example.

We build the jump table as follows: We first set all the entries to zero—the default that will give the other jump rules control of the jump—and after that, we compute the reverse Z array and set the jump values as described earlier.

uint32_t rZ[m];

compute_reverse_z_array(iter->pattern, m, rZ);

uint32_t jump1[m];

for (uint32_t i = 0; i < m; i++) {

jump1[i] = 0;

}

for (uint32_t i = 0; i < m; i++) {

jump1[m - rZ[i] - 1] = m - i - 1;

}

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig23_HTML.png — Figure 2-23
First jump rule for Boyer-Moore

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig24_HTML.png — Figure 2-24
Boyer-Moore jump table one example

We do this in the iterator initialization . We will see the full initialization later when we have seen the second jump table as well.

We do not test for whether the reverse Z value is zero. In this case, we want the jump to be zero (so we will use one of the other jump rules), but whenever Z is zero, the loop will update the last index, n − rZ[i] − 1 = n − 1. In the final iteration, we will write n − i − 1 = n − (n − 1) − 1 = 0 there, exactly as desired.

Second jump table

The second jump rule is used where there are no occurrences of the matched string inside the pattern. When this is the case, we should move the minimal amount necessary to match a prefix of the pattern with a suffix of the matched text; see Figure 2-15. If we have the border array for the pattern, then the longest border of the entire string is ba[m − 1], the second-longest border is ba[ba[m − 1]], and so on. When we have a mismatch, we need to jump such that the longest border possible matches the longest suffix of the text we already matched. If we have matched more than ba[m − 1] characters, then we should align the longest border with the matched string. If we have matched less than ba[m − 1], then we should not align this border; it will only give us a mismatch at the same position we just mismatched on. Instead, we should use the second-longest border for the jump. In general, every time we have a mismatch between borders b_i and b_i+1 (with b_i+1 the shortest), then we should use the shorter of the two, b_i+1; see Figure 2-25. The distance we need to move when using border b is m minus its length, that is, m − b.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig25_HTML.png — Figure 2-25
Jump ranges for jump rule two

We include the border of size zero in this preprocessing. It will guarantee us that if we have matched a string that cannot be matched anywhere else in the pattern, then we skip the entire string past the current attempted match.

Computing this jump table, once we have the border array, is straightforward.

uint32_t ba[m];

compute_border_array(iter->pattern, m, ba);

uint32_t jump2[m];

uint32_t b = ba[m - 1];

uint32_t jump = m - b;

for (uint32_t i = 0; i < m; i++) {

if (i > b) {

b = ba[b];

jump = m - b;

}

jump2[i] = jump;

}

Combining the jump rules

We cannot take the maximum jump of the two jump tables here, unlike in the BMH algorithm where we can jump the maximum number of characters given by the two rules. We should not move to a border of the full string if there is an internal string that matches. This means that we should only use the second rule if we cannot use the first, that is, we use jump table two when jump table one is zero. We can combine the two jump tables, and only use the second if the first is zero, in this way:

// Combine the jump tables

iter->jump = malloc(m * sizeof(uint32_t));

for (uint32_t i = 0; i < m; ++i) {

iter->jump[i] = jump1[i] ? jump1[i] : jump2[i];

}

With all these jump tables, the Boyer-Moore algorithm is more complicated than the previous algorithms. Still, if we put the border and Z array functionality in separate functions, then the BM iterator is not overly complex to initialize and use.

Let us combine everything we have seen. We add a jump table to the iterator:

struct bm_match_iter {

const uint8_t *x; uint32_t n;

const uint8_t *p; uint32_t m;

int32_t rightmost[256];

struct index_linked_list *rightmost_table[256];

uint32_t *jump;

uint32_t j;

};

When we initialize the iterator, we compute the two tables from the BMH algorithm; then we compute the two jump tables, using a reversed Z array and a reversed border array, respectively; and finally we combine the two tables.

void init_bm_match_iter(

struct bm_match_iter *iter,

const uint8_t *x, uint32_t n,

const uint8_t *p, uint32_t m

) {

iter->j = 0;

iter->x = x; iter->n = n;

iter->p = p; iter->m = m;

for (uint32_t k = 0; k < 256; k++) {

iter->rightmost[k] = -1;

iter->rightmost_table[k] = 0;

}

for (uint32_t k = 0; k < m - 1; k++) {

iter->rightmost[p[k]] = k;

iter->rightmost_table[p[k]] =

new_index_link(k,

iter->rightmost_table[p[k]]);

}

uint32_t jump1[m];

uint32_t jump2[m];

for (uint32_t i = 0; i < m; i++) {

jump1[i] = 0;

}

uint32_t rZ[m];

compute_reverse_z_array(iter->p, m, rZ);

for (uint32_t i = 0; i < m; i++) {

// We don't have to check if rZ[i] = 0.

// There, we will always write into n-0-1,

// i.e., the last character in the string.

// For the last index we set this to n - i - 1

// which is zero. When this jump is zero,

// one of the other rules will be used.

jump1[m - rZ[i] - 1] = m - i - 1;

}

for (uint32_t i = 0; i < m; i++) {

jump2[i] = 0;

}

uint32_t ba[m];

compute_border_array(iter->p, m, ba);

// Combine the jump tables

iter->jump = malloc(m * sizeof(uint32_t));

for (uint32_t i = 0; i < m; ++i) {

iter->jump[i] = jump1[i] ? jump1[i] : jump2[i];

}

We use a macro for jumping; in this case, we want the maximum of the jump table increment and the increment we get from the BMH tables. The function that increments the iterator should use this macro instead of BMH_JUMP(), but otherwise there are no changes.

#define BM_JUMP() MAX(iter->jump[i], BMH_JUMP())

This is the only change we need to make to the next_bmh_match() function to get next_bm_match() :

bool next_bm_match(

struct bm_match_iter *iter,

struct match *match

) {

// Aliasing to make the code easier to read...

const uint8_t *x = iter->x;

const uint8_t *p = iter->p;

uint32_t n = iter->n;

uint32_t m = iter->m;

if (m > n) return false;

if (m == 0) return false;

// We need to handle negative numbers, and we have already

// assumed that indices into the pattern can fit into

// this type.

int32_t i = m - 1;

for (uint32_t j = iter->j; j < n - m + 1; j += BM_JUMP()) {

i = m - 1;

while (i > 0 && p[i] == x[j + i]) {

i--;

}

if (i == 0 && p[0] == x[j]) {

match->pos = j;

iter->j = j + BM_JUMP();

return true;

}

return false;

}

Since we allocate the jump table in the initialize, we need to free it in the deallocation function.

void dealloc_bm_match_iter(

struct bm_match_iter *iter

) {

for (uint32_t k = 0; k < 256; k++) {

free_index_list(iter->rightmost_table[k]);

}

free(iter->jump);

}

Aho-Corasick

The Aho-Corasick algorithm differs from the previous algorithms in that it does not only search for a single pattern but search simultaneously for a set of patterns. The algorithm uses a data structure, a trie (short for retrieval), that can store a set of strings and provide efficient lookups to determine if a given string is in the set.

Tries

A trie, also known as a prefix tree , is a tree where each edge holds a letter and where no child has more than one out edge with the same letter. Going from the root and down, you can read off all prefixes of one or more of the strings in the set, and when you have seen a full string from the set, you reach a node that is tagged with the string label from the set; see Figure 2-26.

If you want all your strings to sit in leaves—something that can be useful in certain algorithms—then you can add a sentinel to the strings; see Figure 2-27. This will place all strings in leaves, and if the strings are unique, you will get a one-to-one mapping between the leaves of the tree and the strings it contains. If you have duplicated strings, they will still end up in leaves, but there will no longer be a one-to-one mapping. No amount of trickery will prevent identical strings from ending up in the same node, so some leaves will correspond to multiple strings.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig26_HTML.png — Figure 2-26
The trie data structure

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig27_HTML.png — Figure 2-27
A trie with sentinel strings

One way to represent a trie is to have nodes that store the letter on their incoming edges, a string label if the node corresponds to a string in the set, a pointer to its parent, and pointers to its siblings and its children.

struct trie {

uint8_t in_edge_label;

int string_label;

struct trie *parent;

struct trie *sibling;

struct trie *children;

};

The parent pointer isn’t needed in many algorithms, but we need it in the Aho-Corasick algorithm, so I have included it here. As an example, Figure 2-28 is the representation of the trie in Figure 2-26. Notice that the child pointer points to tries. This is because all sub-tries are also tries themselves.

When we initialize a trie, we set the edge label to zero—we will change it to the correct label later—and we set the string label to minus one, the default number that indicates that the path to the node is not a string in the set the trie stores. The three pointers are set to default values: null.

void init_trie(

struct trie *trie

) {

trie->in_edge_label = '';

trie->string_label = -1;

trie->parent = 0;

trie->sibling = 0;

trie->children = 0;

}

struct trie *alloc_trie(void)

{

struct trie *trie =

malloc(sizeof(struct trie));

init_trie(trie);

return trie;

}

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig28_HTML.png — Figure 2-28
Representing a trie in C

I have written two functions for initializing tries, one that assumes that you have already allocated a root struct trie structure and one that does it for you. When you want a new trie, you create its root using one of these functions. The root has no siblings, but it will have children once we add strings to it.

When we deallocate a trie, we recursively delete children and siblings. We free them because the construct algorithms—see the following function—heap allocate the subnodes.

void dealloc_trie(

struct trie *trie

) {

// Depth-first traversal freeing the trie.

if (trie->children)

free_trie(trie->children);

if (trie->sibling)

free_trie(trie->sibling);

}

void free_trie(

struct trie *trie

) {

dealloc_trie(trie);

free(trie);

}

To construct a trie, we insert the input set string by string. When we add the string p_i, we search down the trie, T, and either find an existing node at the end of p_i, or we find that we cannot continue searching beyond some point k, that is, p_i[0, k] matches, but there is no edge with character label p_i[k]. In the first case, we set the node’s string_label to i, and in the second case, we must insert the string p_i[k, m] in the node we reached before we couldn’t go any longer. So the straightforward approach is to find out where the string sits or where it branches off the trie and insert it there.

Whenever we need to add a substring to a node, the trie we add is a string of nodes with no branching. We can build such a trie using the following function:

static struct trie *

string_to_trie(

const uint8_t *str,

int string_label

) {

const uint8_t *s = str;

while (*s) s++;

struct trie *trie = 0;

do {

s--;

struct trie *new_node = alloc_trie();

new_node->in_edge_label = *s;

new_node->string_label = string_label;

new_node->children = trie;

if (trie) trie->parent = new_node;

trie = new_node;

string_label = -1; // so we only label the leaf...

} while (s != str);

return trie;

}

It starts from the end of the string and iteratively constructs a node for each letter, and sets that node’s child to the previous node we created. It sets the string label in the root and then updates the string_label variable , so the remaining nodes will be set to -1 indicating that they are not representing a string.

Constructing a trie from a string is useful when we build a trie from a single sequence or when we need to add the suffix of a string to an existing node. To search, we need to find which edge to follow for each node in the trie. The following function does that by iterationg through all children w. The w variable is set to the child of the node, and we iterate through the children via their sibling pointers.

struct trie *out_link(

struct trie *v,

uint8_t label

) {

for (struct trie *w = v->children;

w; w = w->sibling) {

if (w->in_edge_label == label)

return w;

}

return 0;

}

With these two helper functions, we can write a function for adding a string to a trie. It will first check if the trie has any children. If not, we build a trie for the string and insert it as the child of the original node. If there are children, we start the search. For each letter in the string, we get the out edge of the current node and update the current node to be the output child. We abort the loop if we find a mismatch or reach the end of our string. If we reach the end of the string—this happens when *str == 0—then we set the string label. If we do not reach the end of the string, we have found a node that doesn’t have the next character as an edge label. We can take the suffix of the pattern after the mismatch and build a (single string) trie from it and then add it to the children of the node where we got the mismatch.

void add_string_to_trie(

struct trie *trie,

const uint8_t *str,

int string_label

) {

if (!trie->children) { // first string is a special case

trie->children = string_to_trie(str, string_label);

trie->children->parent = trie;

return;

}

while (*str) {

struct trie *child = out_link(trie, *str);

if (!child) {

break;

} else {

trie = child;

str++;

}

if (*str == '') {

// The string was already in the trie --

// update with label.

// We only allow this when the

// string wasn't already inserted!

assert(trie->string_label < 0);

trie->string_label = string_label;

} else {

// Insert new suffix as a child of parent

struct trie *new_suffix =

string_to_trie(str, string_label);

new_suffix->sibling = trie->children;

trie->children = new_suffix;

new_suffix->parent = trie;

}

If you want to know if a string is in the trie, then you can search down in it. If there is a node where you cannot continue, the string is not in the trie. On the other hand, if you reach the end of the string, you are in a node. If this node has a string label, then the string is in the trie; otherwise, it is not. The following two functions implement this idea:

struct trie *get_trie_node(

struct trie *trie,

const uint8_t *str

) {

if (!trie->children) return 0;

while (*str) {

struct trie *child = out_link(trie, *str);

if (!child) {

return 0; // we can't find the string

} else {

trie = child;

str++;

}

return trie;

}

bool string_in_trie(

struct trie *trie,

const uint8_t *str

) {

struct trie *t = get_trie_node(trie, str);

return t && (t->string_label >= 0);

}

I split them into two functions because you sometimes want to get the node where a string is so you can do something with it. You cannot use get_trie_node() directly as a test. It returns a node or null, so you can test if there is a node on the path given by the string, but to test if it is in the trie’s set of strings, you must also check the string label. The string_in_trie() function does that.

First, assume that we can always find the out edge with a given symbol in constant time. We use linked lists for this, so on the surface this doesn’t seem to be the case, but we will always assume that the alphabet is of constant size, making a list search a constant time operation. When we insert pattern p_i of length m_i, we will in worst case search down m_i steps down the trie, so if m = ∑_im_i, then the running time for constructing a trie is O(m).

The Aho-Corasick (AC) algorithm works similarly to the KMP algorithm. It scans along the string x (using an index j) and searches down the trie at the same time. We increment the index j every time we see a match in the trie—and never decrement it—and we move the trie along x every time we have a mismatch in the trie; see Figure 2-29.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig29_HTML.png — Figure 2-29
Matching and mismatching in Aho-Corasick

When we match along x, we move down the trie, matching character by character. All strings below the current position in the trie can potentially match x. If we have a mismatch, we need to move the trie to the right to the next position where we can potentially match. Consider any string you can read by concatenating the edge labels from the root and down the trie. That is, consider any string in the trie but not necessarily one labelled with the strings the trie contains. For such a string, p, let f (p) denote the longest proper suffix of p that you can also find in the trie. We call this mapping from p to f (p) the failure link of p because we will use it when we fail at matching deeper into the trie than p.

This definition of failure links sounds more complicated than the function really is. It matches the borders we used in the algorithms earlier, except that it uses a trie form of borders. When we have a mismatch, we want to move the trie to the right such that we have a match between the prefix of some of the strings in the trie and the part of x we are matching against. If you take p and all its suffixes, p[1, m], p[2, m], p[3, m], …, p[m, m], then f (p) is the longest that you can find in the trie, that is, the longest of the suffixes such that if you search down the trie, you will reach the end of the string before you see a mismatch. For each node in the trie, v, we have a mapping from the string it represents and that string’s failure link. In the following, I have listed failure links for the leaves in the trie in Figure 2-29, and for the inner nodes, we jump from when we mismatch.

p	f (p)
aac	?
abbcb	b
ba	a
bba	ba
bbc	c
ca	a
abb	bb
bb	b
B	?
abba	bba

Recall that ? denotes the empty string.

In the Aho-Corasick algorithm, we will represent failure links as pointers from each node to its failure link node. For nodes where the failure link is empty, we will set the pointer to point at the root of the trie. We will compute the failure nodes in a preprocessing step. Whenever we have a mismatch in the algorithm, we will move from the current node to its failure link—conceptually moving the trie further along x.

A version of the algorithm could look like this:

void aho_corasick_match(

const char *x,

uint32_t n,

struct trie *patterns

) {

uint32_t j = 0;

struct trie *v = patterns;

while (j < n) {

struct trie *w = out_link(v, x[j]);

while (w) {

// The matching part

if (w->string_label >= 0) {

// String hits->string_label ends in

// index j. If we know the length

// of hits->string_label, we could

// report the beginning.

// We will do so in the iterator

// code.

REPORT(w->string_label, j);

}

v = w;

j++;

w = out_link(v, x[j]);

}

// When we get here, we do not match

// any longer

if (is_trie_root(v)) {

j++;

} else {

v = v->failure_link;

}

We search down the trie by, at each step, getting the out edge that matches the x[ j] we are looking at. The function we use is this:

struct trie *out_link(

struct trie *v,

uint8_t label

) {

for (struct trie *w = v->children; w; w = w->sibling) {

if (w->in_edge_label == label)

return w;

}

return 0;

}

When we see a match to one of the strings in the trie, that is, a node with a nonnegative string label, we report a match. When it happens, j points to the end of the match and that is what we report (we will change that at the end of the chapter). When we have a mismatch, that is, we cannot find a node with the out label we want, then the matching phase ends and we need to move the matching trie. We have a special case if we are at the root. There, the failure link goes back to the root again, and we would not move if we used it. So if we are at the root, we increase j instead of jumping in the trie. We test if we are at the root by testing whether the trie has a parent:

static inline bool is_trie_root(

struct trie *trie

) {

return trie->parent == 0;

}

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig30_HTML.png — Figure 2-30
Problem with missing matches when one string is a substring of another

If there are no strings that are substrings of others, this algorithm works. However, when one string is a substring of another, the algorithm can skip the string completely. Consider Figure 2-30. We have matched abbcb, and we will then have a mismatch (because we cannot go any further from that node in the trie). The mismatch means we jump to the failure link, which is the string b. It is from here we will continue our search. We are now in the sub-trie with the root b. Not all nodes in the trie will have a string label but assume for the example that bb has. That means that we should report occurrences of it, and if the next character is b we will do so. However, we have already encountered it in the string we have scanned, but we never reported it there because we were elsewhere in the trie when we encountered it, specifically in the node abb. With the shift from the failure link, we have moved past that position entirely, and we are not coming back. Substrings are a problem.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig31_HTML.png — Figure 2-31
Output lists and matches

What we want to do is output all strings that match at a position, and all such strings can be found by traversing the failure links of the nodes; see Figure 2-31. All strings that match from a node we have reached in the trie must be suffixes of the current string in the trie, and we can get all of them by running through the failure link.

When we match the first character in abbcb, we do not match any additional string in the trie, but when we match ab, we also have a match of b that ends in this position. We can get that from following the failure link. When we match abb, we should also report a match of bb and b, and we can get those by following the failure link twice, first for bb and then for b. When we match c, we should report c, and again we can see this using the failure link. Finally, when we match the last b in the string, we should report b as well as the full string. Unlike in this example, we do not want to output all nodes we can find from the failure links but only those with a nonnegative string label. So we skip those that do not by pointing to the next that does. The list we get from doing this is what we call the output list .

For each string v in the tree, you can go through the failure links v, f (v), f²(v), ..., fⁿ (v) from the node up to the root. Each link, by definition, is closer to the root than the previous, so we will eventually get there. From the nodes you see on this path, extract those with a nonnegative string label. That is the output list.

To see that we output all matches that end at index j when we run through the output list, observe that when we move through the failure links, we get all the strings that match a suffix of the string we are looking at. To see that we do not miss any strings we should emit a match for, observe that every time we jump a failure link, we get the longest possible match, and therefore we cannot jump over another match.

Adding the output lists to the search algorithm, we get this:

void aho_corasick_match(

const uint8_t *x,

uint32_t n,

struct trie *patterns

) {

uint32_t j = 0;

struct trie *v = patterns;

while (j < n) {

struct trie *w = out_link(v, x[j]);

while (w) {

for (struct output_list *hits = w->output;

hits != 0;

hits = hits->next) {

// hits->string_label ends in j

REPORT(hits->string_label, j);

}

v = w;

j++;

w = out_link(v, x[j]);

}

if (is_trie_root(v)) {

j++;

} else {

v = v->failure_link;

}

This time we do the traversal as before, but in each node, we run through the output list before we do anything else. We do not check the string label except for reporting the output; we know that we only have hits in the output list. Except for the loop through the output list, there is nothing new in the function.

Preprocessing

Before we see the final version of the algorithm, we will see how to add the failure link and the output list to the trie data structure. First, of course, we need to add them to the trie structure :

struct output_list {

int string_label;

struct output_list *next;

};

struct trie {

uint8_t in_edge_label;

int string_label;

struct trie *parent;

struct trie *sibling;

struct trie *children;

// For Aho-Corasick

struct trie *failure_link;

struct output_list *output;

};

We initialize them with null pointers:

void init_trie(struct trie *trie)

{

trie->in_edge_label = '';

trie->string_label = -1;

trie->parent = 0;

trie->sibling = 0;

trie->children = 0;

// For Aho-Corasick

trie->failure_link = 0;

trie->output = 0;

}

To set all failure links and the output lists, we will traverse the trie in a breadth-first manner. In that way, whenever we see a node in the trie, its parent and all the nodes closer to the root will already have their failure link and output list set.

Consider a node, v, and let p(v) be its parent and f (p(v)) be the failure link of v’s parent. Node v is the string p(v) followed by some letter a (see Figure 2-32 a). The failure link of v must be a suffix of p(v) followed by a. It cannot be a longer string since this would contradict that f (p(v)) is the longest suffix of p(v) that is in the trie; we would be able to get a longer one by adding the first part of the failure link of v (see Figure 2-32 b).

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig32_HTML.png — Figure 2-32
Relationship between v and p(v)

Therefore, f (v) must start with a suffix of p(v) (Figure 2-32 c). It might not be the longest suffix of p(v) with an a concatenated—there might not be an out edge of f (p(v)) with label a, but it will be some suffix, and we can get all suffixes of p(v) following failure links, and we want to pick the longest one.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig33_HTML.png — Figure 2-33
Jumping the failure link of a parent to get the failure link of a child

To set the failure link for node v, we first exploit that the parent of the node will have a failure link, so we can jump to the node it points to, f (p(v)). If we can find an out edge there that matches a, then the node at the end of the out edge will be v’s failure link. We have the longest suffix of p(v) that is in the trie, and we must have a suffix of p(v) as the initial sequence to the failure link of v, f (v). If we can extend it with a, we have the longest suffix of p(v) plus a so the longest suffix of v in the trie; see Figure 2-33. If we cannot extend f (p(v)), then we take the next longest suffix of p(v) in the trie, f ( f (p(v))), and try there, and we continue following failure links until we find one that we can extend. If we handle nodes in a breadth-first manner, we are guaranteed that all nodes closer to the root than v will have their failure link set, so we have f (p(v)) and all fⁿ (p(v)) set since these will be higher in the trie; they are suffixes of p(v) and must, therefore, be shorter strings and thus higher in the trie. If there are no places we can extend, then the only option we have is to set the failure link to the root.

To implement a breadth-first traversal of the trie, we need a queue structure. The pointer_queue data structure can be found in the Appendix; the operations you can do on it are what you would expect of a queue, that is, it is a first-in, first-out data structure, enqueue_pointer() adds a link to a pointer to the back of the queue, pointer_queue_front() gives you the front element of the queue, and dequeue_pointer() removes the first element from the queue. When we need to do a breadth-first traversal of the trie, we need to add all siblings of a node when we reach a new node, and for that, we use a function, enqueue_siblings() :

void enqueue_siblings(

struct pointer_queue *queue,

struct trie *siblings

) {

for (struct trie *s = siblings; s; s = s->sibling)

enqueue_pointer(queue, (void*)s);

}

To insert all children of a node, you can call enqueue_siblings() on its first child.

We use it in the function compute_failure_link_for_node() that handles the breadth-first traversal. It creates and later frees the queue, inserts the children of the root to start the traversal with, and then continues to handle nodes as long as there are nodes in the queue.

void compute_failure_links(

struct trie *trie

) {

// We don't want to recompute them if we

// already have set up the failure links.

if (trie->failure_link) return;

// Make the root its own failure link.

trie->failure_link = trie;

struct pointer_queue *nodes = alloc_pointer_queue();

enqueue_siblings(nodes, trie->children);

while (!is_pointer_queue_empty(nodes)) {

struct trie *v =

(struct trie *)pointer_queue_front(nodes);

dequeue_pointer(nodes);

compute_failure_link_for_node(v, trie, nodes);

}

free_pointer_queue(nodes);

}

It is in the compute_failure_link_node() we do the real work—setting the failure link and output list for a specific node. This is where we search the failure links of the parent to find one we can extend and also where we will set the output list.

void compute_failure_link_for_node(

struct trie *v,

struct trie *root,

struct pointer_queue *queue

) {

// Breadth-first traversal...

enqueue_siblings(queue, v->children);

if (is_trie_root(v->parent)) {

// Special case: immediate children of the

// root should have the root as parent.

v->failure_link = v->parent;

} else {

uint8_t label = v->in_edge_label;

struct trie *w = v->parent->failure_link;

struct trie *out = out_link(w, label);

while (!out && !is_trie_root(w)) {

w = w->failure_link;

out = out_link(w, label);

}

if (out) {

v->failure_link = out;

} else {

v->failure_link = root;

}

// Compute output list

if (v->string_label >= 0) {

v->output = new_output_link(v->string_label,

v->failure_link->output);

} else {

v->output = v->failure_link->output;

}

For the output list, observe that we will have to output all the strings in the output link of the failure link of v—those are the strings that are suffixes of v with a string label. If v has a string label we must also output it, so in that case we prepend v’s label to the list; otherwise, we take the output of f (v).

In the traversal, we use two helper functions: is_trie_root() and new_output_link(). We have seen is_trie_root() before but not new_output_link(). It looks like this:

struct output_list *

new_output_link(

int label,

struct output_list *next

) {

struct output_list *link =

malloc(sizeof(struct output_list));

link->string_label = label;

link->next = next;

return link;

}

When deallocating a trie, we need to handle the outlink as well as children and siblings. We do not need to scan through the list nodes, however. The output list is a linked list, but there is at most one link per string label, and that is associated with the trie node with that label. We don’t need to handle the rest of the output list since those will be handled when their corresponding trie nodes are deleted.

void dealloc_trie(

struct trie *trie

) {

// Depth-first traversal freeing the trie.

if (trie->children) free_trie(trie->children);

if (trie->sibling) free_trie(trie->sibling);

if (trie->output && trie->string_label >= 0) {

free(trie->output);

}

When we examine the running time for the preprocessing, we will assume that the trie is already built; if you want to include it, just add the construction to this running time. Building the trie can be done in time equal to the total sum of the lengths of strings in it, O(m). Constructing the failure links can be done in the same time.

It isn’t obvious that we can construct the failure links in linear time. For each node v at depth d(v), we can in principle follow d(v) failure links, giving us a running time of the square of the number of nodes in the trie. This, however, is not a tight bound. To see this, consider a node v and the path down to it. When we compute the failure links, we do it breadth-first, but for now consider what amounts to a depth-first traversal. If the failure link is set for v and we need to compute it for a child of v, w, then the node depth of f (w) can at most be one more than the node depth of f (v). When we compute f (w), we might decrease the failure depth by a number of steps but we can only increase it by one. As we move down a depth-first path, we can increase the failure link depth by one at each step, but we cannot decrease it more than we have increased it so far. So the total search for suffix links on such a path is bounded by the depth of the path.

The algorithm with iterators

If we have a global REPORT() function (which we will avoid later), then the search algorithm can look like this:

void aho_corasick_match(

const uint8_t *x,

uint32_t n,

struct trie *patterns

) {

uint32_t j = 0;

struct trie *v = patterns;

while (j < n) {

struct trie *w = out_link(v, x[j]);

while (w) {

for (struct output_list *hits = w->output;

hits != 0;

hits = hits->next) {

// String hits->string_label ends in

// index j. If we know the length

// of hits->string_label, we could

// report the beginning.

// We will do so in the iterator

// code.

REPORT(hits->string_label, j);

}

v = w;

j++;

w = out_link(v, x[j]);

}

if (is_trie_root(v)) {

j++;

} else {

v = v->failure_link;

}

We do not want this type of global reporting function, of course, nor do we want a callback. They make it hard for others to use our code. Again we want an iterator. We will initialize the iterator with a trie that is already constructed, in case the user of the algorithm needs to use the trie on several strings or for other purposes and does not want to create the trie anew each time. We also want to know the pattern lengths so we can report the beginning of matches rather than the ends of patterns.

void init_ac_iter(

struct ac_iter *iter,

const uint8_t *x,

uint32_t n,

const uint32_t *pattern_lengths,

struct trie *patterns_trie

) {

assert(iter);

iter->x = x; iter->n = n;

iter->pattern_lengths = pattern_lengths;

iter->patterns_trie = patterns_trie;

iter->nested = true;

iter->j = 0;

iter->v = patterns_trie;

iter->w = 0;

iter->hits = 0;

// We need these for this algorithm.

compute_failure_links(patterns_trie);

}

The resources for the iterator are all handled outside of the iterator so we do not have to free any resources.

void dealloc_ac_iter(struct ac_iter *iter)

{

// Nop

}

The function for incrementing the iterator is more involved. We have nested loops in the algorithm where we have an outer loop that runs through the string x, and then we have a nested loop that matches down the trie using failure links and then yet another nested loop that iterates through the output list. We need to leave the iterator in any of these loops and resume in the same loop when we increment the iterator. Therefore, the iterator has the variable hits that is nonnull if we are in the process of outputting hits. We check if it is null and return a match if it isn’t. We use another variable in the iterator, nested, that is true if we are in the nested loop over failure links and matches, the while (w) look from the implementation earlier. If nested is true, we get the outlink form w, and if there is one, we update the various values in the iterator so we can return to the beginning of the loop. For restarting the loop, we call next_ac_match() recursively. If there isn’t an outgoing edge with the right label, we should leave the nested loop instead, so here we set nested to false and continue to the next part of the function that handles the updates in the outer loop. After the updated variables, we continue the loop by a recursive call again. The recursive calls are likely to be tail-optimized by the compiler so we will not pay a runtime penalty and we do not need to worry about exceeding the stack space.

bool next_ac_match(

struct ac_iter *iter,

struct ac_match *match

) {

if (iter->hits) {

match->string_label = iter->hits->string_label;

// We use the pattern length to output

// the start of a match instead of the end.

match->index = iter->j -

iter->pattern_lengths[match->string_label];

iter->hits = iter->hits->next;

return true;

}

if (iter->nested) {

iter->w = out_link(iter->v, iter->x[iter->j]);

if (iter->w) {

iter->hits = iter->w->output;

iter->v = iter->w;

iter->j++;

iter->w = out_link(iter->v, iter->x[iter->j]);

return next_ac_match(iter, match);

} else {

iter->nested = false;

}

if (iter->j < iter->n) {

if (is_trie_root(iter->v)) {

iter->j++;

} else {

iter->v = iter->v->failure_link;

}

iter->nested = true;

return next_ac_match(iter, match);

}

return false;

}

For the running time of the main algorithm, we can reason similarly to how we did for the KMP algorithm. We never decrease j, but we increase it for each match. We never move the trie to the left but move it to the right every time we have a mismatch. Both j and trie cannot move past the end of x, so the running time is (n) plus the total number of matches we output, z, which is not a constant, so the total time is O(n + z).

Comparisons

Theoretical analysis of algorithms is one thing, and actual running times another—and more important property. I have simulated random strings with three different alphabets: EQUAL, all symbols are the same; DNA, an alphabet of size four (A, C, G, T); and a full 8-bit character set. If our analysis (and implementation) is correct, the naïve algorithms, BMH and BM, should have worst-case complexity on EQUAL, but when we consider random strings, the performance should get better the larger the alphabet. The other algorithms should run in linear time regardless of the alphabet. The relative performance of the two classes of algorithms depends on the complexity of the implementation. So consider Figure 2-34. In the figure, I have used m = 200. The lines are loess fitted to the time measurements. The behavior is as expected: The first three algorithms perform poorly with the EQUAL alphabet but comparable to the other algorithms on the DNA alphabet. The naïve algorithm is a little faster because it is much simpler. With random strings, the BMH and BM algorithms outcompete the others, with BM (the more complex of the two) the fastest. With the largest alphabet, the naïve algorithm, simple as it is, is faster than border and KMP, and the BMH and BM algorithms dramatically faster.

The algorithms also depend on m, and in Figure 2-35, you can see the running time of the linear-time algorithms for different m. In Figure 2-36 you can see the same for the worst-case quadratic time algorithms. Notice that the linear-time algorithms hardly depend on m. There is some dependency from the preprocessing step, but it is very minor. The worst-case quadratic time algorithms depend on both n and m. When we fix m, we always get a straight line for O(nm) algorithms, but the growth depends on m as we see. They are all faster for smaller m and faster for larger alphabets (the y axes are not on the same scale so you can see the running time for the fastest cases).

You are unlikely to run into truly random strings, but genomic DNA data is close enough that the running time will be the same. Natural languages are hardly random, but BMH and BM are known to perform sublinear there as well. The best algorithm depends on your use case, but with reasonably large alphabets, BM and BMH are likely to be good choices.

You can find the code I used for the experiments on GitHub: https://github.com/mailund/stralg/blob/master/performance/match_search.c.

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig34_HTML.png — Figure 2-34
Performance of the search algorithms

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig35_HTML.png — Figure 2-35
Dependency on m for the linear algorithms

../images/490521_1_En_2_Chapter/490521_1_En_2_Fig36_HTML.png — Figure 2-36
Performance of the worst-case quadratic time algorithms for different m

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Classical algorithms for exact search

Create new playlist

Sign In

Sign Up

2. Classical algorithms for exact search

Naïve algorithm

Border array and border search

Borders and border arrays

Exact search using borders

Knuth-Morris-Pratt

Boyer-Moore-Horspool

Extended rightmost table

Boyer-Moore

Jump rule one

Computing the Z array

Z-based jump table

Second jump table

Combining the jump rules

Aho-Corasick

Tries

Preprocessing

The algorithm with iterators

Comparisons

Table of Contents for
2. Classical algorithms for exact search