Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3. Suffix trees

Thomas Mailund¹

(1)

Aarhus N, Denmark

The suffix tree is a fundamental data structure when it comes to string algorithms. It captures the structure of a string via all the string’s suffixes and uses this structure as the basis of many algorithms. We will only use it for searching, where it provides linear search for a pattern after a linear preprocessing of the string we search in.

Imagine that you have listed all the suffixes of a string x$ with the sentinel $ (which in C is usually, and always in this chapter, zero). The sentinel is important here and you must always include it in suffix trees. So, we have the suffixes x$[0, n + 1], x$[1, n], x$[2, n], ..., x$[n, n + 1], which obviously contain all the information that the string does (the first suffix is the entire string). If we want to search for a pattern p in x, then we can find the suffixes where p is a prefix, that is, suffixes x[ j, n] where p = x[ j, j + m]. Iteratively matching p against all suffixes is not efficient, it would take time O(nm), and if we explicitly list all suffixes, we would use O(n²) time and space on top of this. If, however, we construct a trie of all the suffixes, we can search in time O(m). Consider the trie in Figure 3-1. It contains all the suffixes of the string mississippi$. The sentinel guarantees us that there is a one-to-one mapping between leaves and suffixes. To search for a pattern, move down the trie until there is a mismatch or we reach the end of the pattern. If we reach the end of the pattern, then the leaves below that point are the positions where the pattern can be found in the string. If we, for example, search for the string “ss” from the root, we get to the white node in Figure 3-1. The leaves below this node in the tree, two and five, are the indices where “ss” occur in “mississippi$”, that is, indices 2 and 5.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig1_HTML.png — Figure 3-1
Suffix trie for “mississippi”

Constructing this “suffix trie” in the standard way—one string at a time—takes time proportional to the sum of the lengths of sequences we add to the trie, so building this suffix trie costs us O(n²) space and time usage for the preprocessing. A suffix tree is a tree containing all suffixes of a string but exploits the structure in a set of suffixes to reduce both space and time complexity to O(n + m).

Compacted trie and suffix representation

A suffix tree is a trie containing all suffixes , but it is compacted. This means that we do not represent each character as an edge in the trie, but rather, we merge edges where nodes have out-degree one; see the left tree in Figure 3-2.¹ Since we have a single string, which all the edges are a subsequence of, we can represent the edge strings efficiently as indices or pointers into the string. The tree on the right in Figure 3-2 shows the actual representation of a suffix tree. The notation [i, j] means from i to j with i included and j not included, that is, j is one past the last index.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig2_HTML.png — Figure 3-2
Suffix tree, conceptual and actual

The sentinel guarantees that there is a one-to-one correspondence between the string’s suffixes and the suffix tree’s leaves. Since each inner node has at least two children, the total number of nodes cannot exceed 2n − 1 and neither can the number of edges, so the suffix tree can be stored in O(n) space.

In the implementation, we will not use indices but pointers into the string. This is more convenient in many cases, and we can always get an index by subtracting the pointer by the string. We represent an edge with a range structure that we put in each node. The range in node v represents the edge label between v and its parent p(v).

struct range {

const uint8_t *from;

const uint8_t *to;

};

static inline uint32_t range_length(struct range r) {

return (uint32_t)(r.to - r.from);

}

The range_length() is a convenience function and shows how we can go from pointers to a length, and in this case, it will be an offset from the string pointed to by r.to.

In the simplest construction algorithm, we need a range, a sibling list, and a child list in each node, plus a suffix label if the node is a leaf. In the more advanced two algorithms, we also need a parent pointer and a suffix link pointer. What the parent pointer does should be self-evident, and the suffix link pointer will be explained when we get to McCreight’s algorithm later in the chapter.

With all the information we need in a node, the structure looks like this:

struct suffix_tree_node {

uint32_t leaf_label;

struct range range;

struct suffix_tree_node *parent;

struct suffix_tree_node *sibling;

struct suffix_tree_node *child;

struct suffix_tree_node *suffix_link;

};

static inline uint32_t edge_length(

struct suffix_tree_node *n

) {

return range_length(n->range);

}

The edge_length() function is just another helper function we can use to, not surprisingly, get the length of the edge leading to the node.

We could use a root note as the type of a suffix tree—as we did for tries—but we will usually need access to the string that the tree is built from. We have pointers into the string on all edges, but in some of the algorithms in this chapter, we will need to consider them as indices, and that is easily done by subtracting the pointer to the string from the pointers on an edge. So we use a suffix_tree structure and store a pointer to the string in it.

struct suffix_tree {

const uint8_t *string;

uint32_t length;

struct suffix_tree_node *root;

struct suffix_tree_node_pool pool;

};

The pool variable is used to allocate nodes efficiently. We have an upper bound on the size of the number of nodes we can have, so we can preallocate a pool of nodes for the tree instead of using malloc() and free() for each node. This speeds up the construction and makes it easier to free the tree—we can free the pool and do not have to traverse the tree to free individual nodes.

struct suffix_tree_node_pool {

struct suffix_tree_node *nodes;

struct suffix_tree_node *next_node;

};

When we need a new node, we can get it from the pool. The new_node() function constructs a node from a suffix tree (where it can get the pool) and the two pointers that represent the edge label.

static struct suffix_tree_node *

new_node(

struct suffix_tree *st,

const uint8_t *from,

const uint8_t *to

) {

struct suffix_tree_node *v = st->pool.next_node++;

v->leaf_label = 0;

v->range.from = from;

v->range.to = to;

v->parent = 0;

v->sibling = 0;

v->child = 0;

v->suffix_link = 0;

return v;

}

This function should remain in the .c file and not be part of the public interface. We don’t want the user to insert nodes willy-nilly. Suffix trees should be built using a construction algorithm.

To free a suffix tree, we first need to free the nodes. This is a trivial task because we have the nodes pool that we can deallocate with a single free() call.

void free_suffix_tree(struct suffix_tree *st)

{

// Do not free string; we are not managing it.

free(st->pool.nodes);

free(st);

}

We should not free the string when we free the suffix tree. That is the responsibility of the user and part of the interface; the string is declared const, and we will use const strings for all our construction algorithms.

When allocating a new tree, we allocate the struct to set the string and length variables. Then we allocate the array we use as the node pool. Finally, we create a root node and set its parent to itself and its suffix link to itself (forget the suffix link for now, we get to it later). Adding a root to the tree when we construct it makes all other functions easier to write since they avoid handling special cases where a node is null.

static struct suffix_tree *

alloc_suffix_tree(

const uint8_t *string

) {

struct suffix_tree *st =

malloc(sizeof(struct suffix_tree));

st->string = string;

uint32_t slen = (uint32_t)strlen((char *)string);

st->length = slen + 1; // We are using '' as sentinel.

// This is the max number of nodes in a tree where all

// nodes have at least degree two. There is a special case

// when the string is empty -- it should really only happen

// in testing, but never the less.

// In that case, there should be

// two and not one node (the root and a single child).

uint32_t pool_size = st->length == 1

? 2 : (2 * st->length - 1);

st->pool.nodes =

malloc(pool_size * sizeof(struct suffix_tree_node));

st->pool.next_node = st->pool.nodes;

st->root = new_node(st, 0, 0);

st->root->parent = st->root;

st->root->suffix_link = st->root;

return st;

}

Naïve construction algorithm

The simplest way to build a suffix tree is to consider it a trie and insert one string at the time, starting with the first suffix. We cannot quite consider it a trie since it is compacted, but we can look at one character at a time as we scan along edges in effect doing the same work as we would for building a trie. This naïve approach is implemented in the naive_suffix_tree() function :

struct suffix_tree *naive_suffix_tree(

const uint8_t *string

) {

struct suffix_tree *st = alloc_suffix_tree(string);

// We insert the first suffix manually to

// ensure that all inner nodes have at least one child.

// The root will be a special case

// for the first suffix otherwise,

// and we do not want to deal with that

// in the rest of the code.

struct suffix_tree_node *first =

new_node(st, st->string, st->string + st->length);

st->root->child = first;

first->parent = st->root;

const uint8_t *xend = st->string + st->length;

for (uint32_t i = 1; i < st->length; ++i) {

struct suffix_tree_node *leaf =

naive_insert(st, st->root, string + i, xend);

leaf->leaf_label = i;

}

return st;

}

First, we ensure that there is at least one edge out of the root—otherwise, we would need to handle the case where, and there isn’t a special case in all the other functions we will write. After setting the first suffix as the first child of the root (and setting the parent pointer of it to the root), we iterate through all the remaining suffixes and insert them using the following naive_insert() function . This function will return the new leaf representing the suffix, and we set its label accordingly.

The naive_insert() function takes the suffix tree, a node to search out from (the root in naive_suffix_tree() ), and a string given by a start pointer (x + i for the start of suffix i) and an end pointer (the end of the string in naive_suffix_tree().

First, naive_insert() checks if there is an out edge from the node it should search from, v. If there isn’t, then this is where we should add the string, so we create a new node and insert it as a child of v using insert_child() (listed in following code). If there is an out edge, we scan along it. We get a pointer to the start of the interval of the edge, s, and we have the pointer x pointing to the beginning of the string we want to insert. We scan along the edge as long as s and x point to the same character; see Figure 3-3 for an illustration of how scanning an edge maps to scanning an interval in the suffix tree’s string. In the figure, variables from and to define the interval of the edge we scan, s the point in the edge we have scanned to, x the position in the string we insert that we have scanned so far, and xend the end of the string we are inserting.

If we find a mismatch, we need to break the edge in two and add an edge to the leaf. We do that using the function split_edge() described later in this section. If we reach the end of the edge’s interval, we must continue from the node w at the end of the edge. We do this by a recursive call.

static struct suffix_tree_node *

naive_insert(

struct suffix_tree *st,

struct suffix_tree_node *v,

const uint8_t *x,

const uint8_t *xend

) {

// Find child that matches *x.

struct suffix_tree_node *w = find_outgoing_edge(v, x);

if (!w) {

// There is no outgoing edge that matches

// so we must insert here.

struct suffix_tree_node *leaf = new_node(st, x, xend);

insert_child(v, leaf);

return leaf;

} else {

// We have an edge to follow!

const uint8_t *s = w->range.from;

for (; s != w->range.to; ++s, ++x) {

if (*s != *x) {

struct suffix_tree_node *u =

split_edge(st, w, s);

struct suffix_tree_node *leaf =

new_node(st, x, xend);

insert_child(u, leaf);

return leaf;

}

// We made it through the edge, so continue

// from the next node.

// The call is tail-recursive, so the compiler

// will usually optimize it to a loop.

return naive_insert(st, w, x, xend);

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig3_HTML.png — Figure 3-3
Scanning along an edge by scanning in the string

The find_outgoing_edge() function does what you would expect. It scans through the children of a node, tests each edge for whether it matches the character we are looking for, and returns the node if it finds one. If it gets all the way through the children, that is, to a node where the sibling pointer is null, then it returns null.

static struct suffix_tree_node *

find_outgoing_edge(

struct suffix_tree_node *v,

const uint8_t *x

) {

struct suffix_tree_node *w = v->child;

while (w) {

if (*(w->range.from) == *x) break;

w = w->sibling;

}

return w;

}

The simplest way to add a child to a node is to prepend it to the linked list structure. This, however, will leave the children list reversely sorted according to when we insert edges, which for all intents and purposes means randomly sorted. There are algorithms where we need to traverse the tree such that we see the suffixes in lexicographical order. If we keep the children sorted according to their start symbol, this will simply be a depth-first traversal where, for each node, we iterate through the nodes in the order they have in the children’s list. It requires a little more work to insert a child at its correct position, but it is mostly straightforward except for one special case. If the new child’s symbol is less than the first node’s, we prepend the child. Otherwise, we scan through the existing children until we find a letter that is larger than the new child’s symbol, and we insert the child there. If we reach the last node, recognizable by it having null for a sibling, we prepend the new child. After inserting the child, we must remember to set its parent to the node we added it to.

static void insert_child(

struct suffix_tree_node *parent,

struct suffix_tree_node *child

) {

// We need this when we split edges.

if (!parent->child) {

parent->child = child;

return;

}

const char x = *child->range.from;

struct suffix_tree_node *w = parent->child;

if (x < out_letter(w)) {

// Special case for the first child.

child->sibling = parent->child;

parent->child = child;

} else {

// Find w such that it is the first chain

// with an outgoing edge that is larger

// than the new.

while (w->sibling && x > out_letter(w->sibling))

w = w->sibling;

child->sibling = w->sibling;

w->sibling = child;

}

child->parent = parent;

}

The out_letter() gives us the first symbol of an edge:

inline static char out_letter(

struct suffix_tree_node *v

) {

return *(v->range.from);

}

Finally, we come to the function for splitting an edge on a mismatch. Consider Figure 3-4 where on the left we have the edge we have scanned down and the position of the mismatch. We have to split the edge at the mismatch position, which is where the variable s points. The edge going down to the new node must, therefore, go from the pointer from to s and the edge below the new node must go from s to the pointer to. We will insert a new leaf as the second edge out of the new node, and this must go from the pointer x to xend in the calling function but we only break the edge with the new node, u, in this function. We add the other edge outside of this function. In the implementation of the function split_edge() , the node w is an argument as is s. We get the node v from w’s parent pointer, we create node u, and we set the start of its edge to be the from of the edge to w and the end to be s. Now u’s parent should be v and its (so far only) child should be w. We update w’s start position to be s (it already has the right endpoint) and update the child and parent pointers for the nodes. We have to remove w from v’s children—it is now a child of u instead—and we insert u as a new child of v. We return the new node so the caller can insert the leaf as a child of it.

static struct suffix_tree_node *

split_edge(

struct suffix_tree *st,

struct suffix_tree_node *w,

const uint8_t *s

) {

struct suffix_tree_node *v = w->parent;

struct suffix_tree_node *u =

new_node(st, w->range.from, s);

u->parent = v;

u->child = w;

w->range.from = s;

w->parent = u;

remove_child(v, w);

insert_child(v, u);

return u;

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig4_HTML.png — Figure 3-4
Splitting an edge

The return value of split_edge() is the new node and we use that to add the new leaf. We only need the s pointer when we split the edge if we do this. We cannot create the edge to the leaf with a range x to xend from the w edge and the pointer s, so if we wanted split edge to insert the leaf as well, we would need more arguments to the function. In naive_insert() we already know x and xend so we create the leaf there and insert it as the second child of our new node. This is done with the following code:

struct suffix_tree_node *u = split_edge(st, w, s);

struct suffix_tree_node *leaf = new_node(st, x, xend);

insert_child(u, leaf);

The remaining function we need to implement to complete the algorithm is the one for removing a child from a node’s child list. We have three cases: If the list is empty—in which case, we do nothing. If it is the first child—in which case, we change the children list to the first sibling, which will also remove the child. In the final case, the node is somewhere in the middle of the list, so we scan through the list and if we find the node we unlink it.

static void remove_child(

struct suffix_tree_node *v,

struct suffix_tree_node *w

) {

if (!v->child) return;

if (v->child == w) {

v->child = w->sibling;

w->sibling = 0;

} else {

struct suffix_tree_node *u = v->child;

while (u->sibling) {

if (u->sibling == w) {

u->sibling = w->sibling;

w->sibling = 0;

return;

}

u = u->sibling;

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig5_HTML.png — Figure 3-5
Suffix array and longest common prefix array

Suffix trees and the SA and LCP arrays

In this section, we briefly cover the close relationship between two special arrays and a suffix tree. These arrays are the topic of the entire next chapter, but in this section, we only consider how we can use them to build a suffix tree. If you take a list of all the suffixes of a string and then sort them (see Figure 3-5), the suffix array (SA) is the suffix indices in this sorted order. The longest common prefix (LCP) array is the longest prefix of a suffix and the suffix above it in the sorted order—the underlined prefixes in Figure 3-5. In this section, we shall see that we can construct the arrays from a suffix tree in linear time and that we can construct the suffix tree from the two arrays in linear time.

Constructing the SA and LCP arrays

If we depth-first traverse a suffix tree with children sorted alphabetically, we will see all leaves in sorted order, an observation that should be immediately obvious. Thus, if we keep a counter of how many leaves we have seen so far and use it to index into the suffix array, we can build the suffix array simply by putting leaf nodes at the index of the counter as we traverse the tree.

Getting the LCP array is only slightly more involved. If we let branch length be the total length of the string, we read from the root down to a node v. All children of v will share a prefix of exactly this length, so if we iterate through all but the first child, we will have at least this longest common prefix. Children of the children will share longer prefixes, but we can handle that by recursion. The problem with the first child is that though it shares the prefix with the other children, it does not share it with the previous string in the suffix array.

Consider Figure 3-6. The letters represent branch lengths and the numbers the leaves in the tree. Notice how the leftmost node in any of the subtrees has a lower LCP than its siblings. If we traverse the tree and keep track of how much we share to the left, we can output that number for each leaf. For the first child of a node, we send this left-shared number unchanged down the tree, while at the remaining trees, we update the value by adding the edge length of the children’s parent node.

Onto the implementation of the algorithm, we use this structure to reference the two arrays and to keep track of which index we need to update in the array next time we see a leaf.

struct sa_lcp_data {

uint32_t *sa;

uint32_t *lcp;

uint32_t idx;

};

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig6_HTML.png — Figure 3-6
Branch lengths and LCP array

In the function st_compute_sa_and_lcp() , we create and initialize an instance of the structure and call a recursive function, lcp_traverse() , with the root as the node and both the left-shared branch length and the node’s branch length as its input.²

void st_compute_sa_and_lcp(

struct suffix_tree *st,

uint32_t *sa,

uint32_t *lcp

) {

struct sa_lcp_data data; // type defined earlier

data.sa = sa; data.lcp = lcp; data.idx = 0;

uint32_t shared_depth = 0;

uint32_t branch_depth = 0;

lcp_traverse(st->root, &data,

shared_depth, branch_depth);

}

In the traversal , we update the SA and LCP array each time we see a leaf; we propagate the shared left-depth to the first node and use the current node’s depth as both the left and the node depth in the recursive calls.

static void lcp_traverse(

struct suffix_tree_node *v,

struct sa_lcp_data *data,

uint32_t left_depth,

uint32_t node_depth

) {

if (!v->child) {

// Leaf

data->sa[data->idx] = v->leaf_label;

data->lcp[data->idx] = left_depth;

data->idx++;

} else {

// Inner node

// The first child should be treated differently than

// the rest; it has a different branch depth because

// the LCP is relative to the last node in the previous

// leaf in v's previous sibling.

struct suffix_tree_node *child = v->child;

uint32_t this_depth

= node_depth + edge_length(v);

lcp_traverse(child, data,

left_depth, this_depth);

for (child = child->sibling;

child;

child = child->sibling) {

// Handle the remaining children

lcp_traverse(child, data,

this_depth, this_depth);

}

Since the entire algorithm is a depth-first traversal where we do constant time work at each node, the running time is the same as the size of the tree, so O(n).

Constructing the suffix tree from the SA and LCP arrays

We can construct a suffix tree from the SA and LCP arrays by, conceptually, doing a depth-first traversal of the tree while constructing it at the same time. We insert the suffixes in a different order than for the naïve approach; we insert them according to the suffix array, so we first insert sa[0], then sa[1], and so on. When inserting the first suffix, we add an edge from the root to a leaf labelled sa[0]. Then we split that edge to insert sa[1] and an edge to it. For sa[2] we start in sa[1] and figure out where we should break an edge and insert the new leaf. This can either be on the edge down to sa[1] or the edge above it (see Figure 3-7), but it cannot be on the edge down to sa[0] since sa[1] is lexicographically smaller than sa[2] which means it must be to the right or above the edge to sa[1]. We continue inserting this way by inserting sa[3], sa[4], and so on by first moving up the tree to find the place where they should be inserted and then inserting the new leaf. You can think of this as a depth-first traversal. When you insert a new leaf, you move down the recursion (very quickly, of course, since you only insert a single edge), and when you move up the tree to find an edge to split, you in effect return from the depth-first recursion.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig7_HTML.png — Figure 3-7
Possible placements of sa[2]

The main function for the algorithm looks like this:

struct suffix_tree *

lcp_suffix_tree(

const uint8_t *string,

uint32_t *sa,

uint32_t *lcp

) {

struct suffix_tree *st = alloc_suffix_tree(string);

uint32_t first_label = sa[0];

struct suffix_tree_node *v =

new_node(st, st->string + sa[0],

st->string + st->length);

v->leaf_label = first_label;

st->root->child = v;

v->parent = st->root;

for (uint32_t i = 1; i < st->length; ++i) {

v = lcp_insert(st, i, sa, lcp, v);

}

return st;

}

We first add sa[0] to the tree and set the variable v to the new leaf. Then we iteratively add the remaining leaves with lcp_insert() that returns the new leaf inserted. This leaf is used by the function as the starting point for inserting the next leaf.

For the function doing the hard labor, lcp_insert(), consider Figure 3-8. If we are at the leaf for sa[i-1] and the longest prefix it shares with sa[i] is lcp[i], then the new leaf should be inserted lcp[i] symbols down the path from the root to sa[i-1], or n-sa[i-1]-lcp[i] up from the sa[i-1] leaf. We can search up that amount from sa[i-1] and break the edge there (or insert the new leaf in a node if we do not hit an edge).

The pointers we need to insert on the new edge to the leaf are lcp[i] into the suffix, that is, sa[i]+lcp[i], since this is what remains of the suffix after we have shared lcp[i]; see Figure 3-9. We don’t know where the pointers on the edge we break are, they need not be anywhere sa[i-1], sa[i], or lcp[i], but we will know how far up the edge are we from how much we had left to search when we started climbing up the edge. If we call that amount length_up, then the break pointer is at to-length_up.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig8_HTML.png — Figure 3-8
Relationship between sa[i-1], lcp[i], and the suffix tree

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig9_HTML.png — Figure 3-9
Splitting the edge when inserting sa[i]

With all the observations earlier, the actual implementation is straightforward. We get the length we have to move up from the calculations we just did, and then we get the length of the current edge we need to climb. Then we enter a loop that moves us up a number of edges until we have traversed all the way to our insertion point. In the loop, we each iteration subtract the edge length from the length we need to climb up—we have just moved up that amount, after all—we move to the parent node, and we get the length of the next edges. We do this until we hit an edge longer or equal to the next edge or until the new edge length is zero. In the first case, we have found the insertion point on an edge, and in the second case, we hit a node and need to insert the new leaf as a child of that node.

static struct suffix_tree_node *

lcp_insert(

struct suffix_tree *st,

uint32_t i,

uint32_t *sa,

uint32_t *lcp,

struct suffix_tree_node *v

) {

struct suffix_tree_node *new_leaf =

new_node(st,

st->string + sa[i] + lcp[i],

st->string + st->length);

new_leaf->leaf_label = sa[i];

uint32_t length_up = st->length - sa[i-1] - lcp[i];

uint32_t v_edge_len = edge_length(v);

while ((length_up >= v_edge_len)

&& (length_up != 0)) {

length_up -= v_edge_len;

v = v->parent;

v_edge_len = edge_length(v);

}

if (length_up == 0) {

append_child(v, new_leaf);

} else {

struct suffix_tree_node *u =

split_edge(st, v, v->range.to - length_up);

// Append leaf to the new node

// (it has exactly one other child).

u->child->sibling = new_leaf;

new_leaf->parent = u;

}

return new_leaf;

}

There is one new function, append_child() . It adds a child to the children’s list. It could be done with the insert_child() function as well, but it is simpler since we know the edge we are inserting should go to the back of the list.

static void append_child(

struct suffix_tree_node *v,

struct suffix_tree_node *w

) {

struct suffix_tree_node *child = v->child;

while (child->sibling) {

child = child->sibling;

}

child->sibling = w;

w->parent = v;

}

To see that the running time is linear, observe that we construct the tree in what is essentially a depth-first traversal. We do not traverse nodes going down the recursion—that only makes the running time faster—but each time we move up the tree, it corresponds to returning from the recursion in the traversal. If this is not clear, I encourage you to work out a few examples on a piece of paper until you see that this is the case.

An alternative argument for the running time uses an amortization argument, not unlike those we used for, for example, border arrays, KMP, and Aho-Corasick. Consider the depth of the leaf we start an iteration from, v with node depth d(v). The node depth d(v) is the number of nodes on the path from node v to the root. When we search upward, we decrease the depth, but we cannot decrease it more than d(v). After we find the node or edge, we potentially increase the depth of all the nodes lexicographically smaller than the suffix we are inserting (when we split an edge), or we leave them alone (when we insert a child to node). However, we will never explore that part of the tree when we insert lexicographically larger suffixes. We will look at nodes and edges closer to the root than that point, but we will never return to that subtree since we will never insert a lexicographically smaller string in the algorithm. So we can ignore those increases in depth and only consider the increase when we insert a new leaf. That increase is at most one. The algorithm can at max increase the depth by one (for the relevant part of the tree) in each iteration, and we cannot decrease the depth more than we have increased it, so since there are O(n) leaves in the tree, we have a linear running time.

If we use the SA and LCP arrays to build the suffix tree, and need a suffix tree to produce the arrays, we have a circular problem. We shall see in the next chapter, however, that we can build the arrays in linear time without using a suffix tree. We shall also see, in the next section, that we can build a suffix tree in linear time without the two arrays. A benefit of using the SA and LCP algorithm to construct suffix trees is that it will be easier to preprocess a string using a suffix tree and then serialize it to a file when you expect many searches in the same string over time. The two arrays are trivial to write to a string, while serializing the tree structure itself means saving a structure with pointers and reconstruct them when the tree is read from file.

McCreight’s algorithm

McCreight’s algorithm lets us construct suffix trees in linear time without any additional data structure beyond the string itself. It inserts suffixes in the order they appear in the string, x[0, n]$, x[1, n]$, …, $, just like the naïve algorithm, except that it inserts the next suffix faster than the naïve algorithm. It keeps track of the last leaf inserted, like the SA and LCP algorithm, but the two tricks it uses to achieve linear construction time are different. From the latest leaf that we insert, we use a pointer to jump to a subtree somewhere else in the tree, where the next leaf should be inserted. Then we search for the insertion point in that tree, first using a search method that is faster than the naïve one and then searching the last piece of the suffix using the slow scanning method from the naïve algorithm.

Before we start we need to get some terminology and notation defined: Any suffix y = x[i, n]$ can be split into two, potentially empty, substrings y = h(y)t(y) where h(y) is the longest string that matches a prefix of a longer suffix x[ j, n]$, j < i. This is just another way to say that h(y) is how far down the tree we go when inserting suffix i, that is, it is the point in the tree where we need to insert a new leaf for suffix i. We call the two strings the head, h(y), and tail, t(y), of suffix i. For convenience, we will use i for the strings for suffix x[i, n]$, that is, if y = x[i, n]$, then h(i) = h(y) and t(i) = t(y).

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig10_HTML.png — Figure 3-10
Inserting suffix i by finding its head and appending its tail

What we did in the naïve algorithm was searching one character by one character from the root to the point where suffix i would branch off the existing tree, that is, we searched for h(i), and then we inserted a new leaf with label t(i). We searched with the naïve search algorithm, one character at a time, and when we could not continue down the tree, we had found h(i). We did not need to insert that string—by definition, it is already in the tree—but we needed a new leaf and the remaining part of the suffix, t(i), on the edge to the leaf; see Figure 3-10. In McCreight’s algorithm, we do essentially the same, but we exploit the structure of suffixes to search for h(i) faster.

For a string ay let its suffix link s(ay) be defined as the string with the first symbol removed, that is, s(ay) = y, with the special case s(ε) = ε for the empty string. If v is a node in the tree where ay is the string we would read from the root to v, then s(v) is the node where we would find y when reading from the root and down. We will see in the algorithm that all nodes in a tree have a suffix link that is a node so we can represent them as pointers from nodes to nodes. Jumping these pointers is the first trick to McCreight’s algorithm.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig11_HTML.png — Figure 3-11
The head of a suffix is longer to or equal to the previous suffix’ suffix link

First, observe that s(h(i − 1)) is a prefix of h(i). Consider h(i − 1). This string is how far down the tree we could scan before we found a mismatch when we inserted suffix i − 1. Therefore there was a longer suffix, k < i − 1, whose prefix matches h(i − 1) and then had a mismatch (there might be strings that match more of suffix i − 1, but by definition, this was the longest at the time we inserted i − 1). Now consider h(i). This is the longest prefix matching a prefix of a string we already inserted in the string. If we look at suffix k + 1, we can see that this will match s(h(i-1)); see Figure 3-11. Because we match at least suffix k + 1 to this point, s(h(i − 1)) must be a prefix of h(i). If we only have suffix k + 1 that matches to this point, then we would have exactly h(i)=s(h(i-1)), but there might be longer strings matching a prefix of suffix i; after all, this suffix starts with a different character, and there might be suffixes that do the same and matches longer prefixes. Regardless, we know that s(h(i-1)) is a prefix of h(i).

If we have a pointer from h(i − 1) to s(h(i − 1)), then we can jump it and skip past this prefix in the search for h(i). Essentially, this is what we will do except that we do not have this suffix link when we need it. An invariant in the algorithm will be that we have suffix links for all nodes except possibly the parent of the last leaf we inserted. Not to worry, we can do something almost as good. While h(i − 1) might not have a suffix link pointer, its parent does p(h(i − 1)) ↦ s(p(h(i − 1))). This suffix link is also a prefix of h(i). See Figure 3-12. The dashed arrow is a pointer we are guaranteed to have, while the dotted arrow is the pointer we wished we had but might not.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig12_HTML.png — Figure 3-12
Head and suffix links and their “jump pointers”

The overall steps in the algorithm are this: We start by creating the first leaf and connecting it to the root. After that, we go through each suffix in order and get the parent of the last leaf we inserted (h(i − 1) in Figure 3-13 and the code). As an invariant of the algorithm, we will have that all nodes, with the possible exception of h(i − 1), will have a suffix link pointer. This is vacuously true after we have inserted the first leaf. For the root, though, it has a suffix link. We set the root’s suffix link to the root itself in alloc_suffix_tree() , so although we need to set the suffix link when we increment i, to ensure the invariant, it is already satisfied at this point.

From the parent of leaf i − 1, called v in the code and in Figure 3-13, we call a function, suffix_search() , that returns s(h(i − 1)), called w in the code. This is the suffix link of v, so we set the pointer ensuring the invariant for the next iteration of the loop. The h(i) string is somewhere in the subtree of s(h(i − 1)) since s(h(i − 1)) is a prefix of it. So we need to search a little more from node w. We do this in two steps. We jump to s(p(h(i − 1))) using the suffix link and then we search for w from there. In the search, called “scan 1” in the figure, we can move faster than the naïve search. We know that the string from s(p(h(i − 1))) to s(h(i − 1)) is already in the tree which means that we can jump from node to node rather than scan along an edge. At each node we need to find which outgoing edge matches the current symbol in s(h(i − 1)) to choose the right path in the suffix tree, but we do not need to compare characters when we move along an edge. We do not know how far down h(i) is from s(h(i − 1)), so in this part of the search, we need to use the naïve scan, called “scan 2” in the figure. The details of suffix_search() are described in the following texts.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig13_HTML.png — Figure 3-13
Searching for the head of suffix i

The implementation of the main function looks like this:

struct suffix_tree *

mccreight_suffix_tree(

const uint8_t *x

) {

struct suffix_tree *st = alloc_suffix_tree(x);

uint32_t n = st->length;

struct suffix_tree_node *leaf =

new_node(st, x, x + st->length);

leaf->parent = st->root; st->root->child = leaf;

leaf->leaf_label = 0;

for (uint32_t i = 1; i < st->length; ++i) {

// Get the suffix of p(i-1) = h(i-1) = v

struct suffix_tree_node *v = leaf->parent;

struct suffix_tree_node *w = suffix_search(st, v);

v->suffix_link = w;

// Find head for the remaining suffix

// using the naïve search.

if (leaf->parent != st->root) {

const uint8_t *y = leaf->range.from;

const uint8_t *z = leaf->range.to;

leaf = naive_insert(st, w, y, z);

} else {

// Search from the top for

// the entire suffix.

leaf = naive_insert(st, w, x + i, x + n);

}

// Move on to the next suffix.

leaf->leaf_label = i;

}

return st;

}

For the “naive search” and “scan 2” from Figure 3-13, there are two cases depending on whether we need to search from the root or not. The general case is when h(i − 1) is nonempty (which also means that it is not the root). With suffix_search() we have searched for h(i) to the point s(h(i − 1)). The string we have to continue our search with is t(i − 1) (see Figure 3-14 A). We search for t(i − 1) because it makes the code slightly easier and in any case is the same string we would search for if we searched s(h(i − 1)) toward the end of the string (see the figure). We will continue searching for t(i − 1) until we get a mismatch, at which point we have found h(i).

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig14_HTML.png — Figure 3-14
Cases for scan 2

If h(i − 1) is empty (i.e., the root), we cannot search from t(i − 1). The string t(i − 1) is the entire suffix x[i − 1, n]$ (see Figure 3-14 B). We need to scan for the entire suffix x[i, n]$ to find h(i).

Now, the suffix_search() function is responsible for finding s(v) = w for node v; see Figure 3-13 and Figure 3-15. For this it uses two operations, it goes up to p(v) and then jumps to s(p(v)), and then it searches from there to w in the “scan 1” step. If the edge label on the edge from p(v) to v is the string from x to y (i.e., the pointers in the node representing v have the range from x to y), then it is this string we must search for once we have made the parent and suffix jumps.

There are four cases, depending on how v and p(v) sit in the tree; see Figure 3-16. (A) It might be the case that v is the root. Since the suffix of the root is the root itself, we have that w is the root so our function can return that. (B) v might be a child of the root with a single symbol as its edge label. The suffix of a single symbol is the empty string which is the root, so again we can return the root. (C) It is also possible that the parent of v is the root but with a longer edge label, the string from pointer x to pointer y. In this case, we must search from the root for the string x + 1 to y where we add one to x for the same reason that we had to add one when searching with the naïve algorithm from the root (see earlier texts). Finally, (D), we have that p(v) is not the root, so we can jump from v to s(p(v)) following the parent and suffix link pointer, respectively. From this point we must search for the string x to y since this is the missing string between p(v) and v.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig15_HTML.png — Figure 3-15
General case of suffix search

The code for handling the four cases can look like this:

static struct suffix_tree_node *

suffix_search(

struct suffix_tree *st,

struct suffix_tree_node *v

) {

if (v == st->root) {

// Case A

return v;

} else if (v->parent == st->root

&& range_length(v->range) == 1) {

// Case B

return st->root;

} else if (v->parent == st->root) {

// Case C

const uint8_t *x = v->range.from + 1;

const uint8_t *y = v->range.to;

return fast_scan(st, st->root, x, y);

} else {

// The general case, case D

const uint8_t *x = v->range.from;

const uint8_t *y = v->range.to;

struct suffix_tree_node *w =

v->parent->suffix_link;

return fast_scan(st, w, x, y);

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig16_HTML.png — Figure 3-16
Four cases for finding the suffix of node v

The fast_scan() function handles the “scan 1” part of the algorithm, and as the name suggests, it is faster than the naïve search we use for “scan 2”. We call it with a node and a string range. Call these v, x, and y. We want it to return the node w below v where the path label spells out the string from x to y or create such a node if it doesn’t exist (so we always get a node from a call to the function). It is the same behavior as the naïve search in the tree has, but we will exploit that when we use fast_scan(), we call it with a string that we know is in the tree below node v. It is guaranteed by the observation that s(h(i − 1)) is a prefix of h(i). If we know that a string is in the tree, we do not need to compare every symbol down an edge against it. If we know which edge to follow, we can jump directly from one node to the next. This is what fast_scan() does.³

Let w be the node at the end of the edge where the first letter matches the first symbol we are searching for, and let s and t be the pointers that define the edge label from v to w. Let n be the length of the string, n = t − s, and let z = x + n, that is, z is the point we get to if we move x forward by the length of the v to w edge.

We have three cases for fast_scan(), depending on how the string s to t compares to the string from x to y; see Figure 3-17. (A) The two strings could match exactly. This means that we have found the string we are looking for so we can return w. (B) The string x to y might be shorter than then string from s to t—which we can recognize by z < y. In this case, we need to create a new node for where the x to y string ends and return the new node, node u in the figure. Finally, (C) the x to y string might be longer than the string from s to t. If this is the case, we need to search for the remainder of x to y, the string from z to y, starting in node w and we do this recursively.

The implementation can look like this:

static struct suffix_tree_node *

fast_scan(

struct suffix_tree *st,

struct suffix_tree_node *v,

const uint8_t *x,

const uint8_t *y

){

// Find child that matches *x.

struct suffix_tree_node * w = find_outgoing_edge(v, x);

assert(w); // must be here when we search for a suffix

// Jump down the edge.

uint32_t n = edge_length(w);

const uint8_t *z = x + n;

if (z == y) {

// Found the node we should end in.

return w; // We are done now.

} else if (z > y) {

// We stop before we reach the end node, so we

// need to split the edge.

// We need to split at distance k from

// s on the edge from v to w (with label [s,t])

// |---n----|

// v o--------o w (s,t)

// x *---*----* z

// y

// |-k-|

uint32_t k = (uint32_t)(y - x);

assert(k > 0);

const uint8_t *s = w->range.from;

const uint8_t *split_point = s + k;

return split_edge(st, w, split_point);

} else {

// We made it through the edge,

// so continue from the next node.

// The call is tail-recursive,

/// so the compiler will optimize

// it to a loop.

return fast_scan(st, w, z, y);

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig17_HTML.png — Figure 3-17
Fast scan cases

The second case in the fast scan, where we split an edge, leaves a node with a single child, violating an invariant of suffix trees. This should worry us, but it isn’t a problem because we immediately after the fast scan search with a naïve insert. This naïve search will see a mismatch on the existing edge and insert t(i) on an out edge of node u, returning us to a tree that satisfies the invariant.

To see that this is the case, see Figure 3-18. By definition, h(i − 1) is the longest prefix of suffix i − 1 before we have a mismatch, so there must be some longer suffix, k < i − 1, that shares prefix h(i − 1) and then mismatches. Let the symbol at the mismatch be a for suffix i − 1 and b for suffix k. When we insert suffix i, suffix k + 1 must have been inserted. Suffixes k + 1 and i share the prefix s(h(i − 1)) and the next character in k + 1 after s(h(i − 1)) is b, and since we broke a single edge, so there is only one symbol that continues from this point, we must conclude that the symbol after the point where we broke the edge must be b. Since t(i − 1)—which is the string we will search for in after calling fast_scan()—begins with symbol a, we will get a mismatch immediately and conclude that s(h(i − 1)) is indeed h(i).

To analyze the running time of McCreight’s algorithm, we split it into three parts: (1) The total work we do when jumping to parent and then the suffix of the parent, (2) the total work we do when using fast_scan() to handle “scan 1”, and (3) the total work we do with naive_insert() to handle “scan 2”.

Of the three, (1) is easy to handle. For each leaf we insert, of which there are n, we move along two pointers which take constant time, so (1) takes time O(n).

For (2) we will use an amortization argument. Let d(v) be the node depth of node v, that is, the number of nodes on the path from the root to v. Moving to the parent of v, p(v), cannot decrease the depth more than one. If v is the root, which is its own parent, then d(p(v)) = d(v) and otherwise d(p(v)) = d(v) − 1. Moving along a suffix pointer can also only decrease the depth by one. Consider Figure 3-19. For each path from the root down to v, we see a number of nodes, and each node has a suffix link (in the algorithm, node h(i − 1) might not have a suffix link, but we never jump from this node, so this has no consequence for the argument). In the following text, we shall argue that each of these suffix links is unique. In the general case, (A) in Figure 3-19, this means that d(s(v)) = d(v). (B) An exception is when the first edge on the path has a single symbol as its label. In that case, the first node has the root as its suffix and d(s(v)) = d(v) − 1.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig18_HTML.png — Figure 3-18
Splitting an edge in fast scan

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig19_HTML.png — Figure 3-19
Relationship between a path and the path of suffix links

Now, to see that the suffix link nodes are unique, recall that all nodes in the tree correspond to a prefix of at least one suffix and that nodes on the same paths will have the node with the smallest depth be a prefix of the other. Let k be such a suffix for nodes v_i and v_i−1 = p(v_i ); see Figure 3-20. Since v_i−1 and v_i are different nodes, the edge label between them, y, is nonempty. If we now look at suffix k + 1, we get the suffix of v_i−1 and v_i instead, and these are also separated by the nonempty string y; thus, they must be different nodes.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig20_HTML.png — Figure 3-20
Uniqueness of suffix links

In each iteration of (2), the pointer jumps, thus decreasing the depth by at most two; after which “scan 1” (or fast scan) increases the depth by a number of nodes. It cannot, however, increase the depth by more than O(n)—there simply aren’t paths long enough. In total, it can increase the depth by O(n) plus the number of decreases in the algorithm which is bounded by O(n). In total, the time we spend on (B) is linear.

Finally, for (C) consider the search for h(i) from h(i − 1). First we make a jump to s(p(h(i − 1))), then a fast scan down to s(h(i − 1)), and then a slow scan down to h(i). When we insert h(i + 1), we jump and fast scan down to s(h(i)). If you consider these indices in the string we build our suffix over, rather than nodes in the tree, you will see that we always use a slow scan from one head to the other, that is, we slow scan from h(i − 1) to h(i) when inserting h(i), from h(i) to h(i + 1) when inserting h(i + 1), and so on; see Figure 3-21. These intervals do not overlap, and in each iteration, we move the pointer where a slow scan will start to the right. The total time we can spend in “scan 2” is thus O(n).

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig21_HTML.png — Figure 3-21
Slow scan (scan 2) time usage

Searching with suffix trees

To search for a pattern p in a suffix tree of a string x, we use the strategy from the naïve insert we have seen before. We scan down edges in the tree until we either find the pattern or find a mismatch. If the first is the case, then the leaf of the subtree under the edge or the node where we had the hit will be the positions in x where p occurs. The function below implements it. Here the pattern is null-terminated, as C strings are, but do not confuse this null with the sentinel in x$. If we reach the end of the string, we have a null character, and that is how it is used. The function doesn’t directly give us the positions where the pattern matches. It gives us the smallest subtree where the pattern matches the path label to it. All leaves in this subtree are positions where the pattern can be found in x.

The function searches for p from the node v. If p is empty, then we have a match at node v, and we return it. Otherwise, we find the out edge (the edge to a node w) where the first symbol matches the first symbol of p—it is along this edge and its subtree that p might be found. If we do not have a matching out edge, we cannot have a match, and we return a null pointer to indicate that. Assume that we do have an edge to scan along. Then we set s to its beginning and t to its end. The way we scan down the edge is by incrementing s and p and comparing what they point to. This means that p is not the original pattern we are searching for when we run the algorithm, rather it is a pointer to how far into the pattern we have matched so far. Similarly, s is a pointer into how far along the edge we have matched; see Figure 3-22. If we reach the end of the pattern, that is, p points to the null symbol that terminates the string, then we have a match. Although Figure 3-22 shows a single substring of x where the pattern matches, the substring that is the edge label, the pattern will also match at all other leaves in the subtree rooted in w, so we return this node. If we see a mismatch between s and p, we do not have a match in the string, and we return a null pointer. It is important that we test for the end of string p before we check for a mismatch between the characters at s and p. If p points to the termination symbol and s does not, we will have a mismatch in a comparison between the strings, but we want this to be a complete match. If we get to the end of the edge without exhausting the pattern, we continue searching recursively from w. The pattern pointer is already incremented to the point in the pattern where we should continue our search, and the recursion continues the search. The tail recursion will be translated into a loop by most compilers, so the runtime penalty of recursion is minimal.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig22_HTML.png — Figure 3-22
Scanning along an edge when searching for a pattern

static struct suffix_tree_node *

st_search_internal(

struct suffix_tree *st,

struct suffix_tree_node *v,

const uint8_t *p

) {

if (*p == '')

// We are searching from an empty string,

// so we must already be at the right node.

return v;

// Find child that matches *x.

struct suffix_tree_node *w = v->child;

while (w) {

if (*(w->range.from) == *p) break;

w = w->sibling;

}

if (!w) return 0; // The pattern is not here.

// We have an edge to follow!

const uint8_t *s = w->range.from;

const uint8_t *t = w->range.to;

for (; s != t; ++s, ++p) {

if (*p == '') return w; // End of the pattern

if (*s != *p) return 0; // Mismatch

}

// We made it through the edge,

// so continue from the next node.

return st_search_internal(st, w, p);

}

To search in the entire tree, we need to search from the root which is what st_search() does:

struct suffix_tree_node *

st_search(

struct suffix_tree *st,

const uint8_t *p

) {

return st_search_internal(st, st->root, p);

}

Leaf iterators

It is not hard to implement a depth-first traversal of a tree to extract the indices where we have matches, but for a user, iterators are easier to use, which was also our rationale for using them in the matching algorithms in the last chapter. Iterators can be hard to implement, but it is worth it to make the code more usable. An iterator for a depth-first traversal is more complex than the ones we saw in the last chapter, however, because a depth-first traversal is recursive in nature which means that we need a stack, and for an iterator, we need an explicit stack. An explicit stack also solves another problem. A recursive traversal of the tree can have a deep recursion stack and might exceed the available stack space.

A simple way to implement a stack is to use a linked list where we push frames to the front of the list and pop them from the front as well (naturally). Our recursion is over nodes so that is what we put in our stack frames.

struct st_leaf_iter_frame {

struct st_leaf_iter_frame *next;

struct suffix_tree_node *node;

};

static struct st_leaf_iter_frame *

new_frame(struct suffix_tree_node *node)

{

struct st_leaf_iter_frame *frame =

malloc(sizeof(struct st_leaf_iter_frame));

frame->node = node;

frame->next = 0;

return frame;

}

An iterator contains a stack and the results of iterations are leaf nodes.

struct st_leaf_iter {

struct st_leaf_iter_frame *stack;

};

struct st_leaf_iter_result {

struct suffix_tree_node *leaf;

};

When we initialize a new iterator from a node—the root in the tree that we wish to iterate over—we put the node in a frame and make that frame the list. There is a special case when the node is null, that is, the tree is empty. There we leave the stack empty. This way, we do not need to worry about frames with null nodes and the empty stack is the obvious indicator that we are done with the iteration.

void init_st_leaf_iter(

struct st_leaf_iter *iter,

struct suffix_tree *st,

struct suffix_tree_node *node

) {

if (node) iter->stack = new_frame(node);

else iter->stack = 0;

}

If the stack were always empty when we deallocate an iterator, we wouldn’t need to do anything, but it might not be. In that case we need to free all the frames in the stack.

void dealloc_st_leaf_iter(

struct st_leaf_iter *iter

) {

struct st_leaf_iter_frame *frame = iter->stack;

while (frame) {

struct st_leaf_iter_frame *next = frame->next;

free(frame);

frame = next;

}

It is, not surprisingly, in next_st_leaf() the real work is done. Here we follow the steps a recursion would take. We get the next frame from the stack if there are any. If the node in the frame has children, it is not a leaf, so we push its children onto the stack (we get to reverse_push() and why we want it below). If it is a leaf, we free the frame and return it. If it wasn’t a leaf, we also free the frame—just a few lines later in the function—and pop the next frame in the stack.

bool next_st_leaf(

struct st_leaf_iter *iter,

struct st_leaf_iter_result *res

) {

struct st_leaf_iter_frame *frame = iter->stack;

while (frame) {

// Pop the frame.

iter->stack = frame->next;

struct suffix_tree_node *node = frame->node;

if (node->child) {

// We have to push in reverse order to get

// an in-order depth-first traversal.

reverse_push(iter, node->child);

} else {

// Leaf

// clean up and return result

free(frame);

res->leaf = node;

return true;

}

// Get rid of the frame and pop the next.

free(frame);

frame = iter->stack;

}

return false;

}

The order in which we see a node’s children in the recursion depends on the order in which we add them to the stack. In a standard recursive implementation, we can call the function on each child in turn, but with the explicit stack, we need to push all of them to the stack before we pop the first again. If we push the children from the first child and follow its sibling pointers to the last, then the first child will be below the second that is below the third and so forth; see Figure 3-23. If we push the last child first, then the second last, and so on, then we have a stack that, when we pop off and process nodes, will give us the same traversal order as a direct depth-first traversal. If you do not care about which order you traverse the tree, you can use either, but the reverse_push() function is not substantially more complicated than the direct approach:

static void reverse_push(

struct st_leaf_iter *iter,

struct suffix_tree_node *child

) {

if (child->sibling)

reverse_push(iter, child->sibling);

struct st_leaf_iter_frame *child_frame =

new_frame(child);

child_frame->next = iter->stack;

iter->stack = child_frame;

}

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig23_HTML.png — Figure 3-23
Pushing children, direct or reversed

If we want an iterator through the positions where the pattern match, we can wrap the leaf iterator in another iterator for that. The implementation is quite simple. We iterate through leaves—using the iterator earlier—and for each, we set the position and return true. When there are no more leaves, there are no more matches either.

struct st_search_iter {

struct st_leaf_iter leaf_iter;

};

struct st_search_match {

uint32_t pos;

};

void init_st_search_iter(

struct st_search_iter *iter,

struct suffix_tree *st,

const uint8_t *p

) {

struct suffix_tree_node *match = st_search(st, p);

init_st_leaf_iter(&iter->leaf_iter, st, match);

}

bool next_st_match(

struct st_search_iter *iter,

struct st_search_match *match

) {

struct st_leaf_iter_result res;

if (!next_st_leaf(&iter->leaf_iter, &res))

return false;

match->pos = res.leaf->leaf_label;

return true;

}

void dealloc_st_search_iter(

struct st_search_iter *iter

) {

dealloc_st_leaf_iter(&iter->leaf_iter);

}

Comparisons

The naïve implementation runs in worst-case O(n²), while the other two algorithms run in O(n), but how do they compare in practice? The worst input for the naïve algorithm is input where it has to search completely through every suffix to insert it, but for random strings, we expect mismatches early in the suffixes, so here we should get a better running time. It is harder to reason about best- and worst-case data for the other algorithms. For the McCreight’s algorithm, one could argue that it will also benefit from early mismatches since it has a naïve search algorithm as the final step in each iteration. The LCP algorithm doesn’t scan in any way but depends on the SA and LCP arrays (which in turn depends on the string underlying the suffix tree, of course). The algorithm corresponds closely to a depth-first traversal of the suffix tree, so we would expect it to be faster when there are fewer nodes, that is, when the branch-out of the inner nodes is high. We can experiment to see if these intuitive analyses are correct—it turns out that the fan-out of nodes is a key factor and the larger it is, the slower the algorithms get.

In Figure 3-24 you can see the running time of the three algorithms with three types of strings: Equal, strings consisting only on a single symbol; DNA, random⁴ sequences over four symbols (A,C,G,T); and 8-bit characters. Figure 3-25 zooms in on the smaller string sizes and leaves out the time measurements for the naïve algorithm on the Equal alphabet.

Some of the results are as we would expect. The naïve algorithm performs very poorly on single-symbol strings but better on random strings. It is still slower than the other two algorithms on random DNA sequences. With the 8-bit alphabet, the naïve and McCreight’s algorithm run equally fast—because mismatches occur faster with a larger alphabet—but notice two things: McCreight’s algorithm is not running in linear time as it should be, and while the two algorithms run in the same time, they are both slower than when we use the two smaller alphabets.

The culprit is the linked lists we use for node’s children. We have assumed that the alphabet is of a constant size—and generally, it is—but we are comparing running times for different alphabet sizes. The larger the alphabet, the longer it takes to insert nodes and to search for children. Profiling the algorithms reveals that most of the time is spent exactly on traversing these lists, and the larger the alphabet, the larger fraction of time this is. With the single-letter alphabet, Equal, we have the smallest fan-out of nodes—all inner node has two children (the letter and the sentinel). This is the optimal situation, with respect to children list traversal, for McCreight and LCP (but still the worst case for the naïve algorithm where the search time dominates). McCreight does run in linear time for the smallest alphabet, but it will not run in linear time for the larger alphabets until all nodes have most of the alphabet as out edges. The time depends on the degrees of the nodes the algorithms search through, which affects all through algorithms. The time it takes to reach a fan-out close to the maximal, a degree equal to the alphabet size, depends on the length of the string. For those shown in the figures, we are still seeing this effect.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig24_HTML.png — Figure 3-24
The three algorithms on the three different alphabets

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig25_HTML.png — Figure 3-25
Zoom in on short strings

If the linked lists slow down our algorithm, we can consider an alternative.⁵ We can use an array for the children, indexed by the alphabet symbols. With this, we can look up the edge for any symbol in constant time. I have shown the running times of the array-based and linked list–based algorithms in Figure 3-26. The array-based algorithms run in linear time and are not affected by the growing linked lists, but they are still affected by the alphabet size. The larger the alphabet, the larger arrays we need to store in each node, and the poorer IO efficiency we will see.

The fastest construction algorithm appears to be the LCP algorithm, linked list or array-based. We can examine their running time in Figure 3-27. The list-based is slightly faster for the smaller alphabets, and for the 8-bit alphabet, it is faster up to around string length 90,000, where the extra memory usage of the array-based implementation pays off. For smaller alphabets, it seems that the LCP algorithm with linked lists is the way to go.

The LCP algorithm gets its speed from not doing any searches. It does not build the suffix tree using only a string, however, but it needs the suffix array, SA, and the longest common prefix array, LCP. When we measure the running time of the algorithms, we should take into account that we need to compute these arrays before we can use the LCP algorithms. In this chapter, we have seen how to compute the SA and LCP arrays from a suffix tree, but of course, we do not want to include this construction on top of the LCP construction algorithm. We would never build a suffix tree so we could rebuild it with another algorithm. In the next chapter, we will see how to build the two arrays directly from a string, and in Figure 3-28, I have shown how LCP and LCP with the array construction compare to McCreight. The suffix array construction is potentially expensive, and its time usage depends on the alphabet size. The larger the alphabet, the faster it is, maybe counterintuitive but true. In the next chapter, I show several algorithms for computing the suffix array, two of them have this property, and it is one of those I have used for the experiments in this chapter. If we include the array construction, we need a large alphabet for the LCP algorithm to outcompete McCreight’s algorithm. For the DNA alphabet, the two algorithms are equally effective, and for the single-symbol alphabet—the worst case for the array construction algorithm—McCreight’s algorithm is much faster.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig26_HTML.png — Figure 3-26
Array-based children

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig27_HTML.png — Figure 3-27
Array- and list-based LCP constructions

There is a scenario where we would always use the LCP construction algorithm based on the experimental results here. It is not uncommon to preprocess a string we will use for many queries. In bioinformatics, for example, we have strings that are millions of characters long, and we query them with millions of short patterns. In such scenarios, we build a search structure like a suffix array and save it to a file, so it is available every time we have new patterns to search for. It is hard to serialize a suffix tree but trivial to write two arrays to a file and read them in again. If we can read the arrays from a file, we can use the fast LCP algorithm to construct a suffix tree from them.

../images/490521_1_En_3_Chapter/490521_1_En_3_Fig28_HTML.png — Figure 3-28
Adding the array computations to LCP

If we do not have the arrays precomputed, then McCreight’s algorithm or even the naïve algorithm is competitive. The naïve algorithm is, obviously, only worth considering with a large alphabet, but its simplicity and a running time compatible with McCreight might make it a good choice. Further, if memory is a problem, and each word counts, you can implement it without the parent and suffix link pointer; McCreight needs both (and LCP needs the parent pointer but not the suffix link).

You can find the code I used for the measurements on GitHub: https://github.com/mailund/stralg/blob/master/performance/suffix_tree_construction.c.

Footnotes

This is also called the PATRICIA tree, but compacted trie is easier to remember since it is a compact representation of a trie.

The traversal is easier to follow with a recursive function, but for large trees, it will hit the stack limit and crash the program. Using an explicit heap-allocated stack solves the problem. I do not show that implementation here, but you can find it at https://github.com/mailund/stralg.

By definition, we know that h(i) is in the tree, so you might argue that we could use fast_scan( ) to find it. The reason that we cannot is that although we know the string is in the tree, we do not know what the string is. We don’t know when we are done with the scan. For this reason, the naive search is necessary for “scan 2”.

Real DNA sequences are not random, but in my simulated data they are, and it is not that far from real DNA.

I have not listed the implementation in this book since it closely resembles the linked list suffix trees. You can find the implementation at https://github.com/mailund/stralg.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Suffix trees

Create new playlist

Sign In

Sign Up

3. Suffix trees

Compacted trie and suffix representation

Naïve construction algorithm

Suffix trees and the SA and LCP arrays

Constructing the SA and LCP arrays

Constructing the suffix tree from the SA and LCP arrays

McCreight’s algorithm

Searching with suffix trees

Leaf iterators

Comparisons

Table of Contents for
3. Suffix trees