Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12. Search Trees

Thomas Mailund¹

(1)

Aarhus N, Denmark

Search trees , or in this chapter specifically binary search trees, are also recursively defined. A binary search tree is either an empty tree or a node containing a value and two trees, a left and a right subtree. Furthermore, we require that the value in a node is larger than all the values in its left subtree and smaller than all the values in its right subtree. It is the last property that makes search trees interesting for many algorithms. If we know that the smaller values are to the left, and the larger values are to the right, we can efficiently search in a tree, as we shall see.

Figure 12-1 shows three different search trees containing the integers 1, 3, 6, and 8. I have shown empty trees as “ground” symbols, but traditionally we do not show them in figures, and in the following figures, I will leave them out. In A), the root holds the value 1, it has an empty left tree, and its right tree’s root holds the value 3. In B), the root holds 6 and has both a left and a right subtree. On the left, the subtree’s root holds the value 3, and on the right, it holds 8. All the trees in C) have empty right subtrees. As should be evidently clear, the tree representation for a set of values is not unique, as there are many ways you can organize the values such that they satisfy the properties. The organization affects performance when we search, though. When you want to find the node (if any) that holds the value v, you start in the root of the tree. If the value there is v, you are done and can return that node. If not, you determine whether v is smaller than the value in the root, in which case you should search in the left subtree (where the smaller values are). If on the other hand v is larger, you search in the right subtree. If you reach an empty tree, the value wasn’t in the tree.

../images/500158_1_En_12_Chapter/500158_1_En_12_Fig1_HTML.png — Figure 12-1
Three different search trees containing the numbers 1, 3, 6, and 8

The time this search takes is proportional to the height of the tree, which can be proportional to the number of elements the tree holds (as is the case in the trees in A) and C) in Figure 12-1). If the tree is balanced, so the depth of the left and right subtrees is roughly the same for all the nodes, then the depth is logarithmic in the number of elements. There are many strategies for keeping a search tree balanced, but I will refer you to a textbook on algorithms and data structures to explore that. In this chapter, we will only look at strategies for implementing the data structure. For simplicity, we will only consider trees that hold integers and without duplications.

Tree Operations

We want to implement the following operations:

1.
Determine if a value is in a tree
2.
Insert a new element in a tree
3.
Delete a value from a tree

and of course we should have code for freeing a tree once we are done with it.

Contains

If we search for a value v in a tree t, then

1.
If t is empty, we report that v is not in it.
2.
If the value in t is v, then we report that it is.
3.
Otherwise, if v is smaller than the value in t, we look in t’s left subtree, and if it is larger, we look in t’s right subtree.

It is a recursive operation, where we have two base cases: t is empty or it holds the value we are looking for. The recursive case involves searching a subtree, where we can determine which subtree to search from the value in the root of t.

Insert

If we want to insert a value, v, in a tree t, then

1.
If t is empty, we should create a tree with a single leaf, a node that holds the value v and has empty left and right subtrees.
2.
If t has a value, and it is v, then we leave t as it is. It already has the value, and the order property from the definition prevents a search tree from holding more than one of the same value.
3.
Otherwise, if v is smaller than the value in t, insert it into t’s left subtree, and if v is greater, insert it into t’s right subtree.

Again, we have a recursive operation. The base cases are again an empty tree or a tree with v in its root, and the recursive cases are insertions into the left or right subtree.

Delete

Deletion is the most complex of the operations. The overall procedure is this:

1.
If t is empty, we are done.
2.
If t’s value is v, then delete t and replace it with a subtree (see the following).
3.
Otherwise, delete v from the left or right subtree, depending on whether v is smaller than or greater than the value in v.

Step 2 is where the operation is more complicated than the others. To delete the value in the root of a tree, we need to construct a tree that still has all the remaining values. There are two cases here. One, when t has at most one non-empty subtree, we can immediately replace it, and two, if both its subtrees are non-empty, we replace the value in t with the largest value in its left subtree and then remove that value from the left subtree.

Consider case one, where t has at least one empty subtree, and consider Figure 12-2. In the figure, t is the tree rooted in the black node, and the gray triangle represents the (possible) non-empty tree. If we replace t with its non-empty subtree, we have deleted the value, since any value appears at most once in a tree. All values in t’s subtree are smaller than the value in t’s parent, if t is a left subtree, or greater than the value in t’s parent, if t is a right subtree. So replacing t with its subtree will satisfy the order property of search trees.

If both of t’s subtrees are non-empty, we cannot replace t by one of them, but we can do this: find the largest value in the entire tree smaller than v. Since it must be smaller than v, that value must be in t’s left subtree, and since it should be the largest, it must sit in the rightmost node in that subtree; see Figure 12-3. To get this value, we go to t’s left subtree, and then we run down the tree as long as there is a non-empty right subtree. When we reach a node with an empty right subtree, we have what we are looking for. We take that value, and we update t’s value to that. Now we have deleted v from the tree, but we have two copies of the same value, the value in the rightmost node. We need to get rid of one of them. However, the rightmost node has, by definition, an empty right subtree, so we can delete that node following the previous case. So, we delete t’s new value from t’s left subtree, and then we are done.

../images/500158_1_En_12_Chapter/500158_1_En_12_Fig2_HTML.png — Figure 12-2
Deleting in a tree with at most one non-empty subtree

../images/500158_1_En_12_Chapter/500158_1_En_12_Fig3_HTML.png — Figure 12-3
Deleting a value in a tree with two non-empty subtrees

The procedure works just as well by getting the leftmost value in the right subtree, but we chose the rightmost in the left more by tradition than anything else.

Free

When we free the memory held in a tree:

1.
To free an empty tree, do nothing—it is already freed.
2.
Otherwise, free the left and the right subtree, and then free the tree.

Recursive Data Structures and Recursive Functions

We didn’t talk about it when we implemented lists because lists are particularly simple to work with. Still, when it comes to recursive data structures, it is often useful to think about operations on them as well in terms of recursion. Recursive solutions are usually much more straightforward than corresponding iterative solutions, as they closely match the data structure. However, they come at a cost. If we solve a problem recursively, the recursive calls can fill the call stack, which will crash our program, if we are lucky, or corrupt our memory. With search trees, this is usually not an issue. If we keep a search tree balanced, then the recursion depth is logarithmic in the number of nodes in the tree. That means that we can double the number of nodes and only have to go one recursive call deeper. Unfortunately, we are not writing balanced trees in this chapter, so that will do us no good. And even if we did have balanced trees, there is some overhead in function calls that we can avoid if we do not use recursion.

So, what can we do when the best solution, from a programming perspective, is clean recursive code, but where execution constraints prevent us from recursion? We will explore this with the operations we need on trees in this chapter, but the short answer is that, sometimes, recursion isn’t a problem, and you can safely use recursive functions. This is the case with so-called tail recursion. If a function calls itself but immediately returns the result of the recursive call—so it doesn’t do any further processing with the result of the recursive call, it just returns it—the function is said to be tail-recursive. Then recursion isn’t necessary at all. Most compilers, if you turn on optimization, will translate such functions into loops, and calling them involves a single stack frame only. You do not risk filling the stack, and you do not get any function call overhead beyond the first call. If your compiler doesn’t do this for you, it is also trivial to rewrite such a function yourself. We will see several examples of such recursions.

We cannot always get tail-recursive solutions.¹ In a tree, if we need to recurse on both subtrees of a node, for example, to free memory, then at most one of the recursive calls can be the result of the function itself. When you are in this situation, you might have to allocate an explicit stack on the heap. There is more available memory on the heap than on the stack, so this alleviates the problem with exceeding the stack space. It isn’t always trivial to replace a call stack with an explicit stack, however. Call stacks do more than pushing new calls on the stack; when you are done with a recursive call, you also need to return to the correct position in the calling function. You must emulate function calls in your own code, and while sometimes easy, it can get complicated. We will see a few examples of using explicit stacks and discuss why it isn’t advisable in the cases where we need them for our trees.

If we consider using an explicit stack for something like freeing the nodes in a tree, there is another issue. If the stack requires that we allocate heap memory—which it must, if we should be able to handle arbitrarily large trees—then the operation can fail. Should malloc() return NULL, we have an error. We have to deal with that, but if we need the memory to complete the recursive tree traversal, then we are in trouble. Something like freeing memory should never fail—must never fail—because it is practically impossible to recover from. And even if we can recover from a failure, we likely will leak memory. It is far better to have a solution where we can reuse memory we already have to solve the problem. There is, of course, not a general solution for this. What memory we might be able to reuse and how we can arrange existing memory depend on the data structure we have. There is usually a solution if you work at it a little bit. We will see a way to modify search trees that lets us traverse and delete trees without using recursion, an explicit stack, or any additional memory.

There is much to cover, so read on.

Direct Implementation

We start by implementing a direct translation of the operations into C. It will have some issues, in particular with error handling and efficiency, but we will soon fix those.

The data structure for a tree is what you would expect from the examples with lists. We have a node with a value and two subtrees, and the subtrees must be pointers to the node type’s struct. We also define a type stree to be a pointer to a struct node:

struct node {

int value;

struct node *left;

struct node *right;

};

typedef struct node *stree;

As with lists, we write a function for allocating the fundamental struct. There it was a struct link; now it is a struct node. The function takes the member values as input, but in the insertion function, we will only insert new leaves, so we add a macro for that case.

struct node *node(int value, stree left, stree right)

{

struct node *n = malloc(sizeof *n);

if (n) *n = (struct node) {

.value = value, .left = left, .right = right

};

return n;

}

#define leaf(V) node(V, 0, 0)

The if-statement in the function checks if we got a non-NULL value from malloc(), and if we did, we assign the values of a struct node into n. You could also have implemented the function as

struct node *node(int value, stree left, stree right)

{

struct node *n = malloc(sizeof *n);

if (n) {

n->value = value;

n->left = left;

n->right = right

};

return n;

}

There is no particular good reason to prefer one version over the other; I just happen to like the first one.

A direct translation of the contains() operation will look like this:

bool contains(stree t, int val)

{

if (!t) return false;

if (val == t->value) return true;

if (val < t->value) return contains(t->left, val);

else return contains(t->right, val);

}

This is a good solution for that operation, and there is nothing I would change about it. It is simple, it follows the definition of the operation directly, and it is tail-recursive. In the two recursive cases, the return value of contains() is the direct result of the recursive calls to contains(). This means that likely your compiler will translate the recursive calls into a loop in the generated code. If you are not afraid to look at assembler code, you can check the generated code using, for example, godbolt.org, where I have put this function at the link https://godbolt.org/z/3adTr3. There, you can see what different compilers will generate for this function. If you generate code without optimization, the recursive functions will have one or more 'call' instructions in them. If you turn on optimization, that call disappears and you have loops (in various forms, depending on the compiler). Godbolt.org is an excellent resource if you want to know what code your compiler is generating. To check if a tail-recursive function is translated into a loop, you can generally check if it generates code with a call instruction or not. If your compiler isn’t supported by godbolt.org, you will have to generate the assembler yourself—the compiler’s documentation will tell you how—and then you can check. Compilers generally translate tail-recursive functions into loops. The C standard doesn’t require it, but it is usually a safe assumption.

The insert() operation looks like this:

stree insert(stree t, int val)

{

if (!t) return leaf(val); // can fail, but we don't handle it

if (val < t->value) {

t->left = insert(t->left, val);

} else {

t->right = insert(t->right, val);

}

return t;

}

We return a new leaf with the value if t is empty. If val is smaller than t->value, update t’s left subtree with an insertion, and if it is greater, we update t’s right subtree with an insertion. If t->value == val, we don’t do anything; we just return t at the end of the function.

This function is more problematic than contains(). First, we might have an allocation error in leaf(), in which case we return an empty tree that shouldn’t be empty. This is easy to capture if we start out with an empty tree since we will probably notice that adding a value to an empty tree shouldn’t give us an empty tree back. However, if we insert into a non-empty tree, we call recursively down the tree structure, and somewhere down there, we insert an empty subtree that should have been a leaf. We do not get any information about that back from the call. We get a pointer to the input tree back regardless of whether there was an error or not.

One fix to this problem is allocating the new leaf before we recursively insert it into the tree. That way, we can return NULL in case of an error, and since a successful insertion can never return an empty tree, we can recognize this as an error.

stree insert_node(stree t, struct node *n)

{

if (!t) return n;

if (n->value == t->value) {

free(n); // it was already here

} else if (n->value < t->value) {

t->left = insert_node(t->left, n);

} else {

t->right = insert_node(t->right, n);

}

return t;

}

stree insert(stree t, int val)

{

struct node *n = leaf(val);

if (!n) return 0;

return insert_node(t, n);

}

Of course, if the value is already in the tree, we have allocated a new node for no reason, but that is the cost of handling allocation errors correctly here. Upfront allocation, even if you risk deleting again, is often an acceptable solution to problems such as these. If you want to avoid an extra allocation, you can always call contains() first (at the cost of an extra search).

Another issue is that the function is not tail-recursive. When we insert val in a subtree, we need to update one of t’s subtrees accordingly. When we have more work to do after a recursive call, we cannot get tail recursion. Consequently, there will be function calls here, with the overhead they incur and the risk of exceeding the stack space.

To delete, we need a function for finding the value in the rightmost node in a tree. We can implement a function for that like this:

int rightmost_val(stree t)

{

assert(t);

if (!t->right) return t->value;

else return rightmost_val(t->right);

}

We will not call it with an empty tree, so we assert that. Otherwise, t->right would be dereferencing a NULL pointer, which we always want to avoid. We test if there is a right subtree, and if there isn’t, we return t’s value. Otherwise, we continue searching in t’s right subtree. The function is tail-recursive, and with an optimizing compiler, you get an efficient looping function.

Now we can delete():

stree delete(stree t, int val)

{

if (!t) return t;

if (val == t->value) {

if (!(t->left && t->right)) {

stree subtree = t->left ? t->left : t->right;

free(t);

return subtree;

} else {

t->value = rightmost_val(t->left);

t->left = delete(t->left, t->value);

}

} else if (val < t->value) {

t->left = delete(t->left, val);

} else if (val > t->value) {

t->right = delete(t->right, val);

}

return t;

}

If we have an empty tree, the result is the tree itself. If t’s value is the one we are deleting, we have the two cases discussed earlier. When t doesn’t have both subtrees, we remove t and return its subtree. The expression

t->left ? t->left : t->right

will give us the left subtree if it isn’t empty and otherwise give us the right subtree. If both trees are empty, we get the right subtree, but that doesn’t matter, since they are both NULL. If t has both subtrees, we get the rightmost value, put it in t->value, and then we delete it from t->left.

Finally, if t->value is greater or smaller than val, we delete recursively, updating the left or right subtree accordingly.

This function, like insert(), is not tail-recursive. Updating the subtrees after the recursive calls prevents this.

When we free a tree, we must free both subtrees as well. The function can look like this:

void free_stree(stree t)

{

if (!t) return;

free_stree(t->left);

free_stree(t->right);

free(t);

}

This, obviously, isn’t tail-recursive either. We cannot make it tail-recursive because there are two recursive calls involved.

If you want to make a tree from an array, you could write a function such as this:

stree make_stree(int n, int array[n])

{

stree t = 0;

for (int i = 0; i < n; i++) {

t = insert(t, array[i]);

}

return t;

}

We don’t have recursion issues here, at least not directly. We iteratively call insert() (which does have recursion issues).

If you want to print a tree, you want to print the left and right subtrees as well, so here we need recursion, and again we cannot get tail recursion because of the two recursive calls:

void print_stree(stree t)

{

if (!t) return;

putchar('(');

print_stree(t->left);

putchar(',');

printf("%d", t->value);

putchar(',');

print_stree(t->right);

putchar(')');

}

Pass by Reference

The problems we have with insertion and deletion are caused by the same design flaw we had in the first list implementation in Chapter 11. We have designed the functions such that they return a new tree instead of modifying an existing one. That means that the recursive calls give us the trees we now need to store in one of t’s subtrees. The changes we made to lists will also work here. We want the functions to take a tree that we can update as input. That means that an empty tree cannot be a NULL pointer, but must be something else, or at least that if it is a NULL pointer, it is passed by reference. We can use either of the solutions from the lists, make a pointer to pointers our representation, or use a dummy node. The first solution will work for us here, so that is what we will do.

We will still have the type stree be a pointer to struct node, but change the operations, so they work with pointers to stree instead of stree. So the functions take pointers to pointers to nodes. An empty tree is a pointer to a NULL pointer.

An stree * pointer on the stack will work fine for a tree; we just have to pass it by reference in function calls, but should you want a heap-allocated tree, we can provide a function for that as well:

stree *new_stree(void)

{

stree *t = malloc(sizeof *t);

if (t) *t = 0;

return t;

}

It allocates an stree * pointer and, if successful, sets the pointed-to value to NULL, making it an empty tree.

If we heap-allocate trees, we should also have a function for freeing them again. So let us change the name of the free_stree() function from before to one that makes it explicit that it is only freeing nodes:

void free_nodes(struct node *n)

{

if (!n) return;

free_nodes(n->left);

free_nodes(n->right);

free(n);

}

and provide a second function for freeing a tree as well.

static inline

void free_stree(stree *t)

{

free_nodes(*t);

free(t);

}

Now we make the operations take an stree * argument, which means that we must dereference the argument to get the node. For contains(), the function changes to this:

bool contains(stree *tp, int val)

{

assert(tp);

stree t = *tp;

if (!t) return false;

if (val == t->value) return true;

if (val < t->value) return contains(&t->left, val);

else return contains(&t->right, val);

}

To make the code more readable, I cast *tp to an stree so I can refer to the tree with the variable t. Little else changes in the function, but in the recursive calls we must provide an address of a tree, not the tree itself. So, we get the address of t’s left or right subtree when we call contains() recursively. We need to call contains() with an address of a struct node pointer, so we call recursively with &t->left (the address of left subtree) or &t->right (the address of the right subtree). The & operator binds such that &t->left means &(t->left), the address of t’s left member, not (&t)->left, which would be the left member of whatever structure the address of t is. Since &t isn’t a pointer to a structure, it is a pointer to a pointer, and it hasn’t got a left member, the expression (&t)->left would give you a type error.

When we use a pointer to a tree in insert(), we will interpret it as the target position where we should insert. If the target points to NULL, we have an empty tree, and if we write a node into the target, we have put a tree there. Thus, we can write insert() as

bool insert(stree *target, int val)

{

assert(target);

stree t = *target;

if (!t) {

return !!(*target = leaf(val));

}

if (val < t->value) {

return insert(&t->left, val);

} else {

return insert(&t->right, val);

}

We, once again, get the tree that target points to, to make the code more readable. If it is NULL, then target is an address that holds an empty tree, and we want that address to hold a new leaf instead. We, therefore, allocate a leaf and put it in *target. An assignment is also an expression, so (*target = leaf(val)) gives us a value. It is the right-hand side of the assignment, which is the new leaf. If the allocation failed, we get NULL. If we turn a pointer into a truth value (with the !! trick we have seen before), then we turn the assignment into a value that is true if the allocation was successful and false otherwise, which we will return to indicate whether the insertion was successful or not.

The recursive cases didn’t change, except that we pass the address of the subtrees along in the call, instead of just the subtrees. That means that we can modify the tree we use in the recursion, so we don’t need to update the tree after the call. The function is now tail-recursive.

Be careful when you use an address as a parameter to a function call. You can get into trouble if it is the address of a stack object. There is nothing wrong with it per se. If we put a tree (an stree object) on the stack, and use it with our operations, then we have a tree on the stack, and as long as we do not use it after we have returned (which we can’t as it is gone by then), then all is well. But if we, for example, took the address of t and passed it along in a recursive call in insert(), we could write into the address, but we would be writing into a local variable. What we put there is lost once the call to insert() is done. Also, the compiler would see that we use the address of a local variable that presumably should change in recursive calls where we get new memory for local variables, so the tail recursion optimization would not be possible. That, though, is a minor problem, since the function would be broken already if we attempt to pass addresses of local variables along in the call. It is not what we do here, of course. The addresses we pass along in the function calls are addresses in heap-allocated nodes.

The updated delete() is also tail-recursive:

stree *rightmost(stree *t)

{

assert(t && *t);

if (!(*t)->right) return t;

else return rightmost(&(*t)->right);

}

void delete(stree *target, int val)

{

assert(target);

stree t = *target;

if (!t) return;

if (val == t->value) {

if (!(t->left && t->right)) {

stree subtree = t->left ? t->left : t->right;

*target = subtree;

free(t);

} else {

stree *rm_ref = rightmost(&t->left);

stree rm = *rm_ref;

t->value = rm->value;

*rm_ref = rm->left;

free(rm);

}

} else if (val < t->value) {

delete(&t->left, val);

} else if (val > t->value) {

delete(&t->right, val);

}

../images/500158_1_En_12_Chapter/500158_1_En_12_Fig4_HTML.png — Figure 12-4
Return value of rightmost()

Our new rightmost() returns the address that holds the pointer to the rightmost tree; see Figure 12-4. We search down a tree’s right trees until we get to a node where right is empty. However, the variable we use in the function is the address that holds the tree, not the tree itself. So, when we reach the rightmost node, the parameter to the rightmost() call is the address in the parent node. Dereferencing the result of the rightmost() call gives us the rightmost node, and writing into the result changes the right tree in the parent. We exploit this to delete the rightmost value from the left tree. We dereference the rightmost reference to put the value into t, and then we replace the rightmost tree with its left child directly, no second call to delete() this time.

Our remaining functions, make_stree() and print_stree(), do not change much. We will heap-allocate a new tree pointer in make_stree(), and we will pass the tree by reference in print_stree():

stree *make_stree(int n, int array[n])

{

stree *t = new_stree();

if (!t) return 0;

for (int i = 0; i < n; i++) {

if (!insert(t, array[i])) {

free_stree(t);

return 0;

}

return t;

}

void print_stree(stree *t)

{

if (!*t) return;

putchar('(');

print_stree(&(*t)->left);

putchar(',');

printf("%d", (*t)->value);

putchar(',');

print_stree(&(*t)->right);

putchar(')');

}

In print_stree(), I didn’t bother with a variable that holds the dereferenced value of the parameter. Here, t is the pointer to a tree. That means that I need to dereference t to get a tree, which is why you see expressions such as &(*t)->left. We dereference t to get a tree, *t, then get the left tree in that, (*t)->left, and then we get the address of that tree, &(*t)->left.

Refactoring

The recursive search in contains(), insert(), and delete() (except for rightmost()) is the same, so we can extract it into a separate function and refactor the code. We can write a function, find_loc() (find location), that returns the address that points to the node with a given value or, if the value isn’t in the tree, the address where it should sit. It would look like this:

stree *find_loc(stree *t, int val)

{

if (!*t || (*t)->value == val)

return t;

else if (val < (*t)->value)

return find_loc(&(*t)->left, val);

else

return find_loc(&(*t)->right, val);

}

To see if a tree has a value, find the location, and check if there is a tree there. If there is, the tree contains the value. If there isn’t, it doesn’t. Since find_loc() returns the address where the tree should be, we must dereference it and turn the pointer into a truth value:

bool contains(stree *t, int val)

{

return !! *find_loc(t, val);

}

In insert() and delete() , instead of searching, we call find_loc(), and then we update the tree according to what we find:

bool insert(stree *t, int val)

{

stree *target = find_loc(t, val);

if (*target) return true; // already there

else return !!(*target = leaf(val));

}

void delete(stree *t, int val)

{

stree *target = find_loc(t, val);

if (*target) {

stree t = *target;

if (!(t->left && t->right)) {

*target = t->left ? t->left : t->right;

free(t);

} else {

stree *rm_ref = rightmost(&t->left);

stree rm = *rm_ref;

t->value = rm->value;

*rm_ref = rm->left;

free(rm);

}

Iterative Functions

The contains(), insert(), and delete() operations are no longer recursive, but they rely on find_loc() and rightmost() that are. Both of those, however, are tail-recursive, so the code that the compiler generates, when you enable optimization, will not involve function calls. Of course, if you are not comfortable relying on the mercy of your compiler’s optimization, you can translate the functions into iterative ones yourself. With tail-recursive functions, that is usually straightforward.

Take find_loc():

stree *find_loc(stree *t, int val)

{

if (!*t || (*t)->value == val)

return t;

else if (val < (*t)->value)

return find_loc(&(*t)->left, val);

else

return find_loc(&(*t)->right, val);

}

We want to replace recursion with a loop, so we make a loop condition that should terminate the loop when we would directly return from the recursion’s base case. The inversion of the base case condition will usually do.

stree *find_loc(stree *t, int val)

{

while (*t && (*t)->value != val) {

// loop body

}

return t;

}

Then, inside the loop body, where you would normally call recursively, you update the function arguments instead. A recursive call will give the parameters new values; we give them those values directly. So

find_loc(&(*t)->left, val);

would become

t = &(*t)->left; val = val;

where the assignment to val is obviously not necessary here. The looping version of find_loc() looks like this:

stree *find_loc(stree *t, int val)

{

while (*t && (*t)->value != val) {

if (val < (*t)->value) {

t = &(*t)->left;

} else {

t = &(*t)->right;

}

return t;

}

Similarly, the loop version of rightmost(), translated using the same procedure, looks like this:

stree *rightmost(stree *t)

{

while ((*t)->right)

t = &(*t)->right;

return t;

}

Explicit Stacks

Free and print still use recursion. If we want to avoid exceeding the stack space, we can move the recursion from the call stack to an explicit stack. We shall see that it carries its own problems, but let us implement it first.

One way to represent a stack is as a singly linked list. Stack frames hold the information we need for the recursions and then a next pointer to the next stack frame. A stack is empty if it is NULL. For free_nodes(), we can implement the stack and stack frames like this:

struct free_frame {

struct node *node;

struct free_frame *next;

};

void free_push(struct free_frame **stack, struct node *node)

{

if (!node) return;

struct free_frame *frame = malloc(sizeof *frame);

if (!frame) abort(); // We do not tolerate errors!

*frame = (struct free_frame){ .node = node, .next = *stack };

*stack = frame;

}

struct node *free_pop(struct free_frame **stack)

{

struct free_frame *top = *stack;

struct node *node = top->node;

*stack = top->next;

free(top);

return node;

}

The push() and pop() functions take the stack by reference, a pointer to a pointer to a stack frame, so they can modify the stack. When we push(), we allocate a new stack frame, initialize it with a node and the top of the stack, and then we point the stack to the new frame so the new frame is now the top of the stack. If the allocation fails, we terminate the program. Freeing data should never fail because we don’t know how to handle such a situation—we are probably leaving the program in an inconsistent state, and we will most likely leak memory. We could attempt to recover, so we don’t crash the entire program, but here I give up and call abort(). In the following printing code, we will recover slightly more gracefully.

Now the free_nodes() function pushes its input on the stack, and then it loops as long as there are stack frames. It pops the top frame, pushes the two subtrees, and frees the node:

void free_nodes(struct node *n)

{

struct free_frame *stack = 0;

free_push(&stack, n);

while (stack) {

n = free_pop(&stack);

free_push(&stack, n->left);

free_push(&stack, n->right);

free(n);

}

When we push the left child first, we process it last, since stacks work in a first-in, first-out fashion. If we wanted to process the left tree first, as we do in the recursion, we would have to push the subtrees in the opposite order. It doesn’t matter when we are deleting the nodes, though, so deleting right subtrees before left subtrees is fine.

When we print a tree, we need the left subtree printed before the right subtree, so there we must push the right subtree before the left subtree. However, handling printing is slightly more complicated than that. There are multiple operations we need to do before and after the recursive calls. We must print a left parenthesis before we handle the left subtree. We must print a comma, the node’s value, and another comma after the recursive call to the left; only then can we recurse on the right, and once we are done on the right, we must print a right parenthesis. Replacing a call stack with an explicit stack is often complicated if you need to return to different states in your function after the recursive calls. There are different techniques and strategies for dealing with a function’s state when we move to an explicit stack. Still, although the printing function is more complex than free_nodes(), it is quite simple, and the solution we implement later just pushes the various operations we do in the function. In our stack frames, we put an operation, which can be LPAR for printing “(”, TREE for handling a tree recursively, COMMA for printing a comma, VALUE for printing a node’s value, and RPAR for printing “)”. For TREE and VALUE, we need an associated node, so we have a struct node pointer in the stack frame as well.

enum print_op {

LPAR, TREE, COMMA, VALUE, RPAR

};

struct print_frame {

enum print_op op;

struct node *node;

struct print_frame *next;

};

The stack is a pointer to a frame as well, but I will also put a jmp_buf to handle allocation errors. This is a buffer in which we can store the program’s state, which in practice means its registers. If we call the function setjmp(), we store the registers. If we later call longjmp(), we restore them. Restoring means that we reset the stack and instruction pointers, so the program goes to the position on the stack where we called setjmp(), and the instruction pointer starts executing right after the call to setjmp(). The only difference between the call to setjmp() when we stored the registers and when/if we returned to it from a longjmp() is that setjmp() returns zero the first time and a value we give to longjmp() the second time.

The setjmp()/longjmp() functions are a crude version of raising and catching exceptions. We can return to an arbitrary point on the stack, with the program’s registers restored, but there is no checking for whether it is a valid state. You can call longjmp() with registers that moves you to a stack frame that is long gone. The mechanism does nothing to clean up resources, and if you lose pointers to heap-allocated memory in a jump, then it is gone. You need to be careful to free resources when you use this mechanism, just as you must be with normal returns from functions. It is highly unsafe, but it will work for our purposes here. We get a way to return to our printing function from nested function calls that push to the stack.

So, we define a stack like this:

struct print_stack {

struct print_frame *frames;

jmp_buf env;

};

and implement push as

void print_push(struct print_stack *stack,

enum print_op op, struct node *node)

{

struct print_frame *frame = malloc(sizeof *frame);

if (!frame) longjmp(stack->env, 1); // bail!

*frame = (struct print_frame){

.op = op, .node = node, .next = stack->frames

};

stack->frames = frame;

}

An allocation failure results in a longjmp() which we will catch in the main function. The second argument to longjmp(), here 1, is what setjmp() will have returned in the restored state. As long as it is not zero, we can recognize that we have returned from a longjmp(). We won’t implement a pop() operation because we will deal with popping directly in the printing function.

The following two functions push the operations we need to do for one node and execute an operation we have popped from the stack:

void schedule_print(struct print_stack *stack,

struct node *node)

{

print_push(stack, RPAR, 0);

if (node->right) print_push(stack, TREE, node->right);

print_push(stack, COMMA, 0);

print_push(stack, VALUE, node);

print_push(stack, COMMA, 0);

if (node->left) print_push(stack, TREE, node->left);

print_push(stack, LPAR, 0);

}

void handle_print_op(enum print_op op, struct node *node,

struct print_stack *stack)

{

switch (op) {

case LPAR: putchar('('); break;

case RPAR: putchar(')'); break;

case COMMA: putchar(','); break;

case VALUE: printf("%d", node->value); break;

case TREE: schedule_print(stack, node); break;

}

The print function initializes the stack and stores the register states. Then we schedule the first node and iteratively get the top of the stack, take the operation and node from it, and free the top frame before we handle the operation.

void print_stree(stree *t)

{

if (!*t) return;

enum print_op op;

struct node *n = 0;

struct print_stack stack = { .frames = 0 };

if (setjmp(stack.env) != 0) goto error;

schedule_print(&stack, *t);

while (stack.frames) {

struct print_frame *top = stack.frames;

op = top->op; n = top->node;

stack.frames = top->next;

free(top);

handle_print_op(op, n, &stack);

}

return;

error:

while (stack.frames) {

struct print_frame *top = stack.frames;

stack.frames = top->next;

free(top);

}

If we get an allocation error, then the push function will call longjmp(), which results in us returning from setjmp() a second time, but now with a non-zero return value. If that happens, we goto the error label and free the stack. In the while-loop, we must free top before we call handle_print_op(). We might not return from handle_print_op—the longjmp() error handling will send us directly to the setjmp()—so we must leave the function in a state where the code in error can clean up all resources. If we free top before we handle the operation, all the stack frames are on the stack, and the error handling code will free them.

Here, we can handle an error, but we cannot undo the damage we have done by writing parts of the tree to output and then bailing. What we print, we cannot undo. And when we use an explicit stack, we can get allocation errors we will have to recover from (unless we give up and abort, as with free_nodes()). There are many cases where we need to dynamically allocate memory for stacks and queues and whatnot when our programs run, but for freeing memory, or for operations that we cannot undo, we should try hard to avoid allocations. This isn’t always possible, but for traversing trees, there are tricks.

If you want to traverse a tree, and you don’t want to allocate new memory doing it, you have several options. A general option you always have, regardless of whether we consider trees or other objects, is to set aside memory for a stack in your data structures. Add one pointer to the struct node, and you can chain them in a stack if you want to. Pushing and popping works as before, but the stack pointer sits embedded in the struct.

struct node {

int value;

struct node *left;

struct node *right;

struct node *stack;

};

stree node(int value, stree left, stree right)

{

stree t = malloc(sizeof *t);

if (t) *t = (struct node){

.value = value,

.left = left, .right = right,

.stack = 0

};

return t;

}

void push_node(struct node **stack,

struct node *node)

{

if (node) {

node->stack = *stack;

*stack = node;

}

struct node *pop_node(struct node **stack)

{

struct node *top = *stack;

*stack = top->stack;

top->stack = 0;

return top;

}

Then you can free the nodes with

void free_nodes(struct node *n)

{

struct node *stack = 0;

push_node(&stack, n);

while (stack) {

n = pop_node(&stack);

push_node(&stack, n->left);

push_node(&stack, n->right);

free(n);

}

For more complex recursions, though, a node must hold all the information necessary for that algorithm, so the structure can grow in complexity beyond what we are willing to accept.

Morris Traversal

Many algorithms can traverse a tree without using an implicit or explicit stack, and we will see one in this section. This algorithm, Morris traversal, doesn’t require that we add any additional data to the nodes. Instead, it uses a clever way to modify the tree as we traverse it, to simulate recursion, and then restores it to its initial state as the simulated traversal returns.

The algorithm simulates an in-order traversal, that is, one where we first visit all the nodes in the left subtree of a node, then we handle the node, and then we visit all the nodes in the right subtree. Or in C, it simulates this function:

void inorder(struct node *n)

{

if (!n) return;

inorder(n->left);

printf("%d ", n->value);

inorder(n->right);

}

The Morris traversal algorithm is closely related to so-called threaded trees (also invented by Morris). These are search trees with an additional property. If a tree doesn’t have a left or a right subtree, that subtree is replaced by a pointer to the node with the next smallest (left) or next highest (right) node in the tree. Figure 12-5 shows an example, where I have only drawn the thread pointers for right subtrees, as those are the only ones we need for the traversal we will implement. The tree holds the values 1, 3, 4, 5, 6, 7, and 8. Node 1 is a leaf, so it does not have left and right subtrees, but it has a pointer to the next value in the tree, 3. Node 4, likewise, doesn’t have subtrees, so it has a pointer to the node with value 5. Node 5 does have a left subtree, but not a right subtree, so its (right) thread pointer points to the next value, 6, and so on. Node 8 doesn’t have a right subtree, but there is no node with a higher value, so its threaded right pointer is NULL.

../images/500158_1_En_12_Chapter/500158_1_En_12_Fig5_HTML.png — Figure 12-5
A (right) threaded tree

Imagine that we need to do an in-order traversal of such a tree. We can recurse by following left pointers as long as we have them, which in the example will take us to node 1. Here, we cannot continue further left, so we visit 1 (we have, trivially, visited all its left children). Now we recurse to the right, that is, we follow its right pointer. Node 1 doesn’t have a right child, but we have a “threaded” pointer instead that works like one, so going right from 1 corresponds to returning from the recursion to node 3. When we enter node 3, we would have to recurse to the right, unless we can recognize that we had already been there, so let us, for now, assume that we tag the nodes so we can tell that we have already been to node 3 once. If we can see that, we know that we came back from a left recursion, so we visit node 3 and recurse to the right. In 5, a node we haven’t seen before, we recurse to the left, and we get to node 4, where we cannot go further left. We visit 4 and go right. That takes us back to 5, where we realize that we have been before, so we go right instead of left this time, which takes us up to 6. Recognizing that we have been here before, we go right instead of left, to 8, and recurse from here. We have not been here, so we go left, to 7. We cannot continue left, so we go right, back to 8. Since we have been in 8 once before, we visit the node, and then we go right. This time, there is no right child, threaded or otherwise, so the traversal ends.

If we have a threaded tree, the “recursion” works as follows: If we have a node we haven’t seen before, we go left. If we have a node that we have seen before, we visit it and go right. “Returning” from the recursion happens whenever a step to the right is along a threaded pointer rather than a real subtree.

How can we get the threaded pointers, and how can we recognize that we are in a node that we have seen before? Notice that if you have a node “n”, and another node, m, has a threaded pointer to it, then n is the next value after m in the tree. That means that m is the rightmost node in n’s subtree. So we can go from n to m with our rightmost() function . Once there, we can, of course, write a pointer to n into the right subtree. So when we enter a node the first time, we can run down and find the node where we should insert a threaded pointer. How do we recognize that we have visited a node before? If we search for the rightmost node in the left subtree, and we encounter the node itself, then we have inserted it the first time we visited the node! So the same rightmost search will tell us if we are seeing a node for the first time or for the second time.

We can modify the rightmost() function to get a rightmost_to() function that stops early in the recursion if we encounter a given node. If we didn’t have that stop condition, we would recurse forever, as following the right subtrees gives us a loop when we make the right subtree point to an ancestor. Such a function can look like this:

stree *rightmost_to(stree *t, struct node *n)

{

if ((*t)->right == 0 || (*t)->right == n) {

return t;

} else {

return rightmost_to(&(*t)->right, n);

}

#define rightmost(t) rightmost_to(t, 0)

The rightmost() macro is there, so we can use rightmost() in deletion, without the weird extra argument.

An in-order traversal that uses right thread pointers now looks like this:

void morris(stree *t)

{

struct node *curr = *t;

while (curr) {

if (!curr->left) {

// Visit

printf("%d ", curr->value);

curr = curr->right;

} else {

stree rm = *rightmost_to(&curr->left, curr);

assert(rm->right == 0 || rm->right == curr);

if (rm->right == 0) {

rm->right = curr;

curr = curr->left;

} else {

// Visit

printf("%d ", curr->value);

rm->right = 0;

curr = curr->right;

}

If you don’t have a left subtree, you visit the node and go right. Otherwise, you figure out if you can find yourself as the rightmost node on the left. If not, you update the rightmost node’s right subtree to point to the current node, and you recurse to the left. If yes, you restore the right subtree in the rightmost node by setting it to NULL, you visit the current node, and then you go right.

The running time is proportional to the size of the tree. You search for the rightmost node in the left subtree twice in every node (that has a left subtree), but the nodes you run through in such searches do not overlap between different nodes, so each node is maximally visited twice in such searches.

You don’t quite get the recursion in print_stree() here, although you visit the nodes in the same order. It is simple to output the left parentheses, the commas, and the node values, but when you return from the recursion by going through a right pointer, you go up a number of recursive calls in one step, and you can’t see how many. You can annotate the nodes with depth information and get print_stree() behavior, but I will not bother here. Traversing the nodes in order suffices for most applications, and we rarely need to know the exact tree structure.

Freeing Nodes Without Recursion and Memory Allocation

For freeing the nodes, we can simplify the Morris traversal a bit. If we are deleting the tree anyway, we are allowed to modify it (and we don’t need to worry about restoring the tree during the traversal). So, when we process a node with a left subtree, we store a pointer to the node in its rightmost subtree as before. We don’t need to check if the rightmost is the node itself because it will never be. It will never be that, since right before we recurse to the left, we remove the left subtree. We can modify the tree, since we are deleting it, and that prevents us from attempting a second recursion to the left.

void free_nodes(struct node *n)

{

struct node *curr = n;

while (curr) {

if (!curr->left) {

struct node *right = curr->right;

free(curr);

curr = right;

} else {

// Add pointer to rightmost so we can go back

(*rightmost(&curr->left))->right = curr;

// Recurse left, but make sure we never go left again

struct node *left = curr->left; curr->left = 0;

curr = left;

}

It simplifies the algorithm, and we save one rightmost search per node. And we free the tree without using any additional memory, so freeing cannot fail due to stack exhaustion, due to recession, or due to memory allocation errors, if we had used an explicit stack.

Adding a Parent Pointer

A problem with both embedding a stack in the nodes and with Morris traversal is that we can at most run one traversal at a time—because we are using shared memory embedded in the trees. With Morris traversal, we must also complete the traversal before the tree is restored to a consistent state (with the embedded stack, it isn’t a problem as we overwrite the stack in a new traversal). Concurrency is out of the question with these strategies (and with many related strategies that rely on modifying trees for traversal).

If we are willing to add a pointer to the parent of each node (with NULL in the root), then we can traverse a tree without modifying it and without dynamically allocating memory while we do it. This does mean extending the node struct, of course, but for many of the strategies for balancing trees, a parent pointer is necessary in any case, or at least makes the code vastly faster, so it is not a high price to pay.

It is trivial to add the pointer to the struct:

typedef struct node *stree;

struct node {

int value;

struct node *parent;

struct node *left;

struct node *right;

};

int allocated;

stree node(int value, stree parent,

stree left, stree right)

{

stree t = malloc(sizeof *t);

if (t) *t = (struct node){

.value = value, .parent = parent,

.left = left, .right = right

};

return t;

}

#define leaf(V,P) node(V, P, 0, 0)

but then we also need to add it to the operations for modifying the tree. That means that find_loc() must set the parent pointer. We can track the pointer going down the recursion/loop, and if we have a parent pointer argument, passed by reference, we can return it that way:

stree *find_loc(stree *t, int val, stree *p)

{

while (*t && (*t)->value != val) {

*p = *t;

if (val < (*t)->value) t = &(*t)->left;

else t = &(*t)->right;

}

return t;

}

bool contains(stree *t, int val)

{

stree parent = 0;

return !! *find_loc(t, val, &parent);

}

bool insert(stree *t, int val)

{

stree parent = 0;

stree *target = find_loc(t, val, &parent);

if (*target) return true; // already there

else return !!(*target = leaf(val, parent));

}

We do the same with rightmost() , although there we do not set the initial value for p, as we do not always call rightmost() with the root.

stree *rightmost(stree *t, stree *p)

{

while ((*t)->right) {

*p = *t;

t = &(*t)->right;

}

return t;

}

When we delete, we get the parent with the first call to find_loc(), but if we are in the second case, where the node to delete has both subtrees, we find the rightmost, starting with the tree as the parent (it is the parent of its left child, after all):

void delete(stree *t, int val)

{

stree parent = 0;

stree *loc = find_loc(t, val, &parent);

if (*loc) {

stree t = *loc;

if (!(t->left && t->right)) {

*loc = t->left ? t->left : t->right;

// if there was a subtree, update its parent

if (*loc) (*loc)->parent = parent;

free(t);

} else {

parent = t; // t is t->left's parent

stree *rm_ref = rightmost(&t->left, &parent);

stree rm = *rm_ref;

t->value = rm->value;

*rm_ref = rm->left;

// if there was a subtree, update its parent

if (*rm_ref) (*rm_ref)->parent = parent;

free(rm);

}

You can avoid the checks for empty children, before you set the parent pointer, by having a dummy node for empty trees. You can share the same node with all empty trees because you will never need to take the parent of an empty tree. You will just need to replace the tests for NULL empty trees with tests for whether a tree points to the dummy.

With parent pointers in place, we can traverse the tree—and this time get a proper “recursion” where we can add the right parentheses for printing the tree structure. We can keep track of which direction we are moving, down or up the tree, and by comparing the parent’s left child with a tree itself, we can determine if we are returning from the left or right subtree, to determine whether we should now go right or keep returning.

#define left_child(t)

((t)->parent && (t)->parent->left == (t))

void parent_traverse(stree t)

{

enum { DOWN, UP } state = DOWN;

while (t) {

switch (state) {

case DOWN:

// Recurse as far left as we can...

while (t->left) { putchar('('); t = t->left; }

// Emit the leaf we find there

printf("(,%d,", t->value); // VISIT

// Then go right, if we can, or up if we can't.

if (t->right) { t = t->right; }

else { putchar(')'); state = UP; }

break;

case UP:

if (!t->parent) return; // back at the root; we're done

if (left_child(t)) {

// Returning from a left child, we emit the parent

t = t->parent;

printf(",%d,", t->value); // VISIT

// Then we go right if we can't, or continue up

// (t is already the parent) if we cannot.

if (t->right) { t = t->right; state = DOWN; }

else { putchar(')'); }

} else {

// Returning from a right child just means going up

putchar(')'); t = t->parent;

}

break;

}

Freeing the nodes in the same type of traversal is trivial. Do it when you return from a node in the virtual recursion:

void parent_free(stree t)

{

struct node *p;

enum { DOWN, UP } state = DOWN;

while (t) {

switch (state) {

case DOWN:

while (t->left) { t = t->left; }

if (t->right) { t = t->right; }

else { state = UP; }

break;

case UP:

if (!t->parent) { free(t); return; }

if (left_child(t)) {

p = t->parent; free(t); t = p;

if (t->right)

{ t = t->right; state = DOWN; }

} else {

p = t->parent; free(t); t = p;

}

break;

}

There are plenty more algorithms for traversing trees, with or without modifying them, and if you want to get more experience with trees and pointer manipulation, it is a good place to start. For the book, however, it is time to move on to the next topic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Search Trees

Create new playlist

Sign In

Sign Up

12. Search Trees

Tree Operations

Contains

Insert

Delete

Free

Recursive Data Structures and Recursive Functions

Direct Implementation

Pass by Reference

Refactoring

Iterative Functions

Explicit Stacks

Morris Traversal

Freeing Nodes Without Recursion and Memory Allocation

Adding a Parent Pointer

Table of Contents for
12. Search Trees