CHAPTER 7

Searching

The preceding chapter explained how you can sort data. Algorithms such as quicksort and heapsort let you sort fairly large amounts of data quickly. Algorithms such as countingsort and bucketsort let you sort data almost as quickly as a program can examine it but only under certain special circumstances.

One of the advantages of sorted data is that it lets you find specific items relatively quickly. For example, you can locate a particular person in a telephone directory containing tens of thousands of names in just a minute or two because all the names are arranged in sorted order. (Imagine trying to find a name if the directory wasn't sorted!)

This chapter explains algorithms that you can use to find a particular piece of data in a sorted array.

NOTE The algorithms described in this chapter work with simple arrays, not more specialized data structures. Specialized data structures such as trees also let you quickly find an item with a specific value. Algorithms for working with trees are discussed in Chapter 10.

NOTE Some programming libraries include searching tools that locate items in a sorted array. For example, the .NET Framework's Array class provides a BinarySearch method. These methods generally are fast, so in practice you may want to use those tools to save time writing and debugging the sorting code.

It's still important to understand how searching algorithms work, however, because sometimes you can do even better than the tools. For example, interpolation search is much faster than binary search when it is applicable.

Linear Search

As you may be able to guess, a linear or exhaustive search simply loops through the items in the array, looking for the target item. Figure 7-1 shows a linear search for the value 77.

images

Figure 7-1: A linear search examines every item in the array until it finds the target item.

Unlike binary search and interpolation search, linear search works on linked lists, where you cannot easily jump from one part of the list to another, as you can in an array.

Linear search also works on unsorted lists. But if the items are sorted, the algorithm can stop if it ever comes to an item with a value greater than the target value.. That lets the algorithm stop early and save a little time if the target value isn't in the list.

The following pseudocode shows the linear search algorithm for an array:

// Find the target item's index in the sorted array. // If the item isn't in the array, return −1. Integer: LinearSearch(Data values[], Data target) For i = 0 To <length of values> - 1 // See if this is the target. If (values[i] == target) Then Return i // See if we have passed where the target would be. If (values[i] > target) Then Return −1 Next i // If we get here, the target is not in the array. Return −1 End LinearSearch

This algorithm may need to loop through the entire array to conclude that an item isn't there, so its worst-case behavior is O(N).

Even in the average case, the algorithm's run time is O(N). If you add up the number of steps required to search for every item in the array, you get 1 + 2 + 3 + ... + N = N * (N + 1) / 2. If you divide that total by N to get the average search time for all the N items, you get (N + 1) / 2, which is still O(N).

This algorithm is much slower than binary search or interpolation search, but it has the advantage that it works on linked lists and unsorted lists.

Binary Search

In a binary search, the algorithm keeps track of the largest and smallest indices the target item might have in the array. Initially those bounds (call them min and max) are 0 and the largest index in the array.

The algorithm then calculates the index halfway between min and max (call it mid). If the target is less than the array's value at mid, the algorithm resets max to search the left half of the array and starts over. If the target is greater than the array's value at mid, the algorithm resets min to search the right half of the array and starts over. If the target equals the array's value at mid, the algorithm returns the index mid.

Figure 7-2 shows a binary search for the value 77.

images

Figure 7-2: A binary search repeatedly divides the part of the array that might contain the target item into two halves and then searches the appropriate half.

The following pseudocode shows the algorithm:

// Find the target item's index in the sorted array. // If the item isn't in the array, return −1. Integer: BinarySearch(Data values[], Data target) Integer: min = 0 Integer: max = <length of values> - 1 While (min <= max) // Find the dividing item. Integer: mid = (min + max) / 2 // See if we need to search the left or right half. If (target < values[mid]) Then max = mid - 1 Else If (target > values[mid]) Then min = mid + 1 Else Return mid End While // If we get here, the target is not in the array. Return −1 End BinarySearch

At each step, this algorithm divides in half the number of items that might contain the target. If the array contains N items, after O(log N) steps, the section of the array that might hold the target contains only one item, so the algorithm either finds the item or concludes that it isn't in the array. That means the algorithm has O(log N) run time.

Interpolation Search

At every step, binary search examines the item in the middle of the section of the array that it is considering. In contrast, interpolation search uses the value of the target item to guess where in the array it might lie and achieve much faster search times.

For example, suppose the array contains 1,000 items with values between 1 and 100. If the target value is 30, it lies about 30 percent of the way from the smallest to the largest value so you can guess that the item may be somewhere near index 300. Depending on the distribution of the numbers in the array, this may not be exactly correct, but it should get you fairly close to the target item's position.

Figure 7-3 shows an interpolation search for the value 77.

images

Figure 7-3: Interpolation search uses the target item's value to calculate where it should be in the remaining part of the array.

The following pseudocode shows the algorithm at a high level:

Integer: InterpolationSearch(Data values[], Data target) Integer: min = 0 Integer: max = values.Length - 1 While (min <= max) // Find the dividing item. Integer: mid = min + (max - min) * (target - values[min]) / (values[max] - values[min]) If (values[mid] == target) Then Return mid <Set min or max to search the left or right half.> End While Return −1 End InterpolationSearch

This high-level description leaves a couple of problems unsolved. The mid calculation can result in an overflow or a value of mid that is not between min and max. Solving those problems is left as part of Exercise 6.

The trickiest part of this algorithm is the statement that calculates mid. The value is set to the current value of min plus the distance between min and max when scaled by the expected fraction of the distance between values[min] and values[max] where target should lie.

For example, if values[min] is 100, values[max] is 200, and target is 125, then you would use the following calculation to decide where to look for the target value:

(target - values[min]) / (values[max] - values[min]) = (125 - 100) / (200 - 100) = 25 / 100 = 0.25

That puts the new value for mid one quarter of the way from min to max. In the worst case, if the data is extremely unevenly distributed, and you're looking for the worst possible target value, this algorithm has O(N) performance. If the distribution is reasonably uniform, the expected performance is O(log(log N)). (But proving that is outside the scope of this book.)

Summary

Table 7-1 shows the values of log N and log(log N) for different values of N so that you can compare the speeds of linear search, binary search, and interpolation search.

Table 7-1: Algorithm Characteristics

images

Linear search is useful only for relatively small arrays. Table 7-1 shows that binary search works well even for very large arrays. It can search an array containing 1 trillion items in only about 40 steps.

Interpolation search works well for arrays of any size that you can reasonably fit on a computer. It can search an array containing 1 trillion items in only about five steps. In fact, an array would need to hold more than 1×10154 items before interpolation search would require an expected number of steps greater than nine.

However, the exact number of steps for interpolation search depends on the distribution of the values. Sometimes the algorithm gets lucky and finds the target in one or two steps, and other times it might need four or five. On average, however, it is extremely fast.

Exercises

If you don't know what recursion is yet, skip exercises 2, 5, and 7 and come back to them after you read Chapter 8, which introduces recursion.

  1. Write a program that implements linear search.
  2. Write a program that implements linear search recursively. Does this version have any advantages or disadvantages compared to the nonrecursive version?
  3. Write a program that implements linear search with sorted linked lists.
  4. Write a program that implements binary search.
  5. Write a program that implements binary search recursively. Does this version have any advantages or disadvantages compared to the nonrecursive version?
  6. Write a program that implements interpolation search.
  7. Write a program that implements interpolation search recursively. Does this version have any advantages or disadvantages compared to the nonrecursive version?
  8. Which sorting algorithm described in Chapter 6 uses a technique reminiscent of the technique used by interpolation search?
  9. If an array contains duplicates, the binary search and interpolation search algorithms described in this chapter don't guarantee that they return the first instance of the target item. How could you modify them to return the first occurrence of the target item? What is the run time for the modified version?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141