Binary Search

Consider a list of 1 million sorted values. Linear search starts at the beginning of the list and asks, “Is this value what I’m looking for?” If it isn’t, the same is asked about the second value, and then the third. Up to 1 million questions are asked. This algorithm doesn’t take advantage of the list being sorted.

Here’s a new algorithm, called binary search, that relies on the list being sorted: look at the middle value and ask, “Is this value bigger than or smaller than the one I’m looking for?” With that one question, we can eliminate 500,000 values! That leaves a list of 500,000 values to search. We’ll do it again: look at the middle value, ask the same question, and eliminate another 250,000 values. We have eliminated 3/4 of the list with only two questions! Asking only 20 questions, we can locate a particular value in a list of 1 million sorted values.

To figure out how fast it is, we’ll think about how big a list we can search with a certain number of questions. With only one question, we can determine whether a list of length 1 contains a value. With two questions, we can search a list of length 2. With three questions, we can search a list of length 4. Four questions, length 8. Five questions, length 16. Every time we get to ask another question, we can search a list twice as big.

Using logarithmic notation, N sorted values can be searched in ceiling(log2 N) steps, where ceiling() is the ceiling function that rounds a value up to the nearest integer. As shown in Table 20, this increases much less quickly than the time needed for linear search.


Table 20. Logarithmic Growth

Searching N Items

Worst Case—Linear Search

Worst Case—Binary Search

100

100

7

1000

1000

10

10,000

10,000

14

100,000

100,000

17

1,000,000

1,000,000

20

10,000,000

10,000,000

24


The key to binary search is to keep track of three parts of the list: the left part, which contains values that are smaller than the value we are searching for; the right part, which contains values that are equal to or larger than the value we are searching for; and the middle part, which contains values that we haven’t yet examined—the unknown section. If there are duplicate values, we will return the index of the leftmost one, which is why the “equal to” section belongs on the right.

We’ll use two variables to keep track of the boundaries: i will mark the index of the first unknown value, and j will mark the index of the last unknown value:

images/searchsort/binarysearch_invariant.png

At the beginning of the algorithm, the unknown section makes up the entire list, so we will set i to 0 and j to the length of the list minus one as shown in the figure.

images/searchsort/binarysearch_invariant_start.png

We are done when that unknown section is empty—when we’ve examined every item in the list. This happens when i == j + 1—when the values cross. (When i == j, there is still one item left in the unknown section.) Here is a picture of what the values are when the unknown section is empty:

images/searchsort/binarysearch_invariant_finish.png

To make progress, we will set either i or j to near the middle of the range between them. Let’s call this index m, which is at (i + j) // 2. (Notice the use of integer division: we are calculating an index, so we need an integer.)

Think for a moment about the value at m. If it is less than v, we need to move i up, while if it is greater than v, we should move j down. But where exactly do we move them?

When we move i up, we don’t want to set it to the midpoint exactly, because L[m] isn’t included in the range; instead, we set it to one past the middle—in other words, to m + 1.

images/searchsort/binarysearch_invariant_start2.png

Similarly, when we move j down, we move it to m - 1:

images/searchsort/binarysearch_invariant_start3.png

The completed function is as follows:

 from​ typing ​import​ Any
 
 def​ binary_search(L: list, v: Any) -> int:
 """Return the index of the first occurrence of value in L, or return
  -1 if value is not in L.
 
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 1)
  0
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 4)
  2
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 5)
  4
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 10)
  7
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], -3)
  -1
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 11)
  -1
  >>> binary_search([1, 3, 4, 4, 5, 7, 9, 10], 2)
  -1
  >>> binary_search([], -3)
  -1
  >>> binary_search([1], 1)
  0
  """
 
 # Mark the left and right indices of the unknown section.
  i = 0
  j = len(L) - 1
 
 while​ i != j + 1:
  m = (i + j) // 2
 if​ L[m] < v:
  i = m + 1
 else​:
  j = m - 1
 
 if​ 0 <= i < len(L) ​and​ L[i] == v:
 return​ i
 else​:
 return​ -1
 
 if​ __name__ == ​'__main__'​:
 import​ doctest
  doctest.testmod()

There are a lot of tests because the algorithm is quite complicated and we wanted to test pretty thoroughly. Our tests cover these cases:

  • The value is the first item.
  • The value occurs twice. We want the index of the first one.
  • The value is in the middle of the list.
  • The value is the last item.
  • The value is smaller than everything in the list.
  • The value is larger than everything in the list.
  • The value isn’t in the list, but it is larger than some and smaller than others.
  • The list has no items.
  • The list has one item.

In Chapter 15, Testing and Debugging, you’ll learn a different testing framework that allows you to write tests in a separate Python file (thus making docstrings shorter and easier to read; only a couple of examples are necessary), and you’ll learn strategies for coming up with your own test cases.

Binary Search Running Time

Binary search is much more complicated to write and understand than linear search. Is it fast enough to make the extra effort worthwhile? To find out, we can compare it to list.index. As before, we search for the first, middle, and last items in a list with about ten million elements as shown in Table 21.


Table 21. Running Times for Binary Search

Case

list.index

binary_search

Ratio

First

0.007

0.02

0.32

Middle

105

0.02

5910

Last

211

0.02 (Wow!)

11661


The results are impressive. Binary search is up to several thousand times faster than its linear counterpart when searching ten million items. Most importantly, if we double the number of items, binary search takes only one more iteration, whereas the time for list.index nearly doubles.

Note also that although the time taken for linear search grows in step with the index of the item found, there is no such pattern for binary search. No matter where the item is, it takes the same number of steps.

Built-In Binary Search

The Python standard library’s bisect module includes binary search functions that are slightly faster than our binary search. Function bisect_left returns the index where an item should be inserted in a list to keep it in sorted order, assuming it is sorted to begin with. insort_left actually does the insertion.

The word left in the name signals that these functions find the leftmost (lowest index) position where they can do their jobs; the complementary functions bisect_right and insort_right find the rightmost.

There is one minor drawback to binary search: the algorithm assumes that the list is sorted, and sorting is time and memory intensive. We’ll look at that next.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.160.181