Sorting

Now let’s look at a slightly harder problem. The following table[8] shows the number of acres burned in forest fires in Canada from 1918 to 1987. What were the worst years?


Table 22. Acres Lost to Forest Fires in Canada (in thousands), 1918–1987

563

7590

1708

2142

3323

6197

1985

1316

1824

472

1346

6029

2670

2094

2464

1009

1475

856

3027

4271

3126

1115

2691

4253

1838

828

2403

742

1017

613

3185

2599

2227

896

975

1358

264

1375

2016

452

3292

538

1471

9313

864

470

2993

521

1144

2212

2212

2331

2616

2445

1927

808

1963

898

2764

2073

500

1740

8592

10856

2818

2284

1419

1328

1329

1479


One way to find out how much forest was destroyed in the N worst years is to sort the list and then take the last N values, as shown in the following code:

 def​ find_largest(n: int, L: list) -> list:
 """Return the n largest values in L in order from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> find_largest(3, L)
  [4, 5, 7]
  """
 
  copy = sorted(L)
 return​ copy[-n:]

This algorithm is short, clean, and easy to understand, but it relies on a bit of black magic. How does function sorted (and also method list.sort) work? And how efficient are they?

It turns out that many sorting algorithms have been developed over the years, each with its own strengths and weaknesses. Broadly speaking, they can be divided into two categories: those that are simple but inefficient and those that are efficient but harder to understand and implement. We’ll examine two of the former kind. The rest rely on techniques that are more advanced; we’ll show you one of these, rewritten to use only material seen so far.

Both of the simple sorting algorithms keep track of two sections in the list being sorted. The section at the front contains values that are now in sorted order; the section at the back contains values that have yet to be sorted. Here is the main part of the invariant that we will use for our two simple sorts:

images/searchsort/sort_invariant.png

One of the two algorithms has an additional property in its invariant: the items in the sorted section must be smaller than all the items in the unknown section.

Both of these sorting algorithms will work their way through the list, making the sorted section one item longer on each iteration. We’ll see that there are two ways to do this. Here is an outline for our code:

 i = 0 ​# The index of the first unknown item in lst; lst[:i] is sorted
 while​ i != len(L):
 # Do something to incorporate L[i] into the sorted section
 
  i = i + 1

Most Python programmers would probably write the loop header as for i in range(len(L)) rather than incrementing i explicitly in the body of the loop. We’re doing the latter here to explicitly initialize i (to set up the loop invariant) and to show the increment separately from the work this particular algorithm is doing. The “do something…” part is where the two simple sorting algorithms will differ.

Selection Sort

Selection sort works by searching the unknown section for the smallest item and moving it to the index i. Here is our algorithm:

 i = 0 ​# The index of the first unknown item in lst
 
 # lst[:i] is sorted and those items are smaller than those in list[i:]
 while​ i != len(L):
 # Find the index of the smallest item in lst[i:]
 # Swap that smallest item with the item at index i
  i = i + 1

As you can probably guess from this description, selection sort works by repeatedly selecting the smallest item in the unsorted section and placing it just after the sorted section. This works because we are selecting the items in order. On the first iteration, i is 0, and lst[0:] is the entire list. That means that on the first iteration we select the smallest item and move it to the front. On the second iteration we select the second-smallest item and move it to the second spot, and so on:

images/searchsort/selection.png

In a file named sorts.py, we have started writing a selection sort function, partially in English, as shown in the following code:

 def​ selection_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> selection_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  """
 
  i = 0
 while​ i != len(L):
 # Find the index of the smallest item in L[i:]
 # Swap that smallest item with L[i]
  i = i + 1

We can replace the second comment with a single line of code.

 def​ selection_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> selection_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  """
 
  i = 0
 while​ i != len(L):
 # Find the index of the smallest item in L[i:]
  L[i], L[smallest] = L[smallest], L[i]
  i = i + 1

Now all that’s left is finding the index of the smallest item in L[i:]. This is complex enough that it’s worth putting it in a function of its own:

 def​ find_min(L: list, b: int) -> int:
 """Precondition: L[b:] is not empty.
  Return the index of the smallest value in L[b:].
 
  >>> find_min([3, -1, 7, 5], 0)
  1
  >>> find_min([3, -1, 7, 5], 1)
  1
  >>> find_min([3, -1, 7, 5], 2)
  3
  """
 
  smallest = b ​# The index of the smallest so far.
  i = b + 1
 while​ i != len(L):
 if​ L[i] < L[smallest]:
 # We found a smaller item at L[i].
  smallest = i
 
  i = i + 1
 
 return​ smallest
 
 def​ selection_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> selection_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  """
 
  i = 0
 while​ i != len(L):
  smallest = find_min(L, i)
  L[i], L[smallest] = L[smallest], L[i]
  i = i + 1

Function find_min examines each item in L[b:], keeping track of the index of the minimum item so far in variable smallest. Whenever it finds a smaller item, it updates smallest. (Because it is returning the index of the smallest value, it won’t work if L[b:] is empty; hence the precondition.)

This is complicated enough that a couple of doctests may not test enough. Here’s a list of test cases for sorting:

  • An empty list
  • A list of length 1
  • A list of length 2 (this is the shortest case where items can move)
  • An already-sorted list
  • A list with all the same values
  • A list with duplicates

Here are our expanded doctests:

 def​ selection_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> selection_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  >>> L = []
  >>> selection_sort(L)
  >>> L
  []
  >>> L = [1]
  >>> selection_sort(L)
  >>> L
  [1]
  >>> L = [2, 1]
  >>> selection_sort(L)
  >>> L
  [1, 2]
  >>> L = [1, 2]
  >>> selection_sort(L)
  >>> L
  [1, 2]
  >>> L = [3, 3, 3]
  >>> selection_sort(L)
  >>> L
  [3, 3, 3]
  >>> L = [-5, 3, 0, 3, -6, 2, 1, 1]
  >>> selection_sort(L)
  >>> L
  [-6, -5, 0, 1, 1, 2, 3, 3]
  """
 
  i = 0
 
 while​ i != len(L):
  smallest = find_min(L, i)
  L[i], L[smallest] = L[smallest], L[i]
  i = i + 1

As with binary search, the doctest is so long that, as documentation for the function, it obscures rather than helps clarify. Again, we’ll see how to fix this in Chapter 15, Testing and Debugging.

Insertion Sort

Like selection sort, insertion sort keeps a sorted section at the beginning of the list. Rather than scan all of the unsorted section for the next smallest item, though, it takes the next item from the unsorted section—the one at index i—and inserts it where it belongs in the sorted section, increasing the size of the sorted section by one.

 i = 0 ​# The index of the first unknown item in lst; lst[:i] is sorted
 
 while​ i != len(L):
 # Move the item at index i to where it belongs in lst[:i + 1]
 
  i = i + 1

The reason why we use lst[i + 1] is because the item at index i may be larger than everything in the sorted section, and if that is the case then the current item won’t move.

In outline, this is as follows (save this in sorts.py as well):

 def​ insertion_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> insertion_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  """
 
  i = 0
 while​ i != len(L):
 # Insert L[i] where it belongs in L[0:i+1].
  i = i + 1

This is the same approach as selection sort; the difference is the comment in the loop. Like we did with selection sort, we’ll write a helper function to do the work:

 def​ insert(L: list, b: int) -> None:
 """Precondition: L[0:b] is already sorted.
  Insert L[b] where it belongs in L[0:b + 1].
 
  >>> L = [3, 4, -1, 7, 2, 5]
  >>> insert(L, 2)
  >>> L
  [-1, 3, 4, 7, 2, 5]
  >>> insert(L, 4)
  >>> L
  [-1, 2, 3, 4, 7, 5]
  """
 
 # Find where to insert L[b] by searching backwards from L[b]
 # for a smaller item.
  i = b
 while​ i != 0 ​and​ L[i - 1] >= L[b]:
  i = i - 1
 
 # Move L[b] to index i, shifting the following values to the right.
  value = L[b]
 del​ L[b]
  L.insert(i, value)
 
 def​ insertion_sort(L: list) -> None:
 """Reorder the items in L from smallest to largest.
 
  >>> L = [3, 4, 7, -1, 2, 5]
  >>> insertion_sort(L)
  >>> L
  [-1, 2, 3, 4, 5, 7]
  """
 
  i = 0
 
 while​ i != len(L):
  insert(L, i)
  i = i + 1

How does insert work? It works by finding out where L[b] belongs and then moving it. Where does it belong? It belongs after every value less than or equal to it and before every value that is greater than it. We need the check i != 0 in case L[b] is smaller than every value in L[0:b], which will place the current item at the beginning of the list. This passes all the tests we wrote earlier for selection sort. The following illustrates the process:

images/searchsort/insertion.png

Performance

We now have two sorting algorithms. Which should we use? Because both are not too difficult to understand, it’s reasonable to decide based on how fast they are.

It’s easy enough to write a program to compare their running times, along with that for list.sort:

 import​ time
 import​ random
 from​ sorts ​import​ selection_sort
 from​ sorts ​import​ insertion_sort
 
 def​ built_in(L: list) -> None:
 """Call list.sort --- we need our own function to do this so that we can
  treat it as we treat our own sorts.
  """
 
  L.sort()
 
 def​ print_times(L: list) -> None:
 """Print the number of milliseconds it takes for selection sort, insertion
  sort, and list.sort to run.
  """
 
 print​(len(L), end=​'​​ ​​'​)
 for​ func ​in​ (selection_sort, insertion_sort, built_in):
 if​ func ​in​ (selection_sort, insertion_sort) ​and​ len(L) > 10000:
 continue
 
  L_copy = L[:]
  t1 = time.perf_counter()
  func(L_copy)
  t2 = time.perf_counter()
 print​(​"{0:7.1f}"​.format((t2 - t1) * 1000.), end=​'​​ ​​'​)
 
 print​() ​# Print a newline.
 
 for​ list_size ​in​ [10, 1000, 2000, 3000, 4000, 5000, 10000]:
  L = list(range(list_size))
  random.shuffle(L)
  print_times(L)

The results are shown in Table 23, Running Times for Selection, Insertion, and list.sort (in milliseconds).


Table 23. Running Times for Selection, Insertion, and list.sort (in milliseconds)

List Length

Selection Sort

Insertion Sort

list.sort

1000

148

64

0.3

2000

583

268

0.6

3000

1317

594

0.9

4000

2337

1055

1.3

5000

3699

1666

1.6

10000

14574

6550

3.5


Something is very clearly wrong, because our sorting functions are thousands of times slower than the built-in function. What’s more, the time required by our routines is growing faster than the size of the data. On a thousand items, for example, selection sort takes about 0.15 milliseconds per item, but on ten thousand items, it needs about 1.45 milliseconds per item—far more than a tenfold increase! What is going on?

To answer this, we examine what happens in the inner loops of our two algorithms. On the first iteration of selection sort, the inner loop examines every element to find the smallest. On the second iteration, it looks at all but one; on the third, it looks at all but two, and so on.

If there are N items in the list, then the number of iterations of the inner loop, in total, is roughly N + (N - 1) + (N - 2) + … + 1, or N(N + 1)/2. Putting it another way, the number of steps required to sort N items is roughly proportional to N2 + N. For large values of N, we can ignore the second term and say that the time needed by selection sort grows as the square of the number of values being sorted. And indeed, examining the timing data further shows that doubling the size of the list increases the running time by four.

The same analysis can be used for insertion sort, since it also examines one element on the first iteration, two on the second, and so on. (It’s just examining the already sorted values rather than the unsorted values.)

So why is insertion sort slightly faster? The reason is that, on average, only half of the values need to be scanned in order to find the location in which to insert the new value, while with selection sort, every value in the unsorted section needs to be examined in order to select the smallest one. But, wow, list.sort is so much faster!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.228