Sorting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Sorting

Now let’s look at a slightly harder problem. The following table^[8] shows the number of acres burned in forest fires in Canada from 1918 to 1987. What were the worst years?

Table 22. Acres Lost to Forest Fires in Canada (in thousands), 1918–1987

563	7590	1708	2142	3323	6197	1985	1316	1824	472
1346	6029	2670	2094	2464	1009	1475	856	3027	4271
3126	1115	2691	4253	1838	828	2403	742	1017	613
3185	2599	2227	896	975	1358	264	1375	2016	452
3292	538	1471	9313	864	470	2993	521	1144	2212
2212	2331	2616	2445	1927	808	1963	898	2764	2073
500	1740	8592	10856	2818	2284	1419	1328	1329	1479

One way to find out how much forest was destroyed in the N worst years is to sort the list and then take the last N values, as shown in the following code:

	def find_largest(n: int, L: list) -> list:
	"""Return the n largest values in L in order from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> find_largest(3, L)
	[4, 5, 7]
	"""

	copy = sorted(L)
	return copy[-n:]

This algorithm is short, clean, and easy to understand, but it relies on a bit of black magic. How does function sorted (and also method list.sort) work? And how efficient are they?

It turns out that many sorting algorithms have been developed over the years, each with its own strengths and weaknesses. Broadly speaking, they can be divided into two categories: those that are simple but inefficient and those that are efficient but harder to understand and implement. We’ll examine two of the former kind. The rest rely on techniques that are more advanced; we’ll show you one of these, rewritten to use only material seen so far.

Both of the simple sorting algorithms keep track of two sections in the list being sorted. The section at the front contains values that are now in sorted order; the section at the back contains values that have yet to be sorted. Here is the main part of the invariant that we will use for our two simple sorts:

One of the two algorithms has an additional property in its invariant: the items in the sorted section must be smaller than all the items in the unknown section.

Both of these sorting algorithms will work their way through the list, making the sorted section one item longer on each iteration. We’ll see that there are two ways to do this. Here is an outline for our code:

	i = 0 # The index of the first unknown item in lst; lst[:i] is sorted
	while i != len(L):
	# Do something to incorporate L[i] into the sorted section

	i = i + 1

Most Python programmers would probably write the loop header as for i in range(len(L)) rather than incrementing i explicitly in the body of the loop. We’re doing the latter here to explicitly initialize i (to set up the loop invariant) and to show the increment separately from the work this particular algorithm is doing. The “do something…” part is where the two simple sorting algorithms will differ.

Selection Sort

Selection sort works by searching the unknown section for the smallest item and moving it to the index i. Here is our algorithm:

	i = 0 # The index of the first unknown item in lst

	# lst[:i] is sorted and those items are smaller than those in list[i:]
	while i != len(L):
	# Find the index of the smallest item in lst[i:]
	# Swap that smallest item with the item at index i
	i = i + 1

As you can probably guess from this description, selection sort works by repeatedly selecting the smallest item in the unsorted section and placing it just after the sorted section. This works because we are selecting the items in order. On the first iteration, i is 0, and lst[0:] is the entire list. That means that on the first iteration we select the smallest item and move it to the front. On the second iteration we select the second-smallest item and move it to the second spot, and so on:

In a file named sorts.py, we have started writing a selection sort function, partially in English, as shown in the following code:

	def selection_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> selection_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	"""

	i = 0
	while i != len(L):
	# Find the index of the smallest item in L[i:]
	# Swap that smallest item with L[i]
	i = i + 1

We can replace the second comment with a single line of code.

	def selection_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> selection_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	"""

	i = 0
	while i != len(L):
	# Find the index of the smallest item in L[i:]
	L[i], L[smallest] = L[smallest], L[i]
	i = i + 1

Now all that’s left is finding the index of the smallest item in L[i:]. This is complex enough that it’s worth putting it in a function of its own:

	def find_min(L: list, b: int) -> int:
	"""Precondition: L[b:] is not empty.
	Return the index of the smallest value in L[b:].

	>>> find_min([3, -1, 7, 5], 0)
	1
	>>> find_min([3, -1, 7, 5], 1)
	1
	>>> find_min([3, -1, 7, 5], 2)
	3
	"""

	smallest = b # The index of the smallest so far.
	i = b + 1
	while i != len(L):
	if L[i] < L[smallest]:
	# We found a smaller item at L[i].
	smallest = i

	i = i + 1

	return smallest

	def selection_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> selection_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	"""

	i = 0
	while i != len(L):
	smallest = find_min(L, i)
	L[i], L[smallest] = L[smallest], L[i]
	i = i + 1

Function find_min examines each item in L[b:], keeping track of the index of the minimum item so far in variable smallest. Whenever it finds a smaller item, it updates smallest. (Because it is returning the index of the smallest value, it won’t work if L[b:] is empty; hence the precondition.)

This is complicated enough that a couple of doctests may not test enough. Here’s a list of test cases for sorting:

An empty list
A list of length 1
A list of length 2 (this is the shortest case where items can move)
An already-sorted list
A list with all the same values
A list with duplicates

Here are our expanded doctests:

	def selection_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> selection_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	>>> L = []
	>>> selection_sort(L)
	>>> L
	[]
	>>> L = [1]
	>>> selection_sort(L)
	>>> L
	[1]
	>>> L = [2, 1]
	>>> selection_sort(L)
	>>> L
	[1, 2]
	>>> L = [1, 2]
	>>> selection_sort(L)
	>>> L
	[1, 2]
	>>> L = [3, 3, 3]
	>>> selection_sort(L)
	>>> L
	[3, 3, 3]
	>>> L = [-5, 3, 0, 3, -6, 2, 1, 1]
	>>> selection_sort(L)
	>>> L
	[-6, -5, 0, 1, 1, 2, 3, 3]
	"""

	i = 0

	while i != len(L):
	smallest = find_min(L, i)
	L[i], L[smallest] = L[smallest], L[i]
	i = i + 1

As with binary search, the doctest is so long that, as documentation for the function, it obscures rather than helps clarify. Again, we’ll see how to fix this in Chapter 15, Testing and Debugging.

Insertion Sort

Like selection sort, insertion sort keeps a sorted section at the beginning of the list. Rather than scan all of the unsorted section for the next smallest item, though, it takes the next item from the unsorted section—the one at index i—and inserts it where it belongs in the sorted section, increasing the size of the sorted section by one.

	i = 0 # The index of the first unknown item in lst; lst[:i] is sorted

	while i != len(L):
	# Move the item at index i to where it belongs in lst[:i + 1]

	i = i + 1

The reason why we use lst[i + 1] is because the item at index i may be larger than everything in the sorted section, and if that is the case then the current item won’t move.

In outline, this is as follows (save this in sorts.py as well):

	def insertion_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> insertion_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	"""

	i = 0
	while i != len(L):
	# Insert L[i] where it belongs in L[0:i+1].
	i = i + 1

This is the same approach as selection sort; the difference is the comment in the loop. Like we did with selection sort, we’ll write a helper function to do the work:

	def insert(L: list, b: int) -> None:
	"""Precondition: L[0:b] is already sorted.
	Insert L[b] where it belongs in L[0:b + 1].

	>>> L = [3, 4, -1, 7, 2, 5]
	>>> insert(L, 2)
	>>> L
	[-1, 3, 4, 7, 2, 5]
	>>> insert(L, 4)
	>>> L
	[-1, 2, 3, 4, 7, 5]
	"""

	# Find where to insert L[b] by searching backwards from L[b]
	# for a smaller item.
	i = b
	while i != 0 and L[i - 1] >= L[b]:
	i = i - 1

	# Move L[b] to index i, shifting the following values to the right.
	value = L[b]
	del L[b]
	L.insert(i, value)

	def insertion_sort(L: list) -> None:
	"""Reorder the items in L from smallest to largest.

	>>> L = [3, 4, 7, -1, 2, 5]
	>>> insertion_sort(L)
	>>> L
	[-1, 2, 3, 4, 5, 7]
	"""

	i = 0

	while i != len(L):
	insert(L, i)
	i = i + 1

How does insert work? It works by finding out where L[b] belongs and then moving it. Where does it belong? It belongs after every value less than or equal to it and before every value that is greater than it. We need the check i != 0 in case L[b] is smaller than every value in L[0:b], which will place the current item at the beginning of the list. This passes all the tests we wrote earlier for selection sort. The following illustrates the process:

Performance

We now have two sorting algorithms. Which should we use? Because both are not too difficult to understand, it’s reasonable to decide based on how fast they are.

It’s easy enough to write a program to compare their running times, along with that for list.sort:

	import time
	import random
	from sorts import selection_sort
	from sorts import insertion_sort

	def built_in(L: list) -> None:
	"""Call list.sort --- we need our own function to do this so that we can
	treat it as we treat our own sorts.
	"""

	L.sort()

	def print_times(L: list) -> None:
	"""Print the number of milliseconds it takes for selection sort, insertion
	sort, and list.sort to run.
	"""

	print(len(L), end=' ')
	for func in (selection_sort, insertion_sort, built_in):
	if func in (selection_sort, insertion_sort) and len(L) > 10000:
	continue

	L_copy = L[:]
	t1 = time.perf_counter()
	func(L_copy)
	t2 = time.perf_counter()
	print("{0:7.1f}".format((t2 - t1) * 1000.), end=' ')

	print() # Print a newline.

	for list_size in [10, 1000, 2000, 3000, 4000, 5000, 10000]:
	L = list(range(list_size))
	random.shuffle(L)
	print_times(L)

The results are shown in Table 23, Running Times for Selection, Insertion, and list.sort (in milliseconds).

Table 23. Running Times for Selection, Insertion, and list.sort (in milliseconds)

List Length	Selection Sort	Insertion Sort	list.sort
1000	148	64	0.3
2000	583	268	0.6
3000	1317	594	0.9
4000	2337	1055	1.3
5000	3699	1666	1.6
10000	14574	6550	3.5

Something is very clearly wrong, because our sorting functions are thousands of times slower than the built-in function. What’s more, the time required by our routines is growing faster than the size of the data. On a thousand items, for example, selection sort takes about 0.15 milliseconds per item, but on ten thousand items, it needs about 1.45 milliseconds per item—far more than a tenfold increase! What is going on?

To answer this, we examine what happens in the inner loops of our two algorithms. On the first iteration of selection sort, the inner loop examines every element to find the smallest. On the second iteration, it looks at all but one; on the third, it looks at all but two, and so on.

If there are N items in the list, then the number of iterations of the inner loop, in total, is roughly N + (N - 1) + (N - 2) + … + 1, or N(N + 1)/2. Putting it another way, the number of steps required to sort N items is roughly proportional to N² + N. For large values of N, we can ignore the second term and say that the time needed by selection sort grows as the square of the number of values being sorted. And indeed, examining the timing data further shows that doubling the size of the list increases the running time by four.

The same analysis can be used for insertion sort, since it also examines one element on the first iteration, two on the second, and so on. (It’s just examining the already sorted values rather than the unsorted values.)

So why is insertion sort slightly faster? The reason is that, on average, only half of the values need to be scanned in order to find the location in which to insert the new value, while with selection sort, every value in the unsorted section needs to be examined in order to select the smallest one. But, wow, list.sort is so much faster!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Sorting

Create new playlist

Sign In

Sign Up

Sorting

Selection Sort

Insertion Sort

Performance

Table of Contents for
Sorting