5. Searching and Sorting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Searching and Sorting

Introduction

Credit: Tim Peters, PythonLabs

Computer manufacturers of the 1960s estimated that more than 25 percent of the running time on their computers was spent on sorting, when all their customers were taken into account. In fact, there were many installations in which the task of sorting was responsible for more than half of the computing time. From these statistics we may conclude that either (i) there are many important applications of sorting, or (ii) many people sort when they shouldn’t, or (iii) inefficient sorting algorithms have been in common use.

Donald Knuth

The Art of Computer Programming,vol. 3, Sorting and Searching, page 3

Professor Knuth’s masterful work on the topics of sorting and searching spans nearly 800 pages of sophisticated technical text. In Python practice, we reduce it to two imperatives (we read Knuth so you don’t have to):

When you need to sort, find a way to use the built-in sort method of Python lists.
When you need to search, find a way to use built-in dictionaries.

Many recipes in this chapter illustrate these principles. The most common theme is using the decorate-sort-undecorate (DSU) pattern, a general approach to transforming a sorting problem by creating an auxiliary list that we can then sort with the default, speedy sort method. This technique is the single most useful one to take from this chapter. In fact, DSU is so useful that Python 2.4 introduced new features to make it easier to apply. Many recipes can be made simpler in 2.4 as a result, and the discussion of older recipes have been updated to show how.

DSU relies on an unusual feature of Python’s built-in comparisons: sequences are compared lexicographically. Lexicographical order is a generalization to tuples and lists of the everyday rules used to compare strings (e.g., alphabetical order). The built-in cmp(s1, s2), when s1 and s2 are sequences, is equivalent to this Python code:

def lexcmp(s1, s2):
     # Find leftmost nonequal pair.
     i = 0
     while i < len(s1) and i < len(s2):
         outcome = cmp(s1[i], s2[i])
         if outcome:
             return outcome
         i += 1
     # All equal, until at least one sequence was exhausted.
     return cmp(len(s1), len(s2))

This code looks for the first unequal corresponding elements. If such an unequal pair is found, that pair determines the outcome. Otherwise, if one sequence is a proper prefix of the other, the prefix is considered to be the smaller sequence. Finally, if these cases don’t apply, the sequences are identical and are considered equal. Here are some examples:

>>> cmp((1, 2, 3), (1, 2, 3))  # identical0
>>> cmp((1, 2, 3), (1, 2))     # first larger because second is a prefix
1
>>> cmp((1, 100), (2, 1))      # first smaller because 1<2
-1
>>> cmp((1, 2), (1, 3))        # first smaller because 1==1, then 2<3
-1

An immediate consequence of lexicographical comparison is that if you want to sort a list of objects by a primary key, breaking ties by comparing a secondary key, you can simply build a list of tuples, in which each tuple contains the primary key, secondary key, and original object, in that order. Because tuples are compared lexicographically, this automatically does the right thing. When comparing tuples, the primary keys are compared first, and if (and only if) the primary keys are equal, the secondary keys are compared.

The examples of the DSU pattern in this chapter show many applications of this idea. The DSU technique applies to any number of keys. You can add to the tuples as many keys as you like, in the order in which you want the keys compared. In Python 2.4, you can get the same effect with the new key= optional argument to sort, as several recipes point out. Using the sort method’s key= argument is easier, more memory-efficient, and runs faster than building an auxiliary list of tuples by hand.

The other 2.4-introduced innovation in sorting is a convenient shortcut: a sorted built-in function that sorts any iterable, not in-place, but by first copying it into a new list. In Python 2.3 (apart from the new optional keyword arguments, which apply to the sorted built-in function as well as to list.sort), you can code the same functionality quite easily:

def sorted_2_3(iterable):
    alist = list(iterable)
    alist.sort( )
    return alist

Because copying a list and sorting it are both nontrivial operations, and the built-in sorted needs to perform those operations too, no speed advantage is gained in making sorted a built-in. Its advantage is just the convenience. Having something always around and available, rather than having to code even just four simple lines over and over, does make a difference in practice. On the other hand, few tiny functions are used commonly enough to justify expanding the set of built-ins. Python 2.4 added sorted and reversed because those two functions were requested very frequently over the years.

The biggest change in Python sorting since the first edition of this book is that Python 2.3 moved to a new implementation of sorting. The primary visible consequences are increased speed in many common cases, and the fact that the new sort is stable (meaning that when two elements compare equal in the original list, they retain their relative order in the sorted list). The new implementation was so successful, and the chances of improving on it appeared so slim, that Guido was persuaded to proclaim that Python’s list.sort method will always be stable. This guarantee started with Python 2.4 but was actually realized in Python 2.3. Still, the history of sorting cautions us that better methods may yet be discovered. A brief account of Python’s sorting history may be instructive in this regard.

A Short History of Python Sorting

In early releases of Python, list.sort used the qsort routine from the underlying platform’s C library. This didn’t work out for several reasons, primarily because the quality of qsort varied widely across machines. Some versions were extremely slow when given a list with many equal values or in reverse-sorted order. Some versions even dumped core because they weren’t reentrant. A user-defined _ _cmp_ _ function can also invoke list.sort, so that one list.sort can invoke others as a side effect of comparing. Some platform qsort routines couldn’t handle that. A user-defined _ _cmp_ _ function can also (if it’s insane or malicious) mutate the list while it’s being sorted, and many platform qsort routines dumped core when that happened.

Python then grew its own implementation of the quicksort algorithm. This was rewritten with every release, as real-life cases of unacceptable slowness were discovered. Quicksort is a delicate algorithm indeed!

In Python 1.5.2 the quicksort algorithm was replaced by a hybrid of samplesort and binary insertion sort, and that implementation remained unchanged for more than four years, until Python 2.3. Samplesort can be viewed as a variant of quicksort that uses a very large sample size to pick the partitioning element, also known as the pivot (it recursively samplesorts a large random subset of the elements and picks the median of those). This variant makes quadratic-time behavior almost impossible and brings the number of comparisons in the average case much closer to the theoretical minimum.

However, because samplesort is a complicated algorithm, it has too much administrative overhead for small lists. Therefore, small lists (and small slices resulting from samplesort partitioning) were handled by a separate binary insertion sort, which is an ordinary insertion sort, except that it uses binary search to determine where each new element belongs. Most sorting texts say this isn’t worth the bother, but that’s because most texts assume that comparing two elements is as cheap as or cheaper than swapping them in memory, which isn’t true for Python’s sort! Moving an object is very cheap, since what is copied is just a reference to the object. Comparing two objects is expensive, though, because all of the object-oriented machinery for finding the appropriate code to compare two objects and for coercion gets reinvoked each time. This made binary search a major win for Python’s sort.

On top of this hybrid approach, a few common special cases were exploited for speed. First, already-sorted or reverse-sorted lists were detected and handled in linear time. For some applications, these kinds of lists are very common. Second, if an array was mostly sorted, with just a few out-of-place elements at the end, the binary insertion sort handled the whole job. This was much faster than letting samplesort have at it and occurred often in applications that repeatedly sort a list, append a few new elements, then sort it again. Finally, special code in the samplesort looked for stretches of equal elements, so that the slice they occupy could be marked as done early.

In the end, all of this yielded an in-place sort with excellent performance in all known real cases and supernaturally good performance in some common special cases. It spanned about 500 lines of complicated C code, which gives special poignancy to recipe Recipe 5.11.

Over the years samplesort was in use, I made a standing offer to buy dinner for anyone who could code a faster Python sort. Alas, I ate alone. Still, I kept my eye on the literature because several aspects of the samplesort hybrid were irritating:

While no case of quadratic-time behavior appeared in real life, I knew such cases could be contrived, and it was easy to contrive cases two or three times slower than average ones.
The special cases to speed sorting in the presence of extreme partial order were valuable in practice, but my real data often had many other kinds of partial order that should be exploitable. In fact, I came to believe that random ordering in input lists almost never exists in real life (i.e., not outside of timing harnesses for testing sorting algorithms!).
There is no practical way to make samplesort stable without grossly increasing memory use.
The code was very complex and complicated in ugly ways by the special cases.

Current Sorting

It was always clear that a mergesort would be better on several counts, including guaranteed worst-case n log n time, and that mergesort is easy to make stable. The problem was that half a dozen attempts to code a mergesort for Python yielded a sort that ran slower (mergesort does much more data movement than samplesort) and consumed more memory.

A large and growing literature concerns adaptive sorting algorithms, which attempt to detect order of various kinds in the input. I coded a dozen of them, but they were all much slower than Python’s samplesort except on the cases they were designed to exploit. The theoretical bases for these algorithms were simply too complex to yield effective practical algorithms. Then I read an article pointing out that list merging naturally reveals many kinds of partial order, simply by paying attention to how often each input list “wins” in a row. This information was simple and general. When I realized how it could be applied to a natural mergesort, which would obviously exploit all the kinds of partial order I knew and cared about, I got obsessed enough to solve the speed problem for random data and to minimize the memory burden.

The resulting “adaptive, natural, stable” mergesort implemented for Python 2.3 was a major success, but also a major engineering effort—the devil is in the details. There are about 1,200 lines of C code, but unlike the code in the samplesort hybrid, none of these lines are coding for special cases, and about half implement a technical trick allowing the worst-case memory burden to be cut in half. I’m quite proud of it, but the margins of this introduction lack the space for me to explain the details. If you’re curious, I wrote a long technical description that you can find in a Python source distribution: Objects/listsort.txt under the main directory (say, Python-2.3.5 or Python-2.4) where you unpacked Python’s source distribution archive. In the following list, I provide examples of the partial order Python 2.3’s mergesort naturally exploits, where “sorted” means in either forward-sorted or reverse-sorted order:

The input is already sorted.
The input is mostly sorted but has random elements appended at either end, or both, or inserted in the middle.
The input is the concatenation of two or more sorted lists. In fact, the fastest way to merge multiple sorted lists in Python now is to join them into one long list and run list.sort on that.
The input is mostly sorted but has some scattered elements that are out of order. This is common, for example, when people manually add new records to a database sorted by name: people aren’t good at maintaining strict alphabetic order but are good at getting close.
The input has many keys with the same value. For example, when sorting a database of American companies by the stock exchange they’re listed on, most will be associated with the NYSE or NASDAQ exchanges. This is exploitable for a curious reason: records with equal keys are already in sorted order, by the definition of “stable”! The algorithm detects that naturally, without code especially looking for equal keys.
The input was in sorted order but got dropped on the floor in chunks; the chunks were reassembled in random order, and to fight boredom, some of the chunks were riffle-shuffled together. While that’s a silly example, it still results in exploitable partial order and suggests how general the method is.

In short, Python 2.3’s timsort (well, it has to have some brief name) is stable, robust, and preternaturally fast in many real-life cases: use it any time you can!

5.1. Sorting a Dictionary

Credit: Alex Martelli

Problem

You want to sort a dictionary. This probably means that you want to sort the keys and then get the values in that same sorted order.

Solution

The simplest approach is exactly the one expressed by the problem statement: sort the keys, then pick the corresponding values:

def sortedDictValues(adict):
    keys = adict.keys( )
    keys.sort( )
    return [adict[key] for key in keys]

Discussion

The concept of sorting applies only to a collection that has an order—in other words, a sequence. A mapping, such as a dictionary, has no order, so it cannot be sorted. And yet, “How do I sort a dictionary?” is a frequent, though literally meaningless, question on the Python lists. More often than not, the question is in fact about sorting some sequence composed of keys and/or values from the dictionary.

As for the implementation, while one could think of more sophisticated approaches, it turns out (not unusually, for Python) that the one shown in the solution, the simplest one, is also essentially the fastest one. A further slight increase in speed, about 20%, can be squeezed out in Python 2.3 by replacing the list comprehension with a map call in the return statement at the end of the function. For example:

    return map(adict.get, keys)

Python 2.4, however, is already measurably faster than Python 2.3 with the version in the “Solution” and gains nothing from this further step. Other variants, such as using adict._ _getitem_ _ instead of adict.get, offer no further increase in speed, or they even slow performance down a little, in both Python 2.3 and 2.4.

5.2. Sorting a List of Strings Case-Insensitively

Credit: Kevin Altis, Robin Thomas, Guido van Rossum, Martin V. Lewis, Dave Cross

Problem

You want to sort a list of strings, ignoring case differences. For example, you want a, although it’s lowercase, to sort before B, although the latter is uppercase. By default, however, string comparison is case-sensitive (e.g., all uppercase letters sort before all lowercase ones).

Solution

The decorate-sort-undecorate (DSU) idiom is simple and fast:

def case_insensitive_sort(string_list):
    auxiliary_list = [(x.lower( ), x) for x in string_list]    # decorate
    auxiliary_list.sort( )                                     # sort
    return [x[1] for x in auxiliary_list]                     # undecorate

In Python 2.4, DSU is natively supported, so (assuming the items of string_list are indeed strings, and not, e.g., Unicode objects), you can use the following even shorter and faster approach:

def case_insensitive_sort(string_list):
    return sorted(string_list, key=str.lower)

Discussion

An obvious alternative to this recipe’s Solution is to code a comparison function and pass it to the sort method:

def case_insensitive_sort_1(string_list):
    def compare(a, b): return cmp(a.lower( ), b.lower( ))
    string_list.sort(compare)

However, in this way the lower method gets called twice for every comparison, and the number of comparisons needed to sort a list of n items is typically proportional to n log(n).

The DSU idiom builds an auxiliary list, whose items are tuples where each item of the original list is preceded by a “key”. The sort then takes place on the key, because Python compares tuples lexicographically (i.e., it compares the tuples’ first items first). With DSU, the lower method gets called only n times to sort a list of n strings, which saves enough time to cover the small costs of the first, decorate step and the final, undecorate step, with a big net increase in speed.

DSU is also sometimes known, not quite correctly, as the Schwartzian Transform, by somewhat imprecise analogy with a well-known idiom of the Perl language. (If anything, DSU is closer to the Guttman-Rosler Transform, see http://www.sysarch.com/perl/sort_paper.html.)

DSU is so important that Python 2.4 supports it directly: you can optionally pass to the sort method of a list an argument named key, which is the callable to use on each item of the list to obtain the key for the sort. If you pass such an argument, the sorting internally uses DSU. So, in Python 2.4, string_list.sort(key=str.lower is essentially equivalent to function case_insensitive_sort, except the sort method sorts the list in-place (and returns None) instead of returning a sorted copy and leaving the original list alone. If you want function case_insensitive_sort to sort in-place, by the way, just change its return statement into an assignment to the list’s body:

string_list[:] = [x[1] for x in auxiliary_list]

Vice versa, if, in Python 2.4, you want to get a sorted copy and leave the original list alone, you can use the new built-in function sorted. For example, in Python 2.4:

for s in sorted(string_list, key=str.lower): print s

prints each string in the list, sorted case-insensitively, without affecting string_list itself.

The use of str.lower as the key argument in the Python 2.4 Solution restricts you to specifically sorting strings (not, e.g., Unicode objects). If you know you’re sorting a list of Unicode objects, use key=unicode.lower instead. If you need a function that applies just as well to strings and Unicode objects, you can import string and then use key=string.lower; alternatively, you could use key=lambda s: s.lower( ).

If you need case-insensitive sorting of lists of strings, you might also need dictionaries and sets using case-insensitive strings as keys, lists behaving case-insensitively regarding such methods as index and count, case-insensitive results from needle in haystack, and so on. If that is the case, then your real underlying need is a subtype of str that behaves case-insensitively in comparison and hashing—a clearly better factoring of the issue, compared to implementing many container types and functions to get all of this functionality. To see how to implement such a type, see Recipe 1.24.

5.3. Sorting a List of Objects by an Attribute of the Objects

Credit: Yakov Markovitch, Nick Perkins

Problem

You need to sort a list of objects according to one attribute of each object.

Solution

The DSU idiom shines, as usual:

def sort_by_attr(seq, attr):
    intermed = [ (getattr(x, attr), i, x) for i, x in enumerate(seq) ]
    intermed.sort( )
    return [ x[-1] for x in intermed ]
def sort_by_attr_inplace(lst, attr):
    lst[:] = sort_by_attr(lst, attr)

In Python 2.4, DSU is natively supported, so your code can be even shorter and faster:

import operator
def sort_by_attr(seq, attr):
    return sorted(seq, key=operator.attrgetter(attr))
def sort_by_attr_inplace(lst, attr):
    lst.sort(key=operator.attrgetter(attr))

Discussion

Sorting a list of objects by an attribute of each object is best done using the DSU idiom introduced previously in Recipe 5.2. In Python 2.3 and 2.4, DSU is no longer needed, as it used to be, to ensure that a sort is stable (sorting is always stable in Python 2.3 and later), but DSU’s speed advantages still shine.

Sorting, in the general case and with the best algorithms, is O( n log n ) (as is often the case in mathematical formulas, the juxtaposition of terms, here n and log n, indicates that the terms are multiplied). DSU’s speed comes from maximally accelerating the O( n log n ) part, which dominates sorting time for sequences of substantial length n, by using only Python’s native (and maximally fast) comparison. The preliminary decoration step, which prepares an intermediate auxiliary list of tuples, and the successive undecoration step, which extracts the important item from each tuple after the intermediate list is sorted, are only O( n ). Therefore any minor inefficiencies in these steps contribute negligible overhead if n is large enough, and reasonably little even for many practical values of n.

The most useful way to reason about many performance issues is in terms of what is popularly known as big-O analysis and notation (the O stands for “order”). You can find detailed explanations, for example, at http://en.wikipedia.org/wiki/Big_O_notation, but here’s a summary.

If we consider an algorithm applied to input data of some size N, running time can be described, for large enough values of N (and big inputs are often those for which performance is most critical), as being proportional to some function of N. This is indicated with notations such as O( N ) (running time proportional to N: processing twice as much data takes about twice as much time, 10 times as much data, 10 times as much time, and so on; also known as linear time), O( N squared) (running time proportional to the square of N: processing twice as much data takes about four times as much time, 10 times as much data, 100 times as much time; also known as quadratic time), and so on. Another case you will see often is O( N log N ), which is faster than O( N squared) but not as fast as O( N ).

The constant of proportionality is often ignored (at least in theoretical analysis) because it depends on such issues as the clock rate of your computer, not just on the algorithm. If you buy a machine that’s twice as fast as your old one, everything will run in half the time, but that will not change any of the comparisons between alternative algorithms.

This recipe puts index i, in each tuple that is an item of list intermed, ahead of the corresponding x (where x is the i-th item in seq). This placement ensures that two items of seq will never be compared directly, even if they have the same value for the attribute named attr. Even in that case, their indices will still differ, and thus Python’s lexicographic comparison of the tuples will never get all the way to comparing the tuples’ last items (the original items from seq). Avoiding object comparisons may save us from performing extremely slow operations, or even from attempting forbidden ones. For example, we could sort a list of complex numbers by their real attribute: we would get an exception if we ever tried to compare two complex numbers directly, because no ordering is defined on complex numbers. But thanks to the precaution described in this paragraph, such an event can never occur, and the sorting will therefore proceed correctly.

As mentioned earlier in Recipe 5.2, Python 2.4 supports DSU directly. You can pass an optional keyword-argument key, to sort, which is the callable to use on each item to get the sort key. Standard library module operator has two new functions, attrgetter and itemgetter, that exist specifically to return callables suitable for this purpose. In Python 2.4, the ideal solution to this problem therefore becomes:

import operator
seq.sort(key=operator.attrgetter(attr))

This snippet performs the sort in-place, which helps make it blazingly fast—on my machine, three times faster than the Python 2.3 function shown first in this recipe. If you need a sorted copy, without disturbing seq, you can get it using Python 2.4’s new built-in function sorted:

sorted_copy = sorted(seq, key=operator.attrgetter(attr))

While not quite as fast as an in-place sort, this latest snippet is still over 2.5 times faster than the function shown first in this recipe. Python 2.4 also guarantees that, when you pass the optional key named argument, list items will never be accidentally compared directly, so you need not take any special safeguards. Moreover, stability is also guaranteed.

5.4. Sorting Keys or Indices Basedon the Corresponding Values

Credit: John Jensen, Fred Bremmer, Nick Coghlan

Problem

You need to count the occurrences of various items and present the items in order of their number of occurrences—for example, to produce a histogram.

Solution

A histogram, apart from graphical issues, is based on counting the occurrences of items (easy to do with a Python list or dictionary) and then sorting the keys or indices in an order based on corresponding values. Here is a subclass of dict that adds two methods for the purpose:

class hist(dict):
    def add(self, item, increment=1):
        ''' add 'increment' to the entry for 'item' '''
        self[item] = increment + self.get(item, 0)
    def counts(self, reverse=False):
        ''' return list of keys sorted by corresponding values '''
        aux = [ (self[k], k) for k in self ]
        aux.sort( )
        if reverse: aux.reverse( )
        return [k for v, k in aux]

If the items you’re counting are best modeled by small integers in a compact range, so that you want to keep item counts in a list, the solution is quite similar:

class hist1(list):
    def _ _init_ _(self, n):
        ''' initialize this list to count occurrences of n distinct items '''
        list._ _init_ _(self, n*[0])
    def add(self, item, increment=1):
        ''' add 'increment' to the entry for 'item' '''
        self[item] += increment
    def counts(self, reverse=False):
        ''' return list of indices sorted by corresponding values '''
        aux = [ (v, k) for k, v in enumerate(self) ]
        aux.sort( )
        if reverse: aux.reverse( )
        return [k for v, k in aux]

Discussion

The add method of hist embodies the normal Python idiom for counting occurrences of arbitrary (but hashable) items, using a dict to hold the counts. In class hist1, based on a list, we take a different approach, initializing all counts to 0 in _ _init_ _, so the add method is even simpler.

The counts methods produce the lists of keys, or indices, sorted in the order given by the corresponding values. The problem is very similar in both classes, hist and hist1; therefore, the solutions are also almost identical, using in each case the DSU approach already shown in Recipe 5.2 and Recipe 5.3. If we need both classes in our program, the similarity is so close that we should surely factor out the commonalities into a single auxiliary function _sorted_keys:

def _sorted_keys(container, keys, reverse):
    ''' return list of 'keys' sorted by corresponding values in 'container' '''
    aux = [ (container[k], k) for k in keys ]
    aux.sort( )
    if reverse: aux.reverse( )
    return [k for v, k in aux]

and then implement the counts methods of each class as thin wrappers over this _sorted_keys function:

class hist(dict):...
    def counts(self, reverse=False):
        return _sorted_keys(self, self, reverse)
class hist1(list):
    ...
    def counts(self, reverse=False):
        return _sorted_keys(self, xrange(len(self)), reverse)

DSU is so important that in Python 2.4, as shown previously in Recipe 5.2 and Recipe 5.3, the sort method of lists and the new built-in function sorted offer a fast, intrinsic implementation of it. Therefore, in Python 2.4, function _sorted_keys can become much simpler and faster:

def _sorted_keys(container, keys, reverse):
    return sorted(keys, key=container._ _getitem_ _, reverse=reverse)`

The bound-method container._ _getitem_ _ performs exactly the same operation as the indexing container[k] in the Python 2.3 implementation, but it’s a callable to call on each k of the sequence that we’re sorting, namely keys—exactly the right kind of value to pass as the key keyword argument to the sorted built-in function. Python 2.4 also affords an easy, direct way to get a list of a dictionary’s items sorted by value:

from operator import itemgetter
def dict_items_sorted_by_value(d, reverse=False):
    return sorted(d.iteritems( ), key=itemgetter(1), reverse=reverse)

The operator.itemgetter higher-order function, also new in Python 2.4, is a handy way to supply the key argument when you want to sort a container whose items are subcontainers, keying on a certain item of each subcontainer. This is exactly the case here, since a dictionary’s items are a sequence of pairs (two-item tuples), and we want to sort the sequence keying on the second item of each tuple.

Getting back to this recipe’s main theme, here is a usage example for the class hist shown in this recipe’s Solution:

sentence = ''' Hello there this is a test.  Hello there this was a test,
           but now it is not. '''
words = sentence.split( )
c = hist( )
for word in words: c.add(word)
print "Ascending count:"
print c.counts( )
print "Descending count:"
print c.counts(reverse=True)

This code snippet produces the following output:

Ascending count:
[(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'),
(1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')]
Descending count:
[(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'),
(1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]

5.5. Sorting Strings with Embedded Numbers

Credit: Sébastien Keim, Chui Tey, Alex Martelli

Problem

You need to sort a list of strings that contain substrings of digits (e.g., a list of postal addresses) in an order that looks good. For example, 'foo2.txt' should come before 'foo10.txt‘. However, Python’s default string comparison is alphabetical, so, by default, 'foo10.txt' instead comes before 'foo2.txt‘.

Solution

You need to split each string into sequences of digits and nondigits, and transform each sequence of digits into a number. This gives you a list that is just the right comparison key for the sort you want, and you can then use DSU for the sort itself—that is, code two functions, shorter than this description:

import re
re_digits = re.compile(r'(d+)')
def embedded_numbers(s):
    pieces = re_digits.split(s)             # split into digits/nondigits
    pieces[1::2] = map(int, pieces[1::2])   # turn digits into numbers
    return pieces
def sort_strings_with_embedded_numbers(alist):
    aux = [ (embedded_numbers(s), s) for s in alist ]
    aux.sort( )
    return [ s for _ _, s in aux ]           # convention: _ _ means "ignore"

In Python 2.4, use the native support for DSU, with the same function embedded_numbers to get the sort key:

def sort_strings_with_embedded_numbers(alist):
    return sorted(alist, key=embedded_numbers)

Discussion

Say you have an unsorted list of filenames, such as:

files = 'file3.txt file11.txt file7.txt file4.txt file15.txt'.split( )

If you just sort and print this list, for example in Python 2.4 with print ' '.join(sorted(files)), your output looks like file11.txt file15.txt file3.txt file4.txt file7.txt, since, by default, strings are sorted alphabetically (to use a fancier word, the sort order is described as lexicographical). Python cannot just guess that you mean to treat in a different way those substrings that happen to be made of digits; you have to tell Python precisely what you want, and this recipe shows how.

Using this recipe, you can get a nicer-looking result:

print ' '.join(sort_strings_with_embedded_numbers(files))

The output is now file3.txt file4.txt file7.txt file11.txt file15.txt, which is probably just what you want in this case.

The implementation relies on the DSU idiom. We need to code DSU explicitly if we want to support Python 2.3, while if our code is Python 2.4-only, we just rely on the native implementation of DSU. We do so by passing an argument named key (a function to be called on each item to get the right comparison key for the sort) to the new built-in function sorted.

Function embedded_numbers in the recipe is how we get the right comparison key for each item: a list alternating substrings of nondigits, and the int obtained from each substring of digits. re_digits.split(s) gives us a list of alternating substrings of nondigits and digits (with the substrings of digits at odd-numbered indices); then, we use built-in functions map and int (and extended-form slices that get and set all items at odd-numbered indices) to turn sequences of digits into integers. Lexicographical comparison on this list of mixed types now produces just the right result.

5.6. Processing All of a List’s Items in Random Order

Credit: Iuri Wickert, Duncan Grisby, T. Warner, Steve Holden, Alex Martelli

Problem

You need to process, in random order, all of the items of a long list.

Solution

As usual in Python, the best approach is the simplest one. If we are allowed to change the order of items in the input list, then the following function is simplest and fastest:

def process_all_in_random_order(data, process):
    # first, put the whole list into random order
    random.shuffle(data)
    # next, just walk over the list linearly
    for elem in data: process(elem)

If we must preserve the input list intact, or if the input data may be some iterable that is not a list, just insert as the first statement of the function the assignment data = list(data).

Discussion

While it’s a common mistake to be overly concerned with speed, don’t make the opposite mistake of ignoring the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition (assume that we’re allowed to mangle or destroy the input list). The first idea to suggest itself might be to repeatedly pick an item at random (with function random.choice), removing each picked item from the list to avoid future repetitions:

import random
def process_random_removing(data, process):
    while data:
        elem = random.choice(data) 
        data.remove(elem) 
        process(elem)

However, this function is painfully slow, even for input lists of just a few hundred elements. Each call to data.remove must linearly search through the list to find the element to delete. Since the cost of each of n steps is O(n), the whole process is O( n 2)—time proportional to the square of the length of the list (and with a large multiplicative constant, too).

Minor improvements to this first idea could focus on obtaining random indices, using the pop method of the list to get and remove an item at the same time, low-level fiddling with indices to avoid the costly removal in favor of swapping the picked item with the last yet-unpicked one towards the end, or using dictionaries or sets instead of lists. This latest idea might be based on a hope of using a dict’s popitem method (or the equivalent method pop of class sets.Set and Python 2.4’s built-in type set), which may look like it’s designed exactly to pick and remove a random item, but, beware! dict.popitem is documented to return and remove an arbitrary item of the dictionary, and that’s a far cry from a random item. Check it out:

>>> d=dict(enumerate('ciao'))
>>> while d: print d.popitem( )

It may surprise you, but in most Python implementations this snippet will print d’s items in a far from random order, typically (0,'c') then (1,'i') and so forth. In short, if you need pseudo-random behavior in Python, you need standard library module random—popitem is not an alternative.

If you thought about using a dictionary rather than a list, you are definitely on your way to “thinking Pythonically”, even though it turns out that dictionaries wouldn’t provide a substantial performance boost for this specific problem. However, an approach that is even more Pythonic than choosing the right data structure is best summarized as: let the standard library do it!. The Python Standard Library is large, rich, and chock full of useful, robust, fast functions and classes for a wide variety of tasks. In this case, the key intuition is realizing that, to walk over a sequence in a random order, the simplest approach is to first put that sequence into random order (known as shuffling the sequence, an analogy with shuffling a deck of cards) and then walk over the shuffled sequence linearly. Function random.shuffle performs the shuffling, and the function shown in this recipe’s Solution just uses it.

Performance should always be measured, never guessed at, and that’s what standard library module timeit is for. Using a null process function and a list of length 1,000 as data, process_all_in_random_order is almost 10 times faster than process_random_removing; with a list of length 2,000, the performance ratio grows to almost 20. While an improvement of, say, 25%, or even a constant factor of 2, usually can be neglected without really affecting the performance of your program as a whole, the same does not apply to an algorithm that is 10 or 20 times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself. Moreover, this risk increases when we’re talking about O( n 2) versus O( n ) behavior: with such differences in big-O behavior, the performance ratio between bad and good algorithms keeps increasing without bounds as the size of the input data grows.

5.7. Keeping a Sequence Ordered as Items Are Added

Credit: John Nielsen

Problem

You want to maintain a sequence, to which items are added, in a sorted state, so that at any time, you can easily examine or remove the smallest item currently present in the sequence.

Solution

Say you start with an unordered list, such as:

the_list = [903, 10, 35, 69, 933, 485, 519, 379, 102, 402, 883, 1]

You could call the_list.sort( ) to make the list sorted and then result=the_list.pop(0) to get and remove the smallest item. But then, every time you add an item (say with the_list.append(0)), you need to call the_list.sort( ) again to keep the list sorted.

Alternatively, you can use the heapq module of the Python Standard Library:

import heapq
heapq.heapify(the_list)

Now the list is not necessarily fully sorted, but it does satisfy the heap property (meaning if all indices involved are valid, the_list[i]<=the_list[2*i+1] and the_list[i]<=the_list[2*i+2])—so, in particular, the_list[0] is the smallest item. To keep the heap property valid, use result=heapq.heappop(the_list) to get and remove the smallest item and heapq.heappush(the_list, newitem) to add a new item. When you need to do both—add a new item while getting and removing the previously smallest item—you can use result=heapq.heapreplace(the_list, newitem).

Discussion

When you need to retrieve data in an ordered way (at each retrieval getting the smallest item among those you currently have at hand), you can pay the runtime cost for the sorting when you retrieve the data, or you can pay for it when you add the data. One approach is to collect your data into a list and sort the list. Now it’s easy to get your data in order, smallest to largest. However, you have to keep calling sort each time you add new data during the retrieval, to make sure you can later keep retrieving from the smallest current item after each addition. The method sort of Python lists is implemented with a little-known algorithm called Natural Mergesort, which minimizes the runtime cost of this approach. Yet the approach can still be burdensome: each addition (and sorting) and each retrieval (and removal, via pop) takes time proportional to the number of current items in the list (O(N), in common parlance).

An alternative approach is to use a data organization known as a heap, a type of binary tree implemented compactly, yet ensuring that each “parent” is always less than its “children”. The best way to maintain a heap in Python is to use a list and have it managed by the heapq library module, as shown in this recipe’s Solution. The list does not get fully sorted, yet you can be sure that, whenever you heappop an item from the list, you always get the lowest item currently present, and all others will be adjusted to ensure the heap property is still valid. Each addition with heappush, and each removal with heappop, takes a short time proportional to the logarithm of the current length of the list (O(log N), in common parlance). You pay as you go, a little at a time (and not too much in total, either.)

A good occasion to use this heap approach, for example, is when you have a long-running queue with new data periodically arriving, and you always want to be able to get the most important item off the queue without having to constantly re-sort your data or perform full searches. This concept is known as a priority queue, and a heap is an excellent way to implement it. Note that, intrinsically, the heapq module supplies you with the smallest item at each heappop, so make sure to arrange the way you encode your items’ priority values to reflect this. For example, say that you receive incoming items each accompanied by a cost, and the most important item at any time is the one with the highest cost that is currently on the queue; moreover, among items of equal cost, the most important one is the one that arrived earliest. Here’s a way to build a “priority queue” class respecting these specs and based on functions of module heapq:

class prioq(object):
    def _ _init_ _(self):
        self.q = [  ]
        self.i = 0
    def push(self, item, cost):
        heapq.heappush(self.q, (-cost, self.i, item))
        self.i += 1
    def pop(self):
        return heapq.heappop(self.q)[-1]

The main idea in this snippet is to push on the heap tuples whose first item is the cost with changed sign, so that higher costs result in smaller tuples (by Python’s natural comparison); right after the cost, we put a progressive index, so that, among items with equal cost, the one arriving earliest will be in a smaller tuple.

In Python 2.4, module heapq has been reimplemented and optimized; see Recipe 5.8 for more information about heapq.

5.8. Getting the First Few Smallest Items of a Sequence

Credit: Matteo Dell’Amico, Raymond Hettinger, George Yoshida, Daniel Harding

Problem

You need to get just a few of the smallest items from a sequence. You could sort the sequence and just use seq[:n], but is there any way you can do better?

Solution

Perhaps you can do better, if n, the number of items you need, is small compared to n, the sequence’s length. sort is very fast, but it still takes O(n log n) time, while we can get the first n smallest elements in time O(n) if n is small. Here is a simple and practical generator for this purpose, which works equally well in Python 2.3 and 2.4:

import heapq
def isorted(data):
    data = list(data)
    heapq.heapify(data)
    while data:
        yield heapq.heappop(data)

In Python 2.4 only, you can use an even simpler and faster way to get the smallest n items of data when you know n in advance:

import heapq
def smallest(n, data):
    return heapq.nsmallest(n, data)

Discussion

data can be any bounded iterable; the recipe’s function isorted starts by calling list on it to ensure that. You can remove the statement data = list(data) if all the following conditions hold: you know that data is a list to start with, you don’t mind the fact that the generator reorders data’s items, and you want to remove items from data as you fetch them.

As shown previously in Recipe 5.7, the Python Standard Library contains module heapq, which supports the data structures known as heaps. Generator isorted in this recipe’s Solution relies on making a heap at the start (via heap.heapify) and then yielding and removing the heap’s smallest remaining item at each step (via heap.heappop).

In Python 2.4, module heapq has also grown two new functions. heapq.nlargest(n, data) returns a list of the n largest items of data; heapq.nsmallest(n, data) returns a list of the n smallest items. These functions do not require that data satisfy the heap condition; indeed, they do not even require data to be a list—any bounded iterable whose items are comparable will do. Function smallest in this recipe’s Solution just lets heapq.smallest do all the work.

To judge speed, we must always measure it—guessing about relative speeds of different pieces of code is a mug’s game. So, how does isorted’s performance compare with Python 2.4’s built-in function sorted, when we’re only looping on the first few (smallest) items? To help measure timing, I wrote a top10 function that can use either approach, and I also made sure I had a sorted function even in Python 2.3, where it’s not built in:

try:
    sorted
except:
    def sorted(data):
        data = list(data)
        data.sort( )
        return data
import itertools
def top10(data, howtosort):
    return list(itertools.islice(howtosort(data), 10))

On my machine running Python 2.4 on thoroughly shuffled lists of 1,000 integers, top10 takes about 260 microseconds with isorted, while it takes about 850 microseconds with the built-in sorted. However, Python 2.3 is much slower and gives vastly different results: about 12 milliseconds with isorted, about 2.7 milliseconds with sorted. In other words, Python 2.3 is 3 times slower than Python 2.4 for sorted, but it’s 50 times slower for isorted. Lesson to retain: whenever you optimize, measure. You shouldn’t choose optimizations based on first principles, since the performance numbers can vary so widely, even between vastly compatible “point releases”. A secondary point can be made: if you care about performance, move to Python 2.4 as soon as you can. Python 2.4 has been vastly optimized and accelerated over Python 2.3, particularly in areas related to searching and sorting.

If you know that your code need only support Python 2.4, then, as this recipe’s Solution indicates, using heapq’s new function nsmallest is faster, as well as simpler, than doing your own coding. To implement top10 in Python 2.4, for example, you just need:

import heapq
def top10(data):
    return heapq.nsmallest(10, data)

This version takes about half the time of the previously shown isorted-based top10, when called on the same thoroughly shuffled lists of 1,000 integers.

5.9. Looking for Items in a Sorted Sequence

Credit: Noah Spurrier

Problem

You need to look for a lot of items in a sequence.

Solution

If list L is sorted, module bisect from the Python Standard Library makes it easy to check if some item x is present in L:

import bisect
x_insert_point = bisect.bisect_right(L, x)
x_is_present = L[x_insert_point-1:x_insert_point] == [x]

Discussion

Looking for an item x in a list L is very easy in Python: to check whether the item is there at all, if x in L; to find out where exactly it is, L.index(x). However, if L has length n, these operations take time proportional to n—essentially, they just loop over the list’s items, checking each for equality to x. If L is sorted, we can do better.

The classic algorithm to look for an item in a sorted sequence is known as binary search, because at each step it roughly halves the range it’s still searching on—it generally takes about log ₂ n steps. It’s worth considering when you’re going to look for items many times, so you can amortize the cost of sorting over many searches. Once you’ve decided to use binary search for x in L, after calling L.sort( ), module bisect from the Python Standard Library makes the job easy.

Specifically, we need function bisect.bisect_right, which returns the index where an item should be inserted, to keep the sorted list sorted, but doesn’t alter the list; moreover, if the item already appears in the list, bisect_right returns an index that’s just to the right of any items with the same value. So, after getting this “insert point” by calling bisect.bisect_right(L, x), we need only to check the list immediately before the insert point, to see if an item equal to x is already there.

The way we compute x_is_present in the “Solution” may not be immediately obvious. If we know that L is not empty, we can use a simpler and more obvious approach:

x_is_present = L[x_insert_point-1] == x

However, the indexing in this simpler approach raises an exception when L is empty. When the slice boundaries are invalid, slicing is less “strict” than indexing, since it just produces an empty slice without raising any exception. In general, somelist[i:i+1] is the same one-item list as [somelist[i]] when i is a valid index in somelist: it’s an empty list [ ] when the indexing would raise IndexError. The computation of x_is_present in the recipe exploits this important property to avoid having to deal with exceptions and handle empty and nonempty cases for L in one uniform way. An alternative approach is:

x_is_present = L and L[x_insert_point-1] == x

This alternative approach exploits and’s short-circuiting behavior to guard the indexing, instead of using slicing.

An auxiliary dict, as shown in Recipe 5.12, is also a possibility as long as items are hashable (meaning that items can be used as keys into a dict). However, the approach in this recipe, based on a sorted list, may be the only useful one when the items are comparable (otherwise, the list could not be sorted) but not hashable (so a dict can’t have those items as its keys).

When the list is already sorted, and the number of items you need to look up in it is not extremely large, it may in any case be faster to use bisect than to build an auxiliary dictionary, since the investment of time in the latter operation might not be fully amortized. This is particularly likely in Python 2.4, since bisect has been optimized very effectively and is much faster than it was in Python 2.3. On my machine, for example, bisect.bisect_right for an item in the middle of a list of 10,000 integers is about four times faster in Python 2.4 than it was in Python 2.3.

5.10. Selecting the nth Smallest Element of a Sequence

Credit: Raymond Hettinger, David Eppstein, Shane Holloway, Chris Perkins

Problem

You need to get from a sequence the nth item in rank order (e.g., the middle item, known as the median). If the sequence was sorted, you would just use seq[ n ]. But the sequence isn’t sorted, and you wonder if you can do better than just sorting it first.

Solution

Perhaps you can do better, if the sequence is big, has been shuffled enough, and comparisons between its items are costly. Sort is very fast, but in the end (when applied to a thoroughly shuffled sequence of length n) it always takes O( n log n ) time, while there exist algorithms that can be used to get the nth smallest element in time O( n ). Here is a function with a solid implementation of such an algorithm:

import random
def select(data, n):
    " Find the nth rank ordered element (the least value has rank 0). "
    # make a new list, deal with <0 indices, check for valid index
    data = list(data)
    if n<0:
        n += len(data)
    if not 0 <= n < len(data):
        raise ValueError, "can't get rank %d out of %d" % (n, len(data))
    # main loop, quicksort-like but with no need for recursion
    while True:
        pivot = random.choice(data)
        pcount = 0
        under, over = [  ], [  ]
        uappend, oappend = under.append, over.append
        for elem in data:
            if elem < pivot:
                uappend(elem)
            elif elem > pivot:
                oappend(elem)
            else:
                pcount += 1
        numunder = len(under)
        if n < numunder:
            data = under
        elif n < numunder + pcount:
            return pivot
        else:
            data = over
            n -= numunder + pcount

Discussion

This recipe is meant for cases in which repetitions count. For example, the median of the list [1, 1, 1, 2, 3] is 1 because that is the third one of the five items in rank order. If, for some strange reason, you want to discount duplications, you need to reduce the list to its unique items first (e.g., by applying the Recipe 18.1), after which you may want to come back to this recipe.

Input argument data can be any bounded iterable; the recipe starts by calling list on it to ensure that. The algorithm then loops, implementing at each leg a few key ideas: randomly choosing a pivot element; slicing up the list into two parts, made up of the items that are “under” and “over” the pivot respectively; continuing work for the next leg on just one of the two parts, since we can tell which one of them the n ^th element will be in, and the other part can safely be ignored. The ideas are very close to that in the classic algorithm known as quicksort (except that quicksort cannot ignore either part, and thus must use recursion, or recursion-removal techniques such as keeping an explicit stack, to make sure it deals with both parts).

The random choice of pivot makes the algorithm robust against unfavorable data orderings (the kind that wreak havoc with naive quicksort); this implementation decision costs about log ₂N calls to random.choice. Another implementation issue worth pointing out is that the recipe counts the number of occurrences of the pivot: this precaution ensures good performance even in the anomalous case where data contains a high number of repetitions of identical values.

Extracting the bound methods .append of lists under and over as local variables uappend and oappend may look like a pointless, if tiny, complication, but it is, in fact, a very important optimization technique in Python. To keep the compiler simple, straightforward, unsurprising, and robust, Python does not hoist constant computations out of loops, nor does it “cache” the results of method lookup. If you call under.append and over.append in the inner loop, you pay the cost of lookup each and every time. If you want something hoisted, hoist it yourself. When you’re considering an optimization, you should always measure the code’s performance with and without that optimization, to check that the optimization does indeed make an important difference. According to my measurements, removing this single optimization slows performance down by about 50% for the typical task of picking the 5000^th item of range(10000). Considering the tiny amount of complication involved, a difference of 50% is well worth it.

A natural idea for optimization, which just didn’t make the grade once carefully measured, is to call cmp(elem, pivot) in the loop body, rather than making separate tests for elem < pivot and elem > pivot. Unfortunately, measurement shows that cmp doesn’t speed things up; in fact, it slows them down, at least when the items of data are of elementary types such as numbers and strings.

So, how does select’s performance compare with the simpler alternative of:

def selsor(data, n):
    data = list(data)
    data.sort( )
    return data[n]

On thoroughly shuffled lists of 3,001 integers on my machine, this recipe’s select takes about 16 milliseconds to find the median, while selsor takes about 13 milliseconds; considering that sort could take advantage of any partial sortedness in the data, for this kind of length, and on elementary data whose comparisons are fast, it’s not to your advantage to use select. For a length of 30,001, performance becomes very close between the two approaches—around 170 milliseconds either way. When you push the length all the way to 300,001, select provides an advantage, finding the median in about 2.2 seconds, while selsor takes about 2.5.

The break-even point will be smaller if the items in the sequence have costly comparison methods, since the key difference between the two approaches is in the number of comparisons performed—select takes O(n), selsor takes O(n log n). For example, say we need to compare instances of a class designed for somewhat costly comparisons (simulating four-dimensional points that will often coincide on the first few dimensions):

class X(object):
    def _ _init_ _(self):
        self.a = self.b = self.c = 23.51
        self.d = random.random( )
    def _dats(self):
        return self.a, self.b, self.c, self.d
    def _ _cmp_ _(self, oth):
        return cmp(self._dats, oth._dats)

Here, select already becomes faster than selsor when what we’re computing is the median of vectors of 201 such instances.

In other words, although select has more general overhead, when compared to the wondrously efficient coding of lists’ sort method, nevertheless, if n is large enough and each comparison is costly enough, select is still well worth considering.

5.11. Showing off quicksort in Three Lines

Credit: Nathaniel Gray, Raymond Hettinger, Christophe Delord, Jeremy Zucker

Problem

You need to show that Python’s support for the functional programming paradigm is better than it might seem at first sight.

Solution

Functional programming languages, of which Haskell is a great example, are splendid animals, but Python can hold its own in such company:

def qsort(L):
    if len(L) <= 1: return L
    return qsort([lt for lt in L[1:] if lt < L[0]]) + L[0:1] + 
           qsort([ge for ge in L[1:] if ge >= L[0]])

In my humble opinion, this code is almost as pretty as the Haskell version from http://www.haskell.org:

qsort [  ] = [  ]
qsort (x:xs) = qsort elts_lt_x ++ [x] ++ qsort elts_greq_x
                 where
                   elts_lt_x = [y | y <- xs, y < x]
                   elts_greq_x = [y | y <- xs, y >= x]

Here’s a test function for the Python version:

def qs_test(length):
    import random
    joe = range(length)
    random.shuffle(joe)
    qsJoe = qsort(joe)
    for i in range(len(qsJoe)):
        assert qsJoe[i] == i, 'qsort is broken at %d!' %i

Discussion

This rather naive implementation of quicksort illustrates the expressive power of list comprehensions. Do not use this approach in real code! Python lists have an in-place sort method that is much faster and should always be preferred; in Python 2.4, the new built-in function sorted accepts any finite sequence and returns a new sorted list with the sequence’s items. The only proper use of this recipe is for impressing friends, particularly ones who (quite understandably) are enthusiastic about functional programming, and particularly about the Haskell language.

I cooked up this function after finding the wonderful Haskell quicksort (which I’ve reproduced in the “Solution”) at http://www.haskell.org/aboutHaskell.html. After marveling at the elegance of this code for a while, I realized that list comprehensions made the same approach possible in Python. Not for nothing did we steal list comprehensions right out of Haskell, just Pythonizing them a bit by using keywords rather than punctuation!

Both implementations pivot on the first element of the list and thus have worst-case O(n) performance for the very common case of sorting an already sorted list. You would never want to do so in production code! Because this recipe is just a propaganda piece, though, it doesn’t really matter.

You can write a less compact version with similar architecture in order to use named local variables and functions for enhanced clarity:

def qsort(L):
    if not L: return L
    pivot = L[0]
    def lt(x): return x<pivot
    def ge(x): return x>=pivot
    return qsort(filter(lt, L[1:]))+[pivot]+qsort(filter(ge, L[1:]))

Once you start going this route, you can easily move to a slightly less naive version, using random pivot selection to make worst-case performance less likely and counting pivots to handle degenerate case with many equal elements:

import random
def qsort(L):
    if not L: return L
    pivot = random.choice(L)
    def lt(x): return x<pivot
    def gt(x): return x>pivot
    return qsort(filter(lt, L))+[pivot]*L.count(pivot)+qsort(filter(gt, L))

Despite the enhancements, they are meant essentially for fun and demonstration purposes. Production-quality sorting code is quite another thing: these little jewels, no matter how much we dwell on them, will never match the performance and solidity of Python’s own built-in sorting approaches.

Rather than going for clarity and robustness, we can move in the opposite direction to make this last point most obvious, showing off the obscurity and compactness that one can get with Python’s lambda:

q=lambda x:(lambda o=lambda s:[i for i in x if cmp(i,x[0])==s]:
            len(x)>1 and q(o(-1))+o(0)+q(o(1)) or x)( )

At least, with this beauty (a single logical line, although it needs to be split into two physical lines due to its length), it should be absolutely obvious that this approach is not meant for real-world use. The equivalent, using more readable def statements rather than opaque lambdas, would still be pretty obscure:

def q(x):
    def o(s): return [i for i in x if cmp(i,x[0])==s]
    return len(x)>1 and q(o(-1))+o(0)+q(o(1)) or x

but a little more clarity (and sanity) can be recovered by opening up the pithy len(x)>1 and . . . or x into an if/else statement and introducing sensible local names again:

def q(x):
    if len(x)>1:
        lt = [i for i in x if cmp(i,x[0]) == -1 ]
        eq = [i for i in x if cmp(i,x[0]) == 0 ]
        gt = [i for i in x if cmp(i,x[0]) == 1 ]
        return q(lt) + eq + q(gt)
    else:
        return x

Fortunately, in the real world, Pythonistas are much too sensible to write convoluted, lambda-filled horrors such as this. In fact, many (though admittedly not all) of us feel enough aversion to lambda itself (partly from having seen it abused this way) that we go out of our way to use readable def statements instead. As a result, the ability to decode such “bursts of line noise” is not a necessary survival skill in the Python world, as it might be for other languages. Any language feature can be abused by programmers trying to be “clever” . . . as a result, some Pythonistas (though a minority) feel a similar aversion to features such as list comprehensions (since it’s possible to cram too many things into a list comprehension, where a plain for loop would be clearer) or to the short-circuiting behavior of operators and/or (since they can be abused to write obscure, terse expressions where a plain if statement would be clearer).

5.12. Performing Frequent Membership Tests on a Sequence

Credit: Alex Martelli

Problem

You need to perform frequent tests for membership in a sequence. The O(n) behavior of repeated in operators hurts performance, but you can’t switch to using just a dictionary or set instead of the sequence, because you also need to keep the sequence’s order.

Solution

Say you need to append items to a list only if they’re not already in the list. One sound approach to this task is the following function:

def addUnique(baseList, otherList):
    auxDict = dict.fromkeys(baseList)
    for item in otherList:
        if item not in auxDict:
            baseList.append(item)
            auxDict[item] = None

If your code has to run only under Python 2.4, you can get exactly the same effect with an auxiliary set rather than an auxiliary dictionary.

Discussion

A simple (naive?) approach to this recipe’s task looks good:

def addUnique_simple(baseList, otherList):
    for item in otherList:
        if item not in baseList:
            baseList.append(item)

and it may be sort of OK, if the lists are very small.

However, the simple approach can be quite slow if the lists are not small. When you check if item not in baseList, Python can implement the in operator in only one way: an internal loop over the elements of baseList, ending with a result of True as soon as an element compares equal to item, with a result of False if the loop terminates without having found any equality. On average, executing the in-operator takes time proportional to len(baseList). addUnique_simple executes the in-operator len(otherList) times, so, in all, it takes time proportional to the product of the lengths of the two lists.

In the addUnique function shown in the “Solution”, we first build the auxiliary dictionary auxDict, a step that takes time proportional to len(baseList). Then, the in-operator inside the loop checks for membership in a dict—a step that makes all the difference because checking for membership in a dict takes roughly constant time, independent of the number of items in the dict! So, the for loop takes time proportional to len(otherList), and the entire function takes time proportional to the sum of the lengths of the two lists.

The analysis of the running times should in fact go quite a bit deeper, because the length of baseList is not constant in addUnique_simple; baseList grows each time an item is processed that was not already there. But the gist of the (surprisingly complicated) analysis is not very different from what this simplified version indicates. We can check this by measuring. When each list holds 10 integers, with an overlap of 50%, the simple version is about 30% slower than the one shown in the “Solution”, the kind of slowdown that can normally be ignored. But with lists of 100 integers each, again with 50% overlap, the simple version is twelve times slower than the one shown in the “Solution”—a level of slowdown that can never be ignored, and it only gets worse if the lists get really substantial.

Sometimes, you could obtain even better overall performance for your program by permanently placing the auxiliary dict alongside the sequence, encapsulating both into one object. However, in this case, you must maintain the dict as the sequence gets modified, to ensure it stays in sync with the sequence’s current membership. This maintenance task is not trivial, and it can be architected in many different ways. Here is one such way, which does the syncing “just in time,” rebuilding the auxiliary dict when a membership test is required and the dictionary is possibly out of sync with the list’s contents. Since it costs very little, the following class optimizes the index method, as well as membership tests:

class list_with_aux_dict(list):
    def _ _init_ _(self, iterable=( )):
        list._ _init_ _(self, iterable)
        self._dict_ok = False
    def _rebuild_dict(self):
        self._dict = {  }
        for i, item in enumerate(self):
            if item not in self._dict:
                self._dict[item] = i
        self._dict_ok = True
    def _ _contains_ _(self, item):
        if not self._dict_ok:
            self._rebuild_dict( )
        return item in self._dict
    def index(self, item):
        if not self._dict_ok:
            self._rebuild_dict( )
        try: return self._dict[item]
        except KeyError: raise ValueError
def _wrapMutatorMethod(methname):
    _method = getattr(list, methname)
    def wrapper(self, *args):
        # Reset 'dictionary OK' flag, then delegate to the real mutator method
        self._dict_ok = False
        return _method(self, *args)
    # in Python 2.4, only: wrapper._ _name_ _ = _method._ _name_ _
    setattr(list_with_aux_dict, methname, wrapper)
for meth in 'setitem delitem setslice delslice iadd'.split( ):
    _wrapMutatorMethod('_ _%s_ _' % meth)
for meth in 'append insert pop remove extend'.split( ):
    _wrapMutatorMethod(meth)
del _wrapMethod               # remove auxiliary function, not needed any more

The list_with_aux_dict class extends list and delegates to it every method, except _ _contains_ _ and index. Every method that can modify list membership is wrapped in a closure that resets a flag asserting that the auxiliary dictionary is OK. Python’s in-operator calls the _ _contains_ _ method. list_with_aux_dict’s _ _contains_ _ method rebuilds the auxiliary dictionary, unless the flag is set (when the flag is set, rebuilding is unnecessary); the index method works the same way.

Instead of building and installing wrapping closures for all the mutating methods of the list into the list_with_aux_dict class with a helper function, as the recipe does, we could write all the def statements for the wrapper methods in the body of list_with_aux_dict. However, the code for the class as presented has the important advantage of minimizing boilerplate (repetitious plumbing code that is boring and voluminous, and thus a likely home for bugs). Python’s strengths at introspection and dynamic modification give you a choice: you can build method wrappers, as this recipe does, in a smart and concise way; or, you can choose to code the boilerplate anyway, if you prefer to avoid what some would call the black magic of introspection and dynamic modification of class objects.

The architecture of class list_with_aux_dict caters well to a rather common pattern of use, where sequence-modifying operations happen in bunches, followed by a period of time in which the sequence is not modified, but several membership tests may be performed. However, the addUnique_simple function shown earlier would not get any performance benefit if argument baseList was an instance of this recipe’s list_with_aux_dict rather than a plain list: the function interleaves membership tests and sequence modifications. Therefore, too many rebuilds of the auxiliary dictionary for list_with_aux_dict would impede the function’s performance. (Unless a typical case was for a vast majority of the items of otherList to be already contained in baseList, so that very few modifications occurred compared to the number of membership tests.)

An important requisite for any of these membership-test optimizations is that the values in the sequence must be hashable (otherwise, of course, they cannot be keys in a dict, nor items in a set). For example, a list of tuples might be subjected to this recipe’s treatment, but for a list of lists, the recipe as it stands is just not applicable.

5.13. Finding Subsequences

Credit: David Eppstein, Alexander Semenov

Problem

You need to find occurrences of a subsequence in a larger sequence.

Solution

If the sequences are strings (plain or Unicode), Python strings’ find method and the standard library’s re module are the best approach. Otherwise, use the Knuth-Morris-Pratt algorithm (KMP):

def KnuthMorrisPratt(text, pattern):
    ''' Yields all starting positions of copies of subsequence 'pattern'
        in sequence 'text' -- each argument can be any iterable.
        At the time of each yield, 'text' has been read exactly up to and
        including the match with 'pattern' that is causing the yield. '''
    # ensure we can index into pattern, and also make a copy to protect
    # against changes to 'pattern' while we're suspended by `yield'
    pattern = list(pattern)
    length = len(pattern)
    # build the KMP "table of shift amounts" and name it 'shifts'
    shifts = [1] * (length + 1)
    shift = 1
    for pos, pat in enumerate(pattern):
        while shift <= pos and pat != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift
    # perform the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == length or matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == length: yield startPos

Discussion

This recipe implements the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage that builds a table of shift amounts and takes time that’s directly proportional to the length of the pattern, each text symbol is processed in constant amortized time. Explanations and demonstrations of how KMP works can be found in all good elementary texts about algorithms. (A recommendation is provided in See Also.)

If text and pattern are both Python strings, you can get a faster solution by suitably applying Python built-in search methods:

def finditer(text, pattern):
    pos = -1
    while True:
        pos = text.find(pattern, pos+1)
        if pos < 0: break
        yield pos

For example, using an alphabet of length 4 ('ACGU' . . .), finding all occurrences of a pattern of length 8 in a text of length 100000, on my machine, takes about 4.3 milliseconds with finditer, but the same task takes about 540 milliseconds with KnuthMorrisPratt (that’s with Python 2.3; KMP is faster with Python 2.4, taking about 480 milliseconds, but that’s still over 100 times slower than finditer). So remember: this recipe is useful for searches on generic sequences, including ones that you cannot keep in memory all at once, but if you’re searching on strings, Python’s built-in searching methods rule.

5.14. Enriching the Dictionary Type with Ratings Functionality

Credit: Dmitry Vasiliev, Alex Martelli

Problem

You want to use a dictionary to store the mapping between some keys and a current score value for each key. You frequently need to access the keys and scores in natural order (meaning, in order of ascending scores) and to check on a “key"’s current ranking in that order, so that using just a dict isn’t quite enough.

Solution

We can subclass dict and add or override methods as needed. By using multiple inheritance, placing base UserDict.DictMixin before base dict and carefully arranging our various delegations and “over"rides, we can achieve a good balance between getting good performance and avoiding the need to write “boilerplate” code.

By enriching our class with many examples in its docstring, we can use the standard library’s module doctest to give us unit-testing functionality, as well as ensuring the accuracy of all the examples we write in the docstring:

#!/usr/bin/env python
''' An enriched dictionary that holds a mapping from keys to scores '''
from bisect import bisect_left, insort_left
import UserDict
class Ratings(UserDict.DictMixin, dict):
    """ class Ratings is mostly like a dictionary, with extra features: the
        value corresponding to each key is the 'score' for that key, and all
        keys are ranked in terms their scores.  Values must be comparable; keys,
        as well as being hashable, must be comparable if any two keys may ever
        have the same corresponding value (i.e., may be "tied" on score).
        All mapping-like behavior is just as you would expect, such as:
        >>> r = Ratings({"bob": 30, "john": 30})
        >>> len(r)2
        >>> r.has_key("paul"), "paul" in r
        (False, False)
        >>> r["john"] = 20
        >>> r.update({"paul": 20, "tom": 10})
        >>> len(r)
        4
        >>> r.has_key("paul"), "paul" in r
        (True, True)
        >>> [r[key] for key in ["bob", "paul", "john", "tom"]]
        [30, 20, 20, 10]
        >>> r.get("nobody"), r.get("nobody", 0)
        (None, 0)
        In addition to the mapping interface, we offer rating-specific
        methods.  r.rating(key) returns the ranking of a "key" in the
        ratings, with a ranking of 0 meaning the lowest score (when two
        keys have equal scores, the keys themselves are compared, to
        "break the tie", and the lesser key gets a lower ranking):
        >>> [r.rating(key) for key in ["bob", "paul", "john", "tom"]]
        [3, 2, 1, 0]
        getValueByRating(ranking) and getKeyByRating(ranking) return the
        score and key, respectively, for a given ranking index:
        >>> [r.getValueByRating(rating) for rating in range(4)]
        [10, 20, 20, 30]
        >>> [r.getKeyByRating(rating) for rating in range(4)]
        ['tom', 'john', 'paul', 'bob']
        An important feature is that the keys( ) method returns keys in
        ascending order of ranking, and all other related methods return
        lists or iterators fully consistent with this ordering:
        >>> r.keys( )
        ['tom', 'john', 'paul', 'bob']
        >>> [key for key in r]
        ['tom', 'john', 'paul', 'bob']
        >>> [key for key in r.iterkeys( )]
        ['tom', 'john', 'paul', 'bob']
        >>> r.values( )
        [10, 20, 20, 30]
        >>> [value for value in r.itervalues( )]
        [10, 20, 20, 30]
        >>> r.items( )
        [('tom', 10), ('john', 20), ('paul', 20), ('bob', 30)]
        >>> [item for item in r.iteritems( )]
        [('tom', 10), ('john', 20), ('paul', 20), ('bob', 30)]
        An instance can be modified (adding, changing and deleting
        key-score correspondences), and every method of that instance
        reflects the instance's current state at all times:
        >>> r["tom"] = 100
        >>> r.items( )
        [('john', 20), ('paul', 20), ('bob', 30), ('tom', 100)]
        >>> del r["paul"]
        >>> r.items( )
        [('john', 20), ('bob', 30), ('tom', 100)]
        >>> r["paul"] = 25
        >>> r.items( )
        [('john', 20), ('paul', 25), ('bob', 30), ('tom', 100)]
        >>> r.clear( )
        >>> r.items( )
        [  ]
        """
    ''' the implementation carefully mixes inheritance and delegation
        to achieve reasonable performance while minimizing boilerplate,
        and, of course, to ensure semantic correctness as above.  All
        mappings' methods not implemented below get inherited, mostly
        from DictMixin, but, crucially!, _ _getitem_ _ from dict. '''
    def _ _init_ _(self, *args, **kwds):
        ''' This class gets instantiated just like 'dict' '''
        dict._ _init_ _(self, *args, **kwds)
        # self._rating is the crucial auxiliary data structure: a list
        # of all (value, key) pairs, kept in "natural"ly-sorted order
        self._rating = [ (v, k) for k, v in dict.iteritems(self) ]
        self._rating.sort( )
    def copy(self):
        ''' Provide an identical but independent copy '''
        return Ratings(self)
    def _ _setitem_ _(self, k, v):
        ''' besides delegating to dict, we maintain self._rating '''
        if k in self:
            del self._rating[self.rating(k)]
        dict._ _setitem_ _(self, k, v)
        insort_left(self._rating, (v, k))
    def _ _delitem_ _(self, k):
        ''' besides delegating to dict, we maintain self._rating '''
        del self._rating[self.rating(k)]
        dict._ _delitem_ _(self, k)
    ''' delegate some methods to dict explicitly to avoid getting
        DictMixin's slower (though correct) implementations instead '''
    _ _len_ _ = dict._ _len_ _
    _ _contains_ _ = dict._ _contains_ _
    has_key = _ _contains_ _
    ''' the key semantic connection between self._rating and the order
        of self.keys( ) -- DictMixin gives us all other methods 'for
        free', although we could implement them directly for slightly
        better performance. '''
    def _ _iter_ _(self):
        for v, k in self._rating:
            yield k
    iterkeys = _ _iter_ _
    def keys(self):
        return list(self)
    ''' the three ratings-related methods '''
    def rating(self, key):
        item = self[key], key
        i = bisect_left(self._rating, item)
        if item == self._rating[i]:
            return i
        raise LookupError, "item not found in rating"
    def getValueByRating(self, rating):
        return self._rating[rating][0]
    def getKeyByRating(self, rating):
        return self._rating[rating][1]
def _test( ):
    ''' we use doctest to test this module, which must be named
        rating.py, by validating all the examples in docstrings. '''
    import doctest, rating
    doctest.testmod(rating)
if _ _name_ _ == "_ _main_ _":
    _test( )

Discussion

In many ways, a dictionary is the natural data structure for storing a correspondence between keys (e.g., names of contestants in a competition) and the current “score” of each key (e.g., the number of points a contestant has scored so far, or the highest bid made by each contestant at an auction, etc.). If we use a dictionary for such purposes, we will probably want to access it often in natural order—the order in which the keys’ scores are ascending—and we’ll also want fast access to the rankings (ratings) implied by the current “score"s (e.g., the contestant currently in third place, the score of the contestant who is in second place, etc.).

To achieve these purposes, this recipe subclasses dict to add the needed functionality that is completely missing from dict (methods rating, getValueByRating, getKeyByRating), and, more subtly and crucially, to modify method keys and all other related methods so that they return lists or iterators with the required order (i.e., the order in which scores are ascending; if we have to break ties when two keys have the same score, we implicitly compare the keys themselves). Most of the detailed documentation is in the docstring of the class itself—a crucial issue because by keeping the documentation and examples there, we can use module doctest from the Python Standard Library to provide unit-testing functionality, as well as ensuring that our examples are correct.

The most interesting aspect of the implementation is that it takes good care to minimize boilerplate (meaning repetitious and boring code, and therefore code where bugs are most likely to hide) without seriously impairing performance. class Ratings multiply inherits from dict and DictMixin, with the latter placed first in the list of bases, so that all methods come from the mixin, if it provides them, unless explicitly overridden in the class.

Raymond Hettinger’s DictMixin class was originally posted as a recipe to the online version of the Python Cookbook and later became part of Python 2.3’s standard library. DictMixin provides all the methods of a mapping except _ _init_ _, copy, and the four fundamental methods: _ _getitem_ _, _ _setitem_ _, _ _delitem_ _, and, last but not least, keys. If you are coding a mapping class and want to ensure that your class supports all of the many methods that a full mapping provides to application code, you should subclass DictMixin and supply at least the fundamental methods (depending on your class’ semantics—e.g., if your class has immutable instances, you need not supply the mutator methods _ _setitem_ _ and _ _delitem_ _). You may optionally implement other methods for performance purposes, overriding the implementation that DictMixin provides. The whole DictMixin architecture can be seen as an excellent example of the classic Template Method Design Pattern, applied pervasively in a useful mix-in variant.

In this recipe’s class, we inherit _ _getitem_ _ from our other base (namely, the built-in type dict), and we also delegate explicitly to dict everything we can for performance reasons. We have to code the elementary mutator methods (_ _setitem_ _ and _ _delitem_ _) because, in addition to delegating to our base class dict, we need to maintain our auxiliary data structure self._rating—a list of (score, key) pairs that we keep in sorted order with the help of standard library module bisect. We implement keys ourselves (and while we’re at it, we implement _ _iter_ _ —i.e., iterkeys as well, since clearly keys is easiest to implement by using _ _iter_ _) to exploit self._rating and return the keys in the order we need. Finally, we add the obvious implementations for _ _init_ _ and copy, in addition to the three, ratings-specific methods that we supply.

The result is quite an interesting example of balancing concision, clarity, and well-advised reuse of the enormous amount of functionality that the standard Python library places at our disposal. If you use this module in your applications, profiling may reveal that a method that this recipe’s class inherits from DictMixin has somewhat unsatisfactory performance—after all, the implementations in DictMixin are, of necessity, somewhat generic. If this is the case, by all means add a direct implementation of whatever further methods you need to achieve maximum performance! For example, if your application performs a lot of looping on the result of calling r.iteritems( ) for some instance r of class Ratings, you may get slightly better performance by adding to the body of the class the direct implementation of the method:

    def iteritems(self):
        for v, k in self._rating:
            yield k, v

5.15. Sorting Names and Separating Them by Initials

Credit: Brett Cannon, Amos Newcombe

Problem

You want to write a directory for a group of people, and you want that directory to be grouped by the initials of their last names and sorted alphabetically.

Solution

Python 2.4’s new itertools.groupby function makes this task easy:

import itertools
def groupnames(name_iterable):
    sorted_names = sorted(name_iterable, key=_sortkeyfunc)
    name_dict = {  }
    for key, group in itertools.groupby(sorted_names, _groupkeyfunc):
        name_dict[key] = tuple(group)
    return name_dict
pieces_order = { 2: (-1, 0), 3: (-1, 0, 1) }
def _sortkeyfunc(name):
    ''' name is a string with first and last names, and an optional middle
        name or initial, separated by spaces; returns a string in order
        last-first-middle, as wanted for sorting purposes. '''
    name_parts = name.split( )
    return ' '.join([name_parts[n] for n in pieces_order[len(name_parts)]])
def _groupkeyfunc(name):
    ''' returns the key for grouping, i.e. the last name's initial. '''
    return name.split( )[-1][0]

Discussion

In this recipe, name_iterable must be an iterable whose items are strings containing names in the form first - middle - last, with middle being optional and the parts separated by whitespace. The result of calling groupnames on such an iterable is a dictionary whose keys are the last names’ initials, and the corresponding values are the tuples of all names with that last name’s initial.

Auxiliary function _sortkeyfunc splits a name that’s a single string, either “first last” or “first middle last,” and reorders the part into a list that starts with the last name, followed by first name, plus the middle name or initial, if any, at the end. Then, the function returns this list rejoined into a string. The resulting string is the key we want to use for sorting, according to the problem statement. Python 2.4’s built-in function sorted takes just this kind of function (to call on each item to get the sort key) as the value of its optional parameter named key.

Auxiliary function _groupkeyfunc takes a name in the same form and returns the last name’s initial—the key on which, again according to the problem statement, we want to group.

This recipe’s primary function, groupnames, uses the two auxiliary functions and Python 2.4’s sorted and itertools.groupby to solve our problem, building and returning the required dictionary.

If you need to code this task in Python 2.3, you can use the same two support functions and recode function groupnames itself. In 2.3, it is more convenient to do the grouping first and the sorting separately on each group, since no groupby function is available in Python 2.3’s standard library:

def groupnames(name_iterable):
    name_dict = {  }
    for name in name_iterable:
        key = _groupkeyfunc(name)
        name_dict.setdefault(key, [  ]).append(name)
    for k, v in name_dict.iteritems( ):
        aux = [(_sortkeyfunc(name), name) for name in v]
        aux.sort( )
        name_dict[k] = tuple([ n for _ _, n in aux ])
    return name_dict

Table of Contents for 5. Searching and Sorting

Create new playlist

Sign In

Sign Up

Chapter 5. Searching and Sorting

Introduction

A Short History of Python Sorting

Current Sorting

5.1. Sorting a Dictionary

Problem

Solution

Discussion

See Also

5.2. Sorting a List of Strings Case-Insensitively

Problem

Solution

Discussion

See Also

5.3. Sorting a List of Objects by an Attribute of the Objects

Problem

Solution

Discussion

See Also

5.4. Sorting Keys or Indices Basedon the Corresponding Values

Problem

Solution

Discussion

See Also

5.5. Sorting Strings with Embedded Numbers

Problem

Solution

Discussion

See Also

5.6. Processing All of a List’s Items in Random Order

Problem

Solution

Discussion

See Also

5.7. Keeping a Sequence Ordered as Items Are Added

Problem

Solution

Discussion

See Also

5.8. Getting the First Few Smallest Items of a Sequence

Problem

Solution

Discussion

See Also

5.9. Looking for Items in a Sorted Sequence

Problem

Solution

Discussion

See Also

5.10. Selecting the nth Smallest Element of a Sequence

Problem

Solution

Discussion

See Also

5.11. Showing off quicksort in Three Lines

Problem

Solution

Discussion

See Also

5.12. Performing Frequent Membership Tests on a Sequence

Problem

Solution

Discussion

See Also

5.13. Finding Subsequences

Problem

Solution

Discussion

See Also

5.14. Enriching the Dictionary Type with Ratings Functionality

Problem

Solution

Discussion

See Also

5.15. Sorting Names and Separating Them by Initials

Problem

Table of Contents for
5. Searching and Sorting