Chapter 19. Iterators and Generators

Introduction

Credit: Raymond Hettinger

Lather, Rinse, Repeat

—Docs for my bottle of shampoo

The Iterator Protocol

After namespaces, iterators and generators emerged as the next “honking great ideas” in Python. Since their introduction in Python 2.2, they have come to pervade and unify the language. They encourage a loosely coupled programming style that is simple to write, easy to read, flexible, and extendable.

Simply put, the iterator protocol has two halves, a producer and a consumer. An iterable object says, “I know how to supply data one element at a time,” and the consumer says “please give me data one element at a time and say Stop when you’re done.”

The producer/consumer connection can appear in a number of guises. The simplest is where a function or constructor wraps around an iterable object. For example, sorted(set('simsalabim')) has the set constructor looping over the elements of the iterable string and a sorted function wrapping around the resulting iterable set object. replaceable literal

In addition to functions and constructors, regular Python statements can use the in operator to loop over iterable objects. for line in myfile: print line loops over lines of an iterable file object. Likewise, if token in sequence loops over elements of a sequence until it finds a match (or until it reaches the end with no matches).

Both guises of the consumer side of the iterator protocol use the protocol implicitly. In addition, an explicit form is more flexible but used less often. The iterator object is saved as a variable, it = iter(mystring). Then, the iterator’s next method is called to retrieve a data element, elem = it.next( ). Such calls are usually wrapped in try/except statements to catch the StopIteration exception that the iterator raises when the data stream is exhausted.

All of these approaches provide the full range of iterator benefits, including loose coupling and memory friendliness. The loose coupling is evident in the first example, where the independently created and maintained sorted function, set data type, and string objects were readily combined. The memory friendliness derives from the one-at-a-time structure of the iterator protocol. Programs using iterators are likely to be less resource intensive and more scalable than their list-based counterparts.

Iterators and Generators

An object that wants to be iterable should implement an _ _iter_ _ method, which returns an iterator object. Ideally, the iterator should be a distinct object from the iterable, to make it possible to have multiple iterators over the same iterable container. There are exceptions to this general recommendation: for example, a sequential file object does not readily lend itself to multiple iterations; therefore, it is more appropriate in this case to have the file object be its own iterator rather than return a separate iterator object; given any file instance f, indeed, iter( f ) is f.

Any iterator object must implement a next method and an _ _iter_ _ method. The next method should raise StopIteration when the iteration is complete. Care should be taken that subsequent calls to next continue to raise StopIteration (once stopped, it stays stopped). The _ _iter_ _ method of an iterator object should always return the iterator itself (_ _iter_ _ is idempotent on iterators). This simplifies client code by allowing it to treat iterators and iterables the same way (i.e., both return an iterator in response to the iter function).

To be useful, most iterators have a stored state that enables them to return a new data element on each call. The previously described responsibilities and the need to have a stored state both suggest that classes are the way to create iterable objects. That approach is obvious, explicit, and rarely used (only two recipes in this chapter make any use of classes).

Instead of writing classes, two alternate approaches dominate. Starting with the observation that many functions and types both accept iterable inputs and return iterable outputs, an obvious approach is to link them together in a “pipes and filters” style to create new tools. For example, def uniq(seq): return sorted(set(seq)) is a way to create a new tool directly from existing functions and types. Like functional programming, the resulting code is terse, readable, trivial to debug, and often runs at the speed of compiled C code. The economy of this approach motivated the creation of an entire module of iterator building blocks, the itertools module. Indeed, many of the brilliant, effective recipes in this chapter make frequent use of itertools components.

If no combination of building blocks solves the problem, the next best approach is to write a generator. The Recipe 19.1 shows how trivially easy it is to write a generator. By introducing a yield keyword, the responsibilities of creating an iterator are handled automatically. The iterator objects obtained by calling a generator are distinct, save their state, have an idempotent _ _iter_ _ method, and have a next method that raises StopIteration when complete and stays stopped if called again afterwards. Python internally takes care of all of these details. Because of generators’ compelling simplicity, most of the recipes in this chapter make use of generators.

Starting with version 2.4, Python continued its evolution toward using iterators everywhere by introducing generator expressions (genexps for short). Genexps can be likened to a memory-efficient, scalable form of list comprehensions. Simply by replacing brackets with parentheses, an expression will yield one element at a time rather than filling memory all at once. Used correctly (i.e., in a context where they are consumed immediately, one item at a time), genexps can offer remarkable clarity and economy: sum(x*x for x in xrange(10)) is a great way to express the sum of the squares of the first ten natural numbers.

Thinking Out of the Box

Paradoxically, the simpler and more general an idea, the more likely that people will find extraordinary and unexpected ways of using it. Here is a brief sampling of the ways that iterators and generators have been pushed to their outer limits.

Observing that the yield keyword has the unique capability of stopping execution, saving state, and later resuming, it is not surprising that techniques have been discovered for using generators to simulate co-routines and continuations. The core idea is to implement each routine as a generator and having a dispatch function launch the routines in orderly succession. Whenever a task switch is needed, the routines yield back to the dispatcher, which then launches or resumes the next routine by calling its next method. Small complications are involved for startup, termination, and data sharing, but they each are solvable without much ado and present fewer challenges than equivalent thread-based solutions. See Recipe 9.8 for an example.

Observing that some tools can be both producers and consumers, it is natural to want to stack them together like pipes and filters. While that analogy can lead to useful decoupling, be aware that underlying models are different. Iterators do not run independently from start to finish; instead, an outermost layer is always in control, requesting data elements one at a time, so that nothing runs until the outer layer starts making requests.

When stacking tools together (as in the first example with sorted, set, and a string), the code takes on the appearance of a functional programming language. The resemblance is not shallow: iterators do fulfill some of the promise of lazy languages. So, it is natural to borrow some of the most successful techniques from those languages, such as Haskell and SML.

One such technique is to write innermost iterators to yield infinite streams and concentrate the control logic in an outermost driver function. For instance, in numerical programming, write a generator that yields successively better approximations to a desired result and call it from a function that stops whenever two successive approximations fall within a tolerance value. Separating the control logic from the calculation decouples the two, making them easier to write, test, and debug, and makes them more reusable in other contexts.

Odds and Ends

Here are some instructive snippets. Consider each of them carefully, study how they work, and you’ll be well on your way towards understanding how best to link iterators together to solve practical problems. Each of the following lines is independent from the “other"s:

result = dict(enumerate(myseq))
result = set(word for line in page for word in line.split( ))
def dotproduct(v1, v2): return sum(itertools.imap(operator.mul, v1, v2))
def dotproduct(v1, v2): return sum(x*y for x,y in itertools.izip(v1, v2))
randgen = itertools.starmap(random.random, itertools.repeat(( )))
randgen = iter(random.random, -1.0)

The idea for restartable iterators surfaces every so often and then drowns in quicksand. sys.stdin is a plain example of an iterable that cannot logically be restarted unless an entire session is saved in memory. A craving for restartability should be taken as a cue that a list might well be a more appropriate data structure.

Just because iterators cannot be restarted doesn’t mean they cannot be abandoned in mid-stream. The lazy, just-in-time style of production is a key feature of iterators. Take advantage of it. That’s why the for statement supports a break keyword, after all.

The core itertools and their derivatives (see the recipes in the itertools docs that are part of the Python Library Reference) all run at nearly the speed of compiled code. When Python 2.4 introduced a native set data type, I timed it against the pure-Python version, sets.py, and learned that some of the set logic (intersection, union, etc.) achieved only a two to one increase in speed. The reason was that sets.py used itertools, and itertools performance was exceptional. So, when performance is an issue, consider an itertools solution before turning to more labor-intensive optimizations or native language extensions.

19.1. Writing a range-like Function with Float Increments

Credit: Dinu Gherman, Paul Winkler, Stephen Levings

Problem

You need an arithmetic progression, like the built-in xrange but with float values (xrange works only on integers).

Solution

Although float arithmetic progressions are not available as built-ins, it’s easy to code a generator to supply them:

import itertools
def frange(start, end=None, inc=1.0):
    "An xrange-like generator which yields float values"
    # imitate range/xrange strange assignment of argument meanings
    if end is None:
        end = start + 0.0     # Ensure a float value for 'end'
        start = 0.0
    assert inc                # sanity check
    for i in itertools.count( ):
        next = start + i * inc
        if (inc>0.0 and next>=end) or (inc<0.0 and next<=end):
            break
        yield next

Discussion

Sadly missing in the Python Standard Library, the generator in this recipe lets you use arithmetic progressions, similarly to the built-in xrange but with float values.

Many theoretical restrictions apply, but this generator is more useful in practice than in theory. People who work with floating-point numbers all the time tell many war stories about billion-dollar projects that failed because someone did not take into consideration the strange things that modern hardware can do, at times, when comparing floating-point numbers. But for pedestrian cases, simple approaches like this recipe generally work.

This observation by no means implies that you can afford to ignore the fundamentals of numerical analysis, if your programs need to do anything at all with floating-point numbers! For example, in this recipe, we rely on a single multiplication and one addition to compute each item, to avoid accumulating error by repeated additions. Precision would suffer in a potentially dangerous way if we “simplified” the first statement in the loop body to something like:

    next += inc

as might appear very tempting, were it not for those numerical analysis considerations.

One variation you may want to consider is based on pre-computing the number of items that make up the bounded arithmetic progression:

import math
def frange1(start, end=None, inc=1.0):
    if end == None:
        end = start + 0.0     # Ensure a float value for 'end'
        start = 0.0
    nitems = int(math.ceil((end-start)/inc))
    for i in xrange(nitems):
        yield start + i * inc

This frange1 version may or may not be faster than the frange version shown in the solution; if the speed of this particular generator is crucial to your programs, it’s best to try both versions and measure resulting times. In my limited benchmarking, on most of the hardware I have at hand, frange1 does appear to be consistently faster.

Talking about speed—believe it or not, looping with for i in itertools.count( ) is measurably faster than apparently obvious lower-level alternatives such as:

    i = 0
    while True:...loop body unchanged...
        yield next
        i += 1

Do consider using itertools any time you want speed, and you may be in for more of these pleasant surprises.

If you work with floating-point numbers, you should definitely take a look at Numeric and other third-party extension packages that make Python such a powerful language for floating-point computations. For example, with Numeric, you could code something like:

import math, Numeric
def frange2(start, end=None, inc=1.0, typecode=None):
    if end == None:
        end = start + 0.0     # Ensure a float value for 'end'
        start = 0.0
    nitems = math.ceil((end-start)/inc)
    return Numeric.arange(nitems) * inc + start

This one is definitely going to be faster than both frange and frange1 if you need to collect all of the progression’s items into a sequence.

See Also

Documentation for the xrange built-in function, and the itertools and math modules, in the Library Reference; Numeric Python (http://www.pfdubois.com/numpy/).

19.2. Building a List from Any Iterable

Credit: Tom Good, Steve Alexander

Problem

You have an iterable object x (it might be a sequence or any other kind of object on which you can iterate, such as an iterator, a file, a dict) and need a list object y, with the same items as x and in the same order.

Solution

When you know that iterable object x is bounded (so that, e.g., a loop for item in x would surely terminate), building the list object you require is trivial:

y = list(x)

However, when you know that x is unbounded, or when you are not sure, then you must ensure termination before you call list. In particular, if you want to make a list with no more than n items from x, then standard library module itertools' function islice does exactly what you need:

import itertools
y = list(itertools.islice(x, N))

Discussion

Python’s generators, iterators, and sundry other iterables, are a wondrous thing, as this entire chapter strives to point out. The powerful and generic concept of iterable is a great way to represent all sort of sequences, including unbounded ones, in ways that can potentially save you huge (and even infinite!) amounts of memory. With the standard library module itertools, generators you can code yourself, and, in Python 2.4, generator expressions, you can perform many manipulations on completely general iterables.

However, once in a while, you need to build a good old-fashioned full-fledged list object from such a generic iterable. For example, building a list is the simplest way to sort or reverse the items in the iterable, and lists have many other useful methods you may want to apply. As long as you know for sure that the iterable is bounded (i.e., has a finite number of items), just call list with the iterable as the argument, as the “Solution” points out. In particular, avoid the goofiness of misusing a list comprehension such as [ i for i in x ], when list( x ) is faster, cleaner, and more readable!

Calling list won’t help if you’re dealing with an unbounded iterable. The need to ensure that some iterable x is bounded also arises in many other contexts, besides that of calling list( x ): all “accumulator” functions (sum( x ), max( x ), etc.) intrinsically need a bounded-iterable argument, and so does a statement such as for i in x (unless you have appropriate conditional breaks within the loop’s body), a test such as if i in x, and so on.

If, as is frequently the case, all you want is to ensure that no more than n items of iterable x are taken, then itertools.islice, as shown in the “Solution”, does just what you need. The islice function of the standard library itertools module offers many other possibilities (essentially equivalent to the various possibilities that slicing offers on sequences), but out of all of them, the simple “truncation” functionality (i.e., take no more than n items) is by far the most frequently used. The programming language Haskell, from which Python took many of the ideas underlying its list comprehensions and generator expression functionalities, has a built-in take function to cater to this rather frequent need, and itertools.islice is most often used as an equivalent to Haskell’s built-in take.

In some cases, you cannot specify a maximum number of items, but you are able to specify a generic condition that you know will eventually be satisfied by the items of iterable x and can terminate the proceedings. itertools.takewhile lets you deal with such cases in a very general way, since it accepts the controlling predicate as a callable argument. For example:

y = list(itertools.takewhile((11)._ _cmp_ _, x))

binds name y to a new list made up of the sequence of items in iterable x up to, but not including, the first one that equals 11. (The reason we need to code (11)._ _cmp_ _ with parentheses is a somewhat subtle one: if we wrote 11._ _cmp_ _ without parentheses, Python would parse 11. as a floating-point literal, and the entire construct would be syntactically invalid. The parentheses are included to force the tokenization we mean, with 11 as an integer literal and the period indicating an access to its attribute, in this case, bound method _ _cmp_ _.)

For the special and frequent case in which the terminating condition is the equality of an item to some given value, a useful alternative is to use the two-arguments variant of the built-in function iter:

y = list(iter(iter(x).next, 11))

Here, the iter(x) call (which is innocuous if x is already an iterator) gives us an object on which we can surely access callable (bound method) next—which is necessary, because iter in its two-arguments form requires a callable as its first argument. The second argument is the sentinel value, meaning the value that terminates the iteration as soon as an item equal to it appears. For example, if x were a sequence with items 1, 6, 3, 5, 7, 11, 2, 9, . . , y would now be the list [1, 6, 3, 5, 7]. (The sentinel value itself is excluded: from the beginning, included, to the end, excluded, is the normal Python convention for just about all loops, implicit or explicit.)

See Also

Library Reference documentation on built-ins list and iter, and module itertools.

19.3. Generating the Fibonacci Sequence

Credit: Tom Good, Leandro Mariano Lopez

Problem

You want an unbounded generator that yields, one item at a time, the entire (infinite) sequence of Fibonacci numbers.

Solution

Generators are particularly suitable for implementing infinite sequences, given their intrinsically “lazy evaluation” semantics:

def fib( ):
    ''' Unbounded generator for Fibonacci numbers '''
    x, y = 0, 1
    while True:
        yield x
        x, y = y, x + y
if _ _name_ _ == "_ _main_ _":
    import itertools
    print list(itertools.islice(fib( ), 10))
#outputs: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Discussion

Generators make it quite easy to work with unbounded (infinite) sequences. In this recipe, we show a generator that produces all of the (infinitely many) Fibonacci numbers one after the “other”. (If you want the variant in which the sequence starts with 1, 1, 2, . . . , rather than the one, implemented here, which starts with 0, 1, 1, . . . , just interchange the two statements in the loop’s body.)

It’s worth reflecting on why a generator is so perfectly suitable for implementing an unbounded sequence and letting you work with it. Syntactically, a generator is “just” a function containing the keyword yield. When you call a generator, however, the function body does not yet execute. Rather, calling the generator gives you a special anonymous iterator object that wraps the function’s body, the function’s local variables (including arguments, which, for any function, are local variables that happen to be initialized by the caller), and the current point of execution, which is initially the start of the function.

When you call this anonymous iterator object’s next method, the function body executes up to the next yield statement. yield’s argument is returned as the result of the iterator’s next method, and the function is “frozen”, with its execution state intact. When you call next again on the same iterator object, the function “thaws” and continues from where it left off, again up to the next yield statement.

If the function body “falls off the end”, or executes a return, the iterator object raises StopIteration to indicate the end of the sequence. But, of course, if the sequence that the generator is producing is not bounded, the iterator never raises StopIteration. That’s okay, as long as you don’t rely on such an exception as the only way to terminate a loop. In this recipe, for example, the anonymous iterator object is passed as an argument to itertools.islice: as shown in Recipe 19.2, islice is the most typical way in which an unbounded iterator is made finite (truncated at an externally imposed boundary).

The main point to retain is that it’s all right to have infinite sequences represented by generators, since generators are computed lazily (in other words, each item gets computed just in time, when required), as long as some control structure ensures that only a finite number of items are required from the generator. The answer to our curiosity as to why generators are so excellently suitable for this use is in the anonymous iterator object which a generator returns when we call it: that anonymous iterator wraps some code (the generator’s function body) and some state (the function’s local variables, and, crucially, the point at which the function’s execution is to resume) in just the way that’s most convenient for the computation of most sequences, be they bounded or unbounded.

Leonardo Pisano (meaning “from Pisa”), most often called Leonardo Bigollo (the traveler or “the good for nothing”) during his lifetime in the 12th and 13th centuries, and occasionally Leonardo Fibonacci (for his connection to the Bonacci family), must look down with considerable pride from his place in the mathematicians’ Empyreon. Although his most notable contributions were the introduction of decimal notation (arabic numerals) in the West, and the codification of the rules for double-entry bookkeeping, these monumental achievements are not usually connected to his name. The one that is, however—from the third problem in his Liber Abaci, which he originally expressed in terms of a rabbit-raising farm—still provides interesting applications for the distant successors of the abacus, modern computers.

See Also

Recipe 19.2, shows how to make bounded iterators from unbounded (or “potentially unbounded”) ones.

19.4. Unpacking a Few Items in a Multiple Assignment

Credit: Brett Cannon, Oren Tirosh, Alex Martelli

Problem

Python’s multiple unpacking assignment is very handy when you are unpacking all the items from a sequence and binding each to a name. However, you often need to unpack (and bind to names) only some items from the “front” of a sequence, and bind another name to “the rest” of the sequence (the part you didn’t unpack).

Solution

A generator provides an elegant solution to this problem:

def peel(iterable, arg_cnt=1):
    """ Yield each of the first arg_cnt items of the iterable, then
        finally an iterator for the rest of the iterable. """
    iterator = iter(iterable)
    for num in xrange(arg_cnt):
        yield iterator.next( )
    yield iterator
if _ _name_ _ == '_ _main_ _':
    t5 = range(1, 6)
    a, b, c = peel(t5, 2)
    print a, b, list(c)
# emits:1 2 [3, 4, 5]

Discussion

Python supports the handy idea of multiple unpacking assignment. Say that t5 is any sequence of five items. Then, you can code:

a, b, c, d, e = t5

to bind a name to each item of t5.

However, you often do not know (nor care) exactly how many items a certain sequence t holds: you want to bind (say) two names, one to each of the first two items, and a third name to “the rest” of t (this requirement does imply that t must hold at least two items). If you know that t is a “proper” sequence, with support for slicing, not just an arbitrary iterable, you can code:

a, b = t[:2]
c = t[2:]

but this is nowhere as elegant or handy as the multiple unpacking assignment. Moreover, if you are not certain about the nature of t (i.e., if t can be any iterable, not necessarily supporting slice syntax), the task becomes more cumbersome:

c = iter(t5)
a = c.next( )
b = c.next( )

Given these issues, the Python Development mailing list[1] once discussed a new syntax for generalized multiple unpacking assignment, such that:

a, b, *c = t

would perform exactly this task—bind names a and b to the first two items of t and name c to “the rest”.

I didn’t like the idea of making the Python language bigger by adding this extra functionality to assignment statements, so I came up with this recipe’s generator. This generator provides this functionality fully and without any need to add any new syntax to Python.

Just one caveat: you must make sure that you pass the arg_cnt argument properly. If you pass a wrong value for arg_cnt, or if the sequence you pass to peel is shorter than arg_cnt, you get an exception at runtime. But then, you also get a Python exception at runtime if you try to perform a multiple assignment and the number of names you have on the left of the = sign is not identical to the number of items of the sequence you have on the right. Therefore, this recipe isn’t any different from normal, multiple unpacking assignment in this respect. If you think it is important to relax some parts of this requirement, see Recipe 19.5.

See Also

Language Reference and Python in a Nutshell about multiple unpacking assignments; Recipe 19.5.

19.5. Automatically Unpacking the Needed Number of Items

Credit: Sami Hangaslammi, Peter Cogolo

Problem

You want to unpack (and bind to names) some items from the “front” of a sequence and bind another name to “the rest” of the sequence (the part you didn’t unpack). You want to obtain the number of items to unpack automatically, based on how many names are on the left of the = sign in a multiple unpacking assignment.

Solution

The previous approach in Recipe 19.4 is clean and elegant, but you have to “manually” pass the number of items to unpack. If you’re willing to stoop to a little black magic, peering into stack frames and bytecodes, you may be able to bypass that requirement:

import inspect, opcode
def how_many_unpacked( ):
    f = inspect.currentframe( ).f_back.f_back
    if ord(f.f_code.co_code[f.f_lasti]) == opcode.opmap['UNPACK_SEQUENCE']:
        return ord(f.f_code.co_code[f.f_lasti+1])
    raise ValueError, "Must be a generator on RHS of a multiple assignment!"
def unpack(iterable):
    iterator = iter(iterable)
    for num in xrange(how_many_unpacked( )-1):
        yield iterator.next( )
    yield iterator
if _ _name_ _ == '_ _main_ _':
    t5 = range(1, 6)
    a, b, c = unpack(t5)
    print a, b, list(c)

Discussion

While arguably spiffy, this recipe is a bit fragile, as you could well expect from a function relying on introspection on bytecode: while the recipe works in Python 2.3 and 2.4, any future release of Python might easily generate bytecode for a multiple unpacking assignment in a somewhat different way, and thus break the recipe.

Moreover, as presented, the recipe relies on how_many_unpacked being called specifically from a generator; if you call it from an ordinary function, it does not work, since in that case the UNPACK_SEQUENCE bytecode in the caller’s caller happens to fall at offset f.f_lasti+3 instead of f.f_lasti.

For example, the following code doesn’t work with the recipe’s Solution because enumfunc is an ordinary function, not a generator:

def enumfunc( ):
    return xrange(how_many_unpacked( ))
a, b, c, d, e = enumfunc( )

However, the following code does work:

def enumgen( ):
    for x in xrange(how_many_unpacked( )): yield x
a, b, c, d, e = enumgen( )

because enumgen is a generator.

In other words, this recipe is a hack—arguably a neat hack (to the point that one of the editors of this Cookbook successfully lobbied the “other” two and managed to obtain the recipe’s inclusion in this volume), but, nevertheless, a hack. Therefore, you probably do not want to use this approach in “production code”, meaning code that must stay around for a long time and will be maintained across future versions of Python.

Nevertheless, you could make how_many_unpacked work in both contexts by making it a little bit more complicated:

def how_many_unpacked( ):
    f = inspect.currentframe( ).f_back.f_back
    bytecode = f.f_code.co_code
    ups_code = opcode.opmap['UNPACK_SEQUENCE']
    if ord(bytecode[f.f_lasti]) == ups_code:
        return ord(bytecode[f.f_lasti+1])
    elif ord(bytecode[f.f_lasti+3]) == ups_code:
        return ord(bytecode[f.f_lasti+4])
    else:
        raise ValueError, "Must be on the RHS of a multiple assignment!"

With this more complicated variant, how_many_unpacked would work when called from either a generator or an ordinary function. However, I recommend sticking with the simpler version presented in this recipe’s Solution, and calling how_many_unpacked only from the given unpack generator, or a few other specific generators.

Even such a limited use can be considered debatable, since most Pythonistas prefer clarity and simplicity to the risky kind of “convenience” that can be obtained by such shortcuts. After all, this recipe’s only advantage, in comparison to Recipe 19.4, is that you save yourself the trouble of passing to unpack the number of items required, which is nice, but clearly, not all that crucial.”

See Also

Recipe 19.4; Language Reference and Python in a Nutshell about multiple unpacking assignments; Library Reference and Python in a Nutshell about library modules inspect and opcode.

19.6. Dividing an Iterable into n Slices of Stride n

Credit: Gyro Funch, Alex Martelli

Problem

You have an iterable p and need to get the n non-overlapping extended slices of stride n, which, if the iterable was a sequence supporting extended slicing, would be p [ 0 ::n ], p [1::n ], and so on up to p [ n -1::n ].

Solution

While extended slicing would return sequences of the same type we start with, it’s much more sensible to specify a strider function that, instead, solves this problem by returning a list of lists:

def strider(p, n):
    """ Split an iterable p into a list of n sublists, repeatedly taking
        the next element of p and adding it to the next sublist.  Example:
        >>> strider('abcde', 3)
        [['a', 'd'], ['b', 'e'], ['c']]
        In other words, strider's result is equal to:
            [list(p[i::n]) for i in xrange(n)]
        if iterable p is a sequence supporting extended-slicing syntax.
    """
    # First, prepare the result, a list of n separate lists
    result = [ [  ] for x in xrange(n) ]
    # Loop over the input, appending each item to one of
    # result's lists, in "round robin" fashion
    for i, item in enumerate(p):
        result[i % n].append(item)
    return result

Discussion

The function in this recipe takes an iterable p and pulls it apart into a user-defined number n of pieces (specifically, function strider returns a list of sublists), distributing p’s items into what would be the n extended slices of stride n if p were a sequence.

If we were willing to sacrifice generality, forcing argument p to be a sequence supporting extended slicing, rather than a generic iterable, we could use a very different approach, as the docstring of strider indicates:

def strider1(p, n):
    return [list(p[i::n]) for i in xrange(n)]

Depending on our exact needs, with such a strong constraint on p, we might omit the list call to make each subsequence into a list, and/or code a generator to avoid consuming extra memory to materialize the whole list of results at once:

def strider2(p, n):
    for i in xrange(n):
        yield p[i::n]

or, equivalently:

import itertools
def strider3(p, n):
    return itertools.imap(lambda i: p[i::n], xrange(n))

or, in Python 2.4, with a generator expression:

def strider4(p, n):
    return (p[i::n] for i in xrange(n))

However, none of these alternatives accepts a generic iterable as p—each demands a full-fledged sequence.

Back to this recipe’s exact specs, the best way to enhance the recipe is to recode it to avoid low-level fiddling with indices. While doing arithmetic on indices is conceptually quite simple, it can get messy and indeed is notoriously error prone. We can do better by a generous application of module itertools from the Python Standard Library:

import itertools
def strider5(p, n):
    result = [ [  ] for x in itertools.repeat(0, n) ]
    resiter = itertools.cycle(result)
    for item, sublist in itertools.izip(p, resiter):
        sublist.append(item)
    return result

This strider5 version uses three functions from module itertools—all of the functions in module itertools return iterable objects, and, as we see in this case, their results are therefore typically used in for loops. Function repeat yields an object, repeatedly, a given number of times, and here we use it instead of the built-in function xxrange to control the list comprehension that builds the initial value for result. Function cycle takes an iterable object and returns an iterator that walks over that iterable object repeatedly and cyclically—in other words, cycle performs exactly the round-robin effect that we need in this recipe. Function izip is essentially like the built-in function zip, except that it returns an iterator and thus avoids the memory-consumption overhead that zip incurs by building its whole result list in memory at once.

This version achieves deep elegance and conceptual simplicity (although you may need to gain some familiarity with itertools before you agree that this version is simple!) by foregoing all index arithmetic and leaving all of the handling of the round-robin issues to itertools.cycle. resiter, per se, is a nonterminating iterator, but the function deals effortlessly with that. Specifically, since we use resiter together with p as arguments to izip, termination is assured (assuming, of course, that p does terminate!) by the semantics of izip, which, just like built-in function zip, stops iterating as soon as any one of its arguments is exhausted.

See Also

The itertools module is part of the Python Standard Library and is documented in the Library Reference portion of Python’s online documentation; the Library Reference and Python in a Nutshell docs about the built-ins zip and xrange, and extended-form slicing of sequences.

19.7. Looping on a Sequence by Overlapping Windows

Credit: Peter Cogolo, Steven Bethard, Ian Bicking

Problem

You have an iterable s and need to make another iterable whose items are sublists (i.e., sliding windows), each of the same given length, over s' items, with successive windows overlapping by a specified amount.

Solution

We can combine built-in function iter and function islice from the standard library module itertools to code a generator to solve our problem:

import itertools
def windows(iterable, length=2, overlap=0):
    it = iter(iterable)
    results = list(itertools.islice(it, length))
    while len(results) == length:
        yield results
        results = results[length-overlap:]
        results.extend(itertools.islice(it, length-overlap))
    if results:
        yield results
if _ _name_ _ == '_ _main_ _':
    seq = 'foobarbazer'
    for length in (3, 4):
        for overlap in (0, 1):
            print '%d %d: %s' % (length, overlap,
                    map(''.join, windows(seq, length, overlap)))

This module, when run as a main script, emits:

3 0: ['foo', 'bar', 'baz', 'er']
3 1: ['foo', 'oba', 'arb', 'baz', 'zer', 'r']
4 0: ['foob', 'arba', 'zer']
4 1: ['foob', 'barb', 'baze', 'er']

When you know you don’t need any overlap, a fast and concise alternative is available:

def chop(iterable, length=2):
    return itertools.izip(*(iter(iterable),)*length)

The body of this concise alternative may be a bit confusing until you realize that the two occurrences of the asterisk ( * ) there play different roles: the first one is part of a *args syntax form (passing the elements of a sequence as separate positional arguments), the second one indicates that a sequence (the Singleton tuple (iter(iterable),) must be repeated length times.

Discussion

In many cases, we need a sequence of sub-sequences of a given length, and we have to start with a “flat” iterable. For example, we can build a dictionary with given keys and values by calling dict with a sequence of two-item sequences—but what if we start with a “flat” sequence where keys and values just alternate? The function windows in this recipe meets this need:

the_dict = dict(windows(flat_alternating_keys_and_values))

Or, say we have an iterable whose items are the amounts of sales on each day. To turn it into an iterable whose items are the amounts of sales in each week (seven days):

weekly_sales = itertools.imap(sum, windows(daily_sales, 7))

The two use cases just presented are examples of how windows can be useful when called without overlap (in other words, with an overlap argument of 0, its default value), so the alternative chop function also presented in the recipe would be just as good (and faster). However, overlap is often useful when you deal with iterables that are signals, or time series. For example, if you have a function average such as:

def average(sequence):
    return sum(sequence)/float(len(sequence))

then you can apply a simple low-pass filter to a signal:

filtered = itertools.imap(average, windows(raw_signal, 5, 2))

or get the moving average daily sales from the iterable of daily sales:

mvavg_daily_sales = itertools.imap(average, windows(daily_sales, 7, 6))

The implementation of the windows generator in this recipe is quite straightforward, if you’re familiar with itertools.islice (and you should be, if you deal with iterables!). For the first “window”, we must clearly fill list results with the appropriate number of items (islice does that for us). At each subsequent step, we must throw away a certain number of items from the “front” of results (we do that conveniently by list slicing, since results is, indeed, a list) and replenish the same number at the back (we do that by calling the extend method of results, with islice providing the needed “new” items). That number, as a little reasoning shows, is exactly that given by the expression length-overlap. The loop terminates, if ever, only when results cannot be entirely replenished. (The loop never executes if results cannot even be filled entirely in the first place.)

When the loop terminates, we may be left with a “tail” in results, a “last window” that is shorter than length. What to do in that case depends on your application’s exact needs. The recipe, as given above, just yields the shorter window as the last item of the generator, which is what we’d probably want in all of the previously mentioned use cases. In other cases, we might want to drop the last, too-short window (just omit the last two statements in function windows as given in the recipe), raise an exception (when we know that such a situation should never occur), or pad the last window to the required length with a pad value such as None, by changing the last two statements in function windows to something like:

    if result:
        result.extend(itertools.repeat(None, length-len(result)))
        yield result

One important implementation detail: function windows, as coded in the recipe, yields a new list object at each step. It takes some time to generate all of these objects. In some cases, it may be convenient to the caller to know that each object it gets is a separate and independent list. Such knowledge enables the caller to store or modify the objects it gets, without having to make explicit copies. However, none of the use cases we discussed gets any benefit from this feature. So, you could optimize, by yielding the same list object every time. If you want that optimization, just change the statement:

        results = results[length-overlap:]

into:

        del results[:length-overlap]

If you’re applying this optimization, and you’re using Python 2.4, you should also consider using the new type collections.deque instead of list. In order to do so, you need to add the statement:

import collections

at the start of your code, change the only occurrence of list in the recipe into collections.queue, and further change the updating of results to avoid slicing, using, instead:

        for i in xrange(length-overlap): results.popleft( )

If your windows are long, and if they overlap substantially, using deque instead of list might give you better performance, since deque is optimized to support adding and removing items at either end, while lists, being compact arrays in memory, are inherently slow (specifically, O ( n ) for a list of length n) when you add or remove items at the beginning.

When you want windows of some length n that overlap specifically by n-1 items, function itertools.tee, new in Python 2.4, offers an elegant alternative approach. Say that you want to look at each item of the iterable, with access to a few neighboring items and some padding at both ends, so that you get just as many windows as there are items in the iterable. In Python 2.4, you could then code:

import itertools as IT
def windowed(iterable, pre=1, post=1, padding=None):
    # tee-off one iterator for each index in the window
    copies = IT.tee(iterable, pre + 1 + post)
    pre_copies, copy, post_copies = copies[:pre], copies[pre], copies[pre+1:]
    # iterators before the element have their start filled in with the
    # padding value.  no need to slice off the ends, izip will do that.
    pre_copies = [IT.chain(IT.repeat(padding, pre - i), itr)
                  for i, itr in enumerate(pre_copies)]
    # iterators after the element have their starts sliced off and their
    # end filled in with the padding value, endlessly repeated.
    post_copies = [IT.chain(IT.islice(itr, i + 1, None), IT.repeat(padding))
                   for i, itr in enumerate(post_copies)]
    # zip the elements with their preceding and following elements
    return IT.izip(*(pre_copies + [copy] + post_copies))

For example:

>>> print list(windowed(xrange(4), 1, 2, 'x'))[('x', 0, 1, 2), (0, 1, 2, 3), (1, 2, 3, 'x'), (2, 3, 'x', 'x')]

If you use Python 2.4 and want this flavor of “sliding windows” over the iterable, with specified “padding” at both ends, you might prefer this windowed function to the recipe’s windows generator.

See Also

Library Reference documentation on built-in iter and module itertools.

19.8. Looping Through Multiple Iterables in Parallel

Credit: Andy McKay, Hamish Lawson, Corey Coughlin

Problem

You need to loop through every item of multiple iterables in parallel, meaning that you first want to get a tuple with all of the first items of each iterable, next, a tuple with all of the “second items”, and so forth.

Solution

Say you have two iterables (lists, in this case) such as:

a = ['a1', 'a2', 'a3']
b = ['b1', 'b2']

If you want to loop “in parallel” over them, the most general and effective approach is:

import itertools
for x, y in itertools.izip(a, b): 
    print x, y

This snippet outputs two lines:

a1 b1
a2 b2

Discussion

The most general and effective way to loop “in parallel” over multiple iterables is to use function izip of standard library module itertools, as shown in the “Solution”. The built-in function zip is an alternative that is almost as good:

for x, y in zip(a, b):
    print x, y

However, zip has one downside that can hurt your performance if you’re dealing with long sequences: it builds the list of tuples in memory all at once (preparing and returning a list), while you need only one tuple at a time for pure looping purposes.

Both zip and itertools.izip, when you iterate in parallel over iterables of different lengths, stop as soon as the “shortest” such iterable is exhausted. This approach to termination is normally what you want. For example, it lets you have one or more non-terminating iterable in the zipping, as long as at least one of the iterables does terminate—or (in the case of izip, only) as long as you use some control structure, such as a conditional break within a for statement, to ensure you always require a finite number of items and do not loop endlessly.

In some cases, when iterating in parallel over iterables of different lengths, you may want shorter iterables to be conceptually “padded” with None up to the length of the longest iterable in the zipping. For this special need, you can use the built-in function map with a first argument of None:

for x, y in map(None, a, b):
    print x, y

map, like zip, builds and returns a whole list. If that is a problem, you can reproduce map’s pad with None’s behavior by coding your own generator. Coding your own generator is also a good approach when you need to pad shorter iterables with some value that is different from None.

If you need to deal only with specifically two sequences, your iterator’s code can be quite straightforward and linear:

import itertools
def par_two(a, b, padding_item=None):
    a, b = iter(a), iter(b)
    # first, deal with both iterables via izip until one is exhausted:
    for x in itertools.izip(a, b):
        yield x
    # only one of the following two loops, at most, will execute, since
    # either a or b (or both!) are exhausted at this point:
    for x in a:
        yield x, padding_item
    for x in b:
        yield padding_item, x

Alternatively, you can code a more general function, one that is able to deal with any number of sequences:

import itertools
def par_loop(padding_item, *sequences):
    iterators = map(iter, sequences)
    num_remaining = len(iterators)
    result = [padding_item] * num_remaining
    while num_remaining:
        for i, it in enumerate(iterators):
            try: 
                 result[i] = it.next( )
            except StopIteration:
                 iterators[i] = itertools.repeat(padding_item)
                 num_remaining -= 1
                 result[i] = padding_item
        if num_remaining:
            yield tuple(result)

Here’s an example of use for generator par_loop:

print map(''.join, par_loop('x', 'foo', 'zapper', 'ui'))
# emits:['fzu', 'oai', 'opx', 'xpx', 'xex', 'zrx']

Both par_two and par_loop start by calling the built-in function iter on all of their arguments and thereafter use the resulting iterators. This is important, because the functions rely on the state that these iterators maintain. The key idea in par_loop is to keep count of the number of iterators as yet unexhausted, and replace each exhausted iterator with a nonterminating iterator that yields the padding_item ceaselessly; num_remaining counts unexhausted iterators, and both the yield statement and the continuation of the while loop are conditional on some iterators being as yet unexhausted.

Alternatively, if you know in advance which iterable is the longest one, you can wrap every other iterable x as itertools.chain(iter( x ), itertools.repeat(padding)) and then call itertools.izip. You can’t do this wrapping on all iterables because the resulting iterators are nonterminating—if you izip iterators that are all nonterminating, izip itself cannot terminate! Here, for example, is a version that works as intended only when the longest (but terminating!) iterable is the very first one:

import itertools
def par_longest_first(padding_item, *sequences):
    iterators = map(iter, sequences)
    for i, it in enumerate(iterators):
        if not i: continue
        iterators[i] = itertools.chain(it, itertools.repeat(padding_item))
    return itertools.izip(iterators)

See Also

The itertools module is part of the Python Standard Library and is documented in the Library Reference portion of Python’s online documentation; the Library Reference and Python in a Nutshell docs about built-ins zip, iter, and map.

19.9. Looping Through the Cross-Product of Multiple Iterables

Credit: Attila Vàsàrhelyi, Raymond Hettinger, Steven Taschuk

Problem

You need to loop through every item of multiple iterables cross-productwise, meaning that you first want to get the first item of the first iterable paired with all the others, next, the second item of the first iterable paired with all the others, and so forth.

Solution

Say you have two iterables (lists, in this case) such as:

a = ['a1', 'a2', 'a3']
b = ['b1', 'b2']

If you want to loop over their cross-product, the simplest approach is often just a couple of nested for loops:

for x in a:
    for y in b:
        print x, y

This snippet’s output is six lines:

a1 b1
a1 b2
a2 b1
a2 b2
a3 b1
a3 b2

However, in many cases, you’d rather get all items in the “cross-product” of multiple iterables as a single, linear sequence, suitable for using in a single for or for passing onwards to other sequence manipulation functions, such as those supplied by itertools. For such needs, you may put the nested fors in a list comprehension:

for x, y in [(x,y) for x in a for y in b]:
    print x, y

Discussion

A list comprehension lets you easily generate (as a single, linear sequence) all the pairings of several iterables (also known as the cross-product, product set, or Cartesian product of these iterables). However, the number of items in such a cross-product is the arithmetic product (multiplication) of the lengths of all the iterables involved, a number that may easily get quite large. A list comprehension, by definition, builds the entire list at once, which means that it may consume substantial amounts of memory. Also, you get to start iterating only when the whole cross-product list is entirely built.

Python 2.4 offers one obvious way to solve this problem: the newly introduced construct of generator expressions:

for x, y in ((x,y) for x in a for y in b): print x, y

A generator expression looks just like a list comprehension, except that it uses parentheses rather than brackets: it returns an iterator, suitable for looping on, rather than building and returning a list. Thus, a generator expression can save substantial amounts of memory, if you are iterating over a very long sequence. Also, you start executing the loop’s body very soon, since each successive element gets generated iteratively, before each iteration of the loop’s body. If your loop’s body contains conditional breaks, so that execution terminates as soon as some conditions are met, using a generator expression rather than a list comprehension can mean a potentially substantial improvement in performance.

If you need to support Python 2.3, and yet you want to achieve the kind of advantages that generator expressions can afford over list comprehensions, the best approach may be to code your own generator. This is quite simple if you only need to deal with a known number of sequences, such as two:

def cross_two(a, b):
    for x in a:
        for y in b:
            yield a, b

Dealing with an arbitrary number of sequences is a bit more complicated, but not terribly so, particularly if we use recursion to help:

def cross_loop(*sequences):
    if sequences:
        for x in sequences[0]:
            for y in cross_loop(sequences[1:]):
                yield (x,) + y
    else:
        yield ( )

We can also do it without recursion. It’s not hard if we’re willing to build the entire result list in memory at once before returning it, just as a list comprehension would:

def cross_list(*sequences):
    result = [[  ]]
    for seq in sequences:
        result = [sublist+[item] for sublist in result for item in seq]
    return result

Alternatively, you can return map(tuple, result) if you need to ensure that each item of the sequence you return is a tuple, not a list.

Recursion-free iterative (incremental) generation of the “cross-product” sequence is also feasible, even though it’s nowhere as simple as either the recursive or the nonincremental versions:

def cross(*sequences):
    # visualize an odometer, with "wheels" displaying "digits"...:
    wheels = map(iter, sequences)
    digits = [it.next( ) for it in wheels]
    while True:
        yield tuple(digits)
        for i in range(len(digits)-1, -1, -1):
            try:
                digits[i] = wheels[i].next( )
                break
            except StopIteration:
                wheels[i] = iter(sequences[i])
                digits[i] = wheels[i].next( )
        else:
            break

In Python 2.4, you might express the for statement more clearly as for i in reversed(range(len(digits))).

To repeat, it is important to remember that all of these solutions should be considered only if you do have the problem—that is, if and only if you do need to view all items in the “cross-product” of multiple iterables as a single, linear sequence. Many cases have no such requirement, and simply coding multiple nested for loops inline is quite acceptable, simpler, and more readable. In many cases, getting all items in the “cross-product” as a single sequence is preferable, so it’s worth knowing how to do that. However, do keep in mind that simplicity is an important virtue, and do not lose sight of it in pursuit of a cool (but complicated) solution. All the cool tools, constructs, and library modules that Python offers exist strictly to serve you, to let you build and maintain your applications with minimal effort. Don’t go out of your way to use the new shiny tools if you can solve your application’s problems with less effort in simpler ways!

See Also

The Library Reference and Python in a Nutshell docs about built-ins iter, enumerate, map, and (Python 2.4 only) reversed; the Language Reference and Python in a Nutshell docs about list comprehensions and (Python 2.4 only) generator expressions.

19.10. Reading a Text File by Paragraphs

Credit: Alex Martelli, Magnus Lie Hetland, Terry Reedy

Problem

You need to read a text file (or any other iterable whose items are lines of text) paragraph by paragraph, where a “paragraph” is defined as a sequence of nonwhite lines (i.e., paragraphs are separated by lines made up exclusively of whitespace).

Solution

A generator is quite suitable for bunching up lines this way:

def paragraphs(lines, is_separator=str.isspace, joiner=''.join):
    paragraph = [  ]
    for line in lines:
        if is_separator(line):
            if paragraph:
                yield joiner(paragraph)
                paragraph = [  ]
        else:
            paragraph.append(line)
    if paragraph:
        yield joiner(paragraph)
if _ _name_ _ == '_ _main_ _':
    text = 'a first
paragraph

and a
second one

'
    for p in paragraphs(text.splitlines(True)): print repr(p)

Discussion

Python doesn’t directly support paragraph-oriented file reading, but it’s not hard to add such functionality. We define a “paragraph” as the string formed by joining a nonempty sequence of nonseparator lines, separated from any adjoining paragraphs by nonempty sequences of separator lines. A separator line is one that satisfies the predicate passed in as argument is_separator. (A predicate is a function whose result is taken as a logical truth value, and we say a predicate is satisfied when the predicate returns a result that is true.) By default, a line is a separator if it is made up entirely of whitespace characters (e.g., space, tab, newline, etc.).

The recipe’s code is quite straightforward. The state of the generator during iteration is entirely held in local variable paragraph, a list to which we append the nonseparator lines that make up the current paragraph. Whenever we meet a separator in the body of the for statement, we test if paragraph to check whether the list is currently empty. If the list is empty, we’re already skipping a run of separators and need do nothing special to handle the current separator line. If the list is not empty, we’ve just met a separator line that terminates the current paragraph, so we must join up the list, yield the resulting paragraph string, and then set the list back to empty.

This recipe implements a special case of sequence adaptation by bunching: an underlying iterable is “bunched up” into another iterable with “bigger” items. Python’s generators let you express sequence adaptation tasks very directly and linearly. By passing as arguments, with reasonable default values, the is_separator predicate, and the joiner callable that determines what happens to each “bigger item” when we’re done bunching it up, we achieve a satisfactory amount of generality without any extra complexity. To see this, consider a snippet such as:

import operator
numbers = [1, 2, 3, 0, 0, 6, 5, 3, 0, 12]
bunch_up = paragraphs
for s in bunch_up(numbers, operator.not_, sum): print 'S', s
for l in bunch_up(numbers, bool, len): print 'L', l

In this snippet, we use the paragraphs generator (under the name of bunch_up, which is clearer in this context) to get the sums of “runs” of nonzero numbers separated by runs of zeros, then the lengths of the runs of zeros—applications that, at first sight, might appear to be quite different from the recipe’s stated purpose. That’s the magic of abstraction: when appropriately and tastefully applied, it can easily turn the solution of a problem into a family of solutions for many other apparently unrelated problems.

An elementary issue, but a crucial one for getting good performance in the “main” use case of this recipe, is that the paragraphs' generator builds up each resulting paragraph as a list of strings, then concatenates all strings in the list with ''.join to obtain each result it yields. An alternate approach, where a large string is built up as a string, by repeated application of += or +, is never the right approach in Python: it is both slow and clumsy. Good Pythonic style absolutely demands that we use a list as the intermediate accumulator, whenever we are building a long string by concatenating a number of smaller ones. Python 2.4 has diminished the performance penalty of the wrong approach. For example, to join a list of 52 one-character strings into a 52-character string on my machine, Python 2.3 takes 14.2 microseconds with the right approach, 73.6 with the wrong one; but Python 2.4 takes 12.7 microseconds with the right approach, 41.6 with the wrong one, so the penalty in this case has decreased from over five times to over three. Nevertheless, there is no reason to choose to pay such a performance penalty without any returns, even the lower penalty that Python 2.4 manages to extract!

Python 2.4 offers a new itertools.groupby function that is quite suitable for sequence-bunching tasks. Using it, we could express the paragraphs' generator in a really tight and concise way:

from itertools import groupby
def paragraphs(lines, is_separator=str.isspace, joiner=''.join):
    for separator_group, lineiter in groupby(lines, key=is_separator):
        if not separator_group:
            yield joiner(lineiter)

itertools.groupby, like SQL’s GROUP BY clause, which inspired it, is not exactly trivial use, but it can be quite useful indeed for sequence-bunching tasks once you have mastered it thoroughly.

See Also

Recipe 19.11; Chapter 1 for general issues about handling text; Chapter 2 for general issues about handling files; Recipe 19.21; Library Reference documentation on Python 2.4’s itertools.groupby.

19.11. Reading Lines with Continuation Characters

Credit: Alex Martelli

Problem

You have a file that includes long logical lines split over two or more physical lines, with backslashes to indicate that a continuation line follows. You want to process a sequence of logical lines, “rejoining” those split lines.

Solution

As usual, our first idea for a problem involving sequences should be a generator:

def logical_lines(physical_lines, joiner=''.join):
    logical_line = [  ]
    for line in physical_lines:
        stripped = line.rstrip( )
        if stripped.endswith(''):
            # a line which continues w/the next physical line
            logical_line.append(stripped[:-1])
        else:
            # a line which does not continue, end of logical line
            logical_line.append(line)
            yield joiner(logical_line)
            logical_line = [  ]
    if logical_line:
        # end of sequence implies end of last logical line
        yield joiner(logical_line)
if _ _name_ _=='_ _main_ _':
    text = 'some\
', 'lines\
', 'get
', 'joined\
', 'up
'
    for line in text:
        print 'P:', repr(line)
    for line in logical_lines(text, ' '.join):
        print 'L:', repr(line)

When run as a main script, this code emits:

<c>P: 'some\
'
P: 'lines\
'
P: 'get
'
P: 'joined\
'
P: 'up
'
L: 'some lines get
'
L: 'joined up
'</c>

Discussion

This problem is about sequence-bunching, just like the previous Recipe 19.10. It is therefore not surprising that this recipe, like the previous, is a generator (with an internal structure quite similar to the one in the “other” recipe): today, in Python, sequences are often processed most simply and effectively by means of generators.

In this recipe, the generator can encompass just a small amount of generality without introducing extra complexity. Determining whether a line is a continuation line, and of how to proceed when it is, is slightly too idiosyncratic to generalize in a simple and transparent way. I have therefore chosen to code that functionality inline, in the body of the logical_lines generator, rather than “factoring it out” into separate callables. Remember, generality is good, but simplicity is even more important. However, I have kept the simple and transparent generality obtained by passing the joiner function as an argument, and the snippet of code under the if _ _name_ _= ='_ _main_ _' test demonstrates how we may want to use that generality, for example, to join continuation lines with a space rather than with an empty string.

If you are certain that the file you’re processing is sufficiently small to fit comfortably in your computer’s memory, with room to spare for processing, and you don’t need the feature (offered in the version of logical_lines shown in the “Solution”) of ignoring whitespace to the right of a terminating \, a solution using a plain function rather than a generator is simpler than the one shown in this recipe’s Solution:

def logical_lines(physical_lines, joiner=''.join, separator=''):
    return joiner(physical_lines).replace('\
', separator).splitlines(True)

In this variant, we join all of the physical lines into one long string, then we replace the “canceled” line ends (line ends immediately preceded by a backslash) with nothing (or any other separator we’re requested to use), and finally split the resulting long string back into lines (keeping the line ends—that’s what the True argument to method splitlines is for). This approach is a very different one from that suggested in this recipe but possibly worthwhile, if physical_lines is small enough that you can afford the memory for it. I prefer the “Solution"’s approach because giving semantic significance to trailing whitespace is a poor user interface design choice.

See Also

Recipe 19.10; Perl Cookbook recipe 8.1; Chapter 1 for general issues about handling text; Chapter 2 for general issues about handling files.

19.12. Iterating on a Stream of Data Blocks as a Stream of Lines

Credit: Scott David Daniels, Peter Cogolo

Problem

You want to loop over all lines of a stream, but the stream arrives as a sequence of data blocks of arbitrary size (e.g., from a network socket).

Solution

We need to code a generator that gets blocks and yields lines:

def ilines(source_iterable, eol='
', out_eol='
'):
    tail = ''
    for block in source_iterable:
        pieces = (tail+block).split(eol)
        tail = pieces.pop( )
        for line in pieces:
            yield line + out_eol
    if tail:
        yield tail
if _ _name_ _ == '_ _main_ _':
    s = 'one
two
,
three,four,five
,six,
seven
last'.split(',')
    for line in ilines(s): print repr(line)

When run as a main script, this code emits:

'one
'
'two
'
'threefourfive
'
'six
'
'seven
'
'last'

Discussion

Many data sources produce their data in fits and starts—sockets, RSS feeds, the results of expanding compressed text, and (at its heart) most I/O. The data often doesn’t arrive at convenient boundaries, but you nevertheless want to consume it in logical units. For text, the logical units are often lines.

This recipe shows generator ilines, a simple way to consume a source_iterable, which yields blocks of data, producing an iterator that yields lines of text instead. ilines is vastly simplified by assuming that lines are separated, on input, by a known end-of-line (EOL) string—by default ' r n', which is the standard EOL marker in most Internet protocols. ilines' implementation is further simplified by taking a high-level approach, relying on the split method of Python’s string types to do most of the work. This basically leaves ilines with the single task of “buffering” data between successive input blocks, on all occasions when a line starts in one block and ends in a following one (including those occasions in which block boundaries “split” an EOL marker).

ilines easily accomplishes its buffering task through its local variable tail, which starts empty and, at each leg of the loop, holds that which followed the latest EOL marker seen so far. When tail+block ends with an EOL marker, the expression (tail+block).split(eol) produces a list whose last item is an empty string (''), exactly what we need; otherwise, the last item of the list is that which followed the last EOL, which again is exactly what we need.

Python’s built-in file objects are even more powerful than ilines, since they support a universal newlines reading mode (mode 'U'), which is able to recognize and deal with all common EOL markers (even when different markers are mixed within the same stream!). However, ilines is more flexible, since you may apply it in many situations where you have a stream of arbitrary blocks of text and want to process it as a stream of lines, with a known EOL marker.

See Also

Library Reference and Python in a Nutshell docs about built-in file objects; Chapter 2 for general issues about handling files.

19.13. Fetching Large Record Sets from a Database with a Generator

Credit: Christopher Prinos

Problem

You want to fetch a result set from a database (using the Python DB API) and easily iterate over each record in the result set. However, you don’t want to use the DB cursor’s method fetchall: it could consume a lot of memory and would force you to wait until the whole result set comes back before you can start iterating.

Solution

A generator is the ideal solution to this problem:

def fetchsome(cursor, arraysize=1000):
    ''' A generator that simplifies the use of fetchmany '''
    while True:
        results = cursor.fetchmany(arraysize)
        if not results: break
        for result in results:
            yield result

Discussion

In applications that use the Python DB API, you often see code that goes somewhat like (where cursor is a DB API cursor object):

cursor.execute('select * from HUGE_TABLE')
for result in cursor.fetchall( ):
    doSomethingWith(result)

This simple approach is “just” fine, as long as fetchall returns a small result set, but it does not work very well if the query result is very large. A large result set can take a long time to return. Also, cursor.fetchall( ) needs to allocate enough memory to store the entire result set in memory at once. Further, with this simple approach, the doSomethingWith function isn’t going to get called until the entire query’s result finishes coming over from the database to our program.

An alternative approach is to rely on the cursor.fetchone method:

for result in iter(cursor.fetchone, None):
    doSomethingWith(result)

However, this alternative approach does not allow the database to optimize the fetching process: most databases can exhibit better efficiency when returning multiple records for a single query, rather than returning records one at a time as fetchone requires.

To let your applications obtain greater efficiency than fetchone allows, without the risks of unbounded memory consumption and delay connected to the use of fetchall, Python’s DB API’s cursors also have a fetchmany method. However, the direct use of fetchmany makes your iterations somewhat more complicated than the simple for statements such as those just shown. For example:

while True:
    results = cursor.fetchmany(1000)
    if not results: break
    for result in results:
        doSomethingWith(result)

Python’s generators are a great way to encapsulate complicated iteration logic so that application code can just about always loop with simple for statements. With this recipe’s fetchsome generator, you get the same efficiencies and safeguards as with the native use of the fetchmany method in the preceding snippet but with the same crystal-clear simplicity as in the snippets that used either fetchall or fetchone, namely:

for result in fetchsome(cursor):
    doSomethingWith(result)

By default, fetchsome fetches up to 1,000 records at a time, but you can change that number, depending on your requirements. Optimal values can depend on schema, database type, choice of Python DB API module. In general, you’re best advised to experiment with a few different values in your specific settings if you need to optimize this specific aspect. (Such experimentation is often a good idea for any optimization task.)

This recipe is clearly an example of a more general case: a subsequence unbuncher generator that you can use when you have a sequence of subsequences (each subsequence being obtained through some call, and the end of the whole sequence being indicated by an empty subsequence) and want to flatten it into a simple, linear sequence of items. You can think of this unbunching task as the reverse of the sequence-bunching tasks covered earlier in Recipe 19.10 and Recipe 19.11, or as a simpler variant of the sequence-flattening task covered in Recipe 4.6. A generator for unbunching might be:

def unbunch(next_subseq, *args):
    ''' un-bunch a sequence of subsequences into a linear sequence '''
    while True:
        subseq = next_subseq(*args)
        if not subseq: break
        for item in subseq:
            yield item

As you can see, the structure of unbunch is basically identical to that of the recipe’s fetchsome. Usage would also be just about the same:

for result in unbunch(cursor.fetchmany, 1000):
    doSomethingWith(result)

However, while it is important and instructive to consider this kind of generalization, when you’re writing applications you’re often better off using specific generators that directly deal with your application’s specific needs. In this case, for example, calling fetchsome(cursor) is more obvious and direct than calling unbunch(cursor.fetchmany, 1000), and fetchsome usefully hides the usage of fetchmany as well as the specific choice of 1,000 as the subsequence size to fetch at each step.

See Also

Recipe 19.10; Recipe 19.11; Recipe 4.6; Python’s DB API is covered in Chapter 7 and in Python in a Nutshell.

19.14. Merging Sorted Sequences

Credit: Sébastien Keim, Raymond Hettinger, Danny Yoo

Problem

You have several sorted sequences (iterables) and need to iterate on the overall sorted sequence that results from “merging” these sequences.

Solution

A generator is clearly the right tool for the job, in the general case (i.e., when you might not have enough memory to comfortably hold all the sequences). Implementing the generator is made easy by the standard library module heapq, which supplies functions to implement the “heap” approach to priority queues:

import heapq
def merge(*subsequences):
    # prepare a priority queue whose items are pairs of the form
    # (current-value, iterator), one each per (non-empty) subsequence
    heap = [  ]
    for subseq in subsequences:
        iterator = iter(subseq)
        for current_value in iterator:
            # subseq is not empty, therefore add this subseq's pair
            # (current-value, iterator) to the list
            heap.append((current_value, iterator))
            break
    # make the priority queue into a heap
    heapq.heapify(heap)
    while heap:
        # get and yield lowest current value (and corresponding iterator)
        current_value, iterator = heap[0]
        yield current_value
        for current_value in iterator:
            # subseq is not finished, therefore add this subseq's pair
            # (current-value, iterator) back into the priority queue
            heapq.heapreplace(heap, (current_value, iterator))
            break
        else:
            # subseq has been exhausted, therefore remove it from the queue
            heapq.heappop(heap)

Discussion

The need for “merging” sorted subsequences into a larger sorted sequence is reasonably frequent. If the amount of data is small enough to fit entirely in memory without problems, then the best approach is to build a list by concatenating all subsequences, then sort the list:

def smallmerge(*subsequences):
    result = [  ]
    for subseq in subsequences: result.extend(subseq)
    result.sort( )
    return result

The sort method of list objects is based on a sophisticated natural merge algorithm, able to take advantage of existing sorted subsequences in the list you’re sorting; therefore, this approach is quite fast, as well as simple (and general, since this approach’s correctness does not depend on all subsequences being already sorted). If you can choose this approach, it has many other advantages. For example, smallmerge works fine even if one of the subsequences isn’t perfectly sorted to start with; and in Python 2.4, you may add a generic keywords argument **kwds to smallmerge and pass it right along to the result.sort( ) step, to achieve the flexibility afforded in that version by the cmp=, key=, and reverse= arguments to list’s sort method.

However, you sometimes deal with large sequences, which might not comfortably fit in memory all at the same time (e.g., your sequences might come from files on disk, or be computed on the fly, item by item, by other generators). When this happens, this recipe’s generator will enable you to perform your sequence merging while consuming a very moderate amount of extra memory (dependent only on the number of subsequences, not on the number of items in the subsequences).

The recipe’s implementation uses a classic sequence-merging algorithm based on a priority queue, which, in turn, lets it take advantage of the useful heapq module in the Python Standard Library. heapq offers functions to implement a priority queue through the data structure known as a heap.

A heap is any list H such that, for any valid index 0<=i<len(H), H[i]<=H[2*i+1], and H[i]<=H[2*i+2] (if 2*i+1 and 2*i+2 are also valid indices into H). This heap property is fast to establish on an arbitrary list (function heapify does that) and very fast to re-establish after altering or removing the smallest item (and functions heapreplace and heappop do that). The smallest item is always H[0] (it’s easy to see that the “heap” property implies this), and being able to find the smallest item instantly makes heaps an excellent implementation of priority queues.

In this recipe, we use as items in the “heap” a “pair” (i.e., two-items tuple) for each subsequence that is not yet exhausted (i.e., each subsequence through which we have not yet fully iterated). As its first item, each pair has the “current item” in the corresponding subsequence and, as its second item, an iterator over that subsequence. At each iteration step, we yield the smallest “current item”, then we advance the corresponding iterator and re-establish the “heap” property; when an iterator is exhausted, we remove the corresponding pair from the “heap” (so that, clearly, we’re finished when the “heap” is emptied). Note the idiom that we use to advance an iterator by one step, dealing with the possibility that the iterator is exhausted:

for current_value in iterator:
    # if we get here the iterator was not empty, current_value was
    # its first value, and the iterator has been advanced one step...use pair (current_value, iterator)...
    # we break at once as we only wanted the first item of iterator
    break
else:
    # if we get here the break did not execute, so the iterator
    # was empty (exhausted)
    # deal with the case of iterator being exhausted...

We use this idiom twice in the recipe, although in the first of the two uses we do not need the else clause since we can simply ignore iterators that are immediately exhausted (they correspond to empty subsequences, which can be ignored for merging purposes).

If you find this idiom confusing or tricky (because it uses a for statement whose body immediately breaks—i.e., a statement that looks like a loop but is not really a loop because it never executes more than once!), you may prefer a different approach:

try:
    current_value = iterator.next( )
except StopIteration:
    # if we get here the iterator was empty (exhausted)
    #   deal with the case of iterator being exhausted...
else:
    # if we get here the iterator was not empty, current_value was
    # its first value, and the iterator has been advanced one step
    #  use pair (current_value, iterator)...

I slightly prefer the idiom using for; in my view, it gains in clarity by putting the normal case (i.e., an unexhausted iterator) first and the rare case (an exhausted iterator) later. A variant of the try/except idiom that has the same property is:

try:
    current_value = iterator.next( )
    # if we get here the iterator was not empty, current_value was
    # its first value, and the iterator has been advanced one step#  use pair (current_value, iterator)...
except StopIteration:
    # if we get here the iterator was empty (exhausted)
    #  deal with the case of iterator being exhausted...

However, I somewhat dislike this variant (even though it’s quite OK for the two specific uses of this recipe) because it crucially depends on the code indicated as "use pair" never raising a StopIteration exception. As a general principle, it’s best to use a try clause’s body that is as small as possible—just the smallest fragment that you do expect to possibly raise the exception you’re catching in the following handlers (except clauses), not the follow-on code that must execute only if the exception was not raised. The follow-on code goes in the else clause of the try statement, in properly defensive Pythonic coding style. In any case, as long as you are fully aware of the tradeoffs in clarity and defensiveness between these three roughly equivalent idioms, you’re welcome to develop your own distinctive Pythonic style and, in particular, to choose freely among them!

If you do choose either of the idioms that explicitly call iterator.next( ), a further “refinement” (i.e., a tiny optimization) is to keep as the second item of each pair, rather than the iterator object, the bound-method iterator.next directly, ready for you to call. This optimization is not really tricky at all (it is quite common in Python to stash away bound methods and other such callables), but it may nevertheless result in code of somewhat lower readability. Once again, the choice is up to you!

See Also

Chapter 5 for general issues about sorting and Recipe 5.7 and Recipe 5.8 about heapq specifically; Library Reference and Python in a Nutshell documentation on module heapq and lists’ sort method; Robert Sedgewick, Algorithms (Addison-Wesley) (heaps are covered starting on p. 178 in the 2d edition); heapq.py in the Python sources contains an interesting discussion of heaps.

19.15. Generating Permutations, Combinations, and Selections

Credit: Ulrich Hoffmann, Guy Argo, Danny Yoo, Carl Bray, Doug Zongker, Gagan Saksena, Robin Houston, Michael Davies

Problem

You need to iterate on the permutations, combinations, or selections of a sequence. The fundamental rules of combinatorial arithmetic indicate that the length of these derived sequences are very large even if the starting sequence is of moderate size: for example, there are over 6 billion permutations of a sequence of length 13. So you definitely do not want to compute (and keep in memory) all items in a derived sequence before you start iterating,

Solution

Generators enable you to compute needed objects one by one as you iterate on them. The loop inevitably takes a long time if there are vast numbers of such objects and you really need to examine each one. But at least you do not waste memory storing all of them at once:

def _combinators(_handle, items, n):
    ''' factored-out common structure of all following combinators '''
    if n==0:
        yield [  ]
        return
    for i, item in enumerate(items):
        this_one = [ item ]
        for cc in _combinators(_handle, _handle(items, i), n-1):
            yield this_one + cc
def combinations(items, n):
    ''' take n distinct items, order matters '''
    def skipIthItem(items, i):
        return items[:i] + items[i+1:]
    return _combinators(skipIthItem, items, n)
def uniqueCombinations(items, n):
    ''' take n distinct items, order is irrelevant '''
    def afterIthItem(items, i):
        return items[i+1:]
    return _combinators(afterIthItem, items, n)
def selections(items, n):
    ''' take n (not necessarily distinct) items, order matters '''
    def keepAllItems(items, i):
        return items
    return _combinators(keepAllItems, items, n)
def permutations(items):
    ''' take all items, order matters '''
    return combinations(items, len(items))
if _ _name_ _=="_ _main_ _":
    print "Permutations of 'bar'"
    print map(''.join, permutations('bar'))
# emits['bar', 'bra', 'abr', 'arb', 'rba', 'rab']
    print "Combinations of 2 letters from 'bar'"
    print map(''.join, combinations('bar', 2))
# emits ['ba', 'br', 'ab', 'ar', 'rb', 'ra']
    print "Unique Combinations of 2 letters from 'bar'"
    print map(''.join, uniqueCombinations('bar', 2))
# emits ['ba', 'br', 'ar']
    print "Selections of 2 letters from 'bar'"
    print map(''.join, selections('bar', 2))
# emits ['bb', 'ba', 'br', 'ab', 'aa', 'ar', 'rb', 'ra', 'rr']

Discussion

The generators in this recipe accept any sequence as the items argument and always yield lists of length n, where n is the second argument to the generator (permutations accepts only one argument, and n is by definition equal to len(items)).

You can modify the recipe so the generators yield tuples (or instances of another sequence type), instead of lists, by changing two lines of code in _combinators. The yield [ ] must become yield ( ) (more generally, this statement must yield the empty sequence of any sequence type you wish to use), and name this_one must be bound to the Singleton sequence of any sequence type you wish to use. For example, to yield tuples, change the statement that assigns to name this_one into:

    this_one = items[i],

(A subtle, often-forgotten point of Python syntax is that the comma identifies the right side of the assignment as a tuple. Placing parentheses around the right-hand side would be both insufficient and superfluous.)

Another way to modify this recipe is to have the generators yield sequences of the same type as argument items. (As long as this type is indeed a sequence: specifically, it must support slicing, as well as the use of the plus sign, +, for concatenation). If that is what you want, change the yield of the empty sequence into:

    yield items[:0]

and change the assignment to name this_one into:

    this_one = items[i:i+1]

The definition of distinct items for this recipe’s purposes is: “items that occur at different indices in the input sequence.” If your input sequence has duplicates (i.e., the same item occurring at multiple indices), none of the functions in this recipe will care about removing them: rather, all functions will treat the duplicates as “distinct items” for all purposes.

See Also

Recipe 19.16 for another combinatorics building block; Recipe 18.1 and Recipe 18.2.

19.16. Generating the Partitions of an Integer

Credit: David Eppstein, Jan Van lent, George Yoshida

Problem

You want to generate all partitions of a given positive integer, that is, all the ways in which that integer can be represented as a sum of positive integers (for example, the partitions of 4 are 1+1+1+1, 1+1+2, 2+2, 1+3, and 4).

Solution

A recursive generator offers the simplest approach for this task, as is often the case with combinatorial computations:

def partitions(n):
    # base case of the recursion: zero is the sum of the empty tuple
    if n == 0:
        yield ( )
        return
    # modify the partitions of n-1 to form the partitions of n
    for p in partitions(n-1):
        yield (1,) + p
        if p and (len(p) < 2 or p[1] > p[0]):
            yield (p[0] + 1,) + p[1:]

Discussion

Partitions, like permutations, combinations and selections, are among the most basic primitives of combinatorial arithmetic. In other words, such constructs, besides being useful on their own, are building blocks for generating other kinds of combinatorial objects.

This recipe works along classic recursive lines. If you have a partition of a positive integer n, you can reduce it to a partition of n-1 in a canonical way by subtracting one from the smallest item in the partition. For example, you can build partitions of 5 from partitions of 6 by such transformation steps as 1+2+3 => 2+3, 2+4 => 1+4, and so forth. The algorithm in this recipe reverses the process: for each partition p of n-1, the algorithm finds the partitions of n that would be reduced to p by this canonical transformation step. Therefore, each partition p of n is output exactly once, at the step when we are considering the partition p1 of n-1 to which p canonically reduces.

Be warned: the number of partitions of n grows fast when n itself grows. Ramanujan’s upper bound for the number of partitions of a positive integer k is:

    int(exp(pi*sqrt(2.0*k/3.0))/(4.0*k*sqrt(3.0)))

(where names exp, pi and sqrt are all taken from module math, in Python terms). For example, the number 200 has about 4,100 billion partitions.

This recipe generates each partition as a tuple of integers in ascending order. If it’s handier for your application to deal with partitions as tuples of integers in descending order, you need only change the body of the for loop in the recipe to:

    yield p + (1,)
    if p and (len(p) < 2 or p[-2] > p[-1]):
        yield p[:-1] + (p[-1] + 1,)

Creating a new tuple per item in the output stream, as this recipe does, may result in performance issues, if you’re dealing with a very large n. One way to optimize this aspect would be to return lists instead of tuples, and specifically to return the same list object at each step (with the descending-order modification, and append and pop operations rather than list concatenation):

def partfast(n):
    # base case of the recursion: zero is the sum of the empty tuple
    if n == 0:
        yield [  ]
        return
    # modify the partitions of n-1 to form the partitions of n
    for p in partfast(n-1):
        p.append(1)
        yield p
        p.pop( )
        if p and (len(p) < 2 or p[-2] > p[-1]):
            p[-1] += 1
            yield p

This optimization is not worth the bother—not so much because of the modest extra complication in partfast’s own code, but mostly because yielding the same list object at each step means that code using partfast must take precautions. For example, list(partfast(4)) is a potentially surprising list of five empty sublists, while list(partitions(4)) is exactly the expected list of the five partitions of the number 4.

On the “other” hand, a different approach using an auxiliary parameter can actually produce a simplification for the descending-order case:

def partitions_descending(num, lt=num):
    if not num: yield ( )
    for i in xrange(min(num, lt), 0, -1):
        for parts in partitions_descending(num-i, i):
            yield (i,) + parts

This code is simpler than the variant given in the recipe and could be made even clearer in Python 2.4 by changing its outer loop into:

    for i in reversed(xrange(1, min(num, lt)-1)):

See Also

Recipe 19.15 for more combinatorics building blocks.

19.17. Duplicating an Iterator

Credit: Heiko Wundram, Raymond Hettinger

Problem

You have an iterator (or other iterable) object x, and need to iterate twice over x’s sequence of values.

Solution

In Python 2.4, solving this problem is the job of function tee in the standard library module itertools:

import itertools
x1, x2 = itertools.tee(x)
# you can now iterate on x1 and x2 separately

In Python 2.3, you can code tee yourself:

import itertools
def tee(iterable):
    def yield_with_cache(next, cache={  }):
        pop = cache.pop
        for i in itertools.count( ):
            try:
                yield pop(i)
            except KeyError:
                cache[i] = next( )
                yield cache[i]
    it = iter(iterable)
    return yield_with_cache(it.next), yield_with_cache(it.next)

Discussion

The need to iterate repeatedly over the same sequence of values is a reasonably common one. If you know that the sequence comes from a list, or some other container that intrinsically lets you iterate over its items repeatedly, then you simply perform the iteration twice. However, sometimes your sequence may come from a generator, a sequential file (which might, e.g., wrap a stream of data coming from a network socket—data that you can read only once), or some other iterator that is not intrinsically re-iterable. Even then, in some cases, the best approach is the simplest one—first save the data into a list in memory, then repeatedly iterate over that list:

saved_x = list(x)
for item in saved_x: do_something(item)
for item in saved_x: do_something_else(item)

The simple approach of first saving all data from the iterator into a list is not feasible for an infinite sequence x, and may not be optimal if x is very large and your separate iterations over it never get far out-of-step from each other. In these cases, the tee function shown in this recipe can help. For example, say that the items of x are either numbers or operators (the latter being represented as strings such as '+', '*', etc.). Whenever you encounter an operator, you must output the result of applying that operator to all numbers immediately preceding it (since the last operator). Using tee, you could code:

def is_operator(item):
    return isinstance(item, str)
def operate(x):
    x1, x2 = tee(iter(x))
    while True:
        for item in x1:
            if is_operator(item): break
        else:
            # we get here when there are no more operators in the input
            # stream, thus the operate function is entirely done
            return
        if item == '+':
            total = 0
            for item in x2:
                if is_operator(item): break
                total += item
            yield total
        elif item == '*':
            total = 1
            for item in x2:
                if is_operator(item): break
                total *= item
            yield total

This kind of “look-ahead” usage is pretty typical of many of the common use cases of tee. Even in this case, you might choose the alternative approach of accumulating items in a list:

def operate_with_auxiliary_list(x):
    aux = [  ]
    for item in x:
        if is_operator(item):
            if item == '+':
                yield sum(aux)
            elif item == '*':
                total = 1
                for item in aux:
                    total *= item
                yield total
            aux = [  ]
        else:
            aux.append(item)

Having tee available lets you freely choose between these different styles of look-ahead processing.

Function itertools.tee as implemented in Python 2.4 is faster and more general than the pure Python version given in this recipe for version 2.3 usage. However, the pure Python version is quite instructive and deserves study for the sake of the techniques it demonstrates, even if you’re lucky enough to be using Python 2.4 and therefore don’t need to use this pure Python version of tee.

In the pure Python version of tee, the nested generator yield_with_cache makes use of the fact (which some consider a “wart” in Python but is actually quite useful) that the default values of arguments get computed just once, at the time the def statement executes. Thus, both calls to the nested generator in the return statement of tee implicitly share the same initially empty dict as the value of the cache argument.

itertools.count returns non-negative integers, 0 and up, one at a time. yield_with_cache uses each of these integers as a key into the cache dictionary. The call to pop(i) (the argument of the yield statement in the try clause) simultaneously returns and removes the entry corresponding to key i, if that entry was present—that is, in this case, if the “other” instance of the generator had already reached that point in the iteration (and cached the item for our benefit). Otherwise, the except clause executes, computes the item (by calling the object bound to name next, which in this case is the next bound method of an iterator over the iterable object, which tee is duplicating), and caches the item (for the “other” instance’s future benefit) before yielding it.

So, in practice, cache is being used as a FIFO queue. Indeed, were it not for the fact that we don’t need a pure-Python tee in Python 2.4, we could code an equivalent implementation of it in Python 2.4 using the new type deque in standard library module collections:

import collections
def tee_just_an_example(iterable):
    def yield_with_cache(it, cache=collections.deque):
        while True:
            if cache:
                yield cache.popleft( )
            else:
                result = it.next( )
                cache.append(result)
                yield result
    it = iter(iterable)
    return yield_with_cache(it), yield_with_cache(it)

This latest version is meant purely as an illustrative example, and therefore, it’s simplified by not using any of the bound-method extraction idioms shown in the version in the “Solution” (which is intended for “production” use in Python 2.3).

Once you’ve called tee on an iterator, you should no longer use the original iterator anywhere else; otherwise, the iterator could advance without the knowledge of the tee-generated objects, and those objects would then “get out of sync” with the original. Be warned that tee requires auxiliary storage that is proportional to how much the two tee-generated objects get “apart” from each other in their separate iterations. In general, if one iterator is going to walk over most or all of the data from the original before the “other” one starts advancing, you should consider using list instead of tee. Both of these caveats apply to the itertools.tee function of Python 2.4 just as well as they apply to the pure Python versions of tee presented in this recipe. One more caveat: again both for the versions in this recipe, and the itertools.tee function in Python 2.4, there is no guarantee of thread safety: to access the tee’d iterators from different threads, you need to guard those iterators with a single lock!

See Also

The itertools module is part of the Python Standard Library and is documented in the Library Reference portion of Python’s online documentation; Recipe 19.2 shows how to turn an iterator into a list.

19.18. Looking Ahead into an Iterator

Credit: Steven Bethard, Peter Otten

Problem

You are using an iterator for some task such as parsing, which requires you to be able to “look ahead” at the next item the iterator is going to yield, without disturbing the iterator state.

Solution

The best solution is to wrap your original iterator into a suitable class, such as the following one (Python 2.4-only):

import collections
class peekable(object):
    """ An iterator that supports a peek operation.  Example usage:
    >>> p = peekable(range(4))
    >>> p.peek( )
    0
    >>> p.next(1)
    [0]
    >>> p.peek(3)
    [1, 2, 3]
    >>> p.next(2)
    [1, 2]
    >>> p.peek(2)
    Traceback (most recent call last):
      ...
    StopIteration
    >>> p.peek(1)
    [3]
    >>> p.next(2)
    Traceback (most recent call last):
      ...
    StopIteration
    >>> p.next( )
    3
    """
    def _ _init_ _(self, iterable):
        self._iterable = iter(iterable)
        self._cache = collections.deque( )
    def _ _iter_ _(self):
        return self
    def _fillcache(self, n):
        if n is None:
            n = 1
        while len(self._cache) < n:
            self._cache.append(self._iterable.next( ))
    def next(self, n=None):
        self._fillcache(n)
        if n is None:
            result = self._cache.popleft( )
        else:
            result = [self._cache.popleft( ) for i in range(n)]
        return result
    def peek(self, n=None):
        self._fillcache(n)
        if n is None:
            result = self._cache[0]
        else:
            result = [self._cache[i] for i in range(n)]
        return result

Discussion

Many iterator-related tasks, such as parsing, require the ability to “peek ahead” (once or a few times) into the sequence of items that an iterator is yielding, in a way that does not alter the iterator’s observable state. One approach is to use the new Python 2.4 function iterator.tee to get two independent copies of the iterator, one to be advanced for peeking purposes and the “other” one to be used as the “main” iterator. It’s actually handier to wrap the incoming iterator once for all, at the start, with the class peekable presented in this recipe; afterwards, a peek method, which is safe and effective, can be counted on. A little added sweetener is the ability to call peek (and, as long as we’re at it, the standard next method too) with a specific number argument n, to request a list of the next n items of the iterator (without disturbing the iterator’s state when you call peek(n), with iterator state advancement when you call next( n )—just like for normal calls without arguments to the same methods).

The obvious idea used in this recipe for implementing peekable is to have it keep a cache of peeked-ahead arguments. Since the cache must grow at the tail and get consumed from the end, a natural choice is to make the cache a collections.deque, a new type introduced in Python 2.4. However, if you need this code to run under version 2.3 as well, make self._cache a list instead—you only need to change method next’s body a little bit, making it:

        if n is None:
            result = self._cache.pop(0)
        else:
            result, self_cache = self._cache[:n], self._cache[n:]

As long as you’re caching only one or just a few items of lookahead at a time, performance won’t suffer much by making self._cache a list rather than a deque.

An interesting characteristic of the peekable class presented in this recipe is that, if you request too many items from the iterator, you get a StopIteration exception but that does not throw away the last few values of the iterator. For example, if p is an instance of peekable with just three items left, when you call p.next(5), you get a StopIteration exception. You can later call p.next(3) and get the list of the last three items.

A subtle point is that the n argument to methods peek and next defaults to None, not to 1. This gives you two distinct ways to peek at a single item: the default way, calling p.peek( ), just gives you that item, while calling p.peek(1) gives you a list with that single item in it. This behavior is quite consistent with the way p.peek behaves when called with different arguments: any call p.peek(n) with any non-negative integer n returns a list with n items (or raises StopIteration if p has fewer than n items left). This approach even supports calls such as p.next(0), which in practice always returns an empty list [ ] without advancing the iterator’s state. Typically, you just call p.peek( ), without arguments, and get one look-ahead item without problems.

As an implementation detail, note that the docstring of the class peekable presented in this recipe is essentially made up of examples of use with expected results. Besides being faster to write, and arguably to read for an experienced Pythonista, this style of docstring is perfect for use with the Python Standard Library module doctest.

See Also

collections.deque and doctest in the Python Library Reference (for Python 2.4).

19.19. Simplifying Queue-Consumer Threads

Credit: Jimmy Retzlaff, Paul Moore

Problem

You want to code a consumer thread which gets work requests off a queue one at a time, processes each work request, and eventually stops, and you want to code it in the simplest possible way.

Solution

This task is an excellent use case for the good old Sentinel idiom. The producer thread, when it’s done putting actual work requests on the queue, must finally put a sentinel value, that is, a value that is different from any possible work request. Schematically, the producer thread will do something like:

for input_item in stuff_needing_work:
    work_request = make_work_request(input_item)
    queue.put(work_request)
queue.put(sentinel)

where sentinel must be a “well-known value”, different from any work_request object that might be put on the queue in the first phase.

The consumer thread can then exploit the built-in function iter:

for work_request in iter(queue.get, sentinel):
    process_work_request(work_request)
cleanup_and_terminate( )

Discussion

Were it not for built-in function iter, the consumer thread would have to use a slightly less simple and elegant structure, such as:

while True:
    work_request = queue.get( )
    if work_request == sentinel:
        break
    process_work_request(work_request)
cleanup_and_terminate( )

However, the Sentinel idiom is so useful and important that Python directly supports it with built-in function iter. When you call iter with just one argument, that argument must be an iterable object, and iter returns an iterator for it. But when you call iter with two arguments, the first one must be a callable which can be called without arguments, and the second one is an arbitrary value known as the sentinel. In the two-argument case, iter repeatedly calls the first argument. As long as each call returns a value !=sentinel, that value becomes an item in the iteration; as soon as a call returns a value ==sentinel, the iteration stops.

If you had to code this yourself as a generator, you could write:

def iter_sentinel(a_callable, the_sentinel):
    while True:
        item = a_callable( )
        if item == the_sentinel: break
        yield item

But the point of this recipe is that you don’t have to code even this simple generator: just use the power that Python gives you as part of the functionality of the built-in function iter!

Incidentally, Python offers many ways to make sentinel values—meaning values that compare equal only to themselves. The simplest and most direct way, and therefore the one I suggest you always use for this specific purpose, is:

sentinel = object( )

See Also

Documentation for iter in the Library Reference and Python in a Nutshell.

19.20. Running an Iterator in Another Thread

Credit: Garth Kidd

Problem

You want to run the code of a generator (or any other iterator) in its own separate thread, so that the iterator’s code won’t block your main thread even if it contains time-consuming operations, such as blocking calls to the operating system.

Solution

This task is best tackled by wrapping a subclass of threading.Thread around the iterator:

import sys, threading
class SpawnedGenerator(threading.Thread):
    def _ _init_ _(self, iterable, queueSize=0):
        threading.Thread._ _init_ _(self)
        self.iterable = iterable
        self.queueSize = queueSize
    def stop(self):
        "Ask iteration to stop as soon as feasible"
        self.stopRequested = True
    def run(self):
        "Thread.start runs this code in another, new thread"
        put = self.queue.put
        try:
            next = iter(self.iterable).next
            while True:
                # report each result, propagate StopIteration
                put((False, next( ))
                if self.stopRequested:
                    raise StopIteration
        except:
            # report any exception back to main thread and finish
            put((True, sys.exc_info( )))
    def execute(self):
        "Yield the results that the "other", new thread is obtaining"
        self.queue = Queue.Queue(self.queueSize)
        get = self.queue.get
        self.stopRequested = False
        self.start( )              # executes self.run( ) in other thread
        while True:
            iterationDone, item = get( )
            if iterationDone: break
            yield item
        # propagate any exception (unless it's just a StopIteration)
        exc_type, exc_value, traceback = item
        if not isinstance(exc_type, StopIteration):
            raise exc_type, exc_value, traceback
    def _ _iter_ _(self):
        "Return an iterator for our executed self"
        return iter(self.execute( ))

Discussion

Generators (and other iterators) are a great way to package the logic that controls an iteration and obtains the next value to feed into a loop’s body. The code of a generator (and, equivalently, the code of the next method of another kind of iterator) usually runs in the same thread as the code that’s iterating on it. The “calling” code can therefore block, each time around the loop, while waiting for the generator’s code to do its job.

Sometimes, you want to use a generator (or other kind of iterator) in a “non-blocking” way, which means you need to arrange things so that the generator’s body runs in a new, separate thread. This recipe shows a class which supplies exactly this kind of functionality: this recipe’s SpawnedGenerator class subclasses threading.Thread and uses Thread’s start/run mechanism to ensure the generator’s body always executes in a separate thread from that of the calling code.

All communication between the two threads occurs through a single instance of the Queue.Queue class (held through a local-variable bound method in each of the communicating methods: the generator named execute that runs in the calling thread and the method named run that runs in a separate thread). The “calling” code may also call method stop on the SpawnedGenerator instance to ask for the iteration to stop as soon as feasible. Optionally, you may also specify a queue size when you instantiate SpawnedGenerator, if you want to limit how far ahead of the calling thread the spawned thread can get.

The main use case for this recipe is for wrapping iterators that make blocking calls to the operating system (e.g., walking a directory tree), when you need to use such iterators in an application where the “main” thread cannot be allowed to block for a long time. The typical examples of applications whose main thread must not block are event-driven applications, a description that applies to applications with a GUI, as well as to networking applications built on asynchronous frameworks, such as Twisted or the asyncore module of the Python Standard Library.

See Also

Library Reference and Python in a Nutshell docs about modules threading and asyncore; Twisted is at http://www.twistedmatrix.com/; Chapter 9 for general issues about threading; Chapter 11 for general issues about user interfaces; Chapter 13 and Chapter 14 for general issues about network and web programming, including asynchronous approaches to such programs.

19.21. Computing a Summary Report with itertools.groupby

Credit: Paul Moore, Raymond Hettinger

Problem

You have a list of data grouped by a key value, typically read from a spreadsheet or the like, and want to generate a summary of that information for reporting purposes.

Solution

The itertools.groupby function introduced in Python 2.4 helps with this task:

from itertools import groupby
from operator import itemgetter
def summary(data, key=itemgetter(0), field=itemgetter(1)):
    """ Summarise the given data (a sequence of rows), grouped by the
        given key (default: the first item of each row), giving totals
        of the given field (default: the second item of each row).
        The key and field arguments should be functions which, given a
        data record, return the relevant value.
    """
    for k, group in groupby(data, key):
        yield k, sum(field(row) for row in group)
if _ _name_ _ == "_ _main_ _":
    # Example: given a sequence of sales data for city within region,
    # _sorted on region_, produce a sales report by region
    sales = [('Scotland', 'Edinburgh', 20000),
             ('Scotland', 'Glasgow', 12500),
             ('Wales', 'Cardiff', 29700),
             ('Wales', 'Bangor', 12800),
             ('England', 'London', 90000),
             ('England', 'Manchester', 45600),
             ('England', 'Liverpool', 29700)]
    for region, total in summary(sales, field=itemgetter(2)):
        print "%10s: %d" % (region, total)

Discussion

In many situations, data is available in tabular form, with the information naturally grouped by a subset of the data values (e.g., recordsets obtained from database queries and data read from spreadsheets—typically with the csv module of the Python Standard Library). It is often useful to be able to produce summaries of the detail data.

The new groupby function (added in Python 2.4 to the itertools module of the Python Standard Library) is designed exactly for the purpose of handling such grouped data. It takes as arguments an iterator, whose items are to be thought of as records, along with a function to extract the key value from each record. itertools.groupby yields each distinct key from the iterator in turn, each along with a new iterator that runs through the data values associated with that key.

The groupby function is often used to generate summary totals for a dataset. The summary function defined in this recipe shows one simple way of doing this. For a summary report, two extraction functions are required: one function to extract the key, which is the function that you pass to the groupby function, and another function to extract the values to be summarized. The recipe uses another innovation of Python 2.4 for these purposes: the operator.itemgetter higher-order function: called with an index i as its argument. itemgetter produces a function f such that f( x ) extracts the i th item from x, operating just like an indexing x [ i ].

The input records must be sorted by the given key; if you’re uncertain about that condition, you can use groubpy(sorted(data, key=key), key) to ensure it, exploiting the built-in function sorted, also new in Python 2.4. It’s quite convenient that the same key-extraction function can be passed to both sorted and groupby in this idiom. The groupby function itself does not sort its input, which gains extra flexibility that may come in handy—although most of the time you will want to use groupby only on sorted data. See Recipe 19.10 for a case in which it’s quite handy to use groupby on nonsorted data.

For example, if the sales data was in a CSV file sales.csv, the usage example in the recipe’s if _ _name_ _ == `_ _main_ _' section might become:

    import csv
    sales = sorted(cvs.reader(open('sales.csv', 'rb')),
                   key=itemgetter(1))
    for region, total in summary(sales, field=itemgetter(2)):
        print "%10s: %d" % (region, total)

Overall, this recipe provides a vivid illustration of how the new Python 2.4 features work well together: in addition to the groupby function, the operator.itemgetter used to provide field extraction functions, and the potential use of the built-in function sorted, the recipe also uses a generator expression as the argument to the sum built-in function. If you need to implement this recipe’s functionality in Python 2.3, you can start by implementing your own approximate version of groupby, for example as follows:

class groupby(dict):
    def _ _init_ _(self, seq, key):
        for value in seq:
            k = key(value)
            self.setdefault(k, [  ]).append(value)
    _ _iter_ _ = dict.iteritems

This version doesn’t include all the features of Python 2.4’s groupby, but it’s very simple and may be sufficient for your purposes. Similarly, you can write your own simplified versions of functions itemgetter and sorted, such as:

def itemgetter(i):
    def getter(x): return x[i]
    return getter
def sorted(seq, key):
    aux = [(key(x), i, x) for i, x in enumerate(seq)]
    aux.sort( )
    return [x for k, i, x in aux]

As for the generator expression, you can simply use a list comprehension in its place—just call sum([field(row) for row in group]) where the recipe has the same call without the additional square brackets, [ ]. Each of these substitutions will cost a little performance, but, overall, you can build the same functionality in Python 2.3 as you can in version 2.4—the latter just is slicker, simpler, faster, neater!

See Also

itertools.groupy, operator.itemgetter, sorted, and csv in the Library Reference (for Python 2.4).



[1] The Python Development mailing list is the list on which all discussion regarding the development of Python itself is held; see http://mail.python.org/pipermail/python-dev/2002-November/030380.html for this specific subject.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.210.71