Chapter 8: The Itertools Module

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8
The Itertools Module

Functional programming emphasizes stateless objects. In Python, this leads us to work with generator expressions, generator functions, and iterables, instead of large, mutable collection objects. In this chapter, we’ll look at elements of the itertools library. This library has numerous functions to help us work with iterable sequences of objects, as well as collection objects.

We introduced iterator functions in Chapter 3, Functions, Iterators, and Generators. In this chapter, we’ll expand on that superficial introduction. We used some related functions in Chapter 5, Higher-Order Functions.

There are a large number of iterator functions in the itertools module. We’ll examine the combinatoric functions in the next chapter. In this chapter, we’ll look at the following three broad groupings of the remaining iterator functions:

Functions that work with potentially infinite iterators. These can be applied to any iterable or an iterator over any collection. For example, the enumerate() function doesn’t require an upper bound on the number of items in the iterable.
Functions that work with finite iterators. Often, these are used to create a reduction of the source. For example, grouping the items produced by an iterator reduces the source to groups of items with a common key.
The tee() iterator function clones an iterator into several copies that can each be used independently. This provides a way to overcome the primary limitation of Python iterators: they can be used only once. This is memory-intensive, however, and redesign is often required.

We need to emphasize the important limitation of iterables that we’ve touched upon in other places: they can only be used once.

Iterables can be used only once.

This can be astonishing because there’s no error exception raised by attempting to reuse an iterator that’s been consumed fully. Once exhausted, they appear to have no elements and will only raise the StopIteration exception every time they’re used.

There are some other features of iterators that don’t involve such profound limitations. Note that many Python functions, as well as the for statement, will use the built-in iter() function to create as many iterators as required from a collection object.

Other features of iterators include:

There’s no len() function for an iterator.
Iterators, a subclass of iterables, can do next() operations, unlike a container. We’ll often use the built-in iter() to create an iterator that has a next() operation.
The for statement makes the distinction between containers and other iterables invisible by evaluating the built-in iter() function. A container object, for example, a list, responds to this function by producing an iterator over the items. An iterable object that’s not a collection, for example, a generator function, returns itself, since it is designed to follow the Iterator protocol.

These points will provide some necessary background for this chapter. The idea of the itertools module is to leverage what iterables can do to create succinct, expressive applications without the complicated-looking overheads associated with the details of managing the iterables.

8.1 Working with the infinite iterators

The itertools module provides a number of functions that we can use to enhance or enrich an iterable source of data. We’ll look at the following three functions:

count(): This is an unlimited version of the range() function. An upper bound must be imposed by the consumer of this sequence.
cycle(): This will reiterate a cycle of values. The consumer must decide when enough values have been produced.
repeat(): This can repeat a single value an indefinite number of times. The consumer must end the repetition.

Our goal is to understand how these various iterator functions can be used in generator expressions and with generator functions.

8.1.1 Counting with count()

The built-in range() function is defined by an upper limit: the lower limit and step values are optional. The count() function, on the other hand, has a start and optional step, but no upper limit.

This function can be thought of as the primitive basis for a function such as the built-in enumerate() function. We can define the enumerate() function in terms of zip() and count() functions, as follows:

>>> from itertools import count 
>>> enumerate = lambda x, start=0: zip(count(start), x)

The enumerate() function behaves as if it’s a zip() function that uses the count() function to generate the values associated with some iterable source of objects.

Consequently, the following two expressions are equivalent to each other:

>>> list(zip(count(), iter(’word’))) 
[(0, ’w’), (1, ’o’), (2, ’r’), (3, ’d’)] 
>>> list(enumerate(iter(’word’))) 
[(0, ’w’), (1, ’o’), (2, ’r’), (3, ’d’)]

Both will emit a sequence of numbers of two-tuples. The first item in each tuple is an integer counter. The second item comes from the iterator. In this example, the iterator is built from a string of characters.

Here’s something we can do with the count() function that’s difficult to do with the enumerate() function:

>>> list(zip(count(1, 3), iter(’word’))) 
[(1, ’w’), (4, ’o’), (7, ’r’), (10, ’d’)]

The value of count(b, s) is the sequence of values {b,b + s,b + 2s,b + 3s,...}. In this example, it will provide values of 1, 4, 7, 10, and so on as the identifiers for each value from the enumerator. The enumerate() function doesn’t provide a way to change the step.

We can, of course, combine generator functions to achieve this result. Here’s how changing the step can be done with the enumerate() function:

>>> source = iter(’word’) 
>>> gen3 = ((1+3*e, x) for e, x in enumerate(source)) 
>>> list(gen3) 
[(1, ’w’), (4, ’o’), (7, ’r’), (10, ’d’)]

This shows how a new value, 1 + 3e, is computed from the source enumeration value of e. This behaves like the sequence started at 1 and is incremented by 3.

8.1.2 Counting with float arguments

The count() function permits non-integer values. We can use something such as the count(0.5, 0.1) expression to provide floating-point values. This will accumulate an error if the increment value doesn’t have an exact representation. It’s generally better to use integer count() arguments such as (0.5+x*.1 for x in count()) to ensure that representation errors don’t accumulate.

Here’s a way to examine the accumulating error. This exploration of the float approximation shows some interesting functional programming techniques.

We’ll define a function that will evaluate items from an iterator until some condition is met. This is a way to find the first item that meets some criteria defined by a function. Here’s how we can define a find_first() function:

from collections.abc import Callable, Iterator 
from typing import TypeVar 
T = TypeVar("T") 
 
def find_first( 
    terminate: Callable[[T], bool], 
    iterator: Iterator[T] 
) -> T: 
    i = next(iterator) 
    if terminate(i): 
        return i 
    return find_first(terminate, iterator)

This function starts by getting the next value from the iterator object. No specific type is provided; the type variable T tells mypy that the source iterator and the target result will be the same type. If the chosen item passes the test, that is, this is the desired value, iteration stops and the return value will be of the given type associated with the type variable, T. Otherwise, we’ll evaluate this function recursively to search for a subsequent value that passes the test.

Because the tail-call recursion is not replaced with an optimized for statement, this is limited to iterables with about 1,000 items.

If we have some series of values computed by a generator, this will consume items from the iterator. Here’s a silly example. Let’s say we have an approximation that is a sum of a series of values. One example is this:

( 1 1 1 ) π = 4 arctan (1) = 4 1− 3-+ 5-− 7-⋅⋅⋅

The terms of this series can be created by a generator function like this:

 
>>> def term_iter(): 
...     d = 1 
...     sgn = 1 
...     while True: 
...         yield Fraction(sgn, d) 
...         d += 2 
...         sgn = -1 if sgn == 1 else 1

This will yield values like Fraction(1, 1), Fraction(-1, 3), Fraction(1, 5), and Fraction(-1, 7). It will yield an infinite number of them. We want values up until the first value that meets some criteria. For example, we may want to know the first value that will be less than (this is pretty easy to work out with pencil and paper to check the results):

 
>>> find_first(lambda v: abs(v) < 1E-2, term_iter()) 
Fraction(1, 101)

Our goal is to compare counting with float values against counting with integer values and then applying a scaling factor. We want to define a source that has both sequences as pairs. As an introduction to the concept, we’ll look at generating pairs from two parallel sources. Then we’ll return to the computation shown above.

In the following example, the source object is a generator of the pairs of pure float and int-to-float values:

from itertools import count 
from collections.abc import Iterator 
from typing import NamedTuple, TypeAlias 
 
Pair = NamedTuple(’Pair’, [(’flt_count’, float), (’int_count’, float)]) 
Pair_Gen: TypeAlias = Iterator[Pair] 
 
source: Pair_Gen = ( 
  Pair(fc, ic) for fc, ic in 
  zip(count(0, 0.1), (.1*c for c in count())) 
) 
 
def not_equal(pair: Pair) -> bool: 
  return abs(pair.flt_count - pair.int_count) > 1.0E-12

The Pair tuple will have two float values: one generated by summing float values, and the other generated by counting integers and multiplying by a floating-point scaling factor.

The generator, source, has provided a type hint on the assignment statement to show that it iterates over the pairs.

When we evaluate the find_first(not_equal, source) method, we’ll repeatedly compare float approximations of decimal values until they differ. One is a sum of 0.1 values: 0.1 ×∑ _x∈ℕ1. The other is a sum of integer values, weighted by 0.1: ∑ _x∈ℕ0.1. Viewed as abstract mathematical definitions, there’s no distinction.

We can formalize it as follows:

∑ ∑ 0.1 × 1 ≡ 0.1 x∈ℕ x∈ℕ

With concrete approximations of the abstract numbers, however, the two values will differ. The result is as follows:

>>> find_first(not_equal, source)
Pair(flt_count=92.799999999999, int_count=92.80000000000001)

After about 928 iterations, the sum of the error bits has accumulated to 10⁻¹². Neither value has an exact binary representation.

The find_first() function example is close to the Python recursion limit. We’d need to rewrite the function to use tail-call optimization to locate examples with a larger cumulative error value.

We’ve left this as change as an exercise for the reader.

The smallest detectable difference can be computed as follows:

>>> source: Pair_Gen = map(Pair, count(0, 0.1), (.1*c for c in count())) 
 
>>> find_first(lambda pair: pair.flt_count != pair.int_count, source) 
Pair(flt_count=0.6, int_count=0.6000000000000001)

This uses a simple equality check instead of an error range. After six steps, the count(0, 0.1) method has accumulated a tiny, but measurable, error of 10⁻¹⁶. While small, these error values can accumulate to become more significant and visible in a longer computation. When looking at how is represented as a binary value, an infinite binary expansion would be required. This is truncated to about 10⁻¹⁶ ≈ 2⁻⁵³ from the conceptual value. The magic number 53 is the number of bits available in IEEE standard for 64-bit floating-point values.

This is why we generally count things with ordinary integers and apply a weighting to compute a floating-point value.

8.1.3 Re-iterating a cycle with cycle()

The cycle() function repeats a sequence of values. This can be used when partitioning data into subsets by cycling among the dataset identifiers.

We can imagine using it to solve silly fizz-buzz problems. Visit http://rosettacode.org/wiki/FizzBuzz for a comprehensive set of solutions to a fairly trivial programming problem. Also see https://projecteuler.net/problem=1 for an interesting variation on this theme.

We can use the cycle() function to emit sequences of True and False values as follows:

>>> from itertools import cycle 
 
>>> m3 = (i == 0 for i in cycle(range(3))) 
>>> m5 = (i == 0 for i in cycle(range(5)))

These two generator expressions can produce infinite sequences with a pattern of [True, False, False, True, False, False, ...] or [True, False, False, False, False, True, False, False, False, False, ...]. These are iterators and can only be consumed once. They will tend to maintain their internal state. If we don’t consume precisely 15 values, the least common multiple of their cycles, the next time we consume values, they will be in an unexpected, in-between state.

If we zip together a finite collection of numbers and these two derived values, we’ll get a set of three-tuples with a number, the multiple of three true-false condition, and the multiple of five true-false condition. It’s important to introduce a finite iterable to create a proper upper bound on the volume of data being generated. Here’s a sequence of values and their multiplier conditions:

>>> multipliers = zip(range(10), m3, m5)

This is a generator; we can use list(multipliers) to see the resulting object. It looks like this:

>>> list(multipliers)
[(0, True, True), (1, False, False), (2, False, False), ..., (9, True,
False)]

We can now decompose the triples and use a filter to pass numbers that are multiples and reject all others:

>>> multipliers = zip(range(10), m3, m5) 
>>> total = sum(i 
...     for i, *multipliers in multipliers 
...     if any(multipliers) 
... )

The for clause decomposes each triple into two parts: the value, i, and the flags, multipliers. If any of the multipliers are true, the value is passed; otherwise, it’s rejected.

The cycle() function has another, more valuable, use for exploratory data analysis.

Using cycle() for data sampling

We often need to work with samples of large sets of data. The initial phases of cleansing and model creation are best developed with small sets of data and tested with larger and larger sets of data. We can use the cycle() function to fairly select rows from within a larger set. This is distinct from making random selections and trusting the fairness of the random number generator. Because this approach is repeatable and doesn’t rely on a random number generator, it can be applied to very large datasets processed by multiple computers.

Given a population size, N_p, and the desired sample size, N_s, this is the required size of the cycle, c, that will produce appropriate subsets:

c = Np- Ns

We’ll assume that the data can be parsed with a common library like the csv module. This leads to an elegant way to create subsets. Given a value for the cycle_size and two open files, source_file and target_file, we can create subsets using the following function definition:

from collections.abc import Iterable, Iterator 
from itertools import cycle 
from typing import TypeVar 
DT = TypeVar("DT") 
 
def subset_iter( 
         source: Iterable[DT], cycle_size: int 
) -> Iterator[DT]: 
    chooser = (x == 0 for x in cycle(range(cycle_size))) 
    yield from ( 
        row 
        for keep, row in zip(chooser, source) 
        if keep 
    )

The subset_iter() function uses a cycle() function based on the selection factor, cycle_size. For example, we might have a population of ten million records; a 1,000-record subset would be built with cycle_size set to c = = 10,000. We’d keep one record in ten thousand.

The subset_iter() function can be used by a function that reads from a source file and writes a subset to a destination file. This processing is part of the following function definition:

import csv 
from pathlib import Path 
 
def csv_subset( 
        source: Path, target: Path, cycle_size: int = 3 
) -> None: 
    with ( 
            source.open() as source_file, 
            target.open(’w’, newline=’’) as target_file 
    ): 
        rdr = csv.reader(source_file, delimiter=’	’) 
        wtr = csv.writer(target_file) 
        wtr.writerows(subset_iter(rdr, cycle_size))

We can use this generator function to filter the data using the cycle() function and the source data that’s available from the csv reader. Since the chooser expression and the expression used to write the rows are both non-strict, there’s little memory overhead from this kind of processing.

We can also rewrite this method to use compress(), filter(), and islice() functions, as we’ll see later in this chapter.

This design can also be used to reformat a file from any non-standard CSV-like format into a standardized CSV format. As long as we define a parser function that returns consistently defined tuples of strings and write consumer functions that write tuples to the target files, we can do a great deal of cleansing and filtering with relatively short, clear scripts.

8.1.4 Repeating a single value with repeat()

The repeat() function seems like an odd feature: it returns a single value over and over again. It can serve as an alternative for the cycle() function when a single value is needed.

The difference between selecting all of the data and selecting a subset of the data can be expressed with this. The expression (x==0 for x in cycle(range(size))) emits a [True, False, False, ...] pattern, suitable for picking a subset. The function (x==0 for x in repeat(0)) emits a [True, True, True, ...] pattern, suitable for selecting all of the data.

We can think of the following kinds of commands:

from itertools import cycle, repeat 
 
def subset_rule_iter( 
        source: Iterable[DT], rule: Iterator[bool] 
) -> Iterator[DT]: 
    return ( 
        v 
        for v, keep in zip(source, rule) 
        if keep 
    ) 
 
all_rows = lambda: repeat(True) 
subset = lambda n: (i == 0 for i in cycle(range(n)))

This allows us to make a single parameter change, which will either pick all data or pick a subset of data. We can also use cycle([True]) instead of repeat(True); the results are identical.

This pattern can be extended to randomize the subset chosen. The following technique adds an additional kind of choice:

import random 
 
def randomized(limit: int) -> Iterator[bool]: 
    while True: 
        yield random.randrange(limit) == 0

The randomized() function generates a potentially infinite sequence of random numbers over a given range. This fits the pattern of cycle() and repeat().

This allows code such as the following:

>>> import random 
>>> random.seed(42) 
>>> data = [random.randint(1, 12) for _ in range(12)] 
>>> data 
[11, 2, 1, 12, 5, 4, 4, 3, 12, 2, 11, 12] 
 
>>> list(subset_rule_iter(data, all_rows())) 
[11, 2, 1, 12, 5, 4, 4, 3, 12, 2, 11, 12] 
>>> list(subset_rule_iter(data, subset(3))) 
[11, 12, 4, 2] 
 
>>> random.seed(42) 
>>> list(subset_rule_iter(data, randomized(3))) 
[2, 1, 4, 4, 3, 2]

This provides us the ability to use a variety of techniques for selecting subsets. A small change among available functions all(), subset(), and randomized() lets us change our sampling approach in a way that seems succinct and expressive.

8.2 Using the finite iterators

The itertools module provides a number of functions that we can use to produce finite sequences of values. We’ll look at 10 functions in this module, plus some related built-in functions:

enumerate(): This function is actually part of the __builtins__ package, but it works with an iterator and is very similar to functions in the itertools module.
accumulate(): This function returns a sequence of reductions of the input iterable. It’s a higher-order function and can do a variety of clever calculations.
chain(): This function combines multiple iterables serially.
groupby(): This function uses a function to decompose a single iterable into a sequence of iterables over subsets of the input data.
zip_longest(): This function combines elements from multiple iterables. The built-in zip() function truncates the sequence at the length of the shortest iterable. The zip_longest() function pads the shorter iterables with the given fill value.
compress(): This function filters one iterable based on a second, parallel iterable of Boolean values.
islice(): This function is the equivalent of a slice of a sequence when applied to an iterable.
dropwhile() and takewhile(): Both of these functions use a Boolean function to filter items from an iterable. Unlike filter() or filterfalse(), these functions rely on a single True or False value to change their filter behavior for all subsequent values.
filterfalse(): This function applies a filter function to an iterable. This complements the built-in filter() function.
starmap(): This function maps a function to an iterable sequence of tuples using each iterable as an *args argument to the given function. The map() function does a similar thing using multiple parallel iterables.

We’ll start with functions that could be seen as useful for grouping or arranging items of an Iterator. After that, we’ll look at functions that are more appropriate for filtering and mapping the items.

8.2.1 Assigning numbers with enumerate()

In the Using enumerate() to include a sequence number section of Chapter 4, Working with Collections, we used the enumerate() function to make a naive assignment of rank numbers to sorted data. We can do things such as pairing up a value with its position in the original sequence, as follows:

>>> raw_values = [1.2, .8, 1.2, 2.3, 11, 18] 
 
>>> tuple(enumerate(sorted(raw_values))) 
((0, 0.8), (1, 1.2), (2, 1.2), (3, 2.3), (4, 11), (5, 18))

This will sort the items in raw_values in order, create two-tuples with an ascending sequence of numbers, and materialize an object we can use for further calculations.

In Chapter 7, Complex Stateless Objects, we implemented an alternative form of the enumerate() function, the rank() function, which handles ties in a more statistically useful way.

Enumerating rows of data is a common feature that is added to a parser to record the source data row numbers. In many cases, we’ll create some kind of row_iter() function to extract the string values from a source file. This may iterate over the string values in tags of an XML file or in columns of a CSV file. In some cases, we may even be parsing data presented in an HTML file parsed with Beautiful Soup.

In Chapter 4, Working with Collections, we parsed an XML file to create a simple sequence of position tuples. We then created legs with a start, end, and distance. We did not, however, assign an explicit leg number. If we ever sorted the trip collection, we’d be unable to determine the original ordering of the legs.

In Chapter 7, Complex Stateless Objects, we expanded on the basic parser to create named tuples for each leg of the trip. The output from this enhanced parser looks as follows:

>>> from textwrap import wrap 
>>> from pprint import pprint 
 
>>> trip[0] 
LegNT(start=PointNT(latitude=37.54901619777347, longitude=-76.33029518659048), ... 
 
>>> pprint(wrap(str(trip[0]))) 
[’LegNT(start=PointNT(latitude=37.54901619777347,’, 
 ’longitude=-76.33029518659048), end=PointNT(latitude=37.840832,’, 
 ’longitude=-76.273834), distance=17.7246)’] 
>>> pprint(wrap(str(trip[-1]))) 
[’LegNT(start=PointNT(latitude=38.330166, longitude=-76.458504),’, 
 ’end=PointNT(latitude=38.976334, longitude=-76.473503),’, 
 ’distance=38.8019)’]

The value of trip[0] is quite wide, too wide for the book. To keep the output in a form that fits in this book’s pages, we’ve wrapped the string representation of the value, and used pprint to show the individual lines. The first Leg object is a short trip between two points on the Chesapeake Bay.

We can add a function that will build a more complex tuple with the input order information as part of the tuple. First, we’ll define a slightly more complex version of the Leg class:

from typing import NamedTuple 
 
class Point(NamedTuple): 
    latitude: float 
    longitude: float 
 
class Leg(NamedTuple): 
    order: int 
    start: Point 
    end: Point 
    distance: float

The Leg definition is similar to the variations shown in Chapter 7, Complex Stateless Objects, specifically the LegNT definition. We’ll define a function that decomposes pairs and creates Leg instances as follows:

from typing import Iterator 
from Chapter04.ch04_ex1 import haversine 
 
def numbered_leg_iter( 
    pair_iter: Iterator[tuple[Point, Point]] 
) -> Iterator[Leg]: 
    for order, pair in enumerate(pair_iter): 
        start, end = pair 
        yield Leg( 
            order, 
            start, 
            end, 
            round(haversine(start, end), 4) 
        )

We can use this function to enumerate each pair of start and end points. We’ll decompose the pair and then re-assemble the order, start, and end parameters and the haversine(start,end) parameter’s value as a single Leg instance. This generator function will work with an iterable sequence of pairs.

In the context of the preceding explanation, it is used as follows:

>>> from Chapter06.ch06_ex3 import row_iter_kml 
>>> from Chapter04.ch04_ex1 import legs, haversine 
>>> import urllib.request 
 
>>> source_url = "file:./Winter%202012-2013.kml" 
>>> with urllib.request.urlopen(source_url) as source: 
...     path_iter = float_lat_lon(row_iter_kml(source)) 
...     pair_iter = legs(path_iter) 
...     trip_iter = numbered_leg_iter(pair_iter) 
...     trip = list(trip_iter)

We’ve parsed the original file into the path points, created start-end pairs, and then created a trip that was built of individual Leg objects. The enumerate() function ensures that each item in the iterable sequence is given a unique number that increments from the default starting value of 0. A second argument value to the enumerate() function can be given to provide a different starting value.

8.2.2 Running totals with accumulate()

The accumulate() function folds a given function into an iterable, accumulating a series of reductions. This will iterate over the running totals from another iterator; the default function is operator.add(). We can provide alternative functions to change the essential behavior from sum to product. The Python library documentation shows a particularly clever use of the max() function to create a sequence of maximum values so far.

One application of running totals is quartiling data. The quartile is one of many measures of position. The general approach is to multiply a sample’s value by a scaling factor to convert it to the quartile number. If values range from 0 ≤ v_i < N, we can scale by ⌈⌉ to convert any value, v_i, to a value in the range 0 to 3, which map to the various quartiles. The math.ceil() function is used to round the scaling fraction up to the next higher integer. This will ensure that no scaled value will produce a scaled result of 4, an impossible fifth quartile.

If the minimum value of v_i is not zero, we’ll need to subtract this from each value before multiplying by the scaling factor.

In the Assigning numbers with enumerate() section, we introduced a sequence of latitude-longitude coordinates that describe a sequence of legs on a voyage. We can use the distances as a basis for quartiling the waypoints. This allows us to determine the midpoint in the trip.

See the previous section for the value of the trip variable. The value is a sequence of Leg instances. Each Leg object has a start point, an end point, and a distance. The calculation of quartiles looks like the following code:

>>> from itertools import accumulate 
>>> import math 
 
>>> distances = (leg.distance for leg in trip) 
>>> distance_accum = list(accumulate(distances)) 
>>> scale = math.ceil(distance_accum[-1] / 4) 
 
>>> quartiles = list(int(scale*d) for d in distance_accum)

We extracted the distance values and computed the accumulated distances for each leg. The last of the accumulated distances is the total. The value of the quartiles variable is as follows:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

We can use the zip() function to merge this sequence of quartile numbers with the original data points. We can also use functions such as groupby() to create distinct collections of the legs in each quartile.

8.2.3 Combining iterators with chain()

A collection of iterators can be unified into a single sequence of values via the chain() function. This can be helpful to combine data that was decomposed via the groupby() function. We can use this to process a number of collections as if they were a single collection.

Python’s contextlib offers a clever class, ExitStack(), which can be used to perform a number of operations at the end of the context in a with statement. This permits an application to create any number of sub-contexts, all of which will have a proper __enter__() and __exit__() evaluated. This is particularly useful when we have an indefinite number of files to open.

In this example, we can combine the itertools.chain() function with a contextlib.ExitStack object to process—and properly close—a collection of files. Further, the data from all of these files will be processed as a single iterable sequence of values. Instead of wrapping each individual file operation in a with statement, we can wrap all of the operations in a single with context.

We can create a single context for multiple files like this:

import csv 
from collections.abc import Iterator 
from contextlib import ExitStack 
from pathlib import Path 
from typing import TextIO 
 
def row_iter_csv_tab(*filepaths: Path) -> Iterator[list[str]]: 
    with ExitStack() as stack: 
        files: list[TextIO] = [ 
            stack.enter_context(path.open()) 
            for path in filepaths 
        ] 
        readers = map( 
            lambda f: csv.reader(f, delimiter=’	’), 
            files) 
        yield from chain(*readers)

We’ve created an ExitStack object that can contain a number of individual contexts open. When the with statement finishes, all items in the ExitStack object will be closed properly. In the above function, a sequence of open file objects is assigned to the files variable. The stack.enter_context() method enters these objects into the ExitStack object to be properly closed.

Given the sequence of files in the files variable, we created a sequence of CSV readers in the readers variable. In this case, all of our files have a common tab-delimited format, which makes it very pleasant to open them with a simple, consistent application of a function to the sequence of files.

Finally, we chained all of the readers into a single iterator with chain(*readers). This was used to yield the sequence of rows from all of the files.

It’s important to note that we can’t return the chain(*readers) object. If we do, this would exit the with statement context, closing all the source files. Instead, we must yield individual rows from the generator so that the with statement context is kept active until all the rows are consumed.

8.2.4 Partitioning an iterator with groupby()

We can use the groupby() function to partition an iterator into smaller iterators. This works by evaluating the given key function for each item in the given iterable. If the key value matches the previous item’s key, the two items are part of the same partition. If the key does not match the previous item’s key, the previous partition is ended and a new partition is started. Because the matching is done on adjacent items in the iterable, the values must be sorted by the key.

The output from the groupby() function is a sequence of two-tuples. Each tuple has the group’s key value and an iterable over the items in the group, something like [(key, iter(group)), (key, iter(group)), ...]. Each group’s iterator can then be processed to create a materialized collection, or perhaps reduce it to some summary value.

In the Running totals with accumulate() section, earlier in the chapter, we showed how to compute quartile values for an input sequence. We’ll extend that to create groups based on the distance quartiles. Each group will be an iterator over legs that fit into the range of distances.

Given the trip variable with the raw data and the quartile variable with the quartile assignments, we can group the data using the following commands:

>>> from itertools import groupby 
>>> from Chapter07.ch07_ex1 import get_trip 
 
>>> source_url = "file:./Winter%202012-2013.kml" 
>>> trip = get_trip(source_url) 
>>> quartile = quartiles(trip) 
>>> group_iter = groupby(zip(quartile, trip), key=lambda q_raw: q_raw[0]) 
>>> for group_key, group_iter in group_iter: 
...    print(f"Group {group_key+1}: {len(list(group_iter))} legs") 
Group 1: 23 legs 
Group 2: 14 legs 
Group 3: 19 legs 
Group 4: 17 legs

This will start by zipping the quartile numbers with the raw trip data, creating an iterator over two-tuples with quartile number and leg. The groupby() function will use the given lambda object to group by the quartile number, q_raw[0], in each q_raw tuple. We used a for statement to examine the results of the groupby() function. This shows how we get a group key value and an iterator over members of each individual group.

The input to the groupby() function must be sorted by the key values. This will ensure that all of the items in a group will be adjacent. For very large datasets, this may force us to use the operating system’s sort in the rare cases of a file being too large to fit into memory.

Note that we can also create groups using a defaultdict(list) object. This avoids a sort step, but can build a large, in-memory dictionary of lists. The function can be defined as follows:

from collections import defaultdict 
from collections.abc import Iterable, Callable, Hashable 
 
DT = TypeVar("DT") 
KT = TypeVar("KT", bound=Hashable) 
 
def groupby_2( 
    iterable: Iterable[DT], 
    key: Callable[[DT], KT] 
) -> Iterator[tuple[KT, Iterator[DT]]]: 
    groups: dict[KT, list[DT]] = defaultdict(list) 
    for item in iterable: 
        groups[key(item)].append(item) 
    for g in groups: 
        yield g, iter(groups[g])

We created a defaultdict object that will use list() as the default value associated with each new key. The type hints clarify the relationship between the key function, which emits objects of some arbitrary type associated with the type variable KT, and the dictionary, which uses the same type, KT, for the keys.

Each item will have the given key() function applied to create a key value. The item is appended to the list in the defaultdict object with the given key.

Once all of the items are partitioned, we can then return each partition as an iterator over the items that share a common key. This will retain all of the original values in memory, and introduce a dictionary and a list for each unique key value. For very large datasets, this may require more memory than is available on the processor.

The type hints clarify that the source is some arbitrary type, associated with the variable DT. The result will be an iterator that includes iterators of the type DT. This makes a strong statement that no transformation is happening: the range type matches the input domain type.

8.2.5 Merging iterables with zip_longest() and zip()

We saw the zip() function in Chapter 4, Working with Collections. The zip_longest() function differs from the zip() function in an important way: whereas the zip() function stops at the end of the shortest iterable, the zip_longest() function pads short iterables with a given value, and stops at the end of the longest iterable.

The fillvalue= keyword parameter allows filling with a value other than the default value, None.

For most exploratory data analysis applications, padding with a default value is statistically difficult to justify. The Python Standard Library document includes the grouper recipe that can be done with the zip_longest() function. It’s difficult to expand on this without drifting far from our focus on data analysis.

8.2.6 Creating pairs with pairwise())

The pairwise() function consumes a source iterator, emitting the items in pairs. See the legs() function in Chapter 4, Working with Collections, for an example of creating pairs from a source iterable.

Here’s a small example of transforming a sequence of characters into adjacent pairs of characters:

>>> from itertools import pairwise 
 
>>> text = "hello world" 
>>> list(pairwise(text)) 
[(’h’, ’e’), (’e’, ’l’), (’l’, ’l’), ...]

This kind of analysis locates letter pairs, called ”bigrams” or ”digraphs.” This can be helpful when trying to understand a simple letter substitution cipher. The frequency of bigrams in encoded text can suggest possible ways to break the cipher.

In Python 3.10, this function was moved from being a recipe to being a proper itertools function.

8.2.7 Filtering with compress()

The built-in filter() function uses a predicate to determine whether an item is passed or rejected. Instead of a function that calculates a value, we can use a second, parallel iterable to determine which items to pass and which to reject.

In the Re-iterating a cycle with cycle() section of this chapter, we looked at data selection using a simple generator expression. Its essence was as follows:

from typing import TypeVar 
 
DataT = TypeVar("DataT") 
 
def subset_gen( 
        data: Iterable[DataT], rule: Iterable[bool] 
) -> Iterator[DataT]: 
    return ( 
        v 
        for v, keep in zip(data, rule) 
        if keep 
    )

Each value for the rule iterable must be a Boolean value. To choose all items, it can repeat a True value. To pick a fixed subset, it can cycle among a True value followed by copies of a False value. To pick 1/4 of the items, we could use cycle([True] + 3*[False]).

The list comprehension can be revised as compress(some_source, selectors), using a function for the selectors argument value. If we make that change, the processing is simplified:

>>> import random 
>>> random.seed(1) 
>>> data = [random.randint(1, 12) for _ in range(12)] 
 
>>> from itertools import compress 
 
>>> copy = compress(data, all_rows()) 
>>> list(copy) 
[3, 10, 2, 5, 2, 8, 8, 8, 11, 7, 4, 2] 
 
>>> cycle_subset = compress(data, subset(3)) 
>>> list(cycle_subset) 
[3, 5, 8, 7] 
 
>>> random.seed(1) 
>>> random_subset = compress(data, randomized(3)) 
>>> list(random_subset) 
[3, 2, 2, 4, 2]

These examples rely on the alternative selection rules all_rows(), subset(), and randomized(), as shown previously. The subset() and randomized() functions must be defined with a proper parameter with the value for c to pick of the rows from the source. The selectors expression must build an iterable over True and False values based on one of the selection rule functions. The rows to be kept are selected by applying the source iterable to the row-selection iterable.

Since all of this is done as a lazy evaluation, rows are not read from the source until required. This allows us to process very large sets of data efficiently. Also, the relative simplicity of the Python code means that we don’t really need a complex configuration file and an associated parser to make choices among the selection rules. We have the option to use this bit of Python code as the configuration for a larger data-sampling application.

We can think of the filter() function as having the following definition:

from itertools import compress, tee 
from collections.abc import Iterable, Iterator, Callable 
from typing import TypeVar 
 
SrcT = TypeVar("SrcT") 
 
def filter_concept( 
        function: Callable[[SrcT], bool], 
        source: Iterable[SrcT] 
) -> Iterator[SrcT]: 
    i1, i2 = tee(source, 2) 
    return compress(i1, map(function, i2))

We cloned the iterable using the tee() function. We’ll look at this function in detail later. The map() function will generate results of applying the filter predicate function, function(), to each value in the iterable, yielding a sequence of True and False values. The sequence of Booleans is used to compress the original sequence, passing only items associated with True. This builds the features of the filter() function from the compress() function.

The function’s hint can be broadened to Callable[[SrcT], Any]. This is because the compress() function will make use of the truthiness or falsiness of the values returned. It seems helpful to emphasize that the values will be understood as Booleans, hence the use of bool in the type hint, not Any.

8.2.8 Picking subsets with islice()

In Chapter 4, Working with Collections, we looked at slice notation to select subsets from a collection. Our example was to pair up items sliced from a list object. The following is a simple list:

>>> from Chapter04.ch04_ex5 import parse_g 
 
>>> with open("1000.txt") as source: 
...    flat = list(parse_g(source)) 
 
>>> flat[:10] 
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29] 
 
>>> flat[-10:] 
[7841, 7853, 7867, 7873, 7877, 7879, 7883, 7901, 7907, 7919]

We can create pairs using list slices as follows:

>>> list(zip(flat[0::2], flat[1::2])) 
[(2, 3), (5, 7), (11, 13), ...]

The islice() function gives us similar capabilities without the overhead of materializing a list object. This will work with an iterable of any size. The islice() function accepts an Iterable source, and the three parameters that define a slice: the start, stop, and step values. This means islice(source, 1, None, 2) is similar to source[1::2]. Instead of the slice-like shorthand using :, optional parameter values are used; the rules match the built-in range() function. The important difference is that source[1::2] only works for a Sequence object like a list or tuple. The islice(source, 1, None, 2) function works for any iterable, including an iterator object, or a generator expression.

The following example will create pairs of values of an iterable using the islice() function:

>>> flat_iter_1 = iter(flat)
>>> flat_iter_2 = iter(flat)
>>> pairs = list(zip(
... islice(flat_iter_1, 0, None, 2),
... islice(flat_iter_2, 1, None, 2)
... ))
>>> len(pairs)
500
>>> pairs[:3]
[(2, 3), (5, 7), (11, 13)]
>>> pairs[-3:]
[(7877, 7879), (7883, 7901), (7907, 7919)]

We created two independent iterators over a collection of data points in the flat variable. These could be two separate iterators over an open file or a database result set. The two iterators need to be independent to ensure a change in one islice() source doesn’t interfere with the other islice() source.

This will produce a sequence of two-tuples from the original sequence:

[(2, 3), (5, 7), (11, 13), (17, 19), (23, 29), 
... 
(7883, 7901), (7907, 7919)]

Since islice() works with an iterable, this kind of design can work with extremely large sets of data. We can use this to pick a subset out of a larger set of data. In addition to using the filter() or compress() functions, we can also use the islice(source, 0, None, c) method to pick a -sized subset from a larger set of data.

8.2.9 Stateful filtering with dropwhile() and takewhile()

The dropwhile() and takewhile() functions are stateful filter functions. They start in one mode; the given predicate function is a kind of flip-flop that switches the mode. The dropwhile() function starts in reject mode; when the function becomes False, it switches to pass mode. The takewhile() function starts in pass mode; when the given function becomes False, it switches to reject mode. Since these are filters, they will consume the entire iterable argument value.

We can use these to skip header or footer lines in an input file. We use the dropwhile() function to reject header rows and pass the remaining data. We use the takewhile() function to pass data and reject trailer rows. We’ll return to the simple GPL file format shown in Chapter 3, Functions, Iterators, and Generators. The file has a header that looks as follows:

GIMP Palette 
Name: Crayola 
Columns: 16 
#

This is followed by rows that look like the following example data:

255 73 108 Radical Red

Note that there’s an invisible tab character, , between the RGB color triple and the color name. To make it more visible, we can typeset the example like this:

255 73 108	Radical Red

This little typesetting technique seems a little misleading, since it doesn’t look like that in most programming editors.

We can locate the final line of the headers—the # line—using a parser based on the dropwhile() function, as follows:

>>> import csv 
>>> from pathlib import Path 
 
>>> source_path = Path("crayola.gpl") 
>>> with source_path.open() as source: 
...     rdr = csv.reader(source, delimiter=’\t’) 
...     row_iter = dropwhile( 
...         lambda row: row[0] != ’#’, rdr 
...     ) 
...     color_rows = islice(row_iter, 1, None) 
...     colors = list( 
...         (color.split(), name) for color, name in color_rows 
...     )

We created a CSV reader to parse the lines based on tab characters. This will neatly separate the color three-tuple from the name. The three-tuple will need further parsing. This will produce an iterator that starts with the # line and continues with the rest of the file.

We can use the islice() function to discard the first item of an iterable. The islice(rows, 1, None) expression is similar to asking for a rows[1:] slice: the first item is quietly discarded. Once the last of the heading rows have been discarded, we can parse the color tuples and return more useful color objects.

For this particular file, we can also use the number of columns located by the CSV reader() function. Header rows only have a single column, allowing the use of the dropwhile(lambda row: len(row) == 1, rdr) expression to discard header rows. This isn’t a good approach in general, because locating the last line of the headers is often easier than trying to define some general pattern that distinguishes all header (or trailer) lines from the meaningful file content. In this case, the header rows were distinguishable by the number of columns; this is a rarity.

8.2.10 Two approaches to filtering with filterfalse() and filter()

In Chapter 5, Higher-Order Functions, we looked at the built-in filter() function. The filterfalse() function from the itertools module could be defined from the filter() function, as follows:

filterfalse_concept = ( 
    lambda pred, iterable: 
    filter(lambda x: not pred(x), iterable) 
)

As with the filter() function, the predicate function can be the None value. The value of the filter(None, iterable) method is all the True values in the iterable. The value of the filterfalse(None, iterable) method is all the False values from the iterable:

>>> from itertools import filterfalse 
 
>>> source = [0, False, 1, 2] 
>>> list(filter(None, source)) 
[1, 2] 
 
>>> filterfalse(None, source) 
<itertools.filterfalse object at ...> 
>>> list(_) 
[0, False]

The point of having the filterfalse() function is to promote reuse. If we have a succinct function that makes a filter decision, we should be able to use that function to partition input to pass as well as reject groups without having to fiddle around with logical negation.

The idea is to execute the following commands:

>>> iter_1, iter_2 = tee(iter(raw_samples), 2) 
 
>>> rule_subset_iter = filter(rule, iter_1) 
>>> not_rule_subset_iter = filterfalse(rule, iter_2)

This kind of processing into two subsets will include all items from the source. The rule() function is unchanged, and we can’t introduce a subtle logic bug through improper negation of this function.

8.2.11 Applying a function to data via starmap() and map()

The built-in map() function is a higher-order function that applies a function to items from an iterable. We can think of the simple version of the map() function as follows:

map_concept = ( 
    lambda function, arg_iter: 
    (function(a) for a in arg_iter) 
)

This works well when the arg_iter parameter is an iterable that provides individual values. The actual map() function is quite a bit more sophisticated than this, and can also work with a number of iterables.

The starmap() function in the itertools module is the *args version of the map() function. We can imagine the definition as follows:

starmap_concept = ( 
    lambda function, arg_iter: 
    (function(*a) for a in arg_iter) 
             #^-- Adds this * to decompose tuples 
)

This reflects a small shift in the semantics of the map() function to properly handle an iterable-of-tuples structure. Each tuple is decomposed and applied to the various positional parameters.

When we look at the trip data, from the preceding commands, we can redefine the construction of a Leg object based on the starmap() function.

We could use the starmap() function to assemble the Leg objects, as follows:

from Chapter04.ch04_ex1 import legs, haversine 
from Chapter06.ch06_ex3 import row_iter_kml 
from Chapter07.ch07_ex1 import float_lat_lon, LegNT, PointNT 
import urllib.request 
from collections.abc import Callable 
 
def get_trip_starmap(url: str) -> List[LegNT]: 
    make_leg: Callable[[PointNT, PointNT], LegNT] = ( 
        lambda start, end: 
        LegNT(start, end, haversine(start, end)) 
    ) 
    with urllib.request.urlopen(url) as source: 
        path_iter = float_lat_lon( 
            row_iter_kml(source) 
        ) 
        pair_iter = legs(path_iter) 
        trip = list(starmap(make_leg, pair_iter)) 
                   #-------- Used here 
    return trip

Here’s how it looks when we apply this get_trip_starmap() function to read source data and iterate over the created Leg objects:

>>> from pprint import pprint 
>>> source_url = "file:./Winter%202012-2013.kml" 
>>> trip = get_trip_starmap(source_url) 
>>> len(trip) 
73 
>>> pprint(trip[0]) 
LegNT(start=PointNT(latitude=37.54901619777347, longitude=-76.33029518659048), end=PointNT(latitude=37.840832, longitude=-76.273834), distance=17.724564798884984) 
 
>>> pprint(trip[-1]) 
LegNT(start=PointNT(latitude=38.330166, longitude=-76.458504), end=PointNT(latitude=38.976334, longitude=-76.473503), distance=38.801864781785845)

The make_leg() function accepts a pair of Point objects, and returns a Leg object with the start point, end point, and distance between the two points. The legs() function from Chapter 4, Working with Collections, creates pairs of Point objects that reflect the start and end of a leg of the voyage. The pairs created by legs() are provided as input to make_leg() to create proper Leg objects.

The map() function can also accept multiple iterables. When we use map(f, iter1, iter2, ...), it behaves as if the iterators are zipped together, and the starmap() function is applied.

We can think of the map(function, iter1, iter2, iter3) function as if it were starmap(function, zip(iter1, iter2, iter3)).

The benefit of the starmap(function, some_list) method is to replace a potentially wordy (function(*args) for args in some_list) generator expression with something that avoids the potentially overlooked * operator applied to the function argument values.

8.3 Cloning iterators with tee()

The tee() function gives us a way to circumvent one of the important Python rules for working with iterables. The rule is so important, we’ll repeat it here:

Iterators can be used only once.

The tee() function allows us to clone an iterator. This seems to free us from having to materialize a sequence so that we can make multiple passes over the data. Because tee() can use a lot of memory, it is sometimes better to materialize a list and process it multiple times, rather than trying to use the potential simplification of the tee() function.

For example, a simple average for an immense dataset could be written in the following way:

from collections.abc import Iterable 
 
def mean_t(source: Iterable[float]) -> float: 
    it_0, it_1 = tee(iter(source), 2) 
    N = sum(1 for x in it_0) 
    sum_x = sum(x for x in it_1) 
    return sum_x/N

This would compute an average without appearing to materialize the entire dataset in memory. Note that the type hint of float doesn’t preclude integers. The mypy program is aware of the numeric processing rules, and this definition provides a flexible way to specify that either int or float will work.

8.4 The itertools recipes

Within the itertools chapter of the Python library documentation, there’s a subsection called Itertools Recipes, which contains outstanding examples of ways to use the various itertools functions. Since there’s no reason to reproduce these, we’ll reference them here. They should be considered as required reading on functional programming in Python.

For more information, visit https://docs.python.org/3/library/itertools.html#itertools-recipes.

It’s important to note that these aren’t importable functions in the itertools modules. A recipe needs to be read and understood and then, perhaps, copied or modified before it’s included in an application.

Some of the recipes involve some of the more advanced techniques shown in the next chapter; they’re not in the following table. We’ve preserved the ordering of items in the Python documentation, which is not alphabetical. The following table summarizes some of the recipes that show functional programming design patterns built from the itertools basics:


Function Name	Arguments	Results



`take`	`(n,` `iterable)`	Yields the first n items of the iterable as a list. This wraps a use of `islice()` in a simple name.
`tabulate`	`(function,` `start=0)`	Yields `function(0)`, `function(1)`, and so on. This is based on a `map(function,` `count())`.
`consume`	`(iterator,` `n)`	Advance the iterator n steps ahead. If n is `None`, it consumes all of the values from the iterator.
`nth`	`(iterable,` `n,` `default=None)`	Return only the nth item or a default value. This wraps the use of `islice()` in a simple name.
`quantify`	`(iterable,` `pred=bool)`	Returns the count of how many times the predicate is true. This uses `sum()` and `map()` and relies on the way a Boolean predicate is effectively 1 when converted to an integer value.
`padnone`	`(iterable)`	Yields the iterable’s elements and then yields `None` indefinitely. This can create functions that behave like `zip_longest()` or `map()`.
`ncycles`	`(iterable,` `n)`	Yields the sequence elements n times.
`dotproduct`	`(vec1,` `vec2)`	A dot product multiplies two vector’s values and finds the sum of the result.
`flatten`	`(listOfLists)`	This function flattens one level of nesting. This chains the various lists together into a single list.
`repeatfunc`	`(func,` `times=` `None,` `*args)`	This calls the given function, `func`, repeatedly with specified arguments.
`grouper`	`(iterable,` `n,` `fillvalue=None)`	Yields the iterable’s elements as a sequence of fixed-length chunks or blocks.
`roundrobin`	`(*iterables)`	Yields values taken from each of the iterables. For example, `roundrobin(’ABC’,` `’D’,` `’EF’)` is `’A’,` `’D’,` `’E’,` `’B’,` `’F’,` `’C’`.
`partition`	`(pred,` `iterable)`	This uses a predicate to partition entries into `False` entries and `True` entries. The return value is a pair of iterators.
`unique_everseen`	`(iterable,` `key=None)`	Yields the unique elements of the source iterable, preserving order. It also remembers all elements ever seen.
`unique_justseen`	`(iterable,` `key=None)`	Yields unique elements, preserving order. It remembers only the element most recently seen. This is useful for deduplicating or grouping a sorted sequence.
`iter_except`	`(func,` `exception,` `first=None)`	Yields results of calling a function repeatedly until an exception is raised. The exception is silenced. This can be used to iterate until `KeyError` or `IndexError`.

8.5 Summary

In this chapter, we’ve looked at a number of functions in the itertools module. This library module helps us to work with iterators in sophisticated ways.

We’ve looked at the infinite iterators; they repeat without terminating. They include the count(), cycle(), and repeat() functions. Since they don’t terminate, the consuming function must determine when to stop accepting values.

We’ve also looked at a number of finite iterators. Some of them are built-in, and some of them are a part of the itertools module. They work with a source iterable, so they terminate when that iterable is exhausted. These functions include enumerate(), accumulate(), chain(), groupby(), zip_longest(), zip(), pairwise(), compress(), islice(), dropwhile(), takewhile(), filterfalse(), filter(), starmap(), and map(). These functions allow us to replace possibly complex generator expressions with simpler-looking functions.

We’ve noted that functions like the tee() function are available, and can create a helpful simplification. It has the potential cost of using a great deal of memory, and needs to be considered carefully. In some cases, materializing a list may be more efficient than applying the tee() function.

Additionally, we looked at the recipes from the documentation, which provide yet more functions we can study and copy for our own applications. The recipes list shows a wealth of common design patterns.

In Chapter 9, Itertools for Combinatorics – Permutations and Combinations, we’ll continue our study of the itertools module, focusing on permutations and combinations. These operations can produce voluminous results. For example, enumerating all possible 5-card hands from a deck of 52 cards will yield over 3.12 × 10⁸ permutations. For small domains, however, it can be helpful to examine all possible orderings to understand how well observed samples match the domain of possible values.

8.6 Exercises

This chapter’s exercises are based on code available from Packt Publishing on GitHub. See https://github.com/PacktPublishing/Functional-Python-Programming-3rd-Edition.

In some cases, the reader will notice that the code provided on GitHub includes partial solutions to some of the exercises. They serve as hints, allowing the reader to explore alternative solutions.

In many cases, exercises will need unit test cases to confirm they actually solve the problem. These are often identical to the unit test cases already provided in the GitHub repository. The reader should replace the book’s example function name with their own solution to confirm that it works.

8.6.1 Optimize the find_first() function

In Counting with float arguments, we defined a find_first() function to locate the first pair of an iterator that passed a given test criteria. In most of the examples, the test was a comparison between values to see if the difference between the values was larger than 10⁻¹².

The definition of the find_first() function used a simpler recursion. This limits the size of the iterable that can be examined: only about 1,000 values can be consumed before hitting the stack size limitation.

First, create a comparison function that will consume enough values to fail with a recursion limit exception.

Then, rewrite the find_first() function to replace the tail call with iteration using the for statement.

Using the comparison function found earlier, demonstrate that the revised function will readily pass 1,000 elements, looking for the first that matches the revised criteria.

8.6.2 Compare Chapter 4 with the itertools.pairwise() recipe

In Chapter 4, Working with Collections, the legs() function created overlapping pairs from a source iterable. Compare the implementation provided in this book with the pairwise() function.

Create a very, very large iterable and compare the performance of the legs() function and the pairwise() function. Which is faster?

8.6.3 Compare Chapter 4 with itertools.tee() recipe

In the Using sums and counts for statistics section of Chapter 4, Working with Collections, a mean() function was defined that had a limitation of only working with sequences. If itertools.tee() is used, a mean() function can be written that will apply to iterators in general, without being limited to collection objects that can produce multiple iterators. Define a mean_i() function based on the itertools.tee() function that works with any iterator. Which variant of mean computations is easier to understand?

Create a very, very large iterable and compare the performance of the mean_i() function and the mean() function shown in the text. Which is faster? It takes some time to explore, but locating a collection that breaks the itertools.tee() function while still working with a materialized list object is an interesting thing to find.

8.6.4 Splitting a dataset for training and testing purposes

Given a pool of samples, it’s sometimes necessary to partition the data into a subset used for building (or “training”) a model, and a separate subset used to test the model’s predictive ability. It’s common practice to use subsets of 20%, 25%, or even 33% of the source data for testing. Develop a set of functions to partition the data into subsets with ratios of 1 : 3, 1 : 4, or 1 : 5 for test vs. training data.

8.6.5 Rank ordering

In Chapter 7, Complex Stateless Objects, we looked at ranking items in a set of data. The approach shown in that chapter was to build a dictionary of items with the same key value. This made it possible to create a rank that was the mean of the various items. For example, the sequence [0.8, 1.2, 1.2, 2.3, 18] should have rank values of 1, 2.5, 2.5, 4, 5. The two matching key values in positions 1 and 2 of the sequence should have the midpoint value of 2.5 as their common rank.

This can be computed using itertools.groupby(). Each group will have some number of members, provided by the groupby() function. The sequence of rank values for a group of n items with matching keys is r₀,r₀ + 1,r₀ + 2,...,r₀ + n. The value of r₀ is the starting rank for the group. The mean of this sequence is r₀ + . This processing requires creating a temporary sequence of values in order to emit each item from the group of values with the same key with their matching ranks.

Write this rank() function, using the itertools.groupby() function. Compare the code with the examples in Chapter 7, Complex Stateless Objects. What advantages does the itertools variant offer?

Join our community Discord space

Join our Python Discord workspace to discuss and know more about the book: https://packt.link/dHrHU

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: The Itertools Module

Create new playlist

Sign In

Sign Up

8 The Itertools Module