Have you ever noticed that many Python objects know how to behave inside of a for
loop? That’s not an accident. Iteration is so useful, and so common, that Python makes it easy for an object to be iterable. All it has to do is implement a handful of behaviors, known collectively as the iterator protocol.
In this chapter, we’ll explore that protocol and how we can use it to create iterable objects. We’ll do this in three ways:
We’ll create our own iterators via Python classes, directly implementing the protocol ourselves.
We’ll create generators, objects that implement the protocol, based on something that looks very similar to a function. Not surprisingly, these are known as generator functions.
We’ll also create generators using generator expressions, which look quite a bit like list comprehensions.
Even newcomers to Python know that if you want to iterate over the characters in a string, you can write
for i in 'abcd':
print(i) ❶
❶ Prints a, b, c, and d, each on a separate line
This feels natural, and that’s the point. What if you just want to execute a chunk of code five times? Can you iterate over the integer 5
? Many newcomers to Python assume that the answer is yes and write the following:
for i in 5: ❶
print(i)
TypeError: 'int' object is not iterable
From this, we can see that while strings, lists, and dicts are iterable, integers aren’t. They aren’t because they don’t implement the iterator protocol, which consists of three parts:
Sequences (strings, lists, and tuples) are the most common form of iterables, but a large number of other objects, such as files and dicts, are also iterable. Best of all, when you define your own classes, you can make them iterable. All you have to do is make sure that the iterator protocol is in place on your object.
Given those three parts, we can now understand what a for
loop really does:
It asks an object whether it’s iterable using the iter
built-in function (http:// mng.bz/jgja). This function invokes the __iter__
method on the target object. Whatever __iter__
returns is called the iterator.
If the object is iterable, then the for
loop invokes the next
built-in function on the iterator that was returned. That function invokes __next__
on the iterator.
If __next__
raises a Stopiteration
exception, then the loop exits.
This protocol explains a couple things that tend to puzzle newcomers to Python:
Why don’t we need any indexes? In C-like languages, we need a numeric index for our iterations. That’s so the loop can go through each of the elements of the collection, one at a time. In those cases, the loop is responsible for keeping track of the current location. In Python, the object itself is responsible for producing the next item. The for
loop doesn’t know whether we’re on the first item or the last one. But it does know when we’ve reached the end.
How is it that different objects behave differently in for
loops? After all, strings return characters, but dicts return keys, and files return lines. The answer is that the iterator object can return whatever it wants. So string iterators return characters, dict iterators return keys, and file iterators return the lines in a file.
If you’re defining a new class, you can make it iterable as follows:
Define an __iter__
method that takes only self
as an argument and returns self
. In other words, when Python asks your object, “Are you iterable?” the answer will be, “Yes, and I’m my own iterator.”
Define a __next__
method that takes only self
as an argument. This method should either return a value or raise StopIteration
. If it never returns StopIteration
, then any for
loop on this object will never exit.
There are some more sophisticated ways to do things, including returning a separate, different object from __iter__
. I demonstrate and discuss that later in this chapter.
Here’s a simple class that implements the protocol, wrapping itself around an iterable object but indicating when it reaches each stage of iteration:
class LoudIterator(): def __init__(self, data): print(' Now in __init__') self.data = data ❶ self.index = 0 ❷ def __iter__(self): print(' Now in __iter__') return self ❸ def __next__(self): print(' Now in __next__') if self.index >= len(self.data): ❹ print( f' self.index ({self.index}) is too big; exiting') raise StopIteration value = self.data[self.index] ❺ self.index += 1 ❻ print(' Got value {value}, incremented index to {self.index}') return value for one_item in LoudIterator('abc'): print(one_item)
❶ Stores the data in an attribute, self.data
❷ Creates an index attribute, keeping track of our current position
❸ Our __iter__ does the simplest thing, returning self.
❹ Raises StopIteration if our self.index has reached the end
❺ Grabs the current value, but doesn’t return it yet
If we execute this code, we’ll see the following output:
Now in __init__ Now in __iter__ Now in __next__ Got value a, incremented index to 1 a Now in __next__ Got value b, incremented index to 2 b Now in __next__ Got value c, incremented index to 3 c Now in __next__ self.index (3) is too big; exiting
This output walks us through the iteration process that we’ve already seen, starting with a call to __iter__
and then repeated invocations of __next__
. The loop exits when the iterator raises StopIteration
.
Adding such methods to a class works when you’re creating your own new types. There are two other ways to create iterators in Python:
You can use a generator expression, which we’ve already seen and used. As you might remember, generator expressions look and work similarly to list comprehensions, except that you use round parentheses rather than square brackets. But unlike list comprehensions, which return lists that might consume a great deal of memory, generator expressions return one element at a time.
You can use a generator function --something that looks like a function, but when executed acts like an iterator; for example
def foo(): yield 1 yield 2 yield 3
When we execute foo
, the function’s body doesn’t execute. Rather, we get a generator object back--that is, something that implements the iterator protocol. We can thus put it in a for
loop:
g = foo() for one_item in g: print(one_item)
This loop will print 1, 2, and 3. Why? Because with each iteration (i.e., each time we call next
on g
), the function executes through the next yield
statement, returns the value it got from yield
, and then goes to sleep, waiting for the next iteration. When the generator function exits, it automatically raises StopIteration
, thus ending the loop.
Iterators are pervasive in Python because they’re so convenient--and in many ways, they’ve been made convenient because they’re pervasive. In this chapter, you’ll practice writing all of these types of iterators and getting a feel for when each of these techniques should be used.
The built-in enumerate
function allows us to get not just the elements of a sequence, but also the index of each element, as in
for index, letter in enumerate('abc'): print(f'{index}: {letter}')
Create your own MyEnumerate
class, such that someone can use it instead of enumerate
. It will need to return a tuple with each iteration, with the first element in the tuple being the index (starting with 0) and the second element being the current element from the underlying data structure. Trying to use MyEnumerate
with a noniterable argument will result in an error.
In this exercise, we know that our MyEnumerate
class will take a single iterable object. With each iteration, we’ll get back not one of that argument’s elements, but rather a two-element tuple.
This means that at the end of the day, we’re going to need a __next__
method that will return a tuple. Moreover, it’ll need to keep track of the current index. Since __next__
, like all methods and functions, loses its local scope between calls, we’ll need to store the current index in another place. Where? On the object itself, as an attribute.
Thus, our __init__
method will initialize two attributes: self.data
, where we’ll store the object over which we’re iterating, and self.index
, which will start with 0 and be incremented with each call to __next__
. Our implementation of __iter__
will be the standard one that we’ve seen so far, namely return
self
.
Finally __next__
checks to see if self.index
has gone past the length of self.data
. If so, then we raise StopIteration
, which causes the for
loop to exit.
class MyEnumerate(): def __init__(self, data): ❶ self.data = data ❷ self.index = 0 ❸ def __iter__(self): return self ❹ def __next__(self): if self.index >= len(self.data): ❺ raise StopIteration value = (self.index, self.data[self.index]) ❻ self.index += 1 ❼ return value ❽ for index, letter in MyEnumerate('abc'): print(f'{index} : {letter}')
❶ Initializes MyEnumerate with an iterable argument, “data”
❷ Stores “data” on the object as self.data
❸ Initializes self.index with 0
❹ Because our object will be its own iterator, returns self
❺ Are we at the end of the data? If so, then raises StopIteration.
❻ Sets the value to be a tuple, with the index and value
You can work through a version of this code in the Python Tutor at http://mng.bz/ JydQ.
Note that the Python Tutor sometimes displays an error message when StopIteration
is raised.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Now that you’ve created a simple iterator class, let’s dig in a bit deeper:
Rewrite MyEnumerate
such that it uses a helper class (MyEnumerateIterator
), as described in the “Discussion” section. In the end, MyEnumerate
will have the __iter__
method that returns a new instance of MyEnumerateIterator
, and the helper class will implement __next__
. It should work the same way, but will also produce results if we iterate over it twice in a row.
The built-in enumerate
method takes a second, optional argument--an integer, representing the first index that should be used. (This is particularly handy when numbering things for nontechnical users, who believe that things should be numbered starting with 1, rather than 0.)
Redefine MyEnumerate
as a generator function, rather than as a class.
From the examples we’ve seen so far, it might appear as though an iterable simply goes through the elements of whatever data it’s storing and then exits. But an iterator can do anything it wants, and can return whatever data it wants, until the point when it raises StopIteration
. In this exercise, we see just how that works.
Define a class, Circle
, that takes two arguments when defined: a sequence and a number. The idea is that the object will then return elements the defined number of times. If the number is greater than the number of elements, then the sequence repeats as necessary. You should define the class such that it uses a helper (which I call CircleIterator
). Here’s an example:
c = Circle('abc', 5)
print(list(c)) ❶
In many ways, our Circle
class is a simple iterator, going through each of its values. But we might need to provide more outputs than we have inputs, circling around to the beginning one or more times.
The trick here is to use the modulus operator (%
), which returns the integer remainder from a division operation. Modulus is often used in programs to ensure that we can wrap around as many times as we need.
In this case, we’re retrieving from self.data
, as per usual. But the element won’t be self.data[self.index]
, but rather self.data[self.index
%
len(self.data)]
.
Since self.index
will likely end up being bigger than len(self.data)
, we can no longer use that as a test for whether we should raise StopIteration
. Rather, we’ll need to have a separate attribute, self.max_times
, which tells us how many iterations we should execute.
Once we have all of this in place, the implementation becomes fairly straightforward. Our Circle
class remains with only __init__
and __iter__
, the latter of which returns a new instance of CircleIterator
. Note that we have to pass both self.data
and self.max_times
to CircleIterator
, and thus need to store them as attributes in our instance of Circle
.
Our iterator then uses the logic we described in its __next__
method to return one element at a time, until we have self.max_times
items.
class CircleIterator(): def __init__(self, data, max_times): self.data = data self.max_times = max_times self.index = 0 def __next__(self): if self.index >= self.max_times: raise StopIteration value = self.data[self.index % len(self.data)] self.index += 1 return value class Circle(): def __init__(self, data, max_times): self.data = data self.max_times = max_times def __iter__(self): return CircleIterator(self.data, self.max_times) c = Circle('abc', 5) print(list(c))
You can work through a version of this code in the Python Tutor at http://mng.bz/ wBjg.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
I hope you’re starting to see the potential for iterators, and how they can be written in a variety of ways. Here are some additional exercises to get you thinking about what those ways could be:
Rather than write a helper, you could also define iteration capabilities in a class and then inherit from it. Reimplement Circle
as a class that inherits from CircleIterator
, which implements __init__
and __next__
. Of course, the parent class will have to know what to return in each iteration; add a new attribute in Circle
, self.returns
, a list of attribute names that should be returned.
Implement Circle
as a generator function, rather than as a class.
Implement a MyRange
class that returns an iterator that works the same as range
, at least in for
loops. (Modern range
objects have a host of other capabilities, such as being subscriptable. Don’t worry about that.) The class, like range
, should take one, two, or three integer arguments.
File objects, as we’ve seen, are iterators; when we put them in a for
loop, each iteration returns the next line from the file. But what if we want to read through a number of files? It would be nice to have an iterator that goes through each of them.
In this exercise, I’d like you to create just such an iterator, using a generator function. That is, this generator function will take a directory name as an argument. With each iteration, the generator should return a single string, representing one line from one file in that directory. Thus, if the directory contains five files, and each file contains 10 lines, the generator will return a total of 50 strings--each of the lines from file 0, then each of the lines from file 1, then each of the lines from file 2, until it gets through all of the lines from file 4.
If you encounter a file that can’t be opened--because it’s a directory, because you don’t have permission to read from it, and so on--you should just ignore the problem altogether.
Let’s start the discussion by pointing out that if you really wanted to do this the right way, you would likely use the os.walk
function (http://mng.bz/D2Ky), which goes through each of the files in a directory and then descends into its subdirectories. But we’ll ignore that and work to understand the all_lines
generator function that I’ve created here.
First, we run os.listdir
on path
. This returns a list of strings. It’s important to remember that os.listdir
only returns the filenames, not the full path of the file. This means that we can’t just open the filename; we need to combine path
with the filename.
We could use str.join
, or even just +
or an f-string. But there’s a better approach, namely os.path.join
(http://mng.bz/oPPM), which takes any number of parameters (thanks to the *args
) and then joins them together with the value of os.sep
, the directory-separation character for the current operating system. Thus, we don’t need to think about whether we’re on a Unix or Windows system; Python can do that work for us.
What if there’s a problem reading from the file? We then trap that with an except
OSError
clause, in which we have nothing more than pass
. The pass
keyword means that Python shouldn’t do anything; it’s needed because of the structure of Python’s syntax, which requires something indented following a colon. But we don’t want to do anything if an error occurs, so we use pass
.
And if there’s no problem? Then we simply return the current line using yield
. Immediately after the yield
, the function goes to sleep, waiting for the next time a for
loop invokes next
on it.
Note Using except
without specifying which exception you might get is generally frowned upon, all the more so if you pair it with pass
. If you do this in production code, you’ll undoubtedly encounter problems at some point, and because you haven’t trapped specific exceptions or logged the errors, you’ll have trouble debugging the problem as a result. For a good (if slightly old) introduction to Python exceptions and how they should be used, see: http:// mng.bz/VgBX.
import os def all_lines(path): for filename in os.listdir(path): ❶ full_filename = os.path.join(path, filename) ❷ try: for line in open(full_filename): ❸ yield line ❹ except OSError: pass ❺
❶ Gets a list of files in path
❷ Uses os.path.join to create a full filename that we’ll open
❸ Opens and iterates over each line in full_filename
❹ Returns the line using yield, needed in iterators
❺ Ignores file-related problems silently
The Python Tutor site doesn’t work with files, so there’s no link to it. But you could see all of the lines from all files in the /etc/ directory on your computer with
for one_line in all_lines('/etc/'): print(one_line)
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
If something you want to do as an iterator doesn’t align with an existing class but can be defined as a function, then a generator function will likely be a good way to implement it. Generator functions are particularly useful in taking potentially large quantities of data, breaking them down, and returning their output at a pace that won’t overwhelm the system. Here are some other problems you can solve using generator functions:
Modify all_lines
such that it doesn’t return a string with each iteration, but rather a tuple. The tuple should contain four elements: the name of the file, the current number of the file (from all those returned by os.listdir
), the line number within the current file, and the current line.
The current version of all_lines
returns all of the lines from the first file, then all of the lines from the second file, and so forth. Modify the function such that it returns the first line from each file, and then the second line from each file, until all lines from all files are returned. When you finish printing lines from shorter files, ignore those files while continuing to display lines from the longer files.
Modify all_lines
such that it takes two arguments--a directory name, and a string. Only those lines containing the string (i.e., for which you can say s
in
line
) should be returned. If you know how to work with regular expressions and Python’s re
module, then you could even make the match conditional on a regular expression.
Note In generator functions, we don’t need to explicitly raise StopIteration
. That happens automatically when the generator reaches the end of the function. Indeed, raising StopIteration
from within the generator is something that you should not do. If you want to exit from the function prematurely, it’s best to use a return
statement. It’s not an error to use return
with a value (e.g., return
5
) from a generator function, but the value will be ignored. In a generator function, then, yield
indicates that you want to keep the generator going and return a value for the current iteration, while return
indicates that you want to exit completely.
Sometimes, the point of an iterator is not to change existing data, but rather to provide data in addition to what we previously received. Moreover, a generator doesn’t necessarily provide all of its values in immediate succession; it can be queried on occasion, whenever we need an additional value. Indeed, the fact that generators retain all of their state while sleeping between iterations means that they can just hang around, as it were, waiting until needed to provide the next value.
In this exercise, write a generator function whose argument must be iterable. With each iteration, the generator will return a two-element tuple. The first element in the tuple will be an integer indicating how many seconds have passed since the previous iteration. The tuple’s second element will be the next item from the passed argument.
Note that the timing should be relative to the previous iteration, not when the generator was first created or invoked. Thus the timing number in the first iteration will be 0.
You can use time.perf_counter
, which returns the number of seconds since the program was started. You could use time.time
, but perf_counter
is considered more reliable for such purposes.
The solution’s generator function takes a single piece of data and iterates over it. However, it returns a two-element tuple for each item it returns, in which the first element is the time since the previous iteration ran.
For this to work, we need to always know when the previous iteration was executed. Thus, we always calculate and set last_time
before we yield
the current values of delta
and item
.
However, we need to have a value for delta
the first time we get a result back. This should be 0
. To get around this, we set last_time
to None
at the top of the function. Then, with each iteration, we calculate delta
to be the difference between current _time
and last_time
or
current_time
. If last_time
is None
, then we’ll get the value of current_time
. This should only occur once; after the first iteration, last_time
will never be zero.
Normally, invoking a function multiple times means that the local variables are reset with each invocation. However, a generator function works differently: it’s only invoked once, and thus has a single stack frame. This means that the local variables, including parameters, retain their values across calls. We can thus set such values as last_time
and use them in future iterations.
import time def elapsed_since(data): last_time = None ❶ for item in data: current_time = time.perf_counter() ❷ delta = current_time - (last_time or current_time) ❸ last_time = time.perf_counter() yield (delta, item) ❹ for t in elapsed_since('abcd'): print(t) time.sleep(2)
❶ Initializes last_time with None
❸ Calculates the delta between the last time and now
You can work through a version of this code in the Python Tutor at http://mng.bz/ qMjz.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
In this exercise, we saw how we can combine user-supplied data with additional information from the system. Here are some more exercises you can try to get additional practice writing such generator functions:
The existing function elapsed_since
reported how much time passed between iterations. Now write a generator function that takes two arguments--a piece of data and a minimum amount of time that must elapse between iterations. If the next element is requested via the iterator protocol (i.e., next
), and the time elapsed since the previous iteration is greater than the user-defined minimum, then the value is returned. If not, then the generator uses time.sleep
to wait until the appropriate amount of time has elapsed.
Write a generator function, file_usage_timing
, that takes a single directory name as an argument. With each iteration, we get a tuple containing not just the current filename, but also the three reports that we can get about a file’s most recent usage: its access time (atime
), modification time (mtime
), and creation time (ctime
). Hint: all are available via the os.stat
function.
Write a generator function that takes two elements: an iterable and a function. With each iteration, the function is invoked on the current element. If the result is True
, then the element is returned as is. Otherwise, the next element is tested, until the function returns True
. Alternative: implement this as a regular function that returns a generator expression.
As you can imagine, iterator patterns tend to repeat themselves. For this reason, Python comes with the itertools
module (http://mng.bz/NK4E), which makes it easy to create many types of iterators. The classes in itertools
have been optimized and debugged across many projects, and often include features that you might not have considered. It’s definitely worth keeping this module in the back of your mind for your own projects.
One of my favorite objects in itertools
is called chain
. It takes any number of iterables as arguments and then returns each of their elements, one at a time, as if they were all part of a single iterable; for example
from itertools import chain for one_item in chain('abc', [1,2,3], {'a':1, 'b':2}): print(one_item)
a b c 1 2 3 a b
The final 'a'
and 'b'
come from the dict we passed, since iterating over a dict returns its keys.
While itertools.chain
is convenient and clever, it’s not that hard to implement. For this exercise, that’s precisely what you should do: implement a generator function called mychain
that takes any number of arguments, each of which is an iterable. With each iteration, it should return the next element from the current iterable, or the first element from the subsequent iterable--unless you’re at the end, in which case it should exit.
It’s true that you could create this as a Python class that implements the iterator protocol, with __iter__
and __call__
. But, as you can see, the code is so much simpler, easier to understand, and more elegant when we use a generator function.
Our function takes *args
as a parameter, meaning that args
will be a tuple when our function executes. Because it’s a tuple, we can iterate over its elements, no matter how many there might be.
We’ve stated that each argument passed to mychain
should be iterable, which means that we should be able to iterate over those arguments as well. Then, in the inner for
loop, we simply yield
the value of the current line. This returns the current value to the caller, but also holds onto the current place in the generator function. Thus, the next time we invoke __next__
on our iteration object, we’ll get the next item in the series.
def mychain(*args): ❶ for arg in args: ❷ for item in arg: ❸ yield item for one_item in mychain('abc', [1,2,3], {'a':1, 'b':2}): print(one_item)
❶ args is a tuple of iterables
❸ Loops over each element of each iterable, and yield’s it
You can work through a version of this code in the Python Tutor at http://mng.bz/ 7Xv4.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
In this exercise, we saw how we can better understand some built-in functionality by reimplementing it ourselves. In particular, we saw how we can create our own version of itertools.chain
as a generator function. Here are some additional challenges you can solve using generator functions:
The built-in zip
function returns an iterator that, given iterable arguments, returns tuples taken from those arguments’ elements. The first iteration will return a tuple from the arguments’ index 0, the second iteration will return a tuple from the arguments’ index 1, and so on, stopping when the shortest of the arguments ends. Thus zip('abc',
[10,
20,
30])
returns the iterator equivalent of [('a',
10),
('b',
20),
('c',
30)]
. Write a generator function that reimplements zip
in this way.
Reimplement the all_lines
function from exercise 49 using mychain
.
In the “Beyond the exercise” section for exercise 48, you implemented a MyRange
class, which mimics the built-in range
class. Now do the same thing, but using a generator expression.
In this chapter, we looked at the iterator protocol and how we can both implement and use it in a variety of ways. While we like to say that there’s only one way to do things in Python, you can see that there are at least three different ways to create an iterator:
The iterator protocol is both common and useful in Python. By now, it’s a bit of a chicken-and-egg situation--is it worth adding the iterator protocol to your objects because so many programs expect objects to support it? Or do programs use the iterator protocol because so many programs support it? The answer might not be clear, but the implications are. If you have a collection of data, or something that can be interpreted as a collection, then it’s worth adding the appropriate methods to your class. And if you’re not creating a new class, you can still take advantage of iterables with generator functions and expressions.
After doing the exercises in this chapter, I hope that you can see how to do the following:
Add the iterator protocol to a class via a helper iterator class
Write generator functions that filter, modify, and add to iterators that you would otherwise have created or used
Use generator expressions for greater efficiency than list comprehensions
Congratulations! You’ve reached the end of the book, which (if you’re not peeking ahead) means that you’ve finished a large number of Python exercises. As a result, your Python has improved in a few ways.
First, you’re now more familiar with Python syntax and techniques. Like someone learning a foreign language, you might previously have had the vocabulary and grammar structures in place, but now you can express yourself more fluently. You don’t need to think quite as long when deciding what word to choose. You won’t be using constructs that work but are considered un-Pythonic.
Second, you’ve seen enough different problems, and used Python to solve them, that you now know what to do when you encounter new problems. You’ll know what questions to ask, how to break the problems down into their elements, and what Python constructs will best map to your solutions. You’ll be able to compare the trade-offs between different options and then integrate the best ones into your code.
Third, you’re now more familiar with Python’s way of doing things and the vocabulary that the language uses to describe them. This means that the Python documentation, as well as the community’s ecosystem of blogs, tutorials, articles, and videos, will be more understandable to you. The descriptions will make more sense, and the examples will be more powerful.
In short, being more fluent in Python means being able to write better code in less time, while keeping it readable and Pythonic. It also means being able to learn more as you continue on your path as a developer.
I wish you the best of success in your Python career and hope that you’ll continue to find ways to practice your Python as you move forward.
18.224.33.107