At this point in the book, you have been exposed to a fairly complete survey of the more formal aspects of the language (the syntax, the data types, etc.). In this chapter, we’ll “step out of the classroom” by looking at a set of basic computing tasks and examining how Python programmers typically solve them, hopefully helping you ground the theoretical knowledge with concrete results.
Python programmers don’t like to reinvent wheels when they already have access to nice, round wheels in their garage. Thus, the most important content in this chapter is the description of selected tools that make up the Python standard library—built-in functions, library modules, and their most useful functions and classes. While you most likely won’t use all of these in any one program, no useful program avoids all of these. Just as Python provides a list object type because sequence manipulations occur in all programming contexts, the library provides a set of modules that will come in handy over and over again. Before designing and writing any piece of generally useful code, check to see if a similar module already exists. If it’s part of the standard Python library, you can be assured that it’s been heavily tested; even better, others are committed to fixing any remaining bugs—for free.
The goal of this chapter is to expose you to a lot of different tools, so that you know that they exist, rather than to teach you everything you need to know in order to use them. There are very good sources of complementary knowledge once you’ve finished this book. If you want to explore more of the standard library, the definitive reference is the Python Library Reference, currently over 600 pages long. It is the ideal companion to this book; it provides the completeness we don’t have the room for, and, being available online, is the most up-to-date description of the standard Python toolset. Three other O’Reilly books provide excellent additional information: the Python Pocket Reference, written by Mark Lutz, which covers the most important modules in the standard library, along with the syntax and built-in functions in compact form; Fredrik Lundh’s Python Standard Library, which takes on the formidable task of both providing additional documentation for each module in the standard library as well as providing an example program showing how to use each module; and finally, Alex Martelli’s Python in a Nutshell provides a thorough yeteminently readable and concise description of the language and standard library. As we’ll see in Section 27.1, Python comes with tools that make self-learning easy as well.
Just as we can’t cover every standard module, the set of tasks covered in this chapter is necessarily limited. If you want more, check out the Python Cookbook (O’Reilly), edited by David Ascher and Alex Martelli. This Cookbook covers many of the same problem domains we touch on here but in much greater depth and with much more discussion. That book, leveraging the collective knowledge of the Python community, provides a much broader and richer survey of Pythonic approaches to common tasks.
This chapter limits itself to tools available as part of standard Python distributions. The next two chapters expand the scope to third party modules and libraries, since many such modules can be just as valuable to the Python programmer.
This chapter starts by covering common tasks which apply to fundamental programming concepts—types, data structures, strings, moving on to conceptually higher-level topics like files and directories, Internet-related operations and process launching before finishing with some nonprogramming tasks such as testing, debugging, and profiling.
Before digging into specific tasks, we should say a brief word about self-exploration. We have not been exhaustive in coverage of object attributes or module contents in order to focus on the most important aspects of the objects under discussion. If you’re curious about what we’ve left out, you can look it up in the Library Reference, or you can poke around in the Python interactive interpreter, as shown in this section.
The dir
built-in function returns a list of all
of the attributes of an object, and, along with the
type
built-in, provides a great way to learn about
the objects you’re manipulating. For example:
>>> dir([ ])
# What are the attributes of lists?
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__delslice__', '__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',
'__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__',
'__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__',
'__repr__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__',
'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
What this tells you is that the empty list object has a few methods:
append
, count
,
extend
, index
,
insert
, pop
,
remove
, reverse
,
sort
, and a lot of “special
methods” that start with an underscore
(_
) or two (__
). These are
used under the hood by Python when performing operations like
+
. Since these special methods are not needed very
often, we’ll write a simple utility function that
will not display them:
>>> def mydir(obj):
... orig_dir = dir(obj)
... return [item for item in orig_dir if not item.startswith('_')]
...
>>>
Using this new function on the same empty list yields:
>>> mydir([ ])
# What are the attributes of lists?
['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
You can then explore any Python object:
>>>mydir(( ))
# What are the attributes of tuples? [ ] # Note: no "normal" attributes >>>import sys
# What are the attributes of files? >>>mydir(sys.stdin)
# What are the attributes of files? ['close', 'closed', 'fileno', 'flush', 'isatty', 'mode', 'name', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines'] >>>mydir(sys)
# Modules are objects too. ['argv', 'builtin_module_names', 'byteorder', 'copyright', 'displayhook', 'dllhandle', 'exc_info', 'exc_type', 'excepthook', 'exec_prefix', 'executable', 'exit', 'getdefaultencoding', 'getrecursionlimit', 'getrefcount', 'hexversion', 'last_traceback', 'last_type', 'last_value', 'maxint', 'maxunicode', 'modules', 'path', 'platform', 'prefix', 'ps1', 'ps2', 'setcheckinterval', 'setprofile', 'setrecursionlimit', 'settrace', 'stderr', 'stdin', 'stdout', 'version', 'version_info', 'warnoptions', 'winver'] >>>type(sys.version)
# What kind of thing is 'version'? <type 'string'> >>>print repr(sys.version)
# What is the value of this string? '2.3a1 (#38, Dec 31 2002, 17:53:59) [MSC v.1200 32 bit (Intel)]'
Recent versions of Python also contain a built-in that is very helpul
to beginners, named (appropriately enough) help
:
>>> help(sys)
Help on built-in module sys:
NAME
sys
FILE
(built-in)
DESCRIPTION
This module provides access to some objects used or maintained by the
interpreter and to functions that interact strongly with the
interpreter.
Dynamic objects:
argv--command line arguments; argv[0] is the script pathname if known
path--module search path; path[0] is the script directory, else ''
modules--dictionary of loaded modules
displayhook--called to show results in an interactive session
excepthook--called to handle any uncaught exception other than
SystemExit
To customize printing in an interactive session or to install a
custom top-level exception handler, assign other functions to replace
these.
...
There is quite a lot to the online help system. We recommend that you
start it first in its “modal”
state, just by typing help( )
. From then on,
any string you type will yield its documentation. Type
quit
to leave the help mode.
>>>help( )
Welcome to Python 2.2! This is the online help utility ... help>socket
Help on module socket: NAME socket FILE c:python22libsocket.py DESCRIPTION This module provides socket operations and some related functions. On Unix, it supports IP (Internet Protocol) and Unix domain sockets. On other systems, it only supports IP. Functions specific for a socket are available as methods of the socket object. Functions: socket( ) -- create a new socket object fromfd( ) -- create a socket object from an open file descriptor [*] gethostname( ) -- return the current hostname gethostbyname( ) -- map a hostname to its IP number gethostbyaddr( ) -- map an IP number or hostname to DNS info getservbyname( ) -- map a service name and a protocol name to a port ... help>keywords
Here is a list of the Python keywords. Enter any keyword to get more help. and elif global or assert else if pass break except import print class exec in raise continue finally is return def for lambda try del from not while help>topics
Here is a list of available topics. Enter any topic name to get more help. ASSERTION DEBUGGING LITERALS SEQUENCEMETHODS1 ASSIGNMENT DELETION LOOPING SEQUENCEMETHODS2 ATTRIBUTEMETHODS DICTIONARIES MAPPINGMETHODS SEQUENCES ATTRIBUTES DICTIONARYLITERALS MAPPINGS SHIFTING AUGMENTEDASSIGNMENT ELLIPSIS METHODS SLICINGS BACKQUOTES EXCEPTIONS MODULES SPECIALATTRIBUTES BASICMETHODS EXECUTION NAMESPACES SPECIALIDENTIFIERS BINARY EXPRESSIONS NONE SPECIALMETHODS BITWISE FILES NUMBERMETHODS STRINGMETHODS BOOLEAN FLOAT NUMBERS STRINGS CALLABLEMETHODS FORMATTING OBJECTS SUBSCRIPTS CALLS FRAMEOBJECTS OPERATORS TRACEBACKS CLASSES FRAMES PACKAGES TRUTHVALUE CODEOBJECTS FUNCTIONS POWER TUPLELITERALS COERCIONS IDENTIFIERS PRECEDENCE TUPLES COMPARISON IMPORTING PRINTING TYPEOBJECTS COMPLEX INTEGER PRIVATENAMES TYPES CONDITIONAL LISTLITERALS RETURNING UNARY CONVERSIONS LISTS SCOPING UNICODE help>TYPES
3.2 The standard type hierarchy Below is a list of the types that are built into Python. Extension modules written in C can define additional types. Future versions of Python may add types to the type hierarchy (e.g., rational numbers, efficiently stored arrays of integers, etc.). ... help>quit
>>>
While we’ve covered data types, one of the common issues when dealing with any type system is how one converts from one type to another. These conversions happen in a myriad of contexts—reading numbers from a text file, computing integer averages, interfacing with functions that expect different types than the rest of an application, etc.
We’ve seen in previous chapters that we can create a
string from a nonstring object by simply passing the nonstring object
to the str
string constructor. Similarly,
unicode
converts any object to its Unicode string
form and returns it.[1]
In addition to the string creation functions, we’ve
seen list
and tuple
, which take
sequences and return list and tuple versions of them, respectively.
int
, complex
,
float
, and long
take any number
and convert it to their respective types.
int
,
long
,
and
float
have additional features that can be
confusing. First, int
and long
truncate their numeric arguments, if necessary, to perform the
operation, thereby losing information and performing a conversion
that may not be what you want (the round
built-in
rounds numbers the standard way and returns a float). Second,
int
, long
, and
float
can also convert strings to their respective
types, provided the strings are valid integer (or long, or float)
literals. Literals are the text strings that are converted to numbers
early in the Python compilation process. So, the string
1244
in your Python program file (which is
necessarily a string) is a valid integer literal, but def
foo( )
: isn’t.
>>>int(1.0), int(1.4), int(1.9), round(1.9), int(round(1.9))
(1, 1, 1, 2.0, 2) >>>int("1")
1 >>>int("1.2")
# This doesn't work. Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: invalid literal for int( ): 1.2
What’s a little odd is that the rule about conversion (if it’s a valid integer literal) is more important than the feature about truncating numeric arguments, thus:
>>> int("1.0")
# Neither does this
Traceback (most recent call last): # since 1.0 is also not a valid
File "<stdin>", line 1, in ? # integer literal.
ValueError: invalid literal for int( ): 1.0
Given the behavior of int
, it may make sense in
some cases to use a custom variant that does only conversion,
refusing to truncate:
>>>def safeint(candidate):
...converted = float(candidate)
...rounded = round(converted)
...if converted == rounded:
...return int(converted)
...else:
...raise ValueError, "%s would lose precision when cast"%candidate
... >>>safeint(3.0)
3 >>>safeint("3.0")
3 >>>safeint(3.1)
Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 8, in safeint ValueError: 3.1 would lose precision when cast
Converting numbers to strings can be done in a variety of ways. In
addition to using str
or
unicode
, one can use hex
or
oct
, which take integers (whether
int
or long
) as arguments and
return string representations of them in hexadecimal or octal format,
respectively.
>>> hex(1000), oct(1000)
('0x3e8', '01750')
The
abs
built-in returns
the absolute value of scalars (integers, longs, floats) and the
magnitude of complex numbers (the square root of the sum of the
squared real and imaginary parts):
>>> abs(-1), abs(-1.2), abs(-3+4j)
(1, 1.2, 5.0) # 5 is sqrt(3*3 + 4*4).
The
ord
and
chr
functions return
the ASCII value of single characters and vice versa:
>>>map(ord, "test")
# Remember that strings are sequences [116, 101, 115, 116] # of characters, so map can be used. >>>chr(64)
'@' >>>ord('@')
64 # map returns a list of single characters, so it # needs to be "joined" into a str. >>>map(chr, (83, 112, 97, 109, 33))
['S', 'p', 'a', 'm', '! '] # Can also be spelled using list comprehensions >>>[chr(x) for x in (83, 112, 97, 109, 33)]
['S', 'p', 'a', 'm', '! '] >>>''.join([chr(x) for x in (83, 112, 97, 109, 33)])
'Spam!'
The
cmp
built-in returns a negative integer, 0, or a
positive integer, depending on whether its first argument is less
than, equal to, or greater than its second one. It’s
worth emphasizing that cmp
works with more than
just numbers; it compares characters using their ASCII values, and
sequences are compared by comparing their elements.
Comparisons
can raise exceptions, so the comparison function is not guaranteed to
work on all objects, but all reasonable comparisons will
work.[2] The comparison process used by cmp
is
the same as that used by the sort
method of lists.
It’s also used by the built-ins
min
and max
, which return the
smallest and largest elements of the objects they are called with,
dealing reasonably with sequences:
>>> min("pif", "paf", "pof") # When called with multiple arguments, 'paf' # return appropriate one. >>> min("ZELDA!"), max("ZELDA!") # when called with a sequence, '!', 'Z' # return the min/max element of it.
Table 27-1 summarizes the
built-in
functions dealing with type conversions. Many of these can also be
called with no argument to return a false value; for example,
str( )
returns the empty string.
The vast majority of programs perform string operations.
We’ve covered most of the properties and variants of
string objects in Chapter 5, but there are two
areas that we haven’t touched on thus far, the
string
module, and regular expressions. As
we’ll see the first is simple and mostly a
historical note, while the second is complex and powerful.
The
string
module is
somewhat of a historical anomaly. If Python were being designed
today, the string module would not exist—it is mostly a remnant
of a less civilized age before everything was a first-class object.
Nowadays, string objects have methods like split
and join
, which replace the functions that are
still defined in the string
module. The
string
module
does define a convenient function, maketrans
, used
to automatically do string
“mapping” operations with the
translate
method of string objects.
maketrans/translate
is useful when you want to
translate several characters in a string at once. For example, if you
want to replace all occurrences of the space character with an
underscore, change underscores to minus signs, and change minus signs
to plus signs. Doing so with repeated .replace( )
operations is in fact quite tricky, but doing it with
maketrans
is trivial:
>>>import string
>>>conversion = string.maketrans(" _-", "_-+")
>>>input_string = "This is a two_part - one_part"
>>>input_string.translate(conversion)
'This_is_a_two-part_+_one-part'
In addition, the string
module defines a few
useful constants, which haven’t been implemented as
string attributes yet. These are shown in Table 27-2.
Constant name |
Value |
|
' |
|
' |
|
' |
|
' |
|
' |
|
|
|
' |
The constants in Table 27-2 are useful to test
whether specific characters fit a criterion—for example,
x in string.whitespace
returns true only if
x
is one of the whitespace characters. Note that
the values given above aren’t always the values
you’ll find—for example, the definition of
'uppercase
' depends on the locale: if
you’re running a French operating system,
string.lowercase
will include
ç
and ê
.
If strings and their methods aren’t enough (and they do get clumsy in many perfectly normal use cases), Python provides a specialized string-processing tool in the form of a regular expression engine.
Regular expressions are strings that let you define complicated
pattern matching and replacement rules for strings. The syntax for
regular expressions emphasizes compact notation over mnemonic value.
For example, the single character . means
“match any single character.” The
character +
means “one or more of
what just preceded me.” Table 27-3
lists some of the most commonly used regular expression symbols and
their meanings in English. Describing the full set of regular
expression tokens and their meaning would take quite a few
pages—instead, we’ll cover a simple use case
and walk through how to solve the problem using regular expressions.
Special character |
Meaning |
. |
Matches any character except newline by default |
|
Matches the start of the string |
|
Matches the end of the string |
|
“Any number of occurrences of what just preceded me” |
|
“One or more occurrences of what just preceded me” |
|
“Either the thing before me or the thing after me” |
|
Matches any alphanumeric character |
|
Matches any decimal digit |
|
Matches the string tomato |
Suppose you need to write a program to replace the strings “green pepper” and “red pepper” with “bell pepper” if and only if they occur together in a paragraph before the word “salad” and not if they are followed (with no space) by the string “corn.” Although the specific requirements are silly, the general kind (conditional replacement of subparts of text based on specific contextual constraints) is surprisingly common in computing. We will explain each step of the program that solves this task.
Assume that the file you need to process is called pepper.txt. Here’s an example of such a file:
This is a paragraph that mentions bell peppers multiple times. For one, here is a red pepper and dried tomato salad recipe. I don't like to use green peppers in my salads as much because they have a harsher flavor. This second paragraph mentions red peppers and green peppers but not the "s" word (s-a-l-a-d), so no bells should show up. This third paragraph mentions red peppercorns and green peppercorns, which aren't vegetables but spices (by the way, bell peppers really aren't peppers, they're chilies, but would you rather have a good cook or a good botanist prepare your salad?).
The first task is to open the file and read in the text:
file = open('pepper.txt') text = file.read( )
We read the entire text at once and avoid splitting it into lines, since we will assume that paragraphs are defined by two consecutive newline characters. This is easy to do using the split function of the string module:
paragraphs = text.split(' ')
At this point we’ve split the text into a list of paragraph strings, and all there is left to do is perform the actual replacement operation. Here’s where regular expressions come in:
import re matchstr = re.compile( r"""(red|green) # 'red' or 'green' starting new words (s+ # followed by whitespace pepper # The word 'pepper', (?!corn) # if not followed immediately by 'corn' (?=.*salad))""", # and if followed at some point by 'salad', re.IGNORECASE | # allow pepper, Pepper, PEPPER, etc. re.DOTALL | # Allow dots to match newlines as well. re.VERBOSE) # This allows the comments and the newlines above. for paragraph in paragraphs: fixed_paragraph = matchstr.sub(r'bell2', paragraph) print fixed_paragraph+' '
The first line is simple but key: all of Python’s
regular expression smarts are in the re
module.
The bold statement is the hardest one; it creates a regular
expression pattern, which is like a program (that’s
the raw string), and compiles it. Such a pattern specifies two
things: which parts of the strings we’re interested
in and how they should be grouped. Let’s go over
these in turn. The re.compile( )
call takes a
string (although the syntax of that string is quite particular) and
returns an object called a compiled regular expression object, which
corresponds to that string.
Defining which parts of the string we’re interested
in is done by specifying a pattern of characters that defines a
match. This is done by concatenating smaller patterns, each of which
specifies a simple matching criterion (e.g., “match
the string 'pepper
',”
“match one or more whitespace
characters,”
“don’t match
'corn
',” etc.).
We’re looking for the words
“red” or
“green” followed by the word
“pepper” that is itself followed by
the word “salad,” as long as
“pepper” isn’t
followed immediately by “corn.”
Let’s take each line of the re.compile( . .
. )
expression in turn.
The first thing to notice about the string in the
re.compile( )
is
that it’s a “raw”
string (the quotation marks are preceded by an r
).
Prepending such an r
to a string (single- or
triple-quoted) turns off the interpretation of the backslash
characters within the string.[3] We could have used a regular string instead and used
\b
instead of
and
\s
instead of
s
. In this case, it makes little difference; for
complicated regular expressions, raw strings allow for much clearer
syntax than escaped backslashes.
The first line in the pattern is (red|green)
.
stands for “the empty string,
but only at the beginning or end of a word”; using
it here prevents matches that have red or green as the final part of
a word (as in “tired pepper”). The
(red|green)
pattern specifies an alternation:
either 'red
' or 'green
‘. Ignore
the left parenthesis that follows for now. s
is a
special symbol that means “any whitespace
character,” and +
means
“one or more occurrence of whatever comes before
me,” so, put together, s+
means
“one or more whitespace
characters.” Then, pepper
just
means the string 'pepper
‘.
(?!corn)
prevents matches of
“patterns that have 'corn
' at
this point,” so we prevent the match on
'peppercorn
‘. Finally,
(?=.*salad)
says that for the pattern to match, it
must be followed by any number of arbitrary characters
(that’s what .*
means), followed
by the word salad
. The ?=
specifies that while the pattern should determine whether the match
occurs, it shouldn’t be “used
up” by the match process; it’s a
subtle point that we won’t cover in detail here. At
this point we’ve defined the pattern corresponding
to the substring.
Now, note that there are two parentheses—the one before
s+
and the last one. What these two do is define
a “group,” which starts after the
red
or green
and go to the end
of the pattern. We’ll use that group in the next
operation, the actual replacement. The three flags are joined by the
|
symbol (the bitwise
“or” operation) to form the second
argument to re.compile
. These specify kinds of
pattern matches. The first, re.IGNORECASE
, says
that the text comparisons should ignore whether the text and the
match have similar or different cases. The second,
re.DOTALL
, specifies that the .
character should match any character, including the newline character
(that’s not the default behavior). The third,
re.VERBOSE
, allows us to insert extra newlines and
# comments in the regular expression, making it easier to read and
understand. We could have written the statement more compactly as:
matchstr = re.compile(r"(red|green)(s+pepper(?!corn)(?=.*salad))", re.I | re.S)
The actual replacement operation is done with the line:
fixed_paragraph = matchstr.sub(r'bell2', paragraph)
We’re calling the sub
method of
the matchstr
object. That object is a compiled
regular expression object, meaning that some of the processing of the
expression has already been done (in this case, outside the loop),
thus speeding up the total program execution. We use a raw string
again to write the first argument to the method. The
2
is a reference to group 2 in the regular
expression—the second group of parentheses in the regular
expression—in our case, everything starting with whitespace
followed by 'pepper
' and up to and including the
word 'salad
‘. Therefore, this line means,
“Replace the occurrences of the matched substring
with the string that is 'bell
' followed by
whatever starts with whitespace followed by
'pepper
' and goes up to the end of the matched
string, throughout the paragraph
string.”
So, does it work? The pepper.txt file had three
paragraphs: the first satisfied the requirements of the match twice,
the second didn’t because it didn’t
mention the word 'salad
', and the third
didn’t because the 'red
' and
'green
' words are before peppercorn, not pepper.
As it was supposed to, our program (saved in a file called
pepper.py) modifies only the first paragraph:
/home/David/book$ python pepper.py
This is a paragraph that mentions bell peppers multiple times. For
one, here is a bell pepper and dried tomato salad recipe. I don't like
to use bell peppers in my salads as much because they have a harsher
flavor.
This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.
This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).
This example, while artificial, shows how regular expressions can compactly express complicated matching rules. If this kind of problem occurs often in your line of work, mastering regular expressions can be a worthwhile investment of time and effort.
A more thorough coverage of regular expressions is beyond the scope
of this book. Jeffrey Friedl provides excellent coverage of regular
expressions in his book Mastering Regular
Expressions (O’Reilly). This book is a
must-have for anyone doing serious text processing. For the casual
user, the descriptions in the Library Reference or Python
in a Nutshell do the job most of the time. Be sure to use
the re
module, not the regex
,
or regsub
modules, which are deprecated (they
probably won’t be around in a later version of
Python):
>>> import regex
__main__:1: DeprecationWarning: the regex module is deprecated; please use the re
module
One of Python’s greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you’ve grown used to them, you’ll find yourself reaching for them automatically. While we covered all of the operations on each data structure as we introduced them, now’s a good time to go over tasks that can apply to all data structures, such as how to make copies, sort objects, randomize sequences, etc. Many functions and algorithms (theoretical procedures describing how to implement a complex task in terms of simpler basic tasks) are designed to work regardless of the type of data being manipulated. It is therefore useful to know how to do generic things for all data types.
Making copies of objects is a reasonable task in many programming contexts. Often, the only kind of copy that’s needed is just another reference to an object, as in:
x = 'tomato' y = x # y is now 'tomato'. x = x + ' and cucumber' # x is now 'tomato and cucumber', but y is unchanged.
Due to Python’s reference management scheme, the
statement a = b doesn’t make a copy of the object
referenced by b; instead, it makes a new reference to that same
object. When the object being copied is an immutable object (e.g., a
string), there is no real difference. When dealing with mutable
objects like lists and dictionaries, however, sometimes a real, new
copy of the object, not just a shared reference, is needed. How to do
this depends on the type of the object in question. The simplest way
of making a copy is to use the list( )
or
tuple( )
constructors:
newList = list(myList) newTuple = tuple(myTuple)
As opposed to the simplest, the most common way to make copies of
sequences like lists and tuples is somewhat odd. If
myList
is a list, then to make a copy of it, you
can use:
newList = myList[:]
which you can read as “slice from beginning to
end,” since the default index for the start of a
slice is the beginning of the sequence (0), and the default index for
the end of a slice is the end of sequence (see Chapter 3). Since tuples support the same slicing
operation as lists, this same technique can also be applied to
tuples, except that if x
is a tuple, then
x[:]
is the same object as x
,
since tuples are immutable. Dictionaries, on the other hand,
don’t support slicing. To make a copy of a
dictionary
myDict
, you can either use:
newDict = myDict.copy( )
or the dict( )
constructor:
newDict = dict(myDict)
For a different kind of copying, if you have a dictionary
oneDict
, and want to update it with the contents
of a different dictionary otherDict
, simply type
oneDict.update(otherDict)
. This is the equivalent
of:
for key in otherDict.keys( ): oneDict[key] = otherDict[key]
If oneDict
shared some keys with
otherDict
before the update( )
operation, the old values associated with the keys in
oneDict
are obliterated by the update. This may be
what you want to do (it usually is). If it isn’t,
the right thing to do might be to raise an exception. To do this,
make a copy of one dictionary, then look over each entry in the
second. If we find shared keys, we raise an exception, if not, we
just add the key-value mapping to the new dictionary.
def mergeWithoutOverlap(oneDict, otherDict): newDict = oneDict.copy( ) for key in otherDict: if key in oneDict: raise ValueError, "the two dictionaries share keys!" newDict[key] = otherDict[key] return newDict
or, alternatively, combine the values of the two dictionaries, with a
tuple, for example. Using the same logic as in
mergeWithoutOverlap
, but combining the values
instead of throwing an exception:
def mergeWithOverlap(oneDict, otherDict): newDict = oneDict.copy( ) for key in otherDict: if key in oneDict: newDict[key] = oneDict[key], otherDict[key] else: newDict[key] = otherDict[key] return newDict
To illustrate the differences between the preceding three algorithms, consider the following two dictionaries:
phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'} phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}
If phoneBook1
is possibly out of date, and
phoneBook2
is more up to date but less complete,
the right usage is probably
phoneBook1.update(phoneBook2)
. If the two
phoneBooks are supposed to have nonoverlapping sets of keys, using
newBook = mergeWithoutOverlap(phoneBook1,
phoneBook2)
lets you know if that assumption is wrong.
Finally, if one is a set of home phone numbers and the other a set of
office phone numbers, chances are newBook =
mergeWithOverlap(phoneBook1, phoneBook2)
is what you want,
as long as the subsequent code that uses newBook
can deal with the fact that newBook['emily']
is
the tuple ('556-0091', '667-1234')
.
The [:]
and .copy( )
tricks
will get you copies in 90% of the cases. If you
are writing functions that, in true Python spirit, can deal with
arguments of any type, it’s sometimes necessary to
make copies of x
, regardless of what
x
is. In
comes the
copy
module. It provides two functions, copy
and
deepcopy
. The first is just like the
[:]
sequence slice operation or the
copy
method of dictionaries. The second is more
subtle and has to do with deeply nested structures (hence the term
deepcopy
). Take the example of copying a list
(listOne
) by slicing it from beginning to end
using the [:]
construct. This technique makes a
new list that contains references to the same objects contained in
the original list. If the contents of that original list are
immutable objects, such as numbers or strings, the copy is as good as
a “true” copy. However, suppose
that the first element in listOne
is itself a
dictionary (or any other mutable object). The first element of the
copy of listOne
is a new reference to the same
dictionary. So if you then modify that dictionary, the modification
is evident in both listOne
and the copy of
listOne
. An example makes it much clearer:
>>> import copy >>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0] >>> listTwo = listOne[:] # Or listTwo=copy.copy(listOne) >>> listThree = copy.deepcopy(listOne) >>> listOne.append("kid") >>> listOne[0]["city"] = "San Francisco, CA" >>> print listOne, listTwo, listThree [{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid'] [{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0] [{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]
As you can see, modifying listOne
directly
modified only listOne
. Modifying the first entry
of the list referenced by listOne
led to changes
in listTwo
, but not in
listThree
; that’s the difference
between a shallow copy ([:]
) and a deep copy. The
copy module functions know how to copy all the built-in types that
are reasonably copyable,[4] including classes and instances.
Lists have a
sort
method that does
an in-place sort. Sometimes you want to iterate over the sorted
contents of a list, without disturbing the contents of this list. Or
you may want to list the sorted contents of a tuple. Because
tuples are immutable, an operation such as
sort, which modifies it in place, is not allowed. The only solution
is to make a list copy of the elements, sort the list copy, and work
with the sorted copy, as in:
listCopy = list(myTuple) listCopy.sort( ) for item in listCopy: print item # Or whatever needs doing
This solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It’s really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:
keys = myDict.keys( ) # Returns an unsorted list of # the keys in the dict. keys.sort( ) for key in keys: # Print key/value pairs print key, myDict[key] # sorted by key.
The sort method on lists uses the standard Python comparison scheme.
Sometimes, however, that scheme isn’t
what’s needed, and you need to sort according to
some other procedure. For example, when sorting a list of words, case
(lower versus UPPER) may not be significant. The standard comparison
of text strings, however, says that all uppercase letters come before
all lowercase letters, so 'Baby
' is less than
'apple
', but 'baby
' is greater
than 'apple
‘. In order to do a
case-independent sort, you need to
define a comparison function that takes two arguments, and returns
-1, 0, or 1 depending on whether the first argument is smaller than,
equal to, or greater than the second argument. So, for
case-independent sorting, you can use:
>>>def caseIndependentSort(something, other):
...something, other = something.lower( ), other.lower( )
...return cmp(something, other)
... >>>testList = ['this', 'is', 'A', 'sorted', 'List']
>>>testList.sort( )
>>>print testList
['A', 'List', 'is', 'sorted', 'this'] >>>testList.sort(caseIndependentSort)
>>>print testList
['A', 'is', 'List', 'sorted', 'this']
We’re using the built-in function
cmp
, which does the hard part of figuring out that
'a
' comes before 'b
',
'b
' before 'c
', etc. Our sort
function simply converts both items to lowercase and compares the
lowercase versions. Also note that the conversion to lowercase is
local to the comparison function, so the elements in the list
aren’t modified by the sort.
What about
randomizing
a sequence, such as a list of lines? The easiest way to randomize a
sequence is to call the shuffle
function in the
random
module, which randomizes a sequence
in-place:[5]
random.shuffle(myList)
If you need to shuffle a nonlist object, it’s usually easiest to convert that object to a list and shuffle the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won’t seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save programmer time; if that means running a slightly slower or bigger program, so be it. If you’re handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of your time.
This chapter emphasizes the silliness involved in reinventing wheels. This point is especially important when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you’re used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they’re fast and stable. Sometimes, however, the interface to these algorithms isn’t convenient for a particular task.
For example, computer science textbooks often describe algorithms in
terms of other data structures such as queues and stacks. To use
these algorithms, it may make sense to come up with a data structure
that has the same methods as these data structures (such as
pop
and push
for stacks or
enqueue/dequeue for queues). However, it also makes sense to reuse
the built-in list type in the implementation of a stack. In other
words, you need something that acts like a stack but is based on a
list. A simple solution is to use a class wrapper around a list. For
a minimal stack implementation, you can do this:
class Stack: def __init__(self, data): self._data = list(data) self.push = self._data.append self.pop = self._data.pop
The following is simple to write, to understand, to read, and to use:
>>>thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>>thingsToDo.push('do the dishes')
>>>print thingsToDo.pop( )
do the dishes >>>print thingsToDo.pop( )
wash the kid
Two standard Python naming
conventions are used in the
Stack
class above. The first is that class names
start with an uppercase letter, to distinguish them from functions.
The other is that the _data
attribute starts with
an underscore. This is a half-way point between public attributes
(which don’t start with an underscore), private
attributes (which start with two underscores; see Chapter 7), and Python-reserved identifiers (which both
start and end with two underscores). What it means is that
_data
is an attribute of the class that
shouldn’t be needed by clients of the class. The
class designer expects such pseudo-private attributes to be used only
by the class methods and by the methods of any eventual subclass.
The
Stack
class presented earlier does its minimal job just fine. It assumes a
fairly minimal definition of what a stack is, specifically, something
that supports just two operations, a push
and a
pop
. Some of the features of lists would be nice
to use, such as the ability to iterate over all the elements using
the for . . . in . .
. construct. While you could
continue in the style of the previous class and delegate to the
“inner” list object, at some point
it makes more sense to simply reuse the implementation of list
objects directly, through subclassing. In this case, you should
derive a class from the list
base class. The
dict
base class can also be used to create
dictionary-like classes.
# Subclass the list class. class Stack(list): push = list.append
This Stack is a subclass of the list
class. The
list
class implements the pop
methods among others. You don’t need to define your
own __init__
method because
list
defines a perfectly good default. The push
method is defined just by saying that it’s the same
as list
’s
append
method. Now we
can do list-like things as well as stack-like things:
>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid']) >>> print thingsToDo # Inherited from list base class ['write to mom', 'invite friend over', 'wash the kid'] >>> thingsToDo.pop( ) 'wash the kid' >>> thingsToDo.push('change the oil') >>> for chore in thingsToDo: # We can also iterate over the contents. ... print chore ... write to mom invite friend over change the oil
So far so good—we know how to create objects, we can convert between different data types, and we can perform various kinds of operations on them. In practice, however, as soon as one leaves the computer science classroom one is faced with tasks that involve manipulating data that lives outside of the program and performing processes that are external to Python. That’s when it becomes very handy to know how to talk to the operating system, explore the filesystem, read and modify files.
The
os
module provides a
generic interface to the operating system’s most
basic set of tools. Different operating systems have different
behaviors. This is true at the programming interface as well. This
makes it hard to write so-called
“portable” programs, which run well
regardless of the operating system. Having generic interfaces
independent of the operating system helps, as does using an
interpreted language like Python. The specific set of calls it
defines depend on which platform you use. (For example, the
permission-related calls are available only on platforms that support
them, such as Unix and Windows.) Nevertheless, it’s
recommended that you always use the os
module,
instead of the platform-specific versions of the module (called by
such names as posix
, nt
, and
mac
). Table 27-4 lists some of
the most often used functions in the os
module.
When referring to files in the context of the os
module, one is referring to filenames, not file objects.
With just these modules, you can find out a lot about the current state of the filesystem, as well as modify it:
>>>print os.getcwd( )
# Where am I? C:Python22 >>>print os.listdir('.')
# What's here? ['DLLs', 'Doc', 'include', 'Lib', 'libs', 'License.txt', ...] >>>os.chdir('Lib')
# Let's go explore the library. >>>print os.listdir('.')
# What's here? ['aifc.py', 'anydbm.py', 'anydbm.pyc', 'asynchat.py', 'asyncore.py', 'atexit.py', 'atexit.pyc', 'atexit.pyo', 'audiodev.py', 'base64.py', ...] >>>os.remove('atexit.pyc')
# We can remove .pyc files safely. >>>
There are many other functions in the os
module;
in fact, just about any function that’s part of the
POSIX standard and widely available on most Unix and Unix-like
platforms is supported by Python on Unix. The interfaces to these
routines follow the
POSIX conventions. You can retrieve and
set UIDs, PIDs, and process groups; control nice levels; create
pipes; manipulate file descriptors; fork processes; wait for child
processes; send signals to processes; use the execv variants; etc (if
you don’t know what half of the words in this
paragraph mean, don’t worry, you probably
don’t need to).
The os
module also defines some important
attributes that
aren’t functions:
The
os.name
attribute defines the current version of the platform-specific
operating-system interface. Registered values for
os.name
are 'posix
',
'nt
', 'dos
', and
'mac
‘. It’s different from
sys.platform
, primarily in that
it’s less specific—for example, Solaris and
Linux will have the same value ('posix')
for
os.name
, but different values of
sys.platform
.
os.error
defines an exception class used when calls in the
os
module raise errors. It’s the
same thing as OSError
, one of the built-in
exception classes. When this exception is raised, the value of the
exception object contains two variables. The first is the number
corresponding to the error (known as errno
), and
the second is a string message explaining it (known as
strerror
):
>>>os.rmdir('nonexistent_directory')
# How it usually shows up Traceback (innermost last): File "<stdin>", line 1, in ? os.error: (2, 'No such file or directory') >>>try:
# We can catch the error and take ...os.rmdir('nonexistent directory')
# it apart. ...except os.error, value:
...print value[0], value[1]
... 2 No such file or directory
The
os.environ
dictionary contains key/value pairs corresponding to the environment
variables of the shell from which Python was started. Because this
environment is inherited by the commands that are invoked using the
os.system
call, modifying the
os.environ
dictionary modifies the environment:
>>>print os.environ['SHELL']
/bin/sh >>>os.environ['STARTDIR'] = 'MyStartDir'
>>>os.system('echo $STARTDIR')
# 'echo %STARTDIR%' on DOS/Win MyStartDir # Printed by the shell 0 # Return code from echo
The os
module also includes a set of strings that
define portable ways to refer to directory-related parts of filename
syntax, as shown in Table 27-5.
Attribute name |
Meaning and values |
|
A string that denotes the current directory: ' |
|
A string that denotes the parent directory: ' |
|
The character that separates pathname components:
' |
|
An alternate character to sep when available; set to
|
|
The character that separates path components: ' |
These strings are used by the functions in the
os.path
module, which manipulate file paths in
portable ways (see Table 27-6). Note that the
os.path
module is an attribute of the
os
module, not a sub-module of an
os
package; it’s imported
automatically when the os
module is loaded, and
(unlike packages) you don’t need to import it
explicitly. The outputs of the examples in Table 27-6 correspond to code run on a Windows or DOS
machine. On another platform, the appropriate path separators would
be used instead. A useful relevant bit of knowledge is that the
forward slash (/
) can be used safely in Windows to
indicate directory traversal, even though the native separator is the
backwards slash ()—Python and Windows both
do the right thing with it.
The keen-eyed reader might have noticed that the
os
module, while it provides lots of file-related
functions, doesn’t include a
copy
function. In DOS, copying a file is basically the same thing as
opening one file in read/binary mode, reading all its data, opening a
second file in write/binary mode, and writing the data to the second
file. On Unix and Windows, making that kind of copy fails to copy the
stat
bits (permissions,
modification times, etc.) associated with the file. On the Mac, that
operation won’t copy the resource fork, which
contains data such as icons and dialog boxes. In other words, copying
files is just more complicated than one could reasonably believe.
Nevertheless, often you can get away with a fairly simple function
that works on Windows, DOS, Unix, and Mac, as long as
you’re manipulating just data files with no resource
forks. That function, called copyfile
, lives in
the shutil
module. This module includes a few
generally useful functions, shown in Table 27-7.
Function name |
Behavior |
|
Makes a copy of the file |
|
Copies mode information (permissions) from |
|
Copies all stat information (mode, utime) from |
|
Copies data and mode information from |
|
Copies data and stat information from |
|
Copies a directory recursively using |
|
Recursively deletes the directory indicated by path. If
|
While the previous section lists common functions for working with files, many tasks require more than a single function call.
Let’s take a typical example: you have lots of
files, all of which have a space in their name, and
you’d like to replace the spaces with underscores.
All you need is the os.curdir
attribute (which
returns an operating-system specific string that corresponds to the
current directory), the
os.listdir
function
(which returns the list of filenames in a specified
directory), and the
os.rename
function:
import os if len(sys.argv) == 1: # If no filenames are specified, filenames = os.listdir(os.curdir) # use current dir; else: # otherwise, use files specified filenames = sys.argv[1:] # on the command line. for filename in filenames: if ' ' in filename: newfilename = filename.replace(' ', '_') print "Renaming", filename, "to", newfilename, "..." os.rename(filename, newfilename)
This program works fine, but it reveals a certain Unix-centrism. That is, if you call it with wildcards, such as:
python despacify.py *.txt
you find that on Unix machines, it renames all the files with names with spaces in them and that end with .txt. In a DOS-style shell, however, this won’t work because the shell normally used in DOS and Windows doesn’t convert from *.txt to the list of filenames; it expects the program to do it. This is called globbing, because the * is said to match a glob of characters. Luckily, Python helps us make the code portable.
The
glob
module exports a
single
function, also called glob
, which takes a filename
pattern and returns a list of all the filenames that match that
pattern (in the current working directory):
import sys, glob
print sys.argv[1:]
sys.argv = [item for arg in sys.argv for item in glob.glob(arg)]
print sys.argv[1:]
Running this on Unix and DOS shows that on Unix, the Python
glob
didn’t do anything because
the globbing was done by the Unix shell before Python was invoked,
and in DOS, Python’s globbing came up with the same
answer:
/usr/python/book$ python showglob.py *.py ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py'] ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py'] C:pythonook> python showglob.py *.py ['*.py'] ['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']
It’s worth looking at the bold line in showglob.py and understanding exactly what happens there, especially if you’re new to the list comprehension concept (discussed in Chapter 14).
If you’ve ever written a shell script and needed to use intermediary files for storing the results of some intermediate stages of processing, you probably suffered from directory litter. You started out with 20 files called log_001.txt, log_002.txt, etc., and all you wanted was one summary file called log_sum.txt. In addition, you had a whole bunch of log_001.tmp, log_001.tm2, etc. files that, while they were labeled temporary, stuck around. To put order back into your directories, use temporary files in specific directories and clean them up afterwards.
To help in this
temporary file management problem, Python
provides a nice little module called tempfile
that
publishes two functions: mktemp( )
and
TemporaryFile( )
. The former returns the name of a
file not currently in use in a directory on your computer reserved
for temporary files (such as /tmp on Unix or C:TEMP on Windows). The
latter returns a new file object directly. For example:
# Read input file inputFile = open('input.txt', 'r') import tempfile # Create temporary file tempFile = tempfile.TemporaryFile( ) # We don't even need to first_process(input = inputFile, output = tempFile) # know the filename... # Create final output file outputFile = open('output.txt', 'w') second_process(input = tempFile, output = outputFile)
Using tempfile.TemporaryFile( )
works well in
cases where the intermediate steps manipulate file objects. One of
its nice features is that when the file object is deleted, it
automatically deletes the file it created on disk, thus cleaning up
after itself. One important use of temporary files, however, is in
conjunction with the os.system call, which means using a shell, hence
using filenames, not file objects. For example,
let’s look at a program that creates form letters
and mails them to a list of email addresses (on Unix only):
formletter = """Dear %s, I'm writing to you to suggest that ...""" # etc. myDatabase = [('Michael Jackson', '[email protected]'), ('Bill Gates', '[email protected]'), ('Bob', '[email protected]')] for name, email in myDatabase: specificLetter = formletter % name tempfilename = tempfile.mktemp( ) tempfile = open(tempfilename, 'w') tempfile.write(specificLetter) tempfile.close( ) os.system('/usr/bin/mail %(email)s -s "Urgent!" < %(tempfilename)s' % vars( )) os.remove(tempfilename)
The first line in the for
loop returns a
customized version of the form letter based on the name
it’s given. That text is then written to a temporary
file that’s emailed to the appropriate email address
using the os.system
call. Finally, to clean up,
the temporary file is removed.
The vars( )
function is a built-in function that
returns a dictionary corresponding to the variables defined in the
current local namespace. The keys of the dictionary are the variable
names, and the values of the dictionary are the variable values.
vars( )
comes in quite handy for exploring
namespaces. It can also be called with an object as an argument (such
as a module, a class, or an instance), and it will return the
namespace of that object. Two other built-ins, locals(
)
and globals( )
, return the local and
global namespaces, respectively. In all three cases, modifying the
returned dictionaries doesn’t guarantee any effect
on the namespace in question, so view these as read-only and you
won’t be surprised. You can see that the
vars( )
call creates a dictionary that is used by
the string interpolation mechanism; it’s thus
important that the names inside the %(...)
s bits
in the string match the variable names in the program.
The argv
attribute of the sys
module
holds one set of inputs to the current program—the command-line
arguments, more precisely a list of the words input on the command
line, excluding the reference to Python itself if it exists. In other
words, if you type at the shell:
csh> python run.py a x=3 foo
then when run.py starts, the value of the
sys.argv attribute is ['run.py', 'a', 'x=3'
,
'foo']
. The sys.argv
attribute
is mutable (after all, it’s just a list). Common
usage involves iterating over the arguments of the Python program,
that is, sys.argv[1:]
; slicing from index 1 till
the end gives all of the arguments to the program itself, but
doesn’t include the name of the program (module)
stored in sys.argv[0]
. There are two modules that
help you process command line options. The first, an older module
called getopt
, is replaced in Python 2.3 by a
similar but more powerful module called optparse
.
Check the library reference for further details on how to use them.
Experienced programmers will know that there are other inputs to a
program, especially the standard input stream, with siblings for
output and error messages. Python lets the programmer access and
modify these through three file attributes in the
sys
module: sys.stdin
,
sys.stdout
, and sys.stderr
.
Standard input is generally associated by the operating system with
the user’s keyboard; standard output and standard
error are usually associated with the console. The
print
statement in Python outputs to standard
output (sys.stdout
), while error messages such as
exceptions are output on the standard error stream
(sys.stderr
). Python lets you modify these on the
fly: you can redirect the output of a Python program to a file simply
by assigning to sys.stdout
:
sys.stdout = open('log.out', 'w')
After this line, any output will be written to the file
log.out
instead of showing up on the console. Note
that if you don’t save it first, the reference to
the “original” standard out stream
is lost. It’s generally a good idea to save a
reference before reallocating any of the standard streams, as in:
old_stdout = sys.stdout sys.stdout = open('log.out', 'w')
Why have a
standard
input stream? After all, it’s not that hard to type
open('input.txt')
in the program. The major
argument for reading and writing with standard streams is that you
can chain programs so that the standard output from one becomes the
standard input of the next, with no file used in the transfer. This
facility, known as
piping
,
is at the heart of the Unix philosophy. Using standard I/O this way
means that you can write a program to do a specific task once, and
then use it to process files or the intermediate results of other
programs at any time in the future. As an example, a simple program
that counts the number of lines in a file could be written as:
import sys data = sys.stdin.readlines( ) print "Counted", len(data), "lines."
On Unix, you could test it by doing something like:
% cat countlines.py | python countlines.py
Counted 3 lines.
On Windows or DOS, you’d do:
C:> type countlines.py | python countlines.py
Counted 3 lines.
You can get each line in a file simply by iterating over a file object. This comes in very handy when implementing simple filter operations. Here are a few examples of such filter operations.
# Show comment lines (lines that start with a #, like this one). import sys for line in sys.stdin: if line[0] == '#': print line,
Note that a final comma is added after the print statement to indicate that the print operation should not add a newline, which would result in double-spaced output since the line string already includes a newline character as its last character.
The last two programs can easily be combined using pipes to combine their power. To count the number of comment lines in commentfinder.py:
C:> type commentfinder.py | python commentfinder.py | python countlines.py
Counted 1 lines.
Some other filtering tasks that take from standard input and write to standard output follow.
import sys for line in sys.stdin: words = line.split( ) if len(words) >= 4: print words[3]
We look at the length of the words list to find if there are indeed
at least four words. The last two lines could also be replaced by the
try
/except
statement, which is
quite common in Python:
try: print words[3] except IndexError: # There aren't enough words. pass
import sys, string for line in sys.stdin: words = line.split(':') if len(words) >= 4: print words[3].lower( )
If iterating over all of the lines isn’t what you
want, just use the readlines( )
or read(
)
methods of file objects.
import sys lines = sys.stdin.readlines( ) sys.stdout.writelines(lines[:10]) # First 10 lines sys.stdout.writelines(lines[-10:]) # Last 10 lines for lineIndex in range(0, len(lines), 2): # Get 0, 2, 4, ... sys.stdout.write(lines[lineIndex]) # Get the indexed line.
text = open(fname).read( ) print text.count('Python')
In this more complicated example, the task is to transpose a file; imagine you have a file that looks like:
Name: Willie Mark Guido Mary Rachel Ahmed Level: 5 4 3 1 6 4 Tag#: 1234 4451 5515 5124 1881 5132
And you really want it to look like the following instead:
Name: Level: Tag#: Willie 5 1234 Mark 4 4451 ...
You could use code like the following:
import sys lines = sys.stdin.readlines( ) wordlists = [line.split( ) for line in lines] for row in zip(*wordlists): print ' '.join(row)
Of course, you should really use much more defensive programming techniques to deal with the possibility that not all lines have the same number of words in them, that there may be missing data, etc. Those techniques are task-specific and are left as an exercise to the reader.
All the preceding examples assume you can read the entire file at once. In some cases, however, that’s not possible, for example, when processing really huge files on computers with little memory, or when dealing with files that are constantly being appended to (such as log files). In such cases, you can use a while/readline combination, where some of the file is read a bit at a time, until the end of file is reached. In dealing with files that aren’t line-oriented, you must read the file a character at a time:
# Read character by character. while 1: next = sys.stdin.read(1) # Read a one-character string if not next: # or an empty string at EOF. break # Process character 'next'.
Notice that the read( )
method on file objects
returns an empty string at end of file, which breaks out of the while
loop. Most often, however, the files you’ll deal
with consist of line-based data and are processed a line at a time:
# Read line by line. while 1: next = sys.stdin.readline( ) # Read a one-line string if not next: # or an empty string at EOF. break # Process line 'next'.
Being able to read stdin
is a great
feature; it’s the foundation of the Unix toolset.
However, one input is not always enough: many tasks need to be
performed on sets of files. This is usually done by having the Python
program parse the list of arguments sent to the script as
command-line options. For example, if you type:
% python myScript.py input1.txt input2.txt input3.txt output.txt
you might think that myScript.py wants to do something with the first three input files and write a new file, called output.txt. Let’s see what the beginning of such a program could look like:
import sys inputfilenames, outputfilename = sys.argv[1:-1], sys.argv[-1] for inputfilename in inputfilenames: inputfile = open(inputfilename, "r") do_something_with_input(inputfile) inputfile.close( ) outputfile = open(outputfilename, "w") write_results(outputfile) outputfile.close( )
The second line extracts parts of the argv
attribute of the sys
module. Recall that
it’s a list of the words on the command line that
called the current program. It starts with the name of the script.
So, in the example above, the value of sys.argv
is:
['myScript.py', 'input1.txt', 'input2.txt', 'input3.txt', 'output.txt'].
The script assumes that the command line consists of one or more
input files and one output file. So the slicing of the input file
names starts at 1 (to skip the name of the script, which
isn’t an input to the script in most cases), and
stops before the last word on the command line, which is the name of
the output file. The rest of the script should be pretty easy to
understand (but won’t work until you provide the
do_something_with_input( )
and
write_results( )
functions).
Note that the preceding script doesn’t actually read
in the data from the files, but passes the file object down to a
function to do the real work. A generic version of
do_something_with_input( )
is:
def do_something_with_input(inputfile): for line in inputfile: process(line)
The combination of this idiom
with the
preceding one regarding opening each file in the
sys.argv[1:]
list is so common that there is a
module, fileinput
, to do just this task:
import fileinput for line in fileinput.input( ): process(line)
The fileinput.input( )
call parses the arguments
on the command line, and if there are no arguments to the script,
uses sys.stdin
instead. It also provides several
useful functions that let you know which file and line number
you’re currently manipulating, as we can see in the
following script:
import fileinput, sys # Take the first argument out of sys.argv and assign it to searchterm. searchterm, sys.argv[1:] = sys.argv[1], sys.argv[2:] for line in fileinput.input( ): num_matches = line.count(searchterm) if num_matches: # A nonzero count means there was a match. print "found '%s' %d times in %s on line %d." % (searchterm, num_matches, fileinput.filename( ), fileinput.filelineno( ))
Running mygrep.py on a few Python files produces:
% python mygrep.py in *.py found 'in' 2 times in countlines.py on line 2. found 'in' 2 times in countlines.py on line 3. found 'in' 2 times in mygrep.py on line 1. found 'in' 4 times in mygrep.py on line 4. found 'in' 2 times in mygrep.py on line 5. found 'in' 2 times in mygrep.py on line 7. found 'in' 3 times in mygrep.py on line 8. found 'in' 3 times in mygrep.py on line 12.
A file is considered a binary file if it’s not a
text file or a file written in a format based on text, such as HTML
and XML. Image and sound files are prototypical examples of binary
files. A frequent question about file manipulation is
“How do I process binary files in
Python?” The answer to that question usually
involves the struct
module. It has a simple
interface, since it exports just three functions:
pack
, unpack
, and
calcsize
.
Let’s start with the task of decoding
a
binary file. Imagine a binary file
bindat.dat that contains data in a specific
format: first there’s a float corresponding to a
version number, then a long integer corresponding to the size of the
data, and then the number of unsigned bytes corresponding to the
actual data. The key to using the struct
module is
to define a format string, which corresponds to
the format of the data you wish to read, and find out which subset of
the file corresponds to that data. For example:
import struct data = open('bindat.dat').read( ) start, stop = 0, struct.calcsize('fl') version_number, num_bytes = struct.unpack('fl', data[start:stop]) start, stop = stop, start + struct.calcsize('B'*num_bytes) bytes = struct.unpack('B'*num_bytes, data[start:stop])
'f
' is a format string for a single floating-point
number (a C float, to be precise), 'l
' is for a
long integer, and 'B
' is a format string for an
unsigned char. The available unpack
format strings
are listed in Table 27-8. Consult the library
reference manual for usage details.
Format |
C type |
Python |
|
|
No value |
|
|
String of length 1 |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Integer |
|
|
Float |
|
|
Float |
|
|
String |
|
|
String |
|
|
Integer |
At this point, bytes
is a tuple of
num_bytes
Python integers. If we know that the
data is in fact storing characters, we could use chars =
map(chr, bytes)
. To be more efficient, we could change the
last unpack
to use 'c
' instead
of 'B
', which would do the conversion and return a
tuple of num_bytes
single-character strings. More
efficiently still, we could use a format string that specifies a
string of characters of a specified length, such as:
chars = struct.unpack(str(num_bytes)+'s', data[start:stop])
The packing operation (struct.pack
) is the exact
converse; instead of taking a format string and a data string, and
returning a tuple of unpacked values, it takes a format string and a
variable number of arguments and packs those arguments using that
format string into a new packed string.
Note that the struct
module can process data
that’s encoded with either kind of
byte-ordering,[6] thus
allowing you to write platform-independent binary file manipulation
code. For large files, also consider using the array module.
Python is used in a wide variety of Internet-related tasks, from making web servers to crawling the Web to “screen-scraping” web sites for data. This section briefly describes the most often used modules used for such tasks that ship with Python’s core. For more detailed examples of their use, we recommend Lundh’s Standard Python Library and Martelli and Ascher’s Python Cookbook (O’Reilly). There are many third-party add-ons worth knowing about before embarking on a significant web- or Internet-related project.
Python programs often process forms from web pages. To make this task
easy, the standard Python distribution includes a module called
cgi
. Chapter 28 includes an
example of a Python script that uses the CGI, so we
won’t cover it any further here.
Universal resource locators are strings such as http://www.python.org that are now
ubiquitous. Three modules—urllib
,
urllib2
, and
urlparse
—provide tools for processing URLs.
The urllib
module defines a few functions for
writing programs that must be active users of the Web (robots,
agents, etc.). These are listed in Table 27-9.