Chapter 27. Common Tasks in Python

At this point in the book, you have been exposed to a fairly complete survey of the more formal aspects of the language (the syntax, the data types, etc.). In this chapter, we’ll “step out of the classroom” by looking at a set of basic computing tasks and examining how Python programmers typically solve them, hopefully helping you ground the theoretical knowledge with concrete results.

Python programmers don’t like to reinvent wheels when they already have access to nice, round wheels in their garage. Thus, the most important content in this chapter is the description of selected tools that make up the Python standard library—built-in functions, library modules, and their most useful functions and classes. While you most likely won’t use all of these in any one program, no useful program avoids all of these. Just as Python provides a list object type because sequence manipulations occur in all programming contexts, the library provides a set of modules that will come in handy over and over again. Before designing and writing any piece of generally useful code, check to see if a similar module already exists. If it’s part of the standard Python library, you can be assured that it’s been heavily tested; even better, others are committed to fixing any remaining bugs—for free.

The goal of this chapter is to expose you to a lot of different tools, so that you know that they exist, rather than to teach you everything you need to know in order to use them. There are very good sources of complementary knowledge once you’ve finished this book. If you want to explore more of the standard library, the definitive reference is the Python Library Reference, currently over 600 pages long. It is the ideal companion to this book; it provides the completeness we don’t have the room for, and, being available online, is the most up-to-date description of the standard Python toolset. Three other O’Reilly books provide excellent additional information: the Python Pocket Reference, written by Mark Lutz, which covers the most important modules in the standard library, along with the syntax and built-in functions in compact form; Fredrik Lundh’s Python Standard Library, which takes on the formidable task of both providing additional documentation for each module in the standard library as well as providing an example program showing how to use each module; and finally, Alex Martelli’s Python in a Nutshell provides a thorough yeteminently readable and concise description of the language and standard library. As we’ll see in Section 27.1, Python comes with tools that make self-learning easy as well.

Just as we can’t cover every standard module, the set of tasks covered in this chapter is necessarily limited. If you want more, check out the Python Cookbook (O’Reilly), edited by David Ascher and Alex Martelli. This Cookbook covers many of the same problem domains we touch on here but in much greater depth and with much more discussion. That book, leveraging the collective knowledge of the Python community, provides a much broader and richer survey of Pythonic approaches to common tasks.

This chapter limits itself to tools available as part of standard Python distributions. The next two chapters expand the scope to third party modules and libraries, since many such modules can be just as valuable to the Python programmer.

This chapter starts by covering common tasks which apply to fundamental programming concepts—types, data structures, strings, moving on to conceptually higher-level topics like files and directories, Internet-related operations and process launching before finishing with some nonprogramming tasks such as testing, debugging, and profiling.

Exploring on Your Own

Before digging into specific tasks, we should say a brief word about self-exploration. We have not been exhaustive in coverage of object attributes or module contents in order to focus on the most important aspects of the objects under discussion. If you’re curious about what we’ve left out, you can look it up in the Library Reference, or you can poke around in the Python interactive interpreter, as shown in this section.

The dir built-in function returns a list of all of the attributes of an object, and, along with the type built-in, provides a great way to learn about the objects you’re manipulating. For example:

>>> dir([  ])                             # What are the attributes of lists?
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', 
'__delslice__', '__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',
'__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', 
'__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', 
'__repr__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__', 
'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

What this tells you is that the empty list object has a few methods: append, count, extend, index, insert, pop, remove, reverse, sort, and a lot of “special methods” that start with an underscore (_) or two (__). These are used under the hood by Python when performing operations like +. Since these special methods are not needed very often, we’ll write a simple utility function that will not display them:

>>> def mydir(obj):
...     orig_dir = dir(obj)
...     return [item for item in orig_dir if not item.startswith('_')]
...     
>>>

Using this new function on the same empty list yields:

>>> mydir([  ])                             # What are the attributes of lists?
['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

You can then explore any Python object:

>>> mydir((  ))                                # What are the attributes of tuples?
[  ]                                        # Note: no "normal" attributes
>>> import sys                            # What are the attributes of files?
>>> mydir(sys.stdin)                      # What are the attributes of files?
['close', 'closed', 'fileno', 'flush', 'isatty', 'mode', 'name', 'read', 'readinto', 
'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write',
 'writelines', 'xreadlines']
 >>> mydir(sys)                           # Modules are objects too.
['argv', 'builtin_module_names', 'byteorder', 'copyright', 'displayhook', 
'dllhandle', 'exc_info', 'exc_type', 'excepthook', 'exec_prefix', 'executable',
'exit', 'getdefaultencoding', 'getrecursionlimit', 'getrefcount', 'hexversion',
'last_traceback', 'last_type', 'last_value', 'maxint', 'maxunicode', 'modules',
'path', 'platform', 'prefix', 'ps1', 'ps2', 'setcheckinterval', 'setprofile', 
'setrecursionlimit', 'settrace', 'stderr', 'stdin', 'stdout', 'version', 
'version_info', 'warnoptions', 'winver'] 
>>> type(sys.version)                    # What kind of thing is 'version'?
<type 'string'>
>>> print repr(sys.version)              # What is the value of this string?
'2.3a1 (#38, Dec 31 2002, 17:53:59) [MSC v.1200 32 bit (Intel)]'

Recent versions of Python also contain a built-in that is very helpul to beginners, named (appropriately enough) help:

>>> help(sys)
Help on built-in module sys:
NAME
    sys

FILE
     (built-in)

DESCRIPTION
    This module provides access to some objects used or maintained by the
    interpreter and to functions that interact strongly with the 
    interpreter.

    Dynamic objects:

    argv--command line arguments; argv[0] is the script pathname if known
    path--module search path; path[0] is the script directory, else ''
    modules--dictionary of loaded modules

    displayhook--called to show results in an interactive session
    excepthook--called to handle any uncaught exception other than 
    SystemExit

    To customize printing in an interactive session or to install a 
    custom top-level exception handler, assign other functions to replace
    these.
    ...

There is quite a lot to the online help system. We recommend that you start it first in its “modal” state, just by typing help( ). From then on, any string you type will yield its documentation. Type quit to leave the help mode.

>>> help(  )
Welcome to Python 2.2! This is the online help utility
...
help> socket
Help on module socket:

NAME
    socket

FILE
    c:python22libsocket.py

DESCRIPTION
    This module provides socket operations and some related functions.
    On Unix, it supports IP (Internet Protocol) and Unix domain sockets.
    On other systems, it only supports IP. Functions specific for a
    socket are available as methods of the socket object.

    Functions:

    socket(  ) -- create a new socket object
    fromfd(  ) -- create a socket object from an open file descriptor [*]
    gethostname(  ) -- return the current hostname
    gethostbyname(  ) -- map a hostname to its IP number
    gethostbyaddr(  ) -- map an IP number or hostname to DNS info
    getservbyname(  ) -- map a service name and a protocol name to a port 
        ...
help> keywords

Here is a list of the Python keywords.  Enter any keyword to get more help.

and                 elif                global              or
assert              else                if                  pass
break               except              import              print
class               exec                in                  raise
continue            finally             is                  return
def                 for                 lambda              try
del                 from                not                 while
help> topics

Here is a list of available topics.  Enter any topic name to get more help.

ASSERTION           DEBUGGING           LITERALS         SEQUENCEMETHODS1
ASSIGNMENT          DELETION            LOOPING          SEQUENCEMETHODS2
ATTRIBUTEMETHODS    DICTIONARIES        MAPPINGMETHODS   SEQUENCES
ATTRIBUTES          DICTIONARYLITERALS  MAPPINGS         SHIFTING
AUGMENTEDASSIGNMENT ELLIPSIS            METHODS          SLICINGS
BACKQUOTES          EXCEPTIONS          MODULES          SPECIALATTRIBUTES
BASICMETHODS        EXECUTION           NAMESPACES       SPECIALIDENTIFIERS
BINARY              EXPRESSIONS         NONE             SPECIALMETHODS
BITWISE             FILES               NUMBERMETHODS    STRINGMETHODS
BOOLEAN             FLOAT               NUMBERS          STRINGS
CALLABLEMETHODS     FORMATTING          OBJECTS          SUBSCRIPTS
CALLS               FRAMEOBJECTS        OPERATORS        TRACEBACKS
CLASSES             FRAMES              PACKAGES         TRUTHVALUE
CODEOBJECTS         FUNCTIONS           POWER            TUPLELITERALS
COERCIONS           IDENTIFIERS         PRECEDENCE       TUPLES
COMPARISON          IMPORTING           PRINTING         TYPEOBJECTS
COMPLEX             INTEGER             PRIVATENAMES     TYPES
CONDITIONAL         LISTLITERALS        RETURNING        UNARY
CONVERSIONS         LISTS               SCOPING          UNICODE
help> TYPES
  3.2 The standard type hierarchy

  Below is a list of the types that are built into Python. Extension
  modules written in C can define additional types. Future versions of
  Python may add types to the type hierarchy (e.g., rational numbers,
  efficiently stored arrays of integers, etc.).
  ...

help> quit
>>>

Conversions, Numbers, and Comparisons

While we’ve covered data types, one of the common issues when dealing with any type system is how one converts from one type to another. These conversions happen in a myriad of contexts—reading numbers from a text file, computing integer averages, interfacing with functions that expect different types than the rest of an application, etc.

We’ve seen in previous chapters that we can create a string from a nonstring object by simply passing the nonstring object to the str string constructor. Similarly, unicode converts any object to its Unicode string form and returns it.[1]

In addition to the string creation functions, we’ve seen list and tuple, which take sequences and return list and tuple versions of them, respectively. int, complex, float, and long take any number and convert it to their respective types. int, long, and float have additional features that can be confusing. First, int and long truncate their numeric arguments, if necessary, to perform the operation, thereby losing information and performing a conversion that may not be what you want (the round built-in rounds numbers the standard way and returns a float). Second, int, long, and float can also convert strings to their respective types, provided the strings are valid integer (or long, or float) literals. Literals are the text strings that are converted to numbers early in the Python compilation process. So, the string 1244 in your Python program file (which is necessarily a string) is a valid integer literal, but def foo( ): isn’t.

>>> int(1.0), int(1.4), int(1.9), round(1.9), int(round(1.9))
(1, 1, 1, 2.0, 2)
>>> int("1")
1
>>> int("1.2")                             # This doesn't work.
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: invalid literal for int(  ): 1.2

What’s a little odd is that the rule about conversion (if it’s a valid integer literal) is more important than the feature about truncating numeric arguments, thus:

>>> int("1.0")                               # Neither does this
Traceback (most recent call last):           # since 1.0 is also not a valid 
  File "<stdin>", line 1, in ?               # integer literal.
ValueError: invalid literal for int(  ): 1.0

Given the behavior of int, it may make sense in some cases to use a custom variant that does only conversion, refusing to truncate:

>>> def safeint(candidate):
...   converted = float(candidate)
...   rounded = round(converted)
...   if converted == rounded:
...         return int(converted)
...   else: 
...         raise ValueError, "%s would lose precision when cast"%candidate
...
>>> safeint(3.0)
3
>>> safeint("3.0")
3
>>> safeint(3.1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<stdin>", line 8, in safeint
ValueError: 3.1 would lose precision when cast

Converting numbers to strings can be done in a variety of ways. In addition to using str or unicode, one can use hex or oct, which take integers (whether int or long) as arguments and return string representations of them in hexadecimal or octal format, respectively.

>>> hex(1000), oct(1000)
('0x3e8', '01750')

The abs built-in returns the absolute value of scalars (integers, longs, floats) and the magnitude of complex numbers (the square root of the sum of the squared real and imaginary parts):

>>> abs(-1), abs(-1.2), abs(-3+4j)
(1, 1.2, 5.0)                               # 5 is sqrt(3*3 + 4*4).

The ord and chr functions return the ASCII value of single characters and vice versa:

>>> map(ord, "test")    # Remember that strings are sequences
[116, 101, 115, 116]    # of characters, so map can be used.
>>> chr(64)
'@'
>>> ord('@')
64
# map returns a list of single characters, so it
# needs to be "joined" into a str.
>>> map(chr, (83, 112, 97, 109, 33))
['S', 'p', 'a', 'm', '! ']
# Can also be spelled using list comprehensions
>>> [chr(x) for x in (83, 112, 97, 109, 33)]
['S', 'p', 'a', 'm', '! ']
>>> ''.join([chr(x) for x in (83, 112, 97, 109, 33)])
'Spam!'

The cmp built-in returns a negative integer, 0, or a positive integer, depending on whether its first argument is less than, equal to, or greater than its second one. It’s worth emphasizing that cmp works with more than just numbers; it compares characters using their ASCII values, and sequences are compared by comparing their elements. Comparisons can raise exceptions, so the comparison function is not guaranteed to work on all objects, but all reasonable comparisons will work.[2] The comparison process used by cmp is the same as that used by the sort method of lists. It’s also used by the built-ins min and max, which return the smallest and largest elements of the objects they are called with, dealing reasonably with sequences:

>>> min("pif", "paf", "pof")        # When called with multiple arguments,
'paf'                               # return appropriate one.
>>> min("ZELDA!"), max("ZELDA!")    # when called with a sequence, 
'!', 'Z'                            # return the min/max element of it.

Table 27-1 summarizes the built-in functions dealing with type conversions. Many of these can also be called with no argument to return a false value; for example, str( ) returns the empty string.

Table 27-1. Type conversion built-in functions

Function name

Behavior

str(string)unicode(string)

Returns the string representation of any object:

>>> str(dir(  ))
"['__builtins__', '__doc__', '__name__']"
>>> unicode('tomato')
u' tomato'

list(seq)

Returns the list version of a sequence:

>>> list("tomato")
['t', 'o', 'm', 'a', 't', 'o']
>>> list((1,2,3))
[1, 2, 3]

tuple(seq)

Returns the tuple version of a sequence:

>>> tuple("tomato")
('t', 'o', 'm', 'a', 't', 'o')
>>> tuple([0])
(0,)

dict( )dict(mapping)dict(seq)dict(**kwargs)

Creates a dictionary from its argument, which can be a mapping, a sequence, keyword arguments or nothing (yielding the empty dictionary):

>>> dict(  )
{  }
>>> dict([('a', 2), ('b', 5)])
{'a': 2, 'b': 5}
>>> dict(a=2, b=5) # In Python 2.3 or later
{'a': 2, 'b': 5}

int(x)

Converts a string or number to a plain integer; truncates floating-point values. The string needs to be a valid string literal (i.e., no decimal point).

>>> int("3") 
3

long(x)

Converts a string or number to a long integer; truncates floating-point values:

>>> long("3")
3L

float(x)

Converts a string or a number to floating point:

>>> float("3")
3.0

complex(real,imag)

Creates a complex number with the value real + imag*j:

>>> complex(3,5)
(3+5j)

hex(i)

Converts an integer number (of any size) to a hexadecimal string:

>>> hex(10000)
'0x2710'

oct(i)

Converts an integer number (of any size) to an octal string:

>>> oct(10000)
'023420'

ord(char)

Returns the numeric value of a string of one character (using the current default encoding, often ASCII):

>>> ord('A')
65

chr(i)

Returns a string of one character whose numeric code in the current encoding (often ASCII) is the integer i:

>>> chr(65)
'A'

min(i [, i]*)

Returns the smallest item of a nonempty sequence:

>>> min([5,1,2,3,4])
1
>>> min(5,1,2,3,4)
1

max(i [, i]*)

Returns the largest item of a nonempty sequence:

>>> max([5,1,2,3,4])
5
>>> max(5,1,2,3,4)
5

file(name [, mode [,buffering])

Opens a file .

>>> data = file('contents.txt', 'r').read(  )

Manipulating Strings

The vast majority of programs perform string operations. We’ve covered most of the properties and variants of string objects in Chapter 5, but there are two areas that we haven’t touched on thus far, the string module, and regular expressions. As we’ll see the first is simple and mostly a historical note, while the second is complex and powerful.

The string Module

The string module is somewhat of a historical anomaly. If Python were being designed today, the string module would not exist—it is mostly a remnant of a less civilized age before everything was a first-class object. Nowadays, string objects have methods like split and join, which replace the functions that are still defined in the string module. The string module does define a convenient function, maketrans, used to automatically do string “mapping” operations with the translate method of string objects. maketrans/translate is useful when you want to translate several characters in a string at once. For example, if you want to replace all occurrences of the space character with an underscore, change underscores to minus signs, and change minus signs to plus signs. Doing so with repeated .replace( ) operations is in fact quite tricky, but doing it with maketrans is trivial:

>>> import string
>>> conversion = string.maketrans(" _-", "_-+")
>>> input_string = "This is a two_part - one_part"
>>> input_string.translate(conversion)
'This_is_a_two-part_+_one-part'

In addition, the string module defines a few useful constants, which haven’t been implemented as string attributes yet. These are shown in Table 27-2.

Table 27-2. String module constants

Constant name

Value

digits

'0123456789'

octdigits

'01234567'

hexdigits

'0123456789abcdefABCDEF'

lowercase

'abcdefghijklmnopqrstuvwxyz'

uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Letters

lowercase + uppercase

whitespace

' v' (all whitespace characters)

The constants in Table 27-2 are useful to test whether specific characters fit a criterion—for example, x in string.whitespace returns true only if x is one of the whitespace characters. Note that the values given above aren’t always the values you’ll find—for example, the definition of 'uppercase' depends on the locale: if you’re running a French operating system, string.lowercase will include ç and ê.

Complicated String Matches with Regular Expressions

If strings and their methods aren’t enough (and they do get clumsy in many perfectly normal use cases), Python provides a specialized string-processing tool in the form of a regular expression engine.

Regular expressions are strings that let you define complicated pattern matching and replacement rules for strings. The syntax for regular expressions emphasizes compact notation over mnemonic value. For example, the single character . means “match any single character.” The character + means “one or more of what just preceded me.” Table 27-3 lists some of the most commonly used regular expression symbols and their meanings in English. Describing the full set of regular expression tokens and their meaning would take quite a few pages—instead, we’ll cover a simple use case and walk through how to solve the problem using regular expressions.

Table 27-3. Common elements of regular expression syntax

Special character

Meaning

.

Matches any character except newline by default

^

Matches the start of the string

$

Matches the end of the string

*

“Any number of occurrences of what just preceded me”

+

“One or more occurrences of what just preceded me”

|

“Either the thing before me or the thing after me”

w

Matches any alphanumeric character

d

Matches any decimal digit

tomato

Matches the string tomato

A real regular expression problem

Suppose you need to write a program to replace the strings “green pepper” and “red pepper” with “bell pepper” if and only if they occur together in a paragraph before the word “salad” and not if they are followed (with no space) by the string “corn.” Although the specific requirements are silly, the general kind (conditional replacement of subparts of text based on specific contextual constraints) is surprisingly common in computing. We will explain each step of the program that solves this task.

Assume that the file you need to process is called pepper.txt. Here’s an example of such a file:

This is a paragraph that mentions bell peppers multiple times. For
one, here is a red pepper and dried tomato salad recipe. I don't like
to use green peppers in my salads as much because they have a harsher
flavor.

This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.

This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).

The first task is to open the file and read in the text:

file = open('pepper.txt')
text = file.read(  )

We read the entire text at once and avoid splitting it into lines, since we will assume that paragraphs are defined by two consecutive newline characters. This is easy to do using the split function of the string module:

paragraphs = text.split('

')

At this point we’ve split the text into a list of paragraph strings, and all there is left to do is perform the actual replacement operation. Here’s where regular expressions come in:

import re
matchstr = re.compile(
                     r"""(red|green)      # 'red' or 'green' starting new words
        (s+               # followed by whitespace
         pepper            # The word 'pepper',
         (?!corn)          # if not followed immediately by 'corn'
         (?=.*salad))""",  # and if followed at some point by 'salad',
      re.IGNORECASE |      # allow pepper, Pepper, PEPPER, etc.
      re.DOTALL |          # Allow dots to match newlines as well.
      re.VERBOSE)          # This allows the comments and the newlines above.
for paragraph in paragraphs:
    fixed_paragraph = matchstr.sub(r'bell2', paragraph)
    print fixed_paragraph+'
'

The first line is simple but key: all of Python’s regular expression smarts are in the re module.

The bold statement is the hardest one; it creates a regular expression pattern, which is like a program (that’s the raw string), and compiles it. Such a pattern specifies two things: which parts of the strings we’re interested in and how they should be grouped. Let’s go over these in turn. The re.compile( ) call takes a string (although the syntax of that string is quite particular) and returns an object called a compiled regular expression object, which corresponds to that string.

Defining which parts of the string we’re interested in is done by specifying a pattern of characters that defines a match. This is done by concatenating smaller patterns, each of which specifies a simple matching criterion (e.g., “match the string 'pepper',” “match one or more whitespace characters,” “don’t match 'corn',” etc.). We’re looking for the words “red” or “green” followed by the word “pepper” that is itself followed by the word “salad,” as long as “pepper” isn’t followed immediately by “corn.” Let’s take each line of the re.compile( . . . ) expression in turn.

The first thing to notice about the string in the re.compile( ) is that it’s a “raw” string (the quotation marks are preceded by an r). Prepending such an r to a string (single- or triple-quoted) turns off the interpretation of the backslash characters within the string.[3] We could have used a regular string instead and used \b instead of  and \s instead of s. In this case, it makes little difference; for complicated regular expressions, raw strings allow for much clearer syntax than escaped backslashes.

The first line in the pattern is (red|green).  stands for “the empty string, but only at the beginning or end of a word”; using it here prevents matches that have red or green as the final part of a word (as in “tired pepper”). The (red|green) pattern specifies an alternation: either 'red' or 'green‘. Ignore the left parenthesis that follows for now. s is a special symbol that means “any whitespace character,” and + means “one or more occurrence of whatever comes before me,” so, put together, s+ means “one or more whitespace characters.” Then, pepper just means the string 'pepper‘. (?!corn) prevents matches of “patterns that have 'corn' at this point,” so we prevent the match on 'peppercorn‘. Finally, (?=.*salad) says that for the pattern to match, it must be followed by any number of arbitrary characters (that’s what .* means), followed by the word salad. The ?= specifies that while the pattern should determine whether the match occurs, it shouldn’t be “used up” by the match process; it’s a subtle point that we won’t cover in detail here. At this point we’ve defined the pattern corresponding to the substring.

Now, note that there are two parentheses—the one before s+ and the last one. What these two do is define a “group,” which starts after the red or green and go to the end of the pattern. We’ll use that group in the next operation, the actual replacement. The three flags are joined by the | symbol (the bitwise “or” operation) to form the second argument to re.compile. These specify kinds of pattern matches. The first, re.IGNORECASE, says that the text comparisons should ignore whether the text and the match have similar or different cases. The second, re.DOTALL, specifies that the . character should match any character, including the newline character (that’s not the default behavior). The third, re.VERBOSE, allows us to insert extra newlines and # comments in the regular expression, making it easier to read and understand. We could have written the statement more compactly as:

matchstr = re.compile(r"(red|green)(s+pepper(?!corn)(?=.*salad))", re.I | re.S)

The actual replacement operation is done with the line:

fixed_paragraph = matchstr.sub(r'bell2', paragraph)

We’re calling the sub method of the matchstr object. That object is a compiled regular expression object, meaning that some of the processing of the expression has already been done (in this case, outside the loop), thus speeding up the total program execution. We use a raw string again to write the first argument to the method. The 2 is a reference to group 2 in the regular expression—the second group of parentheses in the regular expression—in our case, everything starting with whitespace followed by 'pepper' and up to and including the word 'salad‘. Therefore, this line means, “Replace the occurrences of the matched substring with the string that is 'bell' followed by whatever starts with whitespace followed by 'pepper' and goes up to the end of the matched string, throughout the paragraph string.”

So, does it work? The pepper.txt file had three paragraphs: the first satisfied the requirements of the match twice, the second didn’t because it didn’t mention the word 'salad', and the third didn’t because the 'red' and 'green' words are before peppercorn, not pepper. As it was supposed to, our program (saved in a file called pepper.py) modifies only the first paragraph:

                     /home/David/book$ python pepper.py
This is a paragraph that mentions bell peppers multiple times. For
one, here is a bell pepper and dried tomato salad recipe. I don't like
to use bell peppers in my salads as much because they have a harsher
flavor.

This second paragraph mentions red peppers and green peppers but not
the "s" word (s-a-l-a-d), so no bells should show up.

This third paragraph mentions red peppercorns and green peppercorns,
which aren't vegetables but spices (by the way, bell peppers really
aren't peppers, they're chilies, but would you rather have a good cook
or a good botanist prepare your salad?).

This example, while artificial, shows how regular expressions can compactly express complicated matching rules. If this kind of problem occurs often in your line of work, mastering regular expressions can be a worthwhile investment of time and effort.

A more thorough coverage of regular expressions is beyond the scope of this book. Jeffrey Friedl provides excellent coverage of regular expressions in his book Mastering Regular Expressions (O’Reilly). This book is a must-have for anyone doing serious text processing. For the casual user, the descriptions in the Library Reference or Python in a Nutshell do the job most of the time. Be sure to use the re module, not the regex, or regsub modules, which are deprecated (they probably won’t be around in a later version of Python):

>>> import regex
__main__:1: DeprecationWarning: the regex module is deprecated; please use the re 
module

Data Structure Manipulations

One of Python’s greatest features is that it provides the list, tuple, and dictionary built-in types. They are so flexible and easy to use that once you’ve grown used to them, you’ll find yourself reaching for them automatically. While we covered all of the operations on each data structure as we introduced them, now’s a good time to go over tasks that can apply to all data structures, such as how to make copies, sort objects, randomize sequences, etc. Many functions and algorithms (theoretical procedures describing how to implement a complex task in terms of simpler basic tasks) are designed to work regardless of the type of data being manipulated. It is therefore useful to know how to do generic things for all data types.

Making Copies

Making copies of objects is a reasonable task in many programming contexts. Often, the only kind of copy that’s needed is just another reference to an object, as in:

x = 'tomato'
y = x                   # y is now 'tomato'.
x = x + ' and cucumber' # x is now 'tomato and cucumber', but y is unchanged.

Due to Python’s reference management scheme, the statement a = b doesn’t make a copy of the object referenced by b; instead, it makes a new reference to that same object. When the object being copied is an immutable object (e.g., a string), there is no real difference. When dealing with mutable objects like lists and dictionaries, however, sometimes a real, new copy of the object, not just a shared reference, is needed. How to do this depends on the type of the object in question. The simplest way of making a copy is to use the list( ) or tuple( ) constructors:

newList = list(myList)
newTuple = tuple(myTuple)

As opposed to the simplest, the most common way to make copies of sequences like lists and tuples is somewhat odd. If myList is a list, then to make a copy of it, you can use:

newList = myList[:]

which you can read as “slice from beginning to end,” since the default index for the start of a slice is the beginning of the sequence (0), and the default index for the end of a slice is the end of sequence (see Chapter 3). Since tuples support the same slicing operation as lists, this same technique can also be applied to tuples, except that if x is a tuple, then x[:] is the same object as x, since tuples are immutable. Dictionaries, on the other hand, don’t support slicing. To make a copy of a dictionary myDict, you can either use:

newDict = myDict.copy(  )

or the dict( ) constructor:

newDict = dict(myDict)

For a different kind of copying, if you have a dictionary oneDict, and want to update it with the contents of a different dictionary otherDict, simply type oneDict.update(otherDict). This is the equivalent of:

for key in otherDict.keys(  ):
    oneDict[key] = otherDict[key]

If oneDict shared some keys with otherDict before the update( ) operation, the old values associated with the keys in oneDict are obliterated by the update. This may be what you want to do (it usually is). If it isn’t, the right thing to do might be to raise an exception. To do this, make a copy of one dictionary, then look over each entry in the second. If we find shared keys, we raise an exception, if not, we just add the key-value mapping to the new dictionary.

def mergeWithoutOverlap(oneDict, otherDict):
    newDict = oneDict.copy(  )
    for key in otherDict:
        if key in oneDict:
            raise ValueError, "the two dictionaries share keys!"
        newDict[key] = otherDict[key]
    return newDict

or, alternatively, combine the values of the two dictionaries, with a tuple, for example. Using the same logic as in mergeWithoutOverlap, but combining the values instead of throwing an exception:

def mergeWithOverlap(oneDict, otherDict):
    newDict = oneDict.copy(  )
    for key in otherDict:
        if key in oneDict:
            newDict[key] = oneDict[key], otherDict[key]
        else:
            newDict[key] = otherDict[key]
    return newDict

To illustrate the differences between the preceding three algorithms, consider the following two dictionaries:

phoneBook1 = {'michael': '555-1212', 'mark': '554-1121', 'emily': '556-0091'}
phoneBook2 = {'latoya': '555-1255', 'emily': '667-1234'}

If phoneBook1 is possibly out of date, and phoneBook2 is more up to date but less complete, the right usage is probably phoneBook1.update(phoneBook2). If the two phoneBooks are supposed to have nonoverlapping sets of keys, using newBook = mergeWithoutOverlap(phoneBook1, phoneBook2) lets you know if that assumption is wrong. Finally, if one is a set of home phone numbers and the other a set of office phone numbers, chances are newBook = mergeWithOverlap(phoneBook1, phoneBook2) is what you want, as long as the subsequent code that uses newBook can deal with the fact that newBook['emily'] is the tuple ('556-0091', '667-1234').

The copy module

The [:] and .copy( ) tricks will get you copies in 90% of the cases. If you are writing functions that, in true Python spirit, can deal with arguments of any type, it’s sometimes necessary to make copies of x, regardless of what x is. In comes the copy module. It provides two functions, copy and deepcopy. The first is just like the [:] sequence slice operation or the copy method of dictionaries. The second is more subtle and has to do with deeply nested structures (hence the term deepcopy). Take the example of copying a list (listOne) by slicing it from beginning to end using the [:] construct. This technique makes a new list that contains references to the same objects contained in the original list. If the contents of that original list are immutable objects, such as numbers or strings, the copy is as good as a “true” copy. However, suppose that the first element in listOne is itself a dictionary (or any other mutable object). The first element of the copy of listOne is a new reference to the same dictionary. So if you then modify that dictionary, the modification is evident in both listOne and the copy of listOne. An example makes it much clearer:

>>> import copy
>>> listOne = [{"name": "Willie", "city": "Providence, RI"}, 1, "tomato", 3.0]
>>> listTwo = listOne[:]                   # Or listTwo=copy.copy(listOne)
>>> listThree = copy.deepcopy(listOne)
>>> listOne.append("kid")
>>> listOne[0]["city"] = "San Francisco, CA"
>>> print listOne, listTwo, listThree
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0, 'kid']
[{'name': 'Willie', 'city': 'San Francisco, CA'}, 1, 'tomato', 3.0]
[{'name': 'Willie', 'city': 'Providence, RI'}, 1, 'tomato', 3.0]

As you can see, modifying listOne directly modified only listOne. Modifying the first entry of the list referenced by listOne led to changes in listTwo, but not in listThree; that’s the difference between a shallow copy ([:]) and a deep copy. The copy module functions know how to copy all the built-in types that are reasonably copyable,[4] including classes and instances.

Sorting

Lists have a sort method that does an in-place sort. Sometimes you want to iterate over the sorted contents of a list, without disturbing the contents of this list. Or you may want to list the sorted contents of a tuple. Because tuples are immutable, an operation such as sort, which modifies it in place, is not allowed. The only solution is to make a list copy of the elements, sort the list copy, and work with the sorted copy, as in:

listCopy = list(myTuple)
listCopy.sort(  )
for item in listCopy:
    print item                             # Or whatever needs doing

This solution is also the way to deal with data structures that have no inherent order, such as dictionaries. One of the reasons that dictionaries are so fast is that the implementation reserves the right to change the order of the keys in the dictionary. It’s really not a problem, however, given that you can iterate over the keys of a dictionary using an intermediate copy of the keys of the dictionary:

keys = myDict.keys(  )                          # Returns an unsorted list of
                                           # the keys in the dict.
keys.sort(  )
for key in keys:                           # Print key/value pairs 
    print key, myDict[key]                 # sorted by key.

The sort method on lists uses the standard Python comparison scheme. Sometimes, however, that scheme isn’t what’s needed, and you need to sort according to some other procedure. For example, when sorting a list of words, case (lower versus UPPER) may not be significant. The standard comparison of text strings, however, says that all uppercase letters come before all lowercase letters, so 'Baby' is less than 'apple', but 'baby' is greater than 'apple‘. In order to do a case-independent sort, you need to define a comparison function that takes two arguments, and returns -1, 0, or 1 depending on whether the first argument is smaller than, equal to, or greater than the second argument. So, for case-independent sorting, you can use:

>>> def caseIndependentSort(something, other):
...    something, other = something.lower(  ), other.lower(  )
...    return cmp(something, other)
... 
>>> testList = ['this', 'is', 'A', 'sorted', 'List']
>>> testList.sort(  )
>>> print testList
['A', 'List', 'is', 'sorted', 'this']
>>> testList.sort(caseIndependentSort)
>>> print testList
['A', 'is', 'List', 'sorted', 'this']

We’re using the built-in function cmp, which does the hard part of figuring out that 'a' comes before 'b', 'b' before 'c', etc. Our sort function simply converts both items to lowercase and compares the lowercase versions. Also note that the conversion to lowercase is local to the comparison function, so the elements in the list aren’t modified by the sort.

Randomizing

What about randomizing a sequence, such as a list of lines? The easiest way to randomize a sequence is to call the shuffle function in the random module, which randomizes a sequence in-place:[5]

random.shuffle(myList)

If you need to shuffle a nonlist object, it’s usually easiest to convert that object to a list and shuffle the list version of the same data, rather than come up with a new strategy for each data type. This might seem a wasteful strategy, given that it involves building intermediate lists that might be quite large. In general, however, what seems large to you probably won’t seem so to the computer, thanks to the reference system. Also, consider the time saved by not having to come up with a different strategy for each data type! Python is designed to save programmer time; if that means running a slightly slower or bigger program, so be it. If you’re handling enormous amounts of data, it may be worthwhile to optimize. But never optimize until the need for optimization is clear; that would be a waste of your time.

Making New Data Structures

This chapter emphasizes the silliness involved in reinventing wheels. This point is especially important when it comes to data structures. For example, Python lists and dictionaries might not be the lists and dictionaries or mappings you’re used to, but you should avoid designing your own data structure if these structures will suffice. The algorithms they use have been tested under wide ranges of conditions, and they’re fast and stable. Sometimes, however, the interface to these algorithms isn’t convenient for a particular task.

For example, computer science textbooks often describe algorithms in terms of other data structures such as queues and stacks. To use these algorithms, it may make sense to come up with a data structure that has the same methods as these data structures (such as pop and push for stacks or enqueue/dequeue for queues). However, it also makes sense to reuse the built-in list type in the implementation of a stack. In other words, you need something that acts like a stack but is based on a list. A simple solution is to use a class wrapper around a list. For a minimal stack implementation, you can do this:

class Stack:
    def __init__(self, data):
        self._data = list(data)
        self.push = self._data.append
        self.pop = self._data.pop

The following is simple to write, to understand, to read, and to use:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> thingsToDo.push('do the dishes')
>>> print thingsToDo.pop(  )
do the dishes
>>> print thingsToDo.pop(  )
wash the kid

Two standard Python naming conventions are used in the Stack class above. The first is that class names start with an uppercase letter, to distinguish them from functions. The other is that the _data attribute starts with an underscore. This is a half-way point between public attributes (which don’t start with an underscore), private attributes (which start with two underscores; see Chapter 7), and Python-reserved identifiers (which both start and end with two underscores). What it means is that _data is an attribute of the class that shouldn’t be needed by clients of the class. The class designer expects such pseudo-private attributes to be used only by the class methods and by the methods of any eventual subclass.

Making New Lists and Dictionaries

The Stack class presented earlier does its minimal job just fine. It assumes a fairly minimal definition of what a stack is, specifically, something that supports just two operations, a push and a pop. Some of the features of lists would be nice to use, such as the ability to iterate over all the elements using the for . . . in . . . construct. While you could continue in the style of the previous class and delegate to the “inner” list object, at some point it makes more sense to simply reuse the implementation of list objects directly, through subclassing. In this case, you should derive a class from the list base class. The dict base class can also be used to create dictionary-like classes.

# Subclass the list class.
class Stack(list):
    push = list.append

This Stack is a subclass of the list class. The list class implements the pop methods among others. You don’t need to define your own __init__ method because list defines a perfectly good default. The push method is defined just by saying that it’s the same as list’s append method. Now we can do list-like things as well as stack-like things:

>>> thingsToDo = Stack(['write to mom', 'invite friend over', 'wash the kid'])
>>> print thingsToDo                  # Inherited from list base class
['write to mom', 'invite friend over', 'wash the kid']
>>> thingsToDo.pop(  )        
'wash the kid'
>>> thingsToDo.push('change the oil')
>>> for chore in thingsToDo:          # We can also iterate over the contents.
...    print chore 
...
write to mom
invite friend over
change the oil

Manipulating Files and Directories

So far so good—we know how to create objects, we can convert between different data types, and we can perform various kinds of operations on them. In practice, however, as soon as one leaves the computer science classroom one is faced with tasks that involve manipulating data that lives outside of the program and performing processes that are external to Python. That’s when it becomes very handy to know how to talk to the operating system, explore the filesystem, read and modify files.

The os and os.path Modules

The os module provides a generic interface to the operating system’s most basic set of tools. Different operating systems have different behaviors. This is true at the programming interface as well. This makes it hard to write so-called “portable” programs, which run well regardless of the operating system. Having generic interfaces independent of the operating system helps, as does using an interpreted language like Python. The specific set of calls it defines depend on which platform you use. (For example, the permission-related calls are available only on platforms that support them, such as Unix and Windows.) Nevertheless, it’s recommended that you always use the os module, instead of the platform-specific versions of the module (called by such names as posix, nt, and mac). Table 27-4 lists some of the most often used functions in the os module. When referring to files in the context of the os module, one is referring to filenames, not file objects.

Table 27-4. Most frequently used functions from the os module

Function name

Behavior

getcwd( )

Returns a string referring to the current working directory (cwd):

>>> print os.getcwd(  )
h:Davidook

listdir(path)

Returns a list of all of the files in the specified directory:

>>> os.listdir(os.getcwd(  ))
['preface.doc', 'part1.doc', 'part2.doc']

chown(path, uid, gid)

Changes the owner ID and group ID of specified file

chmod(path, mode)

Changes the permissions of specified file with numeric mode mode (e.g., 0644 means read/write for owner, read for everyone else)

rename(src, dest)

Renames file named src with name dest

remove(path) or unlink(path)

Deletes specified file (see rmdir( ) to remove directories)

rmdir(path)

Deletes specified directory

removedirs(path)

Works like rmdir( ) except that if the leaf directory is successfully removed, directories corresponding to rightmost path segments will be pruned away.

mkdir(path[, mode])

Creates a directory named path with numeric mode mode (see os.chmod):

>>> os.mkdir('newdir')

makedirs(path[, mode])

Like mkdir( ), but makes all intermediate-level directories needed to contain the leaf directory:

>>> os.makedirs('newdir/newsubdir/newsubsubdir')

system(command)

Executes the shell command in a subshell; the return value is the return code of the command

symlink(src, dest)

Creates soft link from file src to file dest

link(src, dest)

Creates hard link from file src to file dest

stat(path)

Returns data about the file, such as size, last modified time, and ownership:

>>> os.stat('TODO.txt')
                                  
# It returns something like a tuple.
(33206, 0L, 3, 1, 0, 0, 1753L, 1042186004, 
1042186004, 1042175785)
>>> os.stat('TODO.txt').st_size
                                  
# Just look at the size.
1753L
>>> time.asctime(time.localtime
              (os.stat('TODO.txt').st_mtime))
'Fri Jan 10 00:06:44 2003'

walk(top, topdown=True, onerror=None) (Python 2.3 and later)

For each directory in the directory tree rotted at top (including top itself, but excluding '.' and '..'), yield a 3-tuple:

dirpath, dirnames, filenames

With just these modules, you can find out a lot about the current state of the filesystem, as well as modify it:

>>> print os.getcwd(  )        # Where am I?
C:Python22
>>> print os.listdir('.')    # What's here?
['DLLs', 'Doc', 'include', 'Lib', 'libs', 'License.txt', ...]
>>> os.chdir('Lib')          # Let's go explore the library.
>>> print os.listdir('.')    # What's here?
['aifc.py', 'anydbm.py', 'anydbm.pyc', 'asynchat.py',
'asyncore.py', 'atexit.py', 'atexit.pyc', 'atexit.pyo',
'audiodev.py', 'base64.py', ...]
>>> os.remove('atexit.pyc')  # We can remove .pyc files safely.
>>>

There are many other functions in the os module; in fact, just about any function that’s part of the POSIX standard and widely available on most Unix and Unix-like platforms is supported by Python on Unix. The interfaces to these routines follow the POSIX conventions. You can retrieve and set UIDs, PIDs, and process groups; control nice levels; create pipes; manipulate file descriptors; fork processes; wait for child processes; send signals to processes; use the execv variants; etc (if you don’t know what half of the words in this paragraph mean, don’t worry, you probably don’t need to).

The os module also defines some important attributes that aren’t functions:

  • The os.name attribute defines the current version of the platform-specific operating-system interface. Registered values for os.name are 'posix', 'nt', 'dos', and 'mac‘. It’s different from sys.platform, primarily in that it’s less specific—for example, Solaris and Linux will have the same value ('posix') for os.name, but different values of sys.platform.

  • os.error defines an exception class used when calls in the os module raise errors. It’s the same thing as OSError, one of the built-in exception classes. When this exception is raised, the value of the exception object contains two variables. The first is the number corresponding to the error (known as errno), and the second is a string message explaining it (known as strerror):

    >>> os.rmdir('nonexistent_directory')      # How it usually shows up
    Traceback (innermost last):
      File "<stdin>", line 1, in ?
    os.error: (2, 'No such file or directory')
    >>> try:                                   # We can catch the error and take
    ...    os.rmdir('nonexistent directory')   # it apart.
    ... except os.error, value:
    ...     print value[0], value[1]
    ...
    2 No such file or directory
  • The os.environ dictionary contains key/value pairs corresponding to the environment variables of the shell from which Python was started. Because this environment is inherited by the commands that are invoked using the os.system call, modifying the os.environ dictionary modifies the environment:

    >>> print os.environ['SHELL']
    /bin/sh
    >>> os.environ['STARTDIR'] = 'MyStartDir'
    >>> os.system('echo $STARTDIR')           # 'echo %STARTDIR%' on DOS/Win
    MyStartDir                                # Printed by the shell
    0                                         # Return code from echo

The os module also includes a set of strings that define portable ways to refer to directory-related parts of filename syntax, as shown in Table 27-5.

Table 27-5. String attributes of the os module

Attribute name

Meaning and values

curdir

A string that denotes the current directory: '.' on Unix, DOS, and Windows; ':' on the Mac

pardir

A string that denotes the parent directory: '..' on Unix, DOS, and Windows; '::' on the Mac

sep

The character that separates pathname components: '/' on Unix, '' on DOS and Windows, ':' on the Mac

altsep

An alternate character to sep when available; set to None on all systems except DOS and Windows, where it’s '/'

pathsep

The character that separates path components: ':' on Unix, ';' on DOS and Windows

These strings are used by the functions in the os.path module, which manipulate file paths in portable ways (see Table 27-6). Note that the os.path module is an attribute of the os module, not a sub-module of an os package; it’s imported automatically when the os module is loaded, and (unlike packages) you don’t need to import it explicitly. The outputs of the examples in Table 27-6 correspond to code run on a Windows or DOS machine. On another platform, the appropriate path separators would be used instead. A useful relevant bit of knowledge is that the forward slash (/) can be used safely in Windows to indicate directory traversal, even though the native separator is the backwards slash ()—Python and Windows both do the right thing with it.

Table 27-6. Most frequently used functions from the os.path module

Function name

Behavior

split(path) is equivalent to the tuple: (dirname(path), basename(path))

Splits the given path into a pair consisting of a head and a tail; the head is the path up to the directory, and the tail is the filename:

>>> os.path.split("h:/David/book/part2.doc"
('h:/David/book', 'part2.doc')

splitdrive(p)

Splits a pathname into drive and path specifiers:

>>> os.path.splitdrive(r"C:fooar.txt")
('C:', '\foo\bar.txt')

splitext(p)

Splits the extension from a pathname:

>>> os.path.splitext(r"C:fooar.txt")
('C:\foo\bar', '.txt')

splitunc(p)

Splits a pathname into UNC mount point and relative path specifiers:

>>> os.path.splitunc(r"\machinemountdirectory
                      file.txt")
('\\machine\mount', '\directory\file.txt')

join(path, ...)

Joins path components intelligently:

>>> print os.path.join(os.getcwd(  ),
                                 ... os.pardir, 'backup', 'part2.doc')
h:Davidook..ackuppart2.doc

exists(path)

Returns true if path corresponds to an existing path

expanduser(path)

Expands the argument with an initial argument of ~ followed optionally by a username:

>>> print os.path.expanduser('~/mydir')
h:Davidmydir

expandvars(path)

Expands the path argument with the variables specified in the environment:

>>> print os.path.expandvars('$TMP')
C:TEMP

isfile(path)isdir(path)islink(path)ismount(path)isabs(path)

Returns true if the specified path is a file, directory, link, mount point, or an absolute path, respectively

getatime(filename)getmtime(filename)getsize(filename)

Gets the last access time, last modification time, and size of a file, respectively

normpath(path)

Normalizes the given path, collapsing redundant separators and uplevel references:

>>> print os.path.normpath("/foo/bar\../tmp")
foo	mp

normcase(s)

Normalizes case of pathname; makes all characters lowercase and all slashes into backslashes:

>>> print os.path.normcase(r'c:/fooBAR.txt')
c:fooar.txt

samefile(p, q)

Returns true if both arguments refer to the same file

walk(p, visit, arg)

Calls the function visit with arguments (arg, dirname, names) for each directory in the directory tree rooted at p (including p itself, if it’s a directory); the argument dirname specifies the visited directory; the argument names lists the files in the directory:

>>> def test_walk(arg, dirname, names):
                                 ...     print arg, dirname, names
                                 ...
>>> os.path.walk('..', test_walk, 'show')
show ..logs ['errors.log', 'access.log']
show ..cgi-bin ['test.cgi']
...

Copying Files and Directories: The shutil Module

The keen-eyed reader might have noticed that the os module, while it provides lots of file-related functions, doesn’t include a copy function. In DOS, copying a file is basically the same thing as opening one file in read/binary mode, reading all its data, opening a second file in write/binary mode, and writing the data to the second file. On Unix and Windows, making that kind of copy fails to copy the stat bits (permissions, modification times, etc.) associated with the file. On the Mac, that operation won’t copy the resource fork, which contains data such as icons and dialog boxes. In other words, copying files is just more complicated than one could reasonably believe. Nevertheless, often you can get away with a fairly simple function that works on Windows, DOS, Unix, and Mac, as long as you’re manipulating just data files with no resource forks. That function, called copyfile, lives in the shutil module. This module includes a few generally useful functions, shown in Table 27-7.

Table 27-7. Functions of the shutil module

Function name

Behavior

copyfile(src, dest)

Makes a copy of the file src and calls it dest (straight binary copy).

copymode(src, dest)

Copies mode information (permissions) from src to dest.

copystat(src, dest)

Copies all stat information (mode, utime) from src to dest.

copy(src, dest)

Copies data and mode information from src to dest (doesn’t include the resource fork on Macs).

copy2(src, dest)

Copies data and stat information from src to dest (doesn’t include the resource fork on Macs).

copytree(src, dest, symlinks=0)

Copies a directory recursively using copy2. The symlinks flag specifies whether symbolic links in the source tree must result in symbolic links in the destination tree, or whether the files being linked to must be copied. The destination directory must not already exist.

rmtree(path, ignore_errors=0, onerror=None)

Recursively deletes the directory indicated by path. If ignore_error is set to 0 (the default behavior), errors are ignored. Otherwise, if onerror is set, it’s called to handle the error; if not, an exception is raised on error.

Filenames and Directories

While the previous section lists common functions for working with files, many tasks require more than a single function call.

Let’s take a typical example: you have lots of files, all of which have a space in their name, and you’d like to replace the spaces with underscores. All you need is the os.curdir attribute (which returns an operating-system specific string that corresponds to the current directory), the os.listdir function (which returns the list of filenames in a specified directory), and the os.rename function:

import os
if len(sys.argv) == 1:                     # If no filenames are specified,
    filenames = os.listdir(os.curdir)      # use current dir;
else:                                      # otherwise, use files specified
    filenames = sys.argv[1:]               # on the command line.
for filename in filenames:
    if ' ' in filename:
        newfilename = filename.replace(' ', '_')
        print "Renaming", filename, "to", newfilename, "..."
        os.rename(filename, newfilename)

This program works fine, but it reveals a certain Unix-centrism. That is, if you call it with wildcards, such as:

python despacify.py *.txt

you find that on Unix machines, it renames all the files with names with spaces in them and that end with .txt. In a DOS-style shell, however, this won’t work because the shell normally used in DOS and Windows doesn’t convert from *.txt to the list of filenames; it expects the program to do it. This is called globbing, because the * is said to match a glob of characters. Luckily, Python helps us make the code portable.

Matching Sets of Files

The glob module exports a single function, also called glob, which takes a filename pattern and returns a list of all the filenames that match that pattern (in the current working directory):

import sys, glob
print sys.argv[1:]
sys.argv = [item for arg in sys.argv for item in glob.glob(arg)]
print sys.argv[1:]

Running this on Unix and DOS shows that on Unix, the Python glob didn’t do anything because the globbing was done by the Unix shell before Python was invoked, and in DOS, Python’s globbing came up with the same answer:

/usr/python/book$ python showglob.py *.py
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']

C:pythonook> python showglob.py *.py
['*.py']
['countlines.py', 'mygrep.py', 'retest.py', 'showglob.py', 'testglob.py']

It’s worth looking at the bold line in showglob.py and understanding exactly what happens there, especially if you’re new to the list comprehension concept (discussed in Chapter 14).

Using Temporary Files

If you’ve ever written a shell script and needed to use intermediary files for storing the results of some intermediate stages of processing, you probably suffered from directory litter. You started out with 20 files called log_001.txt, log_002.txt, etc., and all you wanted was one summary file called log_sum.txt. In addition, you had a whole bunch of log_001.tmp, log_001.tm2, etc. files that, while they were labeled temporary, stuck around. To put order back into your directories, use temporary files in specific directories and clean them up afterwards.

To help in this temporary file management problem, Python provides a nice little module called tempfile that publishes two functions: mktemp( ) and TemporaryFile( ). The former returns the name of a file not currently in use in a directory on your computer reserved for temporary files (such as /tmp on Unix or C:TEMP on Windows). The latter returns a new file object directly. For example:

# Read input file
inputFile = open('input.txt', 'r')

import tempfile
# Create temporary file
tempFile = tempfile.TemporaryFile(  )                   # We don't even need to 
first_process(input = inputFile, output = tempFile)   # know the filename...

# Create final output file
outputFile = open('output.txt', 'w')
second_process(input = tempFile, output = outputFile)

Using tempfile.TemporaryFile( ) works well in cases where the intermediate steps manipulate file objects. One of its nice features is that when the file object is deleted, it automatically deletes the file it created on disk, thus cleaning up after itself. One important use of temporary files, however, is in conjunction with the os.system call, which means using a shell, hence using filenames, not file objects. For example, let’s look at a program that creates form letters and mails them to a list of email addresses (on Unix only):

formletter = """Dear %s,
I'm writing to you to suggest that ..."""    # etc. 
myDatabase = [('Michael Jackson', '[email protected]'),
              ('Bill Gates', '[email protected]'),
              ('Bob', '[email protected]')]
for name, email in myDatabase:
    specificLetter = formletter % name
    tempfilename = tempfile.mktemp(  )
    tempfile = open(tempfilename, 'w')
    tempfile.write(specificLetter)
    tempfile.close(  )
    os.system('/usr/bin/mail %(email)s -s "Urgent!" < %(tempfilename)s' % vars(  )) 
    os.remove(tempfilename)

The first line in the for loop returns a customized version of the form letter based on the name it’s given. That text is then written to a temporary file that’s emailed to the appropriate email address using the os.system call. Finally, to clean up, the temporary file is removed.

The vars( ) function is a built-in function that returns a dictionary corresponding to the variables defined in the current local namespace. The keys of the dictionary are the variable names, and the values of the dictionary are the variable values. vars( ) comes in quite handy for exploring namespaces. It can also be called with an object as an argument (such as a module, a class, or an instance), and it will return the namespace of that object. Two other built-ins, locals( ) and globals( ), return the local and global namespaces, respectively. In all three cases, modifying the returned dictionaries doesn’t guarantee any effect on the namespace in question, so view these as read-only and you won’t be surprised. You can see that the vars( ) call creates a dictionary that is used by the string interpolation mechanism; it’s thus important that the names inside the %(...)s bits in the string match the variable names in the program.

Modifying Input and Outputs

The argv attribute of the sys module holds one set of inputs to the current program—the command-line arguments, more precisely a list of the words input on the command line, excluding the reference to Python itself if it exists. In other words, if you type at the shell:

csh> python run.py a x=3 foo

then when run.py starts, the value of the sys.argv attribute is ['run.py', 'a', 'x=3', 'foo']. The sys.argv attribute is mutable (after all, it’s just a list). Common usage involves iterating over the arguments of the Python program, that is, sys.argv[1:]; slicing from index 1 till the end gives all of the arguments to the program itself, but doesn’t include the name of the program (module) stored in sys.argv[0]. There are two modules that help you process command line options. The first, an older module called getopt, is replaced in Python 2.3 by a similar but more powerful module called optparse. Check the library reference for further details on how to use them.

Experienced programmers will know that there are other inputs to a program, especially the standard input stream, with siblings for output and error messages. Python lets the programmer access and modify these through three file attributes in the sys module: sys.stdin, sys.stdout, and sys.stderr. Standard input is generally associated by the operating system with the user’s keyboard; standard output and standard error are usually associated with the console. The print statement in Python outputs to standard output (sys.stdout), while error messages such as exceptions are output on the standard error stream (sys.stderr). Python lets you modify these on the fly: you can redirect the output of a Python program to a file simply by assigning to sys.stdout:

sys.stdout = open('log.out', 'w')

After this line, any output will be written to the file log.out instead of showing up on the console. Note that if you don’t save it first, the reference to the “original” standard out stream is lost. It’s generally a good idea to save a reference before reallocating any of the standard streams, as in:

old_stdout = sys.stdout
sys.stdout = open('log.out', 'w')

Using Standard I/O to Process Files

Why have a standard input stream? After all, it’s not that hard to type open('input.txt') in the program. The major argument for reading and writing with standard streams is that you can chain programs so that the standard output from one becomes the standard input of the next, with no file used in the transfer. This facility, known as piping , is at the heart of the Unix philosophy. Using standard I/O this way means that you can write a program to do a specific task once, and then use it to process files or the intermediate results of other programs at any time in the future. As an example, a simple program that counts the number of lines in a file could be written as:

import sys
data = sys.stdin.readlines(  )
print "Counted", len(data), "lines."

On Unix, you could test it by doing something like:

% cat countlines.py | python countlines.py 
Counted 3 lines.

On Windows or DOS, you’d do:

C:> type countlines.py | python countlines.py 
Counted 3 lines.

You can get each line in a file simply by iterating over a file object. This comes in very handy when implementing simple filter operations. Here are a few examples of such filter operations.

Finding all lines that start with a #

# Show comment lines (lines that start with a #, like this one).
import sys
for line in sys.stdin:
    if line[0] == '#':
        print line,

Note that a final comma is added after the print statement to indicate that the print operation should not add a newline, which would result in double-spaced output since the line string already includes a newline character as its last character.

The last two programs can easily be combined using pipes to combine their power. To count the number of comment lines in commentfinder.py:

C:> type commentfinder.py | python commentfinder.py | python countlines.py
Counted 1 lines.

Some other filtering tasks that take from standard input and write to standard output follow.

Extracting the fourth column of a file (where columns are defined by whitespace)

import sys
for line in sys.stdin:
    words = line.split(  ) 
    if len(words) >= 4:
        print words[3]

We look at the length of the words list to find if there are indeed at least four words. The last two lines could also be replaced by the try/except statement, which is quite common in Python:

try:
    print words[3]
except IndexError:                     # There aren't enough words.
    pass

Extracting the fourth column of a file, where columns are separated by colons, and making it lowercase

import sys, string
for line in sys.stdin:
    words = line.split(':') 
    if len(words) >= 4:
        print words[3].lower(  )

If iterating over all of the lines isn’t what you want, just use the readlines( ) or read( ) methods of file objects.

Printing the first 10 lines, the last 10 lines, and every other line

import sys
lines = sys.stdin.readlines(  )
sys.stdout.writelines(lines[:10])          # First 10 lines
sys.stdout.writelines(lines[-10:])         # Last 10 lines
for lineIndex in range(0, len(lines), 2):  # Get 0, 2, 4, ...
    sys.stdout.write(lines[lineIndex])     # Get the indexed line.

Counting the number of times the word “Python” occurs in a file

text = open(fname).read(  )
print text.count('Python')

Changing a list of columns into a list of rows

In this more complicated example, the task is to transpose a file; imagine you have a file that looks like:

Name:   Willie   Mark   Guido   Mary  Rachel   Ahmed
Level:    5       4      3       1     6        4
Tag#:    1234   4451   5515    5124   1881    5132

And you really want it to look like the following instead:

Name:  Level:  Tag#:
Willie 5       1234
Mark   4       4451
...

You could use code like the following:

import sys
lines = sys.stdin.readlines(  )
wordlists = [line.split(  ) for line in lines]
for row in zip(*wordlists):
    print '	'.join(row)

Of course, you should really use much more defensive programming techniques to deal with the possibility that not all lines have the same number of words in them, that there may be missing data, etc. Those techniques are task-specific and are left as an exercise to the reader.

Choosing chunk sizes

All the preceding examples assume you can read the entire file at once. In some cases, however, that’s not possible, for example, when processing really huge files on computers with little memory, or when dealing with files that are constantly being appended to (such as log files). In such cases, you can use a while/readline combination, where some of the file is read a bit at a time, until the end of file is reached. In dealing with files that aren’t line-oriented, you must read the file a character at a time:

# Read character by character.
while 1:
    next = sys.stdin.read(1)            # Read a one-character string
    if not next:                        # or an empty string at EOF.
        break
    # Process character 'next'.

Notice that the read( ) method on file objects returns an empty string at end of file, which breaks out of the while loop. Most often, however, the files you’ll deal with consist of line-based data and are processed a line at a time:

# Read line by line.
while 1:
    next = sys.stdin.readline(  )            # Read a one-line string
    if not next:                        # or an empty string at EOF.
        break
    # Process line 'next'.

Doing Something to a Set of Files Specified on the Command Line

Being able to read stdin is a great feature; it’s the foundation of the Unix toolset. However, one input is not always enough: many tasks need to be performed on sets of files. This is usually done by having the Python program parse the list of arguments sent to the script as command-line options. For example, if you type:

% python myScript.py input1.txt input2.txt input3.txt output.txt

you might think that myScript.py wants to do something with the first three input files and write a new file, called output.txt. Let’s see what the beginning of such a program could look like:

import sys
inputfilenames, outputfilename = sys.argv[1:-1], sys.argv[-1]
for inputfilename in inputfilenames:
    inputfile = open(inputfilename, "r")
    do_something_with_input(inputfile)
inputfile.close(  )
outputfile = open(outputfilename, "w")
write_results(outputfile)
outputfile.close(  )

The second line extracts parts of the argv attribute of the sys module. Recall that it’s a list of the words on the command line that called the current program. It starts with the name of the script. So, in the example above, the value of sys.argv is:

['myScript.py', 'input1.txt', 'input2.txt', 'input3.txt', 'output.txt'].

The script assumes that the command line consists of one or more input files and one output file. So the slicing of the input file names starts at 1 (to skip the name of the script, which isn’t an input to the script in most cases), and stops before the last word on the command line, which is the name of the output file. The rest of the script should be pretty easy to understand (but won’t work until you provide the do_something_with_input( ) and write_results( ) functions).

Note that the preceding script doesn’t actually read in the data from the files, but passes the file object down to a function to do the real work. A generic version of do_something_with_input( ) is:

def do_something_with_input(inputfile):
    for line in inputfile:
        process(line)

Processing Each Line of One or More Files

The combination of this idiom with the preceding one regarding opening each file in the sys.argv[1:] list is so common that there is a module, fileinput, to do just this task:

import fileinput
for line in fileinput.input(  ):
    process(line)

The fileinput.input( ) call parses the arguments on the command line, and if there are no arguments to the script, uses sys.stdin instead. It also provides several useful functions that let you know which file and line number you’re currently manipulating, as we can see in the following script:

import fileinput, sys
# Take the first argument out of sys.argv and assign it to searchterm.
searchterm, sys.argv[1:] = sys.argv[1], sys.argv[2:]
for line in fileinput.input(  ):
   num_matches = line.count(searchterm)
   if num_matches:                     # A nonzero count means there was a match.
       print "found '%s' %d times in %s on line %d." % (searchterm, num_matches, 
           fileinput.filename(  ), fileinput.filelineno(  ))

Running mygrep.py on a few Python files produces:

% python mygrep.py in *.py
found 'in' 2 times in countlines.py on line 2.
found 'in' 2 times in countlines.py on line 3.
found 'in' 2 times in mygrep.py on line 1.
found 'in' 4 times in mygrep.py on line 4.
found 'in' 2 times in mygrep.py on line 5.
found 'in' 2 times in mygrep.py on line 7.
found 'in' 3 times in mygrep.py on line 8.
found 'in' 3 times in mygrep.py on line 12.

Dealing with Binary Data: The struct Module

A file is considered a binary file if it’s not a text file or a file written in a format based on text, such as HTML and XML. Image and sound files are prototypical examples of binary files. A frequent question about file manipulation is “How do I process binary files in Python?” The answer to that question usually involves the struct module. It has a simple interface, since it exports just three functions: pack, unpack, and calcsize.

Let’s start with the task of decoding a binary file. Imagine a binary file bindat.dat that contains data in a specific format: first there’s a float corresponding to a version number, then a long integer corresponding to the size of the data, and then the number of unsigned bytes corresponding to the actual data. The key to using the struct module is to define a format string, which corresponds to the format of the data you wish to read, and find out which subset of the file corresponds to that data. For example:

import struct
data = open('bindat.dat').read(  )
start, stop = 0, struct.calcsize('fl')
version_number, num_bytes = struct.unpack('fl', data[start:stop])
start, stop = stop, start + struct.calcsize('B'*num_bytes)
bytes = struct.unpack('B'*num_bytes, data[start:stop])

'f' is a format string for a single floating-point number (a C float, to be precise), 'l' is for a long integer, and 'B' is a format string for an unsigned char. The available unpack format strings are listed in Table 27-8. Consult the library reference manual for usage details.

Table 27-8. Common format codes used by the struct module

Format

C type

Python

x

pad byte

No value

c

char

String of length 1

b

signed char

Integer

B

unsigned char

Integer

h

short

Integer

H

unsigned short

Integer

i

int

Integer

I

unsigned int

Integer

l

long

Integer

L

unsigned long

Integer

f

float

Float

d

double

Float

s

char[ ]

String

p

char[ ]

String

P

void *

Integer

At this point, bytes is a tuple of num_bytes Python integers. If we know that the data is in fact storing characters, we could use chars = map(chr, bytes). To be more efficient, we could change the last unpack to use 'c' instead of 'B', which would do the conversion and return a tuple of num_bytes single-character strings. More efficiently still, we could use a format string that specifies a string of characters of a specified length, such as:

chars = struct.unpack(str(num_bytes)+'s', data[start:stop])

The packing operation (struct.pack) is the exact converse; instead of taking a format string and a data string, and returning a tuple of unpacked values, it takes a format string and a variable number of arguments and packs those arguments using that format string into a new packed string.

Note that the struct module can process data that’s encoded with either kind of byte-ordering,[6] thus allowing you to write platform-independent binary file manipulation code. For large files, also consider using the array module.

Internet-Related Modules

Python is used in a wide variety of Internet-related tasks, from making web servers to crawling the Web to “screen-scraping” web sites for data. This section briefly describes the most often used modules used for such tasks that ship with Python’s core. For more detailed examples of their use, we recommend Lundh’s Standard Python Library and Martelli and Ascher’s Python Cookbook (O’Reilly). There are many third-party add-ons worth knowing about before embarking on a significant web- or Internet-related project.

The Common Gateway Interface: The cgi Module

Python programs often process forms from web pages. To make this task easy, the standard Python distribution includes a module called cgi. Chapter 28 includes an example of a Python script that uses the CGI, so we won’t cover it any further here.

Manipulating URLs: The urllib and urlparse Modules

Universal resource locators are strings such as http://www.python.org that are now ubiquitous. Three modules—urllib, urllib2, and urlparse—provide tools for processing URLs.

The urllib module defines a few functions for writing programs that must be active users of the Web (robots, agents, etc.). These are listed in Table 27-9.

Table 27-9. Functions of the urllib module

Function name

Behavior

urlopen(url [, data])

Opens (for reading) a network object denoted by a URL; it can also open local files:

>>> page = urlopen('http://www.python.org')
>>> page.readline(  )
'<HTML>12'
>>> page.readline(  )
'<!-- THIS PAGE IS AUTOMATICALLY GENERATED.DO NOT EDIT. -->12'

urlretrieve(url [, filename][, hook])

Copies a network object denoted by a URL to a local file (uses a cache):

>>> urllib.urlretrieve('http://www.python.org/', 
'wwwpython.html')

urlcleanup( )

Cleans up the cache used by urlretrieve.

quote(string[, safe])

Replaces special characters in string using the %xx escape. The optional safe parameter specifies additional characters that shouldn’t be quoted; its default value is:

>>> quote('this & that @ home')
'this%20%26%20that%20%40%20home'

quote_plus(string[, safe])

Like quote( ), but also replaces spaces by plus signs.

unquote(string)

Replaces %xx escapes by their single-character equivalent:

>>> unquote('this%20%26%20that%20%40%20home')
'this & that @ home'

urlencode(dict)

Converts a dictionary to a URL-encoded string, suitable to pass to urlopen( ) as the optional data argument:

>>> locals(  )
{'urllib': <module 'urllib'>, '__doc__': None, 'x':
3, '__name__': '__main__', '__builtins__': <module
'__builtin__'>}
>>> urllib.urlencode(locals(  ))
'urllib=%3cmodule+%27urllib%27%3e&__doc__=None&x=3&
__name__=__main__&__builtins__=%3cmodule+%27
__builtin__%27%3e'

The module urllib2 focuses on the tasks of opening URLs that the simpler urllib doesn’t know how to deal with, and provides an extensible framework for new kinds of URLs and protocols. It is what you should use if you want to deal with passwords, digest authentication, proxies, HTTPS URLs, and other fancy URLs.

The module urlparse defines a few functions that simplify taking URLs apart and putting new URLs together. These are listed in Table 27-10.

Table 27-10. Functions of the urlparse module

Function name

Behavior

urlparse(urlstring[, default_scheme[,allow fragments]])

Parses a URL into six components, returning a six tuple (addressing scheme, network location, path, parameters, query, fragment identifier):

>>> urlparse('http://www.python.org/
FAQ.html')
('http', 'www.python.org', '/FAQ.html', '', '', '')

urlunparse(tuple)

Constructs a URL string from a tuple as returned by urlparse( )

urljoin(base, url[,allow fragments])

Constructs a full (absolute) URL by combining a base URL (base) with a relative URL (url):

>>> urljoin('http://www.python.org', 
'doc/lib')
'http://www.python.org/doc/lib'

Specific Internet Protocols

The most commonly used protocols built on top of TCP/IP are supported with modules named after them. The telnetlib module lets you act like a Telnet client. The httplib module lets you talk to web servers with the HTTP protocol. The ftplib module is for transferring files using the FTP protocol. The gopherlib module is for browsing Gopher servers (now fairly rare). In the domains of mail and news, you can use the poplib and imaplib modules for reading mail files on POP3 and IMAP servers, respectively and the smtplib module for sending mail, and the nntplib module for reading and posting Usenet news from NNTP servers.

There are also modules that can build Internet servers, specifically a generic socket-based IP server (SocketServer), a simple web server (SimpleHTTPServer), a CGI-compliant HTTP server (CGIHTTPSserver), and a module for building asynchronous socket handling services (asyncore).

Support for web services currently consists of a core library to process XML-RPC client-side calls (xmlrpclib), as well as a simple XML-RPC server implementation (SimpleXMLRPCServer). Support for SOAP is likely to be added when the SOAP standard becomes more stable.

Processing Internet Data

Once you use an Internet protocol to obtain files from the Internet (or before you serve them to the Internet), you often must process these files. They come in many different formats. Table 27-11 lists each module in the standard library that processes a specific kind of Internet-related file format (there are others for sound and image format processing; see the library reference manual).

Table 27-11. Modules dedicated to Internet file processing

Module name

File format

sgmllib

A simple parser for SGML files.

htmllib

A parser for HTML documents.

formatter

Generic output formatter and device interface.

rfc822

Parse RFC-822 mail headers (i.e., “Subject: hi there!”).

mimetools

Tools for parsing MIME-style message bodies (a.k.a. file attachments).

multifile

Support for reading files that contain distinct parts.

binhex

Encode and decode files in binhex4 format.

uu

Encode and decode files in uuencode format.

binascii

Convert between binary and various ASCII-encoded representations.

xdrlib

Encode and decode XDR data.

mailcap

Mailcap file handling.

mimetypes

Mapping of filename extensions to MIME types.

base64

Encode and decode MIME base64 encoding.

quopri

Encode and decode MIME quoted-printable encoding.

mailbox

Read various mailbox formats.

mimify

Convert mail messages to and from MIME format.

mail

A package for parsing, handling, and generating email messages.

XML Processing

Python comes with a rich set of XML-processing tools. These include parsers, DOM interfaces, SAX interfaces, and more, as shown in Table 27-12.

Table 27-12. Some of the XML modules in the core distribution

Module name

Description

xml.parsers.expat

An interface to the Expat nonvalidating XML parser

xml.dom

Document Object Model (DOM) API for Python

xml.dom.minidom

Lightweight DOM implementation

xml.dom.pulldom

Support for building partial DOM trees from SAX events

xml.sax

Package containing SAX2 base classes and convenience functions

xml.sax.handlers

Base classes for SAX event handlers.

xml.sax.saxutils

Convenience functions and classes for use with SAX.

xml.sax.xmlreader

Interface that SAX-compliant XML parsers must implement.

xmllib

A parser for XML documents.

See the standard library reference for details, or the Python Cookbook (O’Reilly) for example tasks easily solved using the standard XML libraries. The XML facilities are developed by the XML Special Interest Group, which publishes versions of the XML package in-between Python releases. See http://www.python.org/topics/xml for details and the latest version of the code. For expanded coverage, consider Python and XML, by Christopher A. Jones and Fred L. Drake, Jr. (O’Reilly).

Executing Programs

The last set of built-in functions in this section have to do with creating, manipulating, and calling Python code. See Table 27-13.

Table 27-13. Ways to execute Python code

Name

Behavior

import

Executes the code in a module as part of the importing and binds it to a name in the scope in which it is executed. You can choose what name is chosen by using the import modulename as name form.

exec code [ in globaldict [, localdict]]

Executes the specified code (string, file, or compiled code object) in the optionally specified global and local namespaces. This is sometimes useful when reading programs from user-entered code as in an interactive shell or “macro” window.

compile(string, filename, kind)

Compiles the string into a code object. This function is only useful as an optimization.

execfile(filename[, globaldict[, localdict]])

Executes the program in the specified filename, using the optionally specified global and local namespaces. This function is sometimes useful in systems which use Python as an extension language for the users of the system.

eval(code[, globaldict[, localdict]])

Evaluates the specified expression (string or compiled code object) in the optionally specified global and local namespaces, and returns the expression’s result. Calculator-type programs that ask the users for an expression to compute often use eval.

It’s a simple matter to write programs that run other programs. Shortly, we’ll talk about ways to call any program from within a Python program. There are several mechanisms that let you execute arbitrary Python code. The most important is one statement we’ve used throughout the book, import, which executes code existing in files on the Python path. There are several other ways to execute Python code in-process. The first is the exec statement:

exec code [ in globaldict [, localdict]]

exec takes between one and three arguments. The first argument must contain Python code—either in a string, as in the following example; in an open file object; or in a compiled code object. For example:

>>> code = "x = 'Something'"
>>> x = "Nothing"                            # Sets the value of x
>>> exec code                                # Modifies the value of x!
>>> print x
'Something'

exec can take optional arguments. If a single dictionary argument is provided (after the then-mandatory in word), it’s used as both the local and global namespaces for the execution of the specified code. If two dictionary arguments are provided, they are used as the global and local namespaces, respectively. If both arguments are omitted, as in the previous example, the current global and local namespaces are used.

When exec is called, Python needs to parse the code that is being executed. This can be a computationally expensive process, especially if a large piece of code needs to be executed thousands of times. If this is the case, it’s worth compiling the code first (once), and executing it as many times as needed. The compile function takes a string containing the Python code and returns a compiled code object, which can then be processed efficiently by the exec statement.

compile takes three arguments. The first is the code string. The second is the filename corresponding to the Python source file (or '<string>' if it wasn’t read from a file); it’s used in the traceback in case an exception is generated when executing the code. The third argument is one of 'single', 'exec', or 'eval', depending on whether the code is a single statement whose result would be printed (just as in the interactive interpreter), a set of statements, or an expression (creating a compiled code object for use by the eval function).

A related function is the execfile built-in function. Its first argument must be the filename of a Python script instead of a file object or string (remember that file objects are the things the open built-in returns when it’s passed a filename). Thus, if you want your Python script to start by running its arguments as Python scripts, you can do something like:

import sys
for argument in sys.argv[1:]:          # We'll skip ourselves, or it'll go forever!
    execfile(argument)                 # Do whatever.

Two warnings are warranted with respect to execfile. First, it is logical but sometimes surprising that execfile executes by default in its local scope—thus, calling execfile from inside a function will often have much more localized effects than users expect. Second, execfile is almost never the right answer—if you’re writing the code being executed, you should really put it in a module and import it. The behavior you’ll get will be much more predictable, safe, and maintainable—it’s just too easy for execfile( ) code to wreak havoc with the module calling execfile.

Two more functions can execute Python code. The first is the eval function, which takes a code string (and the by now expected optional pair of dictionaries) or a compiled code object and returns the evaluation of that expression. For example:

>>> word = 'xo'
>>> z = eval("word*10")
>>> print z
'xoxoxoxoxoxoxoxoxoxo'

The eval function can’t work with statements, as shown in the following example, because expressions and statements are different syntactically:

>>> z = eval("x = 3")
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "<string>", line 1
x = 3
      ^
SyntaxError: invalid syntax

The last function that executes code is apply. It’s called with a callable object, an optional tuple of the positional arguments, and an optional dictionary of the keywords arguments. A callable object is any function (standard functions, methods, etc.), any class object (they create an instance when called), or any instance of a class that defines a __call__ method. apply is slowly being deprecated because it is no longer necessary or the most efficient way of doing what it does—you can always replace it by simple calls, optionally using the * and ** argument markers. As we’ve seen in the chapter on OOP, one can call a method defined in a base class using:

class Derived(Base):
    def __init__(self, arg, *args, **kw):
        self.__init__(self, *args, **kw)

This code used to be written using apply as in:

class Derived(Base):
    def __init__(self, arg, *args, **kw):
        apply(self.__init__, (self,) + args), kw)

As you can see, the new variant is much cleaner, which is why apply is quickly becoming obsolete.

If you’re not sure if an object is callable (e.g., if it’s an argument to a function), test it using the callable built-in, which returns true if the object it’s called with is callable.[7]

>>> callable(sys.exit), type(sys.exit)
(1, <type 'builtin_function_or_method'>)
>>> callable(sys.version), type(sys.version)
(0, <type 'string'>)

There are other built-in functions we haven’t covered; if you’re curious, check a reference source such as the library reference manual (Section 2.3).

Debugging, Testing, Timing, Profiling

To wrap up our overview of common Python tasks, we’ll cover some tasks that are common for Python programmers even though they’re not programming tasks per se—debugging, testing, timing, and optimizing Python programs.

Debugging with pdb

The first task is, not surprisingly, debugging. Python’s standard distribution includes a debugger called pdb. Using pdb is fairly straightforward. You import the pdb module and call its run method with the Python code the debugger should execute. For example, if you’re debugging the program in spam.py, do this:

>>> import spam                        # Import the module we want to debug.
>>> import pdb                         # Import pdb.
>>> pdb.run('instance = spam.Spam(  )') # Start pdb with a statement to run.
> <string>(0)?(  )
(Pdb) break spam.Spam.__init__                # We can set break points.
(Pdb) next
>        <string>(1)?(  )
(Pdb) n                                        # 'n' is short for 'next'.
> spam.py(3)__init__(  )
-> def __init__(self):
(Pdb) n
> spam.py(4)__init__(  )
-> Spam.numInstances = Spam.numInstances + 1
(Pdb) list                                     # Show the source code listing.
  1    class Spam:
  2        numInstances = 0
  3 B      def __init__(self):                  # Note the B for Breakpoint.
  4  ->        Spam.numInstances = Spam.numInstances + 1  # Where we are
  5        def printNumInstances(self):
  6            print "Number of instances created: ", Spam.numInstances
  7
[EOF]
(Pdb) where                                    # Show the calling stack.
<string>(1)?(  )
> spam.py(4)__init__(  )
-> Spam.numInstances = Spam.numInstances + 1
(Pdb) Spam.numInstances = 10          # Note that we can modify variables
(Pdb) print Spam.numInstances         # while the program is being debugged.
10
(Pdb) continue                        # This continues until the next break-
--Return--                            # point, but there is none, so we're
-> <string>(1)?(  )->None                 # done.
(Pdb) c                               # This ends up quitting Pdb.
<spam.Spam instance at 80ee60>        # This is the returned instance.
>>> instance.numInstances             # Note that the change to numInstance
11                                    # was before the increment op.

As the session above shows, with pdb you can list the current code being debugged (with an arrow pointing to the line about to be executed), examine variables, modify variables, and set breakpoints. Chapter 9 in the Library Reference covers the debugger in detail. Alternative debuggers abound, from the one in IDLE, to the more full-featured debuggers you’ll find in commercial IDEs for Python.

Testing with unittest

Testing software is, in the general case, a very hard problem. For software that takes user input or more generally interacts with the outside world, doing comprehensive testing of any medium-sized program quickly becomes hard to do completely. Luckily, one can get many benefits from doing nonexhaustive testing. The easiest kind of testing to do is called unit testing, and it is supported in Python by the module unittest. In unit testing, one writes very small scripts that test one fact about the program being tested at a time. The trick is to write lots of these simple tests, learn how to write useful unit tests as opposed to silly tests, and to run these tests in between every change to the program. If you have a test suite with good coverage, you’ll gain confidence that each change you make is not going to break another part of the system.

unittest is documented as part of the standard library, as well as on the PyUnit web site (http://pyunit.sourceforge.net).

Timing

Even when a program is working, it can sometimes be too slow. If you know what the bottleneck in your program is, and you know of alternative ways to code the same algorithm, then you might time the various alternative methods to find out which is fastest. The time module, which is part of the standard distribution, provides many time-manipulation routines. We’ll use just one, which returns the time since a fixed epoch with the highest precision available on your machine. As we’ll use just relative times to compare algorithms, the precision isn’t all that important. Here’s two different ways to create a list of 10,000 zeros:

def lots_of_appends(  ):
    zeros = [  ]
    for i in range(10000):
        zeros.append(0)

def one_multiply(  ):
    zeros = [0] * 10000

How can we time these two solutions? Here’s a simple way:

import time, makezeros
def do_timing(num_times, *funcs):
    totals = {  }
    for func in funcs:
        totals[func] = 0.0
        starttime = time.clock(  )        # Record starting time.
        for x in range(num_times):
            for func in funcs:
                apply(func)
        stoptime = time.clock (  )         # Record ending time.
        elapsed = stoptime--starttime       # Difference yields time elapsed
        totals[func] = totals[func] + elapsed
    for func in funcs:
        print "Running %s %d times took %.3f seconds" % (func.__name__, num_times 
totals[func])

do_timing(100, (makezeros.lots_of_appends, makezeros.one_multiply))

And running this program yields:

csh> python timings.py
Running lots_of_appends 100 times took 7.891 seconds
Running one_multiply 100 times took 0.120 seconds

As you might have suspected, a single list multiplication is much faster than lots of appends. Note that in timings, it’s always a good idea to compare lots of runs of functions instead of just one. Otherwise, the timings are likely to be heavily influenced by things that have nothing to do with the algorithm, such as network traffic on the computer or GUI events. Python 2.3 introduces a new module called timeit that provides a very simple way to do precise code timing correctly.

What if you’ve written a complex program, and it’s running slower than you’d like, but you’re not sure where the problem spot is? In that case, what you need to do is profile the program: determine which parts of the program are the time-sinks and see if they can be optimized, or if the program structure can be modified to even out the bottlenecks. The Python distribution includes just the right tools for that, the profile module, documented in the Library Reference, and another module, hotshot, which is unfortunately not well documented as of this writing. Assuming that you want to profile a given function in the current namespace, do this:

>>> from timings import *
>>> from makezeros import *
>>> profile.run('do_timing(100, (lots_of_appends, one_multiply))')
Running lots_of_appends 100 times took 8.773 seconds
Running one_multiply 100 times took 0.090 seconds
203 function calls in 8.823 CPU seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100   8.574   0.086   8.574  0.086 makezeros.py:1(lots_of_appends)
      100   0.101   0.001   0.101  0.001 makezeros.py:6(one_multiply)
        1   0.001   0.001   8.823  8.823 profile:0(do_timing(100, 
(lots_of_appends, one_multiply)))
        0   0.000           0.000        profile:0(profiler)
        1   0.000   0.000   8.821  8.821 python:0(194.C.2)
        1   0.147   0.147   8.821  8.821 timings.py:2(do_timing)

As you can see, this gives a fairly complicated listing, which includes such things as per-call time spent in each function and the number of calls made to each function. In complex programs, the profiler can help find surprising inefficiencies. Optimizing Python programs is beyond the scope of this book; if you’re interested, however, check the Python newsgroup: periodically, a user asks for help speeding up a program and a spontaneous contest starts up, with interesting advice from expert users.

Exercises

This chapter is full of programs we encourage you to type in and play with. However, here are a few more challenging exercises:

See Section B.8.1 for the solutions.

  1. Avoiding regular expressions. Write a program that obeys the same requirements as pepper.py but doesn’t use regular expressions to do the job. This is somewhat difficult, but a useful exercise in building program logic.

  2. Wrapping a text file with a class. Write a class that takes a filename and reads the data in the corresponding file as text. Make it so that this class has three attributes: paragraph, line, word, each of which take an integer argument, so that if mywrapper is an instance of this class, printing mywrapper.paragraph(0) prints the first paragraph of the file, mywrapper.line(-2) prints the next-to-last line in the file, and mywrapper.word(3) prints the fourth word in the file.

  3. Describing a directory. Write a function that takes a directory name and describes the contents of the directory, recursively (in other words, for each file, print the name and size, and proceed down any eventual directories).

  4. Modifying the prompt. Modify your interpreter so that the prompt is, instead of the >>> string, a string describing the current directory and the count of the number of lines entered in the current Python session. Two hints: the prompt variables (e.g., sys.ps1) doesn’t have to be a string but can be any object; printing an instance can have side effects, and is done by calling the instance’s __repr__ method.

  5. Writing a shell. Using the Cmd class in the cmd module and the functions described in this chapter for manipulating files and directories, write a little shell that accepts the standard Unix commands (or DOS commands): ls (dir) for listing the current directory, cd for changing directory, mv (or ren) for moving/renaming a file, and cp (copy) for copying a file.

  6. Redirecting stdout. Modify the mygrep.py script to output to the last file specified on the command line instead of to the console.



[1] As we’re not going to be subclassing from built-in types in this chapter, it makes no difference to us whether these conversion calls are functions (which they were until recent versions of Python) or class creators (which they are in Python 2.2 or later)—either way, they take objects as input and return new objects of the appropriate type (assuming the specific conversion is allowed). In this section we’ll refer to them as functions as a matter of convenience.

[2] For a variety of mostly historical reasons, even some unreasonable comparisons (1 > “2”) will yield a value.

[3] Raw strings can’t end with an odd number of backslash characters. That’s unlikely to be a problem when using raw strings for regular expressions, however, since regular expressions can’t end with backslashes.

[4] Some objects don’t qualify as “reasonably copyable,” such as modules, file objects, and sockets. Remember that file objects are different from files on disk as they are opened at a particular point, and are possibly not even fully written to disk yet. For copying files on disk, the shutil module is introduced later in this chapter.

[5] Another useful function in the random module is the choice function, which returns a random element in the sequence passed in as argument.

[6] The order with which computers list multibyte words depends on the chip used (so much for standards). Intel and DEC systems use little-endian ordering, while Motorola and Sun-based systems use big-endian ordering. Network transmissions also use big-endian ordering, so the struct module comes in handy when doing network I/O on PCs.

[7] You can find many things about callable objects, such as how many arguments they expect and what the names and default values of their arguments are by checking the Language Reference for details, especially Section 3. 2, which describes all attributes for each type. Even easier is using the inspect module, which is designed for that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45